One of the primary goals of studying data is to have a better understanding of facts and figures. A better understanding of any phenomena around us equips us with better judgments. But, almost always in Data-Science, we deal with messy data, which is not understandable in the given format. To understand data better, it has to be communicated and represented in the simplest forms.
Everything should be made as simple as possible, but not simpler.- Einstein
Data reformatting, data cleaning, and data visualization make the data understandable and easy to explain to the relevant stakeholders.
In our attempt to give our readers a flavor of all these concepts in practice, In this post, we are presenting part 1 of a case study of census-2011 data. In this case study, we are trying to understand various socio-economic metrics of India through easy to understand and presentable visualizations. The complete source code is also given at the end of this post for your reference. Come on board the learning journey.
Case Study – Census 2011,Part-1
Mohan works as a consultant for a policy thinktank body. He is studying various key performance indicators of the different states within India to arrive at a better understanding of socio-economic conditions across the country. He decided to analyze the 2011 India census data. Original census data is released (and owned by) the Registrar General and Census Commissioner of India under the Ministry of Home Affairs, Government of India. The data Mohan is using is listed below.
- Data source: http://censusindia.gov.in/2011-Common/CensusData2011.html
- Scrape Source –https://github.com/nishusharma1608/India-Census-2011-Analysis/blob/master/india-districts-census-2011.csv
Mohan is particularly interested to get answers to the following questions.
- How is literacy rate distributed across the Indian States?
- How is sex ratio distributed across the Indian States?
Let’s help Mohan in finding out the big picture.
How Is The Literacy Rate Distributed Across States?
The data we are given is district-wise. To get the state-wise data, we have aggregated the data on state and then calculated the literacy rate. The code can be seen at the end of this post.
Figure 1. Scatterplot of literacy rates across various states and Uts in India as per census 2011. The blue vertical line represents the national average literacy rate.
One important thing to note here is this: literacy is defined as the ability to read, write, and use arithmetic for people having age more than seven years. In our calculations, we have also counted children aged from 0-6 years in the total population. The correct way to calculate it would be to subtract the child population from the total population and then compute the percentage. Because there is no related column in our data, which states no. of children(age 0-6), our results are slightly lesser than the actual literacy rates. But we are concerned about the pattern and not the exact value here.
Let’s see the literacy rate of various states and UTs in the scatterplot shown in Figure 1 below. The Y-axis has all the states and Union Territories as per Census 2011, and the X-axis has the literacy rate in percentage.
The scatter plot in Figure 1 conveys the facts about the literacy rate. The Blue vertical line represents the overall literacy rate of India. The brown vertical lines show one standard deviation limit. They measure variability from the mean. Some of the high-level conclusions that can be drawn from this plot are listed below.
- States like Bihar, Arunachal Pradesh, Rajasthan, and Uttar Pradesh have very low literacy rates as compared to other states.
- States Like Kerala, Goa, Lakshadweep, and Mizoram has higher literacy rates.
- The states which fall below the national average needs attention from policymakers.
One thing to observe is that a large number of states have literacy rates on the right side of the national average. Then, why the national average is around 63% only? This is because the large number of people who are not literate belongs to a few states and thus brings down the national literacy rate. Policymakers should pay special attention to these states to improve national literacy rates.
Let’s see how the population of Indian states/UTs are distributed in Figure 2.This plot suggests that some of the states constitute a high proportion of the population. As per this plot, here are the five most populous states along with the approximate population.
- Uttar Pradesh – 20 Cr
- Maharashtra – 11 Cr
- Bihar – 11 Cr
- West Bengal – 9 cr
- Andhra Pradesh – 9cr
This means that five states constitute almost half of the Indian population.
Figure 2. The population of various states and Uts across India as per Census 2011
Now, Let’s see the literacy rates of these top 5 most populous states in Figure 3. Three of these five states’ literacy rate falls well below the national average.
Figure 3. Literacy rates of the top 5 most populous states in Indian. Three of them fall short of the national average. This leads to the skewed distribution of literacy rates across states as shown in Figure 1.
When we have skewed data, as is the case with literacy rates across Indian states, It’s always good to see the spread of the data around the median rather than around the mean. The Median is less sensitive to outliers. You can learn more about this in the measurement of central tendencies. From the boxplot shown in Figure 4, we can conclude the following.
- 50% of Indian states have literacy rates between 60% and 76%.
- 25% of the states have a literacy rate of less than 60 % and they need attention by policymakers.
- 25% of the states are doing fairly well and have a literacy rate between 76% and 85%
Figure 4. The Blue points represent the literacy rates of various States/Uts in India as per Census-2011.
How is Sex Ratio Distributed?
The sex ratio is used to describe the number of females per 1000 males. In India, it is especially significant because the ratio is skewed towards men.
To find the distribution of sex ratio in Indian states and union territories, we have determined the sex ratio for each of the states and union territories as per the given dataset. Figure 5 shows the scatterplot of the sex ratio of each state and Uts within India. The blue vertical line indicates the mean sex ratio of India, and the two brown vertical lines show the lower and upper limit of one standard deviation limit. Most of the states and UTs have sex ratios within one standard deviation from the mean, that is, approximately between 860 and 1020.
- Very few states/Uts such as Pondicherry and Kerala have a sex ratio greater than 1000. What are the factors that lead to higher sex ratios? This should be analyzed so that the best practices should be emulated in other parts of the country.
- On the other hand, three Union territories have extremely low sex ratios. In the scatterplot on the left side, you may spot that Daman and Diu, Dadra and Nagar Haveli, Chandigarh have very low sex ratio. The government should analyze the reasons for such low sex ratios and come up with a short term and long term plan to address this issue.
Figure 5. The sex ratio of States and Union territories in India
From the scatterplot itself, we were able to understand the distribution of sex ratio within different states and UTs of India. We saw that most of the states/UTs sex ratios were distributed within one standard deviation limit. Let’s see how does the box plot for this distribution looks like.
Figure 6. Boxplot of sex ratio distribution across Indian states and Union territories
The boxplot shows that there are three outliers in the data which are abnormally different from the rest of the population. On the left side of the box plot, we can see that the two UTs(Daman and Diu, Dadra and Nagar Haveli) have very low sex ratio, while on the right side, we can see that one state(Kerala) has more than 1000 as sex ratio.
Even if we look at from the perspective of the distribution of literacy rate and sex-ratio, we can immediately infer that India is a diverse country. A state like Kerala has a very high literacy rate and a very high sex ratio. In contrast, the state of Haryana has a very low sex ratio, and the state of Bihar has an extremely low literacy rate. States with substantial population and very low literacy rates such as Bihar, Uttar Pradesh needs the most attention. Making the weakest link stronger makes the whole chain better. To improve the sex ratio in India as a whole, the Kerala model should be studied, and the best practices should be implemented on a pilot basis in states like Bihar, UP, Haryana.
The code used for this analysis can be found in the link given below.
In this post, we introduced the Census of India 2011 dataset and limited the goal of the analysis to only literacy rates and sex ratio within India. We have calculated the Literacy rates and sex ratio distribution for each state and union territory and visualized them in the form of scatter plots and box plots. We have given our comments based on the analysis. In the next posts, we will try to explore and study other metrics emerging out from the Census 2011 dataset.