Descriptive Statistics
Contents
import numpy as np
import pandas as pd
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')
Descriptive Statistics#
Let’s try to get some useful summary statistics of the variables in the data set. We can use describe()
get a quick summary of the variables.
df.describe()
w_geocode | C000 | CA01 | CA02 | CA03 | CE01 | CE02 | CE03 | CNS01 | CNS02 | ... | CFA02 | CFA03 | CFA04 | CFA05 | CFS01 | CFS02 | CFS03 | CFS04 | CFS05 | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 2.434620e+05 | 243462.000000 | 243462.000000 | 243462.000000 | 243462.000000 | 243462.000000 | 243462.000000 | 243462.000000 | 243462.000000 | 243462.000000 | ... | 243462.0 | 243462.0 | 243462.0 | 243462.0 | 243462.0 | 243462.0 | 243462.0 | 243462.0 | 243462.0 | 243462.0 |
mean | 6.055354e+13 | 65.918899 | 14.177925 | 37.154234 | 14.586740 | 15.360073 | 21.363317 | 29.195509 | 1.633787 | 0.107035 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20190826.0 |
std | 2.754098e+11 | 369.915675 | 67.735635 | 228.748166 | 88.340501 | 113.563218 | 87.713504 | 249.507029 | 31.753710 | 7.473124 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
min | 6.001400e+13 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20190826.0 |
25% | 6.037274e+13 | 2.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20190826.0 |
50% | 6.059064e+13 | 7.000000 | 1.000000 | 4.000000 | 2.000000 | 3.000000 | 2.000000 | 1.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20190826.0 |
75% | 6.073020e+13 | 32.000000 | 7.000000 | 17.000000 | 7.000000 | 9.000000 | 12.000000 | 8.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20190826.0 |
max | 6.115041e+13 | 72275.000000 | 9319.000000 | 47140.000000 | 19589.000000 | 33722.000000 | 10538.000000 | 63521.000000 | 5007.000000 | 1994.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20190826.0 |
8 rows × 53 columns
As you can see, this gives us the count, mean, standard deviation, minimum, 25th percentile (first quartile), 50th percentile (median), 75th percentile (third quartile), and maximum. Notice that it gives us these summaries even for the variables that don’t make sense (the geocode variables).
This gives a lot more information than you might want. What if we just wanted to look at a few columns? You might do something like the following.
df[["C000","CA01","CA02","CA03"]].describe()
C000 | CA01 | CA02 | CA03 | |
---|---|---|---|---|
count | 243462.000000 | 243462.000000 | 243462.000000 | 243462.000000 |
mean | 65.918899 | 14.177925 | 37.154234 | 14.586740 |
std | 369.915675 | 67.735635 | 228.748166 | 88.340501 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 2.000000 | 0.000000 | 1.000000 | 0.000000 |
50% | 7.000000 | 1.000000 | 4.000000 | 2.000000 |
75% | 32.000000 | 7.000000 | 17.000000 | 7.000000 |
max | 72275.000000 | 9319.000000 | 47140.000000 | 19589.000000 |
You might also decide that you only want to find certain values, like the mean. You can do that too.
df[["C000","CA01","CA02","CA03"]].mean()
C000 65.918899
CA01 14.177925
CA02 37.154234
CA03 14.586740
dtype: float64
More Descriptive Statistics#
Before we answer the question posed at the beginning of the chapter, we want to do a little more exploration of the data set. Many times, this is where we might find out how exactly the question should be framed. For example, we might want to know the distribution of jobs by age group for blocks with greater than 50 jobs. How would we do this?
Recall that we made a subset of the df
Data Frame earlier called df_over50
. We can use describe()
on this subsetted Data Frame.
df_over50 = df[df['C000'] > 50]
df_over50.describe()
w_geocode | C000 | CA01 | CA02 | CA03 | CE01 | CE02 | CE03 | CNS01 | CNS02 | ... | CFA02 | CFA03 | CFA04 | CFA05 | CFS01 | CFS02 | CFS03 | CFS04 | CFS05 | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 4.719100e+04 | 47191.000000 | 47191.000000 | 47191.000000 | 47191.000000 | 47191.000000 | 47191.000000 | 47191.000000 | 47191.000000 | 47191.000000 | ... | 47191.0 | 47191.0 | 47191.0 | 47191.0 | 47191.0 | 47191.0 | 47191.0 | 47191.0 | 47191.0 | 47191.0 |
mean | 6.055942e+13 | 301.949588 | 65.180797 | 171.319023 | 65.449768 | 64.763726 | 96.130406 | 141.055455 | 6.926045 | 0.493653 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20190826.0 |
std | 2.681706e+11 | 797.716541 | 142.816052 | 497.462352 | 192.378928 | 251.825776 | 180.693235 | 552.778265 | 71.682382 | 16.937156 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
min | 6.001400e+13 | 51.000000 | 0.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20190826.0 |
25% | 6.037311e+13 | 77.000000 | 15.000000 | 41.000000 | 17.000000 | 16.000000 | 27.000000 | 20.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20190826.0 |
50% | 6.059074e+13 | 131.000000 | 30.000000 | 70.000000 | 29.000000 | 32.000000 | 47.000000 | 43.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20190826.0 |
75% | 6.075010e+13 | 274.000000 | 65.000000 | 149.000000 | 58.000000 | 63.000000 | 99.000000 | 102.000000 | 0.000000 | 0.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20190826.0 |
max | 6.115041e+13 | 72275.000000 | 9319.000000 | 47140.000000 | 19589.000000 | 33722.000000 | 10538.000000 | 63521.000000 | 5007.000000 | 1994.000000 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 20190826.0 |
8 rows × 53 columns
We can also create new variables. Here, we’re going to create an indicator for whether a census block has more than 50 jobs in it.
df["over50"] = df.C000 > 50
df[["C000","over50"]].head()
C000 | over50 | |
---|---|---|
0 | 30 | False |
1 | 4 | False |
2 | 3 | False |
3 | 11 | False |
4 | 10 | False |
Note that we don’t have a variable in df
called over50
before we run the first line in the above cell. What we’re doing with that one line is creating a new column called over50
and filling it with the values in the conditional statement df.C000 > 50
. We’ve then printed out the first few rows using head()
to see that it’s working properly.
Descriptive Statistics for Categorical Data#
So far, we’ve gone over some descriptive statistics for numerical data, like finding mean or standard deviations. However, you might also want to find descriptive statistics of categorical data. To do this, we’ll take a look at the crosswalk file.
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca//ca_xwalk.csv.gz'
xwalk = pd.read_csv(data_url,compression='gzip',low_memory=False)
xwalk.head()
tabblk2010 | st | stusps | stname | cty | ctyname | trct | trctname | bgrp | bgrpname | ... | stanrcname | necta | nectaname | mil | milname | stwib | stwibname | blklatdd | blklondd | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60030100002053 | 6 | CA | California | 6003 | Alpine County, CA | 6003010000 | 100 (Alpine, CA) | 60030100002 | 2 (Tract 100, Alpine, CA) | ... | NaN | 99999 | NaN | NaN | NaN | 6000005 | 05 Golden Sierra Consortium WIB | 38.795161 | -119.752439 | 20211018 |
1 | 60070013003023 | 6 | CA | California | 6007 | Butte County, CA | 6007001300 | 13 (Butte, CA) | 60070013003 | 3 (Tract 13, Butte, CA) | ... | NaN | 99999 | NaN | NaN | NaN | 6000016 | 16 NoRTEC WIB | 39.718686 | -121.816335 | 20211018 |
2 | 60150002031279 | 6 | CA | California | 6015 | Del Norte County, CA | 6015000203 | 2.03 (Del Norte, CA) | 60150002031 | 1 (Tract 2.03, Del Norte, CA) | ... | NaN | 99999 | NaN | NaN | NaN | 6000016 | 16 NoRTEC WIB | 41.508867 | -124.030365 | 20211018 |
3 | 60150002031336 | 6 | CA | California | 6015 | Del Norte County, CA | 6015000203 | 2.03 (Del Norte, CA) | 60150002031 | 1 (Tract 2.03, Del Norte, CA) | ... | NaN | 99999 | NaN | NaN | NaN | 6000016 | 16 NoRTEC WIB | 41.479259 | -123.954100 | 20211018 |
4 | 60150002031264 | 6 | CA | California | 6015 | Del Norte County, CA | 6015000203 | 2.03 (Del Norte, CA) | 60150002031 | 1 (Tract 2.03, Del Norte, CA) | ... | NaN | 99999 | NaN | NaN | NaN | 6000016 | 16 NoRTEC WIB | 41.488986 | -123.983607 | 20211018 |
5 rows × 43 columns
With categorical data, we usually want to look at breakdowns by the group. Let’s use the Micropolitan/Metropolitan area as an example. We can get the count of all non-NA values for each of these areas using a combination of groupby
and count
.
count_by_cbsa = xwalk.groupby('cbsaname').count()
count_by_cbsa.head()
tabblk2010 | st | stusps | stname | cty | ctyname | trct | trctname | bgrp | bgrpname | ... | stanrcname | necta | nectaname | mil | milname | stwib | stwibname | blklatdd | blklondd | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
cbsaname | |||||||||||||||||||||
Bakersfield, CA | 35280 | 35280 | 35280 | 35280 | 35280 | 35280 | 35280 | 35280 | 35280 | 35280 | ... | 0 | 35280 | 0 | 1511 | 1511 | 35280 | 35280 | 35280 | 35280 | 35280 |
Chico, CA | 6517 | 6517 | 6517 | 6517 | 6517 | 6517 | 6517 | 6517 | 6517 | 6517 | ... | 0 | 6517 | 0 | 0 | 0 | 6517 | 6517 | 6517 | 6517 | 6517 |
Clearlake, CA | 5164 | 5164 | 5164 | 5164 | 5164 | 5164 | 5164 | 5164 | 5164 | 5164 | ... | 0 | 5164 | 0 | 0 | 0 | 5164 | 5164 | 5164 | 5164 | 5164 |
Crescent City, CA | 2062 | 2062 | 2062 | 2062 | 2062 | 2062 | 2062 | 2062 | 2062 | 2062 | ... | 0 | 2062 | 0 | 0 | 0 | 2062 | 2062 | 2062 | 2062 | 2062 |
El Centro, CA | 8875 | 8875 | 8875 | 8875 | 8875 | 8875 | 8875 | 8875 | 8875 | 8875 | ... | 0 | 8875 | 0 | 103 | 103 | 8875 | 8875 | 8875 | 8875 | 8875 |
5 rows × 42 columns
The groupby
method first creates groups according to a certain variable. Here, we chose cbsaname
since that is the Micropolitan/Metropolitan name in the crosswalk data. We then just did a count
to count the number of non-NA values in each of the other variables. We could have also used something like sum if there was a value that we wanted to sum over, for example.
Checkpoint: Descriptive Statistics on Your Data#
Using the tools described above, describe the data for the state of your choice. Find measures such as mean and median. Try to think about the distribution of jobs by different characteristics like age group and industry. Which age group had the most jobs in the state? Which industry?
Using the crosswalk file for the state of your choice, try to get some breakdowns by county. Which county has the most census blocks?