import numpy as np 
import pandas as pd 
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')

Descriptive Statistics#

Let’s try to get some useful summary statistics of the variables in the data set. We can use describe() get a quick summary of the variables.

df.describe()
w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 ... CFA02 CFA03 CFA04 CFA05 CFS01 CFS02 CFS03 CFS04 CFS05 createdate
count 2.434620e+05 243462.000000 243462.000000 243462.000000 243462.000000 243462.000000 243462.000000 243462.000000 243462.000000 243462.000000 ... 243462.0 243462.0 243462.0 243462.0 243462.0 243462.0 243462.0 243462.0 243462.0 243462.0
mean 6.055354e+13 65.918899 14.177925 37.154234 14.586740 15.360073 21.363317 29.195509 1.633787 0.107035 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20190826.0
std 2.754098e+11 369.915675 67.735635 228.748166 88.340501 113.563218 87.713504 249.507029 31.753710 7.473124 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
min 6.001400e+13 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20190826.0
25% 6.037274e+13 2.000000 0.000000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20190826.0
50% 6.059064e+13 7.000000 1.000000 4.000000 2.000000 3.000000 2.000000 1.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20190826.0
75% 6.073020e+13 32.000000 7.000000 17.000000 7.000000 9.000000 12.000000 8.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20190826.0
max 6.115041e+13 72275.000000 9319.000000 47140.000000 19589.000000 33722.000000 10538.000000 63521.000000 5007.000000 1994.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20190826.0

8 rows × 53 columns

As you can see, this gives us the count, mean, standard deviation, minimum, 25th percentile (first quartile), 50th percentile (median), 75th percentile (third quartile), and maximum. Notice that it gives us these summaries even for the variables that don’t make sense (the geocode variables).

This gives a lot more information than you might want. What if we just wanted to look at a few columns? You might do something like the following.

df[["C000","CA01","CA02","CA03"]].describe()
C000 CA01 CA02 CA03
count 243462.000000 243462.000000 243462.000000 243462.000000
mean 65.918899 14.177925 37.154234 14.586740
std 369.915675 67.735635 228.748166 88.340501
min 1.000000 0.000000 0.000000 0.000000
25% 2.000000 0.000000 1.000000 0.000000
50% 7.000000 1.000000 4.000000 2.000000
75% 32.000000 7.000000 17.000000 7.000000
max 72275.000000 9319.000000 47140.000000 19589.000000

You might also decide that you only want to find certain values, like the mean. You can do that too.

df[["C000","CA01","CA02","CA03"]].mean()
C000    65.918899
CA01    14.177925
CA02    37.154234
CA03    14.586740
dtype: float64

More Descriptive Statistics#

Before we answer the question posed at the beginning of the chapter, we want to do a little more exploration of the data set. Many times, this is where we might find out how exactly the question should be framed. For example, we might want to know the distribution of jobs by age group for blocks with greater than 50 jobs. How would we do this?

Recall that we made a subset of the df Data Frame earlier called df_over50. We can use describe() on this subsetted Data Frame.

df_over50 = df[df['C000'] > 50]
df_over50.describe()
w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 ... CFA02 CFA03 CFA04 CFA05 CFS01 CFS02 CFS03 CFS04 CFS05 createdate
count 4.719100e+04 47191.000000 47191.000000 47191.000000 47191.000000 47191.000000 47191.000000 47191.000000 47191.000000 47191.000000 ... 47191.0 47191.0 47191.0 47191.0 47191.0 47191.0 47191.0 47191.0 47191.0 47191.0
mean 6.055942e+13 301.949588 65.180797 171.319023 65.449768 64.763726 96.130406 141.055455 6.926045 0.493653 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20190826.0
std 2.681706e+11 797.716541 142.816052 497.462352 192.378928 251.825776 180.693235 552.778265 71.682382 16.937156 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
min 6.001400e+13 51.000000 0.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20190826.0
25% 6.037311e+13 77.000000 15.000000 41.000000 17.000000 16.000000 27.000000 20.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20190826.0
50% 6.059074e+13 131.000000 30.000000 70.000000 29.000000 32.000000 47.000000 43.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20190826.0
75% 6.075010e+13 274.000000 65.000000 149.000000 58.000000 63.000000 99.000000 102.000000 0.000000 0.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20190826.0
max 6.115041e+13 72275.000000 9319.000000 47140.000000 19589.000000 33722.000000 10538.000000 63521.000000 5007.000000 1994.000000 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 20190826.0

8 rows × 53 columns

We can also create new variables. Here, we’re going to create an indicator for whether a census block has more than 50 jobs in it.

df["over50"] = df.C000 > 50
df[["C000","over50"]].head()
C000 over50
0 30 False
1 4 False
2 3 False
3 11 False
4 10 False

Note that we don’t have a variable in df called over50 before we run the first line in the above cell. What we’re doing with that one line is creating a new column called over50 and filling it with the values in the conditional statement df.C000 > 50. We’ve then printed out the first few rows using head() to see that it’s working properly.

Descriptive Statistics for Categorical Data#

So far, we’ve gone over some descriptive statistics for numerical data, like finding mean or standard deviations. However, you might also want to find descriptive statistics of categorical data. To do this, we’ll take a look at the crosswalk file.

data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca//ca_xwalk.csv.gz'
xwalk = pd.read_csv(data_url,compression='gzip',low_memory=False)
xwalk.head()
tabblk2010 st stusps stname cty ctyname trct trctname bgrp bgrpname ... stanrcname necta nectaname mil milname stwib stwibname blklatdd blklondd createdate
0 60030100002053 6 CA California 6003 Alpine County, CA 6003010000 100 (Alpine, CA) 60030100002 2 (Tract 100, Alpine, CA) ... NaN 99999 NaN NaN NaN 6000005 05 Golden Sierra Consortium WIB 38.795161 -119.752439 20211018
1 60070013003023 6 CA California 6007 Butte County, CA 6007001300 13 (Butte, CA) 60070013003 3 (Tract 13, Butte, CA) ... NaN 99999 NaN NaN NaN 6000016 16 NoRTEC WIB 39.718686 -121.816335 20211018
2 60150002031279 6 CA California 6015 Del Norte County, CA 6015000203 2.03 (Del Norte, CA) 60150002031 1 (Tract 2.03, Del Norte, CA) ... NaN 99999 NaN NaN NaN 6000016 16 NoRTEC WIB 41.508867 -124.030365 20211018
3 60150002031336 6 CA California 6015 Del Norte County, CA 6015000203 2.03 (Del Norte, CA) 60150002031 1 (Tract 2.03, Del Norte, CA) ... NaN 99999 NaN NaN NaN 6000016 16 NoRTEC WIB 41.479259 -123.954100 20211018
4 60150002031264 6 CA California 6015 Del Norte County, CA 6015000203 2.03 (Del Norte, CA) 60150002031 1 (Tract 2.03, Del Norte, CA) ... NaN 99999 NaN NaN NaN 6000016 16 NoRTEC WIB 41.488986 -123.983607 20211018

5 rows × 43 columns

With categorical data, we usually want to look at breakdowns by the group. Let’s use the Micropolitan/Metropolitan area as an example. We can get the count of all non-NA values for each of these areas using a combination of groupby and count.

count_by_cbsa = xwalk.groupby('cbsaname').count()
count_by_cbsa.head()
tabblk2010 st stusps stname cty ctyname trct trctname bgrp bgrpname ... stanrcname necta nectaname mil milname stwib stwibname blklatdd blklondd createdate
cbsaname
Bakersfield, CA 35280 35280 35280 35280 35280 35280 35280 35280 35280 35280 ... 0 35280 0 1511 1511 35280 35280 35280 35280 35280
Chico, CA 6517 6517 6517 6517 6517 6517 6517 6517 6517 6517 ... 0 6517 0 0 0 6517 6517 6517 6517 6517
Clearlake, CA 5164 5164 5164 5164 5164 5164 5164 5164 5164 5164 ... 0 5164 0 0 0 5164 5164 5164 5164 5164
Crescent City, CA 2062 2062 2062 2062 2062 2062 2062 2062 2062 2062 ... 0 2062 0 0 0 2062 2062 2062 2062 2062
El Centro, CA 8875 8875 8875 8875 8875 8875 8875 8875 8875 8875 ... 0 8875 0 103 103 8875 8875 8875 8875 8875

5 rows × 42 columns

The groupby method first creates groups according to a certain variable. Here, we chose cbsaname since that is the Micropolitan/Metropolitan name in the crosswalk data. We then just did a count to count the number of non-NA values in each of the other variables. We could have also used something like sum if there was a value that we wanted to sum over, for example.

Checkpoint: Descriptive Statistics on Your Data#

Using the tools described above, describe the data for the state of your choice. Find measures such as mean and median. Try to think about the distribution of jobs by different characteristics like age group and industry. Which age group had the most jobs in the state? Which industry?

Using the crosswalk file for the state of your choice, try to get some breakdowns by county. Which county has the most census blocks?