Data
Contents
Data#
In this book, we will use a set of public datasets from the Longitudinal Employer Household Dynamic (LEHD) data provided by the United States Census Bureau. In particular, we will use the LEHD Origin-Destination Employment Statistics (LODES) data. These data are based on tabulated administrative data and give information about workplaces and residences of workers at the census block level. There are four main types of data that we will use.
Workplace Area Characteristics (WAC): Census block level. Job totals for workplaces in the census block.
Residence Area Characteristics (RAC): Census block level. Job totals for residences in the census block.
Origin-Destination (OD): Origin census block - Destination census block pair level.
Crosswalk (xwalk): Census block level. Contains all census blocks within that state, and contains information about that census block (e.g. city, county).
Workplace Area Characteristics (WAC) and Residence Area Characteristics (RAC)#
The WAC and RAC data generally look something like the following:
import pandas as pd
URL = 'https://lehd.ces.census.gov/data/lodes/LODES7/md/wac/md_wac_S000_JT00_2015.csv.gz'
pd.read_csv(URL, compression='gzip').head()
w_geocode | C000 | CA01 | CA02 | CA03 | CE01 | CE02 | CE03 | CNS01 | CNS02 | ... | CFA02 | CFA03 | CFA04 | CFA05 | CFS01 | CFS02 | CFS03 | CFS04 | CFS05 | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 240010001001023 | 8 | 3 | 4 | 1 | 4 | 4 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
1 | 240010001001025 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
2 | 240010001001054 | 10 | 2 | 3 | 5 | 7 | 3 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
3 | 240010001001113 | 2 | 0 | 2 | 0 | 0 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
4 | 240010001002061 | 8 | 4 | 4 | 0 | 7 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
5 rows × 53 columns
Here, each of the rows represents a census block (this particular table contains data from Maryland). The w_geocode
indicates the block code, serving as the unique identifier for the census block, and the C000
variable represents the total number of jobs in that census block. The rest of the variable break down the number of jobs by various categories. For example, CA01
, CA02
, and CA03
break down the jobs by age group:
CA01
: Number of jobs for workers age 29 or youngerCA02
: Number of jobs for workers age 30 to 54CA03
: Number of jobs for workers age 55 or older
So, the sum of those columns should be equal to the value in C000
.
The same applies for the RAC data, except instead of the jobs in that census block, it shows the residences in the census block. So, the C000
column in the RAC data represents all workers who lived in that census block. The CA01
, CA02
, and CA03
variables represent the number of workers within each age group that lived in that census block.
Note that for both of these datasets, the unit of observations is the census block.
Origin-Destination#
The Origin-Destination file looks like this:
URL = 'https://lehd.ces.census.gov/data/lodes/LODES7/md/od/md_od_main_JT00_2015.csv.gz'
pd.read_csv(URL, compression='gzip').head()
w_geocode | h_geocode | S000 | SA01 | SA02 | SA03 | SE01 | SE02 | SE03 | SI01 | SI02 | SI03 | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 240010001001023 | 240010001002184 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 20190826 |
1 | 240010001001023 | 240010001003108 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 20190826 |
2 | 240010001001023 | 240010002003023 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 20190826 |
3 | 240010001001023 | 240010022001060 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 20190826 |
4 | 240010001001023 | 240430107002095 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 20190826 |
Here, each of the rows represents a w_geocode
-h_geocode
pair. That is, each row is a pair of census blocks for which there was at least one person who worked in the w_geocode
census block and lived in the h_geocode
census block. The S000
variable represents how many people lived in the h_geocode
census block and worked in the w_geocode
census block.
Crosswalk#
URL = 'https://lehd.ces.census.gov/data/lodes/LODES7/md/md_xwalk.csv.gz'
pd.read_csv(URL, compression='gzip').head()
tabblk2010 | st | stusps | stname | cty | ctyname | trct | trctname | bgrp | bgrpname | ... | stanrcname | necta | nectaname | mil | milname | stwib | stwibname | blklatdd | blklondd | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 240037312011001 | 24 | MD | Maryland | 24003 | Anne Arundel County, MD | 24003731201 | 7312.01 (Anne Arundel, MD) | 240037312011 | 1 (Tract 7312.01, Anne Arundel, MD) | ... | NaN | 99999 | NaN | NaN | NaN | 24001001 | 01 Anne Arundel WIA | 39.086213 | -76.536457 | 20211018 |
1 | 240037012001003 | 24 | MD | Maryland | 24003 | Anne Arundel County, MD | 24003701200 | 7012 (Anne Arundel, MD) | 240037012001 | 1 (Tract 7012, Anne Arundel, MD) | ... | NaN | 99999 | NaN | NaN | NaN | 24001001 | 01 Anne Arundel WIA | 38.926495 | -76.537151 | 20211018 |
2 | 240037025001034 | 24 | MD | Maryland | 24003 | Anne Arundel County, MD | 24003702500 | 7025 (Anne Arundel, MD) | 240037025001 | 1 (Tract 7025, Anne Arundel, MD) | ... | NaN | 99999 | NaN | NaN | NaN | 24001001 | 01 Anne Arundel WIA | 38.951701 | -76.550784 | 20211018 |
3 | 240037027022009 | 24 | MD | Maryland | 24003 | Anne Arundel County, MD | 24003702702 | 7027.02 (Anne Arundel, MD) | 240037027022 | 2 (Tract 7027.02, Anne Arundel, MD) | ... | NaN | 99999 | NaN | NaN | NaN | 24001001 | 01 Anne Arundel WIA | 39.011417 | -76.527626 | 20211018 |
4 | 240037025004020 | 24 | MD | Maryland | 24003 | Anne Arundel County, MD | 24003702500 | 7025 (Anne Arundel, MD) | 240037025004 | 4 (Tract 7025, Anne Arundel, MD) | ... | NaN | 99999 | NaN | NaN | NaN | 24001001 | 01 Anne Arundel WIA | 38.947590 | -76.538524 | 20211018 |
5 rows × 43 columns
For more information about the datasets used in the examples, please refer to the data documentation provided at this link.