Data#

In this book, we will use a set of public datasets from the Longitudinal Employer Household Dynamic (LEHD) data provided by the United States Census Bureau. In particular, we will use the LEHD Origin-Destination Employment Statistics (LODES) data. These data are based on tabulated administrative data and give information about workplaces and residences of workers at the census block level. There are four main types of data that we will use.

  • Workplace Area Characteristics (WAC): Census block level. Job totals for workplaces in the census block.

  • Residence Area Characteristics (RAC): Census block level. Job totals for residences in the census block.

  • Origin-Destination (OD): Origin census block - Destination census block pair level.

  • Crosswalk (xwalk): Census block level. Contains all census blocks within that state, and contains information about that census block (e.g. city, county).

Workplace Area Characteristics (WAC) and Residence Area Characteristics (RAC)#

The WAC and RAC data generally look something like the following:

import pandas as pd 
URL = 'https://lehd.ces.census.gov/data/lodes/LODES7/md/wac/md_wac_S000_JT00_2015.csv.gz'
pd.read_csv(URL, compression='gzip').head()
w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 ... CFA02 CFA03 CFA04 CFA05 CFS01 CFS02 CFS03 CFS04 CFS05 createdate
0 240010001001023 8 3 4 1 4 4 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
1 240010001001025 1 0 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
2 240010001001054 10 2 3 5 7 3 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
3 240010001001113 2 0 2 0 0 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
4 240010001002061 8 4 4 0 7 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826

5 rows × 53 columns

Here, each of the rows represents a census block (this particular table contains data from Maryland). The w_geocode indicates the block code, serving as the unique identifier for the census block, and the C000 variable represents the total number of jobs in that census block. The rest of the variable break down the number of jobs by various categories. For example, CA01, CA02, and CA03 break down the jobs by age group:

  • CA01: Number of jobs for workers age 29 or younger

  • CA02: Number of jobs for workers age 30 to 54

  • CA03: Number of jobs for workers age 55 or older

So, the sum of those columns should be equal to the value in C000.

The same applies for the RAC data, except instead of the jobs in that census block, it shows the residences in the census block. So, the C000 column in the RAC data represents all workers who lived in that census block. The CA01, CA02, and CA03 variables represent the number of workers within each age group that lived in that census block.

Note that for both of these datasets, the unit of observations is the census block.

Origin-Destination#

The Origin-Destination file looks like this:

URL = 'https://lehd.ces.census.gov/data/lodes/LODES7/md/od/md_od_main_JT00_2015.csv.gz'
pd.read_csv(URL, compression='gzip').head()
w_geocode h_geocode S000 SA01 SA02 SA03 SE01 SE02 SE03 SI01 SI02 SI03 createdate
0 240010001001023 240010001002184 1 0 1 0 0 1 0 0 1 0 20190826
1 240010001001023 240010001003108 1 0 1 0 0 1 0 0 1 0 20190826
2 240010001001023 240010002003023 1 0 0 1 1 0 0 0 1 0 20190826
3 240010001001023 240010022001060 1 0 1 0 0 1 0 0 1 0 20190826
4 240010001001023 240430107002095 1 1 0 0 1 0 0 0 1 0 20190826

Here, each of the rows represents a w_geocode-h_geocode pair. That is, each row is a pair of census blocks for which there was at least one person who worked in the w_geocode census block and lived in the h_geocode census block. The S000 variable represents how many people lived in the h_geocode census block and worked in the w_geocode census block.

Crosswalk#

URL = 'https://lehd.ces.census.gov/data/lodes/LODES7/md/md_xwalk.csv.gz'
pd.read_csv(URL, compression='gzip').head()
tabblk2010 st stusps stname cty ctyname trct trctname bgrp bgrpname ... stanrcname necta nectaname mil milname stwib stwibname blklatdd blklondd createdate
0 240037312011001 24 MD Maryland 24003 Anne Arundel County, MD 24003731201 7312.01 (Anne Arundel, MD) 240037312011 1 (Tract 7312.01, Anne Arundel, MD) ... NaN 99999 NaN NaN NaN 24001001 01 Anne Arundel WIA 39.086213 -76.536457 20211018
1 240037012001003 24 MD Maryland 24003 Anne Arundel County, MD 24003701200 7012 (Anne Arundel, MD) 240037012001 1 (Tract 7012, Anne Arundel, MD) ... NaN 99999 NaN NaN NaN 24001001 01 Anne Arundel WIA 38.926495 -76.537151 20211018
2 240037025001034 24 MD Maryland 24003 Anne Arundel County, MD 24003702500 7025 (Anne Arundel, MD) 240037025001 1 (Tract 7025, Anne Arundel, MD) ... NaN 99999 NaN NaN NaN 24001001 01 Anne Arundel WIA 38.951701 -76.550784 20211018
3 240037027022009 24 MD Maryland 24003 Anne Arundel County, MD 24003702702 7027.02 (Anne Arundel, MD) 240037027022 2 (Tract 7027.02, Anne Arundel, MD) ... NaN 99999 NaN NaN NaN 24001001 01 Anne Arundel WIA 39.011417 -76.527626 20211018
4 240037025004020 24 MD Maryland 24003 Anne Arundel County, MD 24003702500 7025 (Anne Arundel, MD) 240037025004 4 (Tract 7025, Anne Arundel, MD) ... NaN 99999 NaN NaN NaN 24001001 01 Anne Arundel WIA 38.947590 -76.538524 20211018

5 rows × 43 columns

For more information about the datasets used in the examples, please refer to the data documentation provided at this link.