Data#

In this book, we will use a set of public datasets from the Longitudinal Employer Household Dynamic (LEHD) data provided by the United States Census Bureau. In particular, we will use the LEHD Origin-Destination Employment Statistics (LODES) data. These data are based on tabulated administrative data and give information about workplaces and residences of workers at the census block level. There are four main types of data that we will use.

Workplace Area Characteristics (WAC): Census block level. Job totals for workplaces in the census block.
Residence Area Characteristics (RAC): Census block level. Job totals for residences in the census block.
Origin-Destination (OD): Origin census block - Destination census block pair level.
Crosswalk (xwalk): Census block level. Contains all census blocks within that state, and contains information about that census block (e.g. city, county).

Workplace Area Characteristics (WAC) and Residence Area Characteristics (RAC)#

The WAC and RAC data generally look something like the following:

import pandas as pd 
URL = 'https://lehd.ces.census.gov/data/lodes/LODES7/md/wac/md_wac_S000_JT00_2015.csv.gz'
pd.read_csv(URL, compression='gzip').head()

	w_geocode	C000	CA01	CA02	CA03	CE01	CE02	CE03	...	createdate
0	240010001001023	8	3	4	1	4	4	0	...	20190826
1	240010001001025	1	0	1	0	0	1	0	...	20190826
2	240010001001054	10	2	3	5	7	3	0	...	20190826
3	240010001001113	2	0	2	0	0	1	1	...	20190826
4	240010001002061	8	4	4	0	7	1	0	...	20190826

5 rows × 53 columns

Here, each of the rows represents a census block (this particular table contains data from Maryland). The w_geocode indicates the block code, serving as the unique identifier for the census block, and the C000 variable represents the total number of jobs in that census block. The rest of the variable break down the number of jobs by various categories. For example, CA01, CA02, and CA03 break down the jobs by age group:

CA01: Number of jobs for workers age 29 or younger
CA02: Number of jobs for workers age 30 to 54
CA03: Number of jobs for workers age 55 or older

So, the sum of those columns should be equal to the value in C000.

The same applies for the RAC data, except instead of the jobs in that census block, it shows the residences in the census block. So, the C000 column in the RAC data represents all workers who lived in that census block. The CA01, CA02, and CA03 variables represent the number of workers within each age group that lived in that census block.

Note that for both of these datasets, the unit of observations is the census block.

Origin-Destination#

The Origin-Destination file looks like this:

URL = 'https://lehd.ces.census.gov/data/lodes/LODES7/md/od/md_od_main_JT00_2015.csv.gz'
pd.read_csv(URL, compression='gzip').head()

	w_geocode	h_geocode	S000	SA01	SA02	SA03	SE01	SE02	SI02	createdate
0	240010001001023	240010001002184	1	0	1	0	0	1	1	20190826
1	240010001001023	240010001003108	1	0	1	0	0	1	1	20190826
2	240010001001023	240010002003023	1	0	0	1	1	0	1	20190826
3	240010001001023	240010022001060	1	0	1	0	0	1	1	20190826
4	240010001001023	240430107002095	1	1	0	0	1	0	1	20190826

Here, each of the rows represents a w_geocode-h_geocode pair. That is, each row is a pair of census blocks for which there was at least one person who worked in the w_geocode census block and lived in the h_geocode census block. The S000 variable represents how many people lived in the h_geocode census block and worked in the w_geocode census block.

Crosswalk#

URL = 'https://lehd.ces.census.gov/data/lodes/LODES7/md/md_xwalk.csv.gz'
pd.read_csv(URL, compression='gzip').head()

	tabblk2010	st	stusps	stname	cty	ctyname	trct	trctname	bgrp	bgrpname	...	stanrcname	necta	nectaname	mil	milname	stwib	stwibname	blklatdd	blklondd	createdate
0	240037312011001	24	MD	Maryland	24003	Anne Arundel County, MD	24003731201	7312.01 (Anne Arundel, MD)	240037312011	1 (Tract 7312.01, Anne Arundel, MD)	...	NaN	99999	NaN	NaN	NaN	24001001	01 Anne Arundel WIA	39.086213	-76.536457	20211018
1	240037012001003	24	MD	Maryland	24003	Anne Arundel County, MD	24003701200	7012 (Anne Arundel, MD)	240037012001	1 (Tract 7012, Anne Arundel, MD)	...	NaN	99999	NaN	NaN	NaN	24001001	01 Anne Arundel WIA	38.926495	-76.537151	20211018
2	240037025001034	24	MD	Maryland	24003	Anne Arundel County, MD	24003702500	7025 (Anne Arundel, MD)	240037025001	1 (Tract 7025, Anne Arundel, MD)	...	NaN	99999	NaN	NaN	NaN	24001001	01 Anne Arundel WIA	38.951701	-76.550784	20211018
3	240037027022009	24	MD	Maryland	24003	Anne Arundel County, MD	24003702702	7027.02 (Anne Arundel, MD)	240037027022	2 (Tract 7027.02, Anne Arundel, MD)	...	NaN	99999	NaN	NaN	NaN	24001001	01 Anne Arundel WIA	39.011417	-76.527626	20211018
4	240037025004020	24	MD	Maryland	24003	Anne Arundel County, MD	24003702500	7025 (Anne Arundel, MD)	240037025004	4 (Tract 7025, Anne Arundel, MD)	...	NaN	99999	NaN	NaN	NaN	24001001	01 Anne Arundel WIA	38.947590	-76.538524	20211018

5 rows × 43 columns

For more information about the datasets used in the examples, please refer to the data documentation provided at this link.

Introduction to Python and SQL for Data Analysis

Data

Contents

Data#

Workplace Area Characteristics (WAC) and Residence Area Characteristics (RAC)#

Origin-Destination#

Crosswalk#