import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import sqlite3

from plotnine import *

Example: Finding Jobs by County over Years#

In this section, we will combine everything we have learned in this book so far to create a workflow for getting the number of jobs by county over many years. Here are the steps we will take:

  • Get WAC data and crosswalk from the internet.

  • Store the data within a database.

  • Use SQL to join the WAC to the crosswalk to get the number of jobs by county.

  • Create a Data Frame which we can use to create a visualization.

  • Graph the jobs over years by county.

Within this whole process, we will use functions and loops to make it easier to do this over many years.

Creating a New Database#

We first start by creating a new database to store our data. Creating a connection to a new database name that doesn’t exist will do this for us.

conn = sqlite3.connect("md_lodes.db")

Next, we define a function that will help us get the WAC data that we want. Try to read through the function and determine what it does. Read back to the previous Python material if you aren’t sure about some parts.

def store_wac_md(year, con):
    '''
    Gets WAC data for Maryland for a given year and stores in a database in the connection
    
    Arguments:
    year: Year to get data for
    con: Connection to the database
    
    Returns:
    None
    '''
    
    # Bring in data from internet
    base_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/'
    file_specs = f'md/wac/md_wac_S000_JT00_{year}.csv.gz'
    file_name = base_url + file_specs
    df_md = pd.read_csv(file_name,compression='gzip')
    
    # Store in database
    df_md.columns = df_md.columns.str.lower()
    df_md.to_sql('md_wac_' + str(year), conn, if_exists='replace')

We then use a loop to store the data within the database we created.

for i in range(2007,2017):
    store_wac_md(i, conn)

We just need to run the process for bringing in the crosswalk data once. We’ll store that in the database too.

data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/md/md_xwalk.csv.gz'
md_xwalk = pd.read_csv(data_url,compression='gzip')
md_xwalk.columns = md_xwalk.columns.str.lower()
md_xwalk.to_sql('md_xwalk', conn, if_exists='replace')
145247

We’re going to need another function to automate doing the join, then bring that into Python as a Data Frame.

def jobs_by_county(year, conn):
    '''
    Joins the WAC to the xwalk for a given year and calculates jobs by county. 
    
    Arguments:
    year: Year to get data for
    con: Connection to the database
    
    Returns:
    A Data Frame with three columns, county, jobs, and year
    '''    
    
    base_query = f'''
    SELECT ctyname as county, count(*) as jobs, {year} as year
    FROM md_wac_{year} wac
    JOIN md_xwalk xwalk
    ON wac.w_geocode = xwalk.tabblk2010
    GROUP BY ctyname
    '''
    
    return pd.read_sql(base_query, conn)
jobs_by_county(2015,conn)
county jobs year
0 Allegany County, MD 643 2015
1 Anne Arundel County, MD 2720 2015
2 Baltimore County, MD 4246 2015
3 Baltimore city, MD 3827 2015
4 Calvert County, MD 476 2015
5 Caroline County, MD 321 2015
6 Carroll County, MD 1263 2015
7 Cecil County, MD 663 2015
8 Charles County, MD 721 2015
9 Dorchester County, MD 325 2015
10 Frederick County, MD 1708 2015
11 Garrett County, MD 378 2015
12 Harford County, MD 1470 2015
13 Howard County, MD 1444 2015
14 Kent County, MD 291 2015
15 Montgomery County, MD 5216 2015
16 Prince George's County, MD 3423 2015
17 Queen Anne's County, MD 393 2015
18 Somerset County, MD 201 2015
19 St. Mary's County, MD 507 2015
20 Talbot County, MD 491 2015
21 Washington County, MD 1178 2015
22 Wicomico County, MD 866 2015
23 Worcester County, MD 679 2015

We’ll use a loop to get the data we want stacked on top of each other so that we have a long dataset with three columns: county, jobs, and year.

md_jobs_by_county = pd.DataFrame()
for i in range(2007,2017):
    md_jobs_by_county = pd.concat([md_jobs_by_county, jobs_by_county(i,conn)], 
                                  ignore_index= True)
md_jobs_by_county.head()
county jobs year
0 Allegany County, MD 1057 2007
1 Anne Arundel County, MD 3614 2007
2 Baltimore County, MD 6694 2007
3 Baltimore city, MD 4931 2007
4 Calvert County, MD 782 2007

Finally, we can plot the jobs over years by county. Here, we use ggplot to do the plotting, but you can use other methods as well.

(ggplot(md_jobs_by_county)
 + aes(x = 'year', y = 'jobs', color = 'county') 
 + geom_line())
../../_images/04-jobs-county-years_16_0.png
<ggplot: (336039843)>

Checkpoint: Python and SQL Workflow#

Make sure you understand how all of the code above comes together. Try adjusting the code above so that the graph only has a few counties so that it is easier to read. Try doing the same for a different state.