import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import sqlite3

from plotnine import *

Example: Finding Jobs by County over Years#

In this section, we will combine everything we have learned in this book so far to create a workflow for getting the number of jobs by county over many years. Here are the steps we will take:

Get WAC data and crosswalk from the internet.
Store the data within a database.
Use SQL to join the WAC to the crosswalk to get the number of jobs by county.
Create a Data Frame which we can use to create a visualization.
Graph the jobs over years by county.

Within this whole process, we will use functions and loops to make it easier to do this over many years.

Creating a New Database#

We first start by creating a new database to store our data. Creating a connection to a new database name that doesn’t exist will do this for us.

conn = sqlite3.connect("md_lodes.db")

Next, we define a function that will help us get the WAC data that we want. Try to read through the function and determine what it does. Read back to the previous Python material if you aren’t sure about some parts.

def store_wac_md(year, con):
    '''
    Gets WAC data for Maryland for a given year and stores in a database in the connection
    
    Arguments:
    year: Year to get data for
    con: Connection to the database
    
    Returns:
    None
    '''
    
    # Bring in data from internet
    base_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/'
    file_specs = f'md/wac/md_wac_S000_JT00_{year}.csv.gz'
    file_name = base_url + file_specs
    df_md = pd.read_csv(file_name,compression='gzip')
    
    # Store in database
    df_md.columns = df_md.columns.str.lower()
    df_md.to_sql('md_wac_' + str(year), conn, if_exists='replace')

We then use a loop to store the data within the database we created.

for i in range(2007,2017):
    store_wac_md(i, conn)

We just need to run the process for bringing in the crosswalk data once. We’ll store that in the database too.

data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/md/md_xwalk.csv.gz'
md_xwalk = pd.read_csv(data_url,compression='gzip')
md_xwalk.columns = md_xwalk.columns.str.lower()
md_xwalk.to_sql('md_xwalk', conn, if_exists='replace')

We’re going to need another function to automate doing the join, then bring that into Python as a Data Frame.

def jobs_by_county(year, conn):
    '''
    Joins the WAC to the xwalk for a given year and calculates jobs by county. 
    
    Arguments:
    year: Year to get data for
    con: Connection to the database
    
    Returns:
    A Data Frame with three columns, county, jobs, and year
    '''    
    
    base_query = f'''
    SELECT ctyname as county, count(*) as jobs, {year} as year
    FROM md_wac_{year} wac
    JOIN md_xwalk xwalk
    ON wac.w_geocode = xwalk.tabblk2010
    GROUP BY ctyname
    '''
    
    return pd.read_sql(base_query, conn)

jobs_by_county(2015,conn)

	county	jobs	year
0	Allegany County, MD	643	2015
1	Anne Arundel County, MD	2720	2015
2	Baltimore County, MD	4246	2015
3	Baltimore city, MD	3827	2015
4	Calvert County, MD	476	2015
5	Caroline County, MD	321	2015
6	Carroll County, MD	1263	2015
7	Cecil County, MD	663	2015
8	Charles County, MD	721	2015
9	Dorchester County, MD	325	2015
10	Frederick County, MD	1708	2015
11	Garrett County, MD	378	2015
12	Harford County, MD	1470	2015
13	Howard County, MD	1444	2015
14	Kent County, MD	291	2015
15	Montgomery County, MD	5216	2015
16	Prince George's County, MD	3423	2015
17	Queen Anne's County, MD	393	2015
18	Somerset County, MD	201	2015
19	St. Mary's County, MD	507	2015
20	Talbot County, MD	491	2015
21	Washington County, MD	1178	2015
22	Wicomico County, MD	866	2015
23	Worcester County, MD	679	2015

We’ll use a loop to get the data we want stacked on top of each other so that we have a long dataset with three columns: county, jobs, and year.

md_jobs_by_county = pd.DataFrame()
for i in range(2007,2017):
    md_jobs_by_county = pd.concat([md_jobs_by_county, jobs_by_county(i,conn)], 
                                  ignore_index= True)
md_jobs_by_county.head()

	county	jobs	year
0	Allegany County, MD	1057	2007
1	Anne Arundel County, MD	3614	2007
2	Baltimore County, MD	6694	2007
3	Baltimore city, MD	4931	2007
4	Calvert County, MD	782	2007

Finally, we can plot the jobs over years by county. Here, we use ggplot to do the plotting, but you can use other methods as well.

(ggplot(md_jobs_by_county)
 + aes(x = 'year', y = 'jobs', color = 'county') 
 + geom_line())

../../_images/04-jobs-county-years_16_0.png

<ggplot: (336039843)>

Checkpoint: Python and SQL Workflow#

Make sure you understand how all of the code above comes together. Try adjusting the code above so that the graph only has a few counties so that it is easier to read. Try doing the same for a different state.

Introduction to Python and SQL for Data Analysis

Example: Finding Jobs by County over Years

Contents

Example: Finding Jobs by County over Years#

Creating a New Database#

Checkpoint: Python and SQL Workflow#