# Start (as usual) by loading libraries
import numpy as np
import pandas as pd
import matplotlib as mplib
import matplotlib.pyplot as plt
import seaborn as sns

Example: Line Chart Over Years#

In this last section, we’ll look at a line chart while incorporating all the other methods we’ve gone over in this notebook. Here, we want to look at the change in number of jobs by year, separating them into each age group so that we can look at the trends for each age group as well as compare the trends across age groups.

In order to do this, we’ll need to get that data from multiple Data Frames, since we want to combine data from multiple years. We’ll put them all in lists, then bring it all together into one Data Frame, setting the index of that Data Frame to the correct year, then plot the line chart. Since we want to separate it by age group

Recall that we’ve already brought in data from 2009 to 2015 previously. We’ll use that data for now, replicating that code here. You can try changing the values to add more years, or use a different state. We’ll show all of the code, then go over the individual parts.

def get_wac(year, state = "ca"):
    '''
    Gets the WAC data for a given state and year.
    ---
    state: string, two-letter code of state for which we want the data
    year: int, the year we want to bring in data for
    '''
    
    base_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/'
    file_specs = f'{state}/wac/{state}_wac_S000_JT00_{year}.csv.gz'
    file_name = base_url + file_specs
    
    # print("The URL for the file is at: " + file_name)
    output = pd.read_csv(file_name,compression='gzip')
    return(output)

# Initialize an empty dictionary.
wac_all_years = {}

# This loop might take a little bit of time.
# If you want to see progress while it runs, uncomment the second line in the loop.
for i in range(2009,2016):
    wac_all_years[i] = get_wac(year = i)
    # print("WAC for " + str(i) + " obtained.")

# Initialize the lists we'll use to create Data Frame
ca_c000 = []
ca_ca01 = []
ca_ca02 = []
ca_ca03 = []

# Loop through all years and get total jobs by age group
for i in range(2009,2016):
    tempdf = wac_all_years[i]
    ca_c000.append(tempdf.C000.sum())
    ca_ca01.append(tempdf.CA01.sum())
    ca_ca02.append(tempdf.CA02.sum())
    ca_ca03.append(tempdf.CA03.sum())

# Create the overall Data Frame
plot_df = pd.DataFrame({"C000": ca_c000, "CA01": ca_ca01, 
                       "CA02": ca_ca02, "CA03": ca_ca03},
                       index=range(2009,2016))

# Now to plot
fig, ax = plt.subplots(figsize=(8,6))

# Add each plot
plot_df.C000.plot(kind='line', ax=ax, legend=True)
plot_df.CA01.plot(kind='line', ax=ax, legend=True)
plot_df.CA02.plot(kind='line', ax=ax, legend=True)
plot_df.CA03.plot(kind='line', ax=ax, legend=True)

# Just to make x-axis look nice
ax.get_xaxis().get_major_formatter().set_useOffset(False) 

We start by initializing four lists. These are the lists in which we’ll store the total number of jobs for each age group by year.

ca_c000 = []
ca_ca01 = []
ca_ca02 = []
ca_ca03 = []

The lists each correspond to an age group (or the total of all age groups), so we’ll plot lines for four different categories: Total, 29 or younger, 30 to 54, and 55 or older. Next, we go through the loop.

for i in range(2009,2016):
    tempdf = wac_all_years[i]
    ca_c000.append(tempdf.C000.sum())
    ca_ca01.append(tempdf.CA01.sum())
    ca_ca02.append(tempdf.CA02.sum())
    ca_ca03.append(tempdf.CA03.sum())

We’re going to be looping through each year from 2009 to 2015. In each iteration, we start by storing the Data Frame for that year in tempdf. This step isn’t absolutely necessary, as we could have chosen to replace each instance of “tempdf” with “wac_all_years[i]” in the rest of the loop, but I’ve done it to make the code more readable. Next, for each of the four categories, we’ll take the appropriate column, calculate the sum, then append it to the appropriate list. This all takes place in one line for each category.

plot_df = pd.DataFrame({"C000": ca_c000, "CA01": ca_ca01, 
                       "CA02": ca_ca02, "CA03": ca_ca03},
                       index=range(2009,2016))

We then create a new Data Frame called plot_df that has a column for each age group and a row for each year. Notice that we create a dictionary with column names as keys and the lists containing the elements as the values. In addition, we specify the index to be the years. You can check the contents of plot_df to get a better idea of what we’ve constructed.

plot_df

	C000	CA01	CA02	CA03
2009	14122178	3415234	8222618	2484326
2010	14462669	3361853	8429360	2671456
2011	14664138	3351381	8469390	2843367
2012	14709292	3189236	8493125	3026931
2013	15182968	3268571	8707699	3206698
2014	15614666	3361131	8880201	3373334
2015	16048747	3451786	9045644	3551317

Lastly, we plot the figure using similar methods as before, except with kind='line'.

fig, ax = plt.subplots(figsize=(8,6))

plot_df.C000.plot(kind='line', ax=ax, legend=True)
plot_df.CA01.plot(kind='line', ax=ax, legend=True)
plot_df.CA02.plot(kind='line', ax=ax, legend=True)
plot_df.CA03.plot(kind='line', ax=ax, legend=True)

ax.get_xaxis().get_major_formatter().set_useOffset(False) 

The last line is simply to make the years look nicer on the x-axis. You can try plotting without the last line (i.e. comment it out) to see what happens without it.

Using Seaborn#

To obtain the above graph, you can actually use the seaborn function lineplot with the plot_df DataFrame to get the same thing.

sns.lineplot(data=plot_df)

<Axes: >

Saving Figures#

Of course, some of the figures we make might help us visualize when exploring the data, but we might also want to save them to include in presentations or to show others without needing to run code. We can save the figure by using the savefig() method. Notice that we include the “.png” file extension in the name of the file. Here, we also specify a dpi.

fig.savefig('jobsbyage.png', dpi=600)

In order to save a seaborn plot, you can use the same savefig method, but you need to first extract the Figure object.

g = sns.lineplot(data=plot_df)
f = g.get_figure()
f.savefig('sns_jobsbyage.png')

../../_images/03-multiple-years_13_0.png

Checkpoint: Visualize Your Data#

Try using the methods we’ve described above, try visualizing your data set. What do the visualizations tell you? How are they different from the data from California? How are they the same? Does this make sense?

Introduction to Python and SQL for Data Analysis

Example: Line Chart Over Years

Contents

Example: Line Chart Over Years#

Using Seaborn#

Saving Figures#

Checkpoint: Visualize Your Data#