# Start (as usual) by loading libraries
import numpy as np
import pandas as pd
import matplotlib as mplib
import matplotlib.pyplot as plt
import seaborn as sns

Example: Line Chart Over Years#

In this last section, we’ll look at a line chart while incorporating all the other methods we’ve gone over in this notebook. Here, we want to look at the change in number of jobs by year, separating them into each age group so that we can look at the trends for each age group as well as compare the trends across age groups.

In order to do this, we’ll need to get that data from multiple Data Frames, since we want to combine data from multiple years. We’ll put them all in lists, then bring it all together into one Data Frame, setting the index of that Data Frame to the correct year, then plot the line chart. Since we want to separate it by age group

Recall that we’ve already brought in data from 2009 to 2015 previously. We’ll use that data for now, replicating that code here. You can try changing the values to add more years, or use a different state. We’ll show all of the code, then go over the individual parts.

def get_wac(year, state = "ca"):
    '''
    Gets the WAC data for a given state and year.
    ---
    state: string, two-letter code of state for which we want the data
    year: int, the year we want to bring in data for
    '''
    
    base_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/'
    file_specs = f'{state}/wac/{state}_wac_S000_JT00_{year}.csv.gz'
    file_name = base_url + file_specs
    
    # print("The URL for the file is at: " + file_name)
    output = pd.read_csv(file_name,compression='gzip')
    return(output)
# Initialize an empty dictionary.
wac_all_years = {}

# This loop might take a little bit of time.
# If you want to see progress while it runs, uncomment the second line in the loop.
for i in range(2009,2016):
    wac_all_years[i] = get_wac(year = i)
    # print("WAC for " + str(i) + " obtained.")
# Initialize the lists we'll use to create Data Frame
ca_c000 = []
ca_ca01 = []
ca_ca02 = []
ca_ca03 = []

# Loop through all years and get total jobs by age group
for i in range(2009,2016):
    tempdf = wac_all_years[i]
    ca_c000.append(tempdf.C000.sum())
    ca_ca01.append(tempdf.CA01.sum())
    ca_ca02.append(tempdf.CA02.sum())
    ca_ca03.append(tempdf.CA03.sum())

# Create the overall Data Frame
plot_df = pd.DataFrame({"C000": ca_c000, "CA01": ca_ca01, 
                       "CA02": ca_ca02, "CA03": ca_ca03},
                       index=range(2009,2016))

# Now to plot
fig, ax = plt.subplots(figsize=(8,6))

# Add each plot
plot_df.C000.plot(kind='line', ax=ax, legend=True)
plot_df.CA01.plot(kind='line', ax=ax, legend=True)
plot_df.CA02.plot(kind='line', ax=ax, legend=True)
plot_df.CA03.plot(kind='line', ax=ax, legend=True)

# Just to make x-axis look nice
ax.get_xaxis().get_major_formatter().set_useOffset(False) 
../../_images/03-multiple-years_4_0.png

We start by initializing four lists. These are the lists in which we’ll store the total number of jobs for each age group by year.

ca_c000 = []
ca_ca01 = []
ca_ca02 = []
ca_ca03 = []

The lists each correspond to an age group (or the total of all age groups), so we’ll plot lines for four different categories: Total, 29 or younger, 30 to 54, and 55 or older. Next, we go through the loop.

for i in range(2009,2016):
    tempdf = wac_all_years[i]
    ca_c000.append(tempdf.C000.sum())
    ca_ca01.append(tempdf.CA01.sum())
    ca_ca02.append(tempdf.CA02.sum())
    ca_ca03.append(tempdf.CA03.sum())

We’re going to be looping through each year from 2009 to 2015. In each iteration, we start by storing the Data Frame for that year in tempdf. This step isn’t absolutely necessary, as we could have chosen to replace each instance of “tempdf” with “wac_all_years[i]” in the rest of the loop, but I’ve done it to make the code more readable. Next, for each of the four categories, we’ll take the appropriate column, calculate the sum, then append it to the appropriate list. This all takes place in one line for each category.

plot_df = pd.DataFrame({"C000": ca_c000, "CA01": ca_ca01, 
                       "CA02": ca_ca02, "CA03": ca_ca03},
                       index=range(2009,2016))

We then create a new Data Frame called plot_df that has a column for each age group and a row for each year. Notice that we create a dictionary with column names as keys and the lists containing the elements as the values. In addition, we specify the index to be the years. You can check the contents of plot_df to get a better idea of what we’ve constructed.

plot_df
C000 CA01 CA02 CA03
2009 14122178 3415234 8222618 2484326
2010 14462669 3361853 8429360 2671456
2011 14664138 3351381 8469390 2843367
2012 14709292 3189236 8493125 3026931
2013 15182968 3268571 8707699 3206698
2014 15614666 3361131 8880201 3373334
2015 16048747 3451786 9045644 3551317

Lastly, we plot the figure using similar methods as before, except with kind='line'.

fig, ax = plt.subplots(figsize=(8,6))

plot_df.C000.plot(kind='line', ax=ax, legend=True)
plot_df.CA01.plot(kind='line', ax=ax, legend=True)
plot_df.CA02.plot(kind='line', ax=ax, legend=True)
plot_df.CA03.plot(kind='line', ax=ax, legend=True)

ax.get_xaxis().get_major_formatter().set_useOffset(False) 

The last line is simply to make the years look nicer on the x-axis. You can try plotting without the last line (i.e. comment it out) to see what happens without it.

Using Seaborn#

To obtain the above graph, you can actually use the seaborn function lineplot with the plot_df DataFrame to get the same thing.

sns.lineplot(data=plot_df)
<Axes: >
../../_images/03-multiple-years_9_1.png

Saving Figures#

Of course, some of the figures we make might help us visualize when exploring the data, but we might also want to save them to include in presentations or to show others without needing to run code. We can save the figure by using the savefig() method. Notice that we include the “.png” file extension in the name of the file. Here, we also specify a dpi.

fig.savefig('jobsbyage.png', dpi=600)

In order to save a seaborn plot, you can use the same savefig method, but you need to first extract the Figure object.

g = sns.lineplot(data=plot_df)
f = g.get_figure()
f.savefig('sns_jobsbyage.png')
../../_images/03-multiple-years_13_0.png

Checkpoint: Visualize Your Data#

Try using the methods we’ve described above, try visualizing your data set. What do the visualizations tell you? How are they different from the data from California? How are they the same? Does this make sense?