# Start (as usual) by loading libraries
import numpy as np
import pandas as pd

Functions#

First, we start off by creating a function. You’re familiar with functions in general already, since you’ve used them to do various things like determining the type of an object (by using type()) or finding the mean of a list of numbers (using mean()). Here, we’ll walk through creating your own function. This helps simplify your code make it much more readable if you’re going to be doing the same thing many times, as you won’t have to copy and paste your code each time you want to do a certain task.

Let’s start with a very basic function. Suppose we want a function that takes an argument and returns the squared value.

def squarer (x):
    '''
    Squares a value
    '''
    
    y = x ** 2
    return(y) 

# Test it out
squarer(4)
16

Here, we’re creating a function called squarer which takes an input, x, and outputs the squared value. Let’s break it down line by line.

def squarer (x):

The def indicates that we’re defining a function, followed by what we want to name the function. Then, in parentheses, we put in any arguments we want the function to take. If we don’t want it to take any arguments, we can just leave it blank. Lastly, we end the line with a colon.

This takes us to the next lines.

'''
Squares a value
'''

First, note that these lines are indented. In order to be a part of the function, the lines coming after def must be indented. These few lines of code are what is called a doc string. This is a string that is ignored by Python when running the code, but is put into the documentation. This is what is used to describe a function so that you know what it does later on. Try using the help function to look at it.

help(squarer)
Help on function squarer in module __main__:

squarer(x)
    Squares a value

Next, we get to the body of the function.

y = x ** 2
return(y) 

Here, we have two lines, which assigns the value x ** 2 (note that ** is the operator for taking something to a power) to y, then uses the return function to output it.

Lastly, we have unindented lines, which aren’t part of the function.

squarer(4)

This just uses the function with the argument 4 to see if it works and gives us a value of 16.

Creating a Function to Bring In Data#

Let’s now make a function called get_ca_wac() that takes the year as an argument and outputs the California workplace area characteristic dataset. We’ll show the code, then explain it in detail.

def get_ca_wac(year):
    '''
    Arguments
    year: int, the year we want to bring in data for
    
    Returns
    A pandas DataFrame with the California WAC for a specific year
    '''
    
    file_name = f'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_{year}.csv.gz'
    output = pd.read_csv(file_name,compression='gzip')
    return(output)

We need to change the location of the file depending what year we want to get. We do this by using what is called an f-string or “formatted string literals”. We put f in front of the string, then use curly braces to include whatever we want to insert into the string. In the code above, we are creating a string that looks like this: https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_{year}.csv.gz

Here, we have a part with curly braces, {year}. This means that whenever we see {year} in the string, we replace it with whatever is in year. This gives us the string we want. For example, if the year we want is 2015, the file is at https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz.

In this function, we put the Data Frame that we want in output. The line

output = pd.read_csv(file_name,compression='gzip')

should look relatively familiar to you, since we’ve used the read_csv() method before. In this case, we have to specify another argument, compression = 'gzip', since the file is compressed and the LODES documentation tells us it was compressed using the GZip algorithm.

Finally, we use the return() statement to give the result of our function, which is the Data Frame that we put in output.

Let’s try using this function to get the dataset from 2015.

df_2015 = get_ca_wac(2015)

Now, we’ve shown the creation of a function for just the California Workplace Area Characteristics dataset. Let’s say you actually want to bring in data for multiple states (perhaps even all the states). You could adjust the code above to reflect the correct state in the URL for each state, but that would take a very long time and a lot of tedious editing of code. How might we create a function to make such a task easier?

If you recall, we created a function so that we could easily change the year while keeping the rest of the code the same. In this case, we need to adjust our code so that it can take different states. Therefore, we can take the function we created above and make some slight adjustments so that we can specify the state as one of its arguments.

Try thinking about how you might adjust the code above and compare it to what we have done below.

def get_wac(year, state = "ca"):
    '''
    Arguments
    state: string, two-letter code of state for which we want the data
    year: int, the year we want to bring in data for
    
    Returns
    A pandas DataFrame with the WAC for a specific state and year
    '''
    
    base_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/'
    file_specs = f'{state}/wac/{state}_wac_S000_JT00_{year}.csv.gz'
    file_name = base_url + file_specs
    
    # print("The URL for the file is at: " + file_name)
    output = pd.read_csv(file_name,compression='gzip')
    return(output)

First, notice that we now have two arguments that we can pass to the function: year and state. In addition, when defining the function, we’ve given state a default value of "ca" by using state = "ca". This just means that, when calling get_wac, we can either give a value for state, or we can leave it blank, in which case it will default to state = "ca".

Within the URL string, we need to replace all instances of ca and replace it with the string in state. I’ve separated the URL into two parts in order to make it easier to read. The first half, called base_url, doesn’t change, so we can include that no matter the choice of state or year. The second half, called file_specs, needs state and year to be specified, which we do similar to how we did it before. We can use the + operator to combine to strings into one.

Let’s try using the function and do a quick check to see if it seems to be working correctly.

df_2015 = get_wac(year = 2015, state = 'ca')
df_2015.head()
w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 ... CFA02 CFA03 CFA04 CFA05 CFS01 CFS02 CFS03 CFS04 CFS05 createdate
0 60014001001007 30 2 16 12 4 2 24 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
1 60014001001008 4 0 1 3 0 0 4 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
2 60014001001011 3 2 1 0 0 3 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
3 60014001001017 11 3 3 5 2 2 7 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
4 60014001001024 10 3 3 4 7 1 2 0 0 ... 0 0 0 0 0 0 0 0 0 20190826

5 rows × 53 columns

Here, I’ve specified 'ca' at the state even though it’s the default when we defined the function. This is just for clarity, and running the code without the state = 'ca' argument would do the exact same thing.

Coding Tip: Notice that there’s a line that’s been commented out near the bottom: “# print("The URL for the file is at: " + file_name)”. Try uncommenting it (delete the “#”), run the cell defining the function, then use the function again. This prints out helpful information – in this case, the URL that we’ve constructed – as the function runs. Using print() functions in this way is very useful for debugging when your function isn’t working the way you think it should.

Checkpoint: Creating Functions#

Try using the functions above to bring in the data for your state. Do you need to make any changes to the functions? Why or why not?

Suppose you wanted to adjust the functions so that you can specify whether you want to bring in the Residence Area Characteristics (rac), or the Workplace Area Characteristics (wac), instead of bringing in the wac dataset by default. How would you adjust the code? Try doing it yourself.

What about if you wanted to change the function to also work for the Origin-Destination (od) data? How would the function change then? Make sure you look carefully at the file name.

Hint: You can create if-else statements using something like:

test = 1
if test < 2:
    print('The number is less than 2')
else:
    print('The number is not less than 2')
The number is less than 2

This if-else checks whether test is smaller than 2, then prints the appropriate message based on what’s stored in test. As you can see, the same indentation rules apply as with functions.