import numpy as np 
import pandas as pd 
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')

Checking for Missing Values#

Now that we have our data set, let’s do a quick check for missing values. This is typically one of the first things you’ll want to do when exploring a new data set.

Below, we’ve shown two different ways of writing the same thing. Using isnull() gives us a data frame of the same size with True and False values depending on whether it was a missing value or not. Then, sum() sums each column. Since Python treats True as 1 and False as 0, the sum of each column gives us the total number of missing values for each variable.

df_null = df.isnull()
df_null.sum()
w_geocode     0
C000          0
CA01          0
CA02          0
CA03          0
CE01          0
CE02          0
CE03          0
CNS01         0
CNS02         0
CNS03         0
CNS04         0
CNS05         0
CNS06         0
CNS07         0
CNS08         0
CNS09         0
CNS10         0
CNS11         0
CNS12         0
CNS13         0
CNS14         0
CNS15         0
CNS16         0
CNS17         0
CNS18         0
CNS19         0
CNS20         0
CR01          0
CR02          0
CR03          0
CR04          0
CR05          0
CR07          0
CT01          0
CT02          0
CD01          0
CD02          0
CD03          0
CD04          0
CS01          0
CS02          0
CFA01         0
CFA02         0
CFA03         0
CFA04         0
CFA05         0
CFS01         0
CFS02         0
CFS03         0
CFS04         0
CFS05         0
createdate    0
dtype: int64

We did this in two separate lines, but that’s actually not necessary. In fact, we can do it all in one go:

df.isnull().sum()
w_geocode     0
C000          0
CA01          0
CA02          0
CA03          0
CE01          0
CE02          0
CE03          0
CNS01         0
CNS02         0
CNS03         0
CNS04         0
CNS05         0
CNS06         0
CNS07         0
CNS08         0
CNS09         0
CNS10         0
CNS11         0
CNS12         0
CNS13         0
CNS14         0
CNS15         0
CNS16         0
CNS17         0
CNS18         0
CNS19         0
CNS20         0
CR01          0
CR02          0
CR03          0
CR04          0
CR05          0
CR07          0
CT01          0
CT02          0
CD01          0
CD02          0
CD03          0
CD04          0
CS01          0
CS02          0
CFA01         0
CFA02         0
CFA03         0
CFA04         0
CFA05         0
CFS01         0
CFS02         0
CFS03         0
CFS04         0
CFS05         0
createdate    0
dtype: int64

We can also drop any duplicated rows.

df_no_dups = df.drop_duplicates()
df_no_dups.shape # Check how many rows there are after dropping duplicates
(243462, 53)

Checking for Inconsistencies#

If you check the data documentation, you’ll see that C000 is the total number of jobs. Therefore, it would make sense for the other groups to columns to add up to the values in C000. For example, you’d expect CA01, CA02, and CA03 to add up to C000 for each row. Let’s check to see if this is true.

We’ll first take the sum of CA01, CA02, and CA03 in each row and put that in a new column called CA_sum. Then, we’ll compare our new CA_sum column to the existing C000 column to see if they match. We’ll first show all the code, then explain each section.

# Create a list with the columns we want to add up
vars_to_check = ['CA01','CA02','CA03']

# Using apply to sum the columns for each row
df['CA_sum'] = df[vars_to_check].apply(sum,1)

# Check how many rows don't match
sum(df.CA_sum != df.C000)
0

We first created a list called vars_to_check, which contains the columns that we want to add up. Then, we took those columns from df and used the apply() method, which applies the same function to each row (or column, if we used 0 in the second argument instead of 1). In this case, we want to find the sum of each row, so the first argument is sum. We want to create a new column that contains this sum, so we assign that to a new column in df, CA_sum. Notice that this is the first place we see 'CA_sum', because this is where we are creating it.

Lastly, we want to check how many rows in which C000 and CA_sum differ. We do this by using

df.CA_sum != df.C000

which outputs a Series of True and False values: True if the value in CA_sum is not equal to the value in C000 for that row, and False otherwise. In other words, this is a Series of Trues and Falses indicating whether the values for the row didn’t match. We can then use the sum function from NumPy to add up how many times they didn’t match. If there are no errors, the sum should be 0.

Checking for Outliers#

Suppose we want to check if there are any outliers in total number of jobs by census block. We can sort the values in C000 in order to figure this out. Let’s say we want to find the top ten census blocks by total number of jobs.

df.sort_values("C000",ascending=False).head(10)
w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 ... CFA03 CFA04 CFA05 CFS01 CFS02 CFS03 CFS04 CFS05 createdate CA_sum
52952 60372074001016 72275 5546 47140 19589 2129 6625 63521 0 0 ... 0 0 0 0 0 0 0 0 20190826 72275
64327 60373104003007 50650 6975 31905 11770 33722 5798 11130 0 0 ... 0 0 0 0 0 0 0 0 20190826 50650
59358 60372655101002 24628 3643 15326 5659 2238 4323 18067 45 0 ... 0 0 0 0 0 0 0 0 20190826 24628
186347 60750177002001 23553 1858 15018 6677 1237 1750 20566 0 0 ... 0 0 0 0 0 0 0 0 20190826 23553
130619 60599800001001 23071 9319 9436 4316 7413 10538 5120 0 0 ... 0 0 0 0 0 0 0 0 20190826 23071
48981 60371437001000 22683 4736 12397 5550 14933 3456 4294 0 0 ... 0 0 0 0 0 0 0 0 20190826 22683
97829 60377030012000 21971 1963 13008 7000 14741 925 6305 0 0 ... 0 0 0 0 0 0 0 0 20190826 21971
64362 60373105011027 20663 3268 13510 3885 7717 4444 8502 0 0 ... 0 0 0 0 0 0 0 0 20190826 20663
212152 60855081012002 20104 3603 14572 1929 246 1643 18215 0 0 ... 0 0 0 0 0 0 0 0 20190826 20104
52965 60372074001037 19469 1995 12795 4679 1603 1839 16027 0 0 ... 0 0 0 0 0 0 0 0 20190826 19469

10 rows × 54 columns

Let’s break this down piece by piece. First, we use the sort_values() method to sort the Data Frame by C000. We use ascending=False so that the highest values are at the top (the default is to sort in ascending order). This would give us

df.sort_values("C000",ascending=False)

However, we don’t want to look at everything. Here, we use head() to give us only the top ten values after sorting. This gives us the final code, df.sort_values("C000",ascending=False).head(10).

Checkpoint: Descriptive Statistics on Your Data#

Using the tools described above, look at the data you loaded in earlier. Make sure you know the answers to each of the following questions:

  • Are there any missing values?

  • Are there any inconsistencies in the data?

  • Are there missing values that may not have been coded as missing?

  • Are there any interesting outliers?