Checking for Missing Values
Contents
import numpy as np
import pandas as pd
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')
Checking for Missing Values#
Now that we have our data set, let’s do a quick check for missing values. This is typically one of the first things you’ll want to do when exploring a new data set.
Below, we’ve shown two different ways of writing the same thing. Using isnull() gives us a data frame of the same size with True and False values depending on whether it was a missing value or not. Then, sum() sums each column. Since Python treats True as 1 and False as 0, the sum of each column gives us the total number of missing values for each variable.
df_null = df.isnull()
df_null.sum()
w_geocode 0
C000 0
CA01 0
CA02 0
CA03 0
CE01 0
CE02 0
CE03 0
CNS01 0
CNS02 0
CNS03 0
CNS04 0
CNS05 0
CNS06 0
CNS07 0
CNS08 0
CNS09 0
CNS10 0
CNS11 0
CNS12 0
CNS13 0
CNS14 0
CNS15 0
CNS16 0
CNS17 0
CNS18 0
CNS19 0
CNS20 0
CR01 0
CR02 0
CR03 0
CR04 0
CR05 0
CR07 0
CT01 0
CT02 0
CD01 0
CD02 0
CD03 0
CD04 0
CS01 0
CS02 0
CFA01 0
CFA02 0
CFA03 0
CFA04 0
CFA05 0
CFS01 0
CFS02 0
CFS03 0
CFS04 0
CFS05 0
createdate 0
dtype: int64
We did this in two separate lines, but that’s actually not necessary. In fact, we can do it all in one go:
df.isnull().sum()
w_geocode 0
C000 0
CA01 0
CA02 0
CA03 0
CE01 0
CE02 0
CE03 0
CNS01 0
CNS02 0
CNS03 0
CNS04 0
CNS05 0
CNS06 0
CNS07 0
CNS08 0
CNS09 0
CNS10 0
CNS11 0
CNS12 0
CNS13 0
CNS14 0
CNS15 0
CNS16 0
CNS17 0
CNS18 0
CNS19 0
CNS20 0
CR01 0
CR02 0
CR03 0
CR04 0
CR05 0
CR07 0
CT01 0
CT02 0
CD01 0
CD02 0
CD03 0
CD04 0
CS01 0
CS02 0
CFA01 0
CFA02 0
CFA03 0
CFA04 0
CFA05 0
CFS01 0
CFS02 0
CFS03 0
CFS04 0
CFS05 0
createdate 0
dtype: int64
We can also drop any duplicated rows.
df_no_dups = df.drop_duplicates()
df_no_dups.shape # Check how many rows there are after dropping duplicates
(243462, 53)
Checking for Inconsistencies#
If you check the data documentation, you’ll see that C000 is the total number of jobs. Therefore, it would make sense for the other groups to columns to add up to the values in C000. For example, you’d expect CA01, CA02, and CA03 to add up to C000 for each row. Let’s check to see if this is true.
We’ll first take the sum of CA01, CA02, and CA03 in each row and put that in a new column called CA_sum. Then, we’ll compare our new CA_sum column to the existing C000 column to see if they match. We’ll first show all the code, then explain each section.
# Create a list with the columns we want to add up
vars_to_check = ['CA01','CA02','CA03']
# Using apply to sum the columns for each row
df['CA_sum'] = df[vars_to_check].apply(sum,1)
# Check how many rows don't match
sum(df.CA_sum != df.C000)
0
We first created a list called vars_to_check, which contains the columns that we want to add up. Then, we took those columns from df and used the apply() method, which applies the same function to each row (or column, if we used 0 in the second argument instead of 1). In this case, we want to find the sum of each row, so the first argument is sum. We want to create a new column that contains this sum, so we assign that to a new column in df, CA_sum. Notice that this is the first place we see 'CA_sum', because this is where we are creating it.
Lastly, we want to check how many rows in which C000 and CA_sum differ. We do this by using
df.CA_sum != df.C000
which outputs a Series of True and False values: True if the value in CA_sum is not equal to the value in C000 for that row, and False otherwise. In other words, this is a Series of Trues and Falses indicating whether the values for the row didn’t match. We can then use the sum function from NumPy to add up how many times they didn’t match. If there are no errors, the sum should be 0.
Checking for Outliers#
Suppose we want to check if there are any outliers in total number of jobs by census block. We can sort the values in C000 in order to figure this out. Let’s say we want to find the top ten census blocks by total number of jobs.
df.sort_values("C000",ascending=False).head(10)
| w_geocode | C000 | CA01 | CA02 | CA03 | CE01 | CE02 | CE03 | CNS01 | CNS02 | ... | CFA03 | CFA04 | CFA05 | CFS01 | CFS02 | CFS03 | CFS04 | CFS05 | createdate | CA_sum | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 52952 | 60372074001016 | 72275 | 5546 | 47140 | 19589 | 2129 | 6625 | 63521 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 72275 |
| 64327 | 60373104003007 | 50650 | 6975 | 31905 | 11770 | 33722 | 5798 | 11130 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 50650 |
| 59358 | 60372655101002 | 24628 | 3643 | 15326 | 5659 | 2238 | 4323 | 18067 | 45 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 24628 |
| 186347 | 60750177002001 | 23553 | 1858 | 15018 | 6677 | 1237 | 1750 | 20566 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 23553 |
| 130619 | 60599800001001 | 23071 | 9319 | 9436 | 4316 | 7413 | 10538 | 5120 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 23071 |
| 48981 | 60371437001000 | 22683 | 4736 | 12397 | 5550 | 14933 | 3456 | 4294 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 22683 |
| 97829 | 60377030012000 | 21971 | 1963 | 13008 | 7000 | 14741 | 925 | 6305 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 21971 |
| 64362 | 60373105011027 | 20663 | 3268 | 13510 | 3885 | 7717 | 4444 | 8502 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 20663 |
| 212152 | 60855081012002 | 20104 | 3603 | 14572 | 1929 | 246 | 1643 | 18215 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 20104 |
| 52965 | 60372074001037 | 19469 | 1995 | 12795 | 4679 | 1603 | 1839 | 16027 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 | 19469 |
10 rows × 54 columns
Let’s break this down piece by piece. First, we use the sort_values() method to sort the Data Frame by C000. We use ascending=False so that the highest values are at the top (the default is to sort in ascending order). This would give us
df.sort_values("C000",ascending=False)
However, we don’t want to look at everything. Here, we use head() to give us only the top ten values after sorting. This gives us the final code, df.sort_values("C000",ascending=False).head(10).
Checkpoint: Descriptive Statistics on Your Data#
Using the tools described above, look at the data you loaded in earlier. Make sure you know the answers to each of the following questions:
Are there any missing values?
Are there any inconsistencies in the data?
Are there missing values that may not have been coded as missing?
Are there any interesting outliers?