import numpy as np 
import pandas as pd 
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')

Checking for Missing Values#

Now that we have our data set, let’s do a quick check for missing values. This is typically one of the first things you’ll want to do when exploring a new data set.

Below, we’ve shown two different ways of writing the same thing. Using isnull() gives us a data frame of the same size with True and False values depending on whether it was a missing value or not. Then, sum() sums each column. Since Python treats True as 1 and False as 0, the sum of each column gives us the total number of missing values for each variable.

df_null = df.isnull()
df_null.sum()

w_geocode     0
C000          0
CA01          0
CA02          0
CA03          0
CE01          0
CE02          0
CE03          0
CNS01         0
CNS02         0
CNS03         0
CNS04         0
CNS05         0
CNS06         0
CNS07         0
CNS08         0
CNS09         0
CNS10         0
CNS11         0
CNS12         0
CNS13         0
CNS14         0
CNS15         0
CNS16         0
CNS17         0
CNS18         0
CNS19         0
CNS20         0
CR01          0
CR02          0
CR03          0
CR04          0
CR05          0
CR07          0
CT01          0
CT02          0
CD01          0
CD02          0
CD03          0
CD04          0
CS01          0
CS02          0
CFA01         0
CFA02         0
CFA03         0
CFA04         0
CFA05         0
CFS01         0
CFS02         0
CFS03         0
CFS04         0
CFS05         0
createdate    0
dtype: int64

We did this in two separate lines, but that’s actually not necessary. In fact, we can do it all in one go:

df.isnull().sum()

w_geocode     0
C000          0
CA01          0
CA02          0
CA03          0
CE01          0
CE02          0
CE03          0
CNS01         0
CNS02         0
CNS03         0
CNS04         0
CNS05         0
CNS06         0
CNS07         0
CNS08         0
CNS09         0
CNS10         0
CNS11         0
CNS12         0
CNS13         0
CNS14         0
CNS15         0
CNS16         0
CNS17         0
CNS18         0
CNS19         0
CNS20         0
CR01          0
CR02          0
CR03          0
CR04          0
CR05          0
CR07          0
CT01          0
CT02          0
CD01          0
CD02          0
CD03          0
CD04          0
CS01          0
CS02          0
CFA01         0
CFA02         0
CFA03         0
CFA04         0
CFA05         0
CFS01         0
CFS02         0
CFS03         0
CFS04         0
CFS05         0
createdate    0
dtype: int64

We can also drop any duplicated rows.

df_no_dups = df.drop_duplicates()
df_no_dups.shape # Check how many rows there are after dropping duplicates

(243462, 53)

Checking for Inconsistencies#

If you check the data documentation, you’ll see that C000 is the total number of jobs. Therefore, it would make sense for the other groups to columns to add up to the values in C000. For example, you’d expect CA01, CA02, and CA03 to add up to C000 for each row. Let’s check to see if this is true.

We’ll first take the sum of CA01, CA02, and CA03 in each row and put that in a new column called CA_sum. Then, we’ll compare our new CA_sum column to the existing C000 column to see if they match. We’ll first show all the code, then explain each section.

# Create a list with the columns we want to add up
vars_to_check = ['CA01','CA02','CA03']

# Using apply to sum the columns for each row
df['CA_sum'] = df[vars_to_check].apply(sum,1)

# Check how many rows don't match
sum(df.CA_sum != df.C000)

We first created a list called vars_to_check, which contains the columns that we want to add up. Then, we took those columns from df and used the apply() method, which applies the same function to each row (or column, if we used 0 in the second argument instead of 1). In this case, we want to find the sum of each row, so the first argument is sum. We want to create a new column that contains this sum, so we assign that to a new column in df, CA_sum. Notice that this is the first place we see 'CA_sum', because this is where we are creating it.

Lastly, we want to check how many rows in which C000 and CA_sum differ. We do this by using

df.CA_sum != df.C000

which outputs a Series of True and False values: True if the value in CA_sum is not equal to the value in C000 for that row, and False otherwise. In other words, this is a Series of Trues and Falses indicating whether the values for the row didn’t match. We can then use the sum function from NumPy to add up how many times they didn’t match. If there are no errors, the sum should be 0.

Checking for Outliers#

Suppose we want to check if there are any outliers in total number of jobs by census block. We can sort the values in C000 in order to figure this out. Let’s say we want to find the top ten census blocks by total number of jobs.

df.sort_values("C000",ascending=False).head(10)

	w_geocode	C000	CA01	CA02	CA03	CE01	CE02	CE03	CNS01	...	createdate	CA_sum
52952	60372074001016	72275	5546	47140	19589	2129	6625	63521	0	...	20190826	72275
64327	60373104003007	50650	6975	31905	11770	33722	5798	11130	0	...	20190826	50650
59358	60372655101002	24628	3643	15326	5659	2238	4323	18067	45	...	20190826	24628
186347	60750177002001	23553	1858	15018	6677	1237	1750	20566	0	...	20190826	23553
130619	60599800001001	23071	9319	9436	4316	7413	10538	5120	0	...	20190826	23071
48981	60371437001000	22683	4736	12397	5550	14933	3456	4294	0	...	20190826	22683
97829	60377030012000	21971	1963	13008	7000	14741	925	6305	0	...	20190826	21971
64362	60373105011027	20663	3268	13510	3885	7717	4444	8502	0	...	20190826	20663
212152	60855081012002	20104	3603	14572	1929	246	1643	18215	0	...	20190826	20104
52965	60372074001037	19469	1995	12795	4679	1603	1839	16027	0	...	20190826	19469

10 rows × 54 columns

Let’s break this down piece by piece. First, we use the sort_values() method to sort the Data Frame by C000. We use ascending=False so that the highest values are at the top (the default is to sort in ascending order). This would give us

df.sort_values("C000",ascending=False)

However, we don’t want to look at everything. Here, we use head() to give us only the top ten values after sorting. This gives us the final code, df.sort_values("C000",ascending=False).head(10).

Checkpoint: Descriptive Statistics on Your Data#

Using the tools described above, look at the data you loaded in earlier. Make sure you know the answers to each of the following questions:

Are there any missing values?
Are there any inconsistencies in the data?
Are there missing values that may not have been coded as missing?
Are there any interesting outliers?

Introduction to Python and SQL for Data Analysis

Checking for Missing Values

Contents

Checking for Missing Values#

Checking for Inconsistencies#

Checking for Outliers#

Checkpoint: Descriptive Statistics on Your Data#