import numpy as np 
import pandas as pd 
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')

Exploring the Data Frame#

Now that we’ve loaded in the data set as a Data Frame, let’s check the number of rows and columns. We can do this by looking at the shape attribute of a data frame.

df.shape
(243462, 53)

It looks like there are 243,462 rows and 53 columns.

Let’s also find out the names of all the variables in this data set.

df.columns
Index(['w_geocode', 'C000', 'CA01', 'CA02', 'CA03', 'CE01', 'CE02', 'CE03',
       'CNS01', 'CNS02', 'CNS03', 'CNS04', 'CNS05', 'CNS06', 'CNS07', 'CNS08',
       'CNS09', 'CNS10', 'CNS11', 'CNS12', 'CNS13', 'CNS14', 'CNS15', 'CNS16',
       'CNS17', 'CNS18', 'CNS19', 'CNS20', 'CR01', 'CR02', 'CR03', 'CR04',
       'CR05', 'CR07', 'CT01', 'CT02', 'CD01', 'CD02', 'CD03', 'CD04', 'CS01',
       'CS02', 'CFA01', 'CFA02', 'CFA03', 'CFA04', 'CFA05', 'CFS01', 'CFS02',
       'CFS03', 'CFS04', 'CFS05', 'createdate'],
      dtype='object')

To get more information about the contents of the Data Frame, we can use the .info() method. This will give us the number of non-null values and the type of data (these have all been read in as integers) for each column.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243462 entries, 0 to 243461
Data columns (total 53 columns):
 #   Column      Non-Null Count   Dtype
---  ------      --------------   -----
 0   w_geocode   243462 non-null  int64
 1   C000        243462 non-null  int64
 2   CA01        243462 non-null  int64
 3   CA02        243462 non-null  int64
 4   CA03        243462 non-null  int64
 5   CE01        243462 non-null  int64
 6   CE02        243462 non-null  int64
 7   CE03        243462 non-null  int64
 8   CNS01       243462 non-null  int64
 9   CNS02       243462 non-null  int64
 10  CNS03       243462 non-null  int64
 11  CNS04       243462 non-null  int64
 12  CNS05       243462 non-null  int64
 13  CNS06       243462 non-null  int64
 14  CNS07       243462 non-null  int64
 15  CNS08       243462 non-null  int64
 16  CNS09       243462 non-null  int64
 17  CNS10       243462 non-null  int64
 18  CNS11       243462 non-null  int64
 19  CNS12       243462 non-null  int64
 20  CNS13       243462 non-null  int64
 21  CNS14       243462 non-null  int64
 22  CNS15       243462 non-null  int64
 23  CNS16       243462 non-null  int64
 24  CNS17       243462 non-null  int64
 25  CNS18       243462 non-null  int64
 26  CNS19       243462 non-null  int64
 27  CNS20       243462 non-null  int64
 28  CR01        243462 non-null  int64
 29  CR02        243462 non-null  int64
 30  CR03        243462 non-null  int64
 31  CR04        243462 non-null  int64
 32  CR05        243462 non-null  int64
 33  CR07        243462 non-null  int64
 34  CT01        243462 non-null  int64
 35  CT02        243462 non-null  int64
 36  CD01        243462 non-null  int64
 37  CD02        243462 non-null  int64
 38  CD03        243462 non-null  int64
 39  CD04        243462 non-null  int64
 40  CS01        243462 non-null  int64
 41  CS02        243462 non-null  int64
 42  CFA01       243462 non-null  int64
 43  CFA02       243462 non-null  int64
 44  CFA03       243462 non-null  int64
 45  CFA04       243462 non-null  int64
 46  CFA05       243462 non-null  int64
 47  CFS01       243462 non-null  int64
 48  CFS02       243462 non-null  int64
 49  CFS03       243462 non-null  int64
 50  CFS04       243462 non-null  int64
 51  CFS05       243462 non-null  int64
 52  createdate  243462 non-null  int64
dtypes: int64(53)
memory usage: 98.4 MB

We can use the head and tail methods in order to look at the first or last few rows of the data frame.

df.head() # Default is to show first 5 rows.
w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 ... CFA02 CFA03 CFA04 CFA05 CFS01 CFS02 CFS03 CFS04 CFS05 createdate
0 60014001001007 30 2 16 12 4 2 24 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
1 60014001001008 4 0 1 3 0 0 4 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
2 60014001001011 3 2 1 0 0 3 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
3 60014001001017 11 3 3 5 2 2 7 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
4 60014001001024 10 3 3 4 7 1 2 0 0 ... 0 0 0 0 0 0 0 0 0 20190826

5 rows × 53 columns

df.head(10) # We can specify how many rows we want to see.
w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 ... CFA02 CFA03 CFA04 CFA05 CFS01 CFS02 CFS03 CFS04 CFS05 createdate
0 60014001001007 30 2 16 12 4 2 24 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
1 60014001001008 4 0 1 3 0 0 4 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
2 60014001001011 3 2 1 0 0 3 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
3 60014001001017 11 3 3 5 2 2 7 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
4 60014001001024 10 3 3 4 7 1 2 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
5 60014001001026 3 0 2 1 0 2 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
6 60014001001027 13 3 3 7 4 5 4 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
7 60014001001032 13 2 4 7 3 2 8 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
8 60014001001033 2 0 0 2 2 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
9 60014001001034 1 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826

10 rows × 53 columns

df.tail(10) # Same as head, except the last 10 instead of first 10
w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 ... CFA02 CFA03 CFA04 CFA05 CFS01 CFS02 CFS03 CFS04 CFS05 createdate
243452 61150411004001 1 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243453 61150411004011 1 0 1 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243454 61150411004020 1 1 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243455 61150411004025 3 0 0 3 3 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243456 61150411004027 1 0 1 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243457 61150411004035 3 0 2 1 1 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243458 61150411004037 18 3 7 8 7 9 2 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243459 61150411004047 11 2 5 4 9 0 2 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243460 61150411004051 8 6 1 1 0 7 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243461 61150411004053 10 1 5 4 3 3 4 0 0 ... 0 0 0 0 0 0 0 0 0 20190826

10 rows × 53 columns

Attributes vs Variables: Note that we used head(), with parentheses, while we used just shape or column, without parentheses. This is because shape and column are attributes and head is a method. To put it another way, shape and columns are characteristics that each Data Frame object has, and we’re just displaying the values that exist already. On the other hand, head is a method, or a function that you perform specifically on a certain type of object (in this case, a Data Frame object).

Accessing the Data Frame#

What if we want to only look at certain cells, or certain columns? We can use a variety of commands to do just that.

Accessing Columns#

To access individual columns, we can use square brackets or we can simply use dot notation.

# Look at just total number of jobs (C000)
df["C000"] 
0         30
1          4
2          3
3         11
4         10
          ..
243457     3
243458    18
243459    11
243460     8
243461    10
Name: C000, Length: 243462, dtype: int64
# This does the same thing
df.C000 
0         30
1          4
2          3
3         11
4         10
          ..
243457     3
243458    18
243459    11
243460     8
243461    10
Name: C000, Length: 243462, dtype: int64

Remember, in Python, we are working with objects that have certain types. When we run the above code, we are accessing a specific column of a Data Frame, and that itself is a different type of object called a Series.

type(df.C000)
pandas.core.series.Series

This can be useful for working with individual columns, because we can then use Series methods to do things like find the mean or standard deviation.

# Mean number of jobs in census blocks that had jobs
df.C000.mean()
65.9188990479007
# Standard deviation
df.C000.std()
369.9156753256301

Accessing Rows#

What if we want to get certain rows? We can also use loc with square brackets. We use a colon to indicate that we want a series of indices with a start and end. We can also leave one side of the colon empty to indicate that we want the rest of the values on that end.

# Show rows 10 - 20. Remember, the first row is row 0
df.loc[10:20] 
w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 ... CFA02 CFA03 CFA04 CFA05 CFS01 CFS02 CFS03 CFS04 CFS05 createdate
10 60014001001036 14 2 9 3 3 5 6 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
11 60014001001038 3 0 0 3 0 0 3 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
12 60014001001041 1 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
13 60014001001042 9 0 9 0 0 0 9 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
14 60014001001043 3 1 1 1 0 2 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
15 60014001001044 1 0 0 1 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
16 60014001001047 2 0 2 0 0 0 2 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
17 60014001001049 1 0 1 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
18 60014001001053 1 0 1 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
19 60014001001054 7 0 4 3 1 2 4 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
20 60014001001056 1 0 1 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826

11 rows × 53 columns

df.loc[:10]
w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 ... CFA02 CFA03 CFA04 CFA05 CFS01 CFS02 CFS03 CFS04 CFS05 createdate
0 60014001001007 30 2 16 12 4 2 24 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
1 60014001001008 4 0 1 3 0 0 4 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
2 60014001001011 3 2 1 0 0 3 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
3 60014001001017 11 3 3 5 2 2 7 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
4 60014001001024 10 3 3 4 7 1 2 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
5 60014001001026 3 0 2 1 0 2 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
6 60014001001027 13 3 3 7 4 5 4 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
7 60014001001032 13 2 4 7 3 2 8 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
8 60014001001033 2 0 0 2 2 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
9 60014001001034 1 0 0 1 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
10 60014001001036 14 2 9 3 3 5 6 0 0 ... 0 0 0 0 0 0 0 0 0 20190826

11 rows × 53 columns

df.loc[:] # This gives all rows
w_geocode C000 CA01 CA02 CA03 CE01 CE02 CE03 CNS01 CNS02 ... CFA02 CFA03 CFA04 CFA05 CFS01 CFS02 CFS03 CFS04 CFS05 createdate
0 60014001001007 30 2 16 12 4 2 24 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
1 60014001001008 4 0 1 3 0 0 4 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
2 60014001001011 3 2 1 0 0 3 0 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
3 60014001001017 11 3 3 5 2 2 7 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
4 60014001001024 10 3 3 4 7 1 2 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
243457 61150411004035 3 0 2 1 1 1 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243458 61150411004037 18 3 7 8 7 9 2 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243459 61150411004047 11 2 5 4 9 0 2 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243460 61150411004051 8 6 1 1 0 7 1 0 0 ... 0 0 0 0 0 0 0 0 0 20190826
243461 61150411004053 10 1 5 4 3 3 4 0 0 ... 0 0 0 0 0 0 0 0 0 20190826

243462 rows × 53 columns

In addition, we can use loc to access certain columns as well as certain indices in the Data Frame.

# Look at rows 10 - 20 for total number of jobs (C000)
df.loc[10:20,"C000"] 
10    14
11     3
12     1
13     9
14     3
15     1
16     2
17     1
18     1
19     7
20     1
Name: C000, dtype: int64

To get a range of columns, we can use the same colon notation.

# Look at rows 10 - 20 for total number of jobs (C000) and jobs by age group
df.loc[10:20,'C000':'CA03']
C000 CA01 CA02 CA03
10 14 2 9 3
11 3 0 0 3
12 1 0 0 1
13 9 0 9 0
14 3 1 1 1
15 1 0 0 1
16 2 0 2 0
17 1 0 1 0
18 1 0 1 0
19 7 0 4 3
20 1 0 1 0

An alternative to loc is iloc. This takes rows from specific positions in the Data Frame rather than the row labels. Most of the time, row labels are going to be numbered sequentially, so loc and iloc should act similarly. However, sometimes, especially when creating subsets of the data, you might end up with row labels that aren’t ordered sequentially and go up by one. In those cases, iloc might be more useful.

Another use case for iloc is in using negative numbers.

df.iloc[-5:]['C000']
243457     3
243458    18
243459    11
243460     8
243461    10
Name: C000, dtype: int64

In this case, we were able to use the “-5:” to indicate that we want the last 5 rows of the data frame. Note that we can’t do the same with .loc. This is because .loc retrieves the rows from a particular label in the Data Frame, while .iloc retrieves them from particular positions.

Checkpoint: Explore Other Data#

Copy the code you used in the previous section to bring in another dataset. Try to access different values within the dataset. Can you isolate a certain variable? Look at the documentation and identify a variable to pull out of the dataset.