Exploring the Data Frame
Contents
import numpy as np
import pandas as pd
data_url = 'https://lehd.ces.census.gov/data/lodes/LODES7/ca/wac/ca_wac_S000_JT00_2015.csv.gz'
df = pd.read_csv(data_url,compression='gzip')
Exploring the Data Frame#
Now that we’ve loaded in the data set as a Data Frame, let’s check the number of rows and columns. We can do this by looking at the shape
attribute of a data frame.
df.shape
(243462, 53)
It looks like there are 243,462 rows and 53 columns.
Let’s also find out the names of all the variables in this data set.
df.columns
Index(['w_geocode', 'C000', 'CA01', 'CA02', 'CA03', 'CE01', 'CE02', 'CE03',
'CNS01', 'CNS02', 'CNS03', 'CNS04', 'CNS05', 'CNS06', 'CNS07', 'CNS08',
'CNS09', 'CNS10', 'CNS11', 'CNS12', 'CNS13', 'CNS14', 'CNS15', 'CNS16',
'CNS17', 'CNS18', 'CNS19', 'CNS20', 'CR01', 'CR02', 'CR03', 'CR04',
'CR05', 'CR07', 'CT01', 'CT02', 'CD01', 'CD02', 'CD03', 'CD04', 'CS01',
'CS02', 'CFA01', 'CFA02', 'CFA03', 'CFA04', 'CFA05', 'CFS01', 'CFS02',
'CFS03', 'CFS04', 'CFS05', 'createdate'],
dtype='object')
To get more information about the contents of the Data Frame, we can use the .info()
method. This will give us the number of non-null values and the type of data (these have all been read in as integers) for each column.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243462 entries, 0 to 243461
Data columns (total 53 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 w_geocode 243462 non-null int64
1 C000 243462 non-null int64
2 CA01 243462 non-null int64
3 CA02 243462 non-null int64
4 CA03 243462 non-null int64
5 CE01 243462 non-null int64
6 CE02 243462 non-null int64
7 CE03 243462 non-null int64
8 CNS01 243462 non-null int64
9 CNS02 243462 non-null int64
10 CNS03 243462 non-null int64
11 CNS04 243462 non-null int64
12 CNS05 243462 non-null int64
13 CNS06 243462 non-null int64
14 CNS07 243462 non-null int64
15 CNS08 243462 non-null int64
16 CNS09 243462 non-null int64
17 CNS10 243462 non-null int64
18 CNS11 243462 non-null int64
19 CNS12 243462 non-null int64
20 CNS13 243462 non-null int64
21 CNS14 243462 non-null int64
22 CNS15 243462 non-null int64
23 CNS16 243462 non-null int64
24 CNS17 243462 non-null int64
25 CNS18 243462 non-null int64
26 CNS19 243462 non-null int64
27 CNS20 243462 non-null int64
28 CR01 243462 non-null int64
29 CR02 243462 non-null int64
30 CR03 243462 non-null int64
31 CR04 243462 non-null int64
32 CR05 243462 non-null int64
33 CR07 243462 non-null int64
34 CT01 243462 non-null int64
35 CT02 243462 non-null int64
36 CD01 243462 non-null int64
37 CD02 243462 non-null int64
38 CD03 243462 non-null int64
39 CD04 243462 non-null int64
40 CS01 243462 non-null int64
41 CS02 243462 non-null int64
42 CFA01 243462 non-null int64
43 CFA02 243462 non-null int64
44 CFA03 243462 non-null int64
45 CFA04 243462 non-null int64
46 CFA05 243462 non-null int64
47 CFS01 243462 non-null int64
48 CFS02 243462 non-null int64
49 CFS03 243462 non-null int64
50 CFS04 243462 non-null int64
51 CFS05 243462 non-null int64
52 createdate 243462 non-null int64
dtypes: int64(53)
memory usage: 98.4 MB
We can use the head
and tail
methods in order to look at the first or last few rows of the data frame.
df.head() # Default is to show first 5 rows.
w_geocode | C000 | CA01 | CA02 | CA03 | CE01 | CE02 | CE03 | CNS01 | CNS02 | ... | CFA02 | CFA03 | CFA04 | CFA05 | CFS01 | CFS02 | CFS03 | CFS04 | CFS05 | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60014001001007 | 30 | 2 | 16 | 12 | 4 | 2 | 24 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
1 | 60014001001008 | 4 | 0 | 1 | 3 | 0 | 0 | 4 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
2 | 60014001001011 | 3 | 2 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
3 | 60014001001017 | 11 | 3 | 3 | 5 | 2 | 2 | 7 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
4 | 60014001001024 | 10 | 3 | 3 | 4 | 7 | 1 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
5 rows × 53 columns
df.head(10) # We can specify how many rows we want to see.
w_geocode | C000 | CA01 | CA02 | CA03 | CE01 | CE02 | CE03 | CNS01 | CNS02 | ... | CFA02 | CFA03 | CFA04 | CFA05 | CFS01 | CFS02 | CFS03 | CFS04 | CFS05 | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60014001001007 | 30 | 2 | 16 | 12 | 4 | 2 | 24 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
1 | 60014001001008 | 4 | 0 | 1 | 3 | 0 | 0 | 4 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
2 | 60014001001011 | 3 | 2 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
3 | 60014001001017 | 11 | 3 | 3 | 5 | 2 | 2 | 7 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
4 | 60014001001024 | 10 | 3 | 3 | 4 | 7 | 1 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
5 | 60014001001026 | 3 | 0 | 2 | 1 | 0 | 2 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
6 | 60014001001027 | 13 | 3 | 3 | 7 | 4 | 5 | 4 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
7 | 60014001001032 | 13 | 2 | 4 | 7 | 3 | 2 | 8 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
8 | 60014001001033 | 2 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
9 | 60014001001034 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
10 rows × 53 columns
df.tail(10) # Same as head, except the last 10 instead of first 10
w_geocode | C000 | CA01 | CA02 | CA03 | CE01 | CE02 | CE03 | CNS01 | CNS02 | ... | CFA02 | CFA03 | CFA04 | CFA05 | CFS01 | CFS02 | CFS03 | CFS04 | CFS05 | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
243452 | 61150411004001 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243453 | 61150411004011 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243454 | 61150411004020 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243455 | 61150411004025 | 3 | 0 | 0 | 3 | 3 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243456 | 61150411004027 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243457 | 61150411004035 | 3 | 0 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243458 | 61150411004037 | 18 | 3 | 7 | 8 | 7 | 9 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243459 | 61150411004047 | 11 | 2 | 5 | 4 | 9 | 0 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243460 | 61150411004051 | 8 | 6 | 1 | 1 | 0 | 7 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243461 | 61150411004053 | 10 | 1 | 5 | 4 | 3 | 3 | 4 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
10 rows × 53 columns
Attributes vs Variables: Note that we used
head()
, with parentheses, while we used justshape
orcolumn
, without parentheses. This is becauseshape
andcolumn
are attributes and head is a method. To put it another way,shape
andcolumns
are characteristics that each Data Frame object has, and we’re just displaying the values that exist already. On the other hand,head
is a method, or a function that you perform specifically on a certain type of object (in this case, a Data Frame object).
Accessing the Data Frame#
What if we want to only look at certain cells, or certain columns? We can use a variety of commands to do just that.
Accessing Columns#
To access individual columns, we can use square brackets or we can simply use dot notation.
# Look at just total number of jobs (C000)
df["C000"]
0 30
1 4
2 3
3 11
4 10
..
243457 3
243458 18
243459 11
243460 8
243461 10
Name: C000, Length: 243462, dtype: int64
# This does the same thing
df.C000
0 30
1 4
2 3
3 11
4 10
..
243457 3
243458 18
243459 11
243460 8
243461 10
Name: C000, Length: 243462, dtype: int64
Remember, in Python, we are working with objects that have certain types. When we run the above code, we are accessing a specific column of a Data Frame, and that itself is a different type of object called a Series.
type(df.C000)
pandas.core.series.Series
This can be useful for working with individual columns, because we can then use Series methods to do things like find the mean or standard deviation.
# Mean number of jobs in census blocks that had jobs
df.C000.mean()
65.9188990479007
# Standard deviation
df.C000.std()
369.9156753256301
Accessing Rows#
What if we want to get certain rows? We can also use loc
with square brackets. We use a colon to indicate that we want a series of indices with a start and end. We can also leave one side of the colon empty to indicate that we want the rest of the values on that end.
# Show rows 10 - 20. Remember, the first row is row 0
df.loc[10:20]
w_geocode | C000 | CA01 | CA02 | CA03 | CE01 | CE02 | CE03 | CNS01 | CNS02 | ... | CFA02 | CFA03 | CFA04 | CFA05 | CFS01 | CFS02 | CFS03 | CFS04 | CFS05 | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
10 | 60014001001036 | 14 | 2 | 9 | 3 | 3 | 5 | 6 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
11 | 60014001001038 | 3 | 0 | 0 | 3 | 0 | 0 | 3 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
12 | 60014001001041 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
13 | 60014001001042 | 9 | 0 | 9 | 0 | 0 | 0 | 9 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
14 | 60014001001043 | 3 | 1 | 1 | 1 | 0 | 2 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
15 | 60014001001044 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
16 | 60014001001047 | 2 | 0 | 2 | 0 | 0 | 0 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
17 | 60014001001049 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
18 | 60014001001053 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
19 | 60014001001054 | 7 | 0 | 4 | 3 | 1 | 2 | 4 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
20 | 60014001001056 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
11 rows × 53 columns
df.loc[:10]
w_geocode | C000 | CA01 | CA02 | CA03 | CE01 | CE02 | CE03 | CNS01 | CNS02 | ... | CFA02 | CFA03 | CFA04 | CFA05 | CFS01 | CFS02 | CFS03 | CFS04 | CFS05 | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60014001001007 | 30 | 2 | 16 | 12 | 4 | 2 | 24 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
1 | 60014001001008 | 4 | 0 | 1 | 3 | 0 | 0 | 4 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
2 | 60014001001011 | 3 | 2 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
3 | 60014001001017 | 11 | 3 | 3 | 5 | 2 | 2 | 7 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
4 | 60014001001024 | 10 | 3 | 3 | 4 | 7 | 1 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
5 | 60014001001026 | 3 | 0 | 2 | 1 | 0 | 2 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
6 | 60014001001027 | 13 | 3 | 3 | 7 | 4 | 5 | 4 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
7 | 60014001001032 | 13 | 2 | 4 | 7 | 3 | 2 | 8 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
8 | 60014001001033 | 2 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
9 | 60014001001034 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
10 | 60014001001036 | 14 | 2 | 9 | 3 | 3 | 5 | 6 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
11 rows × 53 columns
df.loc[:] # This gives all rows
w_geocode | C000 | CA01 | CA02 | CA03 | CE01 | CE02 | CE03 | CNS01 | CNS02 | ... | CFA02 | CFA03 | CFA04 | CFA05 | CFS01 | CFS02 | CFS03 | CFS04 | CFS05 | createdate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 60014001001007 | 30 | 2 | 16 | 12 | 4 | 2 | 24 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
1 | 60014001001008 | 4 | 0 | 1 | 3 | 0 | 0 | 4 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
2 | 60014001001011 | 3 | 2 | 1 | 0 | 0 | 3 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
3 | 60014001001017 | 11 | 3 | 3 | 5 | 2 | 2 | 7 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
4 | 60014001001024 | 10 | 3 | 3 | 4 | 7 | 1 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
243457 | 61150411004035 | 3 | 0 | 2 | 1 | 1 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243458 | 61150411004037 | 18 | 3 | 7 | 8 | 7 | 9 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243459 | 61150411004047 | 11 | 2 | 5 | 4 | 9 | 0 | 2 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243460 | 61150411004051 | 8 | 6 | 1 | 1 | 0 | 7 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243461 | 61150411004053 | 10 | 1 | 5 | 4 | 3 | 3 | 4 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 20190826 |
243462 rows × 53 columns
In addition, we can use loc
to access certain columns as well as certain indices in the Data Frame.
# Look at rows 10 - 20 for total number of jobs (C000)
df.loc[10:20,"C000"]
10 14
11 3
12 1
13 9
14 3
15 1
16 2
17 1
18 1
19 7
20 1
Name: C000, dtype: int64
To get a range of columns, we can use the same colon notation.
# Look at rows 10 - 20 for total number of jobs (C000) and jobs by age group
df.loc[10:20,'C000':'CA03']
C000 | CA01 | CA02 | CA03 | |
---|---|---|---|---|
10 | 14 | 2 | 9 | 3 |
11 | 3 | 0 | 0 | 3 |
12 | 1 | 0 | 0 | 1 |
13 | 9 | 0 | 9 | 0 |
14 | 3 | 1 | 1 | 1 |
15 | 1 | 0 | 0 | 1 |
16 | 2 | 0 | 2 | 0 |
17 | 1 | 0 | 1 | 0 |
18 | 1 | 0 | 1 | 0 |
19 | 7 | 0 | 4 | 3 |
20 | 1 | 0 | 1 | 0 |
An alternative to loc
is iloc
. This takes rows from specific positions in the Data Frame rather than the row labels. Most of the time, row labels are going to be numbered sequentially, so loc
and iloc
should act similarly. However, sometimes, especially when creating subsets of the data, you might end up with row labels that aren’t ordered sequentially and go up by one. In those cases, iloc
might be more useful.
Another use case for iloc
is in using negative numbers.
df.iloc[-5:]['C000']
243457 3
243458 18
243459 11
243460 8
243461 10
Name: C000, dtype: int64
In this case, we were able to use the “-5:
” to indicate that we want the last 5 rows of the data frame. Note that we can’t do the same with .loc
. This is because .loc
retrieves the rows from a particular label in the Data Frame, while .iloc
retrieves them from particular positions.
Checkpoint: Explore Other Data#
Copy the code you used in the previous section to bring in another dataset. Try to access different values within the dataset. Can you isolate a certain variable? Look at the documentation and identify a variable to pull out of the dataset.