%LOAD ../../lodes.db

Joins#

One of the nice things about relational databases is organization using multiple tables that are linked together in some way. For example, suppose we have one table with 6 rows called Table A:

blockid	C000
1	5
2	10
3	2
4	6
5	22
6	9

And another table with 5 rows called Table B:

blockid	CA01
2	2
5	4
6	1
7	2
8	0

Let’s say we want to combine Table A and Table B so that we have one table that contains information about blockid, C000, and CA01. We want to do this by matching the two tables by what they have in common, blockid. That is, we want a table that looks like this (let’s call this Table C):

blockid	C000	CA01
2	10	2
5	22	4
6	9	1

Table C has each blockid that was in both Table A and Table B. It also contains the appropriate values for C000 and CA01 corresponding to each blockid. This kind of matching can be quite tricky to figure out manually, since there are different numbers of rows in each table, not all of the blockid values match for the two tables, and there are some blockid values that aren’t in both. Fortunately for us, SQL is well-equipped to handle this task using the JOIN statement.

SQL Code and How It Works#

Now that we have established a plan for how we’re joining two tables together, let’s take a look at the SQL code that performs this join and break it down.

SELECT * FROM ca_wac_2015
JOIN ca_xwalk
ON ca_wac_2015.w_geocode = ca_xwalk.tabblk2010 
LIMIT 10;

Let’s look at the first two lines.

SELECT * FROM ca_wac_2015
JOIN ca_xwalk

Here, we want to SELECT each column from a data table that we get from joining the tables ca_wac_2015 and ca_xwalk. The second line takes the ca_wac_2015 table and joins the ca_xwalk table to it.

We can’t just mash two tables together though – we need some way of making sure that the appropriate rows match. We do this with the third line:

ON ca_wac_2015.w_geocode = ca_xwalk.tabblk2010

This part specifies what we’re joining on. That is, what is the ID variable that is in both tables that we want to match. Notice that they don’t need to be named the same in both tables, though you do need to specify what they are in each table, even if they are the same, as well as which table they are from.

If you run the full code below, you should see the first 10 rows (because of the LIMIT 10) of the joined table. You should be able to scroll through all of the variables and see that we’ve managed to merge the ca_wac_2015 and ca_xwalk tables together according to their census block IDs.

SELECT ca_wac_2015.w_geocode, ca_xwalk.tabblk2010, ca_wac_2015.c000, ca_wac_2015.ca01, ca_wac_2015.ca02, ca_wac_2015.ca03, 
ca_xwalk.ctyname, ca_xwalk.cbsaname 
FROM ca_wac_2015
JOIN ca_xwalk
ON ca_wac_2015.w_geocode = ca_xwalk.tabblk2010 
LIMIT 10;

w_geocode	tabblk2010	c000	ca01	ca02	ca03	ctyname	cbsaname
060014001001007	060014001001007	30	2	16	12	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001008	060014001001008	4	0	1	3	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001011	060014001001011	3	2	1	0	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001017	060014001001017	11	3	3	5	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001024	060014001001024	10	3	3	4	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001026	060014001001026	3	0	2	1	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001027	060014001001027	13	3	3	7	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001032	060014001001032	13	2	4	7	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001033	060014001001033	2	0	0	2	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001034	060014001001034	1	0	0	1	Alameda County, CA	San Francisco-Oakland-Hayward, CA

Here, we’ve chosen to display the two census block ID variables we’re joining on so that you can see the matching, as well as a few characteristics from each table. Notice that we’ve specified the table before each variable. That’s generally only necessary when both tables have a column with the same name, but we’ve done it here for clarity. The following will do the exact same thing and run just fine since the two tables don’t share any of the column names.

SELECT w_geocode, tabblk2010, c000, ca01, ca02, ca03, ctyname, cbsaname 
FROM ca_wac_2015
JOIN ca_xwalk
ON ca_wac_2015.w_geocode = ca_xwalk.tabblk2010 
LIMIT 10;

w_geocode	tabblk2010	c000	ca01	ca02	ca03	ctyname	cbsaname
060014001001007	060014001001007	30	2	16	12	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001008	060014001001008	4	0	1	3	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001011	060014001001011	3	2	1	0	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001017	060014001001017	11	3	3	5	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001024	060014001001024	10	3	3	4	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001026	060014001001026	3	0	2	1	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001027	060014001001027	13	3	3	7	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001032	060014001001032	13	2	4	7	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001033	060014001001033	2	0	0	2	Alameda County, CA	San Francisco-Oakland-Hayward, CA
060014001001034	060014001001034	1	0	0	1	Alameda County, CA	San Francisco-Oakland-Hayward, CA

Checkpoint: Get Census Data and Join#

Try joining the Residence Area Characteristics table with the Crosswalk table in a similar manner, as well as changing some of the columns to display. As you construct your query, make sure you answer the following question regarding the join:

What are they being joined on? That is, what is the “ID” variable you’re matching on in each table?
What information does the resulting table give you? For example, after we join the WAC table to the geography crosswalk table, we are now able to determine counties or metropolitan/micropolitan areas for census block containing workplaces.

Introduction to Python and SQL for Data Analysis

Joins

Contents

Joins#

SQL Code and How It Works#

Checkpoint: Get Census Data and Join#