Joins
Contents
%LOAD ../../lodes.db
Joins#
One of the nice things about relational databases is organization using multiple tables that are linked together in some way. For example, suppose we have one table with 6 rows called Table A:
blockid |
C000 |
---|---|
1 |
5 |
2 |
10 |
3 |
2 |
4 |
6 |
5 |
22 |
6 |
9 |
And another table with 5 rows called Table B:
blockid |
CA01 |
---|---|
2 |
2 |
5 |
4 |
6 |
1 |
7 |
2 |
8 |
0 |
Let’s say we want to combine Table A and Table B so that we have one table that contains information about blockid
, C000
, and CA01
. We want to do this by matching the two tables by what they have in common, blockid
. That is, we want a table that looks like this (let’s call this Table C):
blockid |
C000 |
CA01 |
---|---|---|
2 |
10 |
2 |
5 |
22 |
4 |
6 |
9 |
1 |
Table C has each blockid
that was in both Table A and Table B. It also contains the appropriate values for C000
and CA01
corresponding to each blockid
. This kind of matching can be quite tricky to figure out manually, since there are different numbers of rows in each table, not all of the blockid
values match for the two tables, and there are some blockid
values that aren’t in both. Fortunately for us, SQL is well-equipped to handle this task using the JOIN
statement.
SQL Code and How It Works#
Now that we have established a plan for how we’re joining two tables together, let’s take a look at the SQL code that performs this join and break it down.
SELECT * FROM ca_wac_2015
JOIN ca_xwalk
ON ca_wac_2015.w_geocode = ca_xwalk.tabblk2010
LIMIT 10;
Let’s look at the first two lines.
SELECT * FROM ca_wac_2015
JOIN ca_xwalk
Here, we want to SELECT
each column from a data table that we get from joining the tables ca_wac_2015
and ca_xwalk
. The second line takes the ca_wac_2015
table and joins the ca_xwalk
table to it.
We can’t just mash two tables together though – we need some way of making sure that the appropriate rows match. We do this with the third line:
ON ca_wac_2015.w_geocode = ca_xwalk.tabblk2010
This part specifies what we’re joining on. That is, what is the ID variable that is in both tables that we want to match. Notice that they don’t need to be named the same in both tables, though you do need to specify what they are in each table, even if they are the same, as well as which table they are from.
If you run the full code below, you should see the first 10 rows (because of the LIMIT 10
) of the joined table. You should be able to scroll through all of the variables and see that we’ve managed to merge the ca_wac_2015
and ca_xwalk
tables together according to their census block IDs.
SELECT ca_wac_2015.w_geocode, ca_xwalk.tabblk2010, ca_wac_2015.c000, ca_wac_2015.ca01, ca_wac_2015.ca02, ca_wac_2015.ca03,
ca_xwalk.ctyname, ca_xwalk.cbsaname
FROM ca_wac_2015
JOIN ca_xwalk
ON ca_wac_2015.w_geocode = ca_xwalk.tabblk2010
LIMIT 10;
w_geocode | tabblk2010 | c000 | ca01 | ca02 | ca03 | ctyname | cbsaname |
---|---|---|---|---|---|---|---|
060014001001007 | 060014001001007 | 30 | 2 | 16 | 12 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001008 | 060014001001008 | 4 | 0 | 1 | 3 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001011 | 060014001001011 | 3 | 2 | 1 | 0 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001017 | 060014001001017 | 11 | 3 | 3 | 5 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001024 | 060014001001024 | 10 | 3 | 3 | 4 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001026 | 060014001001026 | 3 | 0 | 2 | 1 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001027 | 060014001001027 | 13 | 3 | 3 | 7 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001032 | 060014001001032 | 13 | 2 | 4 | 7 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001033 | 060014001001033 | 2 | 0 | 0 | 2 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001034 | 060014001001034 | 1 | 0 | 0 | 1 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
Here, we’ve chosen to display the two census block ID variables we’re joining on so that you can see the matching, as well as a few characteristics from each table. Notice that we’ve specified the table before each variable. That’s generally only necessary when both tables have a column with the same name, but we’ve done it here for clarity. The following will do the exact same thing and run just fine since the two tables don’t share any of the column names.
SELECT w_geocode, tabblk2010, c000, ca01, ca02, ca03, ctyname, cbsaname
FROM ca_wac_2015
JOIN ca_xwalk
ON ca_wac_2015.w_geocode = ca_xwalk.tabblk2010
LIMIT 10;
w_geocode | tabblk2010 | c000 | ca01 | ca02 | ca03 | ctyname | cbsaname |
---|---|---|---|---|---|---|---|
060014001001007 | 060014001001007 | 30 | 2 | 16 | 12 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001008 | 060014001001008 | 4 | 0 | 1 | 3 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001011 | 060014001001011 | 3 | 2 | 1 | 0 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001017 | 060014001001017 | 11 | 3 | 3 | 5 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001024 | 060014001001024 | 10 | 3 | 3 | 4 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001026 | 060014001001026 | 3 | 0 | 2 | 1 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001027 | 060014001001027 | 13 | 3 | 3 | 7 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001032 | 060014001001032 | 13 | 2 | 4 | 7 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001033 | 060014001001033 | 2 | 0 | 0 | 2 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
060014001001034 | 060014001001034 | 1 | 0 | 0 | 1 | Alameda County, CA | San Francisco-Oakland-Hayward, CA |
Checkpoint: Get Census Data and Join#
Try joining the Residence Area Characteristics table with the Crosswalk table in a similar manner, as well as changing some of the columns to display. As you construct your query, make sure you answer the following question regarding the join:
What are they being joined on? That is, what is the “ID” variable you’re matching on in each table?
What information does the resulting table give you? For example, after we join the WAC table to the geography crosswalk table, we are now able to determine counties or metropolitan/micropolitan areas for census block containing workplaces.