Using Joins With Aggregation Functions
Contents
%LOAD ../../lodes.db
Using Joins With Aggregation Functions#
Suppose we wanted to know the distribution of the total number of jobs in California by county. The employment information is in one table, and county information is in another table. We need to join them, then aggregate the jobs, by county. Even though this seems like a complicated multi-step process, we can actually do it all in one query. Let’s break it down into two parts: the join and the aggregation.
The Join#
We need to join the ca_wac_2015
table and the ca_xwalk
table. Since we aren’t worried about counties that have no jobs, we can do an inner join. Consider the following JOIN
(if you want to run it, I would suggest adding a LIMIT
at the end):
SELECT a.w_geocode, a.c000, b.ctyname
FROM ca_wac_2015 a
JOIN ca_xwalk b
ON a.w_geocode = b.tabblk2010;
This should provide the JOIN
that we want, as well as the relevant columns. We need to take the table we get from the join and apply the aggregation to it.
The Aggregation#
From our joined table, we need GROUP BY
county, then find the SUM()
. For now, let’s call our joined table “joinedtable
” and write the query based on this table. Since we’re also interested in what counties have the most jobs, we’ll order by the sum in descending order.
SELECT ctyname, SUM(c000)
FROM joinedtable
GROUP BY ctyname
ORDER BY SUM(c000) DESC
LIMIT 30;
But we’ve actually already figured out how to get table “joinedtable
” using the JOIN
s above. All we need to do is put the JOIN
in.
SELECT b.ctyname, SUM(a.c000)
FROM ca_wac_2015 a
JOIN ca_xwalk b
ON a.w_geocode = b.tabblk2010
GROUP BY ctyname
ORDER BY SUM(c000) DESC
LIMIT 30;
ctyname | SUM(a.c000) |
---|---|
Los Angeles County, CA | 4443133 |
Orange County, CA | 1578369 |
San Diego County, CA | 1338649 |
Santa Clara County, CA | 1006868 |
Alameda County, CA | 751240 |
San Francisco County, CA | 700616 |
San Bernardino County, CA | 685642 |
Riverside County, CA | 649800 |
Sacramento County, CA | 631522 |
San Mateo County, CA | 387932 |
Contra Costa County, CA | 359762 |
Fresno County, CA | 349119 |
Ventura County, CA | 304533 |
Kern County, CA | 284472 |
San Joaquin County, CA | 230784 |
Sonoma County, CA | 193045 |
Santa Barbara County, CA | 182850 |
Stanislaus County, CA | 178292 |
Monterey County, CA | 170326 |
Tulare County, CA | 153744 |
Placer County, CA | 144750 |
Solano County, CA | 144473 |
Marin County, CA | 112471 |
San Luis Obispo County, CA | 106785 |
Santa Cruz County, CA | 99832 |
Yolo County, CA | 76969 |
Butte County, CA | 76587 |
Napa County, CA | 73604 |
Merced County, CA | 70947 |
Shasta County, CA | 59933 |
Checkpoint: Putting It All Together#
Look back to the motivating question: What are the characteristics of the distribution of jobs by county and by metropolitan/micropolitan area?
Using what you know about joins and aggregation functions, try to answer the following questions:
Which counties have the most jobs?
Which counties do the workers live in the most?
Which metropolitan/micropolitan areas have the most jobs?
What are some summary statistics of jobs at the county or metropolitan/micropolitan level (e.g. average number of jobs per county)?