%LOAD ../../lodes.db

Using Joins With Aggregation Functions#

Suppose we wanted to know the distribution of the total number of jobs in California by county. The employment information is in one table, and county information is in another table. We need to join them, then aggregate the jobs, by county. Even though this seems like a complicated multi-step process, we can actually do it all in one query. Let’s break it down into two parts: the join and the aggregation.

The Join#

We need to join the ca_wac_2015 table and the ca_xwalk table. Since we aren’t worried about counties that have no jobs, we can do an inner join. Consider the following JOIN (if you want to run it, I would suggest adding a LIMIT at the end):

SELECT a.w_geocode, a.c000, b.ctyname
FROM ca_wac_2015 a
JOIN ca_xwalk b
ON a.w_geocode = b.tabblk2010;

This should provide the JOIN that we want, as well as the relevant columns. We need to take the table we get from the join and apply the aggregation to it.

The Aggregation#

From our joined table, we need GROUP BY county, then find the SUM(). For now, let’s call our joined table “joinedtable” and write the query based on this table. Since we’re also interested in what counties have the most jobs, we’ll order by the sum in descending order.

SELECT ctyname, SUM(c000)
FROM joinedtable
GROUP BY ctyname
ORDER BY SUM(c000) DESC
LIMIT 30;

But we’ve actually already figured out how to get table “joinedtable” using the JOINs above. All we need to do is put the JOIN in.

SELECT b.ctyname, SUM(a.c000)
FROM ca_wac_2015 a
JOIN ca_xwalk b
ON a.w_geocode = b.tabblk2010
GROUP BY ctyname
ORDER BY SUM(c000) DESC
LIMIT 30;
ctyname SUM(a.c000)
Los Angeles County, CA 4443133
Orange County, CA 1578369
San Diego County, CA 1338649
Santa Clara County, CA 1006868
Alameda County, CA 751240
San Francisco County, CA 700616
San Bernardino County, CA 685642
Riverside County, CA 649800
Sacramento County, CA 631522
San Mateo County, CA 387932
Contra Costa County, CA 359762
Fresno County, CA 349119
Ventura County, CA 304533
Kern County, CA 284472
San Joaquin County, CA 230784
Sonoma County, CA 193045
Santa Barbara County, CA 182850
Stanislaus County, CA 178292
Monterey County, CA 170326
Tulare County, CA 153744
Placer County, CA 144750
Solano County, CA 144473
Marin County, CA 112471
San Luis Obispo County, CA 106785
Santa Cruz County, CA 99832
Yolo County, CA 76969
Butte County, CA 76587
Napa County, CA 73604
Merced County, CA 70947
Shasta County, CA 59933

Checkpoint: Putting It All Together#

Look back to the motivating question: What are the characteristics of the distribution of jobs by county and by metropolitan/micropolitan area?

Using what you know about joins and aggregation functions, try to answer the following questions:

  • Which counties have the most jobs?

  • Which counties do the workers live in the most?

  • Which metropolitan/micropolitan areas have the most jobs?

  • What are some summary statistics of jobs at the county or metropolitan/micropolitan level (e.g. average number of jobs per county)?