Using Aggregation Functions
Contents
%LOAD ../../lodes.db
Using Aggregation Functions#
We’ve explored how to create a variable that indicates whether the census block had over 100 jobs or not using the code below.
ALTER TABLE ca_wac_2015 ADD over100 BOOL;
UPDATE ca_wac_2015 SET over100 = 0;
UPDATE ca_wac_2015 SET over100 = 1 WHERE c000 > 100;
Let’s take a quick look at what we did.
SELECT w_geocode, c000, over100
FROM ca_wac_2015
LIMIT 10;
w_geocode | c000 | over100 |
---|---|---|
060014001001007 | 30 | 0 |
060014001001008 | 4 | 0 |
060014001001011 | 3 | 0 |
060014001001017 | 11 | 0 |
060014001001024 | 10 | 0 |
060014001001026 | 3 | 0 |
060014001001027 | 13 | 0 |
060014001001032 | 13 | 0 |
060014001001033 | 2 | 0 |
060014001001034 | 1 | 0 |
What if we wanted to know how many blocks had over 100 jobs and how many didn’t? We can now use the GROUP BY
statement.
SELECT over100, COUNT(over100)
FROM ca_wac_2015
GROUP BY over100
LIMIT 10;
over100 | COUNT(over100) |
---|---|
0 | 214364 |
1 | 29098 |
Here, the GROUP BY
statement groups it into the categories of the variable. Since we’ve chosen to display the count, we can see the counts. We can also change the order in which the results are displayed so that it’s in increasing order.
SELECT over100, COUNT(over100)
FROM ca_wac_2015
GROUP BY over100
ORDER BY COUNT(over100)
LIMIT 10;
over100 | COUNT(over100) |
---|---|
1 | 29098 |
0 | 214364 |
The ORDER BY
statement orders the rows that it displays according to whatever you put after it. In this case, we chose the count of over100
.
Using GROUP BY with Multiple Variables#
For the next few queries, let’s try using a different table. The ca_xwalk
table in the same lodes
database contains information about each of the census blocks in California. We can use this to, for example, look at aggregation by CBSA (metropolitan/micropolitan area) name and by county name.
SELECT cbsaname, ctyname, COUNT(*)
FROM ca_xwalk
GROUP BY cbsaname, ctyname
ORDER BY COUNT(*) DESC
LIMIT 10;
cbsaname | ctyname | COUNT(*) |
---|---|---|
Los Angeles-Long Beach-Anaheim, CA | Los Angeles County, CA | 109587 |
Riverside-San Bernardino-Ontario, CA | San Bernardino County, CA | 48173 |
San Diego-Carlsbad, CA | San Diego County, CA | 43415 |
Los Angeles-Long Beach-Anaheim, CA | Orange County, CA | 36873 |
Riverside-San Bernardino-Ontario, CA | Riverside County, CA | 35720 |
Bakersfield, CA | Kern County, CA | 35280 |
San Francisco-Oakland-Hayward, CA | Alameda County, CA | 23956 |
San Jose-Sunnyvale-Santa Clara, CA | Santa Clara County, CA | 22369 |
Fresno, CA | Fresno County, CA | 22096 |
Sacramento--Roseville--Arden-Arcade, CA | Sacramento County, CA | 19939 |
This first groups by CBSA (cbsaname
) name, then it groups by county (ctyname
), in that order. In this case, county is nested completely in the metropolitan/micropolitan area. In other cases in which we don’t have complete nesting, we would be able to see all possible combinations that exist in the data.
Further, notice that we used DESC
after ORDER BY
. This orders in descending order instead of ascending order, so that we can see the areas with the most census blocks at the top.
Conditional Statements After Aggregation#
Suppose we wanted to display only certain counts. We can use HAVING
to do this.
SELECT ctyname, cbsaname, COUNT(cbsaname)
FROM ca_xwalk
GROUP BY ctyname, cbsaname
HAVING count(cbsaname) > 20000
ORDER BY COUNT(*) DESC
LIMIT 10;
ctyname | cbsaname | COUNT(cbsaname) |
---|---|---|
Los Angeles County, CA | Los Angeles-Long Beach-Anaheim, CA | 109587 |
San Bernardino County, CA | Riverside-San Bernardino-Ontario, CA | 48173 |
San Diego County, CA | San Diego-Carlsbad, CA | 43415 |
Orange County, CA | Los Angeles-Long Beach-Anaheim, CA | 36873 |
Riverside County, CA | Riverside-San Bernardino-Ontario, CA | 35720 |
Kern County, CA | Bakersfield, CA | 35280 |
Alameda County, CA | San Francisco-Oakland-Hayward, CA | 23956 |
Santa Clara County, CA | San Jose-Sunnyvale-Santa Clara, CA | 22369 |
Fresno County, CA | Fresno, CA | 22096 |
This will only display the counts for which the count of cbsaname
is greater than 20000. Note that this is different from using WHERE
, since the conditional statement comes after the GROUP BY
statement. Basically, HAVING
gives us a way of using the same types of conditional statements after we do our aggregation.
Using Different Aggregation Functions#
What if we wanted to find the sum within each group, or the minimum or maximum value? We can use the appropriate aggregation function. To show this, let’s go back to our ca_wac_2015
table.
SELECT over100, COUNT(over100), AVG(c000) AS avg_jobs, MIN(c000) AS min_jobs, MAX(c000) AS max_jobs
FROM ca_wac_2015
GROUP BY over100
ORDER BY over100
LIMIT 10;
over100 | COUNT(over100) | avg_jobs | min_jobs | max_jobs |
---|---|---|---|---|
0 | 214364 | 14.4433300367599 | 1 | 100 |
1 | 29098 | 445.137707058904 | 101 | 72275 |
Here, we’re finding the counts, average, minimum, and maximum value of the total jobs in each census block within each group. Now, we’re not doing anything very insightful here, since the groups already split the blocks by how many jobs there are. However, as we’ll see later on, these aggregation functions can be very useful. For example, suppose we had the county data that’s in ca_xwalk
in this table. We could find the average number of jobs per census block for each county in this way.
Aliasing: You may have noticed that we included a part using “
AS
,” followed by a new name, in the first line. When you ran the code, you might have noticed that the column labels were changed to these new names. This is called aliasing, and is done for readability and ease of access. Later on, aliasing will also help us more easily reference tables within the same query.
Checkpoint: Checking Your Dataset#
Using the above methods, explore the tables we’ve provided or your own state’s data to answer the questions below. As before, make sure to include the queries with your answers.
Which county has the most census blocks?
Which Metropolitan/Micropolitan area has the most census blocks?
Which Origin census block - Destination census block combination has the most workers? How many workers are in this combination?
How would you find all counties containing at least 1000 census blocks?
For California, how many census blocks are there with a latitude above +36?
For California, which county has the most census blocks above the +36 latitude line? Which county has the most below?