%LOAD ../../lodes.db

Using Aggregation Functions#

We’ve explored how to create a variable that indicates whether the census block had over 100 jobs or not using the code below.

ALTER TABLE ca_wac_2015 ADD over100 BOOL;

UPDATE ca_wac_2015 SET over100 = 0;

UPDATE ca_wac_2015 SET over100 = 1 WHERE c000 > 100;

Let’s take a quick look at what we did.

SELECT w_geocode, c000, over100 
FROM ca_wac_2015 
LIMIT 10;

w_geocode	c000	over100
060014001001007	30	0
060014001001008	4	0
060014001001011	3	0
060014001001017	11	0
060014001001024	10	0
060014001001026	3	0
060014001001027	13	0
060014001001032	13	0
060014001001033	2	0
060014001001034	1	0

What if we wanted to know how many blocks had over 100 jobs and how many didn’t? We can now use the GROUP BY statement.

SELECT over100, COUNT(over100) 
FROM ca_wac_2015 
GROUP BY over100 
LIMIT 10;

over100	COUNT(over100)
0	214364
1	29098

Here, the GROUP BY statement groups it into the categories of the variable. Since we’ve chosen to display the count, we can see the counts. We can also change the order in which the results are displayed so that it’s in increasing order.

SELECT over100, COUNT(over100) 
FROM ca_wac_2015
GROUP BY over100 
ORDER BY COUNT(over100)
LIMIT 10;

over100	COUNT(over100)
1	29098
0	214364

The ORDER BY statement orders the rows that it displays according to whatever you put after it. In this case, we chose the count of over100.

Using GROUP BY with Multiple Variables#

For the next few queries, let’s try using a different table. The ca_xwalk table in the same lodes database contains information about each of the census blocks in California. We can use this to, for example, look at aggregation by CBSA (metropolitan/micropolitan area) name and by county name.

SELECT cbsaname, ctyname, COUNT(*)
FROM ca_xwalk
GROUP BY cbsaname, ctyname
ORDER BY COUNT(*) DESC
LIMIT 10;

cbsaname	ctyname	COUNT(*)
Los Angeles-Long Beach-Anaheim, CA	Los Angeles County, CA	109587
Riverside-San Bernardino-Ontario, CA	San Bernardino County, CA	48173
San Diego-Carlsbad, CA	San Diego County, CA	43415
Los Angeles-Long Beach-Anaheim, CA	Orange County, CA	36873
Riverside-San Bernardino-Ontario, CA	Riverside County, CA	35720
Bakersfield, CA	Kern County, CA	35280
San Francisco-Oakland-Hayward, CA	Alameda County, CA	23956
San Jose-Sunnyvale-Santa Clara, CA	Santa Clara County, CA	22369
Fresno, CA	Fresno County, CA	22096
Sacramento--Roseville--Arden-Arcade, CA	Sacramento County, CA	19939

This first groups by CBSA (cbsaname) name, then it groups by county (ctyname), in that order. In this case, county is nested completely in the metropolitan/micropolitan area. In other cases in which we don’t have complete nesting, we would be able to see all possible combinations that exist in the data.

Further, notice that we used DESC after ORDER BY. This orders in descending order instead of ascending order, so that we can see the areas with the most census blocks at the top.

Conditional Statements After Aggregation#

Suppose we wanted to display only certain counts. We can use HAVING to do this.

SELECT ctyname, cbsaname, COUNT(cbsaname)
FROM ca_xwalk
GROUP BY ctyname, cbsaname
HAVING count(cbsaname) > 20000
ORDER BY COUNT(*) DESC
LIMIT 10;

ctyname	cbsaname	COUNT(cbsaname)
Los Angeles County, CA	Los Angeles-Long Beach-Anaheim, CA	109587
San Bernardino County, CA	Riverside-San Bernardino-Ontario, CA	48173
San Diego County, CA	San Diego-Carlsbad, CA	43415
Orange County, CA	Los Angeles-Long Beach-Anaheim, CA	36873
Riverside County, CA	Riverside-San Bernardino-Ontario, CA	35720
Kern County, CA	Bakersfield, CA	35280
Alameda County, CA	San Francisco-Oakland-Hayward, CA	23956
Santa Clara County, CA	San Jose-Sunnyvale-Santa Clara, CA	22369
Fresno County, CA	Fresno, CA	22096

This will only display the counts for which the count of cbsaname is greater than 20000. Note that this is different from using WHERE, since the conditional statement comes after the GROUP BY statement. Basically, HAVING gives us a way of using the same types of conditional statements after we do our aggregation.

Using Different Aggregation Functions#

What if we wanted to find the sum within each group, or the minimum or maximum value? We can use the appropriate aggregation function. To show this, let’s go back to our ca_wac_2015 table.

SELECT over100, COUNT(over100), AVG(c000) AS avg_jobs, MIN(c000) AS min_jobs, MAX(c000) AS max_jobs
FROM ca_wac_2015 
GROUP BY over100 
ORDER BY over100
LIMIT 10;

over100	COUNT(over100)	avg_jobs	min_jobs	max_jobs
0	214364	14.4433300367599	1	100
1	29098	445.137707058904	101	72275

Here, we’re finding the counts, average, minimum, and maximum value of the total jobs in each census block within each group. Now, we’re not doing anything very insightful here, since the groups already split the blocks by how many jobs there are. However, as we’ll see later on, these aggregation functions can be very useful. For example, suppose we had the county data that’s in ca_xwalk in this table. We could find the average number of jobs per census block for each county in this way.

Aliasing: You may have noticed that we included a part using “AS,” followed by a new name, in the first line. When you ran the code, you might have noticed that the column labels were changed to these new names. This is called aliasing, and is done for readability and ease of access. Later on, aliasing will also help us more easily reference tables within the same query.

Checkpoint: Checking Your Dataset#

Using the above methods, explore the tables we’ve provided or your own state’s data to answer the questions below. As before, make sure to include the queries with your answers.

Which county has the most census blocks?
Which Metropolitan/Micropolitan area has the most census blocks?
Which Origin census block - Destination census block combination has the most workers? How many workers are in this combination?
How would you find all counties containing at least 1000 census blocks?
For California, how many census blocks are there with a latitude above +36?
For California, which county has the most census blocks above the +36 latitude line? Which county has the most below?

Introduction to Python and SQL for Data Analysis

Using Aggregation Functions

Contents

Using Aggregation Functions#

Using GROUP BY with Multiple Variables#

Conditional Statements After Aggregation#

Using Different Aggregation Functions#

Checkpoint: Checking Your Dataset#