Checking Number of Rows and Duplicates
Contents
%LOAD ../../lodes.db
Checking Number of Rows and Duplicates#
Let’s say we want to find out how many rows there are. You can do this by using a COUNT
.
SELECT COUNT(*)
FROM ca_wac_2015;
COUNT(*) |
---|
243462 |
Here, we used COUNT(*)
, which does a count of all rows, regardless of NULL
values. We can instead do a count of all non-NULL
values of a certain variable by including that variable instead of *
.
SELECT COUNT(w_geocode)
FROM ca_wac_2015;
COUNT(w_geocode) |
---|
243462 |
We can also look at both of these at the same time. This can be helpful for checking for NULL
values of multiple columns at a time, since we’ll be able to see which ones have less than the total number of rows.
SELECT COUNT(*), COUNT(w_geocode), COUNT(C000)
FROM ca_wac_2015;
COUNT(*) | COUNT(w_geocode) | COUNT(C000) |
---|---|---|
243462 | 243462 | 243462 |
But wait; what if there are duplicates in the data? We can check for them by using DISTINCT
.
SELECT DISTINCT w_geocode
FROM ca_wac_2015
LIMIT 100;
w_geocode |
---|
060014001001007 |
060014001001008 |
060014001001011 |
060014001001017 |
060014001001024 |
060014001001026 |
060014001001027 |
060014001001032 |
060014001001033 |
060014001001034 |
060014001001036 |
060014001001038 |
060014001001041 |
060014001001042 |
060014001001043 |
060014001001044 |
060014001001047 |
060014001001049 |
060014001001053 |
060014001001054 |
060014001001056 |
060014001001057 |
060014001001058 |
060014001001060 |
060014001001062 |
060014001001065 |
060014001001068 |
060014001001074 |
060014002001001 |
060014002001002 |
060014002001004 |
060014002001005 |
060014002001009 |
060014002001010 |
060014002001011 |
060014002001012 |
060014002001016 |
060014002001018 |
060014002001019 |
060014002001020 |
060014002001021 |
060014002001022 |
060014002002000 |
060014002002002 |
060014002002003 |
060014002002004 |
060014002002005 |
060014002002006 |
060014002002008 |
060014002002013 |
060014002002014 |
060014002002015 |
060014003001006 |
060014003001007 |
060014003001008 |
060014003001009 |
060014003001010 |
060014003001011 |
060014003001012 |
060014003001013 |
060014003001014 |
060014003001015 |
060014003001016 |
060014003001017 |
060014003002002 |
060014003002003 |
060014003002004 |
060014003002005 |
060014003002006 |
060014003002007 |
060014003002008 |
060014003002009 |
060014003002010 |
060014003002013 |
060014003002015 |
060014003002017 |
060014003002018 |
060014003002019 |
060014003002020 |
060014003002022 |
060014003002023 |
060014003003001 |
060014003003002 |
060014003003005 |
060014003003006 |
060014003003017 |
060014003003018 |
060014003003019 |
060014003003021 |
060014003003022 |
060014003003024 |
060014003003026 |
060014003003027 |
060014003003030 |
060014003004002 |
060014003004003 |
060014003004004 |
060014003004005 |
060014003004006 |
060014003004007 |
This shows us all of the rows with distinct w_geocode
values; that is, all of the distinct census block ids. Let’s count how many there are. To count them, all we have to do is put COUNT()
around the DISTINCT
part.
SELECT COUNT(DISTINCT w_geocode)
FROM ca_wac_2015;
COUNT(DISTINCT w_geocode) |
---|
243462 |
Building Up a Query Notice that we wanted to count the number of distinct rows, but we first started from querying the rows with distinct
w_geocode
first before adding in theCOUNT
. Though this is a simple example, this process of building up a query as we go is important, especially when we get to more complicated tasks. When writing a query, try to think about the basic parts first, and feel free to run intermediate steps (making sure to includeLIMIT
) as you go.
Checkpoint: Counting Rows, Using Conditional Statements and Creating Variables#
We’ve included the 2015 OD, RAC, WAC, and geography crosswalk data for both California and Illinois for you in tables. The Illinois data follow the same format as the California, except with il
replacing ca
(e.g. il_wac_2015
for the Illinois 2015 WAC). Try using the methods described in this section to further explore the tables. Answer the questions below, making sure to write out the queries used to answer the questions.
How many counties are there?
How many total census blocks are there?
How many Metropolitan/Micropolitan areas are there?