This is the final episode of the SQL for Data Analysis – Tutorial for Beginners series. Today I’ll show you some more advanced SQL moves! Though junior data analysts can live without them, I still recommend that you learn and use them as they will bring your SQL game to the next level and make you more efficient at your daily job. (Or if you don’t have a job yet, they will help impress the people in front of you at your data analyst job interview.) I’m going to show you: subqueries, the SQL CASE statement and the SQL HAVING clause!
Note: to get the most out of this article, I recommend not just reading it, but actually doing the coding part with me!
Before we start…
… I suggest going through these articles first – if you haven’t done so yet:
- Set up your own data server to practice: How to set up Python, SQL, R and Bash (for non-devs)
- Install SQL Workbench to manage your SQL stuff better: How to install SQL Workbench for postgreSQL
- SQL for Data Analysis ep1 (SQL basics)
- SQL for Data Analysis ep2 (SQL WHERE clause)
- SQL for Data Analysis ep3 (SQL functions and GROUP BY)
- SQL for Data Analysis ep4 (SQL best practices)
- SQL for Data Analysis ep5 (SQL JOIN)
Query in a query (aka. SQL subqueries)
There are some cases in which a simple SQL query isn’t enough to answer your question. Open your SQL Workbench and let’s start with an easy task (we will use our flight_delays data set again):
Select the average departure delay by tail numbers (or, in other words, by plane) from the table – and return the tail numbers (and only the tail numbers) of the planes that have the top 10 average delay times.
Note: Let’s try to solve it with the tools we have learned so far!
This is the solution:
SELECT tailnum, AVG(depdelay) as avg_dd FROM flight_delays GROUP BY tailnum ORDER BY avg_dd DESC LIMIT 10;
The result is:
Nice. Except that this is not the 100% correct answer for this question. The description was really specific that we only want to see the
tailnums in our results – and not the average departure time value. The problem is that you have to have the
AVG(depdelay) in your SELECT statement – since you are using it to find out the average departure delays of the planes. At this point you can’t solve the problem with your original query, so you have two choices:
- You can create a new table and put this result there. Then you can select the first column (tailnum) of that freshly created table… This is not the preferred way, because it’s a bit complicated – and it’s not too flexible either.
- A better solution is to handle this query as a table – but in reality, put it as a subquery into another query.
SELECT tailnum FROM (SELECT tailnum, AVG(depdelay) as avg_dd FROM flight_delays GROUP BY tailnum ORDER BY avg_dd DESC LIMIT 10) as my_original_query;
If it looks complicated let me simplify it:
SELECT tailnum FROM [your_first_query] as [some_name];
Pretty cool. We have just gotten the results and you have just learned that after the FROM clause you can always use subqueries instead of tables. And just to let you know, it doesn’t apply only to the FROM clause (e.g. see the next test yourself section, where we will do it with the WHERE clause.)
Note 1: While this solution is very handy, I have to admit that sometimes creating new tables is more efficient in terms of CPU-time. That would be a bit too advanced of a topic, so I won’t go deeper into it in this article.
Test yourself #1 – query in a query in a query…
If you get the concept, here’s an advanced SQL exercise to test yourself with:
- the top 10 destinations
- where the planes with the top 10 average departure delays (see previous example)
- showed up the most
Hint 1: this is a difficult query; first try to sketch and design on paper how it will work!
Hint 2: okay, it’s not that difficult, as you will be able to reuse the SQL query from the previous section.
3… 2… 1… Go!
Here’s my solution…
Note: I’m pretty sure there are other SQL queries that would deliver the same results and you are more than welcome to write them in the comment section.
SELECT COUNT(tailnum), dest FROM flight_delays WHERE tailnum IN (SELECT tailnum FROM (SELECT tailnum, AVG(depdelay) as avg_dd FROM flight_delays GROUP BY tailnum ORDER BY avg_dd DESC LIMIT 10) as my_original_query) GROUP BY dest ORDER BY count DESC LIMIT 10;
Oh, wow! A query in a query in a query… To help you to visualize:
See? Nothing new has happened here – only that we have combined sub-queries with other sub-queries.
SQL CASE – aka the if statement of SQL
Let’s say we want to see how many planes departed early, late or on time. With our current toolset we can do this by running 3 queries – one by one. E.g. the number of flights that departed right on time would look like this:
SELECT COUNT(*) FROM flight_delays WHERE depdelay = 0 LIMIT 10;
To have all the variations you can change the
= in the WHERE statement to
< and to
The result will be:
Departed on time:
Good news! You can also do this using a single (slightly more advanced) SQL query if you use the CASE SQL keyword. Try this:
SELECT COUNT(*), CASE WHEN depdelay < 0 THEN 'early' WHEN depdelay > 0 THEN 'late' ELSE 'ontime' END as segment FROM flight_delays GROUP BY segment LIMIT 10;
Let’s take a look at the new part:
CASE WHEN depdelay < 0 THEN 'early' WHEN depdelay > 0 THEN 'late' ELSE 'ontime' END as segment
If you are familiar with if statements, it’s not too hard to understand what’s happening here: it categorizes the flights based on the depdelay column into 3 categories.
Note: if it’s not clear to you at first sight, I recommend running the query without the GROUP BY statement, like this:
SELECT *, CASE WHEN depdelay < 0 THEN 'early' WHEN depdelay > 0 THEN 'late' ELSE 'ontime' END as segment FROM flight_delays LIMIT 10;
See? The trick is that the CASE WHEN statement is creating a new column at the end of the table.
To be honest I don’t use the SQL CASE statement too often during my daily job… but sometimes it’s very handy when I have to do some quick and dirty data cleaning/transformation (e.g. change all the values from ‘M’ to ‘Male’ and from ‘F’ to ‘Female’ in a table) or turn a continuous value into a categorical value (like we did in the above example).
SQL HAVING clause
You will need SQL HAVING for complex queries. The only good reason that the SQL HAVING clause exists is the fact that you can’t use SQL WHERE (LINK) with aggregate functions. But why is that? Here’s a short explanation!
Let’s say we want to see something simple from our table: how many times an airport shows up. You should go with this query:
SELECT COUNT(*) as nmbr, dest FROM flight_delays GROUP BY dest;
Now we want to see only those airports that show up more than 10,000 times (means that the COUNT(*) is greater than 10,000).
I think a junior SQL person would try this one.
NOTE: THIS QUERY WON’T WORK.
SELECT COUNT(*) as nmbr, dest FROM flight_delays GROUP BY dest WHERE nmbr > 10000;
It looks logical, but again: this query won’t work. Why? The answer is in my previous article (SQL best practices). But I am going to highlight it here as well. This is the order of the keywords your computer sees when processing your query:
As you can see, the WHERE clause comes before the GROUP BY. This means that any aggregation column created by your GROUP BY clause can’t be part of your WHERE filter. And that’s where the SQL HAVING clause comes into the picture. HAVING is the new WHERE – at least for the aggregated values in your SQL query. (Okay, this sounds very nerdy.) You have to change your previous wrong query to this:
SELECT COUNT(*) as nmbr, dest FROM flight_delays GROUP BY dest HAVING COUNT(*) > 10000;
The only unfortunate thing is that you can’t use your aliases in your HAVING statement, so you need to type the whole COUNT function again… But we can deal with that!
Nice!! This were some advanced SQL queries! And you have gone through them! Even better: you are done with my SQL for data analysis tutorial series! Congrats!
If you have read and worked through all 6 episodes, you can proudly say that you are familiar with SQL. If you want to be better, I’d like to ask you to do 3 things: practice, practice and practice!
I’ve created a unique, practice-oriented 6-week online course. Check it out here: The Junior Data Scientist’s First Month!
If you don’t want to miss my upcoming articles, video tutorials, webinars, etc., subscribe to my weekly Newsletter!