This is the final episode of the SQL for Data Analysis – Tutorial for Beginners series. Today I’ll show you some more advanced SQL moves! Though junior data analysts can live without them, I still recommend to learn and use them as they will bring your SQL game to the next level and make you more efficient at your daily job. (Or if you don’t have a job yet, they will help to impress the people in front of you on your data analyst job interview.) I’m going to show you: subqueries, the SQL CASE statement and the SQL HAVING clause!
Note: to get the most out of this article, you I recommend not just read, but actually do the coding part with me!
Before we start…
… I suggest to go through these articles first – if you haven’t done yet:
- Setup your own data server to practice: How to setup Python, SQL, R and Bash (for non-devs)
- Install SQL Workbench to manage your SQL stuff better: How to install SQL Workbench for postgreSQL
- SQL for Data Analysis ep1 (SQL basics)
- SQL for Data Analysis ep2 (SQL WHERE clause)
- SQL for Data Analysis ep3 (SQL functions and GROUP BY)
- SQL for Data Analysis ep4 (SQL best practices)
- SQL for Data Analysis ep5 (SQL JOIN)
Query in a query (aka. SQL subqueries)
There are some cases, when a simple SQL query isn’t enough to answer your question. Open your SQL Workbench and let’s start with an easy task (we will use our flight_delays data set again):
Select the average departure delay by tail numbers (or with other words by plane) from the table – and return the tail numbers (and only the tail numbers) of the planes, that have the top 10 average delay times.
Note: Let’s try to solve it with the tools we have learned so far!
This is the solution:
SELECT tailnum, AVG(depdelay) as avg_dd FROM flight_delays GROUP BY tailnum ORDER BY avg_dd DESC LIMIT 10;
The result is:
Nice. Except, that this is not the 100% correct answer for this question. The description was really specific about that we only want to see the
tailnums in our results – and not the average departure time value. The problem is that you have to have the
AVG(depdelay) in your SELECT statement – as you are using it to find out the average departure delays of the planes. At this point you can’t solve the problem in your original query, so you have two choices:
- You can create a new table and put this result there. Then you can select the first column (tailnum) of that freshly created table… This is not the preferred way, because it’s a bit complicated – and it’s not too flexible either.
- A better solution is to handle this query as a table – but in reality put it as a subquery into another query.
SELECT tailnum FROM (SELECT tailnum, AVG(depdelay) as avg_dd FROM flight_delays GROUP BY tailnum ORDER BY avg_dd DESC LIMIT 10) as my_original_query;
If it looks complicated let me simplify it:
SELECT tailnum FROM [your_first_query] as [some_name];
Pretty cool. We have just got the results and you have just learned that after the FROM clause you can always use subqueries instead of tables. And just to let you know, it doesn’t apply only for the FROM clause (eg. see the next test yourself section, where we will do it with the WHERE clause.)
Note 1: While this solution is very handy, I have to admit, that sometimes creating new tables is more efficient in terms of CPU-time. That would be a bit too advanced topic, so I won’t go deeper into it in this article.
Test yourself #1 – query in a query in a query…
If you get the concept, here’s an advanced SQL exercise to test yourself with:
- the top 10 destinations
- where the planes with the top 10 average departure delays (see previous example)
- showed up the most
Hint 1: this is a difficult query, first try to sketch and design on paper, how it will work!
Hint 2: okay, it’s not that difficult, as you will be able to reuse the SQL query from the previous section.
3… 2… 1… Go!
Here’s my solution…
Note: I’m pretty sure there are other alternative SQL queries to get the same results and you are more than welcome to write them in the comment section.
SELECT COUNT(tailnum), dest FROM flight_delays WHERE tailnum IN (SELECT tailnum FROM (SELECT tailnum, AVG(depdelay) as avg_dd FROM flight_delays GROUP BY tailnum ORDER BY avg_dd DESC LIMIT 10) as my_original_query) GROUP BY dest ORDER BY count DESC LIMIT 10;
Oh, wow! A query in a query in a query… To help you to visualize:
See? Nothing new has happened here – only that we have combined sub-queries with other sub-queries.
SQL CASE – aka. the if statement of SQL
Let’s say, we want to see how many planes departed early, late or on time. With our current toolset we can do this by running 3 queries – one by one. Eg. the number of flights, that departed right on time would look like this:
SELECT COUNT(*) FROM flight_delays WHERE depdelay = 0 LIMIT 10;
To have all the variations you can change the
= in the WHERE statement to
< and to
The result will be:
Departed on time:
Good news! You can do this by a single (bit more advanced) SQL query if you use the CASE SQL keyword. Try this:
SELECT COUNT(*), CASE WHEN depdelay < 0 THEN 'early' WHEN depdelay > 0 THEN 'late' ELSE 'ontime' END as segment FROM flight_delays GROUP BY segment LIMIT 10;
Let’s take a look at the new part:
CASE WHEN depdelay < 0 THEN 'early' WHEN depdelay > 0 THEN 'late' ELSE 'ontime' END as segment
If you are familiar with if statements, it’s not too hard to understand, what’s happening here: it categorizes the flights based on the depdelay column into 3 categories.
Note: if it’s not clear to you for the first sight, I recommend to run the query without the GROUP BY statement, like this:
SELECT *, CASE WHEN depdelay < 0 THEN 'early' WHEN depdelay > 0 THEN 'late' ELSE 'ontime' END as segment FROM flight_delays LIMIT 10;
See? The trick is, that the CASE WHEN statement is creating a new column at the end of the table.
To be honest I don’t use the SQL CASE statement too often during my daily job… but sometimes it’s very handy, when I have to do some quick and dirty data cleaning/transformation (eg. change all the values from ‘M’ to ‘Male’ and from ‘F’ to ‘Female’ in a table) or turn a continuous value into a categorical value (like we did above in the example).
SQL HAVING clause
You will need SQL HAVING in the complex queries. The only good reason, why SQL HAVING clause exists is the fact that, you can’t use SQL WHERE (LINK) with aggregate functions. But why is that? Here’s a short explanation!
Let’s say we want to see something simple from our table: how many times an airport shows up. You should go with this query:
SELECT COUNT(*) as nmbr, dest FROM flight_delays GROUP BY dest;
Now we want to see only those airports, that are showing up more than 10.000 times (means that the COUNT(*) is greater than 10.000).
I think a junior SQL person would try this one.
NOTE: THIS QUERY WON’T WORK.
SELECT COUNT(*) as nmbr, dest FROM flight_delays GROUP BY dest WHERE nmbr > 10000;
It looks logical, but again: this query won’t work. Why? The answer is in my previous article (SQL best practices). But I am going to highlight it here as well. This is the order of the keywords by how your computer is processing your query:
As you can see, the WHERE clause comes before the GROUP BY. It means, that any aggregation column created by your GROUP BY clause can’t be the part of your WHERE filter. And that’s where SQL HAVING clause comes into the picture. HAVING is the new WHERE – at least for the aggregated values in your SQL query. (Okay, this sounds very nerdy.) You have to change your previous wrong query to this:
SELECT COUNT(*) as nmbr,
GROUP BY dest
HAVING COUNT(*) > 10000;
The only unfortunate thing is that you can’t use you aliases in your HAVING statement, so you need to type the whole COUNT function again… But let’s deal with that!
Nice!! These were some advanced SQL stuff! And you have gone through it! Even better! You are done with my SQL for data analysis tutorial series! Congrats!
If you have read and done all the 6 episodes, you can proudly say, that you are familiar with SQL. If you want to be better, I’d like to ask you to do 3 things: practice, practice and practice!
I’ll soon provide a 30-days SQL challenge to do so on my Youtube Channel!
If you don’t want to miss my upcoming articles, video tutorials, webinars, etc. subscribe to my weekly Newsletter!