SQL for Data Analysis – Tutorial – ep6 (advanced stuff)

This is the final episode of the SQL for Data Analysis – Tutorial for Beginners series. Today I’ll show you some more advanced SQL moves! Though junior data analysts can live without them, I still recommend to learn and use them as they will bring your SQL game to the next level and make you more efficient at your daily job. (Or if you don’t have a job yet, they will help to impress the people in front of you on your data analyst job interview.) I’m going to show you: subqueries, the SQL CASE statement and the SQL HAVING clause!

Note: to get the most out of this article, you I recommend not just read, but actually do the coding part with me!

Before we start…

… I suggest to go through these articles first – if you haven’t done yet:

  1. Setup your own data server to practice: How to setup Python, SQL, R and Bash (for non-devs)
  2. Install SQL Workbench to manage your SQL stuff better: How to install SQL Workbench for postgreSQL
  3. SQL for Data Analysis ep1 (SQL basics)
  4. SQL for Data Analysis ep2 (SQL WHERE clause)
  5. SQL for Data Analysis ep3 (SQL functions and GROUP BY)
  6. SQL for Data Analysis ep4 (SQL best practices)
  7. SQL for Data Analysis ep5 (SQL JOIN)

Query in a query (aka. SQL subqueries)

There are some cases, when a simple SQL query isn’t enough to answer your question. Open your SQL Workbench and let’s start with an easy task (we will use our flight_delays data set again):

Select the average departure delay by tail numbers (or with other words by plane) from the table – and return the tail numbers (and only the tail numbers) of the planes, that have the top 10 average delay times.
Note: Let’s try to solve it with the tools we have learned so far!

This is the solution:

SELECT
  tailnum,
  AVG(depdelay) as avg_dd
FROM
  flight_delays
GROUP BY tailnum
ORDER BY avg_dd DESC
LIMIT 10;

The result is:

query in a query

Nice. Except, that this is not the 100% correct answer for this question. The description was really specific about that we only want to see the tailnums in our results – and not the average departure time value. The problem is that you have to have the AVG(depdelay) in your SELECT statement – as you are using it to find out the average departure delays of the planes. At this point you can’t solve the problem in your original query, so you have two choices:

  1. You can create a new table and put this result there. Then you can select the first column (tailnum) of that freshly created table… This is not the preferred way, because it’s a bit complicated – and it’s not too flexible either.
  2. A better solution is to handle this query as a table – but in reality put it as a subquery into another query.

Like this:

SELECT
  tailnum
FROM
  (SELECT
     tailnum,
     AVG(depdelay) as avg_dd
   FROM
     flight_delays
   GROUP BY tailnum
   ORDER BY avg_dd DESC
   LIMIT 10) as my_original_query;

sql subquery 1

If it looks complicated let me simplify it:

SELECT
  tailnum
FROM
  [your_first_query] as [some_name];

Pretty cool. We have just got the results and you have just learned that after the FROM clause you can always use subqueries instead of tables. And just to let you know, it doesn’t apply only for the FROM clause (eg. see the next test yourself section, where we will do it with the WHERE clause.)

Note 1: While this solution is very handy, I have to admit, that sometimes creating new tables is more efficient in terms of CPU-time. That would be a bit too advanced topic, so I won’t go deeper into it in this article.

Test yourself #1 – query in a query in a query…

If you get the concept, here’s an advanced SQL exercise to test yourself with:

Print:

  • the top 10 destinations
  • where the planes with the top 10 average departure delays (see previous example)
  • showed up the most

Hint 1: this is a difficult query, first try to sketch and design on paper, how it will work!
Hint 2: okay, it’s not that difficult, as you will be able to reuse the SQL query from the previous section.

3… 2… 1… Go!
.
.
.
Here’s my solution…
Note: I’m pretty sure there are other alternative SQL queries to get the same results and you are more than welcome to write them in the comment section.

SELECT COUNT(tailnum),
  dest
FROM flight_delays
WHERE tailnum IN
  (SELECT tailnum
   FROM
     (SELECT tailnum,
             AVG(depdelay) as avg_dd
      FROM flight_delays
      GROUP BY tailnum
      ORDER BY avg_dd DESC
      LIMIT 10) as my_original_query)
GROUP BY dest
ORDER BY count DESC
LIMIT 10;

Oh, wow! A query in a query in a query… To help you to visualize:

sql subquery 2

See? Nothing new has happened here – only that we have combined sub-queries with other sub-queries.

SQL CASE – aka. the if statement of SQL

Let’s say, we want to see how many planes departed early, late or on time. With our current toolset we can do this by running 3 queries – one by one. Eg. the number of flights, that departed right on time would look like this:

SELECT COUNT(*)
FROM flight_delays
WHERE depdelay = 0
LIMIT 10;

SQL CASE WHEN 1

To have all the variations you can change the = in the WHERE statement to < and to >.
The result will be:
Departed early: 3515594
Departed on time: 624472
Departed late: 3135222

Good news! You can do this by a single  (bit more advanced) SQL query if you use the CASE SQL keyword. Try this:

SELECT COUNT(*),
  CASE WHEN depdelay < 0 THEN 'early'
       WHEN depdelay > 0 THEN 'late'
       ELSE 'ontime'
  END as segment
FROM flight_delays
GROUP BY segment
LIMIT 10;

SQL CASE WHEN 2

Nice, huh?
Let’s take a look at the new part:

CASE WHEN depdelay < 0 THEN 'early'
WHEN depdelay > 0 THEN 'late'
ELSE 'ontime'
END as segment

If you are familiar with if statements, it’s not too hard to understand, what’s happening here: it categorizes the flights based on the depdelay column into 3 categories.

Note: if it’s not clear to you for the first sight, I recommend to run the query without the GROUP BY statement, like this:

 SELECT *,
 CASE WHEN depdelay < 0 THEN 'early' WHEN depdelay > 0 THEN 'late'
 ELSE 'ontime'
 END as segment
 FROM flight_delays
 LIMIT 10;

SQL CASE WHEN 3

See? The trick is, that the CASE WHEN statement is creating a new column at the end of the table.

To be honest I don’t use the SQL CASE statement too often during my daily job… but sometimes it’s very handy, when I have to do some quick and dirty data cleaning/transformation (eg. change all the values from ‘M’ to ‘Male’ and from ‘F’ to ‘Female’ in a table) or turn a continuous value into a categorical value (like we did above in the example).

SQL HAVING clause

You will need SQL HAVING in the complex queries. The only good reason, why SQL HAVING clause exists is the fact that, you can’t use SQL WHERE (LINK) with aggregate functions. But why is that? Here’s a short explanation!

Let’s say we want to see something simple from our table: how many times an airport shows up. You should go with this query:

SELECT COUNT(*) as nmbr,
       dest
FROM flight_delays
GROUP BY dest;

Good.
Now we want to see only those airports, that are showing up more than 10.000 times (means that the COUNT(*) is greater than 10.000).
I think a junior SQL person would try this one.
NOTE: THIS QUERY WON’T WORK.

SELECT COUNT(*) as nmbr,
       dest
FROM flight_delays
GROUP BY dest
WHERE nmbr > 10000;

SQL HAVING 1

It looks logical, but again: this query won’t work. Why? The answer is in my previous article (SQL best practices). But I am going to highlight it here as well. This is the order of the keywords by how your computer is processing your query:

  1. FROM
  2. ON
  3. JOIN
  4. WHERE
  5. GROUP BY
  6. SELECT
  7. DISTINCT
  8. ORDER BY
  9. LIMIT

As you can see, the WHERE clause comes before the GROUP BY. It means, that any aggregation column created by your GROUP BY clause can’t be the part of your WHERE filter. And that’s where SQL HAVING clause comes into the picture. HAVING is the new WHERE – at least for the aggregated values in your SQL query. (Okay, this sounds very nerdy.) You have to change your previous wrong query to this:

SELECT COUNT(*) as nmbr,
dest
FROM flight_delays
GROUP BY dest
HAVING COUNT(*) > 10000;

SQL HAVING 2

The only unfortunate thing is that you can’t use you aliases in your HAVING statement, so you need to type the whole COUNT function again… But let’s deal with that!

Conclusion

Nice!! These were some advanced SQL stuff! And you have gone through it! Even better! You are done with my SQL for data analysis tutorial series! Congrats!
If you have read and done all the 6 episodes, you can proudly say, that you are familiar with SQL. If you want to be better, I’d like to ask you to do 3 things: practice, practice and practice!
I’ll soon provide a 30-days SQL challenge to do so on my Youtube Channel!

If you don’t want to miss my upcoming articles, video tutorials, webinars, etc. subscribe to my weekly Newsletter!

Cheers,
Tomi Mester

← Previous post

Next post →

2 Comments

  1. Great tutorials, Tomi! Thanks to you I was able to get up and running in no time and do a quick data analysis on new property sales for my wife. I ran the analysis on my Digital Ocean server using the bash functions (cat, cut, etc.) from your earlier tutorials, copied the data into my database, and then ran a query to get the addresses of new home sales in particular zip codes and within certain market values.

    Your tutorials are awesome, thank you!

    -Adam

    • hey Adam,

      I’m really glad that my articles are actually useful! 😉
      In fact, that gives me a lot of motivation to keep them coming – so thanks a lot for the feedback! I appreciate it!

      Cheers,
      Tomi

Leave a Reply