SQL for Data Analysis – Tutorial – ep6 (advanced stuff)

This is the final episode of the SQL for Data Analysis – Tutorial for Beginners series. Today I’ll show you some more advanced SQL moves! Though junior data analysts can live without them, I still recommend that you learn and use them as they will bring your SQL game to the next level and make you more efficient at your daily job. (Or if you don’t have a job yet, they will help impress the people in front of you at your data analyst job interview.) I’m going to show you: subqueries, the SQL CASE statement and the SQL HAVING clause!

Note: to get the most out of this article, I recommend not just reading it, but actually doing the coding part with me!

Before we start…

… I suggest going through these articles first – if you haven’t done so yet:

  1. Set up your own data server to practice: How to set up Python, SQL, R and Bash (for non-devs)
  2. Install SQL Workbench to manage your SQL stuff better: How to install SQL Workbench for postgreSQL
  3. SQL for Data Analysis ep1 (SQL basics)
  4. SQL for Data Analysis ep2 (SQL WHERE clause)
  5. SQL for Data Analysis ep3 (SQL functions and GROUP BY)
  6. SQL for Data Analysis ep4 (SQL best practices)
  7. SQL for Data Analysis ep5 (SQL JOIN)

Query in a query (aka. SQL subqueries)

There are some cases in which a simple SQL query isn’t enough to answer your question. Open your SQL Workbench and let’s start with an easy task (we will use our flight_delays data set again):

Select the average departure delay by tail numbers (or, in other words, by plane) from the table – and return the tail numbers (and only the tail numbers) of the planes that have the top 10 average delay times.
Note: Let’s try to solve it with the tools we have learned so far!

This is the solution:

SELECT
  tailnum,
  AVG(depdelay) as avg_dd
FROM
  flight_delays
GROUP BY tailnum
ORDER BY avg_dd DESC
LIMIT 10;

The result is:

query in a query

Nice. Except that this is not the 100% correct answer for this question. The description was really specific that we only want to see the tailnums in our results – and not the average departure time value. The problem is that you have to have the AVG(depdelay) in your SELECT statement – since you are using it to find out the average departure delays of the planes. At this point you can’t solve the problem with your original query, so you have two choices:

  1. You can create a new table and put this result there. Then you can select the first column (tailnum) of that freshly created table… This is not the preferred way, because it’s a bit complicated – and it’s not too flexible either.
  2. A better solution is to handle this query as a table – but in reality, put it as a subquery into another query.

Like this:

SELECT
  tailnum
FROM
  (SELECT
     tailnum,
     AVG(depdelay) as avg_dd
   FROM
     flight_delays
   GROUP BY tailnum
   ORDER BY avg_dd DESC
   LIMIT 10) as my_original_query;

sql subquery 1

If it looks complicated let me simplify it:

SELECT
  tailnum
FROM
  [your_first_query] as [some_name];

Pretty cool. We have just gotten the results and you have just learned that after the FROM clause you can always use subqueries instead of tables. And just to let you know, it doesn’t apply only to the FROM clause (e.g. see the next test yourself section, where we will do it with the WHERE clause.)

Note 1: While this solution is very handy, I have to admit that sometimes creating new tables is more efficient in terms of CPU-time. That would be a bit too advanced of a topic, so I won’t go deeper into it in this article.

Test yourself #1 – query in a query in a query…

If you get the concept, here’s an advanced SQL exercise to test yourself with:

Print:

  • the top 10 destinations
  • where the planes with the top 10 average departure delays (see previous example)
  • showed up the most

Hint 1: this is a difficult query; first try to sketch and design on paper how it will work!
Hint 2: okay, it’s not that difficult, as you will be able to reuse the SQL query from the previous section.

3… 2… 1… Go!
.
.
.
Here’s my solution…
Note: I’m pretty sure there are other SQL queries that would deliver the same results and you are more than welcome to write them in the comment section.

SELECT COUNT(tailnum),
  dest
FROM flight_delays
WHERE tailnum IN
  (SELECT tailnum
   FROM
     (SELECT tailnum,
             AVG(depdelay) as avg_dd
      FROM flight_delays
      GROUP BY tailnum
      ORDER BY avg_dd DESC
      LIMIT 10) as my_original_query)
GROUP BY dest
ORDER BY count DESC
LIMIT 10;

Oh, wow! A query in a query in a query… To help you to visualize:

sql subquery 2

See? Nothing new has happened here – only that we have combined sub-queries with other sub-queries.

SQL CASE – aka the if statement of SQL

Let’s say we want to see how many planes departed early, late or on time. With our current toolset we can do this by running 3 queries – one by one. E.g. the number of flights that departed right on time would look like this:

SELECT COUNT(*)
FROM flight_delays
WHERE depdelay = 0
LIMIT 10;

SQL CASE WHEN 1

To have all the variations you can change the = in the WHERE statement to < and to >.
The result will be:
Departed early: 3515594
Departed on time: 624472
Departed late: 3135222

Good news! You can also do this using a single (slightly more advanced) SQL query if you use the CASE SQL keyword. Try this:

SELECT COUNT(*),
  CASE WHEN depdelay < 0 THEN 'early'
       WHEN depdelay > 0 THEN 'late'
       ELSE 'ontime'
  END as segment
FROM flight_delays
GROUP BY segment
LIMIT 10;

SQL CASE WHEN 2

Nice, huh?
Let’s take a look at the new part:

CASE WHEN depdelay < 0 THEN 'early'
WHEN depdelay > 0 THEN 'late'
ELSE 'ontime'
END as segment

If you are familiar with if statements, it’s not too hard to understand what’s happening here: it categorizes the flights based on the depdelay column into 3 categories.

Note: if it’s not clear to you at first sight, I recommend running the query without the GROUP BY statement, like this:

 SELECT *,
 CASE WHEN depdelay < 0 THEN 'early' WHEN depdelay > 0 THEN 'late'
 ELSE 'ontime'
 END as segment
 FROM flight_delays
 LIMIT 10;

SQL CASE WHEN 3

See? The trick is that the CASE WHEN statement is creating a new column at the end of the table.

To be honest I don’t use the SQL CASE statement too often during my daily job… but sometimes it’s very handy when I have to do some quick and dirty data cleaning/transformation (e.g. change all the values from ‘M’ to ‘Male’ and from ‘F’ to ‘Female’ in a table) or turn a continuous value into a categorical value (like we did in the above example).

SQL HAVING clause

You will need SQL HAVING for complex queries. The only good reason that the SQL HAVING clause exists is the fact that you can’t use SQL WHERE (LINK) with aggregate functions. But why is that? Here’s a short explanation!

Let’s say we want to see something simple from our table: how many times an airport shows up. You should go with this query:

SELECT COUNT(*) as nmbr,
       dest
FROM flight_delays
GROUP BY dest;

Good.
Now we want to see only those airports that show up more than 10,000 times (means that the COUNT(*) is greater than 10,000).
I think a junior SQL person would try this one.
NOTE: THIS QUERY WON’T WORK.

SELECT COUNT(*) as nmbr,
       dest
FROM flight_delays
GROUP BY dest
WHERE nmbr > 10000;

SQL HAVING 1

It looks logical, but again: this query won’t work. Why? The answer is in my previous article (SQL best practices). But I am going to highlight it here as well. This is the order of the keywords your computer sees when processing your query:

  1. FROM
  2. ON
  3. JOIN
  4. WHERE
  5. GROUP BY
  6. SELECT
  7. DISTINCT
  8. ORDER BY
  9. LIMIT

As you can see, the WHERE clause comes before the GROUP BY. This means that any aggregation column created by your GROUP BY clause can’t be part of your WHERE filter. And that’s where the SQL HAVING clause comes into the picture. HAVING is the new WHERE – at least for the aggregated values in your SQL query. (Okay, this sounds very nerdy.) You have to change your previous wrong query to this:

SELECT COUNT(*) as nmbr,
       dest
FROM flight_delays
GROUP BY dest
HAVING COUNT(*) > 10000;

SQL HAVING 2

The only unfortunate thing is that you can’t use your aliases in your HAVING statement, so you need to type the whole COUNT function again… But we can deal with that!

Conclusion

Nice!! This were some advanced SQL queries! And you have gone through them! Even better: you are done with my SQL for data analysis tutorial series! Congrats!
If you have read and worked through all 6 episodes, you can proudly say that you are familiar with SQL. If you want to be better, I’d like to ask you to do 3 things: practice, practice and practice!

I’ve created a unique, practice-oriented 6-week online course. Check it out here: The Junior Data Scientist’s First Month!

If you don’t want to miss my upcoming articles, video tutorials, webinars, etc., subscribe to my weekly Newsletter!

Cheers,
Tomi Mester

← Previous post

Next post →

18 Comments

  1. Great tutorials, Tomi! Thanks to you I was able to get up and running in no time and do a quick data analysis on new property sales for my wife. I ran the analysis on my Digital Ocean server using the bash functions (cat, cut, etc.) from your earlier tutorials, copied the data into my database, and then ran a query to get the addresses of new home sales in particular zip codes and within certain market values.

    Your tutorials are awesome, thank you!

    -Adam

    • hey Adam,

      I’m really glad that my articles are actually useful! 😉
      In fact, that gives me a lot of motivation to keep them coming – so thanks a lot for the feedback! I appreciate it!

      Cheers,
      Tomi

  2. plingpling

    Thank you a lot for your tutorials. They are very usefully!

  3. Hi Tomi,

    I was looking to refresh on my SQL skills and this 6-episode-tutorial was exactly what I needed. I feel confident now to go on and start practicing on my own.

    Thank you,
    Tomi G

  4. Erin Petry

    very helpful tutorial – thank you!

    Although was tricky to do the FULL JOINS since i was using MySQL and there is no FULL / OUTER join ability with this RDBMS (so had to use a sub query and Left Join + UNION + Right Join)

  5. Select the average departure delay by tail numbers (or with other words by plane) from the table – and return the tail numbers (and only the tail numbers) of the planes, that have the top 10 average delay time

    SELECT tailnum
    FROM flight_delays
    GROUP By tailnum
    ORDER By AVG(depdelay) desc
    LIMIT 10;

    Worked as well. No need to introduce another sub query.

    • hi Nisha,

      thanks for the comment! You actually have right! 🙂
      Although, I’ll leave there the example for demonstrating subqueries.

      Cheers,
      Tomi

  6. I can’t find 30-days SQL challenge. Is it up yet?

  7. Hi Tomi,

    Very good posts and website that I have learned lots of knowledge of Data Analytics, thanks.

    Just wonder how this following code “Group by segment” works

    SELECT COUNT(*),
    CASE WHEN depdelay 0 THEN ‘late’
    ELSE ‘ontime’
    END as segment
    FROM flight_delays
    GROUP BY segment
    LIMIT 10;

    because Group by comes before Select, when SQL executes group by segment, SQL did not recognize “segment” which comes from Select clause.

    I don’t know if my understanding is correct, but you still got results.

    • hi Xiao,

      thanks for the question! To be honest, I don’t know the exact answer and I couldn’t find anything online that provides a definite answer for your question… Although, my best guess is that “CASE” executed – independently from the SELECT statement – before the GROUP BY statement.
      But if anyone finds a better answer, don’t hesitate to comment!

      Tomi

      • Hi Tomi,

        thanks for your reply.

        I just tested case statement + Group by , seems the combination does not work, getting error of “Invalid column name”. I am just curious about when you run your query, did you meet any error or simply got the output?
        PS. I used MS SQL.

        Thanks,
        Xiao

        • hi Xiao,
          hmm. tricky – TBH, I’m not really familiar with MS SQL, sorry… but it’s like 99.9% sure that issue comes from the difference between MS SQL and PostgreSQL.
          Tomi

  8. William

    Thank you Tomi for this awesome tutorial. I learned SQL from scratch and this tutorial is super helpful. Thank you very much!!!

Leave a Reply

The best way of learning Data Science is practicing.
Check out my 6-week online data science challenge:

“The Junior Data Scientist’s First Month” video course

(Early bird discount until this Friday!)