Combining tables is a key component, when you are doing data analysis. And SQL is really good at it! Of course each case is different, but I’ve run into analytical tasks too many times, where joining two (very big) data tables took around 20-30 minutes in Python and bash – and ~10-20 seconds in SQL. I’m not saying, I couldn’t have done the task in Python or bash at all… But for sure SQL JOIN was the easiest solution!
So let’s learn how to use SQL JOIN to top up your analytics projects!

But before we start…

… I highly recommend to go through these articles first – if you haven’t done yet:

  1. Setup your own data server to practice: How to install Python, SQL, R and Bash (for non-devs)
  2. Install SQL Workbench to manage your SQL stuff better: How to install SQL Workbench for postgreSQL
  3. SQL for Data Analysis ep1 (SQL basics)
  4. SQL for Data Analysis ep2 (SQL WHERE clause)
  5. SQL for Data Analysis ep3 (SQL functions and GROUP BY)
  6. SQL for Data Analysis ep4 (SQL best practices)

What is SQL JOIN?

Okay, back to SQL JOIN!
What does it mean joining two tables? Here’s a simple example.

Let’s say we have these two datasets:

Table 1

invention inventor
Rubik’s Cube Erno Rubik
Carburator Janos Csonka
Carburator Donat Banki
Ballpen Laszlo Biro
Dynamo Anyos Jedlik

 

Table 2

inventor profession
Erno Rubik architect
Janos Csonka engineer
Donat Banki engineer
Laszlo Biro journalist
Anyos Jedlik engineer

 

In table 1 we have inventors and inventions and in table 2 we have the same inventors and their original professions. Let’s say, we want to see which inventions was invented by an investor with what original profession. To see that, we have to merge the two tables, based on the column, that shows up in both: inventor.

This is what SQL JOIN is good for. After joining the two tables, this is what we will get:

invention inventor profession
Rubik’s Cube Erno Rubik architect
Carburator Janos Csonka engineer
Carburator Donat Banki engineer
Ballpen Laszlo Biro journalist
Dynamo Anyos Jedlik engineer

 

Now we can finally see, that Rubik’s Cube was invented by an architect!
Extra (not SQL related) task: find out, what’s the common in the 5 inventions I’ve listed!

Perfect, but what does this look like in practice?

Get some data

To put SQL JOIN in practice, we will get some data first.

Open up SQL Workbench! We are gonna create two new temporary data tables: playlist and toplist. Run these two queries in your SQL Workbench one by one.

CREATE TABLE playlist (
  artist VARCHAR,
  song VARCHAR);
CREATE TABLE toplist (
  tophit VARCHAR,
  play INT);

Your tables have been created. Load some data into them! Run these two queries one by one in your SQL Workbench:

INSERT INTO playlist (artist,song) VALUES
  ('ABBA','Dancing Queen'),
  ('ABBA','Gimme!'),
  ('ABBA','The Winner Takes It All'),
  ('ABBA','Mamma Mia'),
  ('ABBA','Take a Chance On Me'),
  ('Tove Lo','Cool Girl'),
  ('Tove Lo','Stay High'),
  ('Tove Lo','Talking Body'),
  ('Tove Lo','Habits'),
  ('Tove Lo','True Disaster'),
  ('Avicii','Wake Me Up'),
  ('Avicii','Waiting For Love'),
  ('Avicii','The Nights'),
  ('Avicii','Hey Brother'),
  ('Avicii','Levels'),
  ('Zara Larsson','Lush Life');
INSERT INTO toplist (tophit,play) VALUES
  ('Dancing Queen',95145796),
  ('Gimme!',32785696),
  ('The Winner Takes It All',34458597),
  ('Mamma Mia',47901900),
  ('Take a Chance On Me',30654536),
  ('Cool Girl',227055115),
  ('Stay High',263901766),
  ('Talking Body',272334711),
  ('Habits',214685822),
  ('True Disaster',27028538),
  ('Wake Me Up',520259542),
  ('Waiting For Love',399906192),
  ('The Nights',278063930),
  ('Hey Brother',321270703),
  ('Levels',206004691),
  ('Despacito',519689490);

Note: This is actually real data, that I’ve pulled from Spotify, when I wrote this article.

Let’s see if everything works! SELECT the full tables:

SELECT
  artist,
  song
FROM
  playlist;

playlist demo

SELECT
  tophit,
  play
FROM
  toplist;

toplist demo

Note: Why didn’t I used SELECT *? Let’s find it out in my SQL best practices article!

Ah, cool, some nice songs from nice artists, that people have played a lot already. Can you find out what’s common in the listed songs and artists? (Actually there is one exception – for a reason…)

An important note: when you close SQL Workbench, these temporary data tables will disappear. If you do so, you have to CREATE them again and then INSERT data into them!

SQL JOIN – basics

Otherwise, let’s just perform our first SQL JOIN!

SELECT *
FROM toplist
JOIN playlist
ON tophit = song;

SQL JOIN simple way

First of all, take a look at the results!
We have four columns, two plus two from the two tables. The values in the song and tophit columns are the same. In fact, that was the column that we have joined on, so this is not a big surprise. But hey, we have just merged two tables!

Let’s see, what has happened code-wise:

SELECT * –» We want to select every column…
FROM toplist –» …from the toplist data table…
JOIN playlist –» …but also we want to merge playlist table to the toplist table…
ON tophit = song; –» …and we want to connect those lines, where the value of the tophit column is matching with the value of the song column.

Not that complicated. Yet!
Now there is one important question here…
If you haven’t discovered it yet: there is one song (“Zara Larsson – Lush Life”) that exists in the playlist table, but not in the toplist table. And there is another one (“Despacito, 519689490”) that exists in the toplist table, but not in the playlist table. I did this on purpose, because I wanted to demonstrate, what happens, when you have to deal with partially missing data, which is actually quite often in real data projects.

But what happens with these songs after our SQL JOIN clause?
When you use this simplest way of SQL JOIN, your query will manage your data like this:

SQL JOIN

As you can see on the Venn-diagram: this basic type of SQL JOIN keeps only the data, that shows up in both tables. For that Despacito and Zara Larsson were removed.

Note: this method is called sometimes “INNER JOIN”.

SQL JOIN – same query, better syntax

Remember, that in the SQL best practices article I suggested not to use * in a SELECT statement, but the columns’ name instead. And it’s even more important, when you use SQL JOIN! So instead of this query…

SELECT *
FROM toplist
JOIN playlist
ON tophit = song;

…I rather recommend to use this one:

SELECT
  tophit,
   play,
   artist,
   song
FROM toplist
JOIN playlist
ON tophit = song;

And this is still not the best solution. The problem is, that when you look at this code, you can’t see instantly, what columns belongs to which table. Thus I like to add the table names also to the column names. It’s really simple. Eg. if the tophit column is in the toplist table, then instead of tophit, I’ll write: toplist.tophit.
If you apply this to the above query:

SELECT
  toplist.tophit,
  toplist.play,
  playlist.artist,
  playlist.song
FROM toplist
JOIN playlist
ON toplist.tophit = playlist.song;

SQL JOIN formatting

Much better. But one last small tweak! The playlist.song and the toplist.tophit columns are actually the same. We don’t need both of them… so remove one:

SELECT
  toplist.tophit,
  toplist.play,
  playlist.artist
FROM toplist
JOIN playlist
ON toplist.tophit = playlist.song;

SQL JOIN formatting 2

Now this is finally a pretty decent SQL JOIN.

FULL JOIN

But you might ask: ”What should we do to keep Despacito and Zara Larsson in the data set, even though they don’t show up in both data table?” And I’d answer: great question! To continue with the visualisation, this is what we want to achieve:

SQL FULL JOIN

This SQL JOIN is called FULL JOIN and to execute it, you should change only one tiny thing in our previous query: add FULL before the JOIN clause.

SELECT
  toplist.tophit,
  toplist.play,
  playlist.artist,
  playlist.song
FROM toplist
FULL JOIN playlist
ON toplist.tophit = playlist.song;

Boom! The magic has happened!

SQL FULL JOIN example

As you can see, Zara Larsson and Despacito are there, but on the joint fields they don’t have data, hence those fields stays empty… These empty fields are called NULLs in SQL. I have already mentioned “NULL” and it’s importance – but I’ll get back to that with more details later!

However now you know: just the fact, that a given value doesn’t exist in both data tables you want to join, doesn’t mean that you can’t join every line. You can, but if you do so, the missing values will stay empty.

LEFT JOIN AND RIGHT JOIN

And this brings us right to the next question. What if we want to apply the FULL JOIN thinking only on one of the tables?

Back to the visualisation! Either this…

SQL LEFT JOIN

… or this:

SQL RIGHT JOIN

Before I reveal the (by the way very simple) solution, I’d like to emphasize, that this problem occurs quite often in real data projects too.

Eg. let’s say, you are running an A/B test, where you have 2 data tables.

Table 1

user bucket
user1 A
user2 B
user3 A
user4 B

 

Table 2

user feature usage timestamp
user1 19:00:00.000
user2 19:32:11.000
user4 19:44:54.000
user5 19:48:23.000
user1 19:59:01.000
user5 20:01:10.000
user2 20:04:32.000
user4 20:09:54.000
user1 20:12:32.000

 

You want to include only 80% of your audience (user1, user2, user3, user4) in your A/B test, and you put these users into a table with the info about which bucket they belong to.

And you have another table, that works as a feature log (see more here: data collection) and it collects the feature usage for all users (user1, user2, user3, user4 and even user5).

To evaluate your A/B test, you have to combine the two tables. And you want to keep all users from the first table (even if they didn’t use the feature, ergo didn’t show up in the other table at all), but you don’t want to keep the users, who showed up only in the second table (used the feature, but not part of the A/B test).

What do you do?
This:

SQL LEFT JOIN 2

This is called LEFT JOIN. If you want to perform a LEFT JOIN, you simply have to add a LEFT to your JOIN clause.

Getting back to our playlist+toplist data sets, try something like this:

SELECT
  toplist.tophit,
  toplist.play,
  playlist.artist
FROM toplist
LEFT JOIN playlist
ON toplist.tophit = playlist.song;

SQL LEFT JOIN example 2

It keeps every line from the toplist table even if it doesn’t exist in the playlist table. But it keeps only those lines from the playlist table, that exists in the toplist table too. Good!

If you want to execute, the opposite, you should do a RIGHT JOIN instead:

SELECT
  toplist.tophit,
  toplist.play,
  playlist.artist
FROM toplist
RIGHT JOIN playlist
ON toplist.tophit = playlist.song;

Awesome!

Test yourself #1

You now everything, you have to know at this point about SQL JOIN! But the best way of learning is practicing! Here’s an assignment, that you should perform on our playlist and toplist tables:

Given the information in the playlist and toplist data tables:
How many plays each artist have in total?
.
.
.
Here’s my solution:

SELECT
  playlist.artist,
  SUM(toplist.play)
FROM playlist
FULL JOIN toplist
  ON playlist.song = toplist.tophit
GROUP BY artist;

SQL test yourself 1

And a little explanation:
SELECT –» We select…
playlist.artist, –» the artists from the playlist table…
SUM(toplist.play) –» and we summarize the number of plays from the toplist table…
FROM playlist –» we actually specify that we will use the playlist table…
FULL JOIN toplist –» and we also want to merge the toplist table (using the empty fields too)…
ON playlist.song = toplist.tophit –» the join matches those lines where the song and tophit columns have the same data…
GROUP BY artist; –» and GROUP BY this refers to the SUM function, above – we want to see the summaries by artists.

Test yourself #2

One more test, before I let you go!
Print the top 5 ABBA songs ordered by number of plays!
.
.
.
And the solution is:

SELECT
  playlist.artist,
  playlist.song,
  toplist.play
FROM playlist
FULL JOIN toplist
  ON playlist.song = toplist.tophit
WHERE playlist.artist = 'ABBA'
ORDER BY toplist.play DESC;

SQL test yourself 2

Well, nothing new here. 😉 But if you need explanation, just let me know in the comment section and I’ll give you one!

Conclusion

SQL JOIN is really important and you will use it quite often as a data analyst/scientist. This article gave you a solid base knowledge about it. Further down the road you will meet even more advanced applications of it, but using what you have learned from this article – combined what you have learned with the previous ones – will cover you most of the cases.

And there is only one more article left from my SQL for Data Analysis – Tutorial for Beginners series. In that I’ll introduce some advanced methods: the most interesting one will be the query-in-a-query, but I’ll also show you two more exciting clauses: HAVING and CASE!

If you don’t want to miss it – and also would like to get info about my upcoming articles, video tutorials, webinars, etc. subscribe to my weekly Newsletter!

Cheers,
Tomi Mester