SQL for Data Analysis – Tutorial for Beginners – ep2

Last updated on November 02, 2020

This is the second episode of my SQL for Data Analysis (for beginners) series, and today I’ll show you every tiny little detail of the SQL WHERE clause. It’s not by accident that I’ve dedicated a whole article to this topic; the WHERE clause is essential if you want to select the right bit of your data from your data table!

In the first half of this article I’ll show you the different operators. In the second half, we will finally import our favorite 7.000.000+ rows dataset (the one with the airplane delays). And at the end of the tutorial, I’ll give you a few assignments to test your SQL knowledge and practice a bit!

If you are new here, let’s start with these articles first:

  1. How to install Python, SQL, R and Bash (for non-devs)
  2. SQL for Data Analysis – Tutorial for Beginners – ep1
  3. How to install SQL Workbench for postgreSQL

Make sure your SQL data environment works!

If you have already done the above mentioned articles, you most probably have these in place… but just in case, please double-check if you have these three things:

  1. Your fully-functioning data server (from this article)
  2. SQL Workbench with an established connection to your data server (from this article)
  3. The zoo data set (from this article)

Check-check-check? Okay, you are ready to learn new things!

SQL WHERE Clause

Note: to get the most out of this article, you should not just read it, but actually do the coding part with me! So if you are on the phone, I suggest you save this article and continue on a computer!

In SQL for Data Analysis episode 1 we did this from the Terminal, but this time please re-run this query from SQL Workbench:

SELECT * FROM zoo WHERE animal = 'elephant';
SQL WHERE clause - a simple where

This is the most basic example of how to use the SQL WHERE clause. But you can (and have to) be more sophisticated.

SQL WHERE with comparison operators

Well first of all – you can do more than “something equals something.” It can be “greater than,” “less than”,” “not equals to,” etc…

Here’s a short list of the comparison operators that work with the SQL WHERE clause:

comparison operatorWhat does it mean?
=equals to
<>not equals to
!=not equals to
<less than
<=less-equal than
>greater than
>=greater-equal than

Let’s see some concrete examples:

1. How would you select all the animals, that are not zebras?

Solution:

SELECT * FROM zoo WHERE animal <> 'zebra';
SQL WHERE clause - not zebras

2. How would you select all the animals for whom the water_need is less than 300?

Solution:

SELECT * FROM zoo WHERE water_need < 300;
SQL WHERE clause - less than

Okay, I think you got it!
So let’s take this to the next level!

SQL WHERE clause with LIKE operator

Sometimes you want to do your selection based on a pattern. Take a look at this fictional data set:

user_nameuser_id
frank1001
franky1002
frankenstein1003
francis1004
frappuccino1005

Let’s say, you want to select all the rows where the user_name starts with ‘frank‘.

Now it’s time to boost your WHERE clause with a LIKE operator:

SELECT * FROM franks WHERE user_name LIKE 'frank%';

How does it work?

The LIKE logical operator is just like the >, = or != — so it’s a comparsion operator. But it tells SQL that you want to filter results based on a pattern — and not on an exact value. Once you typed LIKE you specify what the pattern is exactly. Let’s say it’s 'frank%'! Do you see anything tricky here? Yepp, it’s the % character – which in this case means that any string of characters (of any length) can follow frank.

Hence your result will be:

SQL WHERE clause - franks

To hammer this home, let’s go back to our zoo dataset and try to answer these questions:

1. How would you select all animals whose name contains at least one e character?

Solution:

SELECT * FROM zoo WHERE animal LIKE '%e%';
SQL WHERE clause - LIKE operator

It needed a % at the beginning and also at the end of the pattern, because the ‘e‘ character could be anywhere in the word, not just at the end or the beginning. (If you don’t get it, try removing the first % and you will understand it immediately!)

2. How would you select all animals whose name ends with ‘roo‘?

Solution:

SELECT * FROM zoo WHERE animal LIKE '%roo';
SQL WHERE clause - LIKE roo

There is another type of tricky character: _. (Oh, by the way, from now on I’ll just use its official name: wildcard.)

  • % means any string of characters of any length.
  • _ means any character you want, but only one.

E.g. the word tiger can be found using wildcards like 't%'. But you can use 't____' too.

Knowing this trick, here’s another short question:

3. How would you select all animals whose species name is exactly five characters long?

Solution:

SELECT * FROM zoo WHERE animal LIKE '_____';
SQL WHERE clause - LIKE 5 chars

This is how the SQL WHERE clause works with the LIKE operator.

How to combine different conditions in an SQL WHERE clause

You can combine multiple conditions — using logical operators! There are more, but here I’ll highlight the two most important: AND and OR.

Let’s say we want to select all the animals whose name is exactly five characters long, except for tigers. We have two conditions and we can do that with SQL WHERE boosted with the AND logical operator.

SELECT
  *
FROM
  zoo
WHERE
  animal LIKE '_____'
  AND
  animal <> 'tiger';
SQL WHERE clause - LIKE and AND

Nice, zebras only!

Now twist it a bit and select only those animals, that:

  • have a name exactly five characters long
  • are not tigers
  • have a water_need greater than 200

Solution:

SELECT
  *
FROM
  zoo
WHERE
      animal LIKE '_____'
  AND animal <> 'tiger'
  AND water_need > 200;
SQL WHERE clause - LIKE multiple ANDs


Yes: you can use multiple AND operators.

Finally, let’s see an example of the OR operator!

SELECT
  *
FROM
  zoo
WHERE
  water_need < 300
  OR
  animal = 'lion';

It returns all the animals that are lions, plus all the animals that have less than 300 water_need.

SQL WHERE clause OR operator

Get it?

  • AND returns every row where all the conditions are true.
  • OR returns every row where at least one of the conditions is true.

The SQL IN operator

Imagine a situation where you want to select all the animals whose unique id is any of these: 1001, 1008, 1012, 1015, 1018.

You have already learned a way to do that! Can you find it out?
It’s the SQL WHERE clause with multiple OR operators. Something like this:

SELECT
  *
FROM
  zoo
WHERE
  uniq_id = 1001
  OR uniq_id = 1008
  OR uniq_id = 1012
  OR uniq_id = 1015
  OR uniq_id = 1018;
SQL WHERE clause - multiple OR


While it’s a valid solution, it’s certainly not the most elegant way to get the job done! If you want to reduce the number of ORs in your SQL WHERE clause, you should use the IN operator instead.

This SQL query returns the exact same result as the previous one:

SELECT
  *
FROM
  zoo
WHERE
  uniq_id IN (1001,1008,1012,1015,1018);
SQL WHERE clause - where IN


It’s a little bit shorter — so it’s much nicer. (You know, when it gets to coding, we don’t like to repeat things.)

The SQL NOT operator

There is an extra logical operator, which is by the way extremely simple. NOT will modify your logical operator to do its opposite instead. Let’s get back to a previous example with the 5 characters long animals:

SELECT * FROM zoo WHERE animal LIKE '_____';

How to select all the animals that are not 5 characters long?

SELECT * FROM zoo WHERE animal NOT LIKE '_____';

SQL WHERE clause NOT operator

Simple as that.

Import a bigger dataset into postgreSQL

Are you bored with our 22-row zoo dataset? Me too! It’s time to go a bit bigger! Let’s import the 7.000.000+ row air-delays dataset we have used before in the bash tutorials! Follow these steps to have it imported into your SQL database!

UPDATE: I’ve realized that this might be a bit complex – not difficult, just complex – so I put all the instructions into a short video too! Click here or read below:

Note: we will work in bash. If you haven’t done my bash tutorial series yet, I highly recommend doing at least the first episode, but if you don’t want to, it’s also okay to simply follow my lead step by step below.

  1. Open Terminal and login (ssh) to your data server!
  2. Download the flight_delays data!
    wget http://stat-computing.org/dataexpo/2009/2007.csv.bz2
    Note: If you have this already, skip forward to 5.
    As of  29 December 2019, I realized that the dataset has been removed from its original place. Now, you can download the dataset using this code instead:
    wget 46.101.230.157/sql_tutorial/2007.csv.bz2
  3. Set up dtrx! That’s a command line tool for unzipping stuff!
    (Note: this might have been already set up; if so, skip this step!)
    sudo apt-get install dtrx
  4. Unzip the .csv file!
    dtrx 2007.csv.bz2
    Note: It will take around ~60 seconds to process the whole file, so don’t worry, your Terminal is not freezing, it just needs some time.
  5. Format your data!
    cat 2007.csv |cut -d',' -f1,2,3,4,5,7,10,11,14,15,16,17,18,19 | grep -v ',NA' > sql_ready.csv
  6. Now we have to give permission to our postgreSQL user to create tables and load data into them. This will need multiple steps. Here’s a gif first (note: my username is dataguy – yours might be something else).
    psql_superuser
    First sudo to the user called postgres:
    sudo -u postgres -i

    Then start postgreSQL:
    psql

    The prompt will change to this:
    postgres=#!

    Type:
    ALTER USER [your_user_name] WITH SUPERUSER;

    This turns your original user into a super user.
    Exiting from postgreSQL. Type:
    \q

    Then exit from the user called postgres:
    exit

    Finally access your original user’s postgreSQL database from the command line:

    psql -d postgres

    Okay, this was the hard part…
  7. Now all you need to do is create the table by simply copy-pasting these lines into your terminal:
CREATE TABLE flight_delays (
  year INTEGER,
  month INTEGER,
  dayofmonth INTEGER,
  dayofweek INTEGER,
  deptime INTEGER,
  arrtime INTEGER,
  flightnum INTEGER,
  tailnum VARCHAR,
  airtime INTEGER,
  arrdelay INTEGER,
  depdelay INTEGER,
  origin VARCHAR,
  dest VARCHAR,
  distance INTEGER);
  1. And finally, copy the data from the .csv file you have just downloaded!
    COPY flight_delays FROM '/home/tomi/sql_ready.csv' DELIMITER ',' CSV HEADER;

    Note: make sure that you type your user name where I’ve typed tomi or dataguy (which are my user names…)
  2. Go back to SQL Workbench and make a simple SELECT statement… but make sure that you use the LIMIT clause, too. Why?
    Because now you have over 7.000.000 rows of data. PostgreSQL can handle it easily, sure, but your computer might be frozen if you try print all that data on your screen.

    So try something like this first:
    SELECT * FROM flight_delays LIMIT 10;
    SQL WHERE clause TEST bigger data set

Test yourself #1

Here’s assignment number one:

How many flights did the plane with the tail-number N253WN take on 23 April, 2007?

(Hint: as we haven’t learned the COUNT function yet, it’s enough if you print all the flights with the given conditions and count the lines for yourself.)
.
.
.
Done? Here’s my solution:

SELECT
  *
FROM
  flight_delays
WHERE month = 4
  AND year = 2007
  AND dayofmonth = 23
  AND tailnum = 'N253WN';
SQL WHERE clause - assignement 1

The only trick is that the day, the month and the year are in different columns, so you have to use multiple AND operators to filter simultaneously for all the conditions.

(Note: since the month, year and dayofmonth fields are integers, you don’t need apostrophes. But you need it for the tailnum field, because it contains letters too. I’ll get back to this topic later.)

Test yourself #2

Select all the rows with these conditions:

  • the flight was in April 2007
  • it was an even weekday (2nd, 4th or 6th day of the week)
  • the flight distance was less than 50 miles
  • either the departure or the arrival delay was less than 0 (means the plane took off or landed earlier than planned)

(Hint: you will have to use OR and AND operators together!)
.
.
.
And my solution is:

SELECT
  *
FROM
  flight_delays
WHERE month = 4
  AND dayofweek IN (2,4,6)
  AND distance < 50
  AND (arrdelay < 0
  OR depdelay < 0);
SQL WHERE clause - assignement 2

Can you see the trick here? In SQL WHERE clauses AND is stronger by default than OR – thus if you don’t use the parentheses, the execution order of the different filters would be like this:

WHERE (month = 4
  AND dayofweek IN (2,4,6)
  AND distance < 50
  AND arrdelay < 0)
  OR (depdelay < 0);

And that would have returned wrong results! Take-away is: always use parentheses when adding multiple conditions in your WHERE clause — so SQL can priortize the right filters.

Conclusion

Knowing, understanding and using SQL WHERE clauses with the right operators is crucial. In this article you have learned the most important parts of it.

Now – having a good base knowledge – it’s time to continue with the next important SQL topic! In my next article, I’ll introduce the basic SQL functions (MAX, MIN, SUM, COUNT) and I’ll show you some more important clauses (ORDER BY, GROUP BY, DISTINCT).

You can continue by clicking here.

SQL for Aspiring Data Scientists (7-day online course)

I’ve created an online course that will take you from zero to intermediate level with SQL in 7 days. Go ahead and check it out here:

sql for aspiring data scientists online course header 2

More info…


Cheers,
Tomi Mester

Sources:

← Previous post

Next post →

22 Comments

  1. Katrina

    Hi, Tomi.

    I have been reading this with great interest. The one thing that I kept wondering is the use of parenthesis in the second test. Could the last line of the solution be also expressed as:
    AND ((arrdelay OR depdelay)<0)
    It is probably a wasteful use of parenthesis, but I intuitively went for it, having worked with Boolean before, so was wondering if it's something that would work. I haven't been able to do the coding myself as I'm having some issues with the laptop, but hoping to get on it soon.

    • hey Katrina,

      thanks for the comment.
      Even if it sounds logical, unfortunately it doesn’t work in PostgreSQL.

      The postgreSQL logic is:
      1. Evaluate if arrdelay < 0 is TRUE/FALSE. 2. Evaluate if depdelay < 0 is TRUE/FALSE. 3. Then either of the above 2 is TRUE (with the OR operator). And it doesn't work in the advanced way as you have described! 🙂 Cheers, Tomi

  2. Hi Tomi,

    Thanks for putting these together. I’m following step by step, but I’m stuck at cat 2007.csv |cut -d’,’ -f1,2,3,4,5,7,10,11,14,15,16,17,18,19 | grep -v ‘,NA’ > sql_ready.csv

    I’m getting an error: 2007.csv: No such file or directory.

    Not sure why. Any suggestion? Thanks

    • hi Kay,

      the problem could be two things:
      a) You are in the wrong directory! (If it’s so, navigate back to the right directory!)
      b) You either didn’t download or accidentally removed the 2007.csv file! (If it’s so, download it again as described in the article!)

      Cheers,
      Tomi

  3. Man, I always get this error when tryin to import the data

    ERROR: extra data after last expected column

    • hi Simas,
      I think you missed a column when you create your SQL table!
      Pls double-check my video and make sure you do everything I do there! : )
      Cheers,
      Tomi

  4. Your tutorials have been great Tomi. Thank you. Just a side note for anyone that may have the “2007.csv: No such file or directory.” error. For anyone copying Step 8 from “Import a bigger dataset into postgreSQL”

    “COPY flight_delays FROM ‘/home/tomi/practice/sql_ready.csv’ DELIMITER ‘,’ CSV HEADER;” which is copied from the article should have “practice” removed like it is in the video. That was the issue I faced and hopefully it might help others.

    • Thanks a lot for the comment Sean!
      Yes, being in the right directory is really important…
      Based on your comment, I have just updated that part of the article – so maybe it will help people to avoid possible confusion! 🙂

      Thanks and cheers,
      Tomi

  5. Stefano

    Hello! I get this error after creating the table flight_delays and trying to copy sql.csv data into it:

    ” ERROR: invalid input syntax for integer: “NA”
    CONTEXT: COPY flight_delays, line 40, column deptime: “NA”

    No idea why it happens. I believe I followed all the steps properly.

    • Stefano

      Lol I find the answer, my bad, I had made a mistake when formating the data into sql.csv in this command ” cat 2007.csv |cut -d’,’ -f1,2,3,4,5,7,10,11,14,15,16,17,18,19 | grep -v ‘,NA’ > sql_ready.csv ” sorry about that.

  6. Justin H.

    Tomi,

    The tutorials have been well paced and the gaps we have to fill in generally are well done. Thank you for the time and effort!

    I have a question and I’m getting lost in Stack and Google with trying to upload my own file.

    I cannot load using the same syntax because in one of my columns I have a comma between two ” ” It is a csv file I’ve downloaded from a google survey I do everyday with my athletes. An example here:

    11/7/18,Jackson Picker,3,3,3,3,4,3,7,"feeling alright, same situation with the hip.",4.1,,

    With this as the real issue:
    "feeling alright, same situation with the hip."

    It needs to be read without the comma. I’ve tried sed, grep, awk to no success either.

    I’ve tried in psql on my remote server with the quest_ready.csv file in that directory:

    COPY questionnaire FROM '/home/juharris/quest_ready.csv' WITH CSV DELIMITER ',' QUOTE ‘“‘ CSV HEADER;

    Its been a week and still won’t go!!!!

    Other tables go but this one is the whole reason why I’m learning SQL/Python!

    Other people have to have come across this with data that people submit text responses.

    • hey Justin,

      thanks for the question and no worries, that’s a quite typical data cleaning issue. : )
      It can be done in bash or Python (I prefer bash.)

      In this very particular case this will work:
      sed "s/, / /g"

      But of course, you’ll have to check this for all lines.
      Although it’s a good enough assumption that in comments they’ll use spaces after commas… Anyways, it would get much easier if you could save the file separated by ; or tabs.

      Tomi

  7. Hey Tomi,

    Can you explain this line cat 2007.csv |cut -d',' -f1,2,3,4,5,7,10,11,14,15,16,17,18,19 | grep -v ',NA' > sql_ready.csv actually How it works?

  8. Hi Tomi, I play around with my own test , and I have a question. So just answer when you’re comfortably available!

    So let’s say I’d like to filter the flight data of
    >> Month 3 and 5 wher
    >> day of week is only 6 and 7
    >> airtime only 200or function – so it failed. The only workaround I managed is to include this at the end:

    and month NOT in (1,2,4,6,7,8,9,10,11,12);

    So.. despite receiving the result I wanted, I still feel like without the NOT function there should be another solution with just smartly using the and / or function and parentheses.. Would you show me how you’d do this prompt?

    • well stupid me forgot to attach my failed code:

      select * from flight_delays
      where month in (3,5)
      and dayofweek in (6,7)
      and (airtime > 200 and airtime 250 and airtime < 270);

      Then adding the NOT function at the end.

      • hey Mariska,

        I’m not 100% sure what’s the original task based on your description but there is a missing comparsion operator for sure in your dataset:

        SELECT *
        FROM flight_delays
        WHERE month IN (3,5)
        AND dayofweek IN (6,7)
        AND ((airtime > 200 AND airtime < 220) OR (airtime > 250 AND airtime < 270));

        I suppose this is what you meant. And this query will just run! : )

        Hope this helps!
        Tomi

  9. Unfortunately it looks like the link you provided to download the data: http://stat-computing.org/dataexpo/2009/2007.csv.bz2 no longer works.

  10. Zalan Taller

    Hey Tomi,

    I am gettin this warning after trying to unzip the csv file:
    dtrx: WARNING: extracting /home/tallizali/2007.csv.bz2 to 2007.csv.5

    Do you have any idea what should I do differently?

    Thanks a lot in advance!

    Zalan

    • hey Zalan,

      that means that you have extracted the same file multiple time (hence 2007.csv.5 and not 2007.csv.

      You just have to remove the all the 2007.csv files with the rm 2007.csv* command — then rerun the dtrx line and you’ll be fine.

Leave a Reply