SQL for Data Analysis – Tutorial for Beginners – ep1

SQL is a must, if you want to be a Data Analyst or a Data Scientist. I have worked with many online businesses in the last few years – from 5 people startups up to 5000+ employees multinational companies and haven’t seen a single company who would not use SQL for Data Analysis (and for many more things) in some way.

SQL is simple and easy-to-understand. Thus not just engineers, developers or data analysts/scientists can use it, but anyone who is willing to invest a few days into learning and practicing it.

I have created this SQL series to be the most practical and most hands-on SQL tutorial for aspiring Data Analysts and Data Scientists. It will start from the very beginning, so if you have never touched coding/programming/querying, that won’t be an issue!

Let’s go!

SQL for data analysis 0 - intro

Note: as this is gonna be a practical tutorial, I encourage you to do the coding part with me. If you do so, you will need a data server to practice. So I recommend to go through this article first: https://data36.com/data-coding-101-install-python-sql-r-bash/

SQL for data analysis? What is it? And why is it good?

SQL stands for “Structured Query Language”.
But I like to describe it as “Excel on steroids” too. Here’s the very same data set in Excel and in SQL:

SQL for data analysis 1 Excel vs SQL

As you can see, SQL and Excel are quite similar. Both of them present data in a 2-dimensional table format with rows and columns. Both are very structured, very transparent.

However there are differences as well. There are 2 major things that you should know as a beginner in SQL:

  1. The first one is the performance. While Excel is great with smaller data sets, it becomes really slow and inefficient, when you use different formulas for example on a file with 100k+ rows. In SQL even 10M+ rows can be processed fairly quickly. (For instance in the upcoming SQL for Data Analysis articles we are going to use a 7M+ rows data set.)
  2. The second is how you access your data. Excel is primarily a Graphical User Interface (aka. GUI). You can scroll, type your formula, select your range with your cursor, etc. Very handy. In SQL, you don’t have this interface, you have to type so called “SQL queries” instead. See the gifs below:
SUM function in Excel

SUM in Excel

SUM in SQL

SUM in SQL

For the first look this can feel like a hassle. But believe me, once you’ve understood the basics of SQL, you will find it clearer and more efficient than Excel. Mentioning only two advantages: joining tables is much easier in SQL than it is in Excel. Also automating and reusing your stuff will become more convenient with SQL too.

When you will use SQL for data analysis, you will use it (most probably) for simple tasks: aggregate things, join data sets, use simple statistical and mathematical methods. But you will be able to do these more efficiently and on much bigger data sets, than you did it before.

What about Python, R and bash?

If you have done my previous bash tutorials or if you have attended some of my live webinars, you might ask: is SQL better or worse, than Python, R or bash?
The answer is… well, there is no categorical answer for this question!

Once you will start to apply these languages on real life analytics projects, you will see that Python and R are good in some things and SQL is good in other things. The main differences will be syntax, “features” and performance… But I really don’t want to go into that topic right now, because:

a) I promised a practical tutorial and that topic is a bit more theoretical (even philosophical)

b) it’s a bit more advanced stuff anyway.

c) at this level, you don’t have to worry about performance at all – and when you will, most probably a senior data scientist or a data engineer will help you out.

Note: If you are really curious though, I gave an ~1 hour presentation here comparing Python, R, SQL and bash.

Note2: Some big data frameworks (eg. Spark, Hive, etc.) also support SQL!

But now, let’s jump on the practical part!

STEP 0 – Install your own SQL environment to practice!

Note: if you have your own SQL environment already, you can skip further to STEP 2 – Get some data!

First, please go through this article. It’s a step by step tutorial about how to set up your own Data Server and install bash, Python, R and SQL on it.

Note: in my “SQL for Data Analysis” articles I’ll use postgreSQL! There are other types of SQL languages (eg. another well-known open source one is mySQL). The good news is that all SQL languages are very similar – if you learn postgreSQL, it will be a matter of hours (or even minutes) to adapt to another one. I picked postgreSQL, because it’s very popular (lots of the online businesses are using it) and at the same time it’s claimed as the most advanced open source SQL language.

If you went through the above linked article, please double-check, if you have these 3 things:

  1. A data server with Terminal (or iTerm) access.
  2. PostgreSQL installed on your data server.
  3. Pgadmin4 (or SQL Workbench) on your computer.

If something is missing, please read the article again: Install bash, Python, R and SQL!

Note: As an SQL query tool I prefer SQL Workbench over pgadmin4, but this is kind of question of taste. In my SQL for data analysis tutorials I’ll use SQL Workbench (here’s a tutorial about how to install it on your computer), but feel free to use pgadmin4. It won’t make any difference!

STEP 1 – Login to your SQL database via the command line!

Great times! You will write your first SQL query in a minute! First access your SQL database from the command line! You have done this once, we will just repeat the same process:

  1. open Terminal (or iTerm).
  2. ssh to your data server
    In my case I type:
    ssh tomi@[my_ip_adress]
    SQL login to your database
  3. Once I’m logged into the server, I want to access my postgreSQL database. As I have already given access to my user, I just have to type this command.
    psql -U tomi -d postgres –» psql is the command itself and -U specify your username (in my case “tomi”), and -d specify your database’s name (in my case postgres – the same goes for you.) Your prompt should change to this:
    postgres=>SQL login to your database2And done! You have full access to your SQL database!
  4. As a test, type this:
    \dt
    It will list all your data tables. You have only one so far… But this will change soon!
    SQL tables

Again! If something is missing or not working, please re-read this article: Install bash, Python, R and SQL!

Note: you will see sometimes that an SQL databases are referred as “relational databases”.  For our purpose “relational database” and “SQL database” will be pretty much the same.

STEP 2 – Get some data!

In this tutorial, we will use a really small data set, called zoo.
You can download it from here in raw .tsv format and you can just play around with it in Excel. Or in bash/command line, if you have already done bash tutorials.

But let’s continue with SQL:

1) Create a table, where we can load our data into. In my further tutorials I’ll explain exactly, what’s going on here, but for now, you can just simply copy-paste this few lines into your terminal:

CREATE TABLE zoo (
    animal varchar(10),
    uniq_id integer PRIMARY KEY,
    water_need integer
);

postgresql create table

2) Double check if the table was created:
\dt

Weee! We have our new data table: zoo!
postgresql table created

3) Load the data!
Again: I’ll explain this later, but now, just simply copy-paste this SQL query:

INSERT INTO zoo (animal,uniq_id,water_need) VALUES
    ('elephant',1001,500),
    ('elephant',1002,600),
    ('elephant',1003,550),
    ('tiger',1004,300),
    ('tiger',1005,320),
    ('tiger',1006,330),
    ('tiger',1007,290),
    ('tiger',1008,310),
    ('zebra',1009,200),
    ('zebra',1010,220),
    ('zebra',1011,240),
    ('zebra',1012,230),
    ('zebra',1013,220),
    ('zebra',1014,100),
    ('zebra',1015,80),
    ('lion',1016,420),
    ('lion',1017,600),
    ('lion',1018,500),
    ('lion',1019,390),
    ('kangaroo',1020,410),
    ('kangaroo',1021,430),
    ('kangaroo',1022,410);

If everything is fine, you should get this message back:
INSERT 0 22

postgresql table load data

The most important SQL statement: SELECT

It’s time to learn the most essential SQL statement. This is:

SELECT * FROM table_name;

You will use SELECT every time, when you will read, filter, transform, aggregate or do whatever with your data. Also 90% of the things I’ll show you in these tutorials will be somehow the modification of this one single query. For a start let’s SELECT everything from our zoo data table.

SELECT * FROM zoo;

SELECT SQL FROM

I’ve got my full table back in a nice and readable format. (If you want to quit from this view, please hit Q on your keyboard.)

SELECT * FROM zoo;
I guess it doesn’t even need an explanation – as the syntax itself is really close to English language – but just in case:
SELECT is the main statement and it tells SQL that we want to read something from our data table.
* usually refers to “everything” – in this case it means, that we would like to select every column.
FROM tells SQL, that we are gonna specify the table where we want to read our data from.
zoo is the name of the table. You can replace this with any other table name, if you have more tables.
; this is an SQL specific syntax. Every query should be closed by a semicolon. If you accidentally miss it, SQL will expect you to continue your query and won’t return anything on your screen.

SELECT SQL FROM 2

You will use this syntax a lot from now on.

SELECT columns

You can replace the * character with actual column names. Try this query:
SELECT animal, water_need FROM zoo;

SELECT SQL FROM 3

(If you want to quit from this view, please hit Q on your keyboard.)

Exactly, what you have expected: you got the columns “animal” and “water_need” on your screen, but not the “uniq_id” anymore.
This way you can SELECT any columns. You just have to specify the name of the columns separated with commas.

You can do something like this too:
SELECT animal, animal, animal FROM zoo;

SELECT SQL FROM 4

You would see the same column multiple times… But as it makes no sense, I’d recommend just simply not doing it. 🙂

Show the first few lines of your data – LIMIT clause

Right now we are working with a data table of 22 rows. But in the next episode of this “SQL for Data Analysis” series we will use a 7M+ rows data file to do more advanced analytics stuff. That’s a big change and in that case it will be more than reasonable not to print all your data on your screen every time, but take a small sample of the first few lines instead. To do this, use the LIMIT clause – which is a small “extension” on the top of your previously introduced “base-query”:

SELECT * FROM zoo LIMIT 10;

This will print the first 10 lines only. (Remember how was it done in bash? It was the head -10 command.)

SELECT SQL LIMIT

Of course you can decide, how much lines you want to print and specify it right after LIMIT!

Filter specific rows – WHERE clause

You can select specific rows based on values with the WHERE clause. Eg.:

SELECT * FROM zoo WHERE animal = 'elephant';

SQL WHERE

SELECT * FROM zoo –» this is the “base-query”
WHERE –» this tells to SQL that you want to filter to something
animal = 'elephant' –» animal is the name of the column where you are looking for the given value. And elephant is the value itself. In SQL you have to add the column, where you are looking for your value.
; –» Never forget the semicolon!

In the very next episode of the SQL for Data Analysis series I’ll go into details, how to get the most out of WHERE! For now it’s enough if you know: that’s for filter to rows.

Test yourself #1

This is an introductory article, so this first assignment will be a pretty easy one as well:
Select the first 3 zebras from the zoo table!
The solution will be more or less the summary of this article!
.
.
.
Ready?
Here’s my solution:
SELECT * FROM zoo WHERE animal = 'zebra' LIMIT 3;

SQL for data analysis assignment

SQL is easy, right?

And one more thing: the syntax…

Later on, I’ll show you some best practices and how to keep your SQL queries clean and efficient, but in this first article let me emphasize 2 things only:

  1. All your queries should end with a semicolon (;). If you accidentally miss it, SQL will expect you to continue your query and won’t return anything on your screen. Eg.
    postgres=> SELECT * FROM zoo
    postgres->
    Not so good…This is better:
    postgres=> SELECT * FROM zoo;
  2. SQL is not case-sensitive, when it comes to the SQL keywords (SELECT, WHERE, LIMIT, etc…). Eg.
    SELECT * FROM zoo;
    Works just as much as:
    select * from zoo;The case-sensitivity of table names, column names and values are the question of your settings. In our current setup (with postgreSQL) the table names and column names are NOT case sensitive, but the field values are. Eg.
    SELECT * FROM zoo WHERE animal = 'elephant'; –» WORKS
    SELECT * FROM ZOO WHERE ANIMAL = ‘elephant’; –» WORKS
    SELECT * FROM ZOO WHERE ANIMAL = 'ELEPHANT'; –» DOES NOT WORK

Note that conventionally most people use SQL Keywords as caps (SELECT, WHERE, LIMIT, etc…). It helps a lot to read your code.

Conclusion

Quite a nice first step to learn and use SQL for data analysis! Congrats! Now you can write your first very basic queries… but this is just the beginning! Let’s continue with the next episode and learn, how you can use the WHERE clause to filter stuff from your data sets as a pro.

If you don’t want to miss the next episode – and also would like to get info about my upcoming webinars, subscribe to my weekly Newsletter!

Cheers,
Tomi Mester

← Previous post

Next post →

5 Comments

  1. Awesome stuff Tomi, please keep going 🙂

  2. Tomi,

    Great intro article. I’ve been looking forward to the SQL programming part of your series. I only have very limited experience with SQLite and I’ve been meaning to get into something more robust, like postgreSQL – so this is perfect!

    -Adam

  3. Feel free to comment or ask questions here in regards SQL + Analytics!

Leave a Reply