You have already learnt a lot about the basics of SQL for data analysis. I figured, it would be nice to have now an episode focusing only on SQL best practices. So I wrote one! In this article you will learn:

  • How to format your SQL query to make it more reusable?
  • When to use capital and lowercase characters?
  • How to use aliases?
  • How to add comments?
  • And more…

If you are new here, it’s better to start with these articles (but for this particular article it’s not required):

  1. Setup your own data server (including your SQL server) to practice: How to setup Python, SQL, R and Bash (for non-devs)
  2. Install SQL Workbench to manage your SQL stuff better: How to install SQL Workbench for postgreSQL
  3. Read the first three episodes of the SQL for Data Analysis series ep1, ep2 and ep3
  4. Make sure, that you have the flight_delays data set imported – and if you don’t, check this video.

All set? Let’s see some SQL best practices!

SQL best practice #0: The order of your keywords

The order of your SQL clauses counts in your query. This is not even a best practice, this is a must. Looking only at the SQL clauses we have learned so far, this is the proper order:

  1. SELECT
  2. FROM
  3. WHERE
  4. GROUP BY
  5. ORDER BY
  6. LIMIT

If you don’t use it in this order, your query won’t work. Simple as that.

It’s good to know, that this is not the same order, how the computer processes the query. The logical order is this:

  1. FROM
  2. WHERE
  3. GROUP BY
  4. SELECT
  5. ORDER BY
  6. LIMIT

SQL first checks which data table we will work with. Then it checks the filters. After that it groups the data. Finally it gets the data – and if it’s needed sorts it and prints only the first X lines.
It’s great to keep this logic in mind, but the writing order is more important for now – so learn that one!

SQL best practice #1: Whitespaces and newlines

You have seen me to write queries in different styles.
Eg.

A)

SELECT * FROM flight_delays WHERE origin = 'PHL';

B)

SELECT *
FROM flight_delays
WHERE origin = 'PHL';

C)

SELECT
  *
FROM
  flight_delays
WHERE
  origin = 'PHL';

These three are the same queries. White spaces and line breaks do not affect the outcome of an SQL query. The only thing that matters is the ; that clearly indicates the end of the SQL query. Apart from that you can use as many white spaces and line breaks between the different keywords as you want.

So which coding-style is the best? There is nothing like a set-in-stone rule to answer this question. It depends on many things (and mostly on personal preferences), but just in general: people tend to use version C) the most. Why? Because that’s the easiest to read. It doesn’t make a real difference in a short query like this one, but it does make a huge difference, when it will turn into a very complex one (even 100 or more lines). Being organized with your SQL queries is key to avoid unnecessary headaches while you are debugging your code.

Note: I have to admit, that I get lazy as well sometimes and when I write an ad-hoc one-time short SQL query, I type everything in one-line (Style A)). But I break it immediately, when I see, that I need to expand it and thus it will become more complex.

You can experiment with your own SQL style, but I suggest to keep these 3 simple recommendations:

  1. Use line breaks at least before the main clauses (eg. SELECT, FROM, WHERE, etc.). I prefer to use it before column names, table names, and each WHERE condition too…
  2. Use indentation for column names, operators (AND, OR) and similars!
  3. Stay consistent!

Example:

SELECT
  animal,
  water_need
FROM
  zoo
WHERE
  animal LIKE '_____'
  AND animal <> 'tiger'
  AND water_need > 200;

SQL best practice #2: Upper case vs lower case

I’ve already mentioned in episode 1, that SQL is not case sensitive for the reserved keywords (eg. SELECT, AND, OR, GROUP BY). It can be case sensitive for column names, table names and field values – this depends on your settings.
Note: for the postgreSQL database, we use in these tutorial articles, only field values are case sensitive.

Regardless of case sensitivity it’s quite common to keep these best practices:

  1. All reserved SQL keywords (eg. SELECT, AND, OR, GROUP BY, etc.) should be written with capitals.
  2. All field values, column names and table names should be written with lowercase (except special situations of course, when the name of the column contains uppercase characters initially).

For example:

SELECT
  animal,
  water_need
FROM
  zoo
WHERE
  animal LIKE '_____'
  AND animal <> 'tiger'
  AND water_need > 200;

Note: if you use SQL tools like SQL Workbench (LINK), the different keywords will also be colored, which helps a lot in reading and processing.

SQL best practice #3: Aliases

Using the SQL functions will give new names to your newly generated columns by default. For example, when we have calculated averages for different segments in the previous episode, the new column’s name – that contained the actual averages – was “avg”. Just to recall:

SELECT
  AVG(depdelay),
  origin
FROM flight_delays
GROUP BY origin;

SQL alias

Again: the name “avg” is the default name given by SQL. But we can change it to anything using aliases.

Eg. if in the above results we want to see “average_depdelay” instead of “avg”, we can achieve it like this:

SELECT
AVG(depdelay) AS average_depdelay,
origin
FROM flight_delays
GROUP BY origin;

SQL alias 2

We have just simply added AS average_depdelay after the function. So the SQL keyword itself looks like this:
AS any_new_name_you_want_to_see.

Now go bananas with the previous query:

SELECT
  AVG(depdelay) AS average_depdelay,
  origin AS o
FROM flight_delays AS fd
GROUP BY o;

SQL alias 3

Aliasing works with multiple columns and even with table names (which by the way is going to be very handy in the next episode, where I’ll introduce the SQL JOIN clause). And have you realized, that I have referred the “origin” column using it’s new name “o” in the GROUP BY clause too? (By the way it would have worked with “origin” as well.)

It’s good to know 2 things:

  1. Using AS assigns the new header name temporary. The change is effective only in the certain SQL query.
  2. Above I gave you some stupid examples… The point of whole aliasing is to simplify and shorten the names and make the readability and usability of your code better. So be pragmatic and smart, when you use aliases!

SQL best practice #4: Comments

Commenting your code is a highly recommended in any programming languages. For instance if you are working in a team, it helps the team to understand your code. But you should comment for your own best interest too! If you write a query today, I guarantee that in 1 year you won’t remember at all, what the heck that COUNT(origin) function was good for… So the next SQL best practice is about helping yourself remember stuff!
I like to look at commenting as sending messages to myself in the future or in an alternate dimension. And it’s always really nice to receive these kind messages from my past-self (of course most of the times I’m also very angry, why past me couldn’t be more detailed).

Anyway. Commenting is important, no question around that. But how can you do it in SQL?
Very simple. Type this: --. After double-dash, nothing in that line will be executed.

Try this:

SELECT -- hello this is a comment
  AVG(depdelay),
  origin
-- this is a comment too
FROM flight_delays
GROUP BY origin;

SQL comments

If you run it, you will see, that the two comments won’t affect your results.

SQL best practice #5: ORDER BY column name

There is a little “trick” you will meet sooner or later. So I want you to hear it first from me – to avoid any trouble with it in the future! When you use ORDER BY and GROUP BY, you can of course use column names (as we have just done that so far) or instead of that you can use column numbers (that’s the new thing).

Eg. using column names:

SELECT
  AVG(depdelay),
  origin
FROM flight_delays
GROUP BY origin
ORDER BY origin;

SQL ORDER BY column name

And the same query using column numbers:

SELECT
  AVG(depdelay),
  origin
FROM flight_delays
GROUP BY origin
ORDER BY 2;

The 2 after the ORDER BY clause refers to “origin” – that’s the second column in our query… (If it were 1, it would be ordered by the first column, AVG(depdelay))

SQL ORDER BY column number

Try it yourself… and then never use it again!
Wait, what? Why?
Imagine a very common scenario: you tweak your query and you add an extra column (eg. SUM(depdelays)) as first in your SELECT statement… If you use column names, everything will just work fine. If you use column numbers instead, your query will order by the wrong column (because you should have changed the column number in the ORDER BY clause too, but you might have forgotten it).
So do yourself a favour and don’t add unnecessary points of failure to your query – your job will be complicated enough without them too!
So the fifth SQL best practice is: Always use the actual name of the columns, when you refer to them (either in ORDER BY or GROUP BY) and never use the number.

SQL best practice #6: Avoid SELECT *

Remember to my first SQL article ? I said, this is the base query, you have to learn, use and expand most of the time:

SELECT SQL FROM 2

Now you are good enough in SQL to learn the terrible truth: we don’t use * in real data projects’ real queries – at all.
There are a tons of good reasons for that, but the top two are: efficiency and readability. If you need one column from a table, why would you print all of them? It would mean you have to move more data from your SQL server to your computer – slowing down processes unnecessarily. And at the same time, if you add * and not column names in your query, you won’t have any clue, what columns you have in your data table, when you want to change something…
Use column names, not *.

Conclusion

That was my 6+1 SQL best practices for today. I hope it will help you a lot to be more efficient and practical with your SQL tasks! For further read I suggest this nice website – I’ve just discovered it recently too, but I have found it very useful.

Stay with me, as on the next episode I’ll show you one of the most well-known and appreciated feature of SQL: JOIN.

If you don’t want to miss it – and also would like to get info about my upcoming articles, video tutorials, webinars, etc. subscribe to my weekly Newsletter!

Cheers,
Tomi Mester