It’s time to continue with learning data science and data coding in Bash! Last time I covered the very basic orientation commands. Now moving on with the most important data cleaning and data shaping commands.
Note1: to get the most out of this article, you should not just read, but actually do the coding part with me! So if you are on the phone, I suggest to save this article and continue on Desktop!
Note2: If you don’t have the data infrastructure yet, you can install it on your own in roughly 30-60 minutes: Data Coding 101 – Install Python, R, SQL and bash! Or alternatively you can use the Data36 Learn Server.
Note3: If you haven’t read I suggest to start with bash ep#1 first: Data Coding 101 – Intro to Bash ep#1.
Options in Bash
When you execute a command, you can execute the main function of it. Or with an “option”, you can execute a modification of it. Let’s see a concrete example and you will get it in a second. Last time we learnt the
wc command (stands for word count). Let’s go back to our practice library and try these 5 commands one by one:
wc -c flightdelays.csv
wc -w flightdelays.csv
wc -l flightdelays.csv
wc -L flightdelays.csv
wc by itself counted the lines, the words and the characters in your file, then printed it on your screen.
wc -c counted only the characters.
wc -l only the lines. And
wc -w only the words. But can you find out what did
wc -L do? Well, don’t spend too much time by guessing – because it’s almost impossible to find out by yourself… It counted the number of characters in the longest line of your file.
The thing you should take away from here is that you can add options to your commands with the dash plus a character combination (eg “
-v"). What kind of options exist? I’ll show you the most important ones for each command we go through here. Then later I’ll show you, how you can find more.
Print in a file with bash
Another important bash concept to understand is that what happens with your data, when you execute a command. By default the result will be printed on your Terminal screen. Alternatively you can “save” this result by putting it to a (new) file.
As a simple example, last time we’ve printed the first 10 lines of our file with
head to our screen (
head flightdelays.csv). But if you add an extra
> sign and a new filename, it will print the same result not on your screen, but to a new file – so you can use this modified data anytime in the future. Test this bash command:
head flightdelays.csv > first_ten_lines.csv
Note: It’s good to know, that if the “first_ten_lines.csv” file didn’t exist so far, then it will be created. If it did exist (eg. in the case, if you re-run this command in the future), then it will be overwritten.
Once you are done, list out (
ls) the files in your “practice” directory and you will find, that this new file (first_ten_lines.csv) is there for real. You can
cat it too. Drumroll… the new file has the first 10 lines of the flightdelays.csv indeed.
Test yourself #1
Here’s a little assignment to test yourself. It’s based on the aboves, so I can assure you: you can solve this. However you will need an extra hint:
tail commands have options to print not 10 lines, but as much as you want. Eg.
head -100 flightdelays.csv will print the first 100 lines of a file. The
tail -23 flightdealys.csv will print the last 23 lines of the file.
But then here’s the exercise:
Print the exact 1824th line of the flightdelays.csv file into a new file, called 1824.csv!
And here’s the solution. (Check out, if you solved it right):
head -1824 flightdelays.csv > tmp.csv– this prints the first 1824 lines of flightdelays.csv into a temporary file, that I named tmp.csv. (You can name it however you want to.)
tail -1 tmp.csv > 1824.csv– this one cuts the last line of the tmp file, then place that one line to the 1824.csv.Got it, right? The last line of the first 1824 lines is the 1824th line. 😉
cat 1824.csv– a simple proof-check. If you did everything right, you should get this data back (check the picture for the full line):
Nice job there!
The concept of pipe in Bash
This is the last main concept regarding data coding in bash that you have to know. And this one will make your job so much easier. Remember, I told you there are two alternatives to print your results: A) to your screen, B) to a new file? Well, there’s a third one: pass the results onto a new command. But what does this mean exactly?
Often when it comes to data cleaning and data transformation, one bash command is not enough to do the job – you need to apply several commands after each other. This is exactly, what we did in the exercise above. Run a command, put results into a temporary file, then run another command and so forth and so on.
With pipe, you can skip the temporary file part – and redirect your modified data directly to a new command. By the way the pipe character looks like this:
Let’s try it! Type in this into your command line:
head -1824 flightdelays.csv | tail -1
And you will see the same results you had before. The difference is, that all these happened in one line.
Again the above gives back totally the same as this one did:
head -1824 flightdelays.csv > tmp.csv
tail -1 tmp.csv
It’s just much shorter.
Don’t worry, if you don’t get this pipe-thingy immediately. I’ve held several workshops about data coding in bash, and I found that this is the most difficult to understand bash-concept for first-timers. Just re-read this subchapter as many times as you want, spend some time to process it – and if you are still not 100% sure, that’s just okay. We will use pipe so many times in my data coding tutorials, that I guarantee you will understand it sooner or later! 😉
Test yourself #2
Here’s another exercise:
Count the number of characters in the first 100 lines of flightdelays.csv! Do it by using the | pipe character.
Okay. Are you done?
Here’s my solution:
head -100 flightdelays.csv | wc -c
And the result is:
I can’t emphasise enough that you can get the same result, if you do this in 2 steps. But it’s not the shortest way.
head -100 flightdelays.csv > tmp.csv(Note that this command will overwrite your previous tmp.csv file.)
wc -c tmp.csv
Print specific columns: cut
Okay, you have learnt the 2 main concepts. In the second half of this article we will focus on some new command line tools. The first one is the
cut, which you can use to print specific columns. Why is it needed? Well, print the first 10 lines of your files on your screen again:
See the first line of your file? It’s the header. And boy, that’s a mess. It’s so long, that your Terminal had to break it into more lines… But usually you don’t need all the columns. Let’s say, that we want to run a predictive analysis and we need only these five columns:
- Year (column #1)
- Month (column #2)
- DayofMonth (column #3)
- ArrDelay (column #15)
- DepDealy (column #16)
Note: By the way, you can find more info about what columns is what: here.
To keep only these five columns, let’s use
cut -d',' -f1,2,3,15,16 flightdelays.csv
If you type this bash command, you will see an endless (okay not endless, but very long) data-flow on your screen. So let’s use
head with a pipe
| and keep the first 10 lines only:
cut -d',' -f1,2,3,15,16 flightdelays.csv |head
Some explanation what happened here:
We used the cut command with 3 arguments, 1 of these were the filename (flightdelays.csv) and the rest were options. (Remember? The stuff with the dash character!). One by one:
cut –» the main command
-d',' –» dash tells to bash, that this is gonna be an option,
d (stands for the word “delimiter”) tells that we will specify the delimiter and
',' tells the delimiter itself is gonna be a comma.
-f1,2,3,15,16 -–» again, dash means that this is gonna be an option,
f (stands for “field”) let bash know that we are gonna specify which exact columns we want to get back – and the numbers are the number of the columns.
flightdelays.csv –» the file we want to execute the command on
Note that the delimiter could be anything. If your file is delimited with
; then you can specify it with
cut -d’;’ -f1,2 filename.csv or if it’s delimited with
-, then it’s
cut -d’-’ -f1,2 filename.csv, etc…
Aaand: that’s it! This is how
cut works. Try to print other columns, maybe combining
cut with the
head command (use pipe!
|). Once you are confident with
cut, I’ll continue with…
Filter to specific rows: grep
grep. Grep is pretty much the same as the “filter” function in excel. It can help you to get back specific rows, based on a pattern (characters, words, numbers, etc…) what you are filtering on. The most basic way to use it is to print lines containing your filter-string (string is a fancy techie way to say “word” basically):
grep 'N749SW' flightdelays.csv
If you try this command (note that – as per usual – I used it with
head to have only the first 10 lines printed), on your screen you will see all the lines, that contains the ‘N749SW’ string. Anyway, they will pop up in the TailNum column. (It’s the unique identifier of the airplanes.)
So the format is:
grep 'your_filter_word' your_file.csv
As you can see we did not tell bash anywhere, in which column we would like to see our pattern. This is because it’s not possible unfortunately. Let me show you a simple case, when this becomes problematic. Try for instance to
grep for DAL (Dallas). See? You get all the lines with “DAL” – regardless if the string is in the Origin or in the Dest column.
If it doesn’t look like an issue for you, it will on the next Test yourself section. 😉
But before that I’ll show you some useful grep options:
grep -v 'N749SW' flightdelays.csv – the exact opposite of the original command. It gives back everything, but the lines with ‘N749SW’
grep -i 'n749sw' flightdelays.csv – by default grep is case-sensitive (means it matters if you type in capital or lower letter). If you use
-i, it won’t be case-sensitive anymore.
grep -n 'N749SW' flightdelays.csv – this one prints the line-number before the actual line. It will be handy, believe me.
I’ll show you some more, but it’s time to finish up this tutorial with the last section of…
Test yourself #3
This is the ultimate exercise, that covers everything you’ve learnt so far in this article (and in the previous one). Here it is:
Take the original file (flightdelays.csv) and print:
- the Arrival Delay (column name: ArrDelay) data
- for the flights that depart (column name: Origin) from San Francisco Airport (parameter name: SFO)
- print only the first 3 of those
- into a new file called: first3sfo.csv
Try to do it with only one line of command!
Hint1: You have to use the pipe | character multiple times. If it wasn’t clear so far: something like
tail file.csv | grep xxx | cut -d’,’ -f1,2,3 | head -3 > yyy.csv is a totally valid bash syntax! You are just continuously passing the data onto the new commands and eventually into your file.
Hint2: you don’t need to find the solution for the first try. Take advantage on the fact, that bash is a very iterative data coding language. You can try, fail, then fine-tune your command. Once you have the correct result on your screen, then you can save it into a new file.
Then I’ll show you the solution:
cut -d',' -f15,17 flightdelays.csv | grep SFO | head -3 | cut -d',' -f1 > first3sfo.csv
cat first3sfo.csv, you will get this results:
Some explanation step by step:
cut -d',' -f15,17 flightdelays.csv – Cut keeps only 2 columns from our file, the ArrDelay, what we are looking for, and the Origin, that we will use for filtering out the SFO departures. It’s important not to keep the Dest column, otherwise the next
grep command would use that column for filtering too.
grep SFO – This filters for all the lines, that has “SFO” in it. As we have kept 2 columns only and one of those is a number, this will filter only based on the “Origin” column.
head -3 – Keep the first 3 lines only.
cut -d',' -f1 – And after all, we don’t really need the Origin column anymore – as we used it only for
grep-ing the SFO departures. So just remove it. Did you notice, that the number of the ArrDelay column has changed (not 15th, but 1st)? This is because this last cut commands input was the already modified data, that contained only 2 columns.
Then print this whole stuff to a new file:
(For those, who would like to torture themselves by not using pipe, but tmp files – this is another possible way to get this done:
cut -d',' -f15,17 flightdelays.csv > tmp.csv
grep SFO tmp.csv > tmp2.csv
head -3 tmp2.csv > tmp3.csv
cut -d',' -f1 tmp3.csv > first3sfo.csv
But please: do the pipe version instead of this. :-))
A common mistake could be to change the order of the commands.
Eg. If you first
grep to SFO, it won’t just give back the flights departing from SFO, but the flights arriving to SFO as well – as I’ve described it already – so you will have some unnecessary “garbage” in your data. Or if you first use
head -3, these 3 lines won’t contain any SFO related data, so if you
grep after this to ‘SFO’, you will get back 0 lines. And so on… Make sure, you use the correct order.
Boom! You have just picked up another great piece of knowledge about data science in bash. You have learnt the 3 main concept of it: options, pipes and print in a file. And you have learnt 2 new commands as well: cut and grep.
I have tried to be as detailed as I could, but in the case if you need further clarification, do not hesitate to comment and ask!
I’ll continue from here (UPDATE: for episode 3, click here) and I’ll also provide free video tutorials – about data coding and data science – on my Youtube channel in the short future! If you don’t want to miss any of these, subscribe to my Newsletter!