Data Coding 101 – Intro to Bash – ep6 (last episode)

This is the last piece of my “Data Coding 101: Introduction to Bash” series, where I introduce the very basics of how to use the command line for data science. If you read all the 6 articles – including this one – you will have a good enough base-knowledge to start your first data project. Today I’ll show you 4 more command line tools, that we are using very commonly in only data analyses: sed, awk, join and date.
And if you haven’t read the previous articles, I recommend you to start here:

sed – substitute characters and strings

As we have discussed it before bash handles all your data as text. This means, that if you want to clean your data, you should think about the process as you would do it with a text file. sed stands for stream editor” and it’s a very cool tool to modify text based on common patterns across your whole file.

Let’s see a concrete example. Go back to your flightdelays.csv file (If you don’t know, what’s that, check this article and download it!)

head flightdelays.csv

You already know, that head will print the first 10 lines of your file to the screen.

data science command line 1 head

As you can see the field separator is “,”. It’s not the most useful delimiter usually, so let’s try to change it to something else. First try to use spaces instead. Here comes sed into play:

head flightdelays.csv |sed 's/,/ /g'

data science command line 2 head sed

Note: In episode 2 I’ve already explained the role of the | (pipe) character. If you are not sure, what’s happening, read it first!

We have printed the first 10 lines, then modified the output with sed! What happens step by step:

sed ‘s/,/ /g’ –» After sed you have to specify what and how would you like to modify in your file. You can do that with an expression between apostrophes. In this case: s refers to “substitute” and it defines that we want to remove something and put something else instead of that. / is usually just the delimiter between the different parameters of the sed command. , is the character, you want to remove, and (the space after the second /) is the character, that you want to see instead. Eventually g (stands for “global”) tells to bash, that this should happen for every appearance of the , character. (If you don’t add g, sed will change only the first , in each line.) Explain the same in a picture:

data science command line 3 sed explained

some more cool things with sed

Most of the people are using sed only for substitute characters, but there is so much more potential in it.

First of all, you can change whole strings. Eg. try this:
head flightdelays.csv |sed 's/SMF/DATA36_TUTORIALS/g'

See? All the “SMF” strings have been replaced with “DATA36_TUTORIALS”.

data science command line 4 sed example

Okay, maybe this is not the most useful stuff, you would do with your file… But you get the point!

One important thing, that you have to know about sed, that it works with regular expressions (also known as regexp or regex). I’ll write a whole article about this topic, but for now, here’s the wikipedia article about it. Please be aware, that if you want to do special modifications with sed (eg. change all the numbers to “x”), you can do it, but you have to be familiar with regexp. (Or you have to be really good with googling! ;-)) By the way the above example (change all the numbers in your file to “x”) would look something like this:
head flightdelays.csv |sed 's/[0-9]/x/g'
Where [0-9] is a regular expression for all the numbers.

data science command line 4 sed xxx

It’s also good to know, that sed is not just good for substitution, but also to delete specific lines. Eg. in my recent Medium article (Learning languages very quickly — with the help of some very basic Data Science) I have removed all the empty lines from my file with this command:

sed '/^$/d' –» where ^$ is a regular expression for empty lines and d refers to “delete

You can similarly:

  • insert a line before another line with a matching pattern (by using i)
  • insert a line after another line with a matching pattern (by using a)
  • change a line with another line by using c
  • etc…

sed is a cool tool! Use it!

Note: this is the best online tutorial about sed that I’ve ever read: http://www.grymoire.com/Unix/Sed.html

date – using dates in the command line

A much simpler tool: date. With this command line tool, you can literally print dates in every format. Here are 3 examples:

Bash Command:
date
Result:

Sun Apr 2 23:12:45 UTC 2017

Bash Command:
date +%D
Result:

04/02/17

Bash Command:
date +%Y-%m-%d
Result:

2017-04-02

Note: I wrote this article on 2nd of April, hence it printed the current date and time.

As you can see, there’s a pretty wide range of formatting here. The only thing you need to do is to type date, then a + character (which means, that you want to add an attribute to specify your format), then just simply add the formatting options (eg. %Y or %m) and any other characters (usually between apostrophes), that you want to see there. Example:

Bash Command:
date +%Y-%m-%d': '%A
Result:

2017-04-02: Sunday

data science command line 6 date command

I encourage you to go into the manual of date with the man date command line tool. Here you will find all the possible formats.

join – to join two text files

join is a command that will be very useful in SQL as well. But now let’s stay with the command line. Here’s a little example. You can just simply follow me here, but you can also try this on your own data server, if you create these files:

file1.csv

animal weight
elephant 500
monkey 300
tiger 300
tiger 320
zebra 200
zebra 220

file2.csv

animal color
elephant grey
tiger yellow
zebra white
zebra black

Now apply join for these 2 files…
join file1.csv file2.csv

…you will get these results:

animal weight color
elephant 500 grey
tiger 300 yellow
tiger 320 yellow
zebra 200 white
zebra 200 black
zebra 220 white
zebra 220 black

So what happened here?
In one sentence the two file have been merged based on the first column!

A bit more detail though:
First things first, join automatically recognized the delimiter (it was a space) and identified the columns. If your delimiter is not a space or a tab, you have to define this value with the -t option (same as it was with sort for instance).
Secondly it automatically picked the first column of both files as the field, that we should join on. That’s just perfect in this case. But if you would like to use another column you can specify that with an option as well (I’ll get back to that later).

And third: the join did just happen. All the animals have colors and weights in one file. As you can see the monkey didn’t show up after the merge, this is, because there’s no monkey in the second file at all. And as you can also see, we have 4 zebras after the join. This is because, this merge created all the possible combinations of colors and weights.

Note: the biggest difference between bash’s join and SQL’s JOIN is, that in bash you merge text files and in SQL you merge tables. From our human perspective it looks pretty much the same, but from the computer’s perspective it’s quite different. Because of this – without going into further details…

…it’s very important to know, that bash join only works properly with sorted files! If you don’t sort your files, you will get an error message – also fake results. Take away: before join always sort.

For now it’s enough if you know only these very basic things about join. Bit below in this article I’ll show you some more exciting thing regarding that.

But first, get familiar with one last command line tool:

awk – for expert text-processing

awk is like another programming language on top of bash – and you don’t have to know this one in details either. But it’s still good to keep in mind, that you can call this language from bash and you can pipe it into your other command line tools.
Usually awk is really great for advanced line-based processing. Eg. if you want to print only those lines, where the 3rd column is bigger than 2.500, your tool will be awk.

I’ll give you here only 2 practical examples of how awk will be useful for you.

awk – reorder columns

Let’s say we want to have 4 specific columns from our good old flightdelays.csv file:

  • ArrDelay
  • DepDelay
  • Origin
  • Dest

Easy. We’ll just use cut:
cut -d',' -f15,16,17,18 flightdelays.csv

data science command line 7 cut head

Cool. But the order of the columns would be better like this: Origin, ArrDelay, Dest, DepDelay. You might try to switch the columns in your cut command like this:
cut -d',' -f17,15,18,16 flightdelays.csv

But I have bad news. With cut you can’t change the order of the columns.

data science command line 8 cut limits

See? Same order! Why is it so? Well, I don’t know, but this is the limitation of the cut command and you can simply solve it by awk. Try this:
awk -F',' '{print $17,$15,$18,$16}' flightdelays.csv

data science command line 9 awk

Very similar to cut! Except that here the option for the delimiter will be -F (standing for “field-separator) and not -d. And the way how you define the fields themselves is bit different too: '{print $17,$15,$18,$16}'. As you might have noticed the output’s separator won’t be comma anymore, but spaces – this is a default setting of awk. But you can add here anything similarly to date or to echo, if you put it between quotation marks. Eg.

awk -F',' '{print $17"___"$15"---"$18","$16}' flightdelays.csv |head

data science command line 99 awk second example

Epicness!

awk – sum rows

This is a more straight-forward one. Let’s say you want to summarize all the ArrDelays (like you would do in Excel with the SUM command.) First, you have to cut the 15th column. Then you have to make the sum with awk. Here’s the whole command:

cut -d',' -f15 flightdelays.csv |awk '{sss+=$1} END {print sss}'
The result is: 74151049

You might ask, why have we used the cut command first, why didn’t we get done the whole stuff with one awk command – as we have just seen, that awk is good to cut columns too. The answer here is computing time. How awk works is a bit different, than cut. Awk basically goes through your file line by line and always adds the next value to a variable (called sss in this case). That’s why sss+=$1 is there for. This is okay, until you have only one column, but if awk have to find this one column in each line… Well that can take forever. So that’s why! We use the best tools for the given tasks – so we don’t have to wait forever to get the results…

Note: If you are curious, I still recommend to try to get this number purely with awk… And you will see the difference…

Test yourself

Nice. We got through on 4 new tools. It’s time to wrap them up in one ultimate exercise.
It sounds like this:

Take the flightdelays.csv file and calculate the summary of all the ArrDelays happened on Mondays! For the sake of practicing, let’s say, you can only use these four columns:

  • 1st column: Year
  • 2nd column: Month
  • 3rd column: Day
  • 15th column: ArrDelay

Hint: You will have to use sed, date and awk for sure. And in my solution I’ll use join as well!
.
.
.
And the solution is here!
Put the whole stuff into a script called arrdelay_calculator.sh

#!/usr/bin/env bash

cut flightdelays.csv -d',' -f1,2,3,15 |sed 's/,/-/' |sed 's/,/-/' |sed 's/,/\t/' > tmp1.csv

a=0
while [ $a -le 364 ]
do
i=$a'day'
date -d'2007-1-1'+$i +%Y-%-m-%-d' '%A >> tmp2.csv
a=$((a+1))
done

sort tmp1.csv > tmp3.csv
grep 'Monday' tmp2.csv |sort > tmp4.csv

join tmp3.csv tmp4.csv |cut -d' ' -f2 |awk '{sss+=$1} END {print sss}'

rm tmp1.csv
rm tmp2.csv
rm tmp3.csv
rm tmp4.csv

If you want to understand the script, run it line by line!
But of course, I’ll give you a little bit of explanation here:
cut flightdelays.csv -d',' -f1,2,3,15 |sed 's/,/-/' |sed 's/,/-/' |sed 's/,/\t/' > tmp1.csv
First, I have cut the columns I need and formatted the output a bit more readable. Then I’ve printed the whole stuff into a temporary file.


a=0
while [ $a -le 364 ]
do
i=$a'day'
date -d'2007-1-1'+$i +%Y-%-m-%-d' '%A >> tmp2.csv
a=$((a+1))
done

I’ve printed all the dates in 2007 into another temporary file. I made sure that the date format is the exact same, what I have on my other file (2007-1-1)!

sort tmp1.csv > tmp3.csv
grep 'Monday' tmp2.csv |sort > tmp4.csv

I’ve sorted my temporary files and put them into other temporary files – and at the same time I removed every day, but Mondays from my date file.

join tmp3.csv tmp4.csv |cut -d' ' -f2 |awk '{sss+=$1} END {print sss}'
Eventually I made the join and calculated the sum of the ArrDelays!

rm tmp1.csv
rm tmp2.csv
rm tmp3.csv
rm tmp4.csv

I’ve also removed all my temporary files (this is gonna be important, when you want to run the script for the second time).

The result is by the way: 11436525

Conclusion

I promise I’ll have some more exercises on bash/command line in the future – maybe in a video format! But for now, I have good news! You made it! You went through all the 6 bash articles and you have learnt the essentials of the command line! Now I recommend to look after some pet projects and sharpen your skills via learning-by-doing!
And at the same time stay with me, because very soon I’ll come up with my next initiatives: SQL and Python tutorials! They will be similar than this one: very easy to understand for everyone, and from the very basics! If you don’t want to miss it, subscribe to my Newsletter!

Cheers,
Tomi Mester

← Previous post

Next post →

4 Comments

  1. Hi Tomi,

    This is a nice series of techniques on basic data processing on the Linux command line, and I’ll definitely point my students to it. However, I was surprised to see that you recommend the use of cut + awk rather than using awk alone when summing up a column of numbers. I’ve always used awk only under this circumstance, and never experienced the inefficiency that you mentioned. To check this out I thought I’d give it a try on some random files I have lying around. Here I’m adding up field #14 (in these files it’s the number of records in the line, so mostly in the range 1-48) in 27 CSV files with a combined total of more than 3 million lines.
    This is how long it takes with cut and awk:

    $ time cut -d, -f14 MDMR-*.DAT | awk '{sum += $1}END{print sum}'
    40376443

    real 1m3.617s
    user 1m4.436s
    sys 0m0.619s

    and here it is all done in awk:

    $ time awk -F, '{sum += $14}END{print sum}' MDMR-*.DAT
    40376443

    real 0m13.106s
    user 0m12.361s
    sys 0m0.743s

    So it takes almost 6 times longer when using cut.

    In case it matters, which I don’t think it does, I’m using GNU Awk 3.1.5 and cut from GNU coreutils) 5.97.

    Filtering records purely in awk (as in /re/{…}) vs using grep and awk is a different story, however, as grep is super well optimised for this particular task.

    • hey Chris,

      thanks for the comment!
      Hm, now you made me think.
      Let me check this with some other files myself and if I found the same as you, I’ll definitely update my article!

      I guess the difference here is in the different computer capacities. Eg. in my tutorials I use a very-very cheap computer with 512Mbyte memory, 20GB SSD and only 1 CPU.
      But let me check!

      Cheers,
      Tomi

  2. Hi

    A small mistake (perhaps):
    Excerpt from your article above:
    …”define this value with the -t option (same as it was with cut for instance)”…

    I suppose you meant ‘sort’ instead of ‘cut’..

    Thanks
    Nishant

    P.S: Needless to say, your blog has been very helpful in getting me started with bash programming. Keep it up! 🙂

    • Hey Nishant!
      Thanks! Yeah, I meant like “same as it was the -d option for cut for instance”
      But you have right, it’s much easier to understand with the example of sort!
      So I have updated the article!

      Thanks for the input! And glad you like the blog! 😉

      Cheers,
      Tomi

Leave a Reply