This is third episode of my Data Coding in Bash series, here on data36.com! We’ve already set up a fully functioning data server, have learnt the basic orientation commands and have learnt the basics of data cleaning and shaping in bash. In this article we will do the first steps with basic mathematical and statistical functions, that will get us closer and closer to do actual data science in the command line. Plus we will also start with bash scripting!
Note 1: to get the most out of this article, you should not just read, but actually do the coding part with me! So if you are on the phone, I suggest to save this article and continue on Desktop!
Note 2: If you don’t have the data infrastructure yet, you can install it on your own in roughly 30-60 minutes: Data Coding 101 – Install Python, R, SQL and bash! Or alternatively you can use the Data36 Learn Server.
Note 3: If you haven’t read yet, I suggest to start with the previous bash articles:
Just for the record! I do like pre-built statistical, mathematical and predictive analytics packages and modules. Regardless if we are talking about Bash, Python or R. However what I do not like is a data scientist, who is not 100% aware, what’s happening in the background, when he/she uses these pre-built stuff. I’m a big fan of learning-by-doing, so in my opinion the best way to make sure, you understand the math behind these packages is, if you actually build them yourself – at least once in your life.
Today we will do something like this. But we will start with the easiest ones: MAX, MIN and MEDIAN.
Now back to the command line! I have two tiny demo-data-set for you. You can download it to your server with
wget. Make sure, that you are logged in to your server – and create a new folder (I’ll call it “sortdemo”) next to “practice” (“practice” is the folder, that we’ve created in the latest bash articles), then download the demo-data-sets with
wget https://www.data36.com/demo1.csv --no-check-certificate
wget https://www.data36.com/demo2.csv --no-check-certificate
cat on both files and you will see that demo1.csv contains random numbers and demo2.csv random names. (Generated by a random generator by the way – so if you find your name on the list, that’s an – epic – accident.)
Sorting in bash
There is a pretty easy-to-remember command line tool in bash to sort things with, the
sort command. First try to sort your name-list:
What you will get back:
Barbara Diaz Christopher Howard Craig Carter Diana Hughes Doris Brown Eric Richardson George Coleman Helen Powell James Young Jean Turner Jeffrey Cooper Jesse Thompson Jimmy Phillips Joshua Thomas Kathy Sanchez Michael Bell Nancy Jones Paula Gonzalez Ralph Gonzales Roger Robinson Ronald Foster Stephanie Harris Terry Henderson Thomas Wilson Timothy Moore
Sorted in alphabetical order. Beautiful!
Try the same on the numbers:
And the result is:
1 1 10 12 12 13 15 2 4 4 5 5 7 7 7 8 9
Hm. Not exactly the same, how we’ve learnt it in the elementary school. The problem is, that
sort by default sort things in alphabetical order. The method it follows is: sort by the first character, then by the second character, then by the third, and so on…
While this works really well on the names, it causes some confusion with the numbers. Fortunately you can let
sort know, that you want to sort numbers and not words. You should add an option:
-n (“numeric-sort”). Type this to your command line:
sort -n demo1.csv
Now we are talking!
Important options for sort
sort is pretty straightforward! Here are some more options, that you will find handy. Try them one by one!
sort -V –» natural sort of (version) numbers within text (Note: JDS course participants, you will need this ;-))
sort -r demo2.csv –» -r means reverse. Instead of an ascending order, you will have a descending order.
sort -r -n demo1.csv –» It works with numbers as well if you add the -n option too (eg. it could be useful for creating a top-list).
sort -n -u demo1.csv –» Remove duplicates! It gives back every repeating value only once. (U is for unique.)
sort -t’,’ -k2 -n filename.csv -–» You can’t try this on the demo1.csv, but feel free to go and try it on the flightdelays.csv file.
-k means “key” and
-k2 tells to
sort, that the key of the sorting should be the second column and not the first (which by default is). But as per usual, bash doesn’t know, that what is the separator between your columns, so you should specify it with
-t’,’. It’s pretty much the same, what
-d’,’ did at
cut. Then you might ask, that why is syntax for the same option different:
cut? I have three answer for this:
- I don’t know. (Which is – I guess – not a real answer.)
-dis reserved for another option (“dictionary-order”)
- The creators of
sort(Mike Haertel and Paul Eggert) are not the same people, than the creators of
cut(David M. Ihnat, David MacKenzie, and Jim Meyering), so they just simply didn’t follow the same naming convention.
This is not the only case, when you will see inconsistency across command line tools, so it’s just better to get used to it!
Test yourself #1
Calculate MAX, MIN and MEDIAN!
Now, that you have learnt
sort, go back to your flightdelays.csv data set.
I have 3 exercises for you. All the three is doable by using some of the command line tools you’ve learnt so far.
- What was the MAXIMUM delay in Arrivals?
- What was the MINIMUM delay in Arrivals? (Early arrivals count too, so you will have here a negative number most probably.)
- What was the MEDIAN of the all the delays in Arrivals?
Spend some time trying to answer these questions. Once you are done, continue reading and you will find my solutions.
Here they are:
Solution for MAX and MIN
(minutes… Which is 43 hours. I would not like to be on that plane.)
To get the result, this should be typed to the command line:
cut -d',' -f15 flightdelays.csv |sort -n |tail -1
cut command, we can only keep the 15th column, which is “ArrDelay”. Then we pipe this data into
sort -n. Eventually we print only the last line, which is the actual MAXIMUM value, that we were looking for.
(which means, that plane was 6 hours early.)
The approach is pretty much the same:
cut -d',' -f15 flightdelays.csv |sort -n |head -1
The only thing I’ve changed is that instead of
tail -1 I put
head -1 at the end, which gives back the first and not the last number of the sorted column.
Solution for MEDIAN
Back to mid-school: What is median? Simply put: median is the “middle” value of a data set. Eg. take the data set: [1,2,3,5,6,13,100]. Sort them, then take the middle one: 5. That’s the median. If your list has even number of elements (eg. [1,2,5,7]), then the median will be the middle 2 numbers’ mean (in that case (2+5)/2=3.5).
STEP 1) discover the data:
cut -d',' -f15 flightdelays.csv –» Never miss this step! Keep watching your terminal screen and you will realize, that you have some missing data (“NA”) and an unnecessary header (“ArrDelay”) in your data set. Both of these should be removed.
(Note: in real life data science we have other tools, than simply just watching the screen. We are not there yet in my tutorials, but soon…)
STEP 2) Filter the garbage…
grep and count the lines with
cut -d',' -f15 flightdelays.csv |grep -v 'NA' |grep -v 'ArrDelay' |wc -l
The result is:
. Median is the middle value of this list. (Make sure you sort it first.) So you should divide 7275288 by 2. It’s 3637644. And as 7275288 happened to be an even number, you need to pick the 3637644th and 3637645th values.
STEP 3) Pick the numbers:
cut -d',' -f15 flightdelays.csv |grep -v 'NA' |grep -v 'ArrDelay' |sort -n |head -3637645 |tail -2
Our winning numbers are:
The Median is the mean: 0.
Note: Notice how iterative bash is! We get the results in 3 steps, by little modifications on the command at each step.
Text-editor for the command line
So far we have interacted with bash only by typing different commands directly to the command line. But you can use text editor tools too. Well, it does not look exactly the same, what you get used to in Windows/Mac…
Looks really retro (hey, after all it’s from 1994), but it will be super useful, when you will start with bash scripting or when you want to just simply create a new text-file (or when you want to look hip in Starbucks).
Anyway. When you have set up your data server, you have also set up
mcedit. If you didn’t, please do it now:
sudo apt-get install mc
(Type your password. And if it asks if you want to continue, just say yes.)
Once it’s installed, open the
mcedit command line tool:
Boom! Ready for text-editing! (You should see the same blue picture, that I have attached above.) What I particularly like in
mcedit is, that you can use your mouse here (finally, back to “graphical user interfaces”)! So when you want to exit, you can click the Quit button in the bottom right corner – or when you want to search something, you can click the “7) Search” button on the bottom menu.
Anyway – type some text in, click the Quit button and save your file.
You will have a new file in your folder, that you can print as every other files – with the cat command.
If you want to modify this file, just go back to mcedit:
Print unique values and count them!
Let’s create a new text file with some numbers in it!
Let’s add these numbers to your file:
1 1 2 2 2 3 3 3 2 2 2 1 1
I called my new file to demo3.csv (just to keep consistent with the naming of my demo files).
And here’s a new command line tool. Try it:
uniq unifies multiple lines. The result, that you get back is:
1 2 3 2 1
As you can see, it only unifies adjacent lines. Sometimes it’s useful (eg. when you want to remove duplicates), but in data science, we are usually using it with
sort, so you can get a list about each unique value of your data set.
sort demo3.csv |uniq
It gives back:
1 2 3
Note that this results are exactly the same, that you would get, when you would type
sort -u. Type this:
sort -u demo3.csv
1 2 3
So why am I showing this to you? Because
uniq has some pretty cool options. Here’s one, that I use the most.
sort demo3.csv |uniq -c –» -c stands for “count”
It counts the number of occurrences of every unique value in your file. We had four pieces of 1s, six pieces of 2s and three pieces of 3s in our data3.csv file.
Test yourself #2
How is this useful? Let’s do this exercise and you will find!
Back to your flightdelays.csv file!
Question is: How many different airports we have in our flightdelays.csv file?
(well, the real answer is 310)
cut flightdelays.csv -d',' -f18 |sort |uniq |wc -l
cut the Dest(ination) column (it’s the 18th). Then you
uniq the airport names. Then you count the number of lines with
wc -l. The result is 311, but mind that there is a header line, which you should not count in. This makes the real result to 310.
Test yourself #3
List out the top 3 destination airports (by the number of Arrival planes)!
297481 DFW 375716 ORD 413805 ATL
cut flightdelays.csv -d',' -f18 |sort |uniq -c |sort -n |tail -3
cut the 18th column (Dest) again.
sort the results and use
uniq -c to count the number of occurrence of each value. You turn this into a “top list” with
sort -n and you keep the top 3 with
You have just found out with bash + command line, that which were the most frequent airports in 2007! Now imagine, that these are not airports, but users and we are talking not about airport-landing, but feature usage… This is something that actual online data analysts do time to time, when they are doing their jobs!
Create a bash script
In practical data science it’s a pretty common case that you are working on live or semi-live data. Eg. many startups run their analyses every midnight, so when the decision makers go to the office on the next morning, they can see the refreshed numbers. Of course it doesn’t mean, that an analyst sits there and pushes the buttons every midnight. More like, that smart analysts and data scientist created scripts that automatically run every midnight.
To do this automatic data processes you need to know some things:
- You have to have a remote data server. (done)
- You have to learn data coding on at least one language. (work-in-progress)
- You have to turn your commands into scripts. (coming up now)
- You have to automate them. (I’ll get back to that later)
I remember, when I wrote my first bash script (which was actually my first data script as well), I was very surprised how easy it is. A data script is nothing else but the commands – that you would type to the command line – one after another: in a text file. That’s it.
Let’s try it out!
For a start, let’s go with a very simple script – type this into your editor.
#!/usr/bin/env bash echo "The top 3 airports:" cut flightdelays.csv -d',' -f18 |sort |uniq -c |sort -n |tail -3 echo "The number of unique airports:" cut flightdelays.csv -d',' -f18 |sort |uniq |wc -l
What your script will do:
#!/usr/bin/env bash –» this line is something called the “shebang” – the only role of it is to tell to Ubuntu, that your script is in bash. (You can write similar scripts in Python, R, SQL, etc… I’ll show you, how!)
echo "The top 3 airports:" –» This will print to your screen this string “The top 3 airports”
cut flightdelays.csv -d',' -f18 |sort |uniq -c |sort -n |tail -3 –» You know this line.
echo "The number of unique airports:" –» Same printing function.
cut flightdelays.csv -d',' -f18 |sort |uniq |wc -l –» You know this line too.
If you are done: click “10 Quit” and save your file.
Run a bash script
Your script’s extension is “.sh”, which stands for Shell Script. If you don’t add the “.sh”, that won’t cause any practical difference.
Let’s try to run your script in your folder with this syntax:
The answer will be an error message (“Permission denied”).
The issue is, that running a script in the command line needs permission. Fortunately you can give this permission for yourself by typing this command:
chmod 700 demoscript.sh
And the magic happens:
Your script executes the commands you typed in it line by line and gives back the results on your Terminal screen!
Well, this was just a very brief intro into bash scripting, but I guess you have already learnt a lot today! 😉
(Note: Here are two examples of bit more complicated bash scripts:
- I’ve written this script, when I wanted to find the best rental deal in Budapest back in 2014. It went through the website of the top Hungarian real estate portal in every 5 minutes automatically and sent me an email, when it found something new, that fit me. I managed to find an apartment for 40% less, than the market price!
- This second script is much more complex! Actually it calls other bash and SQL scripts as well. This was my first pet project back in 2012. It collects and analyzes the news from ABC, BBC and CNN. It pulls and connects the articles from the different news-portals if they are about the same topic. Finally it compares them by credibility based on the wordings of the articles.)
We went through many-many important things in this article.
sort are very useful commands to do some actual analysis on your data. Bash scripting supported with
mcedit will be a great ally of yours as well during your data science career!
In the next blogpost I’ll continue with the top command line tools and tricks, that made data coding life easier! (Update: I released the next episode too!)
Stay tuned! If you don’t want to miss any of my data content (videos, articles, downloadable stuff), subscribe to my Newsletter!