Note: to get the most out of this article, you should not just read, but actually do the coding part with me!
If you don’t have the data infrastructure yet, you can install it on your own in roughly 30-60 minutes – simply just read and follow my previous article: Data Coding 101 – Install Python, R, SQL and bash!
Or alternatively you can use the Data36 Learn Server.
Anyways, let’s get into it and do the first steps to learn the most basic data coding language: bash.
What is bash and why you should learn it?
Bash (or many times referred as “the command line”) is not a language directly built for data science. It’s a command language which let’s you to interact with your operation system (in this case Ubuntu). You can type commands and bash will interpret them. Without going into further technical details, the one thing you should understand: you don’t need to learn the bash just because of data science. You will use it for so much more!
- To move, copy or rename files between folders and computers.
- To integrate and/or automize your SQL, Python and R scripts.
- To set up new data tools on your data server.
Learning (at least the most essential) bash commands will open up many new opportunities for you in your data science projects. And, well, even when it comes to job seeking…
But besides all of these: obviously it’s a very powerful tool for data analysis as well. In this article I’ll drive you through the first steps of data analysis in the command line!
Open your Terminal! Meet the prompt.
If you read the above mentioned article or the email you’ve got, when you registered to the Data36 Learn Server, you should be familiar with Terminal. So open it up, log in to your data server and you will see something like this.
Without typing any command let’s take a look at the command line itself:
On bash you will have this “
$” sign on every line.
The text before the dollar sign is called the “prompt”, which usually tells you more about your username, the current folder you are located in your remote computer and other things you can set (but we don’t care about those in this article). The prompt changes when you go to Python(
>) or SQL(
postrgres=#), so you will know if you are on bash or any other languages.
The point is, that all the bash commands I’ll show you in this article should be typed after the $.
Get some data!
To practice a bit, we will download an open data set. “Open” means, it’s free to use. There is a beautiful data set (originally published here) with 7.000.000+ lines of data, that you can simply download by typing this to the command line:
Remember, you logged into your data server! So when you are in the Terminal, you are on your remote server. Means this will be downloaded there and not to your local computer! So don’t worry you can’t do any mess there! 🙂
The file was zipped, so you need to unzip it first. For Ubuntu 14.04 dtrx is a cool unzipping tool. Let’s install it – to your remote server – with this command:
sudo apt-get install dtrx
(As usual if it asks for your sudo password, type it, if it asks if you want to continue the setup, say yes.)
Then unzip the file:
It will take around ~60 seconds to process the whole file, so don’t worry, your Terminal is not freezing, it just needs some time. Once it’s done, you will have a very cool data set about the arrival and departure details of all commercial flights in the US from year 2007. Yes, this data is open for public. 🙂
Where am I now? Orientation in bash.
Per definition you are in your user’s folder (in my case it’s /home/tomi). I’ll get back to this soon, but first, let’s see what do we have in our current folder. Type this bash command:
Is stands for “list” and it literally lists out all the files in your folder. It’s not a surprise, that you have 2 files. One is that you have just downloaded (2007.csv.bz2) and the other is that you have just unpacked (2007.csv).
If you came here from the Junior Data Scientist’s First Month video course, also check out the
ls -v command. The
-v parameter adds a small – but important – modification to the original command: it will print everything in natural order! See more about it in the course!
I know you are very excited to see, what data we have in our 2007.csv file, but let me show you some more important “orientation” related bash commands before that.
(pwd stands for “print working directory”) This will show you, where you exactly are on your remote computer. I’m in the /home/tomi folder – as I’ve already mentioned above.
(Stands for “make directory”.) This created a new folder in your folder called “hello”. Obviously, you can type anything instead of “hello”.
mkdir [folder name] is the general folder-creator command on bash. Let’s use
ls to double-check your new folder. Oh, wow, it’s there! Yay!
It’s a folder, so you can go into. The bash command to do that is:
cd [folder name]
(cd = “change directory”)
So in this case:
If you use
ls again, you will see that this folder is empty. And if you use
pwd, you will see, that you are in “hello” folder indeed.
If you want to go one folder up, use:
You can try
pwd again to make sure, that you know, where you are.
Now try out, how copying files works in the command line:
cp [folder/file_to_copy] [folder/new_file_name]
As you can see, you need to give two parameters here. The first one is the name and the location of the original file, the second is the preferred new name and the location of the new file. Let’s do it:
cp /home/tomi/2007.csv /home/tomi/hello/delay2007.csv
This bash command copied from our original folder into our “hello” folder the 2007.csv file. Plus the new file’s name won’t be 2007.csv anymore, but delay2007.csv.
Note: when you use cp, you don’t necessarily need to type the whole (absolute) path of the file. You can put the so called “relative paths” too. So as I’m in the /home/tomi/ folder I can just type:
cp 2007.csv hello/anothercopy.csv
Same result! (Except that the new filename in this case will be “anothercopy.csv”)
Test yourself #1
Here’s a quick assignment, if you want to test yourself! Try to solve the case by yourself! Nothing else will be needed, but the bash commands I’ve described above! Once you are done, you can read, how I’d do it.
Create a new folder with the name “practice”!
Copy your 2007.csv file into this new “practice” folder and name it “flightdelays.csv”.
Type this into the command line:
cd – (optional) this will automatically move you to your main folder (/home/tomi)
mkdir practice – this will create the folder “practice”
cp 2007.csv practice/flightdelays.csv – this will copy the file into the folder with the new name
cd practice – with that you can go into your new folder
ls – this will list the files in the practice folder, you will have only one: flightdelays.csv
Let’s take a look on the data file! Printing stuff in bash.
Okay! No more boring stuff, let’s dig into the raw data! I’ll continue with the new file we’ve copied into the practice folder. Our next bash command will print our whole .csv file.
Here’s a video about, what happens:
Okay, it might be nerdy to say this… but: how cool this looks? Like we would be in Matrix. Well, Neo, this is our data file line by line – and as I said it’s over 7 million lines, so it might take some time to print the whole file on your screen. On the video above I interrupted this process. You can do the same for yourself by pressing CTRL + C on your keyboard.
cat command wasn’t really meaningful at this time, but at least you have a sense now that what kind of data you have in this file. In data science it’s usually an important first step to spend some time with discovering your raw data. However if you want to do this easier, you can print the first 10 lines or the last 10 lines of your file by typing these to the command line:
When you’ve typed
head, you got back 10 lines, and the first one was the header of your datatable. Depending on the file you use, sometimes you have this, sometimes you don’t.
The last bash command for today’s article: word count!
Three numbers are showing up on your screen:
7.453.216: The number of lines in your file.
7.453.216: The number of words in your file. (Some explanation: bash defines “words” separated by spaces, tabs or line-breaks. As you have seen it before, in this file the words were separated by commas. So bash understands one line as one word. We will fix this issue later.)
702.878.193: The number of all the characters (spaces included) in your file.
So it’s not just word count, but line count and character count too.
Hey! You’ve just started to work with a 7million+ lines data file! This is awesome! (But before you start to tell your friends that you are practicing with “big data”, I have to draw your attention to the fact: this is not yet big data at all.)
You’ve also made the first important steps to learn data coding and become a data scientists! I’ll continue from here (UPDATE: for episode 2, click here) and I’ll also provide free video tutorials on my Youtube channel (in the short future)! If you don’t want to miss any of these, subscribe to my Newsletter!