This is the ultimate step by step guide to installing Python, SQL, R and Bash and starting to learn coding for data science. It takes no more than 60 minutes and it’s 100% understandable for non-developers, too!
If you want to be a data scientist, a data analyst or just simply want to analyze the data of your business, sooner or later you will have to learn coding.
There are 4 important languages that you should know:
- Bash (sometimes referred as “the command line”)
Here, on the Data36 blog, I provide many articles about coding for data science. However, as a foundation of every further conversation on this topic, you need to have Python, R, SQL and bash on your computer. Once you have them set up, you will be able to practice by yourself (and later build your own data projects too), as well as follow, practice and learn via my tutorial articles and videos.
In this article I’ll take you through getting these data tools step by step. At the end of the article you will have your own fully functioning data infrastructure with:
- bash (and mcedit),
- postgreSQL (and pgAdmin4),
- Python 3 (and Jupyter) and
- R (and RStudio)
- I will show you the exact same tools that are used in real life data science projects.
- All these tools are completely free!
Yes, the funny thing is that most of the super-scientific stuff that you can read about in different data science articles are made using open-source data tools. How cool is that?
Anyway: first things first – let’s go and install Python, SQL, R and bash on your computer!
Note 1: The article was last updated on 20 September 2018.
Note 2: This article is more enjoyable if you do the steps and the coding part with me. If you are reading this on your mobile, I suggest saving the article for later (e.g. send it to yourself in e-mail) and coming back here when you have ~60 minutes and are on your Desktop computer or notebook.
Note 3: To have your data infrastructure correctly set up, you simply need to follow my instructions down here step by step — most of the time just copy-paste my code. So don’t be afraid to work on Terminal and writing code. It’s easy, even if you are not a programmer (yet). The article is long, though, because I tried to explain each step as much as I could.
The Operating System
I use all my data tools (Python, R, SQL) on Ubuntu – which is a Linux operating system – and I suggest you do the same. My personal computer is a Mac, but you can have a PC too. In this case it doesn’t really matter, because we won’t install Ubuntu on our computer, but access it via the internet.
What we will do here is to connect to a remote server – type commands and make the remote server do the data analyses instead of our local computer.
(Note that you can set up Ubuntu on your personal/work computer too, if you really want to. But this is something we usually don’t do in real life, because with that solution we would limit our data processes to our computer’s capacity. Also we would lose some cool features.)
If you use a remote server for data analysis, you will be able to:
- Access your data infrastructure from any computer with a login-name and a password (even if you lose or break your personal notebook). Don’t worry, nobody else will able to access your data on your remote server – it is completely private.
- Automate your data scripts (e.g. make them run every 3 hours, even if you turn off your notebook).
- Scale your computing power. You won’t be limited to your own computer’s capacity. Renting a few more processors or more memory is just one click away if you are using a remote server.
- Use Ubuntu without installing it on your computer.
The only downside of going with a remote server is that it costs money. Fortunately these prices are very low (starting at $5 per month).
I’ve been creating all my coding for data science tutorials (videos and articles) using the exact same data stack that I’ll describe in this article. So if you want to follow me, it will be much easier for you if you go through this article step by step and set up everything like I do. To make everything work properly, please make sure that you don’t miss any of the steps and you do them in the same order as it’s written here! The most important parts of course are the code snippets. All the snippets are marked:
Step 1: Get your remote server!
The next step is to find a hosting service to create your first remote server. I used many services and so far I’ve found DigitalOcean to be the best. You can rent here a server for $5/month (this will be perfect for us for now).
First, go to their website and create an account: DigitalOcean.com
Disclaimer: the link above is a special invitation link – if you use that, you’ll get $50 free credit for 30 days (and I’ll get $25 free credit). If you don’t want to use my link, you can simply click here instead. Note that in this case you won’t get the $50 credit.
You will land here:
Register with your email address and you will get a confirmation email in your inbox. Confirm and you will see a screen where you can add your credit/debit card details or use PayPal. (For security reasons I always use PayPal.)
If you are done with that, you are just one step away from creating your first remote server. You will see the “Droplets” screen. Click “Create Droplet” (big green button, top right corner).
And you will end up here:
Make sure you are using these settings:
- “Choose an image:” Distributions: Ubuntu 18.04 x64
- “Choose a size:” Standard: $5/month
This will be more than enough for now. If needed, you will be able to scale it up in the future. As you can see you’ll pay on hourly basis. This means that if you are using the server for 4 hours only, then delete it, you will pay $0.02. This is a very good deal.
- “Add block storage:” You don’t need this.
- “Choose a datacenter region:” Choose the one that is the closest to you. E.g. if you are in San Diego, choose San Francisco and if you are in India, choose Bangalore. I’ll choose Frankfurt as I’m in Stockholm at the moment.
- “Select additional options:” You don’t need this.
- “Add your SSH keys:” You don’t need this.
- “Finalize and create:”
“How many Droplets:” 1
“Choose a hostname:” You can use anything. I chose “data36-learn-data-science.”
- Click “Create.”
Your server will be ready in ~60 secs.
CONGRATS! You have your first remote server, where you can install Python, R, SQL and bash and then practice coding for data science.
(Note: you can destroy this server at any time by clicking “Destroy.”)
Step 2: Access your remote server!
It’s time to login to your freshly created remote data server. When you’ve created your server, you will receive an email from Digital Ocean. It will look something like this:
(Note: I removed my password. Yours won’t be **********, but numbers and characters.)
Make sure you save this email, because you will use this information in the future (especially the IP Address).
Depending on which operating system you use on your computer, you can access your in server different ways.
For Mac/Linux Users:
Open “Terminal” (on Mac I suggest downloading iTerm2 and useing that instead of Terminal).
ssh [Username]@[IP Address]
[Username] is the username from the email, in this case: root
[IP Address] is the IP Address from the email.
In my case I will type:
Hit enter and you are in…
(The next paragraph is important for Windows users only. You can skip it and scroll down to “Both Windows/Mac”!)
For Windows Users:
First download and install a program called PuTTy from here.
If you open Putty, you need to add the details (from the email) on this window:
Host Name (or IP address): the IP Address from the email (eg. 126.96.36.199 in my case)
Connection type: SSH
Click “Open” and you are in. It will ask for your username (“login as:”). You can find this in the email as well. Type:
Both Windows/Mac (oh and Linux of course):
Nice, you SSH-d (logged in) into your remote Ubuntu server. From this point, when you are on the terminal window, until you are disconnected to your remote server, you are going to be using Ubuntu 18.04. It also means that any changes you make here won’t affect your personal computer!
Let’s finalize things before we start setting up your data infrastructure!
If everything’s correct, the server asks some questions like:
Are you sure you want to continue connecting (yes/no)?
yes, hit enter.
Then it will ask for your password. Copy-paste it from the email and hit enter. (If this is your first time on the command line, you might find it weird that the stars don’t appear on the screen when you type your password, but this is how it is on Ubuntu. Even if you don’t see any characters typed in – don’t worry – it’s typed in.)
Then it will return some messages to you and ask you to change your password. First, type (copy-paste) the old password again, then type the new one (whatever you want).
And done! You can start to install the data tools!
(Note: if you have so far used only graphical user interfaces (e.g. SPSS, Excel, Rapidminer, etc.), I know, this command line thingy could be intimidating. But believe me, once you’ve played around for 30 mins on this interface, you will find it fun! ;-))
Step 3: Install Bash!
Or wait… Great news! Bash is already set up, since it’s the built-in language of Ubuntu 18.04. (Again: sometimes it’s referred as “the command line.”)
I’ll get back to bash/command line in the next data coding tutorial articles and videos, but for now it’s enough if you know that you really need to care about learning this language because:
- You will use bash for every basic server operation – like moving files between folders, creating/deleting files, installing new programs (eg. Python, R or SQL, too), giving permissions to users, etc.
- It’s great for creating automations.
- It can be used as the “glue” between other data languages. (eg. moving something from SQL to Python, then from Python to R.)
For now, execute your first command: the “Hello, World!”
echo 'Hello, World!'
You will have “Hello, World!” printed on your screen.
Don’t ask why we did that. This is a nice habit of programmers, so we did it too, but let’s move forward quickly and execute our second, more important command. We are going to create a new user:
You can type anything for the [newusername] part. I’ll type “dataguy.” Like this:
If you hit enter, you will have some text on your terminal screen, then you need to add a new password for this user, some more text, then the name (your name preferably) and you can leave the rest empty.
We have just created a new user!
This was needed for further steps: so far your username was “root” – and by default “root” user is not allowed to do a few important installation steps that we want to do.
Let’s execute one more command to give the right privileges to your new user:
usermod -aG sudo dataguy (obviously: don’t forget to replace “dataguy” with your new username)
From now on, we won’t use root user, we will use the new user you’ve created. So let’s logout from root user:
This command will close the connection between your computer and your remote server. Log back in with your new username! Do everything as described in “Step 2” above, but change root to your new username (in my case “dataguy”) and to your new password. As I’m on Mac I’ll type this – for instance:
Now you are logged in as a normal user. And you can continue by setting up Python3.
Step 4: Install Python 3 and Jupyter!
Note: previously this article was written for Python 2 – but I have decided to upgrade it to Python 3. Python 2 won’t be supported after 2020. And Python 3 has been around since 2008. So if you are new to Python, it is definitely worth much more to learn the new Python 3 and not the old Python 2.
More great news! Python is already installed on Ubuntu 18.04 too! You can try the Python-way of “Hello, World!” very easily. Type into the command line:
This will start Python. (While you are on Python, you can’t use Bash codes.)
Notice that you get the same effect as it was with the “echo” command on bash. “Print” and “echo” are pretty much the same, but “Print” will work on Python and “echo” will work on Bash.
This will stop Python and you will be back to Bash.
To use Python more efficiently in the future, you’ll need to install some add-ons.
The easiest way to install things in the command line is using the apt-get application’s install feature. You only have to type apt-get install, then the name of the add-on that you want to install. If the add-on exists, apt-get will find and install it. Unfortunately the version of apt-get on your server is not the most recent one, so as a first step, update it with this command:
sudo apt-get update
(Note: sudo is an additional keyword before your apt-get command that lets bash know that your user has the privileges to do installations.)
The command line will ask for your password! Remember: it is not the one from the email anymore, but the one you set when you created the new user! Anytime it asks for your password, just type that one.
Now that you have the latest version of apt-get, give it a try and type:
sudo apt-get install mc
(If it asks whether to continue, just say yes.)
Mc – that you have just installed – is an advanced text editor. We will use it soon.
Next 2 steps (one by one):
sudo apt-get -y install python3-pip
sudo apt-get -y install python3-dev
Again if it asks for your password, type it – if it’s asks if continue, say yes.
These commands installed pip and python-dev on your server, which will help you to download Python-specific packages.
sudo -H pip3 install --upgrade pip
This command upgraded pip to the latest version.
Let’s install Jupyter:
sudo -H pip3 install jupyter
You have installed one of the coolest Python packages: Jupyter. This is a tool that helps you to create easy-to-use notebooks from your Python code. Why is it so awesome? In my Python for Data Science tutorial articles and videos I write more about it, but for now, let’s just configure and try it:
jupyter notebook --generate-config
This will create a config file for Jupyter on your server.
echo "c.NotebookApp.ip = '*'" >> /home/[your_username]/.jupyter/jupyter_notebook_config.py
(Note: this is one line of code! Only your browser breaks it into 2 lines!)
echo "c.NotebookApp.allow_remote_access = True" >> /home/[your_username]/.jupyter/jupyter_notebook_config.py
(Note: this is one line of code! Only your browser breaks it into 2 lines!)*
These will add two lines to the newly created config file that will make you able to use your Jupyter notebook from a browser window (like Chrome or Firefox).
Now you can go ahead and use Jupyter by typing:
jupyter notebook --browser any
This command will start to run the Jupyter application on your remote server. While it’s running in Terminal, you should just open a browser and type in the address bar [IP Address of your remote server from the email]:8888
So in my case I open this “website” in Google Chrome:
188.8.131.52:8888. Well, it’s not a real website. It connects me to the interface of my Jupyter notebook.
On this screen you need to type a “password” or a “token” first. As we haven’t generated any password, you need to use the token, which you can easily find if you go back to your terminal window. Here:
If you manage to copy-paste your token, you will be logged into your Jupyter Notebook. And you can create your first Python Notebook on top right corner: “New” –» “Python 3”
On this surface you can try again printing the “Hello, World!” string. Once you have typed it, you can execute this command by hitting SHIFT+ENTER.
And done! You have installed Python 3 and Jupyter Notebook — and you can come back and use them any time.
Note 1: when you are done, don’t forget to shut down Jupyter in your Terminal window by hitting CTRL+C. If you want to use Jupyter again in the future do the same as above: type
jupyter notebook --browser any and open a browser…
Note 2: this setup is not the most secure version of using Jupyter, so I’d suggest not using any confidential data for now. (In another article, I’ll cover the security settings.)
Step 5: Install SQL and pgadmin4!
To continue you should be on Bash. You will know for sure, if you check the beginning of the line in your Terminal window (which is called a “prompt,” by the way). If you are on bash, it will look something like this (not necessarily green, it can be white or gray as well):
If you are not, just double-check that you haven’t missed anything above… Or just hit CTRL+C several times (that’s the hotkey to skip every running process on your terminal screen).
(If somehow accidentally you are still in Python, you will see “>>>” at the beginning of the line. If so, hit CTRL + D.)
When you are back to Bash, you can set up postgreSQL fairly quickly using a similar apt-get command as before:
sudo apt-get install postgresql postgresql-contrib
(If it asks for your password, type it – if it asks whether to continue, say yes!)
Done! You have postgreSQL just like that. Let’s try to access it!
When you installed SQL, it generated an SQL superuser called “postgres.” Right now this is the only user who can access your freshly created SQL database. The good thing is that you can sign in to this superuser’s account with this command:
sudo -i -u postgres
Notice the small change on the command line:
The superuser will be able to access SQL with this command (type it):
You are in! You can type SQL commands!
First thing first, let’s generate a new user, so you can access your database in the future with your normal user too (which is the preferred way).
CREATE USER [your_user_name] WITH PASSWORD '[your_preferred_password]';
In my case:
CREATE USER dataguy WITH PASSWORD 'the_same_password_i_used_so_far';
Exit from postgreSQL and go back to bash! Type:
(this is the exit command in postgres.)
Then you have to log out from the superuser as well and go back to your normal user! Type:
Now you can login with your normal user to your SQL database with this command:
psql -U dataguy -d postgres
Great! You are back to SQL again! Let’s try it out and run these SQL statements:
CREATE TABLE test(column1 TEXT, column2 INT);
INSERT INTO test VALUES ('Hello', 111);
INSERT INTO test VALUES ('World', 222);
SELECT * FROM test;
The first line generates a new table called “test.” The 2nd and the 3rd load some test data in it. The 4th prints all the values to the screen from the “test” table!
You can learn about SQL from my SQL for Data Analysis tutorial series!
Exit from postgreSQL again:
It’s time to set up pgadmin! This is a desktop application for postgreSQL that you can use to access your SQL database from your personal computer (without connecting to your remote server in terminal) and write queries much more easily. You will find this program very useful when you start writing complex queries.
As a first step – make your remote server ready to connect by typing these 5 lines of code (copy-paste it one by one):
sudo -i -u root
echo "listen_addresses = '*'" >> /etc/postgresql/*/main/postgresql.conf(Note: this is one line of code! Only your browser breaks it into 2 lines!)
echo 'host all all 0.0.0.0/0 md5' >> /etc/postgresql/*/main/pg_hba.conf(Note: this is one line of code! Only your browser breaks it into 2 lines!)
sudo /etc/init.d/postgresql restart
What you are doing here is logging in to the root user and making some modifications in the config files of postgreSQL. (Remember: as you are on your remote server, the changes you make won’t affect your personal computer!)
Then download pgadmin4 from here: pgadmin4.
Select your OS, then download, install and run it!
Once you are done, you will see this screen:
Click the Add New Server Icon!
And fill in the popup:
- “Name:” anything you want (eg. “Data36 Test Server“)
- “Host name/address:” your remote server’s IP Address (in my case: 184.108.40.206)
- “Port:” 5432
- “Maintenance database:” postgres
- “User name:” [your_user_name]
- “Password:” your recently generated SQL password for your user
Click save and BOOM! You are connected to your database!
At first sight, it’s not really straightforward, but you can discover the different folders on the left side. If you right click on the name of the server (on my screenshot: “Data36 Test Server”), you can disconnect from your server. Or connect the same way, when you want to get back.
Also if you left click on one of your databases (on my screenshot: “postgres”), then you select on the top menu “Tools” –» “Query tool,” you will be able to run SQL queries (execute with the little Flash Icon):
Notice that on my screenshot you can see the very same result that we got in the Terminal SQL! 🙂
Yay, you have SQL!
Only one small step left…
Step 6: Install R and RStudio!
R is the easiest tool to set up! That’s why I left it to the end.
First use apt-get again to install R:
sudo apt-get install r-base-core
(If it asks for your password, type it – if it asks whether to continue, say yes!)
Now you have R. You can test the “Hello, World!” here as well! Start R first:
print ("Hello, World!");
The syntax is a bit different than it was on Python and much different than it was in Bash.
You can exit from R:
Save workspace image? [y/n/c] —» Say:
There is an application for R as well to make your coding life easier. It’s called RStudio and you can set it up using these 3 lines of code (copy-paste them one by one)!
sudo apt-get install gdebi-core
sudo gdebi rstudio-server-1.1.463-amd64.deb
Then just go to your browser and type [your IP Address] and port 8787. In my case:
You can login with your username (e.g. dataguy) and password. (The same, you were using to access your remote server so far.) And try “Hello, World!” here as well.
You have installed R and RStudio too! Congrats!
Nice job there!
You have created your own remote data server and you have installed Python, SQL, R and bash on it! This is a fantastic first step for you towards becoming a Data Scientist!
As I’ve mentioned several times during this article, I’ve been creating quite a few articles to show you how to use Python, SQL and bash. All of these start from the very basics. Feel free to start with any of these you prefer:
- SQL for Data Analysis tutorial series
- Python for Data Science tutorial series
- Bash for Analytics tutorial series
- If you want to learn more about how to become a data scientist, take my 50-minute video course: How to Become a Data Scientist. (It’s free!)
- Also check out my 6-week online course: The Junior Data Scientist’s First Month video course.
*shout-out to Johann for finding, reporting and solving the issue!
Sources and further reading
Data Science At The Command Line: http://datascienceatthecommandline.com