The ultimate step by step guide – for non-developers and wannabe data scientists! This is how to install R, Python, SQL and Bash in 30 minutes and start learning data coding today.
If you want to be a data scientist, a data analyst or just simply want to analyze the data of your startup (or any online business), sooner or later you will have to learn data coding.
There are 4 important languages, you should learn:
- Bash (sometimes referred as “the command line”)
In the next months I’ll provide more and more articles on data36.com regarding to data coding. However as a foundation of every further conversation on this topic, you need to have Python, R, SQL and bash on your computer. Once you have them, you will be able to practice by yourself (and later build your pet-projects too), as well as follow, practice and learn via my upcoming data coding tutorial articles and videos.
In this article I’ll drive you through step by step, how to get these data tools. At the end of the article you will have your own fully functioning data infrastructure with:
- bash (and mcedit),
- postgreSQL (and pgAdmin4),
- Python 2.7 (and Jupyter) and
- R (and RStudio)
- I will show you here the exact same tools, that we are using in real life data science projects.
- All these tools are completely free!
Yes, the funny thing is, that most of the super-scientific stuff what you can read about in different data science articles are made by open-source data tools. How cool is that?
Anyway: first things first – let’s go and get Python, SQL, R and bash on your computer!
Note: This article is enjoyable, if you are doing the steps and the coding part with me. If you are reading this on your mobile, I suggest to save the article for later (eg. send it to yourself in e-mail) and get back here when you have ~60 minutes and when you are on your Desktop computer or notebook.
Note 2: To have your data infrastructure correctly set up, you just simply need to follow my instructions down here step by step. Most of the time just copy-paste my code. So don’t be afraid to work on Terminal and writing codes. It’s easy, even if you are not a developer (yet). The article is long however, because I tried to explain each step as much as I could.
The Operating System
I use all my data tools on Ubuntu – which is a Linux operating system – and I suggest you to do the same. I use a Mac as a notebook, but you can have a PC too. In this case it doesn’t really matter, because we won’t install Ubuntu on our computer, but access it via internet.
What we will do here is to connect to a remote server – type commands and make the remote server do the data analyses instead of our local computer.
(Note that you can set up Ubuntu on your personal/work computer too, if you really want to. But this is something we usually don’t do in real life, because with that solution we would limit our data processes to our computer’s capacity. Also we would lose some cool features.)
If you use a remote server for data analysis, you will be able to:
- Access your data infrastructure from any computer with a login-name and a password (even if you lose or break your personal notebook or something). Don’t worry, nobody else will able to access your data on your remote server – this is a completely private thing.
- Automate your data scripts (eg. make them run in every 3 hours, even if you turn off your notebook).
- Scale your stuff. You won’t be limited to your computer’s capacity. Renting a few more processors or memory is just one click-away, if you are using a remote server.
- Use Ubuntu without installing it on your computer.
The only downside of going with a remote server, that it costs money. Fortunately these prices are very low (starting with $5 per month).
I will create all my data coding tutorials (videos and articles) using the exact same data stack, that I’ll describe in this article. So if you want to follow me, it will be much easier for you, if you are going through this article step by step and set up everything like I do. To make everything working properly, please make sure that you won’t miss any of the steps and you do them in the same order as it’s written here! The most important parts of course are the code parts. All the codes are marked:
UPDATE: Actually, you don’t need to go through on this whole article step-by-step, if you don’t want. 🙂 Based on reader requests I developed a solution, where all the steps below (the exact same steps) have been already done by me.
I built a data infrastructure – called the Data36 Learn server -, where you can simply log in and use Python (+ Jupyter), postgreSQL (+ pgAdmin), bash (+ mcedit) and R (+ RStudio). This also means, that you can leave the hassle, skip the rest of this article and jump instantly where you want: practice data coding.
Note: as I pay for the server, I’ll charge for this a monthly fee too.
More info here: Start with Python, SQL, R and bash in 1 minute!
If you don’t need my pre-built solution, just continue the article and you can still build everything by yourself following this guide!
Step 1: Get your remote server!
The next step is to find a hosting service to create your first remote server. I used many services and so far I’ve found DigitalOcean the best. You can rent here a server for $5/month (this will be perfect for us for now).
First, go to their website and create an account: DigitalOcean.com
Disclaimer: the link above is a special invitation by me – if you use that, you’ll get $10 free credit (and I’ll get $25 free credit). If you don’t want to use my link, you can simply click here instead. Note that in this case you won’t get the $10 credit.
You will land here:
Register with your email address and you will get a confirmation email in your inbox. Confirm and you will see a screen, where you can add your credit/debit card details or use PayPal. (For security reasons I always use PayPal.)
If you are done with that, you are just one step away to create your first remote server. You will see the “Droplets” screen. Click “Create Droplet” (big green button, top right corner).
And you will end up here:
Make sure, you are using these settings:
- “Choose an image” : Distributions: Ubuntu 14.04.5 x64
- “Choose a size” : Standard: $5/month
This will be more than enough for now. If it will be needed, you can scale it up in the future. As you can see you’ll pay on hourly basis. This means that if you are using the server for 4 hours only, then delete it, you will pay 0.02$. This is a very good deal.
- “Add block storage” : You don’t need this.
- “Choose a datacenter region”: Choose the one, that is the closest to you. Eg. if you are in San Diego, choose San Francisco and if you are in India, choose Bangalore. I’ll choose Frankfurt as I’m in Stockholm at the moment.
- “Select additional options” : You don’t need this.
- “Add your SSH keys”: You don’t need this.
- “Finalize and create”:
“How many Droplets”: 1
“Choose a hostname”: You can add here anything. I chose “data36-learn-datascience”
- Click “Create”.
Your server will be ready in ~60 secs.
CONGRATS! You have your first remote server, where you can practice data coding.
(Note: you can anytime destroy this server by clicking “Destroy”.)
Step 2: Access your remote server!
Now it’s time to login to your remote server. When you’ve created your server, you will receive an email from Digital Ocean. It will look something like this:
(Note: I removed my password. Yours won’t be **********, but numbers and characters. Also the IP Address you see here is fake, so don’t try to use it! ;-))
Make sure you save this email, because you will use these information in the future (especially the IP Address, you’ve got.)
Depending on which operating system you use on your computer, you can access your server different ways.
For Mac/Linux Users:
Open “Terminal” (on Mac I suggest to download iTerm2 and use that instead of Terminal).
ssh [Username]@[IP Address]
[Username] is the username from the email, in this case: root
[IP Address] is the IP Address from the email, you’ve got.
In my case I will type:
Hit enter and you are in…
(The next paragraph is important for Windows users only. You can skip it and scroll down to “Both Windows/Mac”!)
For Windows Users:
First download and install a program called PuTTy from here.
If you open Putty, you need to add the details (from the email you’ve got) on this window:
Host Name (or IP address): the IP Address from the email (eg. 18.104.22.168 in my case)
Connection type: SSH
Click “Open” and you are in. It will ask for your username (“login as:”). You can find this in the email as well. Type:
Both Windows/Mac (oh and Linux of course):
Nice, you SSH-d (logged in) into your remote Ubuntu server. From this point, when you are on the terminal window (until you are connected to your remote server in your terminal window) you are going to be using Ubuntu 14.04. It also means that any changes you make here, won’t affect your personal computer!
Let’s finalize things, before we start with setting up your data infrastructure and start data coding!
If everything’s correct, the server asks some question like:
Are you sure you want to continue connecting (yes/no)?
yes, hit enter.
Then it will ask for your password. Copy-paste it from the email and hit enter. (If this is your first time on the command line, you might find it weird, that the stars don’t appear on the screen when you type your password, but this is how it is on Ubuntu. Even if you don’t see any characters typed in – don’t worry – it’s typed in.)
Then it will give back some messages to you and ask you to change your password. First, type (copy-paste) the old password again, then type the new one (whatever you want).
And done! You can start to install the data coding tools!
(Note: if you have used so far graphical user interfaces only (eg. SPSS, Excel, Rapidminer, etc.), I know, this command line stuff could be intimidating. But believe me, once you’ve played around for 30 mins on this interface, you will find it fun! ;-))
Step 3: Install Bash!
Or wait… Great news! Bash is already set up as it’s the built in language of Ubuntu 14.04. (Again: sometimes it’s referred as “the command line”.)
I’ll get back to bash/command line in the next data coding tutorial articles and videos, but for now it’s enough if you know, that you really need to care about learning this language, because:
- You will use bash for every basic server operation – like moving files between folders, creating/deleting files, automating data scripts, installing new programs (eg. R or SQL!), giving permissions to users, etc.
- You will find it a powerful data tool as well. (Actually bash became my favorite data coding tool recently.)
For now, execute your first command. The “Hello, World!”:
echo 'Hello, World!'
You will have “Hello, World!” printed on your screen.
Don’t ask, why we did that. This is a nice habit of developers, so we did it too, but let’s move forward quickly and execute our second, more important command. We are going to create a new user:
You can add anything to the [newusername] part. I’ll add “dataguy”. Like this:
If you hit enter, you will have some text on your terminal screen, then you need to add a new password for this user, some more text, then the name (your name preferably) and you can leave the rest empty.
What happened here is that we have created a new user.
This was needed for further steps. So far your username was “root” – and by default “root” user is denied to do a few important stuff, that we want to do.
Let’s execute one more command to give the right privileges to your new user:
usermod -aG sudo dataguy (obviously: don’t forget to replace “dataguy” with your new username of course)
From now on we won’t use root, we will use the new user, you’ve created. So let’s logout from root user:
This command will close the connection between your computer and your remote server. Log back with your new username! Do everything the same way as it has been described in “Step 2” above, but change root to your new username (in my case “dataguy”) and to your new password. As I’m on Mac I’ll type this – for instance:
Now you are logged in as a normal user. And you can continue with setting up Python.
Step 4: Install Python and Jupyter!
More great news! Python is already installed on Ubuntu 14.04 too! You can try the Python-way of “Hello, World!” very easily. Type into the command line:
This will start Python. (While you are on Python, you can’t use Bash codes.)
print 'Hello, World!'
Notice that you get the same effect as it was with the “echo” command on bash. “Print” and “echo” are pretty much the same, but “Print” will work on Python and “echo” will work on Bash.
This will stop Python and you will be back to Bash.
To use Python more efficiently in the future, you’ll need to install some add-ons.
The easiest way to install things in the command line is using apt-get install, then the name of the add-on, that you want to install. If the add-on exists, apt-get will find and install it. Unfortunately the version of apt-get on your server is not the most recent one, so as a first step update it with this command:
sudo apt-get update
(Note: sudo is an extra addition that let’s bash know, that your user has the privileges to do installations.)
It will ask for your password! Remember: it is not the one from the email anymore, but the one you set, when you’ve created the new user! Anytime when it asks for password, just type that one.
Now, that you have the latest version of apt-get, give it a try and type:
sudo apt-get install mc
(If it asks if continue, just say yes.)
Mc – that you have just installed – is an advanced text editor. We will use it soon.
Next 2 steps (one by one):
sudo apt-get -y install python-pip
sudo apt-get -y install python-dev
Again if it asks for your password, type it – if it’s asks if continue, say yes.
These commands installed pip and python-dev on your server, which will help you to download python specific packages.
sudo pip install --upgrade pip
This command upgraded pip3 to the latest version of it.
Let’s install Jupyter:
sudo -H pip install jupyter
You have installed the coolest Python package: Jupyter. This is a tool that helps you to create easy-to-use notebooks from your Python codes. Why is it so awesome? I promise I’ll show you in my upcoming data coding tutorial articles and videos, but for now, let’s just configure and try it:
jupyter notebook --generate-config
This will create a config file for jupyter on your server.
echo "c.NotebookApp.ip = '*'" >> /home/[your_username]/.jupyter/jupyter_notebook_config.py
(Note: this is one line of code! Only your browser breaks it into 2 lines!)
This will add one line to the newly created config file, that will make you able to use your jupyter notebook from a browser window (like Chrome or Firefox).
Now you can go ahead and use Jupyter by typing:
jupyter notebook --browser any
This command will start to run the Jupyter application on your remote server. While it’s running in Terminal, you should just open a browser and type to the address bar [IP Address of your remote server from the email]:8888
So in my case I open in Google Chrome the:
22.214.171.124:8888 “website”. Well, it’s not a real website. It connects me to the interface of my Jupyter notebook.
On this screen you need to type a “password” or a “token” first. As we haven’t generated any password, you need to use the token, that you can easily find, if you go back to your terminal window. Here:
If you manage to copy-paste your token, you will be logged into your Jupyter Notebook. And you can create your first Python Notebook on top right corner: “New” –» “Python 2”
On this surface you can try again the “Hello, World!” command. Once you have typed it, you can execute this command by hit SHIFT+ENTER.
And done! Now you can use Python + Jupyter any time.
Note1: when you are done, don’t forget to shut down Jupyter in your terminal by hitting CTRL+C. If you want to use Jupyter again in the future do the same what we’ve did above: type
jupyter notebook --browser any and open a browser…
Note2: this setup is not the most data-secure version of using Jupyter, so I’d suggest not to use any confidential data for now. (Later I’ll cover the security settings.)
Step 5: Install SQL and pgadmin4!
To continue you should be on Bash. You will know it, if you check the beginning of the line in your Terminal window. If you are really on bash, it will look something like this (not necessarily green, it can be white or gray as well):
If you are not, just double-check if you haven’t missed anything above… Or just simply hit CTRL+C several times (that’s the hotkey to skip every running process on your terminal screen).
(If somehow accidentally you are still in Python, you will see “>>>” at the beginning of the line. If it’s so, hit CTRL + D.)
When you are back to Bash, you can set up postgreSQL fairly quickly by a similar apt-get command we’ve used before:
sudo apt-get install postgresql postgresql-contrib
(If it asks for your password, type it – if it asks if continue, say yes!)
Done! You have postgreSQL just like that. Let’s try to access it!
When you’ve installed SQL, it generated an SQL-super-user called “postgres”. Right now this is the only user, who can access your freshly created SQL database. The good thing is, that you can sign in to this superuser’s account with this command:
sudo -i -u postgres
Notice the small change on the command line:
The superuser will be able to access SQL with this command (type it):
You are in! You can type SQL commands!
This first one will generate a new user. With that you will be able to access your database in the future with your normal user too (which is the preferred way).
CREATE USER [your_user_name] WITH PASSWORD '[your_preferred_password]';
In my case:
CREATE USER dataguy WITH PASSWORD '[the_same_password_i_used_so_far]';
Exit from postgreSQL and go back to bash! Type:
(this is the exit command in postgres.)
Then you have to log out from the superuser as well and go back to your normal user! Type:
Now you can login with your normal user to your SQL database with this command:
psql -U dataguy -d postgres
Great! You are back to SQL again! Let’s do some data coding and test SQL queries:
CREATE TABLE test(column1 TEXT, column2 INT);
INSERT INTO test VALUES ('Hello', 111);
INSERT INTO test VALUES ('World', 222);
SELECT * FROM test;
The first line generates a new table called “test”. The 2nd and the 3rd fill some values in it. The 4th print all the values to the screen from “test” table!
I’ll also get back to the usage of SQL later!
Exit from postgreSQL again:
It’s time to set up pgadmin! This is a desktop application for postgreSQL, that you can use to access your SQL database from your personal computer (without connecting to your remote server in terminal) and write queries much easier. You will find this program very useful, when you’ll start writing complex queries.
As a first step – make your remote server ready to connect by typing these 5 lines of code (copy-paste it one by one):
sudo -i -u root
echo "listen_addresses = '*'" >> /etc/postgresql/*/main/postgresql.conf(Note: this is one line of code! Only your browser breaks it into 2 lines!)
echo 'host all all 0.0.0.0/0 md5' >> /etc/postgresql/*/main/pg_hba.conf(Note: this is one line of code! Only your browser breaks it into 2 lines!)
sudo /etc/init.d/postgresql restart
What you are doing here is to login to the root user and make some modification in the config files of postgreSQL. (Remember: as you are on your remote server, the changes you make won’t affect your personal computer!)
Then download pgadmin4 from here: pgadmin4.
Select your OS, then download, install and run it!
Once you are done, you will see this screen:
Click the Add New Server Icon!
And fill the popup:
- “Name”: anything you want (eg. “Data36 Test Server“)
- “Host name/address”: your remote server’s IP Address (in my case: 126.96.36.199)
- “Port”: 5432
- “Maintenance database”: postgres
- “User name”: [your_user_name]
- “Password”: your recently generated SQL password for your user
Click save and BOOM! You are connected to your database!
At the first sight it’s not really straightforward, but you can discover the different folders on the left side. If you right click on the name of the server (on my screenshot: “Data36 Test Server”), you can disconnect from your server. Or connect the same way, when you want to get back.
Also if you left click on one of your databases (on my screenshot: “postgres”), then you select on the top menu “Tools” –» “Query tool”, you will be able to run SQL queries (execute with the little Flash Icon):
Notice that on my screenshot you can see the very same result, that we got in the Terminal SQL! 🙂
Yay, you have SQL!
Only one small step left…
Step 6: Install R and RStudio!
R is the easiest tool to set up! That’s why I left it to the end.
First use apt-get again to install R:
sudo apt-get install r-base-core
(If it asks for your password, type it – if it asks if continue, say yes!)
Now you have R. You can test the “Hello, World!” here as well! Start R first:
print ("Hello, World!");
The syntax is a bit different, than it was on Python and much more different than it was in Bash.
You can exit from R:
Save workspace image? [y/n/c] —» Say:
We have an application for R as well to make your data coding easier. It’s called the RStudio and you can set it up by these 4 lines of commands (copy-paste it one by one)!
sudo apt-get install gdebi-core
sudo gdebi rstudio-server-1.0.136-amd64.deb
sudo restart rstudio-server
Then just go to your browser and type [your IP Address] and port 8787. In my case:
You can login with your username (eg. dataguy) and password. (The same, you were using to access your remote server so far.) And try “Hello, World!” here as well.
You have R and RStudio too! Congrats!
Nice job there!
You have created your own remote data server and you have Bash, Python, SQL and R on it! This is a fantastic first step to become a Data Scientist!
As I’ve mentioned several times during this article, I’ll help you to learn and use these languages in the upcoming data coding tutorial videos and articles on data36.com. We will start from the very basics, I promise!
If you want to be notified first about new content on Data36 (like articles, videos, handbooks, etc.), sign up for my Newsletter!
UPDATE: If you don’t feel the confidence to go through on this whole article step-by-step, I have good news! Based on reader requests I developed a solution, where all the steps have been already done by me.
I built a data infrastructure – called the Data36 Learn server -, where you can simply log in and use Python (+ Jupyter), postgreSQL (+ pgAdmin), bash (+ mcedit) and R (+ RStudio). This also means, that you can leave the hassle of the setup process and jump instantly where you want: practice data coding.
Note: as I pay for the server, I’ll charge for this a monthly fee too.
More info here: Start with Python, SQL, R and bash in 1 minute!
Sources and further reads
Data Science At The Command Line: http://datascienceatthecommandline.com