Pandas Tutorial 1: Pandas Basics (Reading Data Files, DataFrames, Data Selection)

Last updated on September 01, 2021

Pandas is one of the most popular Python libraries for Data Science and Analytics. I like to say it’s the “SQL of Python.” Why? Because pandas helps you to manage two-dimensional data tables in Python. Of course, it has many more features. In this pandas tutorial series, I’ll show you the most important (that is, the most often used) things that you have to know as an Analyst or a Data Scientist. This is the first episode and we will start from the basics!

Note 1: this is a hands-on tutorial, so I recommend doing the coding part with me!

Before we start

If you haven’t done so yet, I recommend going through these articles first:

To follow this pandas tutorial…

  1. You will need a fully functioning data server with Python3, numpy and pandas on it.
    Note 1 : Again, with this tutorial you can set up your data server and Python3. And with this article you can set up numpy and pandas, too.
    Note 2: or take this step-by-step data server set up video course.
  2. Next step: log in to your server and fire up Jupyter. Then open a new Jupyter Notebook in your favorite browser. (If you don’t know how to do that, I really do recommend going through the articles I linked in the “Before we start” section.)
    Note: I’ll also rename my Jupyter Notebook to “pandas_tutorial_1”.

     

    pandas tutorial 1

    Firing up Jupyter Notebook

  3. Import numpy and pandas to your Jupyter Notebook by running these two lines in a cell:
    import numpy as np
    import pandas as pd
    

    pandas tutorial 1 - import numpy import pandas

    Note: It’s conventional to refer to ‘pandas’ as ‘pd’. When you add the as pd at the end of your import statement, your Jupyter Notebook understands that from this point on every time you type pd, you are actually referring to the pandas library.

Okay, now we have everything! Let’s start with this pandas tutorial!
The first question is:

How to open data files in pandas

You might have your data in .csv files or SQL tables. Maybe Excel files. Or .tsv files. Or something else. But the goal is the same in all cases. If you want to analyze that data using pandas, the first step will be to read it into a data structure that’s compatible with pandas.

Pandas data structures

There are two types of data structures in pandas: Series and DataFrames.

Series: a pandas Series is a one dimensional data structure (“a one dimensional ndarray”) that can store values — and for every value it holds a unique index, too.

pandas tutorial 2 - pandas series example

Pandas Series example

DataFrame: a pandas DataFrame is a two (or more) dimensional data structure – basically a table with rows and columns. The columns have names and the rows have indexes.

pandas tutorial 3 - pandas dataframe example

Pandas DataFrame example

In this pandas tutorial, I’ll focus mostly on DataFrames. The reason is simple: most of the analytical methods I will talk about will make more sense in a 2D datatable than in a 1D array.

Loading a .csv file into a pandas DataFrame

Okay, time to put things into practice! Let’s load a .csv data file into pandas!
There is a function for it, called read_csv().

Start with a simple demo data set, called zoo! This time – for the sake of practicing – you will create a .csv file for yourself! Here’s the raw data:

animal,uniq_id,water_need
elephant,1001,500
elephant,1002,600
elephant,1003,550
tiger,1004,300
tiger,1005,320
tiger,1006,330
tiger,1007,290
tiger,1008,310
zebra,1009,200
zebra,1010,220
zebra,1011,240
zebra,1012,230
zebra,1013,220
zebra,1014,100
zebra,1015,80
lion,1016,420
lion,1017,600
lion,1018,500
lion,1019,390
kangaroo,1020,410
kangaroo,1021,430
kangaroo,1022,410

Go back to your Jupyter Home tab and create a new text file…

pandas tutorial 4 - new text file

…then copy-paste the above zoo data into this text file…

pandas tutorial 5 - zoo untitled

… and then rename this text file to zoo.csv!

pandas tutorial 5 - zoo data

Okay, this is our .csv file.
Now, go back to your Jupyter Notebook (that I named ‘pandas_tutorial_1’) and open this freshly created .csv file in it!

Again, the function that you have to use is: read_csv()
Type this to a new cell:

pd.read_csv('zoo.csv', delimiter = ',')

pandas tutorial 6 - read_csv zoo

And there you go! This is the zoo.csv data file, brought to pandas. This nice 2D table? Well, this is a pandas dataframe. The numbers on the left are the indexes. And the column names on the top are picked up from the first row of our zoo.csv file.

pandas tutorial 6 - read_csv zoo header

To be honest, though, you will probably never create a .csv data file for yourself, like we just did… you will use pre-existing data files. So you have to learn how to download .csv files to your server!

If you are here from the Junior Data Scientist’s First Month video course  then you have already dealt with downloading your .txt or .csv data files to your data server, so you must be pretty proficient in it… But if you are not here from the course (or if you want to learn another way to download a .csv file to your server and to get another exciting dataset), follow these steps:

I’ve uploaded a small sample dataset here: DATASET

(Link: 46.101.230.157/dilan/pandas_tutorial_read.csv)

If you click the link, the data file will be downloaded to your computer. But you don’t want to download this data file to your computer, right? You want to download it to your server and then load it to your Jupyter Notebook. It only takes two steps.

STEP 1) Go back to your Jupyter Notebook and type this command:

!wget 46.101.230.157/dilan/pandas_tutorial_read.csv

pandas tutorial 7 - download via wget

This downloaded the pandas_tutorial_read.csv file to your server. Just check it out:

pandas tutorial 7 - csv in the folder

See? It’s there.

If you click it…

pandas tutorial 8 - csv content

…you can even check out the data in it.

STEP 2) Now, go back again to your Jupyter Notebook and use the same read_csv function that we have used before (but don’t forget to change the file name and the delimiter value):

pd.read_csv('pandas_tutorial_read.csv', delimiter=';')

The data is loaded into pandas!

pandas tutorial 9 - read csv wrong header

Does something feel off? Yes, this time we didn’t have a header in our csv file, so we have to set it up manually! Add the names parameter to your function!

pd.read_csv('pandas_tutorial_read.csv', delimiter=';', names = ['my_datetime', 'event', 'country', 'user_id', 'source', 'topic'])

pandas tutorial 10 - fixed header

Better!
And with that, we finally loaded our .csv data into a pandas dataframe!

Note 1: Just so you know, there is an alternative method. (I don’t prefer it though.) You can load the .csv data using the URL directly. In this case the data won’t be downloaded to your data server.

pandas tutorial 11 - read csv from url or server directly

read the .csv directly from the server (using its URL)

Note 2: If you are wondering what’s in this data set – this is the data log of a travel blog. This is a log of one day only (if you are a JDS course participant, you will get much more of this data set on the last week of the course ;-)). I guess the names of the columns are fairly self-explanatory.

Selecting data from a dataframe in pandas

This is the first episode of this pandas tutorial series, so let’s start with a few very basic data selection methods – and in the next episodes we will go deeper!

1) Print the whole dataframe

The most basic method is to print your whole data frame to your screen. Of course, you don’t have to run the pd.read_csv() function again and again and again. Just store its output the first time you run it!

article_read = pd.read_csv('pandas_tutorial_read.csv', delimiter=';', names = ['my_datetime', 'event', 'country', 'user_id', 'source', 'topic'])

After that, you can call this article_read value anytime to print your DataFrame!

pandas tutorial 13 - print the whole dataframe

2) Print a sample of your dataframe

Sometimes, it’s handy not to print the whole dataframe and flood your screen with data. When a few lines is enough, you can print only the first 5 lines – by typing:

article_read.head()

pandas tutorial 14 - head

Or the last few lines by typing:

article_read.tail()

pandas tutorial 15 - tail

Or a few random lines by typing:

article_read.sample(5)

pandas tutorial 16 - sample

3) Select specific columns of your dataframe

This one is a bit tricky! Let’s say you want to print the ‘country’ and the ‘user_id’ columns only.
You should use this syntax:

article_read[['country', 'user_id']]

pandas tutorial 17 - select columns

Any guesses why we have to use double bracket frames? It seems a bit over-complicated, I admit, but maybe this will help you remember: the outer bracket frames tell pandas that you want to select columns, and the inner brackets are for the list (remember? Python lists go between bracket frames) of the column names.

By the way, if you change the order of the column names, the order of the returned columns will change, too:

article_read[['user_id', 'country']]

pandas tutorial 18 - select columns switched

This is the DataFrame of your selected columns.

Note: Sometimes (especially in predictive analytics projects), you want to get Series objects instead of DataFrames. You can get a Series using any of these two syntaxes (and selecting only one column):

article_read.user_id
article_read['user_id']

pandas tutorial - select one column pandas series

output is a Series object and not a DataFrame object

4) Filter for specific values in your dataframe

If the previous one was a bit tricky, this one will be really tricky!

Let’s say, you want to see a list of only the users who came from the ‘SEO’ source. In this case you have to filter for the ‘SEO’ value in the ‘source’ column:

article_read[article_read.source == 'SEO']

It’s worth it to understand how pandas thinks about data filtering:

STEP 1) First, between the bracket frames it evaluates every line: is the article_read.source column’s value 'SEO' or not? The results are boolean values (True or False).

pandas tutorial 20 - filter for values True False

STEP 2) Then from the article_read table, it prints every row where this value is True and doesn’t print any row where it’s False.

pandas tutorial 21 - filter for values result

Does it look over-complicated? Maybe. But this is the way it is, so let’s just learn it because you will use this a lot! 😉

Functions can be used after each other

It’s very important to understand that pandas’s logic is very linear (compared to SQL, for instance). So if you apply a function, you can always apply another one on it. In this case, the input of the latter function will always be the output of the previous function.

E.g. combine these two selection methods:

article_read.head()[['country', 'user_id']]

This line first selects the first 5 rows of our data set. And then it takes only the ‘country’ and the ‘user_id’ columns.

Could you get the same result with a different chain of functions? Of course you can:

article_read[['country', 'user_id']].head()

In this version, you select the columns first, then take the first five rows. The result is the same – the order of the functions (and the execution) is different.

One more thing. What happens if you replace the ‘article_read’ value with the original read_csv() function:

pd.read_csv('pandas_tutorial_read.csv', delimiter=';', names = ['my_datetime', 'event', 'country', 'user_id', 'source', 'topic'])[['country', 'user_id']].head()

This will work, too – only it’s ugly (and inefficient). But it’s really important that you understand that working with pandas is nothing but applying the right functions and methods, one by one.

Test yourself!

As always, here’s a short assignment to test yourself! Solve it, so the content of this article can sink in better!

Select the user_id, the country and the topic columns for the users who are from country_2! Print the first five rows only!

Okay, go ahead and solve it!

.
.
.

And here’s my solution!
It can be a one-liner:

article_read[article_read.country == 'country_2'][['user_id','topic', 'country']].head()

Or, to be more transparent, you can break this into more lines:

ar_filtered = article_read[article_read.country == 'country_2']
ar_filtered_cols = ar_filtered[['user_id','topic', 'country']]
ar_filtered_cols.head()

pandas tutorial 30 - test yourself solutions

Either way, the logic is the same. First you take your original dataframe (article_read), then you filter for the rows where the country value is country_2 ([article_read.country == 'country_2']), then you take the three columns that were required ([['user_id','topic', 'country']]) and eventually you take the first five rows only (.head()).

Conclusion

You are done with the first episode of my pandas tutorial series! Great job! In the next article, you can learn more about the different aggregation methods (e.g. sum, mean, max, min) and about grouping (so basically about segmentation). Stay with me: Pandas Tutorial, Episode 2!

Cheers,
Tomi Mester

← Previous post

Next post →

62 Comments

  1. Hi Tomi Mester,

    Nice tutorial about Pandas.

    For last command
    “article_read[article_read.country == ‘country_2’][[‘user_id’,’topic’, ‘country’]].head()”

    I guess it would be better if rewrite as
    article_read[[‘user_id’,’topic’, ‘country’]][article_read.country == ‘country_2’].head()
    as sames as sql notation SELECT [field] FROM [table] WHERE [filter].
    Just curious, because my background as SQL developer and i’m little bit confusing about pandas syntax.

    Thanks.

    • hey,

      well, the logic in pandas is more linear than in SQL (it’s more “inside-out” in SQL).
      Actually, both version will work fine.

      Your version might be a little bit more efficient (in terms of calculation time) though — because of the particular order you use (column selection first – then filtering for the rows), but we are talking about milliseconds here. 🙂

      Cheers,
      Tomi

  2. Andres Cardenas

    Excellent tutorial. I am just starting with pandas.
    Thanks

  3. Rachit Kulshreshtha

    Hi Tomi,

    Really loved the way you even showed more possible combinations even after getting the job done. These are the curious bits which really help a new learner and gives them the confidence to explore more.

  4. Salman valiyakath

    Really Helpful tutorial Thanks Tomi.

  5. Chrise Hemstone

    Great job I must say!

  6. Divya Negi

    I loved reading all through the end .
    Every example is nicely explained and as a beginner in the course i found it really easy to understand every step.

    Keep up the good work 🙂

  7. Tomi this is an excellent beginner tutorial…I’m very new to Python and I was able to follow this pandas tut and complete all the steps. I can’t wait for the next one!! Thank you for this great resource!

  8. Divya Reddy

    Hi Tomi,

    This has been by far the most simplified tutorial I have ever come across and so very well organized. Thank you so much, I am very motivated to finish the rest of the tutorial series.

  9. Priya

    Hi Tomi,
    Loved this tutorial. I would love to learn the whole python programming in this way. Actually I want to learn machine learning which requires python . I hope your tutorials can help me.

    • they can –» but search for machine learning in the side bar, and you’ll find my pandas/python based machine learning tutorials, too! 😉 Tomi

  10. roshan

    Nice tutorial. Helped alot

  11. Diego Camilo Peña Ramirez

    Hello, do you have github?

    Or something were can i find sample .csv files?

    Thank you

  12. Thanks for the free tutorial!
    It’s really clear and useful.

  13. Nandish Patil

    Very good tutorial and helping me learn faster, as compared to other tutorials.

  14. Behrouz

    Hi Tomi,

    You said that you don’t prefer reading the .csv directly from the server (using its URL). Why?

    • hi Behrouz,

      good question!
      It’s because the external server is something that I can’t control — so if they change, move or remove the data, my data project can be affected — without me knowing about it. That’s something that data scientists don’t really prefer! 😉
      Hope this makes sense!
      Tomi

  15. great tutorial. i love this

  16. Thank you. It explained very clearly for new users.

  17. whoah this weblog is excellent i really like reading your articles.

    Keep up the great work! You already know, lots of people are hunting
    around for this info, you could aid them greatly.

  18. Man, I can’t thank you enough for your Tutorial. It just helped me all the way. It could be sometimes really tricky and confusing when you are new to these things and suddenly come up with tricky questions. We need to start with the proper basics and thanks a lot to you.
    Please do let me know if you have more of such tutorials.

  19. Rajasekaran

    Great Tutorial , Thanks a lot

  20. Thank you for the great explanation! Appreciate it!

  21. This is really fun, thank you!

  22. nice to study about pandas…thank you

  23. brantly goodwin

    Tomi,

    Thank you for a great tutorial on pandas.

    One question that is causing me hadaches: when I type !wget 46.101.230.157/dilan/pandas_tutorial_read.csv, I get back the error message ‘wget’ is not recognized as an internal or external command,
    operable program or batch file.

    Any thoughts on where I might have gone wrong? I am using Jupyter 6.0.0 with Python 3.7.3.

    Many thanks.

    • hey Brantly,
      Are you using the same server setup as I do in the tutorials?
      My best guess right now is that you get the error, because you don’t have an Ubuntu based OS running in the background.
      Tomi

  24. Pandas Tutorial 1 is a great! Help me to understand these functions very well.
    Thanks

  25. excellent lesson, thank you!

  26. Just went through it. Really love it

  27. Muhammad Fardad

    wonderful work man
    I appreciated!

  28. Alisonia

    Excellent. This helps a lot.

    Million thanks.

  29. Hello Tomi,
    Amazing Tutorial on pandas. You have described all the points clearly. Thank you so much…

  30. saukhin

    Excellent quick and efficient introduction to the world of Pandas

  31. This one was really helpful – thanks!

  32. Hi Tomi

    really great piece of work for novice like me.
    this tutorial contains exactly what i was looking for. the explanation about DS returned by read_csv.
    really crisp and crystal now

    thanks buddy.
    already bookmarked you 🙂

Leave a Reply