Data Coding 101 – Intro To Bash – ep4 (with video)

Data Coding in Bash – this is episode 4! Let’s continue with some basic bash best practices. Today I’ll show you 9 tips, that will make your data coding life in the command line much easier.
This week I’ll also start experimenting with video format. Feel free to feedback in comments if you liked it or not. I first suggest to watch the video, and I’ll list the tools and commands below anyway. So you can jump on your Terminal and try them after watching the vid!

Note: If you are new here, I suggest to start with these previous data coding and/or bash articles:

And here’s the video:

Do Bash + Data Coding Faster!

Bash best practice #1
Tab key: Any time when you start typing in the command line and you hit the TAB key, it automatically extends your typed-in text, so you can spare some characters.

data coding command line bash best practice tab

Bash best practice #2
Up/down arrows: They help you to bring back your previous commands. So if you misspelled something or if you want to do a small modification, you don’t have to type the whole command again.

data coding command line bash best practice arrows

Bash best practice #3
history –» This bash tool prints your recently used commands on your terminal screen. (Pro tip: try it with grep! Eg. history |grep 'cut' will list you all the commands you have used and contained cut.)

Bash best practice #4
CTRL + R or clear: It cleans your screen. Better for your mind, better for your eyes!

data coding command line bash best practice ctrlr

Do Bash + Data Coding Cleaner!

First install CSVKit!
sudo pip install csvkit

Note: First I’ve heard about the CSVKit bash tools in this book: Jeroen Janssen – Data Science at the Command Line.

Then try csvlook and csvstat:

Bash best practice #5
csvlook helps you to see your csv files in a much cleaner, much processable-by-humans format. Eg. here’s a short sample from our flightdelays.csv file:
cat 2007.csv |head |cut -d',' -f12,13,14,15| csvlook

data coding command line bash best practice csvlook

Bash best practice #6
csvstat gives you back some basic statistics about your dataset. Try:
cat demo1.csv | csvstat

data coding command line bash best practice csvstat

(Even median is there! Remember last time how hard it was to get it?)

Note: csvstat is unfortunately not so great with bigger files. So you can’t use it for the flightdelays.csv for example, because that file is way too big.

Bash best practice #7
Enter-enter-enter
This will sound dummy, but I assure you this is a real problem and this is a real solution for it. This is what data scientists do, when they use command line in real life. The problem: when you cat a file on your screen, then another one right after, it’s really hard to find the first row of the second file. The main reason is, that the prompt looks like every other line in your files. If you’ve watched the video above, you have seen, that I colored my prompt. That’s part of the solution, but to make sure I will find the first line of my second file, before the second cat I usually hit 10-15 blank enters.

Right?

data coding command line bash best practice blankenters

Do Bash + Data Coding Smarter!

As I’ve mentioned earlier, these articles of mine are just the beginning and they give you the approach and the basic bash tools. But on the long term I suggest 2 major ways to continuously extend your knowledge.

Bash best practice #8
man –» this is a bash command to learn more about specific command line tools. Eg. try:
man cat –» and you will get into the manual of cat. It works for almost every command. (man cut, man grep, etc…) The good thing in it is that in each manual you can find a great list for all the options for the given command.

data coding command line bash best practice man grep

Best practice #9
Googling + StackOverflow
I know this is something I should not even mention, but still: if you have a question, you can be sure that someone has already asked it and another one has already answered it somewhere on the internet. So just don’t forget to use Google first. Most of the answers are on a website called StackOverflow – by the way. If it’s accidentally not there, Stackoverflow is still awesome, because you can also ask questions there. There are many nice and smart people there, from whom you will get an answer fast, so don’t be shy! 😉

A great book about Data Science at the Command Line

data-science-at-the-command-lineAnd eventually it’s time to mention a great book about Bash!
Jeroen Janssen – Data Science at the Command Line

As far as I know, this is the one and only book that writes about bash as a tool for data scientists. It comes with many great tips and bash best practices! It assumes that you have some initial Python and/or R knowledge, but if you don’t, I still recommend it. If you have read my Data Coding 101 articles about bash so far, it won’t cause any issue the understand the most of this nice book!

Conclusion

Today we went through some great tools to make your job in bash cleaner, faster and smarter.
Next week I’ll show you two major control flow components of bash: the if command and the while loops. (And they are even more important, as we are gonna use the same logic later in Python and R as well!) Update: here’s the new episode!

If you don’t want to miss any of my new data contents (articles, videos, e-books, etc), subscribe to my Newsletter!

Cheers,
Tomi Mester

← Previous post

Next post →

4 Comments

  1. Alejandro

    I love your tutorials! Clear and straightforward!

  2. Kumaran Rajendhiran

    In ubuntu and most linux systems, key combination to clear the screen is Ctrl+l not Ctrl+r. Ctrl+r is used for reverse search… And reverse search is a better tool than history command

Leave a Reply