Soar your Bet in Data Science Using Unix Cmds
Last Updated on July 28, 2021 by Editorial Team
Author(s): Karthik Bhandary
When I ask you, “for what purpose do you use the command line/terminal?”, you probably tell me “To run the scripts, obviously!🙄”. I know that we use it to run the scripts, but what else can we do with it? If you are someone who is in the programming field for some time you know where I am going, But if you are a newbie, and just got into the coding world, you probably don’t have that much of an idea. It’s not a problem at all. It’s quite natural considering you are a newbie.
If you don’t know the answer to the above-posed question, then here it is:
greater control of an OS or application; faster management of many operating systems; ability to store scripts to automate regular tasks; basic command–line interface knowledge to help with troubleshooting, such as network connection issues.
We are obviously not going to talk about troubleshooting but, we definitely will talk about file management and control other tasks. If you are an aspiring Data Scientist or Data Analyst, I want to you know that Cmd is going to be a very important, common, and powerful 💪tool used to manage files and modify the data in the files.
In this blog, I am not going to be talking about every single command available. I am going to be talking about some of the important commands and techniques available.
This is used to view the contents of the file.
syntax: cat filename
Example: cat food/burger.csv
It will let you view the content of the burger.csv file.
This is used to view one page at a time and we can use more than one file with this command. There are some special flags that we can use along with this command.
syntax: less filename filename …
when viewing if we want to do the page down use the space bar.
- :n is used to go the next/second file.
- :p to go to the previous file.
- :q to quit.
There are sometimes when we want to view just the top few or the last few items of a file. We have commands to take care of that as well.
This is used to look at the top few items of the file.
syntax: head filename
we can even specify the number of lines we want with “-n” followed by the number of lines you want. For example
head -n 3 filename
It should be intuitive by now, that this selects the last few elements of a file.
syntax: tail filename
It can do what the head command does.
tail -n 3 filename
If you are in a situation where you want to know what a command does then all you have to do is to use this command named “man”
syntax: man cmd_name
While the head and the tail command are used to select the rows, the “cut” command is used to select the columns.
This is used to select data by column.
It has some flags, that we can use.
- -c: To cut by character use the -c option.
- -b: To extract the specific bytes.
- -d: cut uses tab as a default field delimiter but can also work with other delimiters by using the -d option.
- -f: To cut by the field use the -f option.
cut -f 2–5,8 -d “ ” filename.csv
Here we are selecting the fields from 2 to 5 and 8 and we are considering the “ ”(space) as the delimiter/separator.
Consider the following situation, If we use a command and it returns an error because we are in the wrong directory. We are willing to get to the right directory but are not interested in typing the whole command that we wrote earlier. We use the “!” to rerun a command.
head roti.csv # Throws an error because we are in the wrong directory.
cd food # we go to the correct directory
!head # reruns the above head command.
When we want to select a line containing specific values we use the command “grep”.
It takes a piece of text followed by one or more filenames and prints all the lines in these files that contain the text.
grep “Mamma mia!!” dosa.txt
There are sometimes when we want to store the result of a command. We can do that easily.
“>” is needed to store the data. It is used as follows.
head -n 5 food/idli.csv > top.csv
This stores the top 5 rows of the idli.csv in the top.csv file.
This is very important because we use this method all the time when working with data using terminal and Unix.
We can combine two commands or more by using the pipelining method which is technically using the “|”(pipe) found above you Enter key on the keyboard.
head -n 5 food/biriyani.csv | tail -n 3 > top_bottom.csv
This is a very important concept that you should remember. It is one of the concepts that will make your life easier.
There are times when we want to count the records by either the character or word or even by the lines. For this, we are going to use the “wc” command.
It is used to count the records. By using flags you can even specify by what you want to count.
- -c: character
- -w: words
- -l: lines
grep “2017–07” seasonal/spring | wc -l
In the above example, I am selecting everything with “2017–07” and counting the no of occurrences by line.
Until now, if we wanted to select more than one file we just typed them out. We can reduce that by using the wildcard character “*”.
Used to select more than one file at a time.
cut -d, -f 1 seasonal/*.csv
I am getting the first field with the delimiter from every file of the seasonal directory.
We can use “>” with the pipeline but it should appear at the end. For example
cut -d, -f 2 seasonal/*.csv > teeth-only.txt | grep -v tooth
In the above example, all the output is stored to the teeth-only.txt making the grep wait forever for some input. Instead, you can do it like this
cut -d, -f 2 seasonal/*.csv | grep -v tooth > teeth-only.txt
This is how you should use redirecting with pipelining.
When you run a command and nothing is happening and you are not able to run another command, press Ctrl + C.
They are used to store data. Some common variables are HEAD, USER
Use uppercase when defining a variable. It is the common convention.
You can print the variable using the following command.
We use $ because this allows the shell to differentiate between a filename and the value of a variable name.
We store data as we do in any programing language that is by using the “=” sign.
TRAINING = seasonal/summer.csv
If we want to repeat the command for a certain time we can use loops. We particularly use the “for loop”.
for ..var.. in ..list..; do ..body..; done
for filename in [email protected] # special sign for passing files.
head -n 2 $filename | tail -n 1
tail -n 1 $filename
You can use indentations if you want for better readability.
We can use files to store commands. These types of files are called scripts.
Editing a Script:
You should know the following shortcuts to work with these scripts.
For editing the script, i.e to enter into a script use the “nano” command.
Inside we can use these shortcuts.
Ctrl + K: Delete a line.
Ctrl + U : Un-delete a line.
Ctrl + O: save the file (‘O’ for output)
Ctrl + X: exit the editor.
As I said earlier we can use these scripts to store some commands and reuse them as we please. For your understanding purposes, I am going to show you an example.
# inside header.sh
grep -v “Tooth” seasonal/spring.csv
press ctrl + o and press enter. # saves the contents.
press ctrl + x to exit the file.
bash header.sh # by using this we use the commands inside.
Passing Filenames to Script:
Earlier in the for loop example, I used a special character “[email protected]”. This is used as a placeholder for the filenames.
Let’s say we have “unique-dish.sh” which contains the following command in it.
sort [email protected] | sort
when we run
the [email protected] inside the file gets replaced with the filename passed, i.e burger.csv. We can give it more than one filename at once.
These are the Unix commands that I covered. Mind you these are not all there is to it. I just mentioned these, because they are important when it comes to Data Science.
Most of the other commands are similar to the command line of windows. So you don’t need to worry about it.
The key takeaways are
- We learned some important commands.
- We learned some important techniques.
- We learned how to apply them with examples.
And while at it why don’t you check out my recent works:
Published via Towards AI