Master LLMs with our FREE course in collaboration with Activeloop & Intel Disruptor Initiative. Join now!

Publication

Soar your Bet in Data Science Using Unix Cmds
Latest

Soar your Bet in Data Science Using Unix Cmds

Last Updated on July 28, 2021 by Editorial Team

Author(s): Karthik Bhandary

Data Science

When I ask you, “for what purpose do you use the command line/terminal?”, you probably tell me “To run the scripts, obviously!🙄”. I know that we use it to run the scripts, but what else can we do with it? If you are someone who is in the programming field for some time you know where I am going, But if you are a newbie, and just got into the coding world, you probably don’t have that much of an idea. It’s not a problem at all. It’s quite natural considering you are a newbie.

If you don’t know the answer to the above-posed question, then here it is:

greater control of an OS or application; faster management of many operating systems; ability to store scripts to automate regular tasks; basic commandline interface knowledge to help with troubleshooting, such as network connection issues.

Photo by Mr Cup / Fabien Barral on Unsplash

We are obviously not going to talk about troubleshooting but, we definitely will talk about file management and control other tasks. If you are an aspiring Data Scientist or Data Analyst, I want to you know that Cmd is going to be a very important, common, and powerful 💪tool used to manage files and modify the data in the files.

In this blog, I am not going to be talking about every single command available. I am going to be talking about some of the important commands and techniques available.

Commands

Photo by Karina Vorozheeva on Unsplash

cat:

This is used to view the contents of the file.

syntax: cat filename

Example: cat food/burger.csv

It will let you view the content of the burger.csv file.

Photo by K8 on Unsplash

less:

This is used to view one page at a time and we can use more than one file with this command. There are some special flags that we can use along with this command.

syntax: less filename filename …

when viewing if we want to do the page down use the space bar.

  • :n is used to go the next/second file.
  • :p to go to the previous file.
  • :q to quit.

There are sometimes when we want to view just the top few or the last few items of a file. We have commands to take care of that as well.

Photo by Andres Herrera on Unsplash

head:

This is used to look at the top few items of the file.

syntax: head filename

we can even specify the number of lines we want with “-n” followed by the number of lines you want. For example

head -n 3 filename

Photo by Jason Leung on Unsplash

tail:

It should be intuitive by now, that this selects the last few elements of a file.

syntax: tail filename

It can do what the head command does.

tail -n 3 filename

If you are in a situation where you want to know what a command does then all you have to do is to use this command named “man”

syntax: man cmd_name

While the head and the tail command are used to select the rows, the “cut” command is used to select the columns.

Photo by Brands&People on Unsplash

cut:

This is used to select data by column.

It has some flags, that we can use.

  • -c: To cut by character use the -c option.
  • -b: To extract the specific bytes.
  • -d: cut uses tab as a default field delimiter but can also work with other delimiters by using the -d option.
  • -f: To cut by the field use the -f option.

Example:

cut -f 2–5,8 -d “ ” filename.csv

Here we are selecting the fields from 2 to 5 and 8 and we are considering the “ ”(space) as the delimiter/separator.

Consider the following situation, If we use a command and it returns an error because we are in the wrong directory. We are willing to get to the right directory but are not interested in typing the whole command that we wrote earlier. We use the “!” to rerun a command.

Rerun(!):

Example:

head roti.csv # Throws an error because we are in the wrong directory.

cd food # we go to the correct directory

!head # reruns the above head command.

When we want to select a line containing specific values we use the command “grep”.

Photo by S Migaj on Unsplash

grep:

It takes a piece of text followed by one or more filenames and prints all the lines in these files that contain the text.

Example:

grep “Mamma mia!!” dosa.txt

There are sometimes when we want to store the result of a command. We can do that easily.

Photo by Lars Kienle on Unsplash

Storing Data:

“>” is needed to store the data. It is used as follows.

syntax:

head -n 5 food/idli.csv > top.csv

This stores the top 5 rows of the idli.csv in the top.csv file.

Photo by Michael Dziedzic on Unsplash

Combining Commands

This is very important because we use this method all the time when working with data using terminal and Unix.

We can combine two commands or more by using the pipelining method which is technically using the “|”(pipe) found above you Enter key on the keyboard.

Example:

head -n 5 food/biriyani.csv | tail -n 3 > top_bottom.csv

This is a very important concept that you should remember. It is one of the concepts that will make your life easier.

There are times when we want to count the records by either the character or word or even by the lines. For this, we are going to use the “wc” command.

Photo by Towfiqu barbhuiya on Unsplash

wc:

It is used to count the records. By using flags you can even specify by what you want to count.

  • -c: character
  • -w: words
  • -l: lines

Example:

grep “2017–07” seasonal/spring | wc -l

In the above example, I am selecting everything with “2017–07” and counting the no of occurrences by line.

Until now, if we wanted to select more than one file we just typed them out. We can reduce that by using the wildcard character “*”.

Photo by Quentin Rey on Unsplash

Wildcard (*):

Used to select more than one file at a time.

Example:

cut -d, -f 1 seasonal/*.csv

I am getting the first field with the delimiter from every file of the seasonal directory.

We can use “>” with the pipeline but it should appear at the end. For example

cut -d, -f 2 seasonal/*.csv > teeth-only.txt | grep -v tooth

In the above example, all the output is stored to the teeth-only.txt making the grep wait forever for some input. Instead, you can do it like this

cut -d, -f 2 seasonal/*.csv | grep -v tooth > teeth-only.txt

This is how you should use redirecting with pipelining.

When you run a command and nothing is happening and you are not able to run another command, press Ctrl + C.

Environment Variables:

They are used to store data. Some common variables are HEAD, USER

Use uppercase when defining a variable. It is the common convention.

You can print the variable using the following command.

echo $var_name

We use $ because this allows the shell to differentiate between a filename and the value of a variable name.

We store data as we do in any programing language that is by using the “=” sign.

Example:

TRAINING = seasonal/summer.csv

for loop:

If we want to repeat the command for a certain time we can use loops. We particularly use the “for loop”.

Syntax:

for ..var.. in ..list..; do ..body..; done

Example:

for filename in $@ # special sign for passing files.

do

head -n 2 $filename | tail -n 1

tail -n 1 $filename

done

You can use indentations if you want for better readability.

We can use files to store commands. These types of files are called scripts.

Editing a Script:

You should know the following shortcuts to work with these scripts.

For editing the script, i.e to enter into a script use the “nano” command.

nano filename

Inside we can use these shortcuts.

Ctrl + K: Delete a line.

Ctrl + U : Un-delete a line.

Ctrl + O: save the file (‘O’ for output)

Ctrl + X: exit the editor.

As I said earlier we can use these scripts to store some commands and reuse them as we please. For your understanding purposes, I am going to show you an example.

nano header.sh

# inside header.sh

grep -v “Tooth” seasonal/spring.csv

press ctrl + o and press enter. # saves the contents.

press ctrl + x to exit the file.

bash header.sh # by using this we use the commands inside.

Passing Filenames to Script:

Earlier in the for loop example, I used a special character “$@”. This is used as a placeholder for the filenames.

Let’s say we have “unique-dish.sh” which contains the following command in it.

sort $@ | sort

when we run

unique-dish.sh food/burger.csv

the $@ inside the file gets replaced with the filename passed, i.e burger.csv. We can give it more than one filename at once.

These are the Unix commands that I covered. Mind you these are not all there is to it. I just mentioned these, because they are important when it comes to Data Science.

Most of the other commands are similar to the command line of windows. So you don’t need to worry about it.

CONCLUSION

The key takeaways are

  • We learned some important commands.
  • We learned some important techniques.
  • We learned how to apply them with examples.

I hope you found this blog helpful. If you liked this blog then I suggest you follow me on Medium and YouTube, for more content on productivity, self-improvement, Coding, and Tech.

And while at it why don’t you check out my recent works:


Soar your Bet in Data Science Using Unix Cmds was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

Published via Towards AI

Feedback ↓