Unlock the full potential of AI with Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Various Ways to Get Data on Google Colab
Latest   Machine Learning

Various Ways to Get Data on Google Colab

Last Updated on July 24, 2023 by Editorial Team

Author(s): Sanchit Tanwar

Originally published on Towards AI.

Google colab is one of the best platforms for giving free GPU. It is one of the best places to experiment if you are starting with deep learning. One of the best things about google colab apart from free GPU is, it comes loaded with most of the libraries and framework you require to start deep learning, and it is direct code and run type of thing you most probably won’t need to install anything. But one of the problems I face, if I use google colab, is the preemptive drive. The run time of google colab stops after every 8–10 hours. So you will need to download data every time as the hard drive also gets flushed when the server is stopped. In this blog, I will tell you about the ways in which you can download the dataset on google colab, which depends on the source of your dataset and how to save the dataset on your drive in case you are processing the data so that you can use it again later. So let’s get started with downloading the dataset first with the most popular dataset source Kaggle.


Downloading the dataset from kaggle is simplest as kaggle provides CLI from which you can easily download the dataset by executing a one-line command. I will write the steps you need to execute to download the dataset from kaggle.

  1. Create a kaggle account. U+1F605
  2. Go to my account section.

3. Click on create a new API token button, and this will download the kaggle.json file.

4. Run the code mentioned below, and it will ask you to upload a file. Upload kaggle.json file that just downloaded.

from google.colab import files
os.system("mkdir -p ~/.kaggle")
os.system("cp kaggle.json ~/.kaggle/")
os.system("chmod 600 ~/.kaggle/kaggle.json")

5. Copy the API command for the dataset you want to download.

You can find it on the dataset page for every dataset.

!kaggle competitions download -c aerial-cactus-identification

Use an exclamation mark before this command, and it is required to run Linux commands. This will download the dataset, and you can work on it. It will be in zip format, probably so you will have to unzip it.

2. Google drive

Sometimes you will need to download a dataset or some other files(such as weight file) from google drive, and it is very easy to download data from google drive. We can directly use this API.

from google_drive_downloader import GoogleDriveDownloader as gddgdd.download_file_from_google_drive(file_id='1iytA1n2z4go3uVCwE__vIKouTKyIDjEq',dest_path='./data/mnist.zip',unzip=True)

We can get file_id for file in google drive from the link sharing option. There is one limitation of this API. The file you want to download should be open to the public web; otherwise, it will not be downloaded.

3. For all other sources

There are different ways, and it totally depends from source to source. But there is one general solution which works fine for every source. The only limitation is if there are multiple files you want to download or the link of download is dynamic and change every time (for example, if you download something from google drive on your local system, the download link is new every time).

There is one extension for firefox(I tried finding it for chrome but couldn’t find any if anyone is aware of any such extension, please let me know in comments) cliget.

Add this extension in your browser. And go to the website from where you want to download data. As soon as the downloading popup appears, this extension will catch it and will generate a curl command to download that data using CLI. You just need to copy that command and run in google colab runtime. Don’t forget to add ‘!’ before this command.

In case your link is dynamic, you will have to do this every time as earlier command won’t work again.

This is what a typical curl command looks like

!curl --header 'Host: www.crcv.ucf.edu' --user-agent 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:73.0) Gecko/20100101 Firefox/73.0' --header 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8' --header 'Accept-Language: en-US,en;q=0.5' --referer 'https://www.crcv.ucf.edu/data/UCF101.php' --cookie 'sc_is_visitor_unique=rx8721945.1582018638.B9D513E5B1294F3FF5FA6112CC6A1234.; __utma=1.544635313.1582018638.1582018638.1582018638.1; __utmb=; __utmc=1; __utmz=1.1582018638.1.1.utmcsr=googleU+007Cutmccn=(organic)U+007Cutmcmd=organicU+007Cutmctr=(not%20provided); __utmt_ucfhb=1' --header 'Upgrade-Insecure-Requests: 1' 'https://www.crcv.ucf.edu/data/UCF101/UCF101.rar' --output 'UCF101.rar'

Change ‘ — output’ if you want to change the file location.

Google colab provides inbuild code to mount google drive in the current run time. You will just need to log in with your id and enter a key, which will be generated automatically. The drive will be mounted, and if you want to save anything on the drive or use anything from the drive, you will need just to give that path, and the files will become readable, and the file you write will automatically be uploaded on the drive.

Further Reading:



Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Feedback ↓