Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Practical Python: Introduction to DataFrame and Series in Pandas
Latest   Machine Learning

Practical Python: Introduction to DataFrame and Series in Pandas

Last Updated on July 17, 2023 by Editorial Team

Author(s): Peace Aisosa

Originally published on Towards AI.

PYTHON BEGINNER SERIES

Whether you’re new to Python or looking to improve your skills, this practical guide will help you understand the basics and start working with these powerful tools.

Image by Gerd Altmann from Pixabay

Hey there! Are you interested in learning about data analysis using Python? Well, you’re in the right place! Python has become one of the most sought-after programming languages for data analysis due to its wide range of powerful data processing and manipulation tools. Two of the most popular Python data manipulation tools are the Pandas DataFrame and Series.

In this article, we aim to introduce you to these two amazing tools and show you how to use them to perform practical data analysis. So, whether you’re new to Python or have some experience, this article will provide you with a beginner-friendly guide to Pandas DataFrame and Series. Let’s get started!

How To Install Pandas In Python

If you want to analyze data using Python, you must install a library called Pandas. There are two common ways to install Pandas: pip or Anaconda.

First, make sure you have Python installed on your computer. If you don’t, download the latest version of the official Python website (https://www.python.org/downloads/). Follow the instructions to complete the installation.

Installing Pandas

To install Pandas using pip, follow these steps:

  1. Open your command prompt or terminal.
  2. Upgrade pip to the latest version by typing the following command: pip install --upgrade pip.
  3. Once the pip is upgraded, type the following command: pip install pandas.

That’s it! Pandas should now be installed on your computer.

Alternatively, we can install Pandas using Anaconda, a widespread Python distribution with many data science libraries pre-installed. Here’s how to do it:

  • Open the Anaconda Prompt from the Start menu on Windows or the Applications folder on macOS.
Image by Author
  • Type the following command: conda install pandas.
Image by Author

That’s it! Pandas should now be installed on your computer.

After installing Anaconda, the Pandas libraries will all be available on your Windows system. This is because the key libraries are automatically downloaded during installation.

Now that we have Pandas installed, we can start using it to manipulate data in Python.

But wait……..

What is Pandas?

Pandas is a Python library that provides powerful tools for data analysis. In particular, it provides two important data structures: Series and DataFrame.

What is a Series in Pandas with Example?

A Series is like a list of items, where each item has a label. The label is like a name tag that helps you easily find and access the item. The items in a Series can be of the same type, like a list of numbers, or they can be of different types, like a list of strings and numbers.

For example, here’s a simple Series that represents the number of toys that four children have:

#Create a list of number of toys four children have 

toy_count = pd.Series([5, 3, 10, 7], index=['Alice', 'Bob', 'Charlie', 'Dave'])

This code creates a pandas Series object named toy_count with the following values and index:

Alice 5
Bob 3
Charlie 10
Dave 7
dtype: int64

This Series has four items, one for each child, and each has a label representing the child's name. The items in the Series are all numbers, which means they are of the same type.

This means that toy_count is a one-dimensional labeled array with four elements. The elements are integers, each associated with a unique label or index. You can access data in a Series using its index labels. For example, to get the number of toys that Bob has, you can use the following code:

#Create a list of number of toys four children have 

toy_count = pd.Series([5, 3, 10, 7], index=['Alice', 'Bob', 'Charlie', 'Dave'])

#Print the number of toys Bob has
print(toy_count.loc['Bob'])

The code creates a list of numbers and assigns a name to each number. This is done using the Pandas library’s Series function.

The print function is then used to output the value associated with the name 'Bob' in the list.

3

The code output is the value 3, associated with the name β€˜Bob’ in the list.

What is DataFrame in Pandas with Example?

A DataFrame is like a table of data, where each row represents one item or observation, and each column represents a different characteristic or feature of the items. For example, if you have a list of fruits and want to store information about their color, size, and taste, you can create a DataFrame where each row represents a fruit, and each column represents a characteristic.

In a DataFrame, the rows are labeled by index labels (like a label on a file folder), and column names label the columns. The values in each table cell represent the value of the corresponding characteristic for the corresponding item.

For example, here’s a simple DataFrame that represents information about three fruits:

#DataFrame showing color, taste and size of different fruits 

fruit_data = pd.DataFrame({
'color': ['red', 'green', 'yellow'],
'size': ['small', 'medium', 'large'],
'taste': ['sweet', 'sour', 'sweet']
}, index=['apple', 'lime', 'banana'])

The output of the code is a Pandas DataFrame object named β€œfruit_data” containing the following data:

 color size taste
apple red small sweet
lime green medium sour
banana yellow large sweet

The DataFrame has three columns named β€œcolor”, β€œsize”, and β€œtaste”, and three rows indexed by β€œapple”, β€œlime”, and β€œbanana”. The values in the DataFrame represent the color, size, and taste of three different fruits: an apple, a lime, and a banana.

You can access data in a DataFrame using its index and column labels. For example, to get the color of the banana, you can use the following code:

print(fruit_data.loc['banana', 'color'])

This is because fruit_data.loc['banana', 'color'] retrieves the value in the 'color' column for the row with the index label 'banana', which is 'yellow' in the DataFrame.

Importing Data From Pandas

Importing data into Pandas is a crucial step in most data analysis projects. With Pandas, you can import data from various sources, such as CSV files, Excel files, SQL databases, JSON files, and web APIs.

There are various ways to import data into Pandas, but we’ll focus on to_csv() , to_excel() and web API methods.

  1. Importing CSV Files

CSV (comma-separated values) files are one of the most commonly used file formats for storing and sharing data. Pandas provide the read_csv() function to read CSV files into a DataFrame. To use this function, specify the path to the CSV file.

import pandas as pd

df = pd.read_csv('data.csv')

2. Importing Excel Files

Excel files are another popular file format used for storing and sharing data. Pandas provide the read_excel() function to read Excel files into a DataFrame. To use this function, specify the path to the Excel file and the name of the sheet you want to read.

import pandas as pd

df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

Data Pandas also provides the read_sql() function to read data from a SQL database into a DataFrame. To use this function, you need to establish a connection to the database and specify the SQL query you want to execute.

import pandas as pd
import sqlite3

conn = sqlite3.connect('database.db')
query = 'SELECT * FROM my_table'

df = pd.read_sql(query, conn)

2. Importing Data from Web APIs

Importing data from web APIs can be simplified using the requests library in Python. The requests library provides functions to send HTTP requests and receive HTTP responses from web APIs.

To import data from a web API using requests, follow these steps:

Import the requests library

import requests

Send an HTTP request to the API using the requests.get() function. This function takes the URL of the API as an argument.

response = requests.get('https://api.example.com/data')

Check the response's status code to ensure the request was successful. A status code of 200 indicates success.

if response.status_code == 200:
data = response.json()
else:
print('Error: Failed to retrieve data from API.')

Convert the response data to a Pandas DataFrame using the pd.DataFrame() function.

import pandas as pd

df = pd.DataFrame(data)

Following these steps, you can easily import data from a web API into a Pandas DataFrame using the requests library. This simplifies the process of working with web APIs and allows you to analyze and visualize data from a variety of sources quickly.

Exporting Data From Pandas

Exporting data from pandas is straightforward using the to_*() methods. These methods save pandas objects, such as dataframes and series, to various file formats. Here's a list of some of the most commonly used to_*() methods in pandas:

  • to_csv(): saves data to a CSV file
  • to_excel(): saves data to an Excel file
  • to_json(): saves data to a JSON file
  • to_html(): saves data to an HTML file
  • to_sql(): saves data to a SQL database
  • to_pickle(): saves data to a pickle file

We’ll focus on to_csv() and to_excel() methods.

Exporting Data to CSV

CSV stands for Comma-Separated Values, a file format used to store tabular data. It’s a simple format that can be opened in various applications, such as Excel, Google Sheets, and more. To export data to a CSV file using pandas, you can use the to_csv() method.

Here’s an example:

import pandas as pd

# create a dataframe
data = {
'name': ['John', 'Mary', 'Peter', 'Lisa'],
'age': [25, 30, 35, 40],
'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)

# save the dataframe to a CSV file
df.to_csv('my_data.csv', index=False)

In this example, we first create a dataframe with some data. We then use the to_csv() method to save the dataframe to a file called my_data.csv. The index=False argument tells pandas not to include the row index in the exported file.

Exporting Data to Excel

Excel is a popular spreadsheet application used by many people. To export data to an Excel file using pandas, you can use the to_excel() method.

Here’s an example:

import pandas as pd

# create a dataframe
data = {
'name': ['John', 'Mary', 'Peter', 'Lisa'],
'age': [25, 30, 35, 40],
'city': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)

# save the dataframe to an Excel file
df.to_excel('my_data.xlsx', index=False)

In this example, we first create a dataframe with some data. We then use the to_excel() method to save the dataframe to a file called my_data.xlsx. The index=False argument tells pandas not to include the row index in the exported file.

Creating Series and DataFrames

Creating Series and DataFrames in Pandas is simple and can be done in various ways, including Python lists, NumPy arrays, dictionaries, or a file. Once created, Series and DataFrames can be accessed, modified, and transformed in many ways using the powerful Pandas functions and methods.

Creating a Series in Pandas

There are several ways to create a series in Pandas. Here are some of the most common methods:

Method 1: Creating a Series from a List

The easiest way to create a series is to pass a list to the pd.Series() function. The list will be used to create the series values, and the default index will be a sequence of integers starting from 0.

Here is an example of creating a simple Series using the pd.Series() function:

import pandas as pd

# create a Series of numbers
my_series = pd.Series([10, 20, 30, 40, 50])

print(my_series)

output

0 10
1 20
2 30
3 40
4 50
dtype: int64

In this example, we created a series of five numbers using the pd.Series() function and passed a Python list [10, 20, 30, 40, 50] as an argument. The resulting Series has default integer index labels from 0 and data type int64.

You can also provide custom index labels and names for the Series using the index and name arguments, respectively:

# create a Series with custom index and name
my_series = pd.Series([10, 20, 30, 40, 50], index=['a', 'b', 'c', 'd', 'e'], name='my_numbers')

print(my_series)

Output:

a 10
b 20
c 30
d 40
e 50
Name: my_numbers, dtype: int64

In this example, we created a Series with custom index labels ['a', 'b', 'c', 'd', 'e'] and name 'my_numbers' using the pd.Series() function.

Method 2: Creating a Series from a Dictionary

In Python, a dictionary is a collection of key-value pairs. A Pandas Series is like a table with only one column, and each row has a label that can be used to identify it.

To create a Pandas Series from a dictionary, you can use the pandas.Series() function and pass it in the dictionary as an argument. The dictionary's keys will be used as the labels for the rows in the Series, and the dictionary's values will be used as the data in the rows.

For example, if you have a dictionary that represents the number of products sold in a store, you can create a Pandas Series like this:

#Create a dictionary of sales 
sales = {'Product A': 100, 'Product B': 200, 'Product C': 150, 'Product D': 75}

To create a series from this dictionary, we can use the following code:

import pandas as pd

#Create a series from a dictionary

sales_series = pd.Series(sales)
print(sales_series)

This would output

Product A 100
Product B 200
Product C 150
Product D 75
dtype: int64

Notice that the keys from the dictionary (Product A, Product B, etc.) are used as the index of the Series, and the values from the dictionary (100, 200, etc.) are used as the data.

This will output a table with one column and four rows, where the labels for the rows are Product A, Product B, Product C, and Product D.The values in the rows are 100, 200, 150, and 75.

Creating a DataFrame in Pandas

A DataFrame is created by combining multiple series, each representing a column in the DataFrame. There are several ways to create a DataFrame in Pandas.

Method 1: Creating a DataFrame from a Dictionary

A DataFrame can be created from various data sources, including a dictionary. To create a DataFrame from a dictionary, the dictionary should have keys as column names and values as the corresponding column data. The following is an example of a dictionary that can be used to create a DataFrame:

data = {'name': ['John', 'Sarah', 'Michael', 'Jessica'],
'age': [25, 28, 30, 24],
'city': ['New York', 'Paris', 'London', 'Sydney']}

Here, the dictionary keys (β€˜name’, β€˜age’, β€˜city’) represent the column names, and the corresponding values represent the column data.

To create a DataFrame from this dictionary, we first need to import the Pandas library:

import pandas as pd

Then, we can use the pd.DataFrame() method to create a DataFrame from the dictionary:

#Create a dataframe named 'df'
df = pd.DataFrame(data)

This will create a DataFrame df with the following output:

 name age city
0 John 25 New York
1 Sarah 28 Paris
2 Michael 30 London
3 Jessica 24 Sydney

The first column of the DataFrame represents the index, which Pandas automatically generate.

We can also specify the index column by passing a list of values to the index parameter. For example:

data = {'name': ['John', 'Sarah', 'Michael', 'Jessica'],
'age': [25, 28, 30, 24],
'city': ['New York', 'Paris', 'London', 'Sydney']}
index = ['a', 'b', 'c', 'd']
df = pd.DataFrame(data, index=index)

This will create a DataFrame with the specified index column:

 name age city
a John 25 New York
b Sarah 28 Paris
c Michael 30 London
d Jessica 24 Sydney

Creating a DataFrame from a dictionary in Python is a simple process using the Pandas library. By passing a dictionary as an argument to the pd.DataFrame() method, we can create a two-dimensional tabular data structure with the dictionary keys representing the column names and the values representing the corresponding column data.

Method 2: Creating a DataFrame from a List of Dictionaries

Creating a DataFrame from a list of dictionaries is a common way to convert raw data into a DataFrame. This method involves creating a list of dictionaries where each dictionary represents a row in the DataFrame, and the keys in the dictionary represent the column names.

Let’s walk through an example to illustrate this process. Suppose we have the following data:

data = [ {'Name': 'John', 'Age': 32, 'City': 'New York'}, 
{'Name': 'Jane', 'Age': 27, 'City': 'Los Angeles'},
{'Name': 'Mike', 'Age': 41, 'City': 'Chicago'},
{'Name': 'Sarah', 'Age': 29, 'City': 'San Francisco'}]

In this example, we have a list of four dictionaries where each dictionary represents a person’s information, including their name, age, and city.

To convert this data into a DataFrame, we can use the pandas.DataFrame() function and pass it into our list of dictionaries as an argument. Here's how we can do it:

import pandas as pd

df = pd.DataFrame(data)
print(df)

When we run this code, we get the following output:

 Age City Name
0 32 New York John
1 27 Los Angeles Jane
2 41 Chicago Mike
3 29 San Francisco Sarah

As you can see, each dictionary in our list corresponds to a row in the DataFrame, and the keys in each dictionary correspond to the column names in the DataFrame. Pandas automatically sort the columns alphabetically by default.

We can also specify the order of the columns by passing in a list of column names as an argument to the pd.DataFrame() function. For example:

import pandas as pd

df = pd.DataFrame(data, columns=['Name', 'Age', 'City'])
print(df)

This will give us the following output:

 Name Age City
0 John 32 New York
1 Jane 27 Los Angeles
2 Mike 41 Chicago
3 Sarah 29 San Francisco

Pandas is a valuable tool for any data scientist or analyst, and its extensive documentation and active community make it easy to learn and use. The DataFrame and Series are the two primary data structures in Pandas, which allow you to store, manipulate, and analyze data in various ways. By mastering these data structures and their associated methods, you can easily perform various data analysis tasks, such as data cleaning, manipulation, and visualization.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓