Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-FranΓ§ois Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

The Art Behind Data Understanding and Its Importance
Data Analysis   Data Science   Latest   Machine Learning

The Art Behind Data Understanding and Its Importance

Last Updated on April 16, 2025 by Editorial Team

Author(s): Karina Patel

Originally published on Towards AI.

What nature and technical skills should a person have to get a good understanding of data?

To understand data effectively, a person must develop specific skills and traits beyond just technical expertise. Often, individuals with strong technical skills still struggle to deliver insights, while those from diverse backgrounds, once they acquire the required technical knowledge, can excel in the field. My observation on this is that a person needs to inherit these essential traits to become a good Data Analyst or Data Scientist first.

  • Curiosity β€” One should choose or work in a domain that interests them. Curiosity will drive you to seek deeper and hidden insights.
  • Judgment & Critical Thinking β€” Being slightly judgmental (in an analytical sense) helps you determine where to start. An opinionated approach allows you to form hypotheses and know what to look for in the data.
  • Healthy Discussion & Debate β€” Engaging in discussions and challenging your hypotheses can lead to new perspectives and better analytical approaches. That will lead you to optimize your work and deliver more efficient results.
  • Open-mindedness β€” Keeping an open mind enhances Exploratory Data Analysis (EDA) and pattern recognition, allowing you to discover unexpected trends.
  • Reasoning & Interpretation β€” The ability to extract meaningful insights and construct logical explanations from data is crucial for making informed decisions.
Photo by John Schnobrich on Unsplash

The takeaway is that cultivating an analytical mindset is just as important as having strong technical skills to deliver high-quality insights. Technical skills serve as the tool to implement the vision and patterns you identify.

Importance of Data Understanding to Gain an Analytical Advantage

Data-driven decision-making relies on a strong understanding of data, which plays a crucial role in analysis. A deep grasp of data helps validate its authenticity, identify anomalies, and determine key fields for cleaning, transformation, and modeling. It also guides the selection of the right analytical approach, leading to more accurate decision-making. Additionally, a solid understanding of data enhances storytelling skills, which is essential for communicating insights to non-technical business stakeholders.

Now, let’s explore how to simplify data understanding for complex datasets. What should you focus on, and how can you define a feasible approach for analysis? Raw data can be overwhelming at times, but the right perspective can make it more manageable.

Step 1: Understanding of Data Backbone
What is a data backbone?

To understand data properly, start by identifying its source. Determine where the data originates, what software, server, or storage location is used, and how different platforms impact data quality. Each tool has its pros and cons, making it essential to assess data authenticity early. This foundational knowledge will also aid in the data-wrangling phase.

Step 2: Understanding Dataset Attributes

Gain a deep understanding of the dataset’s attributes, their relevance to the business, and how information is structured. Developing domain-specific knowledge will enhance your ability to model data effectively and align it with business requirements. This step is crucial for logic-building and ensuring meaningful analysis.

Step 3: Validate Data Consistency and Timeliness

  • Data Consistency: Ensure the data follows expected rules. For example, a person’s age cannot be negative, and a name should not contain numerical values. The consistency checks (what to look for?) may vary depending on the domain and dataset.
  • Data Timeliness: Verify that the data is up-to-date and relevant for analysis. Outdated information can lead to inaccurate insights and poor decision-making.

Step 4: Defining the Problem Statement

Identify the key problems you need to solve and the questions data should answer. Collaborate with stakeholders to gain business insights and align expectations. Clearly defining the problem statement is critical, as it sets the foundation for choosing the right analytical approach.

Step 5: Identifying and Handling Noise Data

Understand what qualifies as noise β€” irrelevant, random, inconsistent, or erroneous data. This includes data entry errors, missing values, duplicates, and redundant information. Identifying and managing noise is crucial for applying statistical techniques, selecting meaningful features, and implementing machine learning models.

Example: Invalid codes, duplicate records, or negative transaction amounts may indicate errors in some datasets. However, in fintech data, negative amounts might be relevant as they can represent debits.

Data Differentiation

Data can be classified into 2 types: qualitative and quantitative.

Qualitative Data
Non-numerical textual data usually represents information gathered from transcripts of interviews, groups, opinions, remarks, notebooks, maps, observations, and opinionative data (agree, disagree, neutral, status), etc.

The following are the 2 types of Qualitative Data

Nominal Data
A type of data that can not be ordered and does not contain any quantitative information but has classification possibilities.

Examples

  1. Demographic Information
    Gender, Nationality, Eye Color, Blood type, Religion, Movie genre, Employment status, personality type
  2. Geographic Data
    Country, City, Region, Postal Codes, Climate Zones, Landforms, Languages by Country
  3. Organizational Data
    Office locations, Job functions, Office Hierarchy Levels, Employee Status, Feedback types, Employee benefits, Work Locations, Training types, Asset Categories, Shift Patterns, Business Units, Meeting types
  4. Product/Service Data
    Electronic type, Food Categories, Vehicle types, Streaming Services, Banking Services, Tourism Services, Home Appliances, Books, Software types, Accessories

Ordinal Data
This type of data is qualitative and contains a meaningful order for categories, unlike nominal data. However, the difference between the values might not be uniform or measurable. Some data can be quantitative but will lack relationships.

Examples

  1. Survey & Feedback Data
    Customer Satisfaction Surveys can have different satisfaction levels. The difference between β€œNeutral” and β€œSatisfied” is not necessarily equal to the difference between β€œDissatisfied” and β€œNeutral.”
  2. Education Level Data
    Education levels have a natural ranking (High School < Bachelor’s < Master’s < PhD). However, the time required to complete each level may vary.
  3. Economic & Social Class Data
    In income categories like Low, Medium, and High Income, β€œLow Income” is less than β€œMiddle Income,” which is less than β€œHigh Income.” However, the exact income difference between categories is not fixed.

Quantitative Data
It refers to numerical values representing measurable quantities and can be analyzed using mathematical and statistical techniques. It describes how much, how many, or how often something occurs, making it essential for objective analysis and decision-making.

The following are the 2 types of Quantitative Data

Discrete Data
It consists of countable, distinct, and separate values. It can only take specific whole number values that fit the domain-specific predefined notion criteria. A possible example of how values could be represented is a shoe size can be 7.5 but can’t be 7.76.

Examples

  1. Education Data
    Number of Students in the Classroom, Number of teachers, Number of Books, Number of Subjects, Number of times a student was absent.
  2. Business & Sales Data
    Number of Products sold per day, Number of customers visiting a store daily, Number of transactions made in a day, Number of employees in a company, Number of complaints received by customer service.
  3. Transportation Data
    Number of buses arriving at a stop per hour, Number of cars in a parking lot, Number of flights departing from an airport per day, Number of red lights a driver encounters in a trip.
  4. Healthcare & Medical Data
    Number of patients visiting a clinic daily, Number of surgeries performed in a hospital per week, Number of nurses in a hospital ward, Number of vaccines given at a health center per month.
  5. Finance & Banking Data
    Number of transactions in a bank account per month, Number of credit cards owned by a person, Number of ATMs in a city, Number of loans approved per day in a bank, Number of checks deposited in a branch per day.
  6. Social Media & Technology Data
    Number of likes on a social media post, Number of followers on an Instagram account, Number of messages sent in a chat group, Number of times a video was shared, Number of notifications received in a day.

Continuous Data
Continuous data is a type of quantitative data that can be measured over a period of time. It represents measurable quantities that can be further divided into smaller parts and can be deep-dived more into. You could conclude in-depth insights while achieving meaningful precision. It can take any values in the given range; it can be whole numbers, decimals, or fractions.

Examples

  1. Healthcare Data
    Height of a person, Weight of a person, Body Temperature, Blood pressure levels, Cholesterol level in blood, Sugar level in blood.
  2. Time & Speed Data
    Time taken to complete a 400-meter race, Time taken by an F1 racer to complete one lap, Time spent over a phone call, Reaction time in milliseconds.
  3. Temperature & Weather Data
    Daily Temperature in the city on different days, Humidity level percentage, wind speed in a storm, and Air pressure in the atmosphere.
  4. Finance & Banking Data
    Stock Market Price fluctuations, Interest Rate on a Bank Loan, Amount of money withdrawn from an ATM, Gold price per gram, and Total Revenue generated by a business.
  5. Geography & Environmental Data
    Depth of ocean at various points, Area of different locations, Volume of water in a lake, Average elevation of a mountain range, Height of a mountain peak.
  6. Music & Audio Data
    Duration of a song, Bitrate of an audio file, Frequency of different sound waves, Volume levels in decibels, Tempo of a song in beats per minute

Ask your data questions and see which category it falls into: is it qualitative, quantitative, Nominal, Ordinal, Discrete, or Continuous? A clear understanding of these distinctions helps guide the right approach for data preprocessing and analysis.

Below are Python data profiling libraries that can help you better understand your dataset. These libraries provide statistical summaries and interactive visual representations for deeper insights.

Pandas Profiling

This is one of my go-to options for a quick review of the dataset. Pandas Profiling in Python will help you in identifying missing values, generating a correlation matrix, and analyzing relationships between columns.

# pip install pandas-profiling

from pandas_profiling import ProfileReport
profile = ProfileReport(df)
profile

Skimpy

Skimpy is a lightweight Python library that provides summary statistics and automated data-cleaning functions. It is especially useful for analyzing statistics-heavy datasets, offering deeper insights than Pandas Profiling.

# pip install skimpy

from skimpy import skim
skim(df)

SweetViz

SweetViz is an open-source Python library that generates high-density visualizations in an interactive HTML report. It helps streamline the Exploratory Data Analysis (EDA) process by providing insights quickly. The library also offers flexibility, allowing for comparisons between test and train datasets.

# pip install sweetviz

import sweetviz as sv

my_report = sv.analyze(df)
my_report.show_html() # Default arguments will generate to "SWEETVIZ_REPORT.html"

AutoViz

AutoViz provides quick insights into your dataset and streamlines the Exploratory Data Analysis (EDA) process. It also assesses data quality and offers flexibility with custom parameters for generating exploratory visualizations.

# pip install autoviz

from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()

df = AV.AutoViz('data.csv')

Data understanding is an iterative process. As new insights emerge, you will have to revisit some of the steps.

Hope this helps and makes your data understanding process better! 😊

Stay tuned for more insightful approaches, findings, and detailed tutorials on data✨

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓