Essential Python Libraries for Data Science
Author(s): Raj kumar
Originally published on Towards AI.

In Part 1, we built the foundations of the data pipeline. We loaded a real dataset, structured it using Pandas, selected relevant features, and performed numerical transformations using NumPy. By the end of Step 5 of the end-to-end workflow introduced in Part 1, the data had moved through its most critical early stages: cleaning, selection, and normalization. At that point, the dataset was technically ready for modeling.
However, production-grade data science does not move directly from normalized data to models.
Before any algorithm is introduced, there is an essential intermediate step that determines whether models will be stable, explainable, and trustworthy. That step is visualization and diagnostics.
This second part continues directly from where Part 1 ended, using the same variables and the same notebook state. No data is reloaded, no preprocessing is repeated, and no assumptions are reset. The normalized feature set created in Step 5 is now treated as the source of truth, exactly as it would be in a production data pipeline.
The goal of Part 2 is not to create charts for presentation. It is to understand data behavior before modeling. We will examine feature distributions, inspect relationships between variables, identify redundancy and correlation, and validate whether earlier transformations behaved as intended. These steps act as a quality gate between data preparation and machine learning.
By the end of this part, we will have clear answers to questions that models silently assume but never verify. Are the distributions well-behaved? Are certain features encoding the same information? Are there patterns or anomalies that will affect downstream learning?
Only after these questions are answered does it make sense to move into classical machine learning, which is where Part 3 will begin.
Transition to Analysis
With the normalized dataset already prepared in Part 1, the next step is to inspect it visually, starting with basic distribution analysis.
End-to-End Example (Continued): Visualization and Diagnostics
In Part 1, we reached a stable, normalized dataset stored in X_normalized_df. At this point, the data is numerically prepared but still unexamined. Before introducing any machine learning model, we need to understand how this data behaves.
The following steps extend the same workflow, operating on the same dataset and variables, without reloading or reprocessing anything.
Matplotlib — Distribution Analysis
Once data has been cleaned, selected, and numerically transformed, the first diagnostic step is understanding how individual features behave in isolation. Summary statistics provide useful signals, but they rarely reveal the full shape of a distribution.
This is where distribution analysis becomes essential.
Matplotlib remains the default tool for this stage in many production environments. Not because it is visually impressive, but because it produces static, reproducible plots that are easy to review, archive, and audit. In regulated or long-lived systems, this reliability matters more than interactivity.
At this point in the workflow, the objective is straightforward: verify that normalization behaved as expected and inspect whether the feature distribution contains skewness, heavy tails, or unexpected concentration.
Step 6: Distribution Analysis Using Matplotlib
import matplotlib.pyplot as plt
plt.figure(figsize=(6, 4))
plt.hist(X_normalized_df["mean radius"], bins=30)
plt.title("Distribution of Normalized Mean Radius")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()

This plot answers several questions immediately. Is the distribution centered around zero after normalization? Does it resemble a roughly bell-shaped curve, or is it skewed? Are there extreme values that may influence model behavior?
These are not cosmetic concerns. Many machine learning algorithms implicitly assume well-behaved feature distributions. Catching deviations early reduces instability later.
Seaborn — Feature Relationships and Redundancy
While univariate distributions reveal individual behavior, real modeling problems emerge from relationships between features. Features may be correlated, redundant, or encode similar information under different names.
Seaborn is particularly effective at this stage because it layers statistical context on top of Matplotlib. It allows relationships to be inspected visually without requiring formal modeling.
Pairwise plots are often the first place where teams realize that multiple features move together almost perfectly.
Step 7: Feature Relationship Analysis Using Seaborn
import seaborn as sns
sns.pairplot(
X_normalized_df[
["mean radius", "mean texture", "mean area"]
],
diag_kind="kde"
)

These plots make linear relationships, clusters, and overlaps immediately visible. In many real-world datasets, this is where feature redundancy becomes obvious. Several variables may appear distinct conceptually but behave almost identically numerically.
Identifying this early simplifies downstream modeling, improves interpretability, and reduces instability in later stages.
Correlation Diagnostics
Pair plots provide intuition, but correlation analysis provides structure. It quantifies relationships that visual inspection suggests and highlights dependencies across the entire feature set.
Correlation diagnostics are not about blindly removing features. They are about making trade-offs explicit. Highly correlated features increase variance, complicate explanation, and can destabilize certain models if left unchecked.
A correlation heatmap provides a compact, system-level view of these relationships.
Step 8: Correlation Diagnostics
plt.figure(figsize=(6, 4))
sns.heatmap(
X_normalized_df.corr(),
cmap="coolwarm",
annot=False
)
plt.title("Feature Correlation Heatmap")
plt.show()

Blocks of high correlation often indicate groups of features encoding similar information. These insights directly influence feature selection strategies in the next phase.
Plotly — Interactive Exploration During Diagnostics
Static plots are ideal for validation and documentation. However, during exploratory diagnostics, analysts often need to interact with the data. Zooming into dense regions, inspecting outliers, and exploring nonlinear patterns are easier with interactive tools.
Plotly serves this purpose well. Even when its outputs are not included in final reports, interactive exploration often shapes modeling intuition and helps validate assumptions before training begins.
Step 9: Interactive Exploration Using Plotly
import plotly.express as px
fig = px.scatter(
X_normalized_df,
x="mean radius",
y="mean area",
title="Mean Radius vs Mean Area (Normalized)"
)
fig.show()

Interactive exploration helps confirm whether observed relationships hold across the full data range or are driven by a small number of extreme observations. This distinction matters when choosing models, evaluation strategies, and feature handling approaches.
Diagnostic Summary
After completing visualization and diagnostics, the dataset should no longer feel abstract. Its structure, behavior, and limitations should be visible.
This stage is not about drawing conclusions. It is about recording observations that will guide modeling decisions.
Step 10: Diagnostic Summary Before Modeling
### Diagnostic Observations
- Some features show strong linear relationships
- Normalization centers distributions effectively
- Certain features may be redundant for modeling
These observations act as inputs to modeling, not results. They guide feature selection, algorithm choice, and evaluation strategy in the next part.
Why This Stage Matters Before Modeling
Skipping visualization does not save time. It only defers understanding until failures appear during training or in production. Diagnostics act as a quality gate, ensuring that modeling decisions are informed rather than reactive.
Because this analysis operates on the same normalized dataset created in Part 1, continuity is preserved throughout the workflow. There are no hidden assumptions, no reprocessing, and no divergence between analysis and modeling.
This end-to-end continuation reflects how real data science systems are built. Each step builds on the previous one without restarting or reshaping the pipeline. By the end of Part 2, the dataset is not only prepared, but understood. That understanding significantly reduces the risk of unstable models, misleading metrics, and unexpected behavior as systems move closer to production.
Closing Thoughts
Visualization and diagnostics are not optional steps or presentation exercises. They are investigative practices that protect data science systems from silent failure. By inspecting distributions, relationships, and dependencies before modeling, teams surface risks early and make informed trade-offs instead of reactive fixes.
Because this work continues directly from the dataset prepared in Part 1, it reflects how real workflows evolve in practice. Nothing is reset, nothing is hidden, and assumptions are carried forward deliberately. That continuity is what turns exploratory analysis into production-ready work.
If this perspective resonates with how you build or evaluate data science systems, consider clapping to signal value, leaving a comment to share how you approach diagnostics in your own projects, or following the series as it moves from foundations into modeling and production patterns. Thanks !!!
Transition to Part 3
With feature behavior validated and relationships understood, the data is now ready for modeling.
In Part 3, we will continue in this same notebook and introduce classical machine learning using scikit-learn, focusing on pipelines, baseline models, and evaluation strategies grounded in the data we now understand.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Towards AI Academy
We Build Enterprise-Grade AI. We'll Teach You to Master It Too.
15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.
Start free — no commitment:
→ 6-Day Agentic AI Engineering Email Guide — one practical lesson per day
→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages
Our courses:
→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.
→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.
→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.
Note: Article content contains the views of the contributing authors and not Towards AI.