Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Our 15 AI experts built the most comprehensive, practical, 90+ lesson courses to master AI Engineering - we have pathways for any experience at Towards AI Academy. Cohorts still open - use COHORT10 for 10% off.

Publication

Cookiecutter Data Science: A Standardized, Flexible Approach for Modern Data Projects
Data Science   Latest   Machine Learning

Cookiecutter Data Science: A Standardized, Flexible Approach for Modern Data Projects

Author(s): Abinaya Subramaniam

Originally published on Towards AI.

In the ever-evolving world of data science, one of the biggest challenges isn’t the algorithms or tools, it’s project organization. If you are working solo or collaborating with a team, maintaining a clean, reproducible, and scalable project structure can make or break your workflow. Enter Cookiecutter Data Science (CCDS), a framework designed to provide a logical, flexible, and reasonably standardized structure for data science projects.

Cookiecutter Data Science: A Standardized, Flexible Approach for Modern Data Projects
Cookie Cutter — Image by Author

What is Cookiecutter Data Science?

Cookiecutter Data Science is not just a template, it’s a philosophy for organizing your data projects. At its core, it’s a project skeleton that ensures your analysis is reproducible, maintainable, and easy for others to understand. By following CCDS conventions, you can:

  • Reduce confusion when revisiting old projects
  • Make collaboration easier with standardized structures
  • Focus on analysis and modeling rather than figuring out where files should go

Think of it as the Rails for data science, just as web developers use standard frameworks to save time and improve consistency, data scientists can benefit from CCDS for structured workflows.

Why Use Cookiecutter Data Science?

Data science projects are messy by nature. We often explore data in unpredictable ways, experiment with new models, and iterate rapidly. Without a standardized structure, you may find yourself asking questions like:

  • Which notebook should I run first?
  • Where did the raw data come from?
  • Which file contains the final model predictions?

A well-defined structure solves these problems by providing:

  1. Clarity for collaborators Anyone joining the project can immediately understand the workflow.
  2. Reproducibility Helps you or others reproduce results months or years later.
  3. Separation of concerns Organizes raw data, processed data, models, notebooks, and reports in dedicated folders.
  4. Ease of scaling Makes it simpler to expand projects or integrate new datasets and models.

Explore More at: https://cookiecutter-data-science.drivendata.org/

Cookiecutter Official Website

Getting Started with Cookiecutter Data Science

Installation

CCDS v2 requires Python 3.9+ and is recommended to be installed using pipx, which isolates the installation in a separate environment:

pipx install cookiecutter-data-science

Alternatively, you can install via pip or, soon, conda.

Starting a New Project

Starting a new project is as simple as running:

ccds

You can optionally specify a template

ccds https://github.com/drivendataorg/cookiecutter-data-science

The CLI will prompt for details like:

1. project_name

project_name (project_name):
This is the human-readable name of your project. It usually has spaces and capitalization.

Best Practice:

  • Use descriptive, clear names that explain the purpose.
  • Example: Sales Forecasting Analysis
  • Avoid overly short or vague names like “Project1” or “Analysis”.

2. repo_name

repo_name (churn_prediction):
This is the name of your git repository or folder name.

Best Practice:

  • Use snake_case (all lowercase with underscores) for readability in URLs and code.
  • Keep it short but descriptive.
  • Example: sales_forecasting_analysis

3. module_name

module_name (churn_prediction):
This is the Python module/package name where your source code lives.

  • This folder will contain scripts for processing data, modeling, and visualization.

Best Practice:

  • Use lowercase letters, no spaces.
  • Make it descriptive of the project domain.
  • Example: sales_forecasting

4. author_name

author_name:
The name of the person or organization responsible for the project.

  • This appears in metadata files and documentation.

Best Practice:

  • Use your full name or organization name
  • Example: Abinaya Subramaniam

5. description

description:

Best Practice:

  • Keep it concise but meaningful (1–2 sentences).
  • Example: Predict future sales for retail stores using historical data and machine learning models.

6. python_version_number

python_version_number (3.10):
The Python version used for this project.

Best Practice:

  • Use a recent, stable version (3.10 or 3.11).

7. dataset_storage

Select dataset_storage
1 - none
2 - azure
3 - s3
4 - gcs

Where the raw and processed datasets will be stored.

Best Practice:

  • For local/small projects, none is okay (store in data/raw folder).
  • For cloud-based projects, choose the appropriate storage (S3, Azure, GCS).

8. environment_manager

1 - virtualenv
2 - conda
3 - pipenv
...

How you will manage Python dependencies for the project.

Best Practice:

  • Virtualenv is simple and works for most small to medium projects.
  • For data-heavy projects, conda is better (manages Python and packages).

9. dependency_file

1 - requirements.txt
2 - pyproject.toml
3 - environment.yml
...

Which file will list all dependencies.

Best Practice:

  • For pip/virtualenv: requirements.txt
  • For conda: environment.yml
  • For more modern Python packaging: pyproject.toml

10. pydata_packages

1 - none
2 - basic

Whether to include basic PyData packages (like pandas, numpy, matplotlib) in the initial setup.

Best Practice:

  • Choose basic for almost all projects unless you want a very minimal setup.
  • Example: 2 - basic

11. testing_framework

1 - none
2 - pytest
3 - unittest

Select the testing framework for automated testing.

Best Practice:

  • pytest is widely used in Python projects; it’s flexible and easy to write tests.

12. linting_and_formatting

1 - ruff
2 - flake8+black+isort

Code quality and formatting tools to ensure readable and consistent code.

Best Practice:

  • flake8+black+isort is the most popular combo.

13. open_source_license

1 - No license file
2 - MIT
3 - BSD-3-Clause

Best Practice:

  • If it’s open-source: MIT or BSD is fine.
  • For internal company projects: No license is fine.

14. docs

1 - mkdocs
2 - none

Whether to include documentation setup using MkDocs.

Best Practice:

  • For professional projects: mkdocs is great for generating readable documentation.

15. include_code_scaffold

1 - Yes
2 - No

A code scaffold is a starter template for your project that comes with pre-built folders and scripts for common tasks in a data science workflow, like loading data, creating features, training models, making predictions, and visualizing results.

It saves time, enforces a clean and organized structure, and helps you follow best practices from the very beginning, so you can focus on analysis rather than setting up files from scratch.

Once completed, you’ll have a fully structured project ready to go.

Directory Structure

A typical Cookiecutter Data Science project has the following structure,

├── LICENSE
├── Makefile
├── README.md
├── data
│ ├── raw
│ ├── interim
│ ├── processed
│ └── external
├── docs
├── models
├── notebooks
├── pyproject.toml
├── references
├── reports
│ └── figures
├── requirements.txt
├── setup.cfg
└── <module_name>
├── __init__.py
├── config.py
├── dataset.py
├── features.py
├── modeling
│ ├── __init__.py
│ ├── train.py
│ └── predict.py
└── plots.py
  • data/ – Organizes your raw, processed, and intermediate datasets.

In a Cookiecutter Data Science project, the data/ folder is organized to keep datasets clean and manageable. The raw/ folder contains the original, unmodified data exactly as you received it, while the external/ folder stores data from third-party sources, like public datasets or vendor-provided files, that your project depends on.

As you work with the data, any transformed or intermediate datasets are saved in interim/, which are temporary files created during cleaning, feature engineering, or other preprocessing steps. The processed/ folder holds the final, cleaned, and ready-to-use datasets that are used for modeling or analysis, ensuring a clear separation between raw input, temporary work, and final outputs.

  • notebooks/ – Houses your Jupyter notebooks in a numbered and descriptive format for easy tracking.
  • models/ – Stores trained models and serialized predictions.
  • reports/ – Contains generated reports, figures, or dashboards.
  • <module_name>/ – The main source code for processing, feature engineering, modeling, and visualization.
Project Structure — Image by Author

Best Practices Encouraged by CCDS

  1. Reproducibility is key, Using requirements.txt, pyproject.toml, and version control ensures others can replicate your results.
  2. Clear separation of concerns, Keep raw data untouched, isolate processing scripts, and clearly separate modeling from reporting.
  3. Readable, maintainable code, A consistent structure encourages readable code that others (or future you) can follow.
  4. Flexibility, CCDS doesn’t impose rigid rules. You can adjust folder names, add modules, or use different packages as needed.

As Ralph Waldo Emerson famously said, “A foolish consistency is the hobgoblin of little minds.”

CCDS promotes consistency within your project, while still allowing flexibility for unique workflows.

Why CCDS is a Game-Changer
Imagine revisiting a project after six months. Without a structured framework like CCDS, you might encounter a maze of poorly named notebooks, raw data scattered across folders, and multiple conflicting scripts. CCDS solves this by providing a standardized organization that allows you to quickly identify where raw, processed, and external data are located, run notebooks in the correct order, and locate trained models and visualizations without guessing.

This structure not only saves time but also reduces stress and increases confidence in our analysis, making our data science projects far easier to manage and reproduce.

Conclusion
Cookiecutter Data Science is far more than a simple template. it’s a logical and flexible framework that turns messy, experimental projects into organized, reproducible, and scalable workflows. By adopting CCDS, you simplify your own work, make collaboration smoother, and ensure that even your future self can easily understand and build upon your projects.

It doesn’t matter, whether you are a solo analyst or part of a team working on large-scale data projects, CCDS provides a solid foundation for professional, well-structured data science work, helping you focus on insights and analysis rather than project chaos.

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!


Discover Your Dream AI Career at Towards AI Jobs

Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!

Note: Content contains the views of the contributing authors and not Towards AI.