Skip to content

Training Session 1: Project Scaffolding Standards for AI/ML Projects

Objective

By the end of this session, participants will understand the importance of standardized project structures, their advantages, and how to adapt directory structures for different project types (e.g., Python projects, Gen AI applications, production-ready ML pipelines).

1. Introduction: The Importance of Project Scaffolding Standards

Overview

Welcome to the first session of our training series! In this session, we will focus on one of the most fundamental aspects of AI/ML software development: establishing and adhering to standardized project structures. A strong foundation ensures successful collaboration, reproducibility, and scalability, setting the stage for high-quality code and professional workflows.


Why Standardized Project Structures Matter

In AI/ML projects, we work in a multidisciplinary environment that often includes:

  • Data Scientists: Focused on creating models and analyzing data.
  • ML Engineers: Responsible for scaling and deploying models.
  • Data Engineers: Handling data pipelines and storage.

Collaboration Across Roles

Each role brings unique challenges, making collaboration and consistency essential. A standardized directory structure acts as a common language, helping the entire team work efficiently.


Goals of Project Scaffolding

  1. Organization
    Avoid chaos by keeping files and folders structured, making navigation intuitive.

  2. Collaboration
    Ensure team members can work on the same project without confusion about file locations or naming conventions.

  3. Reproducibility
    Guarantee that others (or even your future self!) can reproduce experiments and workflows seamlessly.

  4. Scalability
    Design projects that can grow, transitioning from proof-of-concept to production without major restructuring.

Pro Tip

Begin every project with a scaffolding template that supports your team's needs. Customizing a standard template for specific workflows can save time and reduce errors.


Common Challenges Without Standardization

Without a standardized project scaffold:

  • Onboarding New Team Members:
    • Takes longer, as they need to learn the structure of every new project.
  • Debugging and Maintenance:
    • Becomes harder as team members struggle to locate key files or understand dependencies.
  • Collaboration Friction:
    • Mismatched expectations around file organization slow down progress.
  • Code Reusability:
    • Is hampered, as inconsistent structures make it harder to integrate components across projects.

Potential Pitfalls

A lack of standardization leads to disorganized codebases, delayed debugging, and higher onboarding time for new members.


What We’ll Cover Today

  1. The Importance of Standardization: Why a structured approach is critical for team projects.
  2. Advantages of a Standardized Structure: Benefits for organization, collaboration, reproducibility, and scalability.
  3. Types of Directory Structures: Differences between Python utilities, Gen AI applications, and production ML pipelines.
  4. Exploring Our Standard: Walkthrough of the directory structure we’ll use as a template.
  5. Interactive Activity: Hands-on exercise to restructure a disorganized project.

Interactive Activity

Be prepared to engage in a practical exercise where you’ll apply the principles of standardization to a real-world example. Bring your questions for a live Q&A at the end of the session!


Key Takeaways

  • Standardized project scaffolding is the foundation for professional, collaborative AI/ML development.
  • It enhances productivity, reduces confusion, and sets your project up for long-term success.
  • The skills you’ll learn today will improve how you work within teams and across roles.

You're Ready to Build Smarter

By implementing these principles, you'll streamline workflows, foster collaboration, and ensure your projects are built for scalability and reproducibility.

2. The Importance of Standardized Directory Structures

Introduction

A standardized directory structure is the backbone of any successful AI/ML project. It ensures that all team members, regardless of their role, can collaborate effectively, replicate experiments, and scale solutions for production. Let’s break this down by exploring the benefits in terms of collaboration, reproducibility, and scalability.


1. Collaboration

When multiple team members—data scientists, ML engineers, and data engineers—work on the same project, clarity is key. A standardized directory structure promotes seamless collaboration by:

  • Providing Clear File Locations:
    • Everyone knows where to find code, data, configurations, and documentation.
    • Reduces time wasted searching for files or asking teammates for clarification.

Practical Advice for File Locations

Use self-explanatory directory names such as data/, src/, docs/, and tests/ to make file locations intuitive and reduce friction.

  • Minimizing Miscommunication:
    • Clear separation of concerns (e.g., raw vs. processed data, scripts vs. tests) eliminates confusion over file ownership or purpose.
    • Agreed-upon standards reduce the risk of overwriting each other's work.

Avoid Overwriting Work

Always use version control systems like Git to prevent accidental overwrites, especially when multiple people are working on shared files.

  • Enabling Parallel Development:
    • Teams can work on different aspects of the project (e.g., data preprocessing, model training, and deployment) without stepping on each other's toes.

Example: Parallel Development

Imagine a team where one member preprocesses data in the data/ directory while another works on model training in src/models/. A standardized structure ensures no one disrupts another's workflow.


2. Reproducibility

Reproducibility is a cornerstone of AI/ML projects, especially when experiments need to be validated or results are handed off to other teams. Standardization supports reproducibility by:

  • Ensuring Consistent Experiment Replication:
    • Clearly organized data/ directories (e.g., raw, processed) ensure that the same inputs produce the same outputs.
    • Version-controlled configuration files (settings.yaml, config.cfg) lock down parameters for experiments.

Configuration Best Practices

Store all experiment parameters in a central config/ directory. Use YAML or JSON files for clarity and machine readability.

  • Providing a Clear Workflow:

    • Scripts are logically separated (e.g., scripts/preprocess_data.py, scripts/train_model.py), ensuring that the process is documented and repeatable.
  • Facilitating Debugging:

    • When something breaks, team members can easily trace issues back to specific scripts or data folders.

Streamline Debugging

Use meaningful file and function names to make tracing errors easier (e.g., train_model.py instead of script1.py).


3. Scalability

Projects often start as small proofs of concept but need to evolve into production-ready systems. A standardized structure allows your project to scale by:

  • Future-Proofing for Growth:
    • Clear separation of modules (e.g., src/, tests/, docs/) ensures the project can handle new features without becoming unmanageable.

Plan for Growth

Avoid creating monolithic scripts like a single main.py file. Modularize early to make scaling manageable.

  • Simplifying Transitions to Production:

    • Production-ready components like Dockerfile, deployment scripts, and CI/CD workflows can be easily integrated without restructuring the entire project.
  • Accommodating Team Expansion:

    • A consistent structure ensures new team members can quickly onboard and contribute without disrupting existing workflows.

Smooth Onboarding

A well-structured project enables new team members to start contributing within days instead of weeks.


Practical Examples

  1. Collaboration:
    • Imagine two data scientists working on feature engineering and model training. If both use different folder names or scatter files across the project, merging their work becomes chaotic. A standardized structure ensures they can work independently while aligning with the broader project.

Example: Collaboration

A shared data/ directory for raw and processed data allows team members to independently work on preprocessing and modeling without overlap.

  1. Reproducibility:
    • A research project involving multiple experiments benefits greatly from a clear config/ directory where all experiment parameters are stored. Another team can pick up the same configuration and rerun the experiments without confusion.

Centralized Experiment Configurations

Store all experiment settings in the config/ directory to ensure easy replication of workflows.

  1. Scalability:
    • A POC chatbot developed in a single main.py script can grow into a production system with modules for retrievers/, generators/, and evaluation/. Standardization ensures this evolution is smooth.

Scale with Confidence

Begin with a directory structure that anticipates growth, even if the initial project is small.


Key Takeaways

  • Collaboration: Reduces confusion and improves teamwork by providing clarity.
  • Reproducibility: Ensures that experiments and workflows can be repeated and validated.
  • Scalability: Prepares the project for future growth, whether it’s transitioning to production or adding team members.

Ready to Succeed

By implementing a standardized directory structure, your project will be set up for success across all phases of development.


Next Up: Let’s explore how different types of projects (e.g., Python scripts, Gen AI applications, production ML pipelines) require variations in directory structures.

3. Common Pitfalls of Poor Project Structures

Introduction

While the benefits of a standardized project structure are clear, the consequences of neglecting this foundation can lead to significant challenges. Poor project organization not only hampers productivity but can also create long-term problems for collaboration, debugging, and reproducibility. In this section, we’ll highlight the key pitfalls of poor project structures.


1. Lack of Reproducibility

Reproducibility is essential in AI/ML projects, especially when results need to be validated, shared, or reproduced at a later date. Without a clear structure:

  • Experiment Replication Becomes Impossible:
    • Raw data might be overwritten or mixed with processed data, leading to inconsistent results.
    • Missing or hardcoded configurations make it difficult to replicate an experiment.

Risk of Data Loss

Overwriting raw data or failing to separate preprocessing steps can make experiments irreproducible and lead to permanent loss of critical workflows.

  • Code and Data Mismatches:
    • Scripts might be scattered across folders without clear versioning or alignment with datasets.
    • Dependencies might not be documented, causing issues when trying to rerun old workflows.

Example:

Replication Failure

A team member runs an experiment but doesn’t save the preprocessing script separately from the training script. Months later, when the experiment needs to be replicated, the preprocessing steps are lost, rendering the results unverifiable.


2. Increased Debugging Time

When projects lack clear organization, debugging becomes a frustrating and time-consuming process:

  • Difficult to Locate Issues:
    • Without a dedicated logs/ folder or clear script separation, tracing errors to their origin takes significantly longer.

Use a Logs Directory

Always include a logs/ folder to store detailed error and execution logs for easier debugging.

  • Unclear Code Ownership:
    • If scripts are dumped in a single folder, it’s challenging to determine which script is responsible for what functionality.

Avoid Script Clutter

Organize scripts into meaningful categories (e.g., data_processing/, model_training/) to ensure code ownership and clarity.

  • Duplication and Confusion:
    • Duplicate or outdated scripts might exist, leading to confusion about which file to debug.

Example:

Debugging Delays

A model training script depends on data preprocessing, but due to an unclear structure, the preprocessing script is accidentally overwritten. Debugging the resulting error wastes hours of the team’s time.


3. Difficulties Onboarding Team Members

A poorly structured project can discourage new team members and significantly delay their productivity:

  • Longer Ramp-Up Time:
    • New contributors spend excessive time trying to understand where things are located and how they work.

Documentation Tip

Provide a README.md file that outlines the directory structure and includes instructions for setup and usage.

  • Knowledge Silos:
    • If only one person understands the project layout, they become a bottleneck for progress.

Silo Risk

Avoid creating knowledge silos by documenting workflows and maintaining a clear project structure.

  • Higher Error Risk:
    • New team members might inadvertently disrupt workflows by modifying the wrong files or folders.

Example:

Onboarding Struggles

A new hire joins a project but finds that scripts, data, and documentation are mixed in a single folder. They unintentionally delete critical files while trying to clean up the structure, delaying the project.


Key Takeaways

  • Lack of reproducibility undermines trust in results and creates unnecessary rework.
  • Increased debugging time leads to wasted effort and frustration.
  • Difficulties onboarding new members slow down team productivity and create bottlenecks.

Streamlined Projects Save Time

A well-organized project structure reduces debugging time, enhances reproducibility, and accelerates onboarding for new team members.


Next Up: Let’s explore how different types of AI/ML projects require tailored directory structures and how our standardized approach addresses these pitfalls.

4. Exploring Different Directory Structures

Introduction

In this section, we focus on how to structure simple Python projects, such as utility scripts or smaller-scale tools, to ensure they remain maintainable, reusable, and professional. While these projects may start small, including a well-organized data/ directory enhances data handling, reproducibility, and scalability. Adding this alongside notebooks/ supports experimentation and development workflows for data professionals.


4.1 Python Project Structure for Simple Scripts and Utilities

When to Use This Structure

This structure is ideal for: - Small projects, such as data manipulation utilities or command-line tools. - Projects involving datasets, even small ones, that require preprocessing or feature engineering. - Initial exploration and experimentation using Jupyter notebooks. - Prototypes or proofs of concept before scaling into larger applications.

Use Case Highlight

This structure is particularly useful for projects that might grow beyond their initial scope, as it lays a solid foundation for scaling.


Here’s the recommended directory structure for simple projects, including a data/ section:

project-name/
├── project_name/                 <- Source code for the project.
│   ├── __init__.py               <- Marks the directory as a Python package.
│   ├── main.py                   <- Main script or entry point of the project.
│   └── utils.py                  <- Helper functions used by `main.py`.
├── data/                         <- Data used in the project.
│   ├── raw                       <- Original, immutable data dump.
│   ├── external                  <- Data from third-party sources.
│   ├── interim                   <- Intermediate data, partially processed.
│   ├── processed                 <- Fully processed data, ready for analysis.
│   └── features                  <- Engineered features ready for model training.
├── notebooks/                    <- Jupyter notebooks for exploration and experimentation.
│   ├── data_exploration.ipynb    <- Notebook for data exploration.
│   ├── prototyping.ipynb         <- Notebook for prototyping and initial analysis.
│   └── README.md                 <- Guidelines for using the notebooks.
├── tests/                        <- Unit and integration tests.
│   ├── __init__.py               <- Marks the directory as a package for testing.
│   ├── test_main.py              <- Tests for `main.py`.
│   └── test_utils.py             <- Tests for `utils.py`.
├── .gitignore                    <- Files and directories to ignore in Git.
├── pyproject.toml                <- Project configuration for dependencies and tools.
├── README.md                     <- Description and instructions for the project.
└── LICENSE                       <- License file for open-source projects.

Quick Setup

Use a template generator like cookiecutter to quickly initialize this structure and maintain consistency across projects.


Detailed Explanation

1. Data Directory (data/)

This directory organizes all data used in the project, ensuring a clear workflow from raw data to processed features. Standard subdirectories include:

  • raw/:

    • Contains the original data that is immutable and serves as the source of truth.
    • Example: .csv or .json files downloaded from a database or external source.
  • external/:

    • Holds data from third-party sources, such as APIs or shared datasets.
    • Example: Pre-trained embeddings, external .zip files.
  • interim/:

    • Stores intermediate data that has been partially processed.
    • Example: Data after initial cleaning or aggregation but before full preprocessing.
  • processed/:

    • Contains fully processed data that is ready for analysis or modeling.
    • Example: Cleaned and structured data in .parquet or .csv formats.
  • features/:

    • Holds engineered features for model training.
    • Example: Feature matrices in .npy or .pkl formats.

Avoid Data Overwrites

Never overwrite raw data. Always save intermediate transformations and processed outputs in separate directories to maintain reproducibility.


2. Integration with notebooks/

The data/ directory complements the notebooks/ directory: - Use data/raw/ for initial exploration in data_exploration.ipynb. - Save interim results in data/interim/ for reuse across notebooks and scripts. - Generate final datasets in data/processed/ and data/features/ for downstream tasks like modeling.

Efficient Workflow

Start by loading data from data/raw/ into a data_exploration.ipynb notebook. Save cleaned outputs to data/interim/ for use in a prototyping.ipynb notebook focused on feature engineering.


3. Integration with Source Code (project_name/)

The data/ directory interacts directly with source code: - utils.py: Functions for loading and saving data from specific directories. - main.py: Accesses processed data or features for the core functionality.

Example utility functions:

import os
import pandas as pd

def load_raw_data(file_name: str) -> pd.DataFrame:
    """Loads a raw data file from the `data/raw/` directory."""
    file_path = os.path.join("data", "raw", file_name)
    return pd.read_csv(file_path)

def save_processed_data(df: pd.DataFrame, file_name: str):
    """Saves a processed DataFrame to the `data/processed/` directory."""
    file_path = os.path.join("data", "processed", file_name)
    df.to_csv(file_path, index=False)

Reusable Functions

Centralize data loading and saving functions in utils.py to avoid redundancy and ensure consistent directory handling.


Examples of Projects with data/ and notebooks/

1. Data Cleaning Tool

Use Case: Clean and preprocess raw customer data, with notebooks for exploration and a structured data workflow.

Directory structure:

data-cleaner/
├── data/
│   ├── raw/
│   ├── external/
│   ├── interim/
│   ├── processed/
│   └── features/
├── notebooks/
│   ├── data_exploration.ipynb
│   └── prototyping.ipynb
├── data_cleaner/
│   ├── __init__.py
│   ├── main.py
│   └── cleaning_utils.py

2. Feature Engineering Project

Use Case: Engineer features for predictive modeling, with raw data, interim transformations, and a focus on creating reusable features.

Directory structure:

feature-engineer/
├── data/
│   ├── raw/
│   ├── interim/
│   ├── processed/
│   └── features/
├── notebooks/
│   ├── feature_exploration.ipynb
│   └── prototyping.ipynb
├── feature_engineer/
│   ├── __init__.py
│   ├── main.py
│   └── feature_utils.py


Key Takeaways

  • The data/ directory introduces a clear workflow for handling raw, interim, processed, and engineered datasets.
  • Combined with notebooks/, this structure supports iterative experimentation and scalable development.
  • Even simple projects benefit from modular data organization, improving reproducibility and collaboration.

Foundation for Success

A structured directory ensures clarity, reproducibility, and scalability, enabling your project to grow seamlessly.


Next Up: Let’s explore how directory structures adapt for Gen AI applications and production-ready ML pipelines.

4.2 Exploring Standards for Organizing Python Source Code

Overview

Organizing Python source code effectively is crucial for maintainability, scalability, and collaboration. This section compares two commonly used directory structures: the Flat Directory Approach and the Nested src/ Directory Approach, highlighting their advantages, disadvantages, and recommended use cases.


1. Flat Directory Approach

Overview

The source code is located directly under the top-level directory, inside a folder named after the project.

project-name/
├── project_name/                 <- Source code for the project.
│   ├── __init__.py               <- Marks the directory as a Python package.
│   ├── main.py                   <- Main script or entry point of the project.
│   └── utils.py                  <- Helper functions used by `main.py`.
Advantages
  • Simplicity:
    • Easier to navigate and set up for small projects.
    • Ideal for beginners and small utility projects.

Beginner-Friendly Setup

The flat directory structure is perfect for those new to Python projects, as it requires minimal configuration.

  • No Additional Layers:
    • Directly accessible from the project root, making it easy to run scripts or modules without configuring paths.
  • Intuitive for Small Projects:
    • Reduces overhead when the project scope is limited, such as single-purpose utilities.
Disadvantages
  • Risk of Name Collisions:
    • The top-level package shares its name with the project directory, leading to potential import issues (e.g., importing project_name may conflict with the folder name).

Name Collision Risk

Avoid naming your package the same as the project directory to prevent import conflicts.

  • Scaling Limitations:
    • Mixing source code with top-level directories like data/, notebooks/, and tests/ can become unwieldy as the project grows.
When to Use
  • Best for:
    • Small, single-purpose projects that are unlikely to expand significantly.
    • Scripts or utilities where simplicity is a priority.

2. Nested src/ Directory Approach

Overview

The source code is placed inside a src/ directory, which contains the main package folder.

project-name/
├── src/project_name/             <- Source code for the project.
│   ├── __init__.py               <- Marks the directory as a Python package.
│   ├── main.py                   <- Main script or entry point of the project.
│   └── utils.py                  <- Helper functions used by `main.py`.
Advantages
  • Clear Separation:
    • Distinguishes source code from other directories, such as data/, tests/, and notebooks/, ensuring a cleaner and more professional project structure.
  • Avoids Import Conflicts:
    • Prevents Python from accidentally importing modules from the top-level directory, reducing the risk of name collisions.
  • Better for Larger Projects:
    • Encourages scalability and organization when the project involves multiple packages or modules.

Scalable and Professional

The src/ directory approach is ideal for large or production-level projects, offering a clean, scalable structure.

Disadvantages
  • Added Complexity:
    • Requires configuring the Python path to include src/ (e.g., setting PYTHONPATH or using tools like Poetry or pytest to handle paths).

Extra Configuration Needed

Ensure proper path management to avoid import errors when using a src/ directory.

  • Overhead for Small Projects:
    • The additional layer might feel unnecessary for simple utilities or one-off scripts.
When to Use
  • Best for:
    • Medium to large-scale projects where clean separation of source code is critical.
    • Projects intended for production or distribution as Python packages.

Comparison of the Two Approaches

Aspect Flat Directory Nested src/ Directory
Complexity Simple, beginner-friendly Slightly more complex to set up
Scalability Limited scalability Highly scalable for large projects
Risk of Import Issues Higher due to name collisions Low, avoids conflicts
Use Case Small utility projects Medium to large projects
Professionalism Perceived as less formal Perceived as more professional

Recommendations

Choosing the Right Structure

Select a directory structure based on your project's size, scope, and future growth potential.

Flat Directory Approach:
  • Best for:
    • Small projects with a narrow focus (e.g., data cleaning tools, simple CLI utilities).
    • Prototyping or proof-of-concept work.
  • Avoid if:
    • The project is expected to grow significantly.
    • The name of the package risks conflicting with the project directory.
Nested src/ Directory Approach:
  • Best for:
    • Production-level projects involving multiple components or packages.
    • Projects requiring clean, scalable structures.
  • Avoid if:
    • The project scope is very small, and the extra complexity is unnecessary.

Key Takeaways

Summary

  • The Flat Directory Approach is simple and quick for small projects but can lead to import conflicts and clutter in larger projects.
  • The Nested src/ Directory Approach provides better separation and scalability, making it the preferred choice for professional or production-level projects.
  • Align your directory structure with your project’s scope and long-term goals to ensure maintainability and growth.

4.3 Production-Ready ML Project Structure

Introduction

Building a production-ready ML project requires a structure that supports model development, deployment, scalability, and maintenance. This structure is optimized for collaboration across roles—data scientists, ML engineers, and data engineers—and is designed to accommodate CI/CD workflows, robust testing, and operationalized machine learning pipelines.


Use Cases

Production-ready ML project structures are ideal for:

  • End-to-End Machine Learning Pipelines:
    • Managing data ingestion, preprocessing, training, evaluation, and deployment.
  • Collaboration Between Teams:
    • Facilitating clear roles and responsibilities in teams with multiple contributors.
  • Scalable and Deployable Solutions:
    • Moving from experimental prototypes to fully operationalized systems in production.

Think Big

Adopt this structure if you aim to transition your project from experimentation to production with scalability and maintainability in mind.


Key Features of the Structure

1. Clear Separation of Data

The data/ directory enables reproducibility and clarity through organized subdirectories:

  • data/raw/: Contains unprocessed, immutable data dumps.
  • data/external/: Stores third-party or external datasets.
  • data/interim/: Holds intermediate results during preprocessing.
  • data/processed/: Contains clean datasets ready for modeling.
  • data/features/: Includes feature matrices generated during preprocessing.

Debugging Made Easier

By maintaining a clear separation of raw, interim, and processed data, you can quickly identify and debug issues in your data pipelines without impacting downstream tasks.


2. Modular Scripts

Scripts are divided based on functionality, promoting a clean and reusable workflow:

  • scripts/preprocess_data.py: Handles data cleaning, transformation, and feature engineering.
  • scripts/train_model.py: Contains the logic for model training.
  • scripts/evaluate_model.py: Includes evaluation metrics and model performance validation.
  • scripts/deploy_model.py: Manages deployment of the trained model to production.

Modular Design

Modular scripts ensure that updates to one part of the pipeline, such as preprocessing, do not inadvertently affect training or deployment.


3. Robust Testing Framework

A dedicated tests/ directory ensures that every component of the pipeline is thoroughly validated:

  • Unit Tests:
    • Validate individual functions, such as data preprocessing or model training utilities.
  • Integration Tests:
    • Test how different pipeline components interact with each other.
  • End-to-End Tests:
    • Simulate the full pipeline from raw data ingestion to model deployment.

Why Testing Matters

Skipping robust testing can lead to pipeline failures in production, resulting in significant downtime and wasted resources.


4. CI/CD Workflows

Continuous Integration and Continuous Deployment (CI/CD) workflows automate quality checks and streamline deployment:

  • GitHub Actions:
    • Automate testing, linting, and formatting using tools like Black, Ruff, and Pytest.
  • Docker:
    • Containerize the pipeline for consistent execution across environments.
  • Deployment Pipelines:
    • Automate deployment to production environments using CI/CD tools like Jenkins or GitHub Actions.

GitHub Actions Workflow

Example CI/CD workflow for testing and deployment:

name: ML Pipeline CI/CD

on:
  push:
    branches:
      - main

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.8"
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install poetry
          poetry install
      - name: Run tests
        run: poetry run pytest
  deploy:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - uses: actions/checkout@v3
      - name: Deploy with Docker
        run: docker build -t my-ml-project .


5. Inclusion of a Dockerfile

The Dockerfile ensures consistent execution across environments by containerizing the project:

# Dockerfile for Production-Ready ML Project
FROM python:3.8-slim

# Set up working directory
WORKDIR /app

# Install dependencies
COPY pyproject.toml poetry.lock ./
RUN pip install poetry && poetry install --no-dev

# Copy source code
COPY src/ /app/src/

# Entry point
CMD ["python", "src/main.py"]

Environment Consistency

Docker ensures that your pipeline behaves the same way locally, during testing, and in production environments.


Example Directory Structure

project-name/
├── data/
│   ├── raw/                       # Original, immutable data dump
│   ├── external/                  # Data from third-party sources
│   ├── interim/                   # Intermediate, partially processed data
│   ├── processed/                 # Fully processed data
│   └── features/                  # Engineered features for modeling
├── scripts/
│   ├── preprocess_data.py         # Data cleaning and transformation
│   ├── train_model.py             # Model training logic
│   ├── evaluate_model.py          # Evaluation metrics and validation
│   ├── deploy_model.py            # Deployment script
├── src/
│   ├── __init__.py                # Marks the directory as a Python package
│   ├── data/                      # Data loading and preprocessing
│   ├── models/                    # Model definitions and utilities
│   ├── evaluation/                # Evaluation metrics and methods
│   ├── utils.py                   # General utility functions
│   └── main.py                    # Entry point for the application
├── tests/
│   ├── __init__.py                # Marks the directory as a Python package
│   ├── test_preprocessing.py      # Unit tests for preprocessing
│   ├── test_training.py           # Unit tests for model training
│   └── test_end_to_end.py         # End-to-end pipeline tests
├── Dockerfile                     # Dockerfile for containerizing the project
├── pyproject.toml                 # Dependency management and configuration
├── poetry.lock                    # Lock file for dependencies
├── .github/
│   ├── workflows/
│   │   ├── test.yaml              # CI workflow for testing
│   │   ├── lint.yaml              # CI workflow for linting and formatting
│   │   └── deploy.yaml            # CI workflow for deployment
├── README.md                      # Overview and usage instructions
└── LICENSE                        # License for the project

Key Takeaways

Production-Ready Excellence

  • Data Organization: A clear separation of raw, processed, and feature data ensures reproducibility and clarity.
  • Script Modularity: Dividing the pipeline into preprocessing, training, evaluation, and deployment scripts simplifies maintenance and updates.
  • CI/CD Integration: Automated testing and deployment pipelines streamline production readiness and reliability.
  • Docker for Consistency: Containerization ensures consistent execution across environments.

This structure is the gold standard for end-to-end machine learning pipelines, ensuring scalability, maintainability, and production readiness.

4.4. Creating Python Packages for Specialized Fine-Tuning and Data Inclusion

Overview

In large Gen AI projects, modularizing components into Python packages improves organization, reusability, and maintainability. This section demonstrates how to structure a Python package for fine-tuning processes and embedding or dynamically accessing data.


When to Create a Python Package

Python packages are beneficial when:

  1. Avoiding Overcrowding:
    • Separating fine-tuning workflows or specialized functionality keeps the main project clean.
  2. Reusability Across Projects:
    • A fine-tuning package can be easily imported into multiple Gen AI or ML projects.
  3. Distributing Data or Models:
    • Including datasets, trained models, or configuration files ensures accessibility and version control.

Start with a Package

If your workflow has the potential to grow or be reused, start by organizing it into a Python package for better scalability and collaboration.


Here’s a directory structure for a Python package designed for fine-tuning models and embedding data:

fine_tune_package/
├── src/
│   └── fine_tune_package/
│       ├── __init__.py              # Marks the directory as a Python package.
│       ├── datasets.py              # Code for accessing included data.
│       ├── fine_tune.py             # Core functionality for fine-tuning.
│       ├── utils.py                 # Helper functions for the package.
│       ├── data/                    # Directory containing embedded data.
│       │   ├── __init__.py          # Marks the directory as a subpackage.
│       │   └── model_config.yaml    # Example model configuration file.
│       │   └── vocab.txt            # Vocabulary or tokenizer data.
│       └── models/                  # Directory for pre-trained or fine-tuned models.
│           ├── __init__.py          # Marks the directory as a subpackage.
│           └── fine_tuned_model.bin # Binary file of the fine-tuned model.
├── tests/
│   ├── __init__.py                  # Marks the directory as a package for tests.
│   ├── test_fine_tune.py            # Tests for fine-tuning functionality.
│   └── test_datasets.py             # Tests for dataset access.
├── pyproject.toml                   # Configuration for package building and dependencies.
├── README.md                        # Package description and usage examples.
└── LICENSE                          # License for the package.

Including Data in a Package

Use Cases
  1. Required for Functionality:
    • Data files like tokenizers or model configurations are essential for package operation.
  2. Example Data:
    • Including sample datasets to demonstrate package functionality.
  3. Reproducibility:
    • Bundling data ensures code and data are synchronized and version-controlled.

Embedded vs Downloadable Data

  • Embed small, essential files directly in the package.
  • Provide scripts for downloading large, optional files like pre-trained models.

Embedding Data Using importlib.resources

Step 1: Add Data to the Package

Place required data files (e.g., model_config.yaml) in the data/ subpackage.

fine_tune_package/
├── src/
│   └── fine_tune_package/
│       ├── data/
│       │   ├── __init__.py
│       │   └── model_config.yaml
Step 2: Create a Helper Function to Access the Data

Use importlib.resources to access embedded files.

from importlib import resources
import yaml

def get_model_config():
    """Get the model configuration file as a dictionary."""
    with resources.path("fine_tune_package.data", "model_config.yaml") as f:
        with open(f, "r") as file:
            config = yaml.safe_load(file)
    return config
Step 3: Access the Data in Your Code

Use the helper function to access the configuration file.

from fine_tune_package.datasets import get_model_config

config = get_model_config()
print(config)

Reusable Data Access

With importlib.resources, accessing embedded files becomes consistent and efficient across multiple environments.


Downloading Large Data Files

For large datasets or models, include scripts for dynamic downloading.

import os
import requests

def download_model(destination="models/fine_tuned_model.bin"):
    """Download a fine-tuned model file from an external source."""
    url = "https://example.com/fine_tuned_model.bin"
    os.makedirs(os.path.dirname(destination), exist_ok=True)
    response = requests.get(url, stream=True)
    with open(destination, "wb") as f:
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)
    print(f"Model downloaded to {destination}")

Avoid Embedding Large Files

Large files can increase package size unnecessarily. Use dynamic download scripts for optional, large resources.


Building and Installing the Package

Step 1: Update pyproject.toml

Configure the package for the src layout and specify dependencies.

[tool.poetry]
name = "fine-tune-package"
version = "0.1.0"
description = "A package for fine-tuning Gen AI models."
authors = ["Your Name <your.email@example.com>"]
license = "MIT"

[tool.poetry.dependencies]
python = "^3.8"
requests = "^2.28"
pyyaml = "^6.0"

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
Step 2: Install the Package Locally

Install the package in editable mode for development:

poetry install

Editable Installation

Use editable mode to test changes locally without rebuilding the package.


Key Benefits of This Approach

  1. Reusability:
    • Install and reuse the package across multiple projects.
  2. Data Accessibility:
    • Embedded data or download scripts ensure smooth workflows.
  3. Scalability:
    • Modular packages reduce code clutter and improve maintainability.

Key Takeaways

Enhance Reusability and Professionalism

  • Modularize specialized workflows into Python packages for better organization.
  • Embed essential data using importlib.resources for easy access.
  • Use download scripts for large optional resources to keep packages lightweight.
  • A well-structured Python package boosts reusability, maintainability, and scalability in Gen AI projects.

5. Overview of the Standard Directory Structure

Purpose of the Standard Directory Structure

This section provides a comprehensive overview of each directory and file in the standard project structure. It explains their purpose, contents, and relevance for collaboration and maintaining a professional development workflow.

├── .devcontainer                  <- Directory for Visual Studio Code Dev Container configuration.
│   └── devcontainer.json          <- Configuration file for defining the development container.
├── .github                        <- Directory for GitHub-specific configuration and metadata.
│   ├── CODEOWNERS                 <- File to define code owners for the repository.
│   ├── CONTRIBUTING.md            <- Guidelines for contributing to the project.
│   └── pull_request_template.md   <- Template for pull requests to standardize and improve PR quality.
├── .vscode                        <- Directory for Visual Studio Code-specific configuration files.
│   ├── cspell.json                <- Configuration file for the Code Spell Checker extension.
│   ├── dictionaries               <- Directory for custom dictionary files.
│   │   └── data-science-en.txt    <- Custom dictionary for data science terminology.
│   ├── extensions.json            <- Recommended extensions for the project.
│   └── settings.json              <- Workspace-specific settings for Visual Studio Code.
├── config                         <- Configuration files for the project.
├── data                           <- Data for the project, divided into different stages of data processing.
│   ├── raw                        <- Original, immutable data dump.
│   ├── external                   <- Data from third-party sources.
│   ├── interim                    <- Intermediate data, partially processed.
│   ├── processed                  <- Fully processed data, ready for analysis.
│   └── features                   <- Engineered features ready for model training.
├── docs                           <- Documentation for the project.
│   ├── api-reference.md           <- API reference documentation.
│   ├── explanation.md             <- Detailed explanations and conceptual documentation.
│   ├── how-to-guides.md           <- Step-by-step guides on performing common tasks.
│   ├── index.md                   <- The main documentation index page.
│   └── tutorials.md               <- Tutorials related to the project.
├── log                            <- Logs generated by the project.
├── models                         <- Machine learning models, scripts, and other related artifacts.
├── notebooks                      <- Jupyter notebooks for experiments, examples, or data analysis.
├── scripts                        <- Directory for project-specific scripts and utilities.
│   └── hooks                      <- Directory for custom git hooks and other automation scripts.
│       ├── branch-name-check.sh   <- Hook script for checking branch names.
│       ├── commit-msg-check.sh    <- Hook script for checking commit messages.
│       ├── filename-check.sh      <- Hook script for checking file names.
│       ├── generate_docs.sh       <- Script for generating documentation.
│       └── restricted-file-check.sh <- Hook script for checking restricted files.
├── src                            <- Source code for the project.
│   └── collaborativeaitoy         <- Main project module.
│       ├── __init__.py            <- Initializes the Python package.
│       ├── main.py                <- Entry point for the application.
│       ├── app.py                 <- Main application logic.
│       └── utils.py               <- Utility functions.
├── tests                          <- Directory for all project tests.
│   ├── integration                <- Integration tests.
│   └── spec                       <- Specification tests (unit tests).
├── .gitignore                     <- Specifies intentionally untracked files to ignore.
├── .pre-commit-config.yaml        <- Configuration for pre-commit hooks.
├── Dockerfile                     <- Dockerfile for containerizing the application.
├── Makefile                       <- Makefile with commands like `make data` or `make train`.
├── mkdocs.yml                     <- Configuration file for MkDocs, a static site generator for project documentation.
├── pyproject.toml                 <- Configuration file for Python projects which includes dependencies and package information.
├── README.md                      <- The top-level README for developers using this project.
└── .env                           <- Environment variables configuration file (not visible).

1. .devcontainer

Visual Studio Code Dev Containers

Purpose: Defines a consistent development environment in containers.
Example File: devcontainer.json.
- Collaboration Benefit: Reduces "it works on my machine" issues by standardizing the development setup.


2. .github

Automating Repository Standards

Purpose: Maintains repository management and automation.
Key Files: - CODEOWNERS: Assigns code ownership for specific files or directories. - CONTRIBUTING.md: Guides contributions to the project. - pull_request_template.md: Standardizes pull request formats.
- Collaboration Benefit: Ensures repository standards for all contributors.


3. .vscode

VS Code-Specific Configurations

Purpose: Enhances the development experience with workspace-specific settings.
Key Files: - settings.json: Project-specific settings. - extensions.json: Recommended extensions. - cspell.json: Spell-checker configuration for technical terms.
- Collaboration Benefit: Aligns all contributors with consistent editor configurations.


4. config

Centralized Configurations

Purpose: Stores application settings and environment configurations.
- Collaboration Benefit: Avoids hardcoding settings and improves maintainability.


5. data

Organized Data Processing

Purpose: Organizes data by processing stages.
Structure: - raw/: Immutable data. - processed/: Data ready for analysis.
- Collaboration Benefit: Ensures reproducibility and clarity in data workflows.


6. docs

Comprehensive Documentation

Purpose: Provides user guides, tutorials, and API references.
Key Files: - api-reference.md - how-to-guides.md
- Collaboration Benefit: Onboards team members quickly and improves transparency.


7. log

Project Logs

Purpose: Tracks runtime behavior and debugging information.
- Collaboration Benefit: Helps identify and resolve runtime issues.


8. models

Centralized Model Storage

Purpose: Stores pre-trained or fine-tuned models.
- Collaboration Benefit: Ensures consistency in model usage across the team.


9. notebooks

Prototyping Space

Purpose: Allows for exploratory work before integration into the main codebase.
- Collaboration Benefit: Serves as a shared workspace for experiments.


10. scripts

Reusable Automation Scripts

Purpose: Automates repetitive tasks and project utilities.
- Collaboration Benefit: Standardizes task execution and reduces manual effort.


11. src

Core Project Code

Purpose: Encapsulates the primary functionality of the project.
- Collaboration Benefit: Organizes code for scalability and maintainability.


12. tests

Code Reliability

Purpose: Validates code functionality through unit and integration tests.
- Collaboration Benefit: Prevents regressions and ensures quality in changes.


13. Key Root-Level Files

File Purpose Relevance
.gitignore Excludes unnecessary files from version control. Keeps the repository clean.
.pre-commit-config.yaml Automates code checks before commits. Enforces coding standards.
Dockerfile Containerizes the application. Ensures consistent deployment.
Makefile Automates common tasks. Standardizes workflows for team members.
mkdocs.yml Configures project documentation. Simplifies and standardizes documentation generation.
pyproject.toml Manages dependencies and builds. Centralizes project configuration.
README.md Provides project overview and setup instructions. First point of contact for new team members or external collaborators.
.env Stores environment-specific variables. Improves security by keeping sensitive data out of the codebase.

Conclusion

Collaborative Excellence

This standard directory structure enhances collaboration, scalability, and maintainability in AI/ML projects. By providing clear organization and purpose for each component, teams can efficiently navigate, contribute, and extend the project.

6. Automating Standards with Cookiecutter

Overview of Cookiecutter for Project Automation

Automating project scaffolding ensures consistency across projects and saves time when starting new ones. Cookiecutter is a tool that simplifies this process by generating standardized project structures based on templates. This section introduces Cookiecutter, demonstrates its usage, and highlights its benefits for collaboration and reproducibility.


What is Cookiecutter?

About Cookiecutter

Cookiecutter is a command-line utility for creating projects from predefined templates. It allows teams to: - Enforce consistent directory structures and configurations. - Automate repetitive setup tasks. - Customize templates for specific organizational or project needs.


Why Use Cookiecutter?

Benefits of Cookiecutter

  1. Standardization: Every project follows the same structure, enhancing maintainability.
  2. Time-Saving: Automates creating directories, boilerplate files, and configurations.
  3. Customizability: Allows user input (e.g., project names, author details) to tailor templates.
  4. Scalability: Reuse templates across teams for various projects, such as ML pipelines or Gen AI applications.

Step-by-Step Demonstration

Step 1: Install Cookiecutter

pip install cookiecutter

Dependency Management

Ensure pip is up-to-date to avoid installation issues.


Step 2: Choose or Create a Template

Example Template: ML Project Structure
cookiecutter-ml-template/
├── {{cookiecutter.project_slug}}/
│   ├── data/
│   │   ├── raw/
│   │   ├── processed/
│   ├── notebooks/
│   │   └── example_notebook.ipynb
│   ├── src/
│   │   ├── __init__.py
│   │   ├── main.py
│   │   └── utils.py
│   ├── tests/
│   │   ├── __init__.py
│   │   └── test_main.py
│   ├── .gitignore
│   ├── README.md
│   ├── pyproject.toml
│   └── LICENSE
├── cookiecutter.json
└── README.md

Step 3: Configure the Template

Define prompts and their default values in cookiecutter.json:

{
    "project_name": "My ML Project",
    "project_slug": "{{ cookiecutter.project_name.lower().replace(' ', '_') }}",
    "author_name": "Your Name",
    "description": "A machine learning project template.",
    "license": "MIT"
}


Step 4: Generate a New Project

Run Cookiecutter with the template:

cookiecutter https://github.com/your-org/cookiecutter-ml-template

Example input prompts:

project_name [My ML Project]: Awesome AI Project
author_name [Your Name]: Jane Doe
description [A machine learning project template.]: Fine-tuning Gen AI Models
license [MIT]: Apache-2.0

Generated structure:

awesome_ai_project/
├── data/
├── notebooks/
├── src/
├── tests/
├── README.md
├── pyproject.toml
├── LICENSE
└── .gitignore


Step 5: Customize and Use the Project

Navigate to the generated directory:

cd awesome_ai_project
poetry install

Next Steps

Start coding within a pre-configured environment, saving time and reducing setup errors.


Advanced Features

Hooks

Post-Generation Automation

Run custom scripts (e.g., initialize a Git repository, install dependencies) automatically after project creation.


Private Templates

Store templates in private GitHub repositories to maintain confidentiality and customize for internal use.


Integration with CI/CD

Automate project generation as part of a CI/CD pipeline to maintain consistency across deployments.


Key Takeaways

  • Automation and Standardization: Cookiecutter ensures consistent and professional project setups.
  • Scalability: Templates can evolve with team standards and project complexity.
  • Reusability: Save time and reduce errors by leveraging pre-built templates.

Next Steps: Explore existing templates or customize one for your team’s ML or Gen AI projects.


7. Advanced Topics and Future Considerations


How the Structure Supports CI/CD Workflows

1. Automated Testing and Quality Assurance

Integrating with CI/CD Tools

  • Tools like GitHub Actions and Jenkins can utilize the tests/ directory for automated testing.
  • Pre-commit hooks configured in .pre-commit-config.yaml enforce code quality before commits.

Example Workflow Configuration:

name: CI Workflow

on:
  push:
    branches:
      - main

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout Code
        uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.9'
      - name: Install Dependencies
        run: |
          pip install poetry
          poetry install
      - name: Run Tests
        run: poetry run pytest


2. Deployment Pipelines

Using Docker for Deployment

  • The Dockerfile ensures consistent execution environments across development and production.
  • CI/CD pipelines can automate building and pushing Docker images.

Transitioning from POCs to Production

1. Modular Design

Separation of Concerns

Organizing code into reusable modules (src/) and structuring data (data/) by processing stages ensures scalability and maintainability.


2. Collaboration Readiness

Documentation in docs/ and README.md enables seamless handoffs between teams and stakeholders.


Documentation Tools: MkDocs

What is MkDocs?

MkDocs converts Markdown files (e.g., in docs/) into professional static documentation sites.


Benefits of MkDocs

  • Centralized Documentation: Consolidates all guides and references.
  • Ease of Use: Markdown-based editing simplifies updates.
  • Live Preview: View changes locally during development.

Key Takeaways

  • Automate testing, deployment, and documentation workflows for robust CI/CD integration.
  • Modularize code and document comprehensively for a seamless transition from POCs to production.
  • Leverage MkDocs for high-quality, accessible project documentation.

Next Steps: Implement CI/CD pipelines and explore MkDocs for your project documentation.