Training Session 1: Project Scaffolding Standards for AI/ML Projects¶
Objective¶
By the end of this session, participants will understand the importance of standardized project structures, their advantages, and how to adapt directory structures for different project types (e.g., Python projects, Gen AI applications, production-ready ML pipelines).
1. Introduction: The Importance of Project Scaffolding Standards¶
Overview
Welcome to the first session of our training series! In this session, we will focus on one of the most fundamental aspects of AI/ML software development: establishing and adhering to standardized project structures. A strong foundation ensures successful collaboration, reproducibility, and scalability, setting the stage for high-quality code and professional workflows.
Why Standardized Project Structures Matter¶
In AI/ML projects, we work in a multidisciplinary environment that often includes:
- Data Scientists: Focused on creating models and analyzing data.
- ML Engineers: Responsible for scaling and deploying models.
- Data Engineers: Handling data pipelines and storage.
Collaboration Across Roles
Each role brings unique challenges, making collaboration and consistency essential. A standardized directory structure acts as a common language, helping the entire team work efficiently.
Goals of Project Scaffolding¶
-
Organization
Avoid chaos by keeping files and folders structured, making navigation intuitive. -
Collaboration
Ensure team members can work on the same project without confusion about file locations or naming conventions. -
Reproducibility
Guarantee that others (or even your future self!) can reproduce experiments and workflows seamlessly. -
Scalability
Design projects that can grow, transitioning from proof-of-concept to production without major restructuring.
Pro Tip
Begin every project with a scaffolding template that supports your team's needs. Customizing a standard template for specific workflows can save time and reduce errors.
Common Challenges Without Standardization¶
Without a standardized project scaffold:
- Onboarding New Team Members:
- Takes longer, as they need to learn the structure of every new project.
- Debugging and Maintenance:
- Becomes harder as team members struggle to locate key files or understand dependencies.
- Collaboration Friction:
- Mismatched expectations around file organization slow down progress.
- Code Reusability:
- Is hampered, as inconsistent structures make it harder to integrate components across projects.
Potential Pitfalls
A lack of standardization leads to disorganized codebases, delayed debugging, and higher onboarding time for new members.
What We’ll Cover Today¶
- The Importance of Standardization: Why a structured approach is critical for team projects.
- Advantages of a Standardized Structure: Benefits for organization, collaboration, reproducibility, and scalability.
- Types of Directory Structures: Differences between Python utilities, Gen AI applications, and production ML pipelines.
- Exploring Our Standard: Walkthrough of the directory structure we’ll use as a template.
- Interactive Activity: Hands-on exercise to restructure a disorganized project.
Interactive Activity
Be prepared to engage in a practical exercise where you’ll apply the principles of standardization to a real-world example. Bring your questions for a live Q&A at the end of the session!
Key Takeaways¶
- Standardized project scaffolding is the foundation for professional, collaborative AI/ML development.
- It enhances productivity, reduces confusion, and sets your project up for long-term success.
- The skills you’ll learn today will improve how you work within teams and across roles.
You're Ready to Build Smarter
By implementing these principles, you'll streamline workflows, foster collaboration, and ensure your projects are built for scalability and reproducibility.
2. The Importance of Standardized Directory Structures¶
Introduction
A standardized directory structure is the backbone of any successful AI/ML project. It ensures that all team members, regardless of their role, can collaborate effectively, replicate experiments, and scale solutions for production. Let’s break this down by exploring the benefits in terms of collaboration, reproducibility, and scalability.
1. Collaboration¶
When multiple team members—data scientists, ML engineers, and data engineers—work on the same project, clarity is key. A standardized directory structure promotes seamless collaboration by:
- Providing Clear File Locations:
- Everyone knows where to find code, data, configurations, and documentation.
- Reduces time wasted searching for files or asking teammates for clarification.
Practical Advice for File Locations
Use self-explanatory directory names such as data/
, src/
, docs/
, and tests/
to make file locations intuitive and reduce friction.
- Minimizing Miscommunication:
- Clear separation of concerns (e.g., raw vs. processed data, scripts vs. tests) eliminates confusion over file ownership or purpose.
- Agreed-upon standards reduce the risk of overwriting each other's work.
Avoid Overwriting Work
Always use version control systems like Git to prevent accidental overwrites, especially when multiple people are working on shared files.
- Enabling Parallel Development:
- Teams can work on different aspects of the project (e.g., data preprocessing, model training, and deployment) without stepping on each other's toes.
Example: Parallel Development
Imagine a team where one member preprocesses data in the data/
directory while another works on model training in src/models/
. A standardized structure ensures no one disrupts another's workflow.
2. Reproducibility¶
Reproducibility is a cornerstone of AI/ML projects, especially when experiments need to be validated or results are handed off to other teams. Standardization supports reproducibility by:
- Ensuring Consistent Experiment Replication:
- Clearly organized
data/
directories (e.g.,raw
,processed
) ensure that the same inputs produce the same outputs. - Version-controlled configuration files (
settings.yaml
,config.cfg
) lock down parameters for experiments.
- Clearly organized
Configuration Best Practices
Store all experiment parameters in a central config/
directory. Use YAML or JSON files for clarity and machine readability.
-
Providing a Clear Workflow:
- Scripts are logically separated (e.g.,
scripts/preprocess_data.py
,scripts/train_model.py
), ensuring that the process is documented and repeatable.
- Scripts are logically separated (e.g.,
-
Facilitating Debugging:
- When something breaks, team members can easily trace issues back to specific scripts or data folders.
Streamline Debugging
Use meaningful file and function names to make tracing errors easier (e.g., train_model.py
instead of script1.py
).
3. Scalability¶
Projects often start as small proofs of concept but need to evolve into production-ready systems. A standardized structure allows your project to scale by:
- Future-Proofing for Growth:
- Clear separation of modules (e.g.,
src/
,tests/
,docs/
) ensures the project can handle new features without becoming unmanageable.
- Clear separation of modules (e.g.,
Plan for Growth
Avoid creating monolithic scripts like a single main.py
file. Modularize early to make scaling manageable.
-
Simplifying Transitions to Production:
- Production-ready components like
Dockerfile
, deployment scripts, and CI/CD workflows can be easily integrated without restructuring the entire project.
- Production-ready components like
-
Accommodating Team Expansion:
- A consistent structure ensures new team members can quickly onboard and contribute without disrupting existing workflows.
Smooth Onboarding
A well-structured project enables new team members to start contributing within days instead of weeks.
Practical Examples¶
- Collaboration:
- Imagine two data scientists working on feature engineering and model training. If both use different folder names or scatter files across the project, merging their work becomes chaotic. A standardized structure ensures they can work independently while aligning with the broader project.
Example: Collaboration
A shared data/
directory for raw and processed data allows team members to independently work on preprocessing and modeling without overlap.
- Reproducibility:
- A research project involving multiple experiments benefits greatly from a clear
config/
directory where all experiment parameters are stored. Another team can pick up the same configuration and rerun the experiments without confusion.
- A research project involving multiple experiments benefits greatly from a clear
Centralized Experiment Configurations
Store all experiment settings in the config/
directory to ensure easy replication of workflows.
- Scalability:
- A POC chatbot developed in a single
main.py
script can grow into a production system with modules forretrievers/
,generators/
, andevaluation/
. Standardization ensures this evolution is smooth.
- A POC chatbot developed in a single
Scale with Confidence
Begin with a directory structure that anticipates growth, even if the initial project is small.
Key Takeaways¶
- Collaboration: Reduces confusion and improves teamwork by providing clarity.
- Reproducibility: Ensures that experiments and workflows can be repeated and validated.
- Scalability: Prepares the project for future growth, whether it’s transitioning to production or adding team members.
Ready to Succeed
By implementing a standardized directory structure, your project will be set up for success across all phases of development.
Next Up: Let’s explore how different types of projects (e.g., Python scripts, Gen AI applications, production ML pipelines) require variations in directory structures.
3. Common Pitfalls of Poor Project Structures¶
Introduction
While the benefits of a standardized project structure are clear, the consequences of neglecting this foundation can lead to significant challenges. Poor project organization not only hampers productivity but can also create long-term problems for collaboration, debugging, and reproducibility. In this section, we’ll highlight the key pitfalls of poor project structures.
1. Lack of Reproducibility¶
Reproducibility is essential in AI/ML projects, especially when results need to be validated, shared, or reproduced at a later date. Without a clear structure:
- Experiment Replication Becomes Impossible:
- Raw data might be overwritten or mixed with processed data, leading to inconsistent results.
- Missing or hardcoded configurations make it difficult to replicate an experiment.
Risk of Data Loss
Overwriting raw data or failing to separate preprocessing steps can make experiments irreproducible and lead to permanent loss of critical workflows.
- Code and Data Mismatches:
- Scripts might be scattered across folders without clear versioning or alignment with datasets.
- Dependencies might not be documented, causing issues when trying to rerun old workflows.
Example:
Replication Failure
A team member runs an experiment but doesn’t save the preprocessing script separately from the training script. Months later, when the experiment needs to be replicated, the preprocessing steps are lost, rendering the results unverifiable.
2. Increased Debugging Time¶
When projects lack clear organization, debugging becomes a frustrating and time-consuming process:
- Difficult to Locate Issues:
- Without a dedicated
logs/
folder or clear script separation, tracing errors to their origin takes significantly longer.
- Without a dedicated
Use a Logs Directory
Always include a logs/
folder to store detailed error and execution logs for easier debugging.
- Unclear Code Ownership:
- If scripts are dumped in a single folder, it’s challenging to determine which script is responsible for what functionality.
Avoid Script Clutter
Organize scripts into meaningful categories (e.g., data_processing/
, model_training/
) to ensure code ownership and clarity.
- Duplication and Confusion:
- Duplicate or outdated scripts might exist, leading to confusion about which file to debug.
Example:
Debugging Delays
A model training script depends on data preprocessing, but due to an unclear structure, the preprocessing script is accidentally overwritten. Debugging the resulting error wastes hours of the team’s time.
3. Difficulties Onboarding Team Members¶
A poorly structured project can discourage new team members and significantly delay their productivity:
- Longer Ramp-Up Time:
- New contributors spend excessive time trying to understand where things are located and how they work.
Documentation Tip
Provide a README.md
file that outlines the directory structure and includes instructions for setup and usage.
- Knowledge Silos:
- If only one person understands the project layout, they become a bottleneck for progress.
Silo Risk
Avoid creating knowledge silos by documenting workflows and maintaining a clear project structure.
- Higher Error Risk:
- New team members might inadvertently disrupt workflows by modifying the wrong files or folders.
Example:
Onboarding Struggles
A new hire joins a project but finds that scripts, data, and documentation are mixed in a single folder. They unintentionally delete critical files while trying to clean up the structure, delaying the project.
Key Takeaways¶
- Lack of reproducibility undermines trust in results and creates unnecessary rework.
- Increased debugging time leads to wasted effort and frustration.
- Difficulties onboarding new members slow down team productivity and create bottlenecks.
Streamlined Projects Save Time
A well-organized project structure reduces debugging time, enhances reproducibility, and accelerates onboarding for new team members.
Next Up: Let’s explore how different types of AI/ML projects require tailored directory structures and how our standardized approach addresses these pitfalls.
4. Exploring Different Directory Structures¶
Introduction
In this section, we focus on how to structure simple Python projects, such as utility scripts or smaller-scale tools, to ensure they remain maintainable, reusable, and professional. While these projects may start small, including a well-organized data/
directory enhances data handling, reproducibility, and scalability. Adding this alongside notebooks/
supports experimentation and development workflows for data professionals.
4.1 Python Project Structure for Simple Scripts and Utilities¶
When to Use This Structure¶
This structure is ideal for: - Small projects, such as data manipulation utilities or command-line tools. - Projects involving datasets, even small ones, that require preprocessing or feature engineering. - Initial exploration and experimentation using Jupyter notebooks. - Prototypes or proofs of concept before scaling into larger applications.
Use Case Highlight
This structure is particularly useful for projects that might grow beyond their initial scope, as it lays a solid foundation for scaling.
Recommended Directory Structure¶
Here’s the recommended directory structure for simple projects, including a data/
section:
project-name/
├── project_name/ <- Source code for the project.
│ ├── __init__.py <- Marks the directory as a Python package.
│ ├── main.py <- Main script or entry point of the project.
│ └── utils.py <- Helper functions used by `main.py`.
├── data/ <- Data used in the project.
│ ├── raw <- Original, immutable data dump.
│ ├── external <- Data from third-party sources.
│ ├── interim <- Intermediate data, partially processed.
│ ├── processed <- Fully processed data, ready for analysis.
│ └── features <- Engineered features ready for model training.
├── notebooks/ <- Jupyter notebooks for exploration and experimentation.
│ ├── data_exploration.ipynb <- Notebook for data exploration.
│ ├── prototyping.ipynb <- Notebook for prototyping and initial analysis.
│ └── README.md <- Guidelines for using the notebooks.
├── tests/ <- Unit and integration tests.
│ ├── __init__.py <- Marks the directory as a package for testing.
│ ├── test_main.py <- Tests for `main.py`.
│ └── test_utils.py <- Tests for `utils.py`.
├── .gitignore <- Files and directories to ignore in Git.
├── pyproject.toml <- Project configuration for dependencies and tools.
├── README.md <- Description and instructions for the project.
└── LICENSE <- License file for open-source projects.
Quick Setup
Use a template generator like cookiecutter
to quickly initialize this structure and maintain consistency across projects.
Detailed Explanation¶
1. Data Directory (data/
)¶
This directory organizes all data used in the project, ensuring a clear workflow from raw data to processed features. Standard subdirectories include:
-
raw/
:- Contains the original data that is immutable and serves as the source of truth.
- Example:
.csv
or.json
files downloaded from a database or external source.
-
external/
:- Holds data from third-party sources, such as APIs or shared datasets.
- Example: Pre-trained embeddings, external
.zip
files.
-
interim/
:- Stores intermediate data that has been partially processed.
- Example: Data after initial cleaning or aggregation but before full preprocessing.
-
processed/
:- Contains fully processed data that is ready for analysis or modeling.
- Example: Cleaned and structured data in
.parquet
or.csv
formats.
-
features/
:- Holds engineered features for model training.
- Example: Feature matrices in
.npy
or.pkl
formats.
Avoid Data Overwrites
Never overwrite raw data. Always save intermediate transformations and processed outputs in separate directories to maintain reproducibility.
2. Integration with notebooks/
¶
The data/
directory complements the notebooks/
directory:
- Use data/raw/
for initial exploration in data_exploration.ipynb
.
- Save interim results in data/interim/
for reuse across notebooks and scripts.
- Generate final datasets in data/processed/
and data/features/
for downstream tasks like modeling.
Efficient Workflow
Start by loading data from data/raw/
into a data_exploration.ipynb
notebook. Save cleaned outputs to data/interim/
for use in a prototyping.ipynb
notebook focused on feature engineering.
3. Integration with Source Code (project_name/
)¶
The data/
directory interacts directly with source code:
- utils.py
: Functions for loading and saving data from specific directories.
- main.py
: Accesses processed data or features for the core functionality.
Example utility functions:
import os
import pandas as pd
def load_raw_data(file_name: str) -> pd.DataFrame:
"""Loads a raw data file from the `data/raw/` directory."""
file_path = os.path.join("data", "raw", file_name)
return pd.read_csv(file_path)
def save_processed_data(df: pd.DataFrame, file_name: str):
"""Saves a processed DataFrame to the `data/processed/` directory."""
file_path = os.path.join("data", "processed", file_name)
df.to_csv(file_path, index=False)
Reusable Functions
Centralize data loading and saving functions in utils.py
to avoid redundancy and ensure consistent directory handling.
Examples of Projects with data/
and notebooks/
¶
1. Data Cleaning Tool¶
Use Case: Clean and preprocess raw customer data, with notebooks for exploration and a structured data workflow.
Directory structure:
data-cleaner/
├── data/
│ ├── raw/
│ ├── external/
│ ├── interim/
│ ├── processed/
│ └── features/
├── notebooks/
│ ├── data_exploration.ipynb
│ └── prototyping.ipynb
├── data_cleaner/
│ ├── __init__.py
│ ├── main.py
│ └── cleaning_utils.py
2. Feature Engineering Project¶
Use Case: Engineer features for predictive modeling, with raw data, interim transformations, and a focus on creating reusable features.
Directory structure:
feature-engineer/
├── data/
│ ├── raw/
│ ├── interim/
│ ├── processed/
│ └── features/
├── notebooks/
│ ├── feature_exploration.ipynb
│ └── prototyping.ipynb
├── feature_engineer/
│ ├── __init__.py
│ ├── main.py
│ └── feature_utils.py
Key Takeaways¶
- The
data/
directory introduces a clear workflow for handling raw, interim, processed, and engineered datasets. - Combined with
notebooks/
, this structure supports iterative experimentation and scalable development. - Even simple projects benefit from modular data organization, improving reproducibility and collaboration.
Foundation for Success
A structured directory ensures clarity, reproducibility, and scalability, enabling your project to grow seamlessly.
Next Up: Let’s explore how directory structures adapt for Gen AI applications and production-ready ML pipelines.
4.2 Exploring Standards for Organizing Python Source Code¶
Overview
Organizing Python source code effectively is crucial for maintainability, scalability, and collaboration. This section compares two commonly used directory structures: the Flat Directory Approach and the Nested src/
Directory Approach, highlighting their advantages, disadvantages, and recommended use cases.
1. Flat Directory Approach¶
Overview¶
The source code is located directly under the top-level directory, inside a folder named after the project.
project-name/
├── project_name/ <- Source code for the project.
│ ├── __init__.py <- Marks the directory as a Python package.
│ ├── main.py <- Main script or entry point of the project.
│ └── utils.py <- Helper functions used by `main.py`.
Advantages¶
- Simplicity:
- Easier to navigate and set up for small projects.
- Ideal for beginners and small utility projects.
Beginner-Friendly Setup
The flat directory structure is perfect for those new to Python projects, as it requires minimal configuration.
- No Additional Layers:
- Directly accessible from the project root, making it easy to run scripts or modules without configuring paths.
- Intuitive for Small Projects:
- Reduces overhead when the project scope is limited, such as single-purpose utilities.
Disadvantages¶
- Risk of Name Collisions:
- The top-level package shares its name with the project directory, leading to potential import issues (e.g., importing
project_name
may conflict with the folder name).
- The top-level package shares its name with the project directory, leading to potential import issues (e.g., importing
Name Collision Risk
Avoid naming your package the same as the project directory to prevent import conflicts.
- Scaling Limitations:
- Mixing source code with top-level directories like
data/
,notebooks/
, andtests/
can become unwieldy as the project grows.
- Mixing source code with top-level directories like
When to Use¶
- Best for:
- Small, single-purpose projects that are unlikely to expand significantly.
- Scripts or utilities where simplicity is a priority.
2. Nested src/
Directory Approach¶
Overview¶
The source code is placed inside a src/
directory, which contains the main package folder.
project-name/
├── src/project_name/ <- Source code for the project.
│ ├── __init__.py <- Marks the directory as a Python package.
│ ├── main.py <- Main script or entry point of the project.
│ └── utils.py <- Helper functions used by `main.py`.
Advantages¶
- Clear Separation:
- Distinguishes source code from other directories, such as
data/
,tests/
, andnotebooks/
, ensuring a cleaner and more professional project structure.
- Distinguishes source code from other directories, such as
- Avoids Import Conflicts:
- Prevents Python from accidentally importing modules from the top-level directory, reducing the risk of name collisions.
- Better for Larger Projects:
- Encourages scalability and organization when the project involves multiple packages or modules.
Scalable and Professional
The src/
directory approach is ideal for large or production-level projects, offering a clean, scalable structure.
Disadvantages¶
- Added Complexity:
- Requires configuring the Python path to include
src/
(e.g., settingPYTHONPATH
or using tools like Poetry orpytest
to handle paths).
- Requires configuring the Python path to include
Extra Configuration Needed
Ensure proper path management to avoid import errors when using a src/
directory.
- Overhead for Small Projects:
- The additional layer might feel unnecessary for simple utilities or one-off scripts.
When to Use¶
- Best for:
- Medium to large-scale projects where clean separation of source code is critical.
- Projects intended for production or distribution as Python packages.
Comparison of the Two Approaches¶
Aspect | Flat Directory | Nested src/ Directory |
---|---|---|
Complexity | Simple, beginner-friendly | Slightly more complex to set up |
Scalability | Limited scalability | Highly scalable for large projects |
Risk of Import Issues | Higher due to name collisions | Low, avoids conflicts |
Use Case | Small utility projects | Medium to large projects |
Professionalism | Perceived as less formal | Perceived as more professional |
Recommendations¶
Choosing the Right Structure
Select a directory structure based on your project's size, scope, and future growth potential.
Flat Directory Approach:¶
- Best for:
- Small projects with a narrow focus (e.g., data cleaning tools, simple CLI utilities).
- Prototyping or proof-of-concept work.
- Avoid if:
- The project is expected to grow significantly.
- The name of the package risks conflicting with the project directory.
Nested src/
Directory Approach:¶
- Best for:
- Production-level projects involving multiple components or packages.
- Projects requiring clean, scalable structures.
- Avoid if:
- The project scope is very small, and the extra complexity is unnecessary.
Key Takeaways¶
Summary
- The Flat Directory Approach is simple and quick for small projects but can lead to import conflicts and clutter in larger projects.
- The Nested
src/
Directory Approach provides better separation and scalability, making it the preferred choice for professional or production-level projects. - Align your directory structure with your project’s scope and long-term goals to ensure maintainability and growth.
4.3 Production-Ready ML Project Structure¶
Introduction
Building a production-ready ML project requires a structure that supports model development, deployment, scalability, and maintenance. This structure is optimized for collaboration across roles—data scientists, ML engineers, and data engineers—and is designed to accommodate CI/CD workflows, robust testing, and operationalized machine learning pipelines.
Use Cases¶
Production-ready ML project structures are ideal for:
- End-to-End Machine Learning Pipelines:
- Managing data ingestion, preprocessing, training, evaluation, and deployment.
- Collaboration Between Teams:
- Facilitating clear roles and responsibilities in teams with multiple contributors.
- Scalable and Deployable Solutions:
- Moving from experimental prototypes to fully operationalized systems in production.
Think Big
Adopt this structure if you aim to transition your project from experimentation to production with scalability and maintainability in mind.
Key Features of the Structure¶
1. Clear Separation of Data¶
The data/
directory enables reproducibility and clarity through organized subdirectories:
data/raw/
: Contains unprocessed, immutable data dumps.data/external/
: Stores third-party or external datasets.data/interim/
: Holds intermediate results during preprocessing.data/processed/
: Contains clean datasets ready for modeling.data/features/
: Includes feature matrices generated during preprocessing.
Debugging Made Easier
By maintaining a clear separation of raw, interim, and processed data, you can quickly identify and debug issues in your data pipelines without impacting downstream tasks.
2. Modular Scripts¶
Scripts are divided based on functionality, promoting a clean and reusable workflow:
scripts/preprocess_data.py
: Handles data cleaning, transformation, and feature engineering.scripts/train_model.py
: Contains the logic for model training.scripts/evaluate_model.py
: Includes evaluation metrics and model performance validation.scripts/deploy_model.py
: Manages deployment of the trained model to production.
Modular Design
Modular scripts ensure that updates to one part of the pipeline, such as preprocessing, do not inadvertently affect training or deployment.
3. Robust Testing Framework¶
A dedicated tests/
directory ensures that every component of the pipeline is thoroughly validated:
- Unit Tests:
- Validate individual functions, such as data preprocessing or model training utilities.
- Integration Tests:
- Test how different pipeline components interact with each other.
- End-to-End Tests:
- Simulate the full pipeline from raw data ingestion to model deployment.
Why Testing Matters
Skipping robust testing can lead to pipeline failures in production, resulting in significant downtime and wasted resources.
4. CI/CD Workflows¶
Continuous Integration and Continuous Deployment (CI/CD) workflows automate quality checks and streamline deployment:
- GitHub Actions:
- Automate testing, linting, and formatting using tools like Black, Ruff, and Pytest.
- Docker:
- Containerize the pipeline for consistent execution across environments.
- Deployment Pipelines:
- Automate deployment to production environments using CI/CD tools like Jenkins or GitHub Actions.
GitHub Actions Workflow
Example CI/CD workflow for testing and deployment:
name: ML Pipeline CI/CD
on:
push:
branches:
- main
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: "3.8"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install poetry
poetry install
- name: Run tests
run: poetry run pytest
deploy:
runs-on: ubuntu-latest
needs: test
steps:
- uses: actions/checkout@v3
- name: Deploy with Docker
run: docker build -t my-ml-project .
5. Inclusion of a Dockerfile
¶
The Dockerfile
ensures consistent execution across environments by containerizing the project:
# Dockerfile for Production-Ready ML Project
FROM python:3.8-slim
# Set up working directory
WORKDIR /app
# Install dependencies
COPY pyproject.toml poetry.lock ./
RUN pip install poetry && poetry install --no-dev
# Copy source code
COPY src/ /app/src/
# Entry point
CMD ["python", "src/main.py"]
Environment Consistency
Docker ensures that your pipeline behaves the same way locally, during testing, and in production environments.
Example Directory Structure¶
project-name/
├── data/
│ ├── raw/ # Original, immutable data dump
│ ├── external/ # Data from third-party sources
│ ├── interim/ # Intermediate, partially processed data
│ ├── processed/ # Fully processed data
│ └── features/ # Engineered features for modeling
├── scripts/
│ ├── preprocess_data.py # Data cleaning and transformation
│ ├── train_model.py # Model training logic
│ ├── evaluate_model.py # Evaluation metrics and validation
│ ├── deploy_model.py # Deployment script
├── src/
│ ├── __init__.py # Marks the directory as a Python package
│ ├── data/ # Data loading and preprocessing
│ ├── models/ # Model definitions and utilities
│ ├── evaluation/ # Evaluation metrics and methods
│ ├── utils.py # General utility functions
│ └── main.py # Entry point for the application
├── tests/
│ ├── __init__.py # Marks the directory as a Python package
│ ├── test_preprocessing.py # Unit tests for preprocessing
│ ├── test_training.py # Unit tests for model training
│ └── test_end_to_end.py # End-to-end pipeline tests
├── Dockerfile # Dockerfile for containerizing the project
├── pyproject.toml # Dependency management and configuration
├── poetry.lock # Lock file for dependencies
├── .github/
│ ├── workflows/
│ │ ├── test.yaml # CI workflow for testing
│ │ ├── lint.yaml # CI workflow for linting and formatting
│ │ └── deploy.yaml # CI workflow for deployment
├── README.md # Overview and usage instructions
└── LICENSE # License for the project
Key Takeaways¶
Production-Ready Excellence
- Data Organization: A clear separation of raw, processed, and feature data ensures reproducibility and clarity.
- Script Modularity: Dividing the pipeline into preprocessing, training, evaluation, and deployment scripts simplifies maintenance and updates.
- CI/CD Integration: Automated testing and deployment pipelines streamline production readiness and reliability.
- Docker for Consistency: Containerization ensures consistent execution across environments.
This structure is the gold standard for end-to-end machine learning pipelines, ensuring scalability, maintainability, and production readiness.
4.4. Creating Python Packages for Specialized Fine-Tuning and Data Inclusion¶
Overview
In large Gen AI projects, modularizing components into Python packages improves organization, reusability, and maintainability. This section demonstrates how to structure a Python package for fine-tuning processes and embedding or dynamically accessing data.
When to Create a Python Package¶
Python packages are beneficial when:
- Avoiding Overcrowding:
- Separating fine-tuning workflows or specialized functionality keeps the main project clean.
- Reusability Across Projects:
- A fine-tuning package can be easily imported into multiple Gen AI or ML projects.
- Distributing Data or Models:
- Including datasets, trained models, or configuration files ensures accessibility and version control.
Start with a Package
If your workflow has the potential to grow or be reused, start by organizing it into a Python package for better scalability and collaboration.
Recommended Directory Structure for a Python Package with Data¶
Here’s a directory structure for a Python package designed for fine-tuning models and embedding data:
fine_tune_package/
├── src/
│ └── fine_tune_package/
│ ├── __init__.py # Marks the directory as a Python package.
│ ├── datasets.py # Code for accessing included data.
│ ├── fine_tune.py # Core functionality for fine-tuning.
│ ├── utils.py # Helper functions for the package.
│ ├── data/ # Directory containing embedded data.
│ │ ├── __init__.py # Marks the directory as a subpackage.
│ │ └── model_config.yaml # Example model configuration file.
│ │ └── vocab.txt # Vocabulary or tokenizer data.
│ └── models/ # Directory for pre-trained or fine-tuned models.
│ ├── __init__.py # Marks the directory as a subpackage.
│ └── fine_tuned_model.bin # Binary file of the fine-tuned model.
├── tests/
│ ├── __init__.py # Marks the directory as a package for tests.
│ ├── test_fine_tune.py # Tests for fine-tuning functionality.
│ └── test_datasets.py # Tests for dataset access.
├── pyproject.toml # Configuration for package building and dependencies.
├── README.md # Package description and usage examples.
└── LICENSE # License for the package.
Including Data in a Package¶
Use Cases¶
- Required for Functionality:
- Data files like tokenizers or model configurations are essential for package operation.
- Example Data:
- Including sample datasets to demonstrate package functionality.
- Reproducibility:
- Bundling data ensures code and data are synchronized and version-controlled.
Embedded vs Downloadable Data
- Embed small, essential files directly in the package.
- Provide scripts for downloading large, optional files like pre-trained models.
Embedding Data Using importlib.resources
¶
Step 1: Add Data to the Package¶
Place required data files (e.g., model_config.yaml
) in the data/
subpackage.
fine_tune_package/
├── src/
│ └── fine_tune_package/
│ ├── data/
│ │ ├── __init__.py
│ │ └── model_config.yaml
Step 2: Create a Helper Function to Access the Data¶
Use importlib.resources
to access embedded files.
from importlib import resources
import yaml
def get_model_config():
"""Get the model configuration file as a dictionary."""
with resources.path("fine_tune_package.data", "model_config.yaml") as f:
with open(f, "r") as file:
config = yaml.safe_load(file)
return config
Step 3: Access the Data in Your Code¶
Use the helper function to access the configuration file.
Reusable Data Access
With importlib.resources
, accessing embedded files becomes consistent and efficient across multiple environments.
Downloading Large Data Files¶
For large datasets or models, include scripts for dynamic downloading.
import os
import requests
def download_model(destination="models/fine_tuned_model.bin"):
"""Download a fine-tuned model file from an external source."""
url = "https://example.com/fine_tuned_model.bin"
os.makedirs(os.path.dirname(destination), exist_ok=True)
response = requests.get(url, stream=True)
with open(destination, "wb") as f:
for chunk in response.iter_content(chunk_size=8192):
f.write(chunk)
print(f"Model downloaded to {destination}")
Avoid Embedding Large Files
Large files can increase package size unnecessarily. Use dynamic download scripts for optional, large resources.
Building and Installing the Package¶
Step 1: Update pyproject.toml
¶
Configure the package for the src
layout and specify dependencies.
[tool.poetry]
name = "fine-tune-package"
version = "0.1.0"
description = "A package for fine-tuning Gen AI models."
authors = ["Your Name <your.email@example.com>"]
license = "MIT"
[tool.poetry.dependencies]
python = "^3.8"
requests = "^2.28"
pyyaml = "^6.0"
[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
Step 2: Install the Package Locally¶
Install the package in editable mode for development:
Editable Installation
Use editable mode to test changes locally without rebuilding the package.
Key Benefits of This Approach¶
- Reusability:
- Install and reuse the package across multiple projects.
- Data Accessibility:
- Embedded data or download scripts ensure smooth workflows.
- Scalability:
- Modular packages reduce code clutter and improve maintainability.
Key Takeaways¶
Enhance Reusability and Professionalism
- Modularize specialized workflows into Python packages for better organization.
- Embed essential data using
importlib.resources
for easy access. - Use download scripts for large optional resources to keep packages lightweight.
- A well-structured Python package boosts reusability, maintainability, and scalability in Gen AI projects.
5. Overview of the Standard Directory Structure¶
Purpose of the Standard Directory Structure
This section provides a comprehensive overview of each directory and file in the standard project structure. It explains their purpose, contents, and relevance for collaboration and maintaining a professional development workflow.
├── .devcontainer <- Directory for Visual Studio Code Dev Container configuration.
│ └── devcontainer.json <- Configuration file for defining the development container.
├── .github <- Directory for GitHub-specific configuration and metadata.
│ ├── CODEOWNERS <- File to define code owners for the repository.
│ ├── CONTRIBUTING.md <- Guidelines for contributing to the project.
│ └── pull_request_template.md <- Template for pull requests to standardize and improve PR quality.
├── .vscode <- Directory for Visual Studio Code-specific configuration files.
│ ├── cspell.json <- Configuration file for the Code Spell Checker extension.
│ ├── dictionaries <- Directory for custom dictionary files.
│ │ └── data-science-en.txt <- Custom dictionary for data science terminology.
│ ├── extensions.json <- Recommended extensions for the project.
│ └── settings.json <- Workspace-specific settings for Visual Studio Code.
├── config <- Configuration files for the project.
├── data <- Data for the project, divided into different stages of data processing.
│ ├── raw <- Original, immutable data dump.
│ ├── external <- Data from third-party sources.
│ ├── interim <- Intermediate data, partially processed.
│ ├── processed <- Fully processed data, ready for analysis.
│ └── features <- Engineered features ready for model training.
├── docs <- Documentation for the project.
│ ├── api-reference.md <- API reference documentation.
│ ├── explanation.md <- Detailed explanations and conceptual documentation.
│ ├── how-to-guides.md <- Step-by-step guides on performing common tasks.
│ ├── index.md <- The main documentation index page.
│ └── tutorials.md <- Tutorials related to the project.
├── log <- Logs generated by the project.
├── models <- Machine learning models, scripts, and other related artifacts.
├── notebooks <- Jupyter notebooks for experiments, examples, or data analysis.
├── scripts <- Directory for project-specific scripts and utilities.
│ └── hooks <- Directory for custom git hooks and other automation scripts.
│ ├── branch-name-check.sh <- Hook script for checking branch names.
│ ├── commit-msg-check.sh <- Hook script for checking commit messages.
│ ├── filename-check.sh <- Hook script for checking file names.
│ ├── generate_docs.sh <- Script for generating documentation.
│ └── restricted-file-check.sh <- Hook script for checking restricted files.
├── src <- Source code for the project.
│ └── collaborativeaitoy <- Main project module.
│ ├── __init__.py <- Initializes the Python package.
│ ├── main.py <- Entry point for the application.
│ ├── app.py <- Main application logic.
│ └── utils.py <- Utility functions.
├── tests <- Directory for all project tests.
│ ├── integration <- Integration tests.
│ └── spec <- Specification tests (unit tests).
├── .gitignore <- Specifies intentionally untracked files to ignore.
├── .pre-commit-config.yaml <- Configuration for pre-commit hooks.
├── Dockerfile <- Dockerfile for containerizing the application.
├── Makefile <- Makefile with commands like `make data` or `make train`.
├── mkdocs.yml <- Configuration file for MkDocs, a static site generator for project documentation.
├── pyproject.toml <- Configuration file for Python projects which includes dependencies and package information.
├── README.md <- The top-level README for developers using this project.
└── .env <- Environment variables configuration file (not visible).
1. .devcontainer
¶
Visual Studio Code Dev Containers
Purpose: Defines a consistent development environment in containers.
Example File: devcontainer.json
.
- Collaboration Benefit: Reduces "it works on my machine" issues by standardizing the development setup.
2. .github
¶
Automating Repository Standards
Purpose: Maintains repository management and automation.
Key Files:
- CODEOWNERS
: Assigns code ownership for specific files or directories.
- CONTRIBUTING.md
: Guides contributions to the project.
- pull_request_template.md
: Standardizes pull request formats.
- Collaboration Benefit: Ensures repository standards for all contributors.
3. .vscode
¶
VS Code-Specific Configurations
Purpose: Enhances the development experience with workspace-specific settings.
Key Files:
- settings.json
: Project-specific settings.
- extensions.json
: Recommended extensions.
- cspell.json
: Spell-checker configuration for technical terms.
- Collaboration Benefit: Aligns all contributors with consistent editor configurations.
4. config
¶
Centralized Configurations
Purpose: Stores application settings and environment configurations.
- Collaboration Benefit: Avoids hardcoding settings and improves maintainability.
5. data
¶
Organized Data Processing
Purpose: Organizes data by processing stages.
Structure:
- raw/
: Immutable data.
- processed/
: Data ready for analysis.
- Collaboration Benefit: Ensures reproducibility and clarity in data workflows.
6. docs
¶
Comprehensive Documentation
Purpose: Provides user guides, tutorials, and API references.
Key Files:
- api-reference.md
- how-to-guides.md
- Collaboration Benefit: Onboards team members quickly and improves transparency.
7. log
¶
Project Logs
Purpose: Tracks runtime behavior and debugging information.
- Collaboration Benefit: Helps identify and resolve runtime issues.
8. models
¶
Centralized Model Storage
Purpose: Stores pre-trained or fine-tuned models.
- Collaboration Benefit: Ensures consistency in model usage across the team.
9. notebooks
¶
Prototyping Space
Purpose: Allows for exploratory work before integration into the main codebase.
- Collaboration Benefit: Serves as a shared workspace for experiments.
10. scripts
¶
Reusable Automation Scripts
Purpose: Automates repetitive tasks and project utilities.
- Collaboration Benefit: Standardizes task execution and reduces manual effort.
11. src
¶
Core Project Code
Purpose: Encapsulates the primary functionality of the project.
- Collaboration Benefit: Organizes code for scalability and maintainability.
12. tests
¶
Code Reliability
Purpose: Validates code functionality through unit and integration tests.
- Collaboration Benefit: Prevents regressions and ensures quality in changes.
13. Key Root-Level Files¶
File | Purpose | Relevance |
---|---|---|
.gitignore |
Excludes unnecessary files from version control. | Keeps the repository clean. |
.pre-commit-config.yaml |
Automates code checks before commits. | Enforces coding standards. |
Dockerfile |
Containerizes the application. | Ensures consistent deployment. |
Makefile |
Automates common tasks. | Standardizes workflows for team members. |
mkdocs.yml |
Configures project documentation. | Simplifies and standardizes documentation generation. |
pyproject.toml |
Manages dependencies and builds. | Centralizes project configuration. |
README.md |
Provides project overview and setup instructions. | First point of contact for new team members or external collaborators. |
.env |
Stores environment-specific variables. | Improves security by keeping sensitive data out of the codebase. |
Conclusion¶
Collaborative Excellence
This standard directory structure enhances collaboration, scalability, and maintainability in AI/ML projects. By providing clear organization and purpose for each component, teams can efficiently navigate, contribute, and extend the project.
6. Automating Standards with Cookiecutter¶
Overview of Cookiecutter for Project Automation
Automating project scaffolding ensures consistency across projects and saves time when starting new ones. Cookiecutter is a tool that simplifies this process by generating standardized project structures based on templates. This section introduces Cookiecutter, demonstrates its usage, and highlights its benefits for collaboration and reproducibility.
What is Cookiecutter?¶
About Cookiecutter
Cookiecutter is a command-line utility for creating projects from predefined templates. It allows teams to: - Enforce consistent directory structures and configurations. - Automate repetitive setup tasks. - Customize templates for specific organizational or project needs.
Why Use Cookiecutter?¶
Benefits of Cookiecutter
- Standardization: Every project follows the same structure, enhancing maintainability.
- Time-Saving: Automates creating directories, boilerplate files, and configurations.
- Customizability: Allows user input (e.g., project names, author details) to tailor templates.
- Scalability: Reuse templates across teams for various projects, such as ML pipelines or Gen AI applications.
Step-by-Step Demonstration¶
Step 1: Install Cookiecutter¶
Dependency Management
Ensure pip
is up-to-date to avoid installation issues.
Step 2: Choose or Create a Template¶
Example Template: ML Project Structure¶
cookiecutter-ml-template/
├── {{cookiecutter.project_slug}}/
│ ├── data/
│ │ ├── raw/
│ │ ├── processed/
│ ├── notebooks/
│ │ └── example_notebook.ipynb
│ ├── src/
│ │ ├── __init__.py
│ │ ├── main.py
│ │ └── utils.py
│ ├── tests/
│ │ ├── __init__.py
│ │ └── test_main.py
│ ├── .gitignore
│ ├── README.md
│ ├── pyproject.toml
│ └── LICENSE
├── cookiecutter.json
└── README.md
Step 3: Configure the Template¶
Define prompts and their default values in cookiecutter.json
:
{
"project_name": "My ML Project",
"project_slug": "{{ cookiecutter.project_name.lower().replace(' ', '_') }}",
"author_name": "Your Name",
"description": "A machine learning project template.",
"license": "MIT"
}
Step 4: Generate a New Project¶
Run Cookiecutter with the template:
Example input prompts:
project_name [My ML Project]: Awesome AI Project
author_name [Your Name]: Jane Doe
description [A machine learning project template.]: Fine-tuning Gen AI Models
license [MIT]: Apache-2.0
Generated structure:
awesome_ai_project/
├── data/
├── notebooks/
├── src/
├── tests/
├── README.md
├── pyproject.toml
├── LICENSE
└── .gitignore
Step 5: Customize and Use the Project¶
Navigate to the generated directory:
Next Steps
Start coding within a pre-configured environment, saving time and reducing setup errors.
Advanced Features¶
Hooks¶
Post-Generation Automation
Run custom scripts (e.g., initialize a Git repository, install dependencies) automatically after project creation.
Private Templates¶
Store templates in private GitHub repositories to maintain confidentiality and customize for internal use.
Integration with CI/CD¶
Automate project generation as part of a CI/CD pipeline to maintain consistency across deployments.
Key Takeaways¶
- Automation and Standardization: Cookiecutter ensures consistent and professional project setups.
- Scalability: Templates can evolve with team standards and project complexity.
- Reusability: Save time and reduce errors by leveraging pre-built templates.
Next Steps: Explore existing templates or customize one for your team’s ML or Gen AI projects.
7. Advanced Topics and Future Considerations¶
How the Structure Supports CI/CD Workflows¶
1. Automated Testing and Quality Assurance¶
Integrating with CI/CD Tools
- Tools like GitHub Actions and Jenkins can utilize the
tests/
directory for automated testing. - Pre-commit hooks configured in
.pre-commit-config.yaml
enforce code quality before commits.
Example Workflow Configuration:
name: CI Workflow
on:
push:
branches:
- main
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout Code
uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.9'
- name: Install Dependencies
run: |
pip install poetry
poetry install
- name: Run Tests
run: poetry run pytest
2. Deployment Pipelines¶
Using Docker for Deployment
- The
Dockerfile
ensures consistent execution environments across development and production. - CI/CD pipelines can automate building and pushing Docker images.
Transitioning from POCs to Production¶
1. Modular Design¶
Separation of Concerns
Organizing code into reusable modules (src/
) and structuring data (data/
) by processing stages ensures scalability and maintainability.
2. Collaboration Readiness¶
Documentation in docs/
and README.md
enables seamless handoffs between teams and stakeholders.
Documentation Tools: MkDocs¶
What is MkDocs?¶
MkDocs converts Markdown files (e.g., in docs/
) into professional static documentation sites.
Benefits of MkDocs¶
- Centralized Documentation: Consolidates all guides and references.
- Ease of Use: Markdown-based editing simplifies updates.
- Live Preview: View changes locally during development.
Key Takeaways¶
- Automate testing, deployment, and documentation workflows for robust CI/CD integration.
- Modularize code and document comprehensively for a seamless transition from POCs to production.
- Leverage MkDocs for high-quality, accessible project documentation.
Next Steps: Implement CI/CD pipelines and explore MkDocs for your project documentation.