Skip to content

How-To Guide: Setting Up and Using Data Version Control with DVC

Setting Up DVC in Your Project

Step 1: Initialize DVC

Initialization

  1. Open your terminal.
  2. Navigate to your project repository.
  3. Run the following command:
    dvc init
    
    This initializes DVC in your repository, creating a .dvc directory.

Step 2: Set Up Remote Storage

Choose Your Storage Backend

DVC supports various storage backends. Depending on your choice (Azure, AWS, local), follow the appropriate steps below.

For Azure Blob Storage

  1. Run the commands:
    dvc remote add -d myremote azure://mycontainer/path
    dvc remote modify myremote connection_string 'myconnectionstring'
    

For AWS S3

  1. Configure AWS and run:
    dvc remote add -d myremote s3://mybucket/path
    

For Local Storage

  1. Set up local storage:
    dvc remote add -d myremote /path/to/local/storage
    

Step 3: Add Data to DVC

Tracking Data

  1. Use the dvc add command to track files or directories:
    dvc add data/dataset.csv
    

Step 4: Commit Changes to Version Control

  1. Commit the changes to both DVC and Git:
    git add data/dataset.csv.dvc data/.gitignore
    git commit -m "Add dataset with DVC"
    

Step 5: Push Data to Remote Storage

  1. Push your data to the remote storage:
    dvc push
    

Updating Data with DVC

Local Data Updates

Updating Local Data

  1. Modify your dataset file (e.g., dataset.csv).
  2. Run dvc add dataset.csv to update the .dvc file.
  3. Commit and push the changes:
    git add dataset.csv.dvc
    git commit -m "Update dataset.csv"
    dvc push
    

Cloud Data Updates

Cloud Updates

  1. Synchronize with the cloud using dvc pull.
  2. After detecting changes, pull the updated data.
  3. Commit and push any local changes:
    dvc add dataset.csv
    git add dataset.csv.dvc
    git commit -m "Update dataset.csv with cloud changes"
    dvc push
    

Conclusion

By following these steps, you can effectively utilize DVC for data version control, ensuring the consistency and reproducibility of your data science and machine learning projects.