Building a Reproducible ML Pipeline with Datasets.do and PyTorch

"Which version of the data was this model trained on?"

If you're in machine learning, you've likely asked, or been asked, this question. It's the loose thread that can unravel an entire project, leading to a frustrating lack of reproducibility. When datasets are just a collection of files in a cloud bucket—data_final.csv, data_final_v2.csv, data_final_v2_fixed.csv—chaos is inevitable.

What if you could manage your data with the same rigor and clarity as you manage your code?

This is the core promise of Datasets.do: to treat your datasets as code. By programmatically defining, versioning, and managing your AI training data, you can build truly end-to-end reproducible ML pipelines.

In this post, we'll walk you through how to seamlessly integrate Datasets.do with PyTorch to create a robust and repeatable training workflow. Let's banish data chaos for good.

The Problem: Data as an Afterthought

In modern software development, Git provides a source of truth for our code. We can track every change, collaborate with confidence, and revert to any previous state. Yet, we rarely afford our data the same courtesy.

This leads to common pain points:

No Version History: It's difficult to track how a dataset has evolved—which annotations were added, which items were cleaned, or how it was split.
Collaboration Friction: Teams struggle to stay in sync, often working from different or outdated copies of the data.
Reproducibility Nightmare: Re-running an old experiment becomes a forensic task of hunting down the exact state of the data used months ago.

Enter Datasets.do: Your Datasets, as Code

Datasets.do solves this by providing a simple, powerful API to manage your data programmatically. It decouples the data definition (the schema, metadata, and version) from the underlying storage, giving you a centralized control plane for your most critical asset.

Here’s how you define and register a new dataset. Notice how it's declarative, versioned, and schema-enforced right from the start.

import { Dataset } from 'datasets.do';

// Define and register a new dataset schema
const imageCaptions = await Dataset.create({
  name: 'image-caption-pairs-v2',
  description: '1M image-caption pairs for model training.',
  schema: {
    imageUrl: { type: 'string', required: true },
    caption: { type: 'string', required: true },
    source: { type: 'string' }
  }
});

// Add new records via the API
await imageCaptions.addRecords([
  { imageUrl: 'https://cdn.do/img-1.jpg', caption: 'A photo of a cat on a couch.' },
  { imageUrl: 'https://cdn.do/img-2.jpg', caption: 'A photo of a boat on the water.' }
]);

Managing datasets this way enables versioning, collaboration, and the traceability needed for serious model development.

Integrating Datasets.do with PyTorch: A Step-by-Step Guide

Now, let's get practical. Here’s how you integrate a versioned dataset from Datasets.do directly into a PyTorch training pipeline.

Step 1: Ensure Your Data is in Datasets.do

First, make sure you have defined and populated your dataset using the Datasets.do API, as shown in the code block above. The key is the unique, versioned name (image-caption-pairs-v2). This name is your reproducible pointer to a specific, immutable version of the data.

Step 2: Create a Custom PyTorch Dataset

The magic of integration lies in creating a custom class that inherits from torch.utils.data.Dataset. This class will serve as a bridge, fetching data on-the-fly from the Datasets.do API.

This approach is highly scalable. You don't need to download terabytes of data locally; the Dataset class retrieves only the data required for a given training step.

import torch
from torch.utils.data import Dataset
from datasets_do import Dataset as DatasetsDoAPI # Alias to avoid name clash
from PIL import Image
import requests
from io import BytesIO

# Fictional tokenizer for demonstration
def tokenize(caption):
    return torch.tensor([ord(c) for c in caption])

class DatasetsDoImageCaption(Dataset):
    """
    A custom PyTorch Dataset to interface with Datasets.do.
    """
    def __init__(self, dataset_name: str, transform=None):
        """
        Initializes the Dataset by fetching metadata from Datasets.do.

        Args:
            dataset_name (str): The unique name of the dataset (e.g., 'image-caption-pairs-v2').
            transform: PyTorch transforms to be applied to the images.
        """
        self.transform = transform
        
        # Get a handle to the versioned dataset
        self.dataset_api = DatasetsDoAPI.get(dataset_name)
        
        # Get the total number of records for __len__
        self.record_count = self.dataset_api.count() # Fetches total record count

    def __len__(self):
        """Returns the total number of samples in the dataset."""
        return self.record_count

    def __getitem__(self, idx):
        """
        Fetches a single data point from Datasets.do by index.
        """
        # Fetch a single record's data from the API
        record = self.dataset_api.get_record_by_index(idx) # Fetches one item

        # 1. Load the image from its URL
        response = requests.get(record['imageUrl'])
        image = Image.open(BytesIO(response.content)).convert("RGB")

        if self.transform:
            image = self.transform(image)

        # 2. Get and tokenize the caption
        caption = record['caption']
        tokenized_caption = tokenize(caption) # Your tokenizer logic here

        return image, tokenized_caption

Note: DatasetsDoAPI.get(), .count(), and .get_record_by_index() are illustrative methods. The actual SDK will provide similar functionality for accessing your versioned data.

Step 3: Wire It Up with a DataLoader

With our custom Dataset class defined, plugging it into the standard PyTorch workflow is effortless. PyTorch's DataLoader will handle all the heavy lifting of batching, shuffling, and parallel data loading for you.

from torch.utils.data import DataLoader
from torchvision import transforms

# Define standard image transformations
image_transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Instantiate our custom dataset, pointing to a *specific* version
# This string is the SINGLE source of truth for your data!
DATASET_VERSION = 'image-caption-pairs-v2' 
training_data = DatasetsDoImageCaption(
    dataset_name=DATASET_VERSION,
    transform=image_transform
)

# Create the PyTorch DataLoader
train_dataloader = DataLoader(training_data, batch_size=32, shuffle=True, num_workers=4)

Step 4: Run Your Reproducible Training Loop

Your training loop now looks exactly like any other PyTorch training loop. The underlying complexity of data fetching and versioning has been cleanly abstracted away.

# A standard PyTorch training loop
model = ... # Your PyTorch model
optimizer = ... # Your optimizer
criterion = ... # Your loss function

print(f"Starting training on dataset: {DATASET_VERSION}")

for epoch in range(num_epochs):
    for images, captions in train_dataloader:
        # Move data to the appropriate device (CPU/GPU)
        images, captions = images.to(device), captions.to(device)

        # --- Your model training logic ---
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, captions)
        loss.backward()
        optimizer.step()
        # --- End of training logic ---

    print(f"Epoch {epoch+1} complete. Loss: {loss.item()}")

The real beauty of this system? If you need to test a new version of the data—say, image-caption-pairs-v3 with more cleaning—you only need to change one line of code:

DATASET_VERSION = 'image-caption-pairs-v3'

Re-run your script, and your entire pipeline will pull from the new, correct dataset. Your experiment is once again perfectly reproducible.

The Payoff: Certainty and Speed

By integrating Datasets.do with PyTorch, you gain powerful advantages:

Absolute Reproducibility: Tie a model version to a code commit and a data version. You can always recreate the exact conditions of any experiment.
Built-in Versioning: Create immutable versions of your datasets effortlessly. Track changes, annotations, and splits over time, just like code branches.
Scalability by Design: The API-first approach handles datasets of any size, as you only query and load the data you need, when you need it.
Streamlined Collaboration: Your entire team works from a single source of truth, referencing datasets by a clear, unambiguous name.

Stop wrestling with files and start delivering high-quality, reliable training data.

Ready to build your own reproducible ML pipeline? Visit Datasets.do to get started and manage your data as code.

Do Work. With AI.