"Which version of the data was this model trained on?"
If you're in machine learning, you've likely asked, or been asked, this question. It's the loose thread that can unravel an entire project, leading to a frustrating lack of reproducibility. When datasets are just a collection of files in a cloud bucket—data_final.csv, data_final_v2.csv, data_final_v2_fixed.csv—chaos is inevitable.
What if you could manage your data with the same rigor and clarity as you manage your code?
This is the core promise of Datasets.do: to treat your datasets as code. By programmatically defining, versioning, and managing your AI training data, you can build truly end-to-end reproducible ML pipelines.
In this post, we'll walk you through how to seamlessly integrate Datasets.do with PyTorch to create a robust and repeatable training workflow. Let's banish data chaos for good.
In modern software development, Git provides a source of truth for our code. We can track every change, collaborate with confidence, and revert to any previous state. Yet, we rarely afford our data the same courtesy.
This leads to common pain points:
Datasets.do solves this by providing a simple, powerful API to manage your data programmatically. It decouples the data definition (the schema, metadata, and version) from the underlying storage, giving you a centralized control plane for your most critical asset.
Here’s how you define and register a new dataset. Notice how it's declarative, versioned, and schema-enforced right from the start.
import { Dataset } from 'datasets.do';
// Define and register a new dataset schema
const imageCaptions = await Dataset.create({
name: 'image-caption-pairs-v2',
description: '1M image-caption pairs for model training.',
schema: {
imageUrl: { type: 'string', required: true },
caption: { type: 'string', required: true },
source: { type: 'string' }
}
});
// Add new records via the API
await imageCaptions.addRecords([
{ imageUrl: 'https://cdn.do/img-1.jpg', caption: 'A photo of a cat on a couch.' },
{ imageUrl: 'https://cdn.do/img-2.jpg', caption: 'A photo of a boat on the water.' }
]);
Managing datasets this way enables versioning, collaboration, and the traceability needed for serious model development.
Now, let's get practical. Here’s how you integrate a versioned dataset from Datasets.do directly into a PyTorch training pipeline.
First, make sure you have defined and populated your dataset using the Datasets.do API, as shown in the code block above. The key is the unique, versioned name (image-caption-pairs-v2). This name is your reproducible pointer to a specific, immutable version of the data.
The magic of integration lies in creating a custom class that inherits from torch.utils.data.Dataset. This class will serve as a bridge, fetching data on-the-fly from the Datasets.do API.
This approach is highly scalable. You don't need to download terabytes of data locally; the Dataset class retrieves only the data required for a given training step.
import torch
from torch.utils.data import Dataset
from datasets_do import Dataset as DatasetsDoAPI # Alias to avoid name clash
from PIL import Image
import requests
from io import BytesIO
# Fictional tokenizer for demonstration
def tokenize(caption):
return torch.tensor([ord(c) for c in caption])
class DatasetsDoImageCaption(Dataset):
"""
A custom PyTorch Dataset to interface with Datasets.do.
"""
def __init__(self, dataset_name: str, transform=None):
"""
Initializes the Dataset by fetching metadata from Datasets.do.
Args:
dataset_name (str): The unique name of the dataset (e.g., 'image-caption-pairs-v2').
transform: PyTorch transforms to be applied to the images.
"""
self.transform = transform
# Get a handle to the versioned dataset
self.dataset_api = DatasetsDoAPI.get(dataset_name)
# Get the total number of records for __len__
self.record_count = self.dataset_api.count() # Fetches total record count
def __len__(self):
"""Returns the total number of samples in the dataset."""
return self.record_count
def __getitem__(self, idx):
"""
Fetches a single data point from Datasets.do by index.
"""
# Fetch a single record's data from the API
record = self.dataset_api.get_record_by_index(idx) # Fetches one item
# 1. Load the image from its URL
response = requests.get(record['imageUrl'])
image = Image.open(BytesIO(response.content)).convert("RGB")
if self.transform:
image = self.transform(image)
# 2. Get and tokenize the caption
caption = record['caption']
tokenized_caption = tokenize(caption) # Your tokenizer logic here
return image, tokenized_caption
Note: DatasetsDoAPI.get(), .count(), and .get_record_by_index() are illustrative methods. The actual SDK will provide similar functionality for accessing your versioned data.
With our custom Dataset class defined, plugging it into the standard PyTorch workflow is effortless. PyTorch's DataLoader will handle all the heavy lifting of batching, shuffling, and parallel data loading for you.
from torch.utils.data import DataLoader
from torchvision import transforms
# Define standard image transformations
image_transform = transforms.Compose([
transforms.Resize((256, 256)),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
# Instantiate our custom dataset, pointing to a *specific* version
# This string is the SINGLE source of truth for your data!
DATASET_VERSION = 'image-caption-pairs-v2'
training_data = DatasetsDoImageCaption(
dataset_name=DATASET_VERSION,
transform=image_transform
)
# Create the PyTorch DataLoader
train_dataloader = DataLoader(training_data, batch_size=32, shuffle=True, num_workers=4)
Your training loop now looks exactly like any other PyTorch training loop. The underlying complexity of data fetching and versioning has been cleanly abstracted away.
# A standard PyTorch training loop
model = ... # Your PyTorch model
optimizer = ... # Your optimizer
criterion = ... # Your loss function
print(f"Starting training on dataset: {DATASET_VERSION}")
for epoch in range(num_epochs):
for images, captions in train_dataloader:
# Move data to the appropriate device (CPU/GPU)
images, captions = images.to(device), captions.to(device)
# --- Your model training logic ---
optimizer.zero_grad()
outputs = model(images)
loss = criterion(outputs, captions)
loss.backward()
optimizer.step()
# --- End of training logic ---
print(f"Epoch {epoch+1} complete. Loss: {loss.item()}")
The real beauty of this system? If you need to test a new version of the data—say, image-caption-pairs-v3 with more cleaning—you only need to change one line of code:
DATASET_VERSION = 'image-caption-pairs-v3'
Re-run your script, and your entire pipeline will pull from the new, correct dataset. Your experiment is once again perfectly reproducible.
By integrating Datasets.do with PyTorch, you gain powerful advantages:
Stop wrestling with files and start delivering high-quality, reliable training data.
Ready to build your own reproducible ML pipeline? Visit Datasets.do to get started and manage your data as code.