If you're building AI models, you know the truth: data isn't always neat and tidy. While tabular data fits nicely into CSVs and databases, the world of modern AI is dominated by unstructured data—images, audio clips, video streams, and vast collections of text. Managing this data is often a chaotic process of juggling folders, inconsistent filenames, and separate annotation files, leading to what we call "pipeline debt."
What if you could manage this complex, unstructured data with the same discipline, version control, and clarity you use for your application code? This is the core idea behind treating your datasets as code, a paradigm shift that turns data chaos into a reproducible, scalable workflow.
For many teams, the "dataset" is just a loose collection of files on a shared drive or in a cloud storage bucket. This approach seems simple at first, but it quickly breaks down, introducing significant friction into the MLOps lifecycle.
The "data as code" philosophy solves these problems by abstracting the dataset away from the physical files. Instead of manipulating folders, you interact with your data through a programmatic interface.
This is precisely what we built at Datasets.do. We provide a simple, powerful API to define, version, and manage your AI training data programmatically.
Let's see how this works in practice. Imagine you're building an image captioning model.
First, you define the "shape" of your data. You're not dealing with raw files anymore; you're creating a structured definition for each record in your dataset.
import { Dataset } from 'datasets.do';
// Define and register a new dataset schema
const imageCaptions = await Dataset.create({
name: 'image-caption-pairs-v2',
description: '1M image-caption pairs for model training.',
schema: {
imageUrl: { type: 'string', required: true },
caption: { type: 'string', required: true },
source: { type: 'string' }
}
});
Here, we’ve established a contract: every record in this dataset will have an imageUrl and a caption. This schema lives with your dataset, providing a single source of truth for its structure.
With the schema in place, adding data is as simple as an API call. You can write a simple script to read from your existing file structure, S3 bucket, or database and populate your versioned dataset.
// Add new records via the API
await imageCaptions.addRecords([
{ imageUrl: 's3://my-bucket/img-1.jpg', caption: 'A photo of a cat on a couch.' },
{ imageUrl: 's3://my-bucket/img-2.jpg', caption: 'A photo of a boat on the water.' }
]);
Notice we're pointing to the image location (imageUrl), not uploading the image itself into a database. Datasets.do decouples the metadata from the storage, allowing you to manage terabytes of data with a lightweight, programmatic interface.
This is where the magic happens. Let's say your team finishes a round of annotation cleaning. Instead of overwriting the old data, you create a new, immutable version.
// Imagine you've run a script to update thousands of records
// Now, you lock it in as a new version for absolute reproducibility
const newVersion = await imageCaptions.createVersion('v2.1');
Now, your training script can target 'image-caption-pairs-v2.1' specifically. Your previous experiment that used v2.0 remains 100% reproducible. You have a complete, auditable history of how your data has evolved—just like a Git history for your code.
When it's time to train, your script doesn't need complex logic to find and parse files. It simply asks the Datasets.do API for the data it needs.
# In your Python training script
from datasets_do import Client
client = Client(api_key="...")
dataset = client.get_dataset('image-caption-pairs-v2.1')
for record in dataset.stream_records():
# record = {'imageUrl': '...', 'caption': '...'}
# ...your training logic here...
This approach ensures that every training run uses the exact same data, specified by a single version string. It’s simple, scalable, and eliminates an entire class of experiment-ruining bugs.
Shifting from managing files to managing datasets programmatically is a fundamental step toward mature, reliable MLOps. It transforms your data from a liability—a chaotic collection of assets—into a versioned, traceable, and collaborative part of your development lifecycle.
By treating your AI training data as code, you can:
Your models are only as good as the data you train them on. It's time to give your data the structure and discipline it deserves.