Beyond Tables: How to Manage Image and Unstructured Data Programmatically

If you're building AI models, you know the truth: data isn't always neat and tidy. While tabular data fits nicely into CSVs and databases, the world of modern AI is dominated by unstructured data—images, audio clips, video streams, and vast collections of text. Managing this data is often a chaotic process of juggling folders, inconsistent filenames, and separate annotation files, leading to what we call "pipeline debt."

What if you could manage this complex, unstructured data with the same discipline, version control, and clarity you use for your application code? This is the core idea behind treating your datasets as code, a paradigm shift that turns data chaos into a reproducible, scalable workflow.

The Hidden Costs of Poor Dataset Management

For many teams, the "dataset" is just a loose collection of files on a shared drive or in a cloud storage bucket. This approach seems simple at first, but it quickly breaks down, introducing significant friction into the MLOps lifecycle.

Versioning Hell: You train model-v1 on a set of images. A week later, a teammate "cleans" the dataset by removing 500 bad images and correcting 1,000 labels. How do you reliably reproduce the original training run? How do you even know exactly what changed? Without rigorous data versioning, your experiments become impossible to reproduce.
Metadata Mayhem: Where do you store the crucial context? Image captions, audio transcription sources, or bounding box coordinates for object detection are often kept in separate CSVs or JSON files. This separation is fragile; a simple filename change can break the link, corrupting your dataset.
Scalability Nightmares: How do you efficiently train a model on a 5TB dataset of high-resolution images? Downloading it all is impractical. Writing custom, brittle scripts to stream from cloud storage for every experiment is a time-consuming distraction from building better models.
Collaboration Chaos: When multiple data scientists and annotators are working on the same collection of files, mistakes are inevitable. Overwritten files, duplicated data, and conflicting updates create a source of truth that no one can trust.

The Solution: A Programmatic, API-First Approach

The "data as code" philosophy solves these problems by abstracting the dataset away from the physical files. Instead of manipulating folders, you interact with your data through a programmatic interface.

This is precisely what we built at Datasets.do. We provide a simple, powerful API to define, version, and manage your AI training data programmatically.

Let's see how this works in practice. Imagine you're building an image captioning model.

Step 1: Define Your Data Schema

First, you define the "shape" of your data. You're not dealing with raw files anymore; you're creating a structured definition for each record in your dataset.

import { Dataset } from 'datasets.do';

// Define and register a new dataset schema
const imageCaptions = await Dataset.create({
  name: 'image-caption-pairs-v2',
  description: '1M image-caption pairs for model training.',
  schema: {
    imageUrl: { type: 'string', required: true },
    caption: { type: 'string', required: true },
    source: { type: 'string' }
  }
});

Here, we’ve established a contract: every record in this dataset will have an imageUrl and a caption. This schema lives with your dataset, providing a single source of truth for its structure.

Step 2: Add and Manage Records via API

With the schema in place, adding data is as simple as an API call. You can write a simple script to read from your existing file structure, S3 bucket, or database and populate your versioned dataset.

// Add new records via the API
await imageCaptions.addRecords([
  { imageUrl: 's3://my-bucket/img-1.jpg', caption: 'A photo of a cat on a couch.' },
  { imageUrl: 's3://my-bucket/img-2.jpg', caption: 'A photo of a boat on the water.' }
]);

Notice we're pointing to the image location (imageUrl), not uploading the image itself into a database. Datasets.do decouples the metadata from the storage, allowing you to manage terabytes of data with a lightweight, programmatic interface.

Step 3: Version Your Dataset Like Git

This is where the magic happens. Let's say your team finishes a round of annotation cleaning. Instead of overwriting the old data, you create a new, immutable version.

// Imagine you've run a script to update thousands of records
// Now, you lock it in as a new version for absolute reproducibility
const newVersion = await imageCaptions.createVersion('v2.1');

Now, your training script can target 'image-caption-pairs-v2.1' specifically. Your previous experiment that used v2.0 remains 100% reproducible. You have a complete, auditable history of how your data has evolved—just like a Git history for your code.

Step 4: Access Data for Training Reliably

When it's time to train, your script doesn't need complex logic to find and parse files. It simply asks the Datasets.do API for the data it needs.

# In your Python training script
from datasets_do import Client

client = Client(api_key="...")
dataset = client.get_dataset('image-caption-pairs-v2.1')

for record in dataset.stream_records():
  # record = {'imageUrl': '...', 'caption': '...'}
  # ...your training logic here...

This approach ensures that every training run uses the exact same data, specified by a single version string. It’s simple, scalable, and eliminates an entire class of experiment-ruining bugs.

Stop Wrestling with Files. Start Building Better Models.

Shifting from managing files to managing datasets programmatically is a fundamental step toward mature, reliable MLOps. It transforms your data from a liability—a chaotic collection of assets—into a versioned, traceable, and collaborative part of your development lifecycle.

By treating your AI training data as code, you can:

Ensure reproducibility with immutable versions.
Maintain data quality with enforced schemas.
Collaborate effectively with a centralized API.
Scale effortlessly without downloading massive files.

Your models are only as good as the data you train them on. It's time to give your data the structure and discipline it deserves.

Do Work. With AI.