Your First Programmatic Dataset: A Step-by-Step Guide

Managing AI training data can feel chaotic. You start with a handful of CSVs, but soon you're drowning in a sea of versioned folders, inconsistent naming conventions (data_final_v2_revised.zip, anyone?), and a creeping fear that you can't reproduce last month's "best" model. There has to be a better way.

The solution is to treat your data like you treat your application code: with structure, version control, and programmatic access. This is the core principle of data as code, and it’s what we built Datasets.do to enable.

In this guide, we'll walk you through creating your first programmatic dataset. You'll see how a simple, API-driven workflow can replace messy folders and transform your dataset management from a liability into a reliable, reproducible asset.

The Goal: From Files to API

We're going to create a simple dataset for training an image captioning model. Our dataset will contain pairs of image URLs and their corresponding text captions. By the end, you'll be able to define, populate, and version this dataset entirely through code.

Prerequisites:

A Datasets.do account.
Basic familiarity with TypeScript/JavaScript and Node.js.
The datasets.do SDK installed in your project: npm install datasets.do

Step 1: Define Your Dataset Schema

Before you add a single piece of data, you need a blueprint. A schema defines the structure of your records, ensuring every piece of data you add is consistent and valid. Think of it as an interface in TypeScript or a table definition in a database. It's the foundation of high-quality AI training data.

Let's use the Datasets.do SDK to create a new dataset with a clearly defined schema.

import { Dataset } from 'datasets.do';

// Initialize with your API key
// Dataset.init('YOUR_API_KEY');

// Define and register a new dataset schema
const imageCaptions = await Dataset.create({
  name: 'image-caption-pairs-v1',
  description: 'Initial set of 1M image-caption pairs for model training.',
  schema: {
    imageUrl: { type: 'string', required: true },
    caption: { type: 'string', required: true },
    source: { type: 'string' }
  }
});

console.log(`Dataset "${imageCaptions.name}" created successfully!`);

Let's break this down:

name: This is the unique identifier for your dataset. We've included -v1 right in the name, making data versioning an explicit part of our workflow from the start.
description: A human-readable summary. Good metadata is crucial for keeping your machine learning datasets organized.
schema: This object is the blueprint. We've defined that every record must have an imageUrl and a caption, and can optionally have a source.

You've just registered your dataset's structure with the Datasets.do API. You now have a stable endpoint and a validated structure for adding records.

Step 2: Add Records Programmatically

With the schema in place, it’s time to add some data. Forget manual uploads or complex ETL scripts. You can add records directly via the API. This is perfect for integrating into your existing data processing pipelines, web scrapers, or annotation workflows.

// Assuming 'imageCaptions' is the dataset object from Step 1

await imageCaptions.addRecords([
  { 
    imageUrl: 'https://cdn.do/img-1.jpg', 
    caption: 'A close-up photo of a cat relaxing on a couch.',
    source: 'internal-collection-a'
  },
  { 
    imageUrl: 'https://cdn.do/img-2.jpg', 
    caption: 'A wide shot of a sailboat on calm water at sunset.' 
  }
]);

console.log("Successfully added new records!");

Here, we're calling the .addRecords() method on our dataset instance. Datasets.do validates these records against your schema before ingesting them. If a record is missing a required field (like caption), the API call will fail, protecting your dataset from inconsistent data.

This API-first approach is key to effective dataset management. You're not just storing data; you're programmatically controlling and validating it at the point of entry.

Step 3: Access and Use Your Dataset for Training

Now for the most important part: using the data. To train a model, your script needs to access the dataset. Instead of pointing to a local file path, you'll fetch the data from the same programmatic interface.

// In your separate model training script...
import { Dataset } from 'datasets.do';

async function trainModel() {
  // Fetch the dataset by its versioned name
  const trainingData = await Dataset.get('image-caption-pairs-v1');

  console.log(`Starting training with dataset: ${trainingData.name}`);

  // Efficiently stream records to handle large datasets
  for await (const record of trainingData.streamRecords()) {
    // Your training logic here
    // e.g., download record.imageUrl, preprocess it with record.caption
    // and feed it into your model.
    console.log(`Processing record: ${record.caption}`);
  }
  
  console.log("Training complete.");
}

trainModel();

Notice how clean this is. Your training script asks for image-caption-pairs-v1 by name. It doesn't need to know where the data is stored or how it's formatted on disk. By using streamRecords(), you can process terabytes of data without needing to load the entire dataset into memory, making your workflow scalable.

Step 4: Iterate with Versioning

Your model is performing well, but you realize you need to add a quality score to each image-caption pair. How do you do this without breaking the model you just trained? You create a new version.

Data versioning allows you to evolve your dataset while preserving its history, ensuring perfect reproducibility.

// In a data curation or migration script...
import { Dataset } from 'datasets.do';

// Create a new version with an updated schema
const imageCaptionsV2 = await Dataset.create({
  name: 'image-caption-pairs-v2',
  description: 'Adds a quality score to the image-caption pairs.',
  schema: {
    imageUrl: { type: 'string', required: true },
    caption: { type: 'string', required: true },
    source: { type: 'string' },
    qualityScore: { type: 'number', required: true } // Our new field!
  }
});

// You can now write a script to migrate data from v1 to v2,
// adding the new `qualityScore` field along the way.

Now, your old training script can continue to pull image-caption-pairs-v1 and work perfectly. Your new experiments can target image-caption-pairs-v2 to take advantage of the quality score. This parallel, non-destructive approach is a cornerstone of the data as code philosophy. It makes your experiments traceable, safe, and easy to reproduce.

Welcome to Modern Dataset Management

You've just seen how to define, populate, access, and version an AI training dataset entirely through code. The chaos of file management is replaced by the clarity and control of an API.

By treating your datasets as code with Datasets.do, you gain:

Reproducibility: Lock your experiments to specific data versions.
Collaboration: A centralized, programmatic source of truth for your team.
Scalability: Manage massive datasets with a lightweight interface.
Quality: Enforce data consistency from the very beginning with schemas.

Ready to stop wrestling with files and supercharge your AI models? Get started with Datasets.do today and build your next great model on a foundation of reliable, versioned data.

Do Work. With AI.