Skip to content

Data preparation

There are two ways to prepare data for training with distil labs:

  • Trace processing — use this if you already have production logs from an LLM-powered application. Upload your traces and our pipeline handles the rest.
  • Minimal dataset — use this if you don’t have production traces but can provide a small set of labeled examples for your task.

If you have production traces (logs of real interactions with an LLM), you can upload them and our pipeline will automatically filter, relabel, and convert them into training and test data. This is the fastest way to get started if you already have an LLM-powered application in production.

Your traces directory needs three files:

FileFormatDescription
traces.jsonlJSONLProduction traces in the OpenAI messages format
job_description.jsonJSONTask description defining what the model should do
config.yamlYAMLTraining config with trace_processing parameters
distil model upload-traces <model-id> --data ./traces

Learn more about trace processing →

If you don’t have production traces, prepare a small structured dataset with labeled examples. You only need a few dozen high-quality examples that capture the essence of your task.

Your data directory needs the following files:

FileFormatRequiredDescription
job_description.jsonJSONYesTask description defining what the model should do
train.csvCSV or JSONLYes20+ labeled (question, answer) pairs
test.csvCSV or JSONLYesHeld-out evaluation set
config.yamlYAMLYesTraining hyperparameters
unstructured.csvCSV or JSONLNoDomain-relevant text for synthetic data generation
distil model upload-data <model-id> --data ./data

For detailed formatting and structure requirements per task type, refer to: