Datasets

Datasets are files used during fine-tuning that contain example input-output pairs that you want your fine-tuned model to learn. The examples in your dataset should mimic how you expect to use the model in production.

If you want to train your model to write SQL code in response to natural language inputs, your dataset should include a number of examples with natural language inputs and SQL outputs.

Formatting datasets

Datasets must be JSONL files that contain examples in either chat-ml or prompt-completion format.

JSONL (short for JSON lines) is a file format where each line contains a single JSON item.

Chat-ml format is a common schema typically used for conversational use cases. It is an array of messages where each message has a role (can be "user", "assistant", or "system") and a content item.

Here's an example of a JSONL file that uses chat-ml format for a silly chatbot that sneezes.

{"messages": [{"role": "system", "content": "You are a factual chatbot that sneezes in between responses."}, {"role": "user", "content": "What is the largest ocean?"}, {"role": "assistant", "content": "The Pacific Ocean is the... achoo!... largest ocean, covering more than 30% of the Earth's surface."}]}
{"messages": [{"role": "system", "content": "You are a factual chatbot that sneezes in between responses."}, {"role": "user", "content": "What causes rainbows?"}, {"role": "assistant", "content": "Rainbows are caused by the... achoo!... refraction and dispersion of light in water droplets, leading to a spectrum of light in the sky."}]}
{"messages": [{"role": "system", "content": "You are a factual chatbot that sneezes in between responses."}, {"role": "user", "content": "How many planets are in the Solar System?"}, {"role": "assistant", "content": "There are eight planets in the... achoo!... Solar System since Pluto was reclassified as a dwarf planet."}]}

Prompt-completion is another common format where each sample contains a prompt and completion item. Here is the previous example written using prompt-completion format:

{"prompt": "You are a factual chatbot that sneezes in between responses. A user asks, 'What is the largest ocean?'", "completion": "The Pacific Ocean is the... achoo!... largest ocean, covering more than 30% of the Earth's surface."}
{"prompt": "You are a factual chatbot that sneezes in between responses. A user asks, 'What causes rainbows?'","completion": "Rainbows are caused by the... achoo!... refraction and dispersion of light in water droplets, leading to a spectrum of light in the sky."}
{"prompt": "You are a factual chatbot that sneezes in between responses. A user asks, 'How many planets are in the Solar System?'", "completion": "There are eight planets in the... achoo!... Solar System since Pluto was reclassified as a dwarf planet."}

Managing datasets

You can view all of your datasets from the Data page by clicking "Data" in the left-hand side bar. There are a number of features related to datasets

Uploading a dataset

You can upload a dataset from the Data page by clicking the Upload button, or on the Create Fine-tune page.

Automatic validation

We automatically validate your dataset to make sure it is in the correct format. A dataset status of "Ready" indicates that your dataset has passed validation and is ready for fine-tuning. "Processing" means that validation is still in progress, and "Failed" means that your dataset was not in the correct format. Validation typically happens in a few seconds.

If your dataset status is "Failed" but you are using the correct syntax, check to make sure there are no special characters or encoding issues that could cause a parsing issue.

Dataset inspection

You can easily view the samples in your dataset through the UI by clicking the Inpsect Dataset button in the top-right corner of the page. You can use the arrows to click through examples or the keyboard shortcuts: j for left, k for right.

Download a dataset

To download your dataset, click the options drop-down in the top-right corner of the page and click Download button. The download will start automatically in your browser.

Delete a dataset

To delete a dataset, click the options drop-down in the top-right corner and click delete dataset.

Dataset analytics

As part of validation, Forefront calculates token analytics for your dataset samples at the aggregate and sample level. These can be easily viewed by clicking into a dataset.

Aggregate analytics include: total tokens, and token counts by role. If using chat-ml syntax, this is user tokens, assistant tokens, and system tokens. If using prompt-completion syntax, this is prompt tokens and completion tokens.

Sample level analytics provide a birds-eye view of the diversity of samples in your dataset.

Tokens per sample

This is an area chart showing the distribution of token counts in your dataset. In the example above, we can see that the dataset contains 20 samples where "user" messages contain between 300-400 tokens. We can also see that overall the user messages in the dataset tend to skew shorter, where the assistant messages are much longer and varied in length. The blue line represents the system messages which are all very short, less than 100 tokens.

Unique tokens per sample

This chart is similar to the previous one, and shows the count of unique tokens in each sample. Low token counts could represent either short examples, repetitive language or both. This is not necessarily good or bad, but depends on the use case.

Token type ratios

Token type ratios (TTR) is a measure of lexical diversity in your examples. It is defined as the unique tokens divided by total tokens in each sample. Again, a low or high TTR is not necessarily good or bad, but will depend on your use case.

Note on interpreting charts

The chart analytics should be used a tool to understanding model performance. Let's consider the scenario where a fine-tuned model was trained on the dataset above.

Most samples in this dataset contain user messages between 0-500 tokens. We might expect the model to perform well on a request with similar user token counts because it has "seen" this data during training.

Similarly, we might expect the model to perform poorly on requests where user token counts are very high because the model has not seen data like this during training. If this is the case, a simple solution to improve performance would be perform additional fine-tuning using examples with high user token counts.

In general, you can expect a model to perform well on examples it has seen before, and poorly on examples it has not seen. The charts are useful for helping to identify what your overall data looks like.

Last updated