data-simulator is a lightweight Python library for generating synthetic datasets from your own corpus — perfect for testing, evaluating, or fine-tuning LLM Applications.
Real documents contain a mix of useful and irrelevant content. When generating synthetic data, this leads to:
- Queries that real users would never ask
- Test sets that don't reflect actual usage
- Wasted effort optimizing for the wrong things
Data Simulator filters out low-quality content first, then generates realistic queries and answers that match how your system will actually be used.
Install from PyPI:
pip install llm-data-simulatorOr install it locally:
git clone https://github.com/langwatch/data-simulator.git
cd data-simulator
pip install -e .Run the built-in test script:
python test.pyfrom data_simulator import DataSimulator
from dotenv import load_dotenv
import os
from data_simulator.utils import display_results
load_dotenv()
generator = DataSimulator(api_key=os.getenv("OPENAI_API_KEY"))
results = generator.generate_from_docs(
file_paths=["test_data/nike_10k.pdf"],
context="You're a financial support assistant for Nike, helping a financial analyst decide whether to invest in the stock.",
example_queries="how much revenue did nike make last year\nwhat risks does nike face\nwhat are nike's top 3 priorities"
)
display_results(results){
"id": "chunk_42",
"document": "Nike reported annual revenue of $44.5 billion for fiscal year 2022, an increase of 5% compared to the previous year.",
"query": "What was Nike's revenue growth in 2022?",
"answer": "Nike's revenue grew by 5% in fiscal year 2022, reaching $44.5 billion."
}The project follows a modular, object-oriented design:
simulator.py: Contains the mainDataSimulatorclass that orchestrates the data generation processllm.py: Houses theLLMProcessorclass that handles all LLM-related operationsdocument_processor.py: Provides theDocumentProcessorclass for loading and chunking documentsprompts.py: Stores all prompt templates used for LLM interactionsutils.py: Contains utility functions likedisplay_resultsfor formatting output
MIT License