We build bespoke evaluation benchmarks for leading AI labs.
Six dataset primitives that compose the modern post-training stack. Mix and match — or hand us the failure mode and we'll figure out the right blend.
RLHF & preference data
Preference pairs and reward signals labelled by domain experts — not crowdsourced annotators.
Reinforcement Learning from Human Feedback only works when the humans giving feedback know what "good" actually looks like in the domain. Our annotators rank, rate, and arbitrate model outputs against project-specific rubrics designed with your research team.
Every preference pair carries per-rubric scores, an arbitration trace when reviewers disagreed, and reviewer reliability metadata so your training pipeline can weight by signal strength.
- ─Pairwise and listwise preference labelling
- ─Process reward modeling traces
- ─Rubric design with your research leads
- ─Inter-rater reliability + arbitration trails
Instruction-tuning data
Diverse, high-quality prompt-response pairs across domains, written by working specialists.
Foundation models trained on expert reasoning consistently outperform those trained on outputs alone. Our SFT data captures the intermediate steps a senior practitioner takes — the disambiguating questions, the working, the citations.
We deliver in your preferred schema: messages-array, reasoning-trace, or tool-augmented, with full provenance per example.
- ─Single-turn and multi-turn dialogues
- ─Reasoning-trace and chain-of-thought formats
- ─Tool-augmented and function-calling examples
- ─Refusal and safety behaviors when needed
Code training data
Annotated code, debugging traces, and graded programming challenges built by senior engineers.
Code data is bottlenecked on the depth of explanation, not the volume of snippets. Our Code Data Specialists write clean, well-documented solutions across languages and annotate them with the reasoning behind each design choice.
From algorithmic interview-style problems to multi-file refactors and code-review scenarios — every example is rubric-graded for correctness, idiomaticity, and clarity.
- ─Algorithmic problems with test cases and edge cases
- ─Code-review and debugging examples
- ─Multi-language coverage (Python, JS, and beyond)
- ─Performance benchmarks where it matters
STEM domain annotation
Expert-level Q&A and multi-step problem solving across math, physics, chemistry, biology, and CS.
Our Domain Expert Annotators are MS/PhD-level subject matter experts who create and validate specialized training data in their field. Multi-step problems come with detailed explanations and verifiable solutions, not just final answers.
We collaborate with your team to define what "correct reasoning" looks like in each domain — and we hold every example to that bar.
- ─Math olympiad-style and competition problems
- ─Physics, chemistry, biology Q&A with verifiable answers
- ─Multi-step problem-solving with detailed traces
- ─Technical-accuracy review of existing datasets
Safety & alignment data
Red-teaming prompts, refusal training data, and alignment benchmarks for testing safety boundaries.
Safety datasets need people who think adversarially about misuse without losing sight of legitimate use. Our Safety & Alignment Specialists develop red-team prompts, flag problematic outputs, and design refusal behaviors that don't over-refuse.
We collaborate with researchers on alignment methodology, then turn the methodology into datasets you can train and evaluate against.
- ─Adversarial prompts and red-teaming sets
- ─Harmful-content evaluation and grading
- ─Refusal training data calibrated against helpfulness
- ─Alignment-benchmark design
Custom evaluation benchmarks
Closed benchmarks built around the specific failure modes your model still exhibits.
We build private evaluation suites tailored to your roadmap. You give us a model and a definition of "good" — we design the eval set, the grading framework, and a versioning scheme so you can compare runs over time.
Where useful, we publish open variants on /benchmarks. The proprietary version stays yours.
- ─Task design with explicit acceptance criteria
- ─Deterministic graders + LLM-judge ensembles
- ─Versioned eval sets that travel with your models
- ─Optional public-facing benchmark releases