Awesome LLM Training Data

A curated list of tools, papers, datasets, and best practices for LLM training data quality, annotation, preference data, synthetic data, data governance, and evaluation.

Built for practitioners who need to choose, audit, or govern LLM datasets. This is not a general AI bookmark dump.

Use this repo to:

Find public tools for cleaning, deduplication, inspection, annotation, preference data, synthetic data, and RAG evaluation.
Compare dataset-quality and governance references before building internal data pipelines.
Locate public financial-domain LLM benchmarks and datasets without mixing in private or proprietary material.

English first. Complete Chinese version: README.zh-CN.md.

Disclaimer: This repository does not contain private company data, real user data, or proprietary workflows.

Why Training Data Quality Deserves First-Class Engineering

LLM behavior is shaped as much by data decisions as by model architecture. Data collection, filtering, annotation, preference modeling, evaluation design, privacy review, and governance all create production risk when treated as ad hoc work. Training data quality deserves first-class engineering because it determines reproducibility, safety, domain reliability, evaluation validity, and whether teams can explain what changed when model behavior changes.

Scope

This list focuses on public resources that help teams engineer, evaluate, document, or govern LLM training and evaluation data. It does not rank vendors, recommend private datasets, publish internal playbooks, or treat popularity as proof of quality.

Financial and Regulated-Domain Note

Financial-domain resources should be public, reproducible, and useful for evaluation or data engineering. This list avoids investment advice, trading signals, private business data, and claims that a benchmark or dataset proves production readiness.

Quality Bar

No fake links.
No private or proprietary resources.
No low-quality SEO content.
Prefer active and reproducible resources.
Prefer resources useful to real LLM data teams.
Prefer primary sources, official repositories, dataset cards, papers, and standards.
Include a resource only when its relevance to LLM data engineering is clear.
Include access, license, or usage constraints in the description when they are material.

The repository also runs a lightweight resource audit to check resource format, allowed tags, placeholder links, duplicate-link risk, and English/Chinese resource-count consistency.

Scope
Financial and Regulated-Domain Note
Start Here
Training Data Quality
Data Cleaning and Deduplication
Dataset Inspection Tools
Annotation Platforms
Annotation Quality and Agreement
Human Preference Data
RLHF / DPO / RLAIF Data
Synthetic Data
RAG Evaluation Data
Agent Evaluation and Trajectory Data
Financial-domain LLM Data
Practitioner Guides
Data Governance
Privacy and Compliance
Papers
Open-source Tools
Reports and Playbooks
Contributing
License

Start Here

DataPerf - Tag: [benchmark] - MLCommons benchmark suite focused on measuring the impact of data quality and data-centric ML work.
DataComp-LM - Tag: [benchmark] - A data-centric benchmark for studying how language-model pretraining data choices affect downstream results.
Hugging Face Datasets - Tag: [tool] - Core library and documentation for loading, processing, sharing, and versioning datasets.
The Pile - Tag: [paper] - Paper describing a large open text corpus and practical dataset composition decisions.
Data-Centric AI Resources - Tag: [report] - A curated list of data-centric AI papers, tools, and benchmarks.

Training Data Quality

Data-Juicer - Tag: [tool] - Open-source toolkit for analyzing, filtering, and processing large multimodal and text datasets.
FineWeb - Tag: [dataset] - A large open web dataset with transparent processing choices for LLM pretraining research.
FineWeb-Edu - Tag: [dataset] - A filtered educational subset of FineWeb useful for studying quality-oriented corpus selection.
Dolma - Tag: [dataset] - An open corpus from AI2 designed to support reproducible language-model pretraining research.
RefinedWeb - Tag: [paper] - Paper describing web-scale filtering and corpus construction choices behind the RefinedWeb dataset.
DataComp-LM Paper - Tag: [paper] - Paper framing LLM pretraining data selection as a controlled benchmark problem.

Data Cleaning and Deduplication

DataTrove - Tag: [tool] - Hugging Face processing library for large-scale web data extraction, filtering, and deduplication.
text-dedup - Tag: [tool] - Toolkit for exact, near, and semantic deduplication of text datasets.
datasketch - Tag: [tool] - Python library for MinHash, LSH, and other probabilistic data structures often used in near-dedup pipelines.
Trafilatura - Tag: [tool] - Web text extraction library useful for turning HTML pages into cleaner text before dataset filtering.
jusText - Tag: [tool] - Boilerplate-removal library for extracting main textual content from web pages.
tiktoken - Tag: [tool] - Fast tokenizer library useful for estimating token distributions, truncation behavior, and corpus size.

Dataset Inspection Tools

Lilac - Tag: [tool] - Dataset exploration tool for clustering, searching, labeling, and inspecting large text datasets.
Renumics Spotlight - Tag: [tool] - Interactive tool for exploring embeddings, metadata, labels, and dataset slices.
FiftyOne - Tag: [tool] - Dataset visualization and curation platform especially useful for multimodal and vision-language data.
Cleanlab - Tag: [tool] - Library for finding label issues, outliers, and data quality problems in ML datasets.
whylogs - Tag: [tool] - Data profiling library for tracking dataset statistics and drift over time.
Evidently - Tag: [tool] - Open-source evaluation and monitoring toolkit for data and model quality reports.

Annotation Platforms

Label Studio - Tag: [platform] - Open-source data labeling platform supporting text, image, audio, and multimodal workflows.
Argilla - Tag: [platform] - Open-source platform for human and LLM feedback workflows, dataset curation, and preference data.
Doccano - Tag: [platform] - Open-source annotation tool for text classification, sequence labeling, and sequence-to-sequence tasks.
INCEpTION - Tag: [platform] - Semantic annotation platform with support for knowledge-oriented and NLP annotation projects.
Label Sleuth - Tag: [platform] - Open-source no-code text classification labeling tool with active learning workflows.

Annotation Quality and Agreement

Cleanlab - Tag: [tool] - Useful for surfacing likely label errors and prioritizing review work in annotated datasets.
NLTK Agreement - Tag: [tool] - NLTK module for calculating inter-annotator agreement measures.
scikit-learn Cohen Kappa - Tag: [tool] - Reference implementation for Cohen's kappa, a common pairwise agreement metric.
statsmodels Fleiss Kappa - Tag: [tool] - Implementation of Fleiss' kappa for agreement across multiple annotators.
fast-krippendorff - Tag: [tool] - Fast implementation of Krippendorff's alpha for measuring annotation reliability.

Human Preference Data

Anthropic HH-RLHF - Tag: [dataset] - Human preference dataset for helpful and harmless assistant behavior research.
OpenAI summarize_from_feedback - Tag: [dataset] - Human feedback data for training and evaluating summarization preference models.
OpenAI WebGPT Comparisons - Tag: [dataset] - Comparison data collected for web-browsing question-answering model research.
Stanford Human Preferences - Tag: [dataset] - Preference dataset built from naturally occurring Reddit question-answer interactions.
Chatbot Arena Conversations - Tag: [dataset] - Public conversation data from Chatbot Arena useful for studying comparative human judgments.
RewardBench - Tag: [benchmark] - Benchmark for evaluating reward models used in preference optimization pipelines.

RLHF / DPO / RLAIF Data

UltraFeedback - Tag: [dataset] - Large-scale AI feedback dataset commonly used for instruction tuning and preference optimization.
Argilla UltraFeedback Binarized Preferences - Tag: [dataset] - Processed preference-pair version of UltraFeedback for DPO-style training.
TRL - Tag: [tool] - Training library for SFT, reward modeling, PPO, DPO, and related preference optimization workflows.
OpenRLHF - Tag: [tool] - Open-source RLHF framework covering reward modeling and alignment training pipelines.
Direct Preference Optimization - Tag: [paper] - Paper introducing DPO, a widely used method for training directly from preference pairs.
Constitutional AI - Tag: [paper] - Paper introducing a framework for using principles and AI feedback to reduce reliance on direct human labels.

Synthetic Data

Self-Instruct - Tag: [tool] - Repository for generating instruction-following data from language models with bootstrapped prompts.
distilabel - Tag: [tool] - Framework for building synthetic data and AI feedback pipelines with reproducible workflows.
DSPy - Tag: [tool] - Programming framework that can optimize prompts and generate training/evaluation data for LM systems.
Awesome Synthetic Datasets - Tag: [report] - Curated list of synthetic datasets and generation resources across text and multimodal tasks.
Self-Instruct Paper - Tag: [paper] - Paper describing bootstrapped instruction generation for aligning language models.
WizardLM / Evol-Instruct - Tag: [paper] - Paper describing evol-instruct style generation for creating complex instruction data.

RAG Evaluation Data

Ragas - Tag: [tool] - Evaluation framework for RAG systems with metrics for retrieval and generation quality.
DeepEval - Tag: [tool] - Open-source evaluation framework that supports RAG, LLM, and agent evaluation workflows.
BEIR - Tag: [benchmark] - Retrieval benchmark suite often used to evaluate document ranking and search components.
KILT - Tag: [benchmark] - Knowledge-intensive language task benchmark connecting tasks to provenance-bearing corpora.
HotpotQA - Tag: [dataset] - Multi-hop question-answering dataset useful for retrieval and evidence-chain evaluation.
Natural Questions - Tag: [dataset] - Open-domain question-answering dataset frequently used in retrieval and QA evaluation.

Agent Evaluation and Trajectory Data

Harbor - Tag: [tool] - Framework for running agent evaluations, collecting trajectories, and creating RL environments in sandboxed settings.
Claw-Eval - Tag: [benchmark] - Autonomous-agent evaluation suite emphasizing trajectory-aware grading, safety assessment, and repeated-trial robustness.
Terminal-Bench - Tag: [benchmark] - Benchmark for evaluating agents on terminal-based tasks with executable environments and verifiers.
SWE-Bench - Tag: [benchmark] - Software engineering benchmark for evaluating agents on real GitHub issue resolution tasks.
WebArena - Tag: [benchmark] - Web-based agent benchmark for evaluating interactive task completion in simulated websites.
OSWorld - Tag: [benchmark] - Computer-use benchmark for evaluating multimodal agents in desktop operating-system environments.

Financial-domain LLM Data

FinEval - Tag: [benchmark] - Chinese financial-domain benchmark for evaluating LLM financial knowledge and safety.
PIXIU / FinBen - Tag: [benchmark] - Financial LLM benchmark and framework covering multiple financial tasks and datasets.
FinGPT - Tag: [tool] - Open-source project for financial LLM research, data pipelines, and domain adaptation.
FinNLP - Tag: [tool] - Financial NLP toolkit for collecting and processing finance-related text data.
FinanceBench - Tag: [benchmark] - Benchmark for evaluating LLM performance on financial question-answering tasks grounded in public filings.
FinQA - Tag: [dataset] - Dataset for numerical reasoning over financial reports.
TAT-QA - Tag: [dataset] - Table-and-text question-answering dataset built around hybrid reasoning over financial reports.

Practitioner Guides

Claw-style Agent Evaluation Notes - Notes on trajectory-aware grading, repeated trials, safety evidence, and how Harbor maps to this evaluation pattern.
Harbor Repeated-trial Metric Example - A small metric.py example for reporting mean reward, pass@k, Pass^k, and missing-evidence rate.
LLM Training Data Operating Model - A practical operating loop for source review, profiling, filtering, annotation, evaluation, release, and governance.
LLM Training Data Quality Rubric - A practical checklist for reviewing public LLM training, tuning, preference, synthetic, or evaluation datasets.
Financial-domain LLM Evaluation Checklist - A governance-oriented checklist for financial-domain LLM evaluation without private data or investment claims.
Annotation Quality and Adjudication Guide - A practical guide for calibration, agreement, adjudication, reviewer drift, and preference-data annotation quality.
Preference Data Quality Checklist - A review checklist for human preference data, AI feedback data, and DPO/RLHF dataset suitability.
Financial-domain Benchmark Inclusion Criteria - Conservative criteria for including financial-domain LLM benchmarks and datasets.
Upstream Contribution Shortlist - A conservative plan for useful future contributions to public LLM data-quality projects.

Data Governance

Datasheets for Datasets - Tag: [paper] - Foundational paper proposing structured documentation for dataset motivation, composition, collection, and maintenance.
Data Cards Playbook - Tag: [report] - Practical guide for documenting datasets in a consistent and responsible way.
Croissant - Tag: [governance] - MLCommons metadata format for machine learning datasets.
DataHub - Tag: [tool] - Open-source metadata platform for dataset discovery, lineage, and governance.
OpenMetadata - Tag: [tool] - Open-source metadata and governance platform for data assets.
DVC - Tag: [tool] - Data versioning and pipeline tool useful for reproducible dataset releases.

Privacy and Compliance

Microsoft Presidio - Tag: [tool] - Open-source framework for detecting and anonymizing personally identifiable information.
scrubadub - Tag: [tool] - Python library for removing personally identifiable information from free text.
LLM Guard - Tag: [tool] - Toolkit for input/output scanning, including sensitive-data and prompt-risk checks.
Google Differential Privacy - Tag: [tool] - Libraries and tools for building differentially private data analysis workflows.
NIST AI Risk Management Framework - Tag: [governance] - Risk management framework relevant to AI system governance and documentation.
OWASP Top 10 for LLM Applications - Tag: [governance] - Security and risk reference for LLM application development and deployment.

Papers

Training Language Models to Follow Instructions with Human Feedback - Tag: [paper] - InstructGPT paper connecting supervised data, preference data, and RLHF.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback - Tag: [paper] - Paper describing preference data and RLHF training for assistant behavior.
Deduplicating Training Data Makes Language Models Better - Tag: [paper] - Study showing why training-data duplication matters for language model behavior.
LIMA: Less Is More for Alignment - Tag: [paper] - Paper studying how small, carefully curated supervised datasets can affect alignment behavior.
The BigScience ROOTS Corpus - Tag: [paper] - Documentation of the multilingual corpus used to train BLOOM, including governance and sourcing details.
Direct Preference Optimization - Tag: [paper] - Preference optimization paper focused on training from pairwise preference data.

Open-source Tools

Hugging Face Datasets - Tag: [tool] - Library for dataset loading, transformation, streaming, and sharing.
DataTrove - Tag: [tool] - Large-scale text data processing framework for LLM corpus preparation.
Data-Juicer - Tag: [tool] - Data processing and quality-analysis toolkit for LLM and multimodal data.
Dolma Toolkit - Tag: [tool] - AI2 toolkit for building and analyzing large pretraining corpora.
Label Studio - Tag: [platform] - General-purpose open-source annotation platform for multimodal labeling workflows.
Argilla - Tag: [platform] - Feedback and annotation platform for LLM data workflows.
Ragas - Tag: [tool] - RAG evaluation library for retrieval and generation metrics.
TRL - Tag: [tool] - Preference optimization and alignment training library.

Reports and Playbooks

Data Cards Playbook - Tag: [report] - Practical playbook for transparent dataset documentation.
HELM - Tag: [benchmark] - Holistic evaluation framework and reports for language model evaluation.
NIST AI RMF Playbook - Tag: [report] - Operational playbook for applying the NIST AI Risk Management Framework.
FineWeb Blog - Tag: [report] - Hugging Face write-up explaining the design and filtering choices behind FineWeb.
The Turing Way: Research Data Management - Tag: [report] - Practical guide to reproducible research data management.

Contributing

Contributions are welcome if they meet the Quality Bar. Please read CONTRIBUTING.md before opening an issue or pull request.

License

This repository is licensed under CC BY 4.0. Linked third-party resources keep their own licenses and terms.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.github		.github
docs		docs
examples/harbor-repeated-trial-metric		examples/harbor-repeated-trial-metric
tools		tools
.gitattributes		.gitattributes
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
README.md		README.md
README.zh-CN.md		README.zh-CN.md
ROADMAP.md		ROADMAP.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome LLM Training Data

Why Training Data Quality Deserves First-Class Engineering

Scope

Financial and Regulated-Domain Note

Quality Bar

Contents

Start Here

Training Data Quality

Data Cleaning and Deduplication

Dataset Inspection Tools

Annotation Platforms

Annotation Quality and Agreement

Human Preference Data

RLHF / DPO / RLAIF Data

Synthetic Data

RAG Evaluation Data

Agent Evaluation and Trajectory Data

Financial-domain LLM Data

Practitioner Guides

Data Governance

Privacy and Compliance

Papers

Open-source Tools

Reports and Playbooks

Contributing

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Awesome LLM Training Data

Why Training Data Quality Deserves First-Class Engineering

Scope

Financial and Regulated-Domain Note

Quality Bar

Contents

Start Here

Training Data Quality

Data Cleaning and Deduplication

Dataset Inspection Tools

Annotation Platforms

Annotation Quality and Agreement

Human Preference Data

RLHF / DPO / RLAIF Data

Synthetic Data

RAG Evaluation Data

Agent Evaluation and Trajectory Data

Financial-domain LLM Data

Practitioner Guides

Data Governance

Privacy and Compliance

Papers

Open-source Tools

Reports and Playbooks

Contributing

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages