Skip to content

Alfonsobang/awesome-llm-training-data

Repository files navigation

Awesome LLM Training Data

Awesome Markdown Links Resource Audit License: CC BY 4.0

A curated list of tools, papers, datasets, and best practices for LLM training data quality, annotation, preference data, synthetic data, data governance, and evaluation.

Built for practitioners who need to choose, audit, or govern LLM datasets. This is not a general AI bookmark dump.

Use this repo to:

  • Find public tools for cleaning, deduplication, inspection, annotation, preference data, synthetic data, and RAG evaluation.
  • Compare dataset-quality and governance references before building internal data pipelines.
  • Locate public financial-domain LLM benchmarks and datasets without mixing in private or proprietary material.

English first. Complete Chinese version: README.zh-CN.md.

Disclaimer: This repository does not contain private company data, real user data, or proprietary workflows.

Why Training Data Quality Deserves First-Class Engineering

LLM behavior is shaped as much by data decisions as by model architecture. Data collection, filtering, annotation, preference modeling, evaluation design, privacy review, and governance all create production risk when treated as ad hoc work. Training data quality deserves first-class engineering because it determines reproducibility, safety, domain reliability, evaluation validity, and whether teams can explain what changed when model behavior changes.

Scope

This list focuses on public resources that help teams engineer, evaluate, document, or govern LLM training and evaluation data. It does not rank vendors, recommend private datasets, publish internal playbooks, or treat popularity as proof of quality.

Financial and Regulated-Domain Note

Financial-domain resources should be public, reproducible, and useful for evaluation or data engineering. This list avoids investment advice, trading signals, private business data, and claims that a benchmark or dataset proves production readiness.

Quality Bar

  • No fake links.
  • No private or proprietary resources.
  • No low-quality SEO content.
  • Prefer active and reproducible resources.
  • Prefer resources useful to real LLM data teams.
  • Prefer primary sources, official repositories, dataset cards, papers, and standards.
  • Include a resource only when its relevance to LLM data engineering is clear.
  • Include access, license, or usage constraints in the description when they are material.

The repository also runs a lightweight resource audit to check resource format, allowed tags, placeholder links, duplicate-link risk, and English/Chinese resource-count consistency.

Contents

Start Here

  • DataPerf - Tag: [benchmark] - MLCommons benchmark suite focused on measuring the impact of data quality and data-centric ML work.
  • DataComp-LM - Tag: [benchmark] - A data-centric benchmark for studying how language-model pretraining data choices affect downstream results.
  • Hugging Face Datasets - Tag: [tool] - Core library and documentation for loading, processing, sharing, and versioning datasets.
  • The Pile - Tag: [paper] - Paper describing a large open text corpus and practical dataset composition decisions.
  • Data-Centric AI Resources - Tag: [report] - A curated list of data-centric AI papers, tools, and benchmarks.

Training Data Quality

  • Data-Juicer - Tag: [tool] - Open-source toolkit for analyzing, filtering, and processing large multimodal and text datasets.
  • FineWeb - Tag: [dataset] - A large open web dataset with transparent processing choices for LLM pretraining research.
  • FineWeb-Edu - Tag: [dataset] - A filtered educational subset of FineWeb useful for studying quality-oriented corpus selection.
  • Dolma - Tag: [dataset] - An open corpus from AI2 designed to support reproducible language-model pretraining research.
  • RefinedWeb - Tag: [paper] - Paper describing web-scale filtering and corpus construction choices behind the RefinedWeb dataset.
  • DataComp-LM Paper - Tag: [paper] - Paper framing LLM pretraining data selection as a controlled benchmark problem.

Data Cleaning and Deduplication

  • DataTrove - Tag: [tool] - Hugging Face processing library for large-scale web data extraction, filtering, and deduplication.
  • text-dedup - Tag: [tool] - Toolkit for exact, near, and semantic deduplication of text datasets.
  • datasketch - Tag: [tool] - Python library for MinHash, LSH, and other probabilistic data structures often used in near-dedup pipelines.
  • Trafilatura - Tag: [tool] - Web text extraction library useful for turning HTML pages into cleaner text before dataset filtering.
  • jusText - Tag: [tool] - Boilerplate-removal library for extracting main textual content from web pages.
  • tiktoken - Tag: [tool] - Fast tokenizer library useful for estimating token distributions, truncation behavior, and corpus size.

Dataset Inspection Tools

  • Lilac - Tag: [tool] - Dataset exploration tool for clustering, searching, labeling, and inspecting large text datasets.
  • Renumics Spotlight - Tag: [tool] - Interactive tool for exploring embeddings, metadata, labels, and dataset slices.
  • FiftyOne - Tag: [tool] - Dataset visualization and curation platform especially useful for multimodal and vision-language data.
  • Cleanlab - Tag: [tool] - Library for finding label issues, outliers, and data quality problems in ML datasets.
  • whylogs - Tag: [tool] - Data profiling library for tracking dataset statistics and drift over time.
  • Evidently - Tag: [tool] - Open-source evaluation and monitoring toolkit for data and model quality reports.

Annotation Platforms

  • Label Studio - Tag: [platform] - Open-source data labeling platform supporting text, image, audio, and multimodal workflows.
  • Argilla - Tag: [platform] - Open-source platform for human and LLM feedback workflows, dataset curation, and preference data.
  • Doccano - Tag: [platform] - Open-source annotation tool for text classification, sequence labeling, and sequence-to-sequence tasks.
  • INCEpTION - Tag: [platform] - Semantic annotation platform with support for knowledge-oriented and NLP annotation projects.
  • Label Sleuth - Tag: [platform] - Open-source no-code text classification labeling tool with active learning workflows.

Annotation Quality and Agreement

  • Cleanlab - Tag: [tool] - Useful for surfacing likely label errors and prioritizing review work in annotated datasets.
  • NLTK Agreement - Tag: [tool] - NLTK module for calculating inter-annotator agreement measures.
  • scikit-learn Cohen Kappa - Tag: [tool] - Reference implementation for Cohen's kappa, a common pairwise agreement metric.
  • statsmodels Fleiss Kappa - Tag: [tool] - Implementation of Fleiss' kappa for agreement across multiple annotators.
  • fast-krippendorff - Tag: [tool] - Fast implementation of Krippendorff's alpha for measuring annotation reliability.

Human Preference Data

  • Anthropic HH-RLHF - Tag: [dataset] - Human preference dataset for helpful and harmless assistant behavior research.
  • OpenAI summarize_from_feedback - Tag: [dataset] - Human feedback data for training and evaluating summarization preference models.
  • OpenAI WebGPT Comparisons - Tag: [dataset] - Comparison data collected for web-browsing question-answering model research.
  • Stanford Human Preferences - Tag: [dataset] - Preference dataset built from naturally occurring Reddit question-answer interactions.
  • Chatbot Arena Conversations - Tag: [dataset] - Public conversation data from Chatbot Arena useful for studying comparative human judgments.
  • RewardBench - Tag: [benchmark] - Benchmark for evaluating reward models used in preference optimization pipelines.

RLHF / DPO / RLAIF Data

  • UltraFeedback - Tag: [dataset] - Large-scale AI feedback dataset commonly used for instruction tuning and preference optimization.
  • Argilla UltraFeedback Binarized Preferences - Tag: [dataset] - Processed preference-pair version of UltraFeedback for DPO-style training.
  • TRL - Tag: [tool] - Training library for SFT, reward modeling, PPO, DPO, and related preference optimization workflows.
  • OpenRLHF - Tag: [tool] - Open-source RLHF framework covering reward modeling and alignment training pipelines.
  • Direct Preference Optimization - Tag: [paper] - Paper introducing DPO, a widely used method for training directly from preference pairs.
  • Constitutional AI - Tag: [paper] - Paper introducing a framework for using principles and AI feedback to reduce reliance on direct human labels.

Synthetic Data

  • Self-Instruct - Tag: [tool] - Repository for generating instruction-following data from language models with bootstrapped prompts.
  • distilabel - Tag: [tool] - Framework for building synthetic data and AI feedback pipelines with reproducible workflows.
  • DSPy - Tag: [tool] - Programming framework that can optimize prompts and generate training/evaluation data for LM systems.
  • Awesome Synthetic Datasets - Tag: [report] - Curated list of synthetic datasets and generation resources across text and multimodal tasks.
  • Self-Instruct Paper - Tag: [paper] - Paper describing bootstrapped instruction generation for aligning language models.
  • WizardLM / Evol-Instruct - Tag: [paper] - Paper describing evol-instruct style generation for creating complex instruction data.

RAG Evaluation Data

  • Ragas - Tag: [tool] - Evaluation framework for RAG systems with metrics for retrieval and generation quality.
  • DeepEval - Tag: [tool] - Open-source evaluation framework that supports RAG, LLM, and agent evaluation workflows.
  • BEIR - Tag: [benchmark] - Retrieval benchmark suite often used to evaluate document ranking and search components.
  • KILT - Tag: [benchmark] - Knowledge-intensive language task benchmark connecting tasks to provenance-bearing corpora.
  • HotpotQA - Tag: [dataset] - Multi-hop question-answering dataset useful for retrieval and evidence-chain evaluation.
  • Natural Questions - Tag: [dataset] - Open-domain question-answering dataset frequently used in retrieval and QA evaluation.

Agent Evaluation and Trajectory Data

  • Harbor - Tag: [tool] - Framework for running agent evaluations, collecting trajectories, and creating RL environments in sandboxed settings.
  • Claw-Eval - Tag: [benchmark] - Autonomous-agent evaluation suite emphasizing trajectory-aware grading, safety assessment, and repeated-trial robustness.
  • Terminal-Bench - Tag: [benchmark] - Benchmark for evaluating agents on terminal-based tasks with executable environments and verifiers.
  • SWE-Bench - Tag: [benchmark] - Software engineering benchmark for evaluating agents on real GitHub issue resolution tasks.
  • WebArena - Tag: [benchmark] - Web-based agent benchmark for evaluating interactive task completion in simulated websites.
  • OSWorld - Tag: [benchmark] - Computer-use benchmark for evaluating multimodal agents in desktop operating-system environments.

Financial-domain LLM Data

  • FinEval - Tag: [benchmark] - Chinese financial-domain benchmark for evaluating LLM financial knowledge and safety.
  • PIXIU / FinBen - Tag: [benchmark] - Financial LLM benchmark and framework covering multiple financial tasks and datasets.
  • FinGPT - Tag: [tool] - Open-source project for financial LLM research, data pipelines, and domain adaptation.
  • FinNLP - Tag: [tool] - Financial NLP toolkit for collecting and processing finance-related text data.
  • FinanceBench - Tag: [benchmark] - Benchmark for evaluating LLM performance on financial question-answering tasks grounded in public filings.
  • FinQA - Tag: [dataset] - Dataset for numerical reasoning over financial reports.
  • TAT-QA - Tag: [dataset] - Table-and-text question-answering dataset built around hybrid reasoning over financial reports.

Practitioner Guides

Data Governance

  • Datasheets for Datasets - Tag: [paper] - Foundational paper proposing structured documentation for dataset motivation, composition, collection, and maintenance.
  • Data Cards Playbook - Tag: [report] - Practical guide for documenting datasets in a consistent and responsible way.
  • Croissant - Tag: [governance] - MLCommons metadata format for machine learning datasets.
  • DataHub - Tag: [tool] - Open-source metadata platform for dataset discovery, lineage, and governance.
  • OpenMetadata - Tag: [tool] - Open-source metadata and governance platform for data assets.
  • DVC - Tag: [tool] - Data versioning and pipeline tool useful for reproducible dataset releases.

Privacy and Compliance

  • Microsoft Presidio - Tag: [tool] - Open-source framework for detecting and anonymizing personally identifiable information.
  • scrubadub - Tag: [tool] - Python library for removing personally identifiable information from free text.
  • LLM Guard - Tag: [tool] - Toolkit for input/output scanning, including sensitive-data and prompt-risk checks.
  • Google Differential Privacy - Tag: [tool] - Libraries and tools for building differentially private data analysis workflows.
  • NIST AI Risk Management Framework - Tag: [governance] - Risk management framework relevant to AI system governance and documentation.
  • OWASP Top 10 for LLM Applications - Tag: [governance] - Security and risk reference for LLM application development and deployment.

Papers

Open-source Tools

  • Hugging Face Datasets - Tag: [tool] - Library for dataset loading, transformation, streaming, and sharing.
  • DataTrove - Tag: [tool] - Large-scale text data processing framework for LLM corpus preparation.
  • Data-Juicer - Tag: [tool] - Data processing and quality-analysis toolkit for LLM and multimodal data.
  • Dolma Toolkit - Tag: [tool] - AI2 toolkit for building and analyzing large pretraining corpora.
  • Label Studio - Tag: [platform] - General-purpose open-source annotation platform for multimodal labeling workflows.
  • Argilla - Tag: [platform] - Feedback and annotation platform for LLM data workflows.
  • Ragas - Tag: [tool] - RAG evaluation library for retrieval and generation metrics.
  • TRL - Tag: [tool] - Preference optimization and alignment training library.

Reports and Playbooks

  • Data Cards Playbook - Tag: [report] - Practical playbook for transparent dataset documentation.
  • HELM - Tag: [benchmark] - Holistic evaluation framework and reports for language model evaluation.
  • NIST AI RMF Playbook - Tag: [report] - Operational playbook for applying the NIST AI Risk Management Framework.
  • FineWeb Blog - Tag: [report] - Hugging Face write-up explaining the design and filtering choices behind FineWeb.
  • The Turing Way: Research Data Management - Tag: [report] - Practical guide to reproducible research data management.

Contributing

Contributions are welcome if they meet the Quality Bar. Please read CONTRIBUTING.md before opening an issue or pull request.

License

This repository is licensed under CC BY 4.0. Linked third-party resources keep their own licenses and terms.

About

Curated tools, papers, datasets, and practices for LLM training data engineering.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages