A curated list of tools, papers, datasets, and best practices for LLM training data quality, annotation, preference data, synthetic data, data governance, and evaluation.
Built for practitioners who need to choose, audit, or govern LLM datasets. This is not a general AI bookmark dump.
Use this repo to:
- Find public tools for cleaning, deduplication, inspection, annotation, preference data, synthetic data, and RAG evaluation.
- Compare dataset-quality and governance references before building internal data pipelines.
- Locate public financial-domain LLM benchmarks and datasets without mixing in private or proprietary material.
English first. Complete Chinese version: README.zh-CN.md.
Disclaimer: This repository does not contain private company data, real user data, or proprietary workflows.
LLM behavior is shaped as much by data decisions as by model architecture. Data collection, filtering, annotation, preference modeling, evaluation design, privacy review, and governance all create production risk when treated as ad hoc work. Training data quality deserves first-class engineering because it determines reproducibility, safety, domain reliability, evaluation validity, and whether teams can explain what changed when model behavior changes.
This list focuses on public resources that help teams engineer, evaluate, document, or govern LLM training and evaluation data. It does not rank vendors, recommend private datasets, publish internal playbooks, or treat popularity as proof of quality.
Financial-domain resources should be public, reproducible, and useful for evaluation or data engineering. This list avoids investment advice, trading signals, private business data, and claims that a benchmark or dataset proves production readiness.
- No fake links.
- No private or proprietary resources.
- No low-quality SEO content.
- Prefer active and reproducible resources.
- Prefer resources useful to real LLM data teams.
- Prefer primary sources, official repositories, dataset cards, papers, and standards.
- Include a resource only when its relevance to LLM data engineering is clear.
- Include access, license, or usage constraints in the description when they are material.
The repository also runs a lightweight resource audit to check resource format, allowed tags, placeholder links, duplicate-link risk, and English/Chinese resource-count consistency.
- Scope
- Financial and Regulated-Domain Note
- Start Here
- Training Data Quality
- Data Cleaning and Deduplication
- Dataset Inspection Tools
- Annotation Platforms
- Annotation Quality and Agreement
- Human Preference Data
- RLHF / DPO / RLAIF Data
- Synthetic Data
- RAG Evaluation Data
- Agent Evaluation and Trajectory Data
- Financial-domain LLM Data
- Practitioner Guides
- Data Governance
- Privacy and Compliance
- Papers
- Open-source Tools
- Reports and Playbooks
- Contributing
- License
- DataPerf - Tag: [benchmark] - MLCommons benchmark suite focused on measuring the impact of data quality and data-centric ML work.
- DataComp-LM - Tag: [benchmark] - A data-centric benchmark for studying how language-model pretraining data choices affect downstream results.
- Hugging Face Datasets - Tag: [tool] - Core library and documentation for loading, processing, sharing, and versioning datasets.
- The Pile - Tag: [paper] - Paper describing a large open text corpus and practical dataset composition decisions.
- Data-Centric AI Resources - Tag: [report] - A curated list of data-centric AI papers, tools, and benchmarks.
- Data-Juicer - Tag: [tool] - Open-source toolkit for analyzing, filtering, and processing large multimodal and text datasets.
- FineWeb - Tag: [dataset] - A large open web dataset with transparent processing choices for LLM pretraining research.
- FineWeb-Edu - Tag: [dataset] - A filtered educational subset of FineWeb useful for studying quality-oriented corpus selection.
- Dolma - Tag: [dataset] - An open corpus from AI2 designed to support reproducible language-model pretraining research.
- RefinedWeb - Tag: [paper] - Paper describing web-scale filtering and corpus construction choices behind the RefinedWeb dataset.
- DataComp-LM Paper - Tag: [paper] - Paper framing LLM pretraining data selection as a controlled benchmark problem.
- DataTrove - Tag: [tool] - Hugging Face processing library for large-scale web data extraction, filtering, and deduplication.
- text-dedup - Tag: [tool] - Toolkit for exact, near, and semantic deduplication of text datasets.
- datasketch - Tag: [tool] - Python library for MinHash, LSH, and other probabilistic data structures often used in near-dedup pipelines.
- Trafilatura - Tag: [tool] - Web text extraction library useful for turning HTML pages into cleaner text before dataset filtering.
- jusText - Tag: [tool] - Boilerplate-removal library for extracting main textual content from web pages.
- tiktoken - Tag: [tool] - Fast tokenizer library useful for estimating token distributions, truncation behavior, and corpus size.
- Lilac - Tag: [tool] - Dataset exploration tool for clustering, searching, labeling, and inspecting large text datasets.
- Renumics Spotlight - Tag: [tool] - Interactive tool for exploring embeddings, metadata, labels, and dataset slices.
- FiftyOne - Tag: [tool] - Dataset visualization and curation platform especially useful for multimodal and vision-language data.
- Cleanlab - Tag: [tool] - Library for finding label issues, outliers, and data quality problems in ML datasets.
- whylogs - Tag: [tool] - Data profiling library for tracking dataset statistics and drift over time.
- Evidently - Tag: [tool] - Open-source evaluation and monitoring toolkit for data and model quality reports.
- Label Studio - Tag: [platform] - Open-source data labeling platform supporting text, image, audio, and multimodal workflows.
- Argilla - Tag: [platform] - Open-source platform for human and LLM feedback workflows, dataset curation, and preference data.
- Doccano - Tag: [platform] - Open-source annotation tool for text classification, sequence labeling, and sequence-to-sequence tasks.
- INCEpTION - Tag: [platform] - Semantic annotation platform with support for knowledge-oriented and NLP annotation projects.
- Label Sleuth - Tag: [platform] - Open-source no-code text classification labeling tool with active learning workflows.
- Cleanlab - Tag: [tool] - Useful for surfacing likely label errors and prioritizing review work in annotated datasets.
- NLTK Agreement - Tag: [tool] - NLTK module for calculating inter-annotator agreement measures.
- scikit-learn Cohen Kappa - Tag: [tool] - Reference implementation for Cohen's kappa, a common pairwise agreement metric.
- statsmodels Fleiss Kappa - Tag: [tool] - Implementation of Fleiss' kappa for agreement across multiple annotators.
- fast-krippendorff - Tag: [tool] - Fast implementation of Krippendorff's alpha for measuring annotation reliability.
- Anthropic HH-RLHF - Tag: [dataset] - Human preference dataset for helpful and harmless assistant behavior research.
- OpenAI summarize_from_feedback - Tag: [dataset] - Human feedback data for training and evaluating summarization preference models.
- OpenAI WebGPT Comparisons - Tag: [dataset] - Comparison data collected for web-browsing question-answering model research.
- Stanford Human Preferences - Tag: [dataset] - Preference dataset built from naturally occurring Reddit question-answer interactions.
- Chatbot Arena Conversations - Tag: [dataset] - Public conversation data from Chatbot Arena useful for studying comparative human judgments.
- RewardBench - Tag: [benchmark] - Benchmark for evaluating reward models used in preference optimization pipelines.
- UltraFeedback - Tag: [dataset] - Large-scale AI feedback dataset commonly used for instruction tuning and preference optimization.
- Argilla UltraFeedback Binarized Preferences - Tag: [dataset] - Processed preference-pair version of UltraFeedback for DPO-style training.
- TRL - Tag: [tool] - Training library for SFT, reward modeling, PPO, DPO, and related preference optimization workflows.
- OpenRLHF - Tag: [tool] - Open-source RLHF framework covering reward modeling and alignment training pipelines.
- Direct Preference Optimization - Tag: [paper] - Paper introducing DPO, a widely used method for training directly from preference pairs.
- Constitutional AI - Tag: [paper] - Paper introducing a framework for using principles and AI feedback to reduce reliance on direct human labels.
- Self-Instruct - Tag: [tool] - Repository for generating instruction-following data from language models with bootstrapped prompts.
- distilabel - Tag: [tool] - Framework for building synthetic data and AI feedback pipelines with reproducible workflows.
- DSPy - Tag: [tool] - Programming framework that can optimize prompts and generate training/evaluation data for LM systems.
- Awesome Synthetic Datasets - Tag: [report] - Curated list of synthetic datasets and generation resources across text and multimodal tasks.
- Self-Instruct Paper - Tag: [paper] - Paper describing bootstrapped instruction generation for aligning language models.
- WizardLM / Evol-Instruct - Tag: [paper] - Paper describing evol-instruct style generation for creating complex instruction data.
- Ragas - Tag: [tool] - Evaluation framework for RAG systems with metrics for retrieval and generation quality.
- DeepEval - Tag: [tool] - Open-source evaluation framework that supports RAG, LLM, and agent evaluation workflows.
- BEIR - Tag: [benchmark] - Retrieval benchmark suite often used to evaluate document ranking and search components.
- KILT - Tag: [benchmark] - Knowledge-intensive language task benchmark connecting tasks to provenance-bearing corpora.
- HotpotQA - Tag: [dataset] - Multi-hop question-answering dataset useful for retrieval and evidence-chain evaluation.
- Natural Questions - Tag: [dataset] - Open-domain question-answering dataset frequently used in retrieval and QA evaluation.
- Harbor - Tag: [tool] - Framework for running agent evaluations, collecting trajectories, and creating RL environments in sandboxed settings.
- Claw-Eval - Tag: [benchmark] - Autonomous-agent evaluation suite emphasizing trajectory-aware grading, safety assessment, and repeated-trial robustness.
- Terminal-Bench - Tag: [benchmark] - Benchmark for evaluating agents on terminal-based tasks with executable environments and verifiers.
- SWE-Bench - Tag: [benchmark] - Software engineering benchmark for evaluating agents on real GitHub issue resolution tasks.
- WebArena - Tag: [benchmark] - Web-based agent benchmark for evaluating interactive task completion in simulated websites.
- OSWorld - Tag: [benchmark] - Computer-use benchmark for evaluating multimodal agents in desktop operating-system environments.
- FinEval - Tag: [benchmark] - Chinese financial-domain benchmark for evaluating LLM financial knowledge and safety.
- PIXIU / FinBen - Tag: [benchmark] - Financial LLM benchmark and framework covering multiple financial tasks and datasets.
- FinGPT - Tag: [tool] - Open-source project for financial LLM research, data pipelines, and domain adaptation.
- FinNLP - Tag: [tool] - Financial NLP toolkit for collecting and processing finance-related text data.
- FinanceBench - Tag: [benchmark] - Benchmark for evaluating LLM performance on financial question-answering tasks grounded in public filings.
- FinQA - Tag: [dataset] - Dataset for numerical reasoning over financial reports.
- TAT-QA - Tag: [dataset] - Table-and-text question-answering dataset built around hybrid reasoning over financial reports.
- Claw-style Agent Evaluation Notes - Notes on trajectory-aware grading, repeated trials, safety evidence, and how Harbor maps to this evaluation pattern.
- Harbor Repeated-trial Metric Example - A small
metric.pyexample for reporting mean reward, pass@k, Pass^k, and missing-evidence rate. - LLM Training Data Operating Model - A practical operating loop for source review, profiling, filtering, annotation, evaluation, release, and governance.
- LLM Training Data Quality Rubric - A practical checklist for reviewing public LLM training, tuning, preference, synthetic, or evaluation datasets.
- Financial-domain LLM Evaluation Checklist - A governance-oriented checklist for financial-domain LLM evaluation without private data or investment claims.
- Annotation Quality and Adjudication Guide - A practical guide for calibration, agreement, adjudication, reviewer drift, and preference-data annotation quality.
- Preference Data Quality Checklist - A review checklist for human preference data, AI feedback data, and DPO/RLHF dataset suitability.
- Financial-domain Benchmark Inclusion Criteria - Conservative criteria for including financial-domain LLM benchmarks and datasets.
- Upstream Contribution Shortlist - A conservative plan for useful future contributions to public LLM data-quality projects.
- Datasheets for Datasets - Tag: [paper] - Foundational paper proposing structured documentation for dataset motivation, composition, collection, and maintenance.
- Data Cards Playbook - Tag: [report] - Practical guide for documenting datasets in a consistent and responsible way.
- Croissant - Tag: [governance] - MLCommons metadata format for machine learning datasets.
- DataHub - Tag: [tool] - Open-source metadata platform for dataset discovery, lineage, and governance.
- OpenMetadata - Tag: [tool] - Open-source metadata and governance platform for data assets.
- DVC - Tag: [tool] - Data versioning and pipeline tool useful for reproducible dataset releases.
- Microsoft Presidio - Tag: [tool] - Open-source framework for detecting and anonymizing personally identifiable information.
- scrubadub - Tag: [tool] - Python library for removing personally identifiable information from free text.
- LLM Guard - Tag: [tool] - Toolkit for input/output scanning, including sensitive-data and prompt-risk checks.
- Google Differential Privacy - Tag: [tool] - Libraries and tools for building differentially private data analysis workflows.
- NIST AI Risk Management Framework - Tag: [governance] - Risk management framework relevant to AI system governance and documentation.
- OWASP Top 10 for LLM Applications - Tag: [governance] - Security and risk reference for LLM application development and deployment.
- Training Language Models to Follow Instructions with Human Feedback - Tag: [paper] - InstructGPT paper connecting supervised data, preference data, and RLHF.
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback - Tag: [paper] - Paper describing preference data and RLHF training for assistant behavior.
- Deduplicating Training Data Makes Language Models Better - Tag: [paper] - Study showing why training-data duplication matters for language model behavior.
- LIMA: Less Is More for Alignment - Tag: [paper] - Paper studying how small, carefully curated supervised datasets can affect alignment behavior.
- The BigScience ROOTS Corpus - Tag: [paper] - Documentation of the multilingual corpus used to train BLOOM, including governance and sourcing details.
- Direct Preference Optimization - Tag: [paper] - Preference optimization paper focused on training from pairwise preference data.
- Hugging Face Datasets - Tag: [tool] - Library for dataset loading, transformation, streaming, and sharing.
- DataTrove - Tag: [tool] - Large-scale text data processing framework for LLM corpus preparation.
- Data-Juicer - Tag: [tool] - Data processing and quality-analysis toolkit for LLM and multimodal data.
- Dolma Toolkit - Tag: [tool] - AI2 toolkit for building and analyzing large pretraining corpora.
- Label Studio - Tag: [platform] - General-purpose open-source annotation platform for multimodal labeling workflows.
- Argilla - Tag: [platform] - Feedback and annotation platform for LLM data workflows.
- Ragas - Tag: [tool] - RAG evaluation library for retrieval and generation metrics.
- TRL - Tag: [tool] - Preference optimization and alignment training library.
- Data Cards Playbook - Tag: [report] - Practical playbook for transparent dataset documentation.
- HELM - Tag: [benchmark] - Holistic evaluation framework and reports for language model evaluation.
- NIST AI RMF Playbook - Tag: [report] - Operational playbook for applying the NIST AI Risk Management Framework.
- FineWeb Blog - Tag: [report] - Hugging Face write-up explaining the design and filtering choices behind FineWeb.
- The Turing Way: Research Data Management - Tag: [report] - Practical guide to reproducible research data management.
Contributions are welcome if they meet the Quality Bar. Please read CONTRIBUTING.md before opening an issue or pull request.
This repository is licensed under CC BY 4.0. Linked third-party resources keep their own licenses and terms.