OpenDCAI · memoryforget · Apr 8, 2026
diff --git a/docs/.vuepress/notes/en/guide.ts b/docs/.vuepress/notes/en/guide.ts
@@ -128,5 +128,15 @@ export const Guide: ThemeNote = defineNoteConfig({
                 "web_collection"
             ]
         },
+        {
+            text: "DataFlow Skills",
+            collapsed: false,
+            icon: 'material-symbols:auto-awesome',
+            prefix: 'skills',
+            items: [
+                "generating_dataflow_pipeline",
+                "core_text"
+            ]
+        },
     ],
 })
diff --git a/docs/.vuepress/notes/zh/guide.ts b/docs/.vuepress/notes/zh/guide.ts
@@ -127,6 +127,16 @@ export const Guide: ThemeNote = defineNoteConfig({
                 "web_collection"
             ]
         },
+        {
+            text: "DataFlow Skills",
+            collapsed: false,
+            icon: 'material-symbols:auto-awesome',
+            prefix: 'skills',
+            items: [
+                "generating_dataflow_pipeline",
+                "core_text"
+            ]
+        },
         // {
         //     text: '写作',
         //     icon: 'fluent-mdl2:edit-create',

diff --git a/docs/en/notes/guide/skills/core_text.md b/docs/en/notes/guide/skills/core_text.md
@@ -0,0 +1,63 @@
+---
+title: Core Text Operators
+icon: material-symbols:extension
+createTime: 2026/04/08 22:45:39
+permalink: /en/guide/skills/core_text/
+---
+
+# Core Text Operator Reference
+
+Extended operator reference for the [Generating DataFlow Pipeline](./generating_dataflow_pipeline.md) skill. When the 6 core primitives don't cover your task, consult the detailed per-operator documentation here.
+
+## Available Operators
+
+### Generate (`core_text/generate/`)
+
+| Operator | Description |
+|----------|-------------|
+| `prompted-generator` | Basic single-field LLM generation |
+| `format-str-prompted-generator` | Multi-field template-based generation |
+| `chunked-prompted-generator` | Long document chunk-by-chunk processing |
+| `embedding-generator` | Text vectorization using embedding APIs |
+| `retrieval-generator` | Async RAG generation using LightRAG |
+| `bench-answer-generator` | Benchmark answer generation with evaluation type variants |
+| `text2multihopqa-generator` | Multi-hop QA pair construction from text |
+| `random-domain-knowledge-row-generator` | Domain-specific row generation from seed data |
+
+### Filter (`core_text/filter/`)
+
+| Operator | Description |
+|----------|-------------|
+| `prompted-filter` | LLM-based quality scoring and filtering |
+| `general-filter` | Rule-based deterministic filtering |
+| `kcentergreedy-filter` | Diversity-based filtering using k-Center Greedy |
+
+### Refine (`core_text/refine/`)
+
+| Operator | Description |
+|----------|-------------|
+| `prompted-refiner` | LLM-based text rewriting and refinement |
+| `pandas-operator` | Custom pandas DataFrame operations |
+
+### Eval (`core_text/eval/`)
+
+| Operator | Description |
+|----------|-------------|
+| `prompted-evaluator` | LLM-based scoring and evaluation |
+| `bench-dataset-evaluator` | Benchmark dataset evaluation |
+| `bench-dataset-evaluator-question` | Benchmark question-level evaluation |
+| `text2qa-sample-evaluator` | QA sample quality evaluation |
+| `unified-bench-dataset-evaluator` | Unified benchmark evaluation across formats |
+
+## Directory Structure
+
+Each operator folder follows the same layout:
+
+```
+<operator-name>/
+├── SKILL.md          # English documentation: use cases, imports, parameters, run() examples
+├── SKILL_zh.md       # Chinese documentation
+└── examples/
+    ├── good.md       # Correct usage with a simple single-operator pipeline, sample input and output
+    └── bad.md        # Common mistakes and anti-patterns
+```
diff --git a/docs/en/notes/guide/skills/generating_dataflow_pipeline.md b/docs/en/notes/guide/skills/generating_dataflow_pipeline.md
@@ -0,0 +1,153 @@
+---
+title: Generating DataFlow Pipeline
+icon: carbon:flow
+createTime: 2026/04/08 22:45:39
+permalink: /en/guide/skills/generating_dataflow_pipeline/
+---
+
+# Generating DataFlow Pipeline
+
+<video src="https://github.com/user-attachments/assets/ca1fefbf-9bf7-469f-b856-b201952fb99b" controls style="width:100%; max-width:800px;"></video>
+
+## What It Does
+
+A reasoning-guided pipeline planner for [Claude Code](https://docs.anthropic.com/en/docs/claude-code). Given a **target** (what the pipeline should achieve) and a **sample JSONL file** (1–5 representative rows), it analyzes the data, selects operators, validates field dependencies, and generates a complete, runnable DataFlow pipeline in Python.
+
+## Quick Start
+
+### 1. Add the Skill
+
+Clone the repository and copy the skill directories into your Claude Code skills folder:
+
+```bash
+git clone https://github.com/haolpku/DataFlow-Skills.git
+
+# Project-level (this project only)
+cp -r DataFlow-Skills/generating-dataflow-pipeline .claude/skills/generating-dataflow-pipeline
+cp -r DataFlow-Skills/core_text .claude/skills/core_text
+
+# Or personal-level (all your projects)
+cp -r DataFlow-Skills/generating-dataflow-pipeline ~/.claude/skills/generating-dataflow-pipeline
+cp -r DataFlow-Skills/core_text ~/.claude/skills/core_text
+```
+
+Claude Code discovers skills from `.claude/skills/<skill-name>/SKILL.md`. The `name` field in `SKILL.md` frontmatter becomes the `/slash-command`. For more details, see the [official skills documentation](https://docs.anthropic.com/en/docs/claude-code/skills).
+
+### 2. Prepare Your Data
+
+Create a JSONL file (one JSON object per line) with 1–5 representative rows:
+
+```jsonl
+{"product_name": "Laptop", "category": "Electronics"}
+{"product_name": "Coffee Maker", "category": "Appliances"}
+```
+
+### 3. Run the Skill
+
+In Claude Code, invoke `/generating-dataflow-pipeline` and describe your target:
+
+```
+/generating-dataflow-pipeline
+Target: Generate product descriptions and filter high-quality ones
+Sample file: ./data/products.jsonl
+Expected outputs: generated_description, quality_score
+```
+
+### 4. Review the Output
+
+The skill returns a two-stage result:
+
+1. **Intermediate Operator Decision** — JSON with operator chain, field flow, and reasoning
+2. **Complete 5-Section Response**:
+   - Field Mapping — which fields exist vs. need to be generated
+   - Ordered Operator List — operators in execution order with justification
+   - Reasoning Summary — why this design satisfies the target
+   - Complete Pipeline Code — full executable Python following standard structure
+   - Adjustable Parameters / Caveats — tunable knobs and debugging tips
+
+## Six Core Operators
+
+| Operator | Purpose | LLM? |
+|----------|---------|------|
+| `PromptedGenerator` | Single-field LLM generation | Yes |
+| `FormatStrPromptedGenerator` | Multi-field template-based generation | Yes |
+| `Text2MultiHopQAGenerator` | Multi-hop QA pair construction from text | Yes |
+| `PromptedFilter` | LLM-based quality scoring & filtering | Yes |
+| `GeneralFilter` | Rule-based deterministic filtering | No |
+| **KBC Trio** (3 operators, always together in order) | File/URL → Markdown → chunks → clean text | Partial |
+
+## Generated Pipeline Structure
+
+All generated pipelines follow the same standard structure:
+
+```python
+from dataflow.operators.core_text import PromptedGenerator, PromptedFilter
+from dataflow.serving import APILLMServing_request
+from dataflow.utils.storage import FileStorage
+
+class MyPipeline:
+    def __init__(self):
+        self.storage = FileStorage(
+            first_entry_file_name="./data/input.jsonl",  # User-provided path
+            cache_path="./cache",
+            file_name_prefix="step",
+            cache_type="jsonl"
+        )
+        self.llm_serving = APILLMServing_request(
+            api_url="https://api.openai.com/v1/chat/completions",
+            model_name="gpt-4o",
+            max_workers=10
+        )
+        # Operator instances ...
+
+    def forward(self):
+        # Sequential operator.run() calls, each with storage.step()
+        ...
+
+if __name__ == "__main__":
+    pipeline = MyPipeline()
+    pipeline.forward()
+```
+
+Key rules:
+
+- `first_entry_file_name` is set to the exact user-provided JSONL path
+- Each `operator.run()` call uses `storage=self.storage.step()` for checkpointing
+- Fields propagate forward: a field must exist in the sample or be output by a prior step before it can be consumed
+
+## Extended Operators
+
+Beyond the 6 core primitives, DataFlow provides additional operators for generation, filtering, refinement, and evaluation. See the [Core Text Operator Reference](./core_text.md) for the full list.
+
+## Adding a New Operator
+
+Prerequisite: the new operator's skill definition already exists (with `SKILL.md`, `examples/good.md`, `examples/bad.md`, etc.).
+
+### As an Extended Operator
+
+Two steps are required:
+
+**Step 1.** Create an operator directory with its skill definition under any appropriate location (e.g., `core_text/<category>/`, or a separate skill package):
+
+```
+<skill-directory>/<your-operator-name>/
+├── SKILL.md          # API reference (constructor, run() signature, execution logic, constraints)
+├── SKILL_zh.md       # Chinese translation (optional)
+└── examples/
+    ├── good.md       # Best-practice example
+    └── bad.md        # Common mistakes
+```
+
+**Step 2.** Register the operator in `SKILL.md`'s **Extended Operator Reference** section. Add a row to the corresponding category table (Generate / Filter / Refine / Eval) with the operator name, subdirectory path, and description. Without this entry, the pipeline generator will not know the operator exists.
+
+### Promoting to a Core Primitive (Optional)
+
+If the operator is used frequently enough to warrant priority selection, promote it by modifying `SKILL.md`:
+
+1. Add to the **Preferred Operator Strategy** core primitives list
+2. Add a decision table row in **Operator Selection Priority Rule** (when to use / when not to use)
+3. Add full constructor and `run()` signatures in **Operator Parameter Signature Rule**
+4. Add the import path in **Correct Import Paths**
+5. Add input pattern matching in **Input File Content Analysis Rule** if it handles a new data type
+6. Update or remove the entry from the **Extended Operator Reference** table to avoid duplication
+7. Add a complete example in `examples/` (recommended)
diff --git a/docs/zh/notes/guide/skills/core_text.md b/docs/zh/notes/guide/skills/core_text.md
@@ -0,0 +1,63 @@
+---
+title: core_text
+icon: material-symbols:extension
+createTime: 2026/04/08 22:45:39
+permalink: /zh/guide/de8oculw/
+---
+
+# Core Text 扩展算子参考
+
+[DataFlow Pipeline生成](./generating_dataflow_pipeline.md) 的扩展算子参考库。当 6 个核心算子不能满足需求时，可查阅这里的逐算子详细文档。
+
+## 可用算子
+
+### Generate (`core_text/generate/`)
+
+| 算子 | 说明 |
+|------|------|
+| `prompted-generator` | 最基础的单字段 LLM 生成 |
+| `format-str-prompted-generator` | 多字段模板式生成 |
+| `chunked-prompted-generator` | 长文本分块逐段处理 |
+| `embedding-generator` | 调用 Embedding API 生成文本向量 |
+| `retrieval-generator` | 基于 LightRAG 的异步 RAG 生成 |
+| `bench-answer-generator` | Benchmark 答案生成，支持多种评估类型 |
+| `text2multihopqa-generator` | 从文本构建多跳问答对 |
+| `random-domain-knowledge-row-generator` | 基于种子数据的领域知识行生成 |
+
+### Filter (`core_text/filter/`)
+
+| 算子 | 说明 |
+|------|------|
+| `prompted-filter` | 基于 LLM 的质量评分与过滤 |
+| `general-filter` | 基于规则的确定性过滤 |
+| `kcentergreedy-filter` | 基于 k-Center Greedy 的多样性过滤 |
+
+### Refine (`core_text/refine/`)
+
+| 算子 | 说明 |
+|------|------|
+| `prompted-refiner` | 基于 LLM 的文本改写与精炼 |
+| `pandas-operator` | 自定义 pandas DataFrame 操作 |
+
+### Eval (`core_text/eval/`)
+
+| 算子 | 说明 |
+|------|------|
+| `prompted-evaluator` | 基于 LLM 的打分评估 |
+| `bench-dataset-evaluator` | Benchmark 数据集评估 |
+| `bench-dataset-evaluator-question` | Benchmark 问题级评估 |
+| `text2qa-sample-evaluator` | 问答样本质量评估 |
+| `unified-bench-dataset-evaluator` | 跨格式统一 Benchmark 评估 |
+
+## 目录结构
+
+每个算子文件夹遵循统一布局：
+
+```
+<算子名称>/
+├── SKILL.md          # 英文文档：使用场景、导入方式、参数说明、run() 示例
+├── SKILL_zh.md       # 中文文档
+└── examples/
+    ├── good.md       # 正确用法示例，含单一算子组成的简单 Pipeline、样例输入及输出
+    └── bad.md        # 常见错误与反模式
+```