Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/.vuepress/notes/en/guide.ts
Original file line number Diff line number Diff line change
Expand Up @@ -128,5 +128,15 @@ export const Guide: ThemeNote = defineNoteConfig({
"web_collection"
]
},
{
text: "DataFlow Skills",
collapsed: false,
icon: 'material-symbols:auto-awesome',
prefix: 'skills',
items: [
"generating_dataflow_pipeline",
"core_text"
]
},
],
})
10 changes: 10 additions & 0 deletions docs/.vuepress/notes/zh/guide.ts
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,16 @@ export const Guide: ThemeNote = defineNoteConfig({
"web_collection"
]
},
{
text: "DataFlow Skills",
collapsed: false,
icon: 'material-symbols:auto-awesome',
prefix: 'skills',
items: [
"generating_dataflow_pipeline",
"core_text"
]
},
// {
// text: '写作',
// icon: 'fluent-mdl2:edit-create',
Expand Down
63 changes: 63 additions & 0 deletions docs/en/notes/guide/skills/core_text.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
title: Core Text Operators
icon: material-symbols:extension
createTime: 2026/04/08 22:45:39
permalink: /en/guide/skills/core_text/
---

# Core Text Operator Reference

Extended operator reference for the [Generating DataFlow Pipeline](./generating_dataflow_pipeline.md) skill. When the 6 core primitives don't cover your task, consult the detailed per-operator documentation here.

## Available Operators

### Generate (`core_text/generate/`)

| Operator | Description |
|----------|-------------|
| `prompted-generator` | Basic single-field LLM generation |
| `format-str-prompted-generator` | Multi-field template-based generation |
| `chunked-prompted-generator` | Long document chunk-by-chunk processing |
| `embedding-generator` | Text vectorization using embedding APIs |
| `retrieval-generator` | Async RAG generation using LightRAG |
| `bench-answer-generator` | Benchmark answer generation with evaluation type variants |
| `text2multihopqa-generator` | Multi-hop QA pair construction from text |
| `random-domain-knowledge-row-generator` | Domain-specific row generation from seed data |

### Filter (`core_text/filter/`)

| Operator | Description |
|----------|-------------|
| `prompted-filter` | LLM-based quality scoring and filtering |
| `general-filter` | Rule-based deterministic filtering |
| `kcentergreedy-filter` | Diversity-based filtering using k-Center Greedy |

### Refine (`core_text/refine/`)

| Operator | Description |
|----------|-------------|
| `prompted-refiner` | LLM-based text rewriting and refinement |
| `pandas-operator` | Custom pandas DataFrame operations |

### Eval (`core_text/eval/`)

| Operator | Description |
|----------|-------------|
| `prompted-evaluator` | LLM-based scoring and evaluation |
| `bench-dataset-evaluator` | Benchmark dataset evaluation |
| `bench-dataset-evaluator-question` | Benchmark question-level evaluation |
| `text2qa-sample-evaluator` | QA sample quality evaluation |
| `unified-bench-dataset-evaluator` | Unified benchmark evaluation across formats |

## Directory Structure

Each operator folder follows the same layout:

```
<operator-name>/
├── SKILL.md # English documentation: use cases, imports, parameters, run() examples
├── SKILL_zh.md # Chinese documentation
└── examples/
├── good.md # Correct usage with a simple single-operator pipeline, sample input and output
└── bad.md # Common mistakes and anti-patterns
```
153 changes: 153 additions & 0 deletions docs/en/notes/guide/skills/generating_dataflow_pipeline.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
---
title: Generating DataFlow Pipeline
icon: carbon:flow
createTime: 2026/04/08 22:45:39
permalink: /en/guide/skills/generating_dataflow_pipeline/
---

# Generating DataFlow Pipeline

<video src="https://github.com/user-attachments/assets/ca1fefbf-9bf7-469f-b856-b201952fb99b" controls style="width:100%; max-width:800px;"></video>

## What It Does

A reasoning-guided pipeline planner for [Claude Code](https://docs.anthropic.com/en/docs/claude-code). Given a **target** (what the pipeline should achieve) and a **sample JSONL file** (1–5 representative rows), it analyzes the data, selects operators, validates field dependencies, and generates a complete, runnable DataFlow pipeline in Python.

## Quick Start

### 1. Add the Skill

Clone the repository and copy the skill directories into your Claude Code skills folder:

```bash
git clone https://github.com/haolpku/DataFlow-Skills.git

# Project-level (this project only)
cp -r DataFlow-Skills/generating-dataflow-pipeline .claude/skills/generating-dataflow-pipeline
cp -r DataFlow-Skills/core_text .claude/skills/core_text

# Or personal-level (all your projects)
cp -r DataFlow-Skills/generating-dataflow-pipeline ~/.claude/skills/generating-dataflow-pipeline
cp -r DataFlow-Skills/core_text ~/.claude/skills/core_text
```

Claude Code discovers skills from `.claude/skills/<skill-name>/SKILL.md`. The `name` field in `SKILL.md` frontmatter becomes the `/slash-command`. For more details, see the [official skills documentation](https://docs.anthropic.com/en/docs/claude-code/skills).

### 2. Prepare Your Data

Create a JSONL file (one JSON object per line) with 1–5 representative rows:

```jsonl
{"product_name": "Laptop", "category": "Electronics"}
{"product_name": "Coffee Maker", "category": "Appliances"}
```

### 3. Run the Skill

In Claude Code, invoke `/generating-dataflow-pipeline` and describe your target:

```
/generating-dataflow-pipeline
Target: Generate product descriptions and filter high-quality ones
Sample file: ./data/products.jsonl
Expected outputs: generated_description, quality_score
```

### 4. Review the Output

The skill returns a two-stage result:

1. **Intermediate Operator Decision** — JSON with operator chain, field flow, and reasoning
2. **Complete 5-Section Response**:
- Field Mapping — which fields exist vs. need to be generated
- Ordered Operator List — operators in execution order with justification
- Reasoning Summary — why this design satisfies the target
- Complete Pipeline Code — full executable Python following standard structure
- Adjustable Parameters / Caveats — tunable knobs and debugging tips

## Six Core Operators

| Operator | Purpose | LLM? |
|----------|---------|------|
| `PromptedGenerator` | Single-field LLM generation | Yes |
| `FormatStrPromptedGenerator` | Multi-field template-based generation | Yes |
| `Text2MultiHopQAGenerator` | Multi-hop QA pair construction from text | Yes |
| `PromptedFilter` | LLM-based quality scoring & filtering | Yes |
| `GeneralFilter` | Rule-based deterministic filtering | No |
| **KBC Trio** (3 operators, always together in order) | File/URL → Markdown → chunks → clean text | Partial |

## Generated Pipeline Structure

All generated pipelines follow the same standard structure:

```python
from dataflow.operators.core_text import PromptedGenerator, PromptedFilter
from dataflow.serving import APILLMServing_request
from dataflow.utils.storage import FileStorage

class MyPipeline:
def __init__(self):
self.storage = FileStorage(
first_entry_file_name="./data/input.jsonl", # User-provided path
cache_path="./cache",
file_name_prefix="step",
cache_type="jsonl"
)
self.llm_serving = APILLMServing_request(
api_url="https://api.openai.com/v1/chat/completions",
model_name="gpt-4o",
max_workers=10
)
# Operator instances ...

def forward(self):
# Sequential operator.run() calls, each with storage.step()
...

if __name__ == "__main__":
pipeline = MyPipeline()
pipeline.forward()
```

Key rules:

- `first_entry_file_name` is set to the exact user-provided JSONL path
- Each `operator.run()` call uses `storage=self.storage.step()` for checkpointing
- Fields propagate forward: a field must exist in the sample or be output by a prior step before it can be consumed

## Extended Operators

Beyond the 6 core primitives, DataFlow provides additional operators for generation, filtering, refinement, and evaluation. See the [Core Text Operator Reference](./core_text.md) for the full list.

## Adding a New Operator

Prerequisite: the new operator's skill definition already exists (with `SKILL.md`, `examples/good.md`, `examples/bad.md`, etc.).

### As an Extended Operator

Two steps are required:

**Step 1.** Create an operator directory with its skill definition under any appropriate location (e.g., `core_text/<category>/`, or a separate skill package):

```
<skill-directory>/<your-operator-name>/
├── SKILL.md # API reference (constructor, run() signature, execution logic, constraints)
├── SKILL_zh.md # Chinese translation (optional)
└── examples/
├── good.md # Best-practice example
└── bad.md # Common mistakes
```

**Step 2.** Register the operator in `SKILL.md`'s **Extended Operator Reference** section. Add a row to the corresponding category table (Generate / Filter / Refine / Eval) with the operator name, subdirectory path, and description. Without this entry, the pipeline generator will not know the operator exists.

### Promoting to a Core Primitive (Optional)

If the operator is used frequently enough to warrant priority selection, promote it by modifying `SKILL.md`:

1. Add to the **Preferred Operator Strategy** core primitives list
2. Add a decision table row in **Operator Selection Priority Rule** (when to use / when not to use)
3. Add full constructor and `run()` signatures in **Operator Parameter Signature Rule**
4. Add the import path in **Correct Import Paths**
5. Add input pattern matching in **Input File Content Analysis Rule** if it handles a new data type
6. Update or remove the entry from the **Extended Operator Reference** table to avoid duplication
7. Add a complete example in `examples/` (recommended)
63 changes: 63 additions & 0 deletions docs/zh/notes/guide/skills/core_text.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
---
title: core_text
icon: material-symbols:extension
createTime: 2026/04/08 22:45:39
permalink: /zh/guide/de8oculw/
---

# Core Text 扩展算子参考

[DataFlow Pipeline生成](./generating_dataflow_pipeline.md) 的扩展算子参考库。当 6 个核心算子不能满足需求时,可查阅这里的逐算子详细文档。

## 可用算子

### Generate (`core_text/generate/`)

| 算子 | 说明 |
|------|------|
| `prompted-generator` | 最基础的单字段 LLM 生成 |
| `format-str-prompted-generator` | 多字段模板式生成 |
| `chunked-prompted-generator` | 长文本分块逐段处理 |
| `embedding-generator` | 调用 Embedding API 生成文本向量 |
| `retrieval-generator` | 基于 LightRAG 的异步 RAG 生成 |
| `bench-answer-generator` | Benchmark 答案生成,支持多种评估类型 |
| `text2multihopqa-generator` | 从文本构建多跳问答对 |
| `random-domain-knowledge-row-generator` | 基于种子数据的领域知识行生成 |

### Filter (`core_text/filter/`)

| 算子 | 说明 |
|------|------|
| `prompted-filter` | 基于 LLM 的质量评分与过滤 |
| `general-filter` | 基于规则的确定性过滤 |
| `kcentergreedy-filter` | 基于 k-Center Greedy 的多样性过滤 |

### Refine (`core_text/refine/`)

| 算子 | 说明 |
|------|------|
| `prompted-refiner` | 基于 LLM 的文本改写与精炼 |
| `pandas-operator` | 自定义 pandas DataFrame 操作 |

### Eval (`core_text/eval/`)

| 算子 | 说明 |
|------|------|
| `prompted-evaluator` | 基于 LLM 的打分评估 |
| `bench-dataset-evaluator` | Benchmark 数据集评估 |
| `bench-dataset-evaluator-question` | Benchmark 问题级评估 |
| `text2qa-sample-evaluator` | 问答样本质量评估 |
| `unified-bench-dataset-evaluator` | 跨格式统一 Benchmark 评估 |

## 目录结构

每个算子文件夹遵循统一布局:

```
<算子名称>/
├── SKILL.md # 英文文档:使用场景、导入方式、参数说明、run() 示例
├── SKILL_zh.md # 中文文档
└── examples/
├── good.md # 正确用法示例,含单一算子组成的简单 Pipeline、样例输入及输出
└── bad.md # 常见错误与反模式
```
Loading
Loading