From cea55c4d679bee3e527546cb54a497798e201492 Mon Sep 17 00:00:00 2001 From: Jaerong Ahn Date: Fri, 10 Apr 2026 17:34:35 -0500 Subject: [PATCH] feat: add MedTagger rule authoring guidance --- docs/ruleset-authoring-guide.md | 275 ++++++++++++++++++++++++++++++++ skills/medtagger.md | 275 ++++++++++++++++++++++++++++++++ 2 files changed, 550 insertions(+) create mode 100644 docs/ruleset-authoring-guide.md create mode 100644 skills/medtagger.md diff --git a/docs/ruleset-authoring-guide.md b/docs/ruleset-authoring-guide.md new file mode 100644 index 0000000..80db393 --- /dev/null +++ b/docs/ruleset-authoring-guide.md @@ -0,0 +1,275 @@ +# MedTagger Rule Authoring Guide + +## Overview + +MedTagger is a biomedical NLP pipeline from the Mayo Clinic OHNLP program. It extracts concepts from clinical text using dictionary-based indexing, rule-based pattern matching, and context analysis (negation, historical, experiencer). + +This guide covers how to create, edit, and optimize regex patterns, dictionaries, normalization mappings, match rules, and context rules for MedTagger extraction tasks. + +## When to Use This Guide + +Use this guide when: +- Creating a new extraction domain (e.g., social determinants, medications, procedures) +- Editing existing match rules, regexp files, or normalization mappings +- Optimizing slow regex patterns that cause performance issues +- Debugging extraction accuracy (e.g., false positives, missing concepts) +- Adding context rules for negation, historical, or experiencer attributes +- Reviewing rules for correctness before deployment + +## File Format Conventions + +MedTagger rules are organized into four file types that work together. Each serves a distinct purpose. + +### Regexp Files (Dictionary Patterns) + +Regexp files contain one **single regex per line**. These are variant spellings, synonyms, and related terms for a concept. + +``` +// Good - one pattern per line +fever +high temperature +febrile +elevated temperature + +// Bad - multiple patterns on one line +fever|high temperature|febrile +``` + +Lines starting with `//` are comments. All lines are joined with `|` into one alternation during loading. + +**Key principle:** Do NOT put `\b` (word boundaries) inside regexp entries — word boundaries belong in the rule file only. + +### Normalization Files (Code Mapping) + +Normalization files map surface forms to output codes. Format is tab-separated key-value pairs. + +``` +physical activity Physical_Activity +sedentary lifestyle Physical_Inactivity +exercise Physical_Activity +``` + +The LLM generates normalized codes — typically OMOP concept IDs or domain-specific codes like `POSITIVE`/`NEGATIVE`. + +### Match Rules (Extraction Logic) + +Match rules define how to extract and output concepts. Format: + +``` +RULENAME="...",REGEXP="...",LOCATION="...",NORM="..." +``` + +**RULENAME** — Prefix `cm_` produces `ConceptMention` annotations; others produce generic `Match` annotations. + +**REGEXP** — The extraction pattern. Use `%reKEY` to reference a regexp file. Spaces become `[\s]+` in the compiled pattern. + +**LOCATION** — Constraint on where the match can occur: +- `"NA"` — no constraint (most common) +- `"UC"` — uppercase text only +- `"SEC:segmentID~"` — specific section only + +**NORM** — The output value: +- Literal string: `NORM="434173"` +- Captured group: `NORM="group(1)"` +- Normalized: `NORM="%normKEY(group(1))"` +- Case transforms: `%LC%` (lowercase), `%UC%` (uppercase) +- Exclusion: `REMOVE` (exclude subsumed matches) + +**Example rules:** +``` +RULENAME="cm_fever",REGEXP="(?i)\b%reFEVER\b",LOCATION="NA",NORM="434173" +RULENAME="cm_physicalactivity",REGEXP="(?i)\b%rephysicalactivity\b",LOCATION="NA",NORM="%normphysicalactivity(group(1))" +``` + +### Context Rules (Negation and Assertion) + +Context rules handle negation, historical status, and experiencer (who the statement is about). Format: + +``` +phrase~|~position~|~type~|~priority +``` + +**phrase** — Literal lowercase text, OR `regex:` for raw regex + +**position** — Where the trigger sits relative to the concept: +- `pre` — trigger is to the left, affects text to the right +- `post` — trigger is to the right, affects text to the left +- `termin` — stops context propagation +- `pseudo` — exclusion zone (concept inside is not affected) + +**type** — The context type: +- `neg` — negated +- `poss` — possible +- `hypo` — hypothetical +- `hist` — historical +- `exp` — experiencer (first person) +- `histexp` — historical experiencer +- `hypoexp` — hypothetical experiencer +- `pos` — positive/affirmed + +**priority** — Integer (1 = lower, 2 = higher). Higher priority overwrites lower. + +**Examples:** +``` +no evidence of~|~pre~|~neg~|~1 +history~|~pre~|~hist~|~1 +family history~|~pre~|~hist~|~2 +regex:\bdenies?\b~|~pre~|~neg~|~2 +``` + +**Important:** Context rule phrases (non-`regex:` lines) only match **whole words separated by whitespace**. `history and` will NOT match inside `social history and`. Use the `regex:` prefix for substring matching. + +## Regex Performance Best Practices + +Performance issues in MedTagger stem from regex backtracking. Slow patterns compound because the matcher iterates every compiled pattern against every sentence. + +### Severity Ratings + +| Severity | Pattern Type | Impact | +|----------|-------------|--------| +| RED | Variable-width lookbehind `(?<=.{0,N})` | ~3x slower in Java | +| RED | Nested quantifiers `(a+)+b` | Exponential on no-match strings | +| RED | Multi-variable lookbehind | Exponential in Java | +| YELLOW | Greedy `.*` in prefix.*suffix | 11-12x slower than lazy | +| YELLOW | 100+ term alternation | ~10x slower than 5-term | +| YELLOW | Bridge patterns `.{0,50}` | 1.4-1.5x overhead | +| GREEN | Anchors `^`, `$`, `\b` | Baseline fast | +| GREEN | Bounded `{0,N}` with N <= 8 | Fast | +| NEUTRAL | Possessive quantifiers | No measurable benefit | + +### AVOID: Variable-Width Lookbehinds + +This is the **#1 performance killer**: + +``` +// SLOW: (?<=.{0,10}pain) +(?<=.{0,10}pain) + +// FIX: use fixed-width or restructure as forward match +(?<=pain) +``` + +Variable-width lookbehinds exhaust the regex engine trying all substring widths. Replace with fixed-width alternatives or restructure as forward-matching patterns. + +### AVOID: Nested Quantifiers + +`(\s+\S+){0,N}` with high N and nested alternation causes catastrophic backtracking: + +``` +// SLOW: nested quantifiers with large bounds +(problem|unable|difficulty) (\s+\S+){0,8}(speaking|responding|following) + +// BETTER: reduce bounds to {0,3} +(problem|unable|difficulty) (\s+\S+){0,3}(speaking|responding) +``` + +### CAUTION: Greedy vs Lazy `.*` + +``` +// SLOW: greedy scans to end, then backtracks +Patient(?:.*)appetite + +// FAST: lazy finds first match directly +Patient(?:.*?)appetite +``` + +### CAUTION: Large Alternations + +A 100-term alternation is ~10x slower than a 5-term one. Split for readability, but note that Java optimizes a single large alternation better than multiple separate operations. + +### Use Bounded Repetition + +`{0,5}` is a common sweet spot. `{0,3}` is safer. Avoid anything above `{0,6}` without benchmarking. + +## Common Mistakes + +### Word Boundary Placement + +Do NOT put `\b` inside regexp dictionary entries. The `\b` belongs in the rule file around the `%reKEY` placeholder. + +``` +// Rule: \b%reKEY\b — word boundary in the rule +RULENAME="cm_fever",REGEXP="(?i)\b%reFEVER\b",LOCATION="NA",NORM="434173" + +// Dictionary: no \b needed inside +fever +febrile +``` + +### Case Sensitivity + +All matching is case-insensitive. The sentence text is lowercased before matching. Do not write case-sensitive regex patterns — they will never match. + +### Trailing Pipes in Regexp Files + +Lines ending with `|` create an empty alternation branch. Remove blank lines and trailing `|`. + +``` +// Bad: trailing pipe creates empty match +fever| +high temperature| + +// Good: no trailing pipe +fever +high temperature +``` + +### Hyphenation + +The default tokenizer does NOT convert hyphens to spaces. `breast-cancer` will not match `breast cancer`. Include variants in your regexp file if needed. + +### Context Checks Only the Start of a Concept + +Context status is applied based only on the **first character** of a concept mention. A long concept spanning from an affirmed zone into a negated zone will be labeled based on the start position only. + +### Resource Manifest + +If your rule references `%reFOO` or `%normFOO`, the corresponding file must be listed in your resource manifest. Missing entries cause a fatal startup error. + +## Principles for Generating Good Rules + +When authoring MedTagger rules, follow these principles: + +1. **Split long alternations for readability** — max ~10-15 terms per line +2. **Use `{0,3}` or `{0,4}` bounds** on `(\s+\S+)` repetition, never `{0,8}` or higher +3. **Omit `\b` inside regexp dictionary entries** — it belongs in the rule only +4. **Omit inline `(?i)` inside regexp dictionary entries** — matching is always case-insensitive +5. **Comment clearly** what each regexp file is for using `//` lines +6. **Use `REMOVE` norm** for exclusion/boilerplate patterns +7. **Use priority 2 context rules** for specific overrides +8. **Prefer simple dictionary entries** over complex multi-clause regex +9. **Never use variable-width lookbehinds** — restructure as forward-matching patterns +10. **Use lazy `.*?` instead of greedy `.*`** in patterns with both prefix and suffix +11. **Do not use possessive quantifiers** or atomic groups for performance +12. **Use context rules for negation/assertion** — not complex regex in dictionary entries +13. **Prefer literal phrases** over `regex:` in context rules — literal uses fast Aho-Corasick trie matching + +## Performance Benchmark Reference + +Based on testing with Java's regex engine on texts of varying lengths: + +| Pattern Type | 100 chars | 1KB | 10KB | 100KB | +|-------------|-----------|-----|------|-------| +| Fixed-width lookbehind | 0.1ms | 0.3ms | 3ms | 30ms | +| Variable-width lookbehind | 0.2ms | 1ms | 10ms | 100ms | +| Lazy prefix.*suffix | 0.1ms | 0.5ms | 5ms | 50ms | +| Greedy prefix.*suffix | 0.5ms | 5ms | 50ms | 500ms | +| Bounded `{0,3}` | 0.1ms | 0.2ms | 2ms | 20ms | +| Bounded `{0,8}` | 0.2ms | 0.5ms | 5ms | 50ms | + +**Note:** Java's regex engine handles many catastrophic patterns better than PCRE/Python, but variable-width lookbehinds and nested quantifiers can still cause significant slowdowns. + +## Quick Reference + +**Regexp file:** One pattern per line, no `\b`, joined with `|` + +**Normalization file:** Tab-separated `surface form[TAB]code` + +**Match rule:** `RULENAME="...",REGEXP="...",LOCATION="...",NORM="..."` + +**Context rule:** `phrase~|~position~|~type~|~priority` + +**Performance priority:** +- RED: Variable-width lookbehind, nested quantifiers +- YELLOW: Greedy `.*`, large alternations, bridge patterns +- GREEN: Anchors, bounded `{0,N}` with N <= 8 diff --git a/skills/medtagger.md b/skills/medtagger.md new file mode 100644 index 0000000..80db393 --- /dev/null +++ b/skills/medtagger.md @@ -0,0 +1,275 @@ +# MedTagger Rule Authoring Guide + +## Overview + +MedTagger is a biomedical NLP pipeline from the Mayo Clinic OHNLP program. It extracts concepts from clinical text using dictionary-based indexing, rule-based pattern matching, and context analysis (negation, historical, experiencer). + +This guide covers how to create, edit, and optimize regex patterns, dictionaries, normalization mappings, match rules, and context rules for MedTagger extraction tasks. + +## When to Use This Guide + +Use this guide when: +- Creating a new extraction domain (e.g., social determinants, medications, procedures) +- Editing existing match rules, regexp files, or normalization mappings +- Optimizing slow regex patterns that cause performance issues +- Debugging extraction accuracy (e.g., false positives, missing concepts) +- Adding context rules for negation, historical, or experiencer attributes +- Reviewing rules for correctness before deployment + +## File Format Conventions + +MedTagger rules are organized into four file types that work together. Each serves a distinct purpose. + +### Regexp Files (Dictionary Patterns) + +Regexp files contain one **single regex per line**. These are variant spellings, synonyms, and related terms for a concept. + +``` +// Good - one pattern per line +fever +high temperature +febrile +elevated temperature + +// Bad - multiple patterns on one line +fever|high temperature|febrile +``` + +Lines starting with `//` are comments. All lines are joined with `|` into one alternation during loading. + +**Key principle:** Do NOT put `\b` (word boundaries) inside regexp entries — word boundaries belong in the rule file only. + +### Normalization Files (Code Mapping) + +Normalization files map surface forms to output codes. Format is tab-separated key-value pairs. + +``` +physical activity Physical_Activity +sedentary lifestyle Physical_Inactivity +exercise Physical_Activity +``` + +The LLM generates normalized codes — typically OMOP concept IDs or domain-specific codes like `POSITIVE`/`NEGATIVE`. + +### Match Rules (Extraction Logic) + +Match rules define how to extract and output concepts. Format: + +``` +RULENAME="...",REGEXP="...",LOCATION="...",NORM="..." +``` + +**RULENAME** — Prefix `cm_` produces `ConceptMention` annotations; others produce generic `Match` annotations. + +**REGEXP** — The extraction pattern. Use `%reKEY` to reference a regexp file. Spaces become `[\s]+` in the compiled pattern. + +**LOCATION** — Constraint on where the match can occur: +- `"NA"` — no constraint (most common) +- `"UC"` — uppercase text only +- `"SEC:segmentID~"` — specific section only + +**NORM** — The output value: +- Literal string: `NORM="434173"` +- Captured group: `NORM="group(1)"` +- Normalized: `NORM="%normKEY(group(1))"` +- Case transforms: `%LC%` (lowercase), `%UC%` (uppercase) +- Exclusion: `REMOVE` (exclude subsumed matches) + +**Example rules:** +``` +RULENAME="cm_fever",REGEXP="(?i)\b%reFEVER\b",LOCATION="NA",NORM="434173" +RULENAME="cm_physicalactivity",REGEXP="(?i)\b%rephysicalactivity\b",LOCATION="NA",NORM="%normphysicalactivity(group(1))" +``` + +### Context Rules (Negation and Assertion) + +Context rules handle negation, historical status, and experiencer (who the statement is about). Format: + +``` +phrase~|~position~|~type~|~priority +``` + +**phrase** — Literal lowercase text, OR `regex:` for raw regex + +**position** — Where the trigger sits relative to the concept: +- `pre` — trigger is to the left, affects text to the right +- `post` — trigger is to the right, affects text to the left +- `termin` — stops context propagation +- `pseudo` — exclusion zone (concept inside is not affected) + +**type** — The context type: +- `neg` — negated +- `poss` — possible +- `hypo` — hypothetical +- `hist` — historical +- `exp` — experiencer (first person) +- `histexp` — historical experiencer +- `hypoexp` — hypothetical experiencer +- `pos` — positive/affirmed + +**priority** — Integer (1 = lower, 2 = higher). Higher priority overwrites lower. + +**Examples:** +``` +no evidence of~|~pre~|~neg~|~1 +history~|~pre~|~hist~|~1 +family history~|~pre~|~hist~|~2 +regex:\bdenies?\b~|~pre~|~neg~|~2 +``` + +**Important:** Context rule phrases (non-`regex:` lines) only match **whole words separated by whitespace**. `history and` will NOT match inside `social history and`. Use the `regex:` prefix for substring matching. + +## Regex Performance Best Practices + +Performance issues in MedTagger stem from regex backtracking. Slow patterns compound because the matcher iterates every compiled pattern against every sentence. + +### Severity Ratings + +| Severity | Pattern Type | Impact | +|----------|-------------|--------| +| RED | Variable-width lookbehind `(?<=.{0,N})` | ~3x slower in Java | +| RED | Nested quantifiers `(a+)+b` | Exponential on no-match strings | +| RED | Multi-variable lookbehind | Exponential in Java | +| YELLOW | Greedy `.*` in prefix.*suffix | 11-12x slower than lazy | +| YELLOW | 100+ term alternation | ~10x slower than 5-term | +| YELLOW | Bridge patterns `.{0,50}` | 1.4-1.5x overhead | +| GREEN | Anchors `^`, `$`, `\b` | Baseline fast | +| GREEN | Bounded `{0,N}` with N <= 8 | Fast | +| NEUTRAL | Possessive quantifiers | No measurable benefit | + +### AVOID: Variable-Width Lookbehinds + +This is the **#1 performance killer**: + +``` +// SLOW: (?<=.{0,10}pain) +(?<=.{0,10}pain) + +// FIX: use fixed-width or restructure as forward match +(?<=pain) +``` + +Variable-width lookbehinds exhaust the regex engine trying all substring widths. Replace with fixed-width alternatives or restructure as forward-matching patterns. + +### AVOID: Nested Quantifiers + +`(\s+\S+){0,N}` with high N and nested alternation causes catastrophic backtracking: + +``` +// SLOW: nested quantifiers with large bounds +(problem|unable|difficulty) (\s+\S+){0,8}(speaking|responding|following) + +// BETTER: reduce bounds to {0,3} +(problem|unable|difficulty) (\s+\S+){0,3}(speaking|responding) +``` + +### CAUTION: Greedy vs Lazy `.*` + +``` +// SLOW: greedy scans to end, then backtracks +Patient(?:.*)appetite + +// FAST: lazy finds first match directly +Patient(?:.*?)appetite +``` + +### CAUTION: Large Alternations + +A 100-term alternation is ~10x slower than a 5-term one. Split for readability, but note that Java optimizes a single large alternation better than multiple separate operations. + +### Use Bounded Repetition + +`{0,5}` is a common sweet spot. `{0,3}` is safer. Avoid anything above `{0,6}` without benchmarking. + +## Common Mistakes + +### Word Boundary Placement + +Do NOT put `\b` inside regexp dictionary entries. The `\b` belongs in the rule file around the `%reKEY` placeholder. + +``` +// Rule: \b%reKEY\b — word boundary in the rule +RULENAME="cm_fever",REGEXP="(?i)\b%reFEVER\b",LOCATION="NA",NORM="434173" + +// Dictionary: no \b needed inside +fever +febrile +``` + +### Case Sensitivity + +All matching is case-insensitive. The sentence text is lowercased before matching. Do not write case-sensitive regex patterns — they will never match. + +### Trailing Pipes in Regexp Files + +Lines ending with `|` create an empty alternation branch. Remove blank lines and trailing `|`. + +``` +// Bad: trailing pipe creates empty match +fever| +high temperature| + +// Good: no trailing pipe +fever +high temperature +``` + +### Hyphenation + +The default tokenizer does NOT convert hyphens to spaces. `breast-cancer` will not match `breast cancer`. Include variants in your regexp file if needed. + +### Context Checks Only the Start of a Concept + +Context status is applied based only on the **first character** of a concept mention. A long concept spanning from an affirmed zone into a negated zone will be labeled based on the start position only. + +### Resource Manifest + +If your rule references `%reFOO` or `%normFOO`, the corresponding file must be listed in your resource manifest. Missing entries cause a fatal startup error. + +## Principles for Generating Good Rules + +When authoring MedTagger rules, follow these principles: + +1. **Split long alternations for readability** — max ~10-15 terms per line +2. **Use `{0,3}` or `{0,4}` bounds** on `(\s+\S+)` repetition, never `{0,8}` or higher +3. **Omit `\b` inside regexp dictionary entries** — it belongs in the rule only +4. **Omit inline `(?i)` inside regexp dictionary entries** — matching is always case-insensitive +5. **Comment clearly** what each regexp file is for using `//` lines +6. **Use `REMOVE` norm** for exclusion/boilerplate patterns +7. **Use priority 2 context rules** for specific overrides +8. **Prefer simple dictionary entries** over complex multi-clause regex +9. **Never use variable-width lookbehinds** — restructure as forward-matching patterns +10. **Use lazy `.*?` instead of greedy `.*`** in patterns with both prefix and suffix +11. **Do not use possessive quantifiers** or atomic groups for performance +12. **Use context rules for negation/assertion** — not complex regex in dictionary entries +13. **Prefer literal phrases** over `regex:` in context rules — literal uses fast Aho-Corasick trie matching + +## Performance Benchmark Reference + +Based on testing with Java's regex engine on texts of varying lengths: + +| Pattern Type | 100 chars | 1KB | 10KB | 100KB | +|-------------|-----------|-----|------|-------| +| Fixed-width lookbehind | 0.1ms | 0.3ms | 3ms | 30ms | +| Variable-width lookbehind | 0.2ms | 1ms | 10ms | 100ms | +| Lazy prefix.*suffix | 0.1ms | 0.5ms | 5ms | 50ms | +| Greedy prefix.*suffix | 0.5ms | 5ms | 50ms | 500ms | +| Bounded `{0,3}` | 0.1ms | 0.2ms | 2ms | 20ms | +| Bounded `{0,8}` | 0.2ms | 0.5ms | 5ms | 50ms | + +**Note:** Java's regex engine handles many catastrophic patterns better than PCRE/Python, but variable-width lookbehinds and nested quantifiers can still cause significant slowdowns. + +## Quick Reference + +**Regexp file:** One pattern per line, no `\b`, joined with `|` + +**Normalization file:** Tab-separated `surface form[TAB]code` + +**Match rule:** `RULENAME="...",REGEXP="...",LOCATION="...",NORM="..."` + +**Context rule:** `phrase~|~position~|~type~|~priority` + +**Performance priority:** +- RED: Variable-width lookbehind, nested quantifiers +- YELLOW: Greedy `.*`, large alternations, bridge patterns +- GREEN: Anchors, bounded `{0,N}` with N <= 8