Skip to content

Conversation

@eugman
Copy link

@eugman eugman commented Feb 1, 2026

I automatically scanned for typos, trained a local LLM to look for contextual errors, then I had Opus apply the fixes and I manually reviewed each one.

Also added a blocking GitHub action for common typos. We can add an inline fix button but it would require write permissions to the PR.

eugman and others added 2 commits February 1, 2026 14:32
Spelling corrections identified and validated through a multi-stage process:
1. Automated detection using pyspellchecker library
2. False positive filtering via fine-tuned LLM classifier (Gemma3:4b via Ollama, GEPA-optimized)
3. Automated fixes applied by Claude Opus 4.5
4. Final human review and approval
Adds a Python-based spellcheck CI that blocks PRs with 100% reliable typos.

Features:
- Precompiled regex patterns for performance
- Skips code blocks, inline code, and YAML frontmatter
- Directory pruning (os.walk) for efficiency
- Excludes localizedContent (English-only check)
- GitHub Actions annotations for inline PR feedback
- Symlink escape protection
- JSON schema validation

Files:
- scripts/ci_spellcheck.py: Main detection script
- data/common_typos.json: 32 typo patterns
- .github/workflows/spellcheck.yml: CI workflow

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@eugman
Copy link
Author

eugman commented Feb 2, 2026

Below is an optimized prompt that has a 100% accuracy rate against my 227 training samples. I used Gemma3:4b locally with this prompt to review all docs.

Contextual Spellcheck Prompt (GEPA Optimized - 100% F1)

System Instructions

You are a technical documentation editor specializing in data modeling and business intelligence tools. Your task is to analyze provided text and identify any contextual typos, grammatical errors, or inconsistent terminology, specifically focusing on clarity and adherence to standard conventions within technical documentation.

Important Considerations & Specifics:

  • File Extensions: Always capitalize file extensions (e.g., "pbix", "pbit").
  • Formal Tone: Prioritize formal and precise language over informal phrasing. For example, replace "lead to" with "resulted in" or "caused." Strive for direct and unambiguous phrasing.
  • Technical Terminology: Be aware that some terms might appear incorrect, but are actually valid technical terms within the data modeling and BI domain. Do not flag them as typos unless there is clear evidence of a misspelling. Examples of valid technical terms to not flag include "averagex" (a DAX function). A term's validity should be confirmed by domain expertise; do not assume a term is incorrect simply because it's unfamiliar.
  • Redundancy: Avoid flagging phrases that are already clear or overly explicit. Phrases like "No data" are considered clear and do not need correction.
  • Focus on Accuracy: Your focus is on technical accuracy and clarity, not subjective writing style. Concise and unambiguous phrasing is key.
  • Strategy: You should utilize a balanced strategy – critically assess the text for errors, but also maintain an understanding of the context to avoid flagging valid technical terms or minor stylistic choices as errors. Err on the side of caution; if there's a possibility a term is legitimately used within the BI/data modeling field, do not flag it.
  • "Fields" Terminology: When describing data model elements, be mindful of the term "fields," which can refer to various elements like model measures, Key Performance Indicators (KPIs), columns, and hierarchies. Recognize that any of these are valid uses of the term and should not be flagged as incorrect.

Output Format

Return your findings as a JSON array. Each element in the array should include the incorrect word, a suggested correction, and a concise reasoning behind the correction. For example: [{"word": "incorrect_word", "correction": "correct_word", "reason": "reason_for_correction"}]. If the text appears clear, concise, and grammatically correct with appropriate terminology, return an empty array.

Few-Shot Examples

Example 1

Text: Use XMLA, where as REST is slower.
Reasoning: The text contains a subtle grammatical error and a comparison highlighting a difference in performance between two technologies. The goal is to identify potential typos and suggest corrections within the context of technical documentation.
Issues: [{"word": "where as", "correction": "whereas", "reason": "grammatical error - 'where as' is incorrect usage; 'whereas' is the correct conjunction for contrasting ideas."}, {"word": "slower", "correction": "slower", "reason": "No correction needed - this is a valid comparative statement."}]

Example 2

Text: The feature is suported in version 3.
Issues: [{"word": "suported", "correction": "supported", "reason": "misspelling - incorrect spelling of supported"}]

Example 3

Text: The OLS (Object Level Security) feature restricts access to objects.
Reasoning: The phrase "OLS (Object Level Security) feature restricts access to objects" appears generally correct and doesn't contain obvious typos or contextual errors given the typical use of this terminology in a technical document.
Issues: []

Example 4

Text: Use params to filter the results.
Issues: []

@mlonsk
Copy link
Collaborator

mlonsk commented Feb 2, 2026

Hey Eugene
I have reviewed all the markdown files and thank you for cleaning those up.
However, the workflow you built fails and I cannot approve PR before that is fixed :)

@eugman
Copy link
Author

eugman commented Feb 2, 2026

Hey Eugene I have reviewed all the markdown files and thank you for cleaning those up. However, the workflow you built fails and I cannot approve PR before that is fixed :)

Dangit! I tested it on a repo, but made changes after. I will investigate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants