Medical Domain Embeddings Adapter#45
Conversation
…ation - Added `medicalNormalized` to `EmbeddingsAdapter` interface. - Created `MedicalTokenizer` for Spanish and English abbreviation expansion. - Implemented `MedicalEmbeddingsAdapter` as a decorator for other adapters. - Updated all existing adapters to implement the new interface method. - Added comprehensive unit tests for the new components.
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces medical text processing enhancements by adding a MedicalTokenizer for expanding Spanish and English medical abbreviations and a MedicalEmbeddingsAdapter decorator to apply these expansions before embedding. It also extends the EmbeddingsAdapter interface with a medicalNormalized method, implemented across existing adapters. Feedback was provided regarding the performance of the MedicalTokenizer, specifically suggesting the caching of regular expressions and sorted keys to avoid redundant computations during text processing.
| String expandAbbreviations(String text) { | ||
| if (text.isEmpty) return text; | ||
|
|
||
| String expandedText = text; | ||
|
|
||
| // Sort keys by length descending to avoid partial matches (e.g., 'ta' in 'tac') | ||
| final sortedKeys = _abbreviations.keys.toList() | ||
| ..sort((a, b) => b.length.compareTo(a.length)); | ||
|
|
||
| for (final key in sortedKeys) { | ||
| // Use regex with word boundaries to avoid matching inside words | ||
| // e.g. "TA" should match but "taza" should not. | ||
| // We handle Tª specifically as it has a special character. | ||
| final escapedKey = RegExp.escape(key); | ||
| final regex = RegExp('\\b$escapedKey\\b', caseSensitive: false); | ||
|
|
||
| // Special case for Tª since \b might not work as expected with ª | ||
| if (key == 'tª') { | ||
| expandedText = expandedText.replaceAll( | ||
| RegExp(r'Tª', caseSensitive: false), _abbreviations[key]!); | ||
| } else { | ||
| expandedText = expandedText.replaceAllMapped(regex, (match) { | ||
| return _abbreviations[key]!; | ||
| }); | ||
| } | ||
| } | ||
|
|
||
| return expandedText; | ||
| } |
There was a problem hiding this comment.
The expandAbbreviations method is inefficient because it re-calculates the sorted keys and re-compiles multiple RegExp objects on every call. This can lead to performance degradation when processing large texts or when called frequently.
Consider pre-calculating the sorted keys and caching the compiled regular expressions as static members of the class. This avoids redundant work and improves the overall performance of the tokenizer.
static final List<String> _sortedKeys = _abbreviations.keys.toList()
..sort((a, b) => b.length.compareTo(a.length));
static final Map<String, RegExp> _regexCache = {
for (final key in _sortedKeys)
key: key == 'tª'
? RegExp(r'tª', caseSensitive: false)
: RegExp(r'\b' + RegExp.escape(key) + r'\b', caseSensitive: false)
};
String expandAbbreviations(String text) {
if (text.isEmpty) return text;
String expandedText = text;
for (final key in _sortedKeys) {
expandedText = expandedText.replaceAll(_regexCache[key]!, _abbreviations[key]!);
}
return expandedText;
}|
SCOPE ISSUE: MedicalTokenizer tiene abreviaturas médicas hardcoded en español (TA, FC, DM, etc.). Esto viola el principio de package genérico. SOLUCIÓN REQUERIDA:
`dart static TextNormalizationConfig empty() => TextNormalizationConfig({}); String normalize(String text) { ... }
`dart // Usa config.normalizationConfig.abbreviationMap, sin hardcodear nada Las abreviaturas médicas específicas para España/Latam se configurarán en OrionHealth, no en este package. Por favor: elimina lo hardcoded y hazlo genérico. |
|
SCOPE VIOLATION: MedicalTokenizer tiene abreviaciones medicas hardcoded (TA, FC, DM). Por favor crea nuevo PR con TextNormalizationConfig generico y pluggable. |
This PR introduces the
MedicalEmbeddingsAdapterandMedicalTokenizerto improve the quality of embeddings for medical text in both Spanish and English.The
MedicalTokenizerhandles the expansion of common medical abbreviations (e.g., "TA" to "tensión arterial", "BP" to "blood pressure") using a regex-based approach with word boundaries to avoid false positives.The
MedicalEmbeddingsAdapteris implemented as a decorator that pre-processes text before passing it to an underlyingEmbeddingsAdapter.The
EmbeddingsAdapterinterface has been updated with amedicalNormalizedmethod, and all existing implementations (GeminiEmbeddingsAdapter,OnDeviceEmbeddingsAdapter,FallbackEmbeddingsAdapter, andCodeEmbeddingsAdapter) have been updated accordingly.Unit tests have been added to verify the abbreviation expansion and the integration with embedding adapters.
Fixes #38
PR created automatically by Jules for task 14484991894278155486 started by @iberi22