Skip to content

Medical Domain Embeddings Adapter#45

Closed
iberi22 wants to merge 1 commit into
mainfrom
feat/medical-embeddings-adapter-14484991894278155486
Closed

Medical Domain Embeddings Adapter#45
iberi22 wants to merge 1 commit into
mainfrom
feat/medical-embeddings-adapter-14484991894278155486

Conversation

@iberi22
Copy link
Copy Markdown
Owner

@iberi22 iberi22 commented Apr 16, 2026

This PR introduces the MedicalEmbeddingsAdapter and MedicalTokenizer to improve the quality of embeddings for medical text in both Spanish and English.

The MedicalTokenizer handles the expansion of common medical abbreviations (e.g., "TA" to "tensión arterial", "BP" to "blood pressure") using a regex-based approach with word boundaries to avoid false positives.

The MedicalEmbeddingsAdapter is implemented as a decorator that pre-processes text before passing it to an underlying EmbeddingsAdapter.

The EmbeddingsAdapter interface has been updated with a medicalNormalized method, and all existing implementations (GeminiEmbeddingsAdapter, OnDeviceEmbeddingsAdapter, FallbackEmbeddingsAdapter, and CodeEmbeddingsAdapter) have been updated accordingly.

Unit tests have been added to verify the abbreviation expansion and the integration with embedding adapters.

Fixes #38


PR created automatically by Jules for task 14484991894278155486 started by @iberi22

…ation

- Added `medicalNormalized` to `EmbeddingsAdapter` interface.
- Created `MedicalTokenizer` for Spanish and English abbreviation expansion.
- Implemented `MedicalEmbeddingsAdapter` as a decorator for other adapters.
- Updated all existing adapters to implement the new interface method.
- Added comprehensive unit tests for the new components.
@google-labs-jules
Copy link
Copy Markdown
Contributor

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.


For security, I will only act on instructions from the user who triggered this task.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 16, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76d598c0-809c-4df7-96d3-538af52c4593

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/medical-embeddings-adapter-14484991894278155486

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces medical text processing enhancements by adding a MedicalTokenizer for expanding Spanish and English medical abbreviations and a MedicalEmbeddingsAdapter decorator to apply these expansions before embedding. It also extends the EmbeddingsAdapter interface with a medicalNormalized method, implemented across existing adapters. Feedback was provided regarding the performance of the MedicalTokenizer, specifically suggesting the caching of regular expressions and sorted keys to avoid redundant computations during text processing.

Comment on lines +48 to +76
String expandAbbreviations(String text) {
if (text.isEmpty) return text;

String expandedText = text;

// Sort keys by length descending to avoid partial matches (e.g., 'ta' in 'tac')
final sortedKeys = _abbreviations.keys.toList()
..sort((a, b) => b.length.compareTo(a.length));

for (final key in sortedKeys) {
// Use regex with word boundaries to avoid matching inside words
// e.g. "TA" should match but "taza" should not.
// We handle Tª specifically as it has a special character.
final escapedKey = RegExp.escape(key);
final regex = RegExp('\\b$escapedKey\\b', caseSensitive: false);

// Special case for Tª since \b might not work as expected with ª
if (key == 'tª') {
expandedText = expandedText.replaceAll(
RegExp(r'Tª', caseSensitive: false), _abbreviations[key]!);
} else {
expandedText = expandedText.replaceAllMapped(regex, (match) {
return _abbreviations[key]!;
});
}
}

return expandedText;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The expandAbbreviations method is inefficient because it re-calculates the sorted keys and re-compiles multiple RegExp objects on every call. This can lead to performance degradation when processing large texts or when called frequently.

Consider pre-calculating the sorted keys and caching the compiled regular expressions as static members of the class. This avoids redundant work and improves the overall performance of the tokenizer.

  static final List<String> _sortedKeys = _abbreviations.keys.toList()
    ..sort((a, b) => b.length.compareTo(a.length));

  static final Map<String, RegExp> _regexCache = {
    for (final key in _sortedKeys)
      key: key == 'tª'
          ? RegExp(r'tª', caseSensitive: false)
          : RegExp(r'\b' + RegExp.escape(key) + r'\b', caseSensitive: false)
  };

  String expandAbbreviations(String text) {
    if (text.isEmpty) return text;

    String expandedText = text;

    for (final key in _sortedKeys) {
      expandedText = expandedText.replaceAll(_regexCache[key]!, _abbreviations[key]!);
    }

    return expandedText;
  }

@iberi22
Copy link
Copy Markdown
Owner Author

iberi22 commented Apr 16, 2026

SCOPE ISSUE: MedicalTokenizer tiene abreviaturas médicas hardcoded en español (TA, FC, DM, etc.). Esto viola el principio de package genérico.

SOLUCIÓN REQUERIDA:

  1. Eliminar lib/src/utils/medical_tokenizer.dart
  2. Eliminar lib/src/embeddings/medical_embeddings_adapter.dart
  3. Crear un sistema PLUGGABLE en lib/src/utils/normalization_config.dart:

`dart
// Generic normalization config - NO hardcoded values
class TextNormalizationConfig {
final Map<String, String> abbreviationMap; // inyectado desde fuera

static TextNormalizationConfig empty() => TextNormalizationConfig({});

String normalize(String text) { ... }
}
`

  1. En lib/src/embeddings/ crear un adapter genérico que use el config:

`dart
class NormalizingEmbeddingsAdapter implements EmbeddingsAdapter {
final EmbeddingsAdapter inner;
final TextNormalizationConfig config;

// Usa config.normalizationConfig.abbreviationMap, sin hardcodear nada
}
`

Las abreviaturas médicas específicas para España/Latam se configurarán en OrionHealth, no en este package.

Por favor: elimina lo hardcoded y hazlo genérico.

@iberi22
Copy link
Copy Markdown
Owner Author

iberi22 commented Apr 17, 2026

SCOPE VIOLATION: MedicalTokenizer tiene abreviaciones medicas hardcoded (TA, FC, DM). Por favor crea nuevo PR con TextNormalizationConfig generico y pluggable.

@iberi22 iberi22 closed this Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feat: Medical Domain Embeddings Adapter

1 participant