DeID improvements #541

tomolopolis · 2025-05-22T13:40:52Z

Small improvements for fine-tuning and using the de-id in a pipeline:

train / set splits can be performed outside of the train method. Changes to support this in DeIdModel.train
locale specific regex can be useful in a pipeline rather than directly collecting annotations and fine-tuning the underlying model. Changes to include arbitrary patterns matched and mapped to CDB cuis, then merged with model predictions. Eval code also doesn't use tokenizer split, so is fully representative of what was annotated.

… or train / test files

mart-r

A few comments that I think should be addressed (i.e duplicate/typod key, remaining old doc string), the rest of it is more optional I'd say.
Though I do feel like the 2 comments regarding cui2preferred_name as well as the one regarding using filter make sense to implement as well.

medcat/utils/ner/deid.py

medcat/utils/ner/metrics.py

mart-r

Some tests as well - great!
All good on my side.

DeID improvements

Tom Searle and others added 13 commits May 22, 2025 14:31

CU-86995ddec: Extra args on the train to pass in dataset to be split,…

fed0e3c

… or train / test files

CU-86995ddvj: Add in post-processing funcs for a de-id pipeline

d8e6151

Merge branch 'master' into deid_train_eval

5ab846d

CU-8698jzjj3: fix prev merge

add75ac

CU-8698jzjj3: fix mypy errors

14dae8c

CU-8698jzjj3: remove example

42e69e6

CU-8698jzjj3: fix mypy errors

da57c74

Merge branch 'master' into deid_train_eval

5980adb

CU-8698jzjj3: flake8 fixes

32265f7

CU-8698jzjj3: flake8 fixes

13ef58c

CU-8698jzjj3: darglint fixes

56121d5

CU-8698jzjj3: darglint fix

651ad88

CU-8698jzjj3: datasets splits fix and extra test for extra named arg

5dae170

mart-r suggested changes Jun 2, 2025

View reviewed changes

Tom Searle added 2 commits June 3, 2025 12:31

CU-8698jzjj3: Add tests and respond to comments

7d61c72

CU-8698jzjj3: Fix tests

d2ba527

mart-r approved these changes Jun 4, 2025

View reviewed changes

mart-r mentioned this pull request Jun 4, 2025

CU-86999tnz7 resync with v1 CogStack/MedCAT2#74

Merged

10 tasks

tomolopolis merged commit a7661ef into master Jun 4, 2025
8 checks passed

alhendrickson pushed a commit to CogStack/cogstack-nlp that referenced this pull request Jul 1, 2025

Merge pull request CogStack/MedCAT#541 from CogStack/deid_train_eval

d255984

DeID improvements

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DeID improvements #541

DeID improvements #541

Uh oh!

tomolopolis commented May 22, 2025

Uh oh!

mart-r left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mart-r left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

DeID improvements #541

DeID improvements #541

Uh oh!

Conversation

tomolopolis commented May 22, 2025

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mart-r left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants