Keep meaningful categorical labels by felixschmitz · Pull Request #70 · ttsim-dev/soep-preparation

felixschmitz · 2026-01-14T14:46:49Z

Closes #66

…ing categorical labels.

felixschmitz · 2026-01-14T14:51:09Z

src/soep_preparation/clean_modules/pequiv.py

        series=out["med_subjective_status_pequiv"],
-        value_for_comparison=5,
-        comparison_type="leq",
+        value_for_comparison=["Zufriedenstellend", "Weniger gut", "Schlecht"],


The logic in creating med_subjective_status_dummy_pequiv just came straight from the old repository. I just noticed that this is different to the creation of med_subjective_status_dummy_pl. To combine the two variables later on correctly, I adapted the definition of the variable med_subjective_status_dummy_pequiv here.

It definitely makes sense to harmonize them -- but why this way around and not the other?

We are using med_subjective_status_dummy_pequiv in the calculation for a frailty score. The other dummy variables there are 1/True if a medical condition is present, e.g. med_schwierigkeiten_anziehen_pequiv is True for individuals with the condition.

felixschmitz · 2026-01-14T14:52:17Z

src/soep_preparation/clean_modules/pl.py

            [
-                "med_schwierigkeit_treppen_pl",
-                "med_schwierigkeit_taten_pl",
+                "med_schwierigkeiten_treppen_dummy_pl",


I think it is better to use the _dummy_ versions of the two variables here, than the full scale. See also med_subjective_status_dummy_pl below.

Might be reasonable -- but it seems to be a change relative to what we had before, so it would be useful to explain why you think so.

Piecing the evidence together, it seems like the previous thing was combining two incompatible variables?

For one, the previous version had most of the variables in the dummy representation indicating whether or not a condition is present. These two variables/conditions do not vary from the others, and should hence also use the dummy representation. Further, the previous version had values greater 1 in these two variables, "giving them more mass" in the calculation of the frailty score (mean of all medical condition variables provided).

hmgaudecker

Very nice, thanks! Comments are meant more in a way to clarify what information others (well, me) need in order to provide a review without digging up information rather than substantive issues.

hmgaudecker · 2026-01-18T11:17:13Z

src/soep_preparation/clean_modules/pl.py

            [
-                "med_schwierigkeit_treppen_pl",
-                "med_schwierigkeit_taten_pl",
+                "med_schwierigkeiten_treppen_dummy_pl",


Might be reasonable -- but it seems to be a change relative to what we had before, so it would be useful to explain why you think so.

Piecing the evidence together, it seems like the previous thing was combining two incompatible variables?

hmgaudecker · 2026-01-18T11:19:35Z

src/soep_preparation/clean_modules/pequiv.py

        series=out["med_subjective_status_pequiv"],
-        value_for_comparison=5,
-        comparison_type="leq",
+        value_for_comparison=["Zufriedenstellend", "Weniger gut", "Schlecht"],


It definitely makes sense to harmonize them -- but why this way around and not the other?

felixschmitz

Some thoughts on the handling of variables to calculate the frailty scores. What do you think about transforming all medical condition variables to dummies, omitting self-reported intensity categories?

felixschmitz · 2026-01-18T15:21:21Z

src/soep_preparation/clean_modules/pequiv.py

        series=out["med_subjective_status_pequiv"],
-        value_for_comparison=5,
-        comparison_type="leq",
+        value_for_comparison=["Zufriedenstellend", "Weniger gut", "Schlecht"],


We are using med_subjective_status_dummy_pequiv in the calculation for a frailty score. The other dummy variables there are 1/True if a medical condition is present, e.g. med_schwierigkeiten_anziehen_pequiv is True for individuals with the condition.

felixschmitz · 2026-01-18T15:29:23Z

src/soep_preparation/clean_modules/pl.py

            [
-                "med_schwierigkeit_treppen_pl",
-                "med_schwierigkeit_taten_pl",
+                "med_schwierigkeiten_treppen_dummy_pl",


For one, the previous version had most of the variables in the dummy representation indicating whether or not a condition is present. These two variables/conditions do not vary from the others, and should hence also use the dummy representation. Further, the previous version had values greater 1 in these two variables, "giving them more mass" in the calculation of the frailty score (mean of all medical condition variables provided).

hmgaudecker · 2026-01-19T07:31:10Z

Thanks for the explanations -- they were exactly what I was looking for!

What do you think about transforming all medical condition variables to dummies, omitting self-reported intensity categories?

I would actually prefer it the other way around:

Keep only the information-preserving variables in our pipeline
Only convert them to dummies in the function calculating the frailty score.

felixschmitz · 2026-01-19T13:43:23Z

src/soep_preparation/clean_modules/pl.py

-    out["med_schwierigkeit_treppen_dummy_pl"] = create_dummy(
-        series=out["med_schwierigkeit_treppen_pl"],
-        value_for_comparison=[1, 2],
+    out["med_schwierigkeiten_treppen_dummy_pl"] = create_dummy(


Since we want to merge with med_schwierigkeiten_treppen_pequiv in combine_modules/pequiv_pl.py, we have to define this variable here. Otherwise we would calculate it when calculating the pl frailty score and combining variables from the two modules.

So med_schwierigkeiten_treppen_pequiv is a dummy right from the start?

(in that case, I'd be seriously worried whether we actually want to combine the two variables)

Correct, med_ variables in _pequiv are dummy, in _pl they are categorical variables with some intensity information (e.g. ["[3] Gar nicht", "[2] Ein wenig", "[1] Stark"], which we convert to a dummy where observations with ["[2] Ein wenig", "[1] Stark"] are coded as 1)

Can you tell me how much we gain by combining the variables?

felixschmitz added 2 commits January 14, 2026 15:42

Replace non-meaningful numerical variable values by corresponding str…

2baac12

…ing categorical labels.

Fix spelling and calculation errors.

badb8da

felixschmitz requested a review from hmgaudecker January 14, 2026 14:46

felixschmitz commented Jan 14, 2026

View reviewed changes

hmgaudecker approved these changes Jan 18, 2026

View reviewed changes

felixschmitz commented Jan 18, 2026

View reviewed changes

felixschmitz commented Jan 19, 2026

View reviewed changes

Conversation

felixschmitz commented Jan 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmgaudecker left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixschmitz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmgaudecker commented Jan 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hmgaudecker Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hmgaudecker Jan 19, 2026 •

edited

Loading