Data Processing and Logical Issues in Generating OSCEs from MIMIC Data, Leading to Unreliable Results

After reviewing and testing the repository, I have identified several major issues specifically related to the `generate_cases/gen_mimic_tutorial.py` file. Below are the details:

1. **`case_studies` only uses patient IDs without actual patient information**  
   - In the code snippet below, the prompt passed to the large language model only uses `_case` as content, while `_case` is essentially just a string of patient IDs and does not contain any corresponding patient details:  
     ```python
     messages = [
         {"role": "system", "content": "..."},
         {"role": "user", "content": "Generate a OSCE for the following case study {}.".format(_case) + ...}
     ]
     ```  
   - Since `case_studies[_case]` contains the actual patient information, but only `_case` (the patient ID) is passed to the LLM here, the generated OSCE has no real connection to the patient data and is therefore unreliable.

2. **The logic to limit data size does not take effect; `case_studies` is not actually truncated to 300 records**  
   - Although there is a mechanism intended to limit the number of cases to 300:
     ```python
     # Choose only cases with diagnoses == 1
     for _ in num_diagnoses:
         if num_diagnoses[_] < 2:
             num += 1
             if num >= 300: break
             patlist.append(_)
     ...
     pats_file = [_ for _ in pats_file if _[0] in patlist]
     ```  
   - In the subsequent processing and output of the `patient_info` dictionary (i.e., `case_studies`), there is no real filtering or removal based on `patlist`. As a result, `case_studies` still contains all the data.

3. **Inefficient CSV reading approach, especially for the very large `labevents.csv`**  
   - The code uses `list(csv.reader(f))` to load the entire CSV file into memory at once, for example:
     ```python
     with open(base_str + "hosp/labevents.csv", "r") as f:
         labenvt_file = list(csv.reader(f))
     ```  
   - For massive MIMIC data, particularly the `labevents.csv`, this approach places a heavy burden on memory and processing speed, making it impractical to run on servers with typical configurations.

4. **Potential lack of reliability in the paper’s experimental results**  
   - As described in **Issue 1**, the script does not pass actual patient information to the large language model, but rather just the patient ID. This could lead to significant bias or unreliability in the experimental results.
   - If the paper’s analysis or conclusions rely on these generated results, the validity of the findings should be carefully reevaluated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Processing and Logical Issues in Generating OSCEs from MIMIC Data, Leading to Unreliable Results #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Data Processing and Logical Issues in Generating OSCEs from MIMIC Data, Leading to Unreliable Results #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions