Add HealthOS bundle privacy foundation#106
Conversation
There was a problem hiding this comment.
Code Review
This pull request integrates HealthOS as a validated health data engine, introducing a new security policy, an RFC, and a privacy audit script to prevent sensitive data exposure. The implementation includes API endpoints for bundle imports, logic for summarizing health data in chat contexts, and a privacy scanner. Review feedback highlights opportunities to improve the privacy scanner's coverage of sensitive terms, ensure only validated fields are persisted in the database, optimize file I/O in the audit script, and replace magic numbers with named constants for better maintainability.
| const forbiddenStringPatterns = [ | ||
| /[A-Za-z]:\\/, | ||
| /\/Users\/[^/\s]+/, | ||
| /\bruntime\//i, | ||
| /\.sqlite3?\b/i, | ||
| /\.db\b/i, | ||
| /\.pdf\b/i, | ||
| /\.vcf\b/i, | ||
| /\braw_payload\b/i, | ||
| /\badvisor[_ -]?chat\b/i, | ||
| ]; |
There was a problem hiding this comment.
The regex patterns for forbidden strings do not account for camelCase variants (e.g., advisorChat, rawPayload), which are explicitly listed in forbiddenKeys. Additionally, several other sensitive keys like mood_log and unapproved_genetics are missing from the string pattern check entirely. This could allow sensitive information to bypass the scanner if it appears within string values using these naming conventions.
const forbiddenStringPatterns = [
/[A-Za-z]:\\/,
/\/Users\/[^/\s]+/,
/\bruntime\//i,
/\.sqlite3?\b/i,
/\.db\b/i,
/\.pdf\b/i,
/\.vcf\b/i,
/\b(raw_?payload|rawPayload)\b/i,
/\b(advisor[_ -]?chat|advisorChat)\b/i,
/\b(mood_?log|moodLog)\b/i,
/\b(unapproved_?genetics|unapprovedGenetics)\b/i,
];| const bundle = { | ||
| ...input, | ||
| bundle_version: HEALTHOS_BUNDLE_VERSION, | ||
| generated_at: requireString(input, "generated_at"), | ||
| source_freshness: validateFreshness(requireArray(input, "source_freshness")), | ||
| landing_summary: validateLandingSummary(requireArray(input, "landing_summary")), | ||
| clinician_packet: requireRecord(input, "clinician_packet") as HealthOsClinicianPacket, | ||
| metrics: requireArray(input, "metrics") as JsonRecord[], | ||
| provenance: requireArray(input, "provenance") as JsonRecord[], | ||
| warnings: requireArray(input, "warnings").map((warning, index) => { | ||
| if (typeof warning !== "string") throw new Error(`warnings[${index}] must be a string`); | ||
| return warning; | ||
| }), | ||
| } satisfies HealthOsBundleV1; |
There was a problem hiding this comment.
The parseHealthOsBundle function uses the spread operator (...input) when constructing the bundle object. This preserves any extra, unvalidated properties from the raw input and includes them in the final object. Since this object is subsequently stored in the database (via buildHealthOsImportData), this could lead to the persistence of unexpected or sensitive data that was not intended to be part of the HealthOS bundle schema. It is safer to explicitly pick the validated fields.
const bundle: HealthOsBundleV1 = {
bundle_version: HEALTHOS_BUNDLE_VERSION,
generated_at: requireString(input, "generated_at"),
source_freshness: validateFreshness(requireArray(input, "source_freshness")),
landing_summary: validateLandingSummary(requireArray(input, "landing_summary")),
clinician_packet: requireRecord(input, "clinician_packet") as HealthOsClinicianPacket,
metrics: requireArray(input, "metrics") as JsonRecord[],
provenance: requireArray(input, "provenance") as JsonRecord[],
warnings: requireArray(input, "warnings").map((warning, index) => {
if (typeof warning !== "string") throw new Error(`warnings[${index}] must be a string`);
return warning;
}),
};| if (!isLikelyText(file)) continue; | ||
|
|
||
| const text = readFileSync(file, "utf8").toLowerCase(); |
There was a problem hiding this comment.
The script reads the content of each file twice: once in isLikelyText to check for null bytes, and again here to perform the term search. For a large number of files, this significantly increases I/O overhead. Consider refactoring isLikelyText to return the file content if it's valid text, or caching the result to avoid redundant reads.
| .slice(0, 12) | ||
| .join("; "); | ||
| const landingSummary = bundle.landing_summary | ||
| .map((metric) => { | ||
| const unit = metric.unit ? ` ${metric.unit}` : ""; | ||
| const asOf = metric.as_of ? ` as of ${metric.as_of}` : ""; | ||
| return `${metric.label}: ${compact(metric.value)}${unit}${asOf}`; | ||
| }) | ||
| .slice(0, 20) | ||
| .join("; "); | ||
| const warnings = bundle.warnings.slice(0, 8).join("; "); |
Summary
This is the first upstream-safe HealthOS integration slice from RFC #105.
It adds a privacy-first foundation before any HealthOS engine or connector code is migrated:
No private HealthOS runtime data, raw exports, reports, real fixtures, local paths, or person-specific clinical facts are included.
Verification
npm run lintandnpm run buildreport pre-existing React hook warnings in onboarding/chat/source components, but pass.