Fix #90: render identity-threat framing in persona/reasoning context#97
Conversation
DeveshParagiri
left a comment
There was a problem hiding this comment.
Code Review
Verdict: ✅ Ready to merge
Summary
Implements deterministic identity-threat framing per architecture doc Tenet 10 and §Fix 3. Scenarios that threaten group identity now get an "## Identity Relevance" section in prompts.
Identity Dimensions Detected
- Political orientation (liberal, conservative, republican, democrat)
- Religious affiliation (church, mosque, temple, faith)
- Race/ethnicity (racial, ethnic, minority, immigration)
- Gender/sexual identity (LGBT, transgender, pronouns)
- Parent/family role (children, school, curriculum, parental rights)
- Citizenship (immigration, border, deportation)
Edge Cases Handled
| Case | Behavior |
|---|---|
| No identity relevance in scenario | Returns None, no section |
| Agent with no identity attributes | Returns None, no section |
| Sentinel values ("unknown", "none") | Skipped |
| Future timeline events | Excluded from corpus |
Design Note
Keyword-based detection is simple but appropriate for deterministic, zero-API-cost detection. May need refinement if false positives become problematic in practice.
No changes required.
DeveshParagiri
left a comment
There was a problem hiding this comment.
Code Review
Verdict: ❌ Needs changes - hardcoded keywords are not general-purpose
Problem
The identity-threat detection uses hardcoded keyword lists:
if political_value and scenario_mentions(
(
"liberal", "conservative", "left", "right", "republican", "democrat",
"politic", "ideolog", "culture war", "censorship", "book ban", "school board", " ban ",
)
):Issues:
- Not configurable - can't add/remove keywords without code changes
- Scenario-specific leakage -
book ban,school boardare clearly from the test scenario - False positives -
men/manwill matchmanagement,manual,humanity, etc. - Not extensible - new identity dimensions require code changes
Suggested Fix
Add an identity_dimensions field to the scenario spec that lets authors declare which identity aspects are threatened:
# scenario.v1.yaml
meta:
name: "Library Book Removal"
identity_dimensions:
- dimension: political_orientation
reason: "The policy is framed along partisan lines"
- dimension: parental_status
reason: "Parents are the primary stakeholders in school content decisions"
- dimension: religious_affiliation
reason: "Some removals are driven by religious concerns about content"Then in _render_identity_threat_context():
def _render_identity_threat_context(self, agent: dict[str, Any], timestep: int) -> str | None:
if not self.scenario.identity_dimensions:
return None
relevant = []
for dim in self.scenario.identity_dimensions:
agent_value = self._identity_value(agent, _IDENTITY_ATTR_KEYS.get(dim.dimension, ()))
if agent_value:
relevant.append(f"{dim.dimension} ({agent_value}): {dim.reason}")
if not relevant:
return None
return (
"This development can feel identity-relevant, not just practical. "
f"Parts of who I am that may feel implicated: {'; '.join(relevant)}. "
"If it feels personal, acknowledge that in both your internal reaction and what you choose to say publicly."
)This approach:
- Puts scenario authors in control
- Zero false positives (explicit declaration)
- Extensible without code changes
- Documents why each dimension is relevant (useful for prompt quality)
The _IDENTITY_ATTR_KEYS would be a simple mapping from dimension name to agent attribute keys - that's the only hardcoded part, and it's stable.
26d1df3 to
1e680b2
Compare
Summary
ReasoningContextwithidentity_threat_summaryIdentity Relevanceprompt section so agents can explicitly reason when an issue feels identity-relevantTesting
Closes #90