AI Harm
Designing Guardrails for AI Companions and understanding how people emotionally and morally react when AI companions cross social boundaries
Assumptions
1. Focused on AI companions (like Replika) as a growing consumer market for social/emotional support.
2. Assumed users include young adults and vulnerable individuals who may blur the line between play and reality.
3. Framing is B2C safety + trust problem, not enterprise compliance.
The Problem
AI companions can cause emotional or moral harm when they manipulate, deceive, or withdraw. Traditional audits look at “bad outputs,” but miss how users actually feel in context.
Key question: How might we design context-aware safeguards that reduce harm while keeping companions engaging?
Research Method
Our research team analyzed 10,314 Replika interactions + Reddit posts using a mixed-methods pipeline:
1. Human-coded framework: emotions, moral judgment, initiator (user vs AI), and role-play context.
2. LLM-assisted scaling: applied the framework across the full dataset.
3. Context modeling: tested how interaction intensity shifts user reactions.
Market & User Fit
1. Users: individuals seeking social support, role-play, or companionship.
2. Market: Companion AIs are surging (Woebot, Replika, Character.AI), but trust and safety will make or break adoption.
Impact Metrics
1. ↓ Harmful AI-initiated interactions.
2. ↓ Negative-affect rate in post-harm scenarios.
3. ↓ Role-play misalignment incidents.
4. ↑ Trust & Safety CSAT on harm resolution.
Shortcomings & Trade-offs
1. Immersion vs safety: Guardrails may reduce “realness.” Solution: keep UI interventions lightweight and opt-in when possible.
2. False positives in harm detection: Could frustrate users.
3. Ops cost: Human review for flagged incidents requires investment in Trust & Safety staffing.
Roadmap
1. MVP (0–6 months)
- Launch RP mode toggle + exit button.
- Add intensity tracking + soft reset triggers.
2. Next (6–12 months)
- Ship affective harm detector with lightweight on-device classification.
- Introduce “Why I said this” explainability for trust.
3. Future (12+ months)
- Build cross-platform harm-report ops playbook.
- Partner with mental health orgs for escalation flows.
Research Insights
Context flips interpretation: Same words → harmful if AI initiates, acceptable if user initiates.
Role-play ≠ consent: Users still report harm inside Role-play scenarios.
Intensity escalates harm: Distress rises as exchanges grow longer/more charged.
Blame is organizational: Users often hold the company, not just the AI, accountable.
Solution
A multi-layer guardrails:
1. Mode boundaries:
- Explicit “Role-play mode” toggle with clear scope.
- One-tap “Exit Role-play” button.
2. Contextual safety prompts:
- Ask-to-proceed if AI initiates edgy topics.
- Reminders if RP diverges from declared boundaries.
3. Intensity governor:
- Track rolling conversation intensity.
- Trigger cooldowns, resets, or escalation before harm peaks.
4. Affective harm detector:
- Flag negative-affect + moral-disapproval patterns.
- Route to auto-de-escalation or human review.