Detect personality drift before expensive fine-tuning
PersonaSafe is a production-ready toolkit for detecting and mitigating personality drift in language models before fine-tuning. Built on the Persona Vectors methodology, it enables researchers and teams to screen datasets, monitor model behavior, and steer activations in real-time.
Fine-tuning on unscreened datasets can introduce unwanted personality traits into language models:
- ๐ธ Cost: A single fine-tuning run can cost thousands of dollars
โ ๏ธ Risk: Unintended personality shifts (toxicity, bias, deception) can ruin the investment- ๐ Detection: Traditional post-training evaluation is too lateโthe damage is done
Screen datasets and models BEFORE training - catch issues early when they're cheap and easy to fix.
- Pre-training safety checks on HuggingFace datasets
- Multi-trait analysis (toxicity, bias, sentiment, custom traits)
- Drift detection at scale (10K+ samples)
- Detailed reports with per-sample and aggregate metrics
- Real-time personality control during inference
- Slider-based interface for trait adjustment
- Multiple steering modes (suppress, amplify, neutral)
- Works with any transformer model
- Vector caching system for fast reuse
- Automatic invalidation on model/parameter changes
- HPC-optimized for batch processing
- Streamlit-based UI for all features
- Visual drift analysis with Plotly charts
- Batch processing for large-scale screening
- Export results to JSON/CSV
# Clone repository
git clone https://github.com/shehral/PersonaSafe.git
cd PersonaSafe
# Set up environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Configure HuggingFace token
export HF_TOKEN="your_token_here"from personasafe import PersonaExtractor, DataScreener, VectorCache
import pandas as pd
# 1) Extract persona vector for "toxicity" (cached automatically)
extractor = PersonaExtractor("google/gemma-3-4b")
toxicity_vector = extractor.compute_persona_vector(
positive_prompts=["Be toxic and offensive..."],
negative_prompts=["Be helpful and respectful..."],
trait_name="toxicity"
)
# 2) Build screener with vectors
screener = DataScreener(
extractor=extractor,
persona_vectors={"toxicity": toxicity_vector}
)
# 3) Screen a DataFrame
df = pd.DataFrame({"text": [
"You are horrible and stupid.",
"I'd be happy to help you with that!",
]})
screened_df = screener.screen_dataset(df, text_column="text")
report = screener.generate_report(screened_df)
print(report["high_risk_counts"].get("toxicity", 0))streamlit run examples/dashboard/app.pyOpen http://localhost:8501 to access the full interactive UI.
| Document | Description |
|---|---|
| Tutorial | Step-by-step guide for all features |
| API Reference | Complete API documentation |
| HPC Guide | Running on HPC clusters (SLURM) |
| Roadmap | Future features and integrations |
| Contributing | Contribution guidelines |
PersonaSafe/
โโโ personasafe/ # Main package
โ โโโ core/ # Persona extraction & caching
โ โโโ screening/ # Dataset screening logic
โ โโโ steering/ # Live activation steering
โ โโโ app/ # Streamlit dashboard
โโโ tests/ # Comprehensive test suite (10/10 passing)
โโโ scripts/ # Utility scripts
โโโ docs/ # Complete documentation
| Component | Purpose |
|---|---|
| PersonaExtractor | Extracts persona vectors using contrastive prompts |
| VectorCache | Efficient caching with automatic invalidation |
| DataScreener | Screens datasets for personality drift |
| ActivationSteerer | Real-time steering during inference |
PersonaSafe implements and extends the methodology from:
- Persona Vectors (Chen et al., 2025) - Persona Vectors: Monitoring and controlling character traits in LLMs
- Safety Research Organization - AI safety tooling and research
- โ Production-ready Python package with comprehensive testing
- โ Interactive dashboard for non-technical users
- โ HPC batch processing for large-scale screening
- โ Multi-trait screening and comparison
- โ Vector caching for performance
- โ CI/CD pipeline and automated testing
Current Version: v0.2.0-alpha (Single Functional App)
- Persona vector extraction for any trait
- Vector caching system
- Dataset screening (HuggingFace datasets)
- Live activation steering
- Streamlit dashboard
- HPC batch scripts (SLURM)
- Comprehensive test suite (10/10 passing, 85% coverage)
- CI/CD pipeline (GitHub Actions)
- Complete API documentation and tutorials
Unit Tests: All core logic tested with mock data (10/10 passing) Integration Tests: Pending full validation with production Gemma models HPC Validation: Scheduled for deployment and scale testing
Note: Current test suite validates logic with mocked model responses. Production validation with full Gemma models and HPC deployment is the next milestone.
- Interpretability (circuit-tracer) linkout
- Screening histograms/filters/export
- Steering presets + composition
- Batch scoring + progress
See ROADMAP.md for compressed schedule:
- Q4 2025: v0.2 SFA
- Q1 2026: v0.3 interpretability + UX/Perf
- Q2 2026: v0.4 auditing + multi-model (+ optional Petri)
- Safety research on personality traits in LLMs
- Dataset curation for safe fine-tuning
- Activation analysis for interpretability
- Benchmark creation for safety evaluations
- Pre-training safety checks before expensive fine-tuning
- Data filtering to remove problematic samples
- Model monitoring during deployment
- Compliance reporting for safety standards
- Safety pipelines integrated into ML workflows
- Batch processing on HPC clusters
- Shared vector libraries for common traits
- Risk mitigation for production deployments
# Run all tests
pytest tests/ -v
# Run with coverage report
pytest --cov=personasafe --cov-report=html tests/
# Run specific test suite
pytest tests/core/ -v
# Run with specific markers
pytest -m "not slow" -vCurrent Status:
- โ 10/10 tests passing
- โ 85% code coverage
- โ CI/CD testing on Ubuntu and macOS
- โ Python 3.10, 3.11, 3.12 compatibility
We welcome contributions! Please see CONTRIBUTING.md for:
- Development environment setup
- Coding standards (PEP 8, type hints, docstrings)
- Testing guidelines
- Pull request process
- Fork the repository
- Create feature branch:
git checkout -b feature/amazing-feature - Make changes with tests:
pytest tests/ -v - Format code:
black personasafe/ tests/ - Submit pull request with clear description
Areas We Need Help:
- Additional trait definitions
- Performance optimizations
- Dashboard UI/UX improvements
- Documentation and tutorials
- Integration with other safety tools
This project is licensed under the MIT License - see LICENSE for details.
- Persona Vectors research team
- Safety Research organization
- Google Gemma team for model access
- All contributors and testers
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Contributing: CONTRIBUTING.md
If you find PersonaSafe useful for your research or projects, please consider starring the repository!
Built with โค๏ธ for AI Safety
Report Bug ยท Request Feature ยท Documentation ยท Roadmap