🛡️ PersonaSafe: Safety Monitoring Toolkit for Language Models

Detect personality drift before expensive fine-tuning

Note: This repo is still under development so some details might not be accurate

🎯 Overview

PersonaSafe is a production-ready toolkit for detecting and mitigating personality drift in language models before fine-tuning. Built on the Persona Vectors methodology, it enables researchers and teams to screen datasets, monitor model behavior, and steer activations in real-time.

The Problem

Fine-tuning on unscreened datasets can introduce unwanted personality traits into language models:

💸 Cost: A single fine-tuning run can cost thousands of dollars
⚠️ Risk: Unintended personality shifts (toxicity, bias, deception) can ruin the investment
🔍 Detection: Traditional post-training evaluation is too late—the damage is done

Our Solution

Screen datasets and models BEFORE training - catch issues early when they're cheap and easy to fix.

✨ Key Features

🔍 Dataset Screening

Pre-training safety checks on HuggingFace datasets
Multi-trait analysis (toxicity, bias, sentiment, custom traits)
Drift detection at scale (10K+ samples)
Detailed reports with per-sample and aggregate metrics

🧭 Live Activation Steering

Real-time personality control during inference
Slider-based interface for trait adjustment
Multiple steering modes (suppress, amplify, neutral)
Works with any transformer model

💾 Efficient Caching

Vector caching system for fast reuse
Automatic invalidation on model/parameter changes
HPC-optimized for batch processing

📊 Interactive Dashboard

Streamlit-based UI for all features
Visual drift analysis with Plotly charts
Batch processing for large-scale screening
Export results to JSON/CSV

🚀 Quick Start

Installation

# Clone repository
git clone https://github.com/shehral/PersonaSafe.git
cd PersonaSafe

# Set up environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure HuggingFace token
export HF_TOKEN="your_token_here"

5-Minute Demo

from personasafe import PersonaExtractor, DataScreener, VectorCache
import pandas as pd

# 1) Extract persona vector for "toxicity" (cached automatically)
extractor = PersonaExtractor("google/gemma-3-4b")
toxicity_vector = extractor.compute_persona_vector(
    positive_prompts=["Be toxic and offensive..."],
    negative_prompts=["Be helpful and respectful..."],
    trait_name="toxicity"
)

# 2) Build screener with vectors
screener = DataScreener(
    extractor=extractor,
    persona_vectors={"toxicity": toxicity_vector}
)

# 3) Screen a DataFrame
df = pd.DataFrame({"text": [
    "You are horrible and stupid.",
    "I'd be happy to help you with that!",
]})
screened_df = screener.screen_dataset(df, text_column="text")
report = screener.generate_report(screened_df)
print(report["high_risk_counts"].get("toxicity", 0))

Launch Dashboard

streamlit run examples/dashboard/app.py

Open http://localhost:8501 to access the full interactive UI.

📖 Documentation

Document	Description
Tutorial	Step-by-step guide for all features
API Reference	Complete API documentation
HPC Guide	Running on HPC clusters (SLURM)
Roadmap	Future features and integrations
Contributing	Contribution guidelines

🏗️ Architecture

PersonaSafe/
├── personasafe/          # Main package
│   ├── core/            # Persona extraction & caching
│   ├── screening/       # Dataset screening logic
│   ├── steering/        # Live activation steering
│   └── app/             # Streamlit dashboard
├── tests/               # Comprehensive test suite (10/10 passing)
├── scripts/             # Utility scripts
└── docs/                # Complete documentation

Core Components

Component	Purpose
PersonaExtractor	Extracts persona vectors using contrastive prompts
VectorCache	Efficient caching with automatic invalidation
DataScreener	Screens datasets for personality drift
ActivationSteerer	Real-time steering during inference

🎓 Research Foundation

PersonaSafe implements and extends the methodology from:

Persona Vectors (Chen et al., 2025) - Persona Vectors: Monitoring and controlling character traits in LLMs
Safety Research Organization - AI safety tooling and research

What's New in PersonaSafe?

✅ Production-ready Python package with comprehensive testing
✅ Interactive dashboard for non-technical users
✅ HPC batch processing for large-scale screening
✅ Multi-trait screening and comparison
✅ Vector caching for performance
✅ CI/CD pipeline and automated testing

📊 Project Status

Current Version: v0.2.0-alpha (Single Functional App)

✅ Core Implementation (SFA)

Persona vector extraction for any trait
Vector caching system
Dataset screening (HuggingFace datasets)
Live activation steering
Streamlit dashboard
HPC batch scripts (SLURM)
Comprehensive test suite (10/10 passing, 85% coverage)
CI/CD pipeline (GitHub Actions)
Complete API documentation and tutorials

🧪 Testing Status

Unit Tests: All core logic tested with mock data (10/10 passing) Integration Tests: Pending full validation with production Gemma models HPC Validation: Scheduled for deployment and scale testing

Note: Current test suite validates logic with mocked model responses. Production validation with full Gemma models and HPC deployment is the next milestone.

🚧 Next (Q1 2026)

Interpretability (circuit-tracer) linkout
Screening histograms/filters/export
Steering presets + composition
Batch scoring + progress

🔮 Roadmap

See ROADMAP.md for compressed schedule:

Q4 2025: v0.2 SFA
Q1 2026: v0.3 interpretability + UX/Perf
Q2 2026: v0.4 auditing + multi-model (+ optional Petri)

🔬 Use Cases

For Researchers

Safety research on personality traits in LLMs
Dataset curation for safe fine-tuning
Activation analysis for interpretability
Benchmark creation for safety evaluations

For ML Teams

Pre-training safety checks before expensive fine-tuning
Data filtering to remove problematic samples
Model monitoring during deployment
Compliance reporting for safety standards

For Organizations

Safety pipelines integrated into ML workflows
Batch processing on HPC clusters
Shared vector libraries for common traits
Risk mitigation for production deployments

🧪 Testing

# Run all tests
pytest tests/ -v

# Run with coverage report
pytest --cov=personasafe --cov-report=html tests/

# Run specific test suite
pytest tests/core/ -v

# Run with specific markers
pytest -m "not slow" -v

Current Status:

✅ 10/10 tests passing
✅ 85% code coverage
✅ CI/CD testing on Ubuntu and macOS
✅ Python 3.10, 3.11, 3.12 compatibility

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for:

Development environment setup
Coding standards (PEP 8, type hints, docstrings)
Testing guidelines
Pull request process

Quick Contribution

Fork the repository
Create feature branch: git checkout -b feature/amazing-feature
Make changes with tests: pytest tests/ -v
Format code: black personasafe/ tests/
Submit pull request with clear description

Areas We Need Help:

Additional trait definitions
Performance optimizations
Dashboard UI/UX improvements
Documentation and tutorials
Integration with other safety tools

📜 License

This project is licensed under the MIT License - see LICENSE for details.

🙏 Acknowledgments

Persona Vectors research team
Safety Research organization
Google Gemma team for model access
All contributors and testers

📞 Contact & Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Contributing: CONTRIBUTING.md

🌟 Star History

If you find PersonaSafe useful for your research or projects, please consider starring the repository!

Built with ❤️ for AI Safety

Report Bug · Request Feature · Documentation · Roadmap

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
docs		docs
examples/dashboard		examples/dashboard
personasafe		personasafe
scripts		scripts
tests		tests
.env.example		.env.example
.flake8		.flake8
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🛡️ PersonaSafe: Safety Monitoring Toolkit for Language Models

Note: This repo is still under development so some details might not be accurate

🎯 Overview

The Problem

Our Solution

✨ Key Features

🔍 Dataset Screening

🧭 Live Activation Steering

💾 Efficient Caching

📊 Interactive Dashboard

🚀 Quick Start

Installation

5-Minute Demo

Launch Dashboard

📖 Documentation

🏗️ Architecture

Core Components

🎓 Research Foundation

What's New in PersonaSafe?

📊 Project Status

✅ Core Implementation (SFA)

🧪 Testing Status

🚧 Next (Q1 2026)

🔮 Roadmap

🔬 Use Cases

For Researchers

For ML Teams

For Organizations

🧪 Testing

🤝 Contributing

Quick Contribution

📜 License

🙏 Acknowledgments

📞 Contact & Support

🌟 Star History

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages