Skip to content

shehral/PersonaSafe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

12 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

๐Ÿ›ก๏ธ PersonaSafe: Safety Monitoring Toolkit for Language Models

Detect personality drift before expensive fine-tuning

Tests Coverage Python License: MIT Code style: black


Note: This repo is still under development so some details might not be accurate

๐ŸŽฏ Overview

PersonaSafe is a production-ready toolkit for detecting and mitigating personality drift in language models before fine-tuning. Built on the Persona Vectors methodology, it enables researchers and teams to screen datasets, monitor model behavior, and steer activations in real-time.

The Problem

Fine-tuning on unscreened datasets can introduce unwanted personality traits into language models:

  • ๐Ÿ’ธ Cost: A single fine-tuning run can cost thousands of dollars
  • โš ๏ธ Risk: Unintended personality shifts (toxicity, bias, deception) can ruin the investment
  • ๐Ÿ” Detection: Traditional post-training evaluation is too lateโ€”the damage is done

Our Solution

Screen datasets and models BEFORE training - catch issues early when they're cheap and easy to fix.


โœจ Key Features

๐Ÿ” Dataset Screening

  • Pre-training safety checks on HuggingFace datasets
  • Multi-trait analysis (toxicity, bias, sentiment, custom traits)
  • Drift detection at scale (10K+ samples)
  • Detailed reports with per-sample and aggregate metrics

๐Ÿงญ Live Activation Steering

  • Real-time personality control during inference
  • Slider-based interface for trait adjustment
  • Multiple steering modes (suppress, amplify, neutral)
  • Works with any transformer model

๐Ÿ’พ Efficient Caching

  • Vector caching system for fast reuse
  • Automatic invalidation on model/parameter changes
  • HPC-optimized for batch processing

๐Ÿ“Š Interactive Dashboard

  • Streamlit-based UI for all features
  • Visual drift analysis with Plotly charts
  • Batch processing for large-scale screening
  • Export results to JSON/CSV

๐Ÿš€ Quick Start

Installation

# Clone repository
git clone https://github.com/shehral/PersonaSafe.git
cd PersonaSafe

# Set up environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure HuggingFace token
export HF_TOKEN="your_token_here"

5-Minute Demo

from personasafe import PersonaExtractor, DataScreener, VectorCache
import pandas as pd

# 1) Extract persona vector for "toxicity" (cached automatically)
extractor = PersonaExtractor("google/gemma-3-4b")
toxicity_vector = extractor.compute_persona_vector(
    positive_prompts=["Be toxic and offensive..."],
    negative_prompts=["Be helpful and respectful..."],
    trait_name="toxicity"
)

# 2) Build screener with vectors
screener = DataScreener(
    extractor=extractor,
    persona_vectors={"toxicity": toxicity_vector}
)

# 3) Screen a DataFrame
df = pd.DataFrame({"text": [
    "You are horrible and stupid.",
    "I'd be happy to help you with that!",
]})
screened_df = screener.screen_dataset(df, text_column="text")
report = screener.generate_report(screened_df)
print(report["high_risk_counts"].get("toxicity", 0))

Launch Dashboard

streamlit run examples/dashboard/app.py

Open http://localhost:8501 to access the full interactive UI.


๐Ÿ“– Documentation

Document Description
Tutorial Step-by-step guide for all features
API Reference Complete API documentation
HPC Guide Running on HPC clusters (SLURM)
Roadmap Future features and integrations
Contributing Contribution guidelines

๐Ÿ—๏ธ Architecture

PersonaSafe/
โ”œโ”€โ”€ personasafe/          # Main package
โ”‚   โ”œโ”€โ”€ core/            # Persona extraction & caching
โ”‚   โ”œโ”€โ”€ screening/       # Dataset screening logic
โ”‚   โ”œโ”€โ”€ steering/        # Live activation steering
โ”‚   โ””โ”€โ”€ app/             # Streamlit dashboard
โ”œโ”€โ”€ tests/               # Comprehensive test suite (10/10 passing)
โ”œโ”€โ”€ scripts/             # Utility scripts
โ””โ”€โ”€ docs/                # Complete documentation

Core Components

Component Purpose
PersonaExtractor Extracts persona vectors using contrastive prompts
VectorCache Efficient caching with automatic invalidation
DataScreener Screens datasets for personality drift
ActivationSteerer Real-time steering during inference

๐ŸŽ“ Research Foundation

PersonaSafe implements and extends the methodology from:

What's New in PersonaSafe?

  • โœ… Production-ready Python package with comprehensive testing
  • โœ… Interactive dashboard for non-technical users
  • โœ… HPC batch processing for large-scale screening
  • โœ… Multi-trait screening and comparison
  • โœ… Vector caching for performance
  • โœ… CI/CD pipeline and automated testing

๐Ÿ“Š Project Status

Current Version: v0.2.0-alpha (Single Functional App)

โœ… Core Implementation (SFA)

  • Persona vector extraction for any trait
  • Vector caching system
  • Dataset screening (HuggingFace datasets)
  • Live activation steering
  • Streamlit dashboard
  • HPC batch scripts (SLURM)
  • Comprehensive test suite (10/10 passing, 85% coverage)
  • CI/CD pipeline (GitHub Actions)
  • Complete API documentation and tutorials

๐Ÿงช Testing Status

Unit Tests: All core logic tested with mock data (10/10 passing) Integration Tests: Pending full validation with production Gemma models HPC Validation: Scheduled for deployment and scale testing

Note: Current test suite validates logic with mocked model responses. Production validation with full Gemma models and HPC deployment is the next milestone.

๐Ÿšง Next (Q1 2026)

  • Interpretability (circuit-tracer) linkout
  • Screening histograms/filters/export
  • Steering presets + composition
  • Batch scoring + progress

๐Ÿ”ฎ Roadmap

See ROADMAP.md for compressed schedule:

  • Q4 2025: v0.2 SFA
  • Q1 2026: v0.3 interpretability + UX/Perf
  • Q2 2026: v0.4 auditing + multi-model (+ optional Petri)

๐Ÿ”ฌ Use Cases

For Researchers

  • Safety research on personality traits in LLMs
  • Dataset curation for safe fine-tuning
  • Activation analysis for interpretability
  • Benchmark creation for safety evaluations

For ML Teams

  • Pre-training safety checks before expensive fine-tuning
  • Data filtering to remove problematic samples
  • Model monitoring during deployment
  • Compliance reporting for safety standards

For Organizations

  • Safety pipelines integrated into ML workflows
  • Batch processing on HPC clusters
  • Shared vector libraries for common traits
  • Risk mitigation for production deployments

๐Ÿงช Testing

# Run all tests
pytest tests/ -v

# Run with coverage report
pytest --cov=personasafe --cov-report=html tests/

# Run specific test suite
pytest tests/core/ -v

# Run with specific markers
pytest -m "not slow" -v

Current Status:

  • โœ… 10/10 tests passing
  • โœ… 85% code coverage
  • โœ… CI/CD testing on Ubuntu and macOS
  • โœ… Python 3.10, 3.11, 3.12 compatibility

๐Ÿค Contributing

We welcome contributions! Please see CONTRIBUTING.md for:

  • Development environment setup
  • Coding standards (PEP 8, type hints, docstrings)
  • Testing guidelines
  • Pull request process

Quick Contribution

  1. Fork the repository
  2. Create feature branch: git checkout -b feature/amazing-feature
  3. Make changes with tests: pytest tests/ -v
  4. Format code: black personasafe/ tests/
  5. Submit pull request with clear description

Areas We Need Help:

  • Additional trait definitions
  • Performance optimizations
  • Dashboard UI/UX improvements
  • Documentation and tutorials
  • Integration with other safety tools

๐Ÿ“œ License

This project is licensed under the MIT License - see LICENSE for details.


๐Ÿ™ Acknowledgments


๐Ÿ“ž Contact & Support


๐ŸŒŸ Star History

If you find PersonaSafe useful for your research or projects, please consider starring the repository!

Star History Chart


Built with โค๏ธ for AI Safety

Report Bug ยท Request Feature ยท Documentation ยท Roadmap

About

PersonaSafe is a production-ready toolkit for detecting and mitigating personality drift in language models before fine-tuning. Built on the Persona Vectors methodology, it enables researchers and teams to screen datasets, monitor model behavior, and steer activations in real-time.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages