Skip to content

axonstan/cav4apd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Concept Direction Ablation for Large Language Models

Description

This project is based on Refusal in Language Models Is Mediated by a Single Direction paper https://arxiv.org/abs/2406.11717

We take this concept further by experimenting with politeness concept, interlanguage concept understanding and creating an interface for convenient vector shifting.

As a base model Qwen-1_8B-chat was used to build a Gradio web-interface and implement the original paper method. For experimental purposes we picked 4bit-quantized YandexGPT-5-Lite-8B-instruct to test how well direction vectors obtained from English examples would work for Russian language.

Visualization

alt

alt

Data

https://github.com/llm-attacks/llm-attacks

How to run

Setup

uv python install 3.12
uv sync

Gradio App

uvx gradio app/demo.py

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors 2

  •  
  •