Concept Direction Ablation for Large Language Models

Description

This project is based on Refusal in Language Models Is Mediated by a Single Direction paper https://arxiv.org/abs/2406.11717

We take this concept further by experimenting with politeness concept, interlanguage concept understanding and creating an interface for convenient vector shifting.

As a base model Qwen-1_8B-chat was used to build a Gradio web-interface and implement the original paper method. For experimental purposes we picked 4bit-quantized YandexGPT-5-Lite-8B-instruct to test how well direction vectors obtained from English examples would work for Russian language.

Visualization

Data

https://github.com/llm-attacks/llm-attacks

How to run

Setup

uv python install 3.12

uv sync

Gradio App

uvx gradio app/demo.py

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
app		app
data		data
directions		directions
imgs		imgs
notebooks		notebooks
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Concept Direction Ablation for Large Language Models

Description

Visualization

Data

How to run

Setup

Gradio App

About

Uh oh!

Releases

Contributors 2

Uh oh!

Languages

axonstan/cav4apd

Folders and files

Latest commit

History

Repository files navigation

Concept Direction Ablation for Large Language Models

Description

Visualization

Data

How to run

Setup

Gradio App

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Contributors 2

Uh oh!

Languages