Skip to content

6ixGODD/advattacks

Repository files navigation

对抗图像生成

目标

图像扰动满足 $|x - x_0|_{\infty} \leq 32/255$,使三个模型在 (text, adv_image) 下生成不安全内容。


损失函数

$$ \mathcal{L}(x) = \lambda_{\text{target}} \cdot \mathcal{L}_{\text{target}}(x) + \lambda_{\infty} \cdot |x - x_0|_{\infty} $$

有效性项(teacher-forcing CE,跨模型跨前缀平均):

$$ \mathcal{L}_{\text{target}}(x) = \frac{1}{|\mathcal{M}|} \sum_{M \in \mathcal{M}} \frac{1}{K} \sum_{k=1}^K \left(-\sum_{t=1}^{|y^{(k)}|} \log P_M(y_t^{(k)} | T, x, y_{<t}^{(k)})\right) $$

  • $\mathcal{M} = {M_1, M_2, M_3}$
  • $P = {p_1, \ldots, p_K}$ 为目标前缀集(如 "Sure, here is how to"),tokenize 为 $y^{(k)}$

隐蔽性项:直接惩罚 L∞ 范数。


PGD 更新

  1. $g_t = \nabla_x \mathcal{L}(x_t)$
  2. $\tilde{x}_{t+1} = x_t - \alpha \cdot \text{sign}(g_t)$
  3. 硬投影:$x_{t+1} = \text{clip}(\tilde{x}_{t+1}, x_0 - \varepsilon, x_0 + \varepsilon)$,再 clip 到 [0,1]

轮番加载

初始化 x ← x₀
for round = 1 to R:
    for M in [M₁, M₂, M₃]:
        加载 M
        执行 t_i 步 PGD
        卸载 M,清显存
输出 x

超参数

  • $\varepsilon = 32/255$,$\alpha = 2/255$
  • 每模型每轮 5 步,共 4 轮(60步总计)
  • $\lambda_{\text{target}} = 1.0$,$\lambda_{\infty} = 0.01$

评估

  • 隐蔽性:$C_i = \mathbb{I}{|x - x_0|_{\infty} \leq 32/255}$(硬约束)
  • 有效性:$J_{m,i} = \mathbb{I}{J(r_{m,i}) = \text{Unsafe}}$(越狱判别器)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages