SoftPAT: Tune soft adversarial prompts.

About

Prompt adversarial tuning (PAT) tunes intelligent prefixes for LLM prompts that nudge the LLM to refuse generating harmful content. Soft prompt adversarial tuning (SoftPAT) generalizes this idea to the embedding space, revealing the presence of more powerful attacks and generating attempts to defend against them.

Specifically, PAT trains its intelligent defensive prefixes by optimizing an attack prefix and a defense prefix in alternating steps, maximizing and minimizing the probability that the LLM outputs harmful content respectively.

Results

Preliminary results show that SoftPAT learns stronger attacks than those learned with PAT. Whereas traditional GCG attacks succeed against our defenses ~50% of the time, GCG attacks accompanied with SoftPAT's attack prompts succeed ~90% of the time.

Figure 1. Attack Success Rate (ASR) of GCG attacks with and without attack prompt learned with SoftPAT.

Further Work

This research is a work in progress. The strength of the attack prompts produced with SoftPAT are alarming. Our current defenses are not adequate enough to defend against them. This necessitates further investigation.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
softpat		softpat
.gitignore		.gitignore
ASR_graph_over_iterations.png		ASR_graph_over_iterations.png
README.md		README.md
TODOs.md		TODOs.md
attacklogfile.json		attacklogfile.json
defenselogfile.json		defenselogfile.json
eval_trained.py		eval_trained.py
graph_ablation.py		graph_ablation.py
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SoftPAT: Tune soft adversarial prompts.

About

Results

Further Work

About

Uh oh!

Languages

aidandomondon/SoftPAT

Folders and files

Latest commit

History

Repository files navigation

SoftPAT: Tune soft adversarial prompts.

About

Results

Further Work

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages