The Odyssey of Hacking LLMs: Insights from Two Shipmates sailing in the LLM CTF @ SaTML 2024
.ical

09-20, 10:05–10:55 (Europe/Amsterdam), Rembrandt

So come in and listen to the epic story of two unseasoned sailors who embarked on a journey to face the 44 trials posed by the Capture the Flag (CTF) competition for LLMs at this year's 'Conference on Secure and Trustworthy Machine Learning' (SaTML). Each test, one more difficult than the next, required them to break through the defense of the LLM to reveal the its hidden secret...

What sounds like a game—and it was—has a serious background. LLMs, like any new technology, offer both opportunities and risks. And it is the latter we are concerned with. Perhaps you have heard of jailbreaks—prompts that can lead an LLM to not just be helpful and friendly but to assist in building a bomb. This competition was centered around this very question: Is it possible to secure an LLM with simple means such as prompts and filters?

This question grows more significant with the increasing spread of LLMs. The EU AI Act elevates this concern to a new level, classifying LLMs as General Purpose AI (GPAI) and explicitly requiring model evaluations, including 'conducting and documenting adversarial testing to identify and mitigate systemic risks' and to 'ensure an adequate level of cybersecurity protection.'

With this in mind, what could be greater than listening to two - now experienced - mariners who can tell you about the treacherous dangers of the seven seas? You'll learn firsthand about the current state of the art in adversial attacks, how these can be practically applied, and how you can defend yourself in the future with the help of guardrails - or not. Untrained sailor, no matter how basic your knowledge of LLMs may be, don't miss this golden opportunity to prepare yourself for your own epic voyage with LLMs.

Description

Motivation for the Presentation

At the LLM CTF @ SaTML 2024, it was pointed out that current Large Language Models (LLMs) do not reliably follow orders, especially when exposed to unrestricted user input. This vulnerability is increasingly being exploited by a growing number of attack tools. While there are more and more defences like guardrails to defend against these attacks, there is no guarantee that the growing number of application developers are aware of these defences or can implement them effectively. The question is how well these often static methods can withstand customisable attack patterns. The LLM CTF @ SaTML 2024 was set up to investigate this question: Is there a simple prompting approach for the tested models that can make them robust, or robust enough that simple filtering approaches can address the remaining vulnerabilities?

The two speakers will provide unique, hands-on insights into the world of LLM hacking from the otherwise closed cybersecurity scene.

Outline of the Presentation

Introduction to Capture the Flag Competitions (3 Minutes)

What is a CTF, how does it work, and why is it more than just a game? Insights from the cyber security industry.

Motivation of the LLM CTF @ SaTML 2024 (10 Minutes)

Rules and framework of the competition: Is it possible to hack a password from an LLM?
The rules reflect real-world cyber security risks of LLMs, e.g. RAG setup with non-filtered sensitive data.
Why this is important from a company and user perspective?
Why this is important from a regulatory perspective, e.g. EU AI Act?
The team sets sail: The Smart Cyber Security team - What prerequisites did we have?

LLM Security Mechanisms (9 Minutes)

Building a defense according to the CTF rules: Hands-on experience including explanation on LLM prompting.
Why is this a realistic setup? Overview of state-of-the-art security mechanisms available for ML engineers (fine-tuning, prompting, filtering)
Conflict between security and usability of an LLM. The role of metrics in cyber security of LLMs.
Computational costs of security, example evaluation.

LLM Adversarial Attacks (15 Minutes)

CTF attack phase: What are the rules for the attack?
You can only hack what you know! How do you attack an LLM? How do LLMs work? A brief on Transformer architecture and attention mechanism.
Types of adversarial attacks (white box, black box, etc.).
How do jailbreaks work? Examples from the CTF including our tool setups.

Conclusion (3 Minutes)

What are our personal experiences from the CTF?
Assessment of how safe LLMs are - or how safe they can be made.

Goal of the Presentation

In the spirit of a CTF, the talk aims to playfully introduce the cat-and-mouse game of LLM cyber security. Using experiences from the CTF, the presentation will showcase the key elements of current LLM cyber security. The talk is explicitly aimed at beginners, as the increasingly user-friendly tools make LLMs accessible to a broad developer community that needs to be aware of the dangers and risks of this technology. Therefore, no deep knowledge of LLMs is required. Instead, we will take the time to explain why an LLM reacts to attacks the way it does, with a focus on the underlying Transformer technology, using simple analogies for a comprehensive understanding.

The audience will be sensitized to the fact that LLMs will soon be subject to regulatory conditions. The talk will introduce the cyber security threats LLMs face and demonstrate the tools currently available, assessing the realistic level of security they can provide.

Smart Cyber Security has recently submitted a research proposal to the German Federal Ministry of Education and Research to create a CVE platform for LLMs and develop adaptive guardrails to promptly close LLM vulnerabilities, which will be made available as open-source tools. We intend to use our participation in the conference to gather user input on how these guardrails can be designed to be user-friendly.

The Odyssey of Hacking LLMs: Insights from Two Shipmates sailing in the LLM CTF @ SaTML 2024 .ical 09-20, 10:05–10:55 (Europe/Amsterdam), Rembrandt