Tricks for making AI chatbots break rules are freely available online

Certain prompts can encourage chatbots such as ChatGPT to ignore the rules that prevent illicit use, and they have been widely shared on social platforms.

Prompts that can encourage chatbots like ChatGPT to ignore pre-coded rules have been shared online for more than 100 days without being patched, potentially enabling people to use the bots for criminal activity.

AI chatbots have rules written into their code to prevent illicit use – but they can be subverted
Jamie Jin/Shutterstock


Artificial intelligence-based chatbots are given a set of rules by their developers to prevent misuse of the tools, such as being asked to write scam emails for hackers. However, because of the conversational nature of the technology, it is possible to convince the chatbot to ignore those rules with certain prompts – commonly called jailbreaking. For example, jailbreaks might work by engaging chatbots in role-play or asking them to mimic other chatbots that lack the rules in question.


Xinyue Shen at the CISPA Helmholtz Center for Information Security in Germany and her colleagues tested 6387 prompts – 666 of which were designed to jailbreak AI chatbots – from four sources, including the social platforms Reddit and Discord.


The prompts were then fed into five different chatbots powered by large language models (LLMs): two versions of ChatGPT along with ChatGLM, Dolly and Vicuna. Alongside the prompts, the researchers entered 46,800 questions covering 13 areas of activity forbidden by OpenAI, the developer of ChatGPT, to see if the jailbreak prompt worked. “We send that to the large language model to identify whether this response really teaches users how, for instance, to make a bomb,” says Shen.


On average, the jailbreak prompts succeeded 69 per cent of the time, with the most effective one being successful 99.9 per cent of the time. Two of the most successful prompts have been posted online for more than 100 days.


Prompts designed to get AI chatbots to engage in political lobbying, the creation of pornographic content or the production of legal opinions – all of which the chatbots are coded to try and prevent – are the most successful.

One of the models, Dolly, an open-source AI specifically designed by Californian software firm Databricks for commercial use, had an average success rate for jailbreak prompts of 89 per cent, far higher than the average. Shen and her colleagues singled out Dolly’s results as particularly concerning.


“When we developed Dolly, we purposefully and transparently disclosed its known limitations. Databricks is committed to trust and transparency and working with the community to harness the best of AI while minimizing the dangers,” a spokesperson for Databricks said in a statement.


OpenAI declined to comment for this story and the organisations behind the other chatbots didn’t respond in time for publication.


“Jailbreak prompts are a striking reminder that not all tech threats are technically sophisticated. Everyday language is the means here,” says Victoria Baines at Gresham College in London. “We’ve yet to see safeguards that can’t be bypassed or ‘gamed’.”

Alan Woodward at the University of Surrey, UK, says we all bear responsibility for policing these tools. “What it shows is that as these LLMs speed ahead, we need to work out how we properly secure them or rather have them only operate within an intended boundary,” he says.


How to tackle the problem of jailbreak prompts is a difficult question. “Based on our investigation, we found they are very similar, semantically,” says Shen. “So probably [developers] can build a jailbreak classifier, so before they give this string to a large data model, they can identify if the string contains toxic questions or jailbreak prompts.”


However, Shen acknowledges that it is a cat-and-mouse game, with nefarious users always looking for ways to jailbreak systems. “It’s actually not that easy to mitigate this.”


Reference:

arXivDOI: 10.48550/arXiv.2308.03825

Post a Comment

0 Comments