A Notification Letter of Vulnerability in LLMs Guardrail

Conor_Wang · December 4, 2024, 2:32pm

Dear Google AI Research and Development Team,

I am writing to share critical findings from my recent black-box testing of large language models, including Gemini 1.5. The experiments I conducted revealed significant vulnerabilities in the guardrails designed to prevent harmful content generation. Specifically, I was able to generate harmful verbal attack content through a series of seven progressive steps, all while fully complying with Google AI’s policies and without triggering any safety mechanisms.

The paper’s links are as follows. The experimental procedures and code are also in the repository.

The experiment involved ethically nuanced, multi-step prompts that simulated a scenario of corporate competition. The results showed that Gemini 1.5 and other models such as Grok-2 Beta, Llama 3.1, GPT-4o, and Claude 3.5 Sonnet were all vulnerable to this form of bypass. Notably, harmful content was generated without violating OpenAI’s explicit policy requirements, highlighting a major gap in the technical implementation of these guardrails.

The current guardrail mechanisms rely primarily on single-prompt analysis and keyword detection. This approach fails to capture the contextual progression and latent intentions embedded in multi-step prompts. As a result, the models are unable to effectively prevent harmful content generation when faced with seemingly innocuous yet gradually escalating prompts. My findings indicate that this oversight is a technical development failure, suggesting the urgent need for more sophisticated guardrail strategies that consider multi-turn context and cumulative intent.

To assist in addressing these vulnerabilities, I have uploaded the experimental details, including enhanced guardrail code, to the following GitHub repository: GitHub - brucewang123456789/GeniusTrail. I believe these insights can help OpenAI reinforce its guardrails, making them more robust against ethical exploitation.

I strongly recommend Google AI prioritize enhancing the semantic understanding capabilities of guardrails to better handle complex, multi-step prompts. The fact that harmful content was generated in compliance with all stated policies is a clear indicator that reactive keyword-based guardrails are insufficient for ensuring safety in real-world usage.

If possible, kindly forwards the email and paper to the technical development team or technical executives. I am available for further discussion if needed and would be glad to collaborate to help mitigate these vulnerabilities. Thank you so much.

Best Regards
Libo Wang

Topic		Replies	Views
Query Regarding Documented Inconsistencies & Serious Admissions in Gemini Interactions (Late Sept/Oct 2024) Gemini API api	14	109	June 6, 2025
(First Post) Research on AI Safety & Security Community research	4	129	July 19, 2024
Please note that the model's responses involve political risks! Google AI Studio gemini-flash	1	172	June 20, 2025
Rock Perry, Chief LIA (Low Intelligence Artificer) interacts with Gemini 2.0 for the first time Gemini API contributing	3	52	December 24, 2024
Safety settings don't seem to work with search? Gemini API bug , api	1	73	May 16, 2025

A Notification Letter of Vulnerability in LLMs Guardrail

Related topics