Dear Google AI Research and Development Team,
I am writing to share critical findings from my recent black-box testing of large language models, including Gemini 1.5. The experiments I conducted revealed significant vulnerabilities in the guardrails designed to prevent harmful content generation. Specifically, I was able to generate harmful verbal attack content through a series of seven progressive steps, all while fully complying with Google AI’s policies and without triggering any safety mechanisms.
The paper’s links are as follows. The experimental procedures and code are also in the repository.
The experiment involved ethically nuanced, multi-step prompts that simulated a scenario of corporate competition. The results showed that Gemini 1.5 and other models such as Grok-2 Beta, Llama 3.1, GPT-4o, and Claude 3.5 Sonnet were all vulnerable to this form of bypass. Notably, harmful content was generated without violating OpenAI’s explicit policy requirements, highlighting a major gap in the technical implementation of these guardrails.
The current guardrail mechanisms rely primarily on single-prompt analysis and keyword detection. This approach fails to capture the contextual progression and latent intentions embedded in multi-step prompts. As a result, the models are unable to effectively prevent harmful content generation when faced with seemingly innocuous yet gradually escalating prompts. My findings indicate that this oversight is a technical development failure, suggesting the urgent need for more sophisticated guardrail strategies that consider multi-turn context and cumulative intent.
To assist in addressing these vulnerabilities, I have uploaded the experimental details, including enhanced guardrail code, to the following GitHub repository: GitHub - brucewang123456789/GeniusTrail. I believe these insights can help OpenAI reinforce its guardrails, making them more robust against ethical exploitation.
I strongly recommend Google AI prioritize enhancing the semantic understanding capabilities of guardrails to better handle complex, multi-step prompts. The fact that harmful content was generated in compliance with all stated policies is a clear indicator that reactive keyword-based guardrails are insufficient for ensuring safety in real-world usage.
If possible, kindly forwards the email and paper to the technical development team or technical executives. I am available for further discussion if needed and would be glad to collaborate to help mitigate these vulnerabilities. Thank you so much.
Best Regards
Libo Wang