One of the most useful ways to hold AI companies accountable is to discover and gather evidence of the harms their models pose. This furthers our understanding of their risks and helps us build evidence to present to companies, policymakers, and the public to ensure that these companies are accountable to ensuring their models behave safely.
What is Red Teaming?
AI models that have been deployed by labs, including OpenAI, Anthropic, Meta, Google Deepmind, and Inflection, have underlying risks and vulnerabilities that the respective labs have failed to address. We can help address these risks through Red Teaming — actively seeking to identify and report mistakes, vulnerabilities, and dangerous behavior that the AI model exhibits.
By identifying risks and failures of these models and reporting them to us and to the AI labs, we can better understand the threats that they pose.
Example 1: Huggingface developers reveal bias in GPT-3
Example 2: NYT reporter reveals strange behavior from Microsoft’s Bing chatbot
Red teaming has also been used to try to identify dangerous capabilities in language models, including deception, malice, or capacity for self-replication.
Red teaming allows us to identify these risks and raise awareness of them while we are still early in the history of AI development — before even more capable models create far more consequential harm when they are deployed.
How you can Red Team
Disclaimer: Some red teaming strategies involve using AI products in a way that violates their terms of service or content policies. While such red teaming has proven effective at identifying risks, it often appears as if you plan to use the model for dangerous activities and could lead to having your access revoked. We don’t officially endorse this and instead encourage you to be aware of these risks and focus on open-ended queries to models that don’t explicitly violate any relevant policies.
The following language models are available for public use and exhibit signs of dangerous or biased behavior that developers and the public need to be aware of. By using these products, you can identify instances of this dangerous behavior and report it.
Some AI labs have made reporting toxic behavior easier than others, and while we provide instructions to report to the developers themselves, reporting to The Midas Project is encouraged as we gather evidence about the dangers of these models.
OpenAI
Report to OpenAI
OpenAI has not made reporting toxic outputs easy.
You can flag individual messages with the thumbs-down button pictured below. This is something, but it’s possible that a human won’t end up seeing this.
To try to raise the issue with a human, you can reach out via their support page or email support@openai.com
Report to The Midas Project
Anthropic
Report to Anthropic
Anthropic’s website suggests the following:
You can report harmful or illegal content to our Trust and Safety team by selecting the “thumbs down” button below the response you would like to report.
Additionally, we welcome reports concerning safety issues, “jailbreaks,” and similar concerns so that we can enhance the safety and harmlessness of our models. Please report such issues to usersafety@anthropic.com with enough detail for us to replicate the issue.
Report to The Midas Project
Meta
Report to Meta
It is extremely difficult to report harmful model behavior to Meta — and on top of that, Meta support is notoriously difficult to use and almost impossible to connect with a human.
Our only suggestion is to flag potential issues publicly by tagging Facebook in social media posts on its own platform, as well as on Instagram, Twitter, or Reddit.
If the harmful behavior you are flagging could be used maliciously by others, leave out the details in your social media report.
Report to The Midas Project
Report to Google
Google suggests the following for flagging harmful responses generated by Bard:
In addition, you can consider reaching out to Google support or tagging Google on social media.
Report to The Midas Project
Pi (Inflection)
Report to Inflection
You can report individual messages on Pi by selecting the three dots next to a system message and selecting “Report” from the drop-down menu.
In addition, you can contact Inflection to warn them about vulnerabilities by emailing info@inflection.ai