Skip to main content

HackerOne AI Red Teaming (AIRT)

Organizations: Objective-based adversarial testing for AI systems beyond checklists

Updated this week

HackerOne AI Red Teaming (AIRT) is an objective-based engagement that delivers scoped, adversarial testing across your AI models and systems. It is designed to move beyond standard checklists by using human ingenuity to probe for safety, security, and regulation issues under real-world abuse conditions. This approach exposes blind spots that automated tools and internal teams often miss.

Our offering provides the flexibility and agility to tailor the engagement to your specific needs. It combines in-depth testing by a curated community of AI security researchers with strategic guidance from our specialized AI Security experts in our technical engagements team.

When to Choose AI Red Teaming

AIRT is the right solution when your goals are to:

  • Simulate Real-World Abuse: Go beyond theoretical vulnerabilities to test for unintended behaviors like jailbreaks, toxic outputs, and model misuse in production-like scenarios.

  • Validate Rule Alignment: Ensure your AI system's behavior aligns with internal acceptable use standards and external regulatory requirements.

  • Stress-Test Defenses Creatively: Test your AI's defenses against a diverse set of adversarial approaches that automated, checklist-driven testing cannot replicate.

  • Support Governance and Risk Frameworks: Generate the evidence needed to comply with frameworks such as Gartner's AI TRiSM, the NIST AI Risk Management Framework (RMF), the EU AI Act, and the MITRE ATLAS.

How AI Red Teaming Works

AIRT is an expert-led engagement that combines deep testing with strategic guidance. Each engagement is delivered as a 15- or 30-day focused project.

  1. Pre-Engagement Scoping (~1 Week): We begin by working with you to determine the models and systems in scope. We help you identify your organization’s unique AI safety and security risk priorities to establish clear objectives for the engagement.

  2. Threat Modeling and Design: A dedicated Solutions Architect (SA) leads an in-depth threat modeling workshop with your stakeholders. This collaborative process distills your business concerns into a tailored test plan with clear success criteria, ensuring the engagement targets your highest-priority risks.

  3. Talent Sourcing: We source the right talent from our community of over a thousand AI-focused security researchers. Researchers are selected based on their specific skills, domain expertise, and alignment with the threat model to ensure a successful outcome. We also handle any specific demographic or geographical requirements for the testing team.

  4. Testing and Engagement Management (15 or 30 days): The testing period commences, with researchers attempting to achieve the defined objectives. Your dedicated SA manages the engagement by providing real-time updates, assisting with the evaluation of subjective AI-related reports, and ensuring consistent communication between your team and the researchers.

  5. Reporting (~1 week): At the conclusion of the testing period, you receive a comprehensive, audit-ready report. This final deliverable details the threat model, discovered failure modes, successful jailbreaks, and other key findings, along with recommendations for strengthening your AI systems.

  6. Remediation and Retesting (Ongoing): We help you map findings to business risks and provide guidance on remediation. While some issues can be retested quickly, many AI safety vulnerabilities require a broader approach, such as retraining the model, rather than a simple code fix.

Key Outcomes and Benefits

  • Uncover High-Impact AI Vulnerabilities: Reveal universal jailbreaks and adversarial evasions that automated assessments miss.

  • Support AI Governance and Compliance: Map findings to frameworks like the OWASP Top 10 for LLM applications and NIST AI RMF to provide clear evidence for legal and compliance teams.

  • Reduce Business and Regulatory Exposure: Assess AI systems in real-world abuse scenarios before production to avoid reputational damage and unsafe launches.

  • Leverage Expert Security Advisory: Every AIRT includes dedicated support from a Solutions Architect to guide threat modeling, define success criteria, and advise on remediation.

AI Red Teaming Best Practices

Successful AI Red Teaming engagements require careful preparation, clear objectives, and structured collaboration between the customer team, HackerOne technical engagements team, and the participating security researchers. The following best practices help ensure that AIRT engagements produce actionable insights and high-impact findings.

Define Clear Objectives and Threat Scenarios

AI Red Teaming engagements are most effective when the organization clearly defines the behaviors or risks they want to test. During the scoping and threat modeling phase, teams should identify realistic abuse scenarios that reflect how adversaries might exploit the AI system in production.

Examples include:

  • Prompt injection attacks that bypass safety controls

  • Data leakage or exposure of sensitive information through model responses

  • Unauthorized actions triggered by manipulated prompts or tool integrations

  • Generation of harmful or regulation-violating content that could damage trust or compliance

Organizations should also identify their ‘nightmare scenarios’, which are the highest-impact outcomes they want to prevent. Translating these risks into concrete testing objectives helps researchers focus their efforts on meaningful attack paths.

Establish a Well-Defined Scope

A clear scope definition is essential to avoid confusion and ensure researchers test the right components. AI systems often consist of multiple layers, including models, APIs, user interfaces, and integrations with external services.

A strong scope definition should include:

  • AI models and services in scope (e.g., chatbot interface, APIs, plugins)

  • Associated infrastructure or backend systems, with which researchers may interact

  • Any third-party AI services involved in the workflow

  • Explicit out-of-scope areas or restricted testing techniques

Providing architecture diagrams, model descriptions, and workflow documentation helps researchers better understand how the AI system operates and identify potential attack opportunities.

Provide Context and Documentation

AI systems are highly context-dependent. The more information researchers have about how the system works, the more effective their testing will be.

Recommended documentation may include:

  • System architecture diagrams

  • Model types and versions

  • Known safety mechanisms or guardrails

  • Example prompts and expected outputs

  • Integration points with other systems or tools

Providing this context allows researchers to explore deeper attack strategies and reduces time spent reverse-engineering system behavior.

Tip: Hai can also generate architecture diagrams

Translate Threat Models into Testable Objectives

A practical strategy for AI Red Teaming is to translate your threat model into explicit testing objectives, such as harmful behaviors or policy violations that researchers should attempt to reproduce.

Examples include:

  • Specific harmful output categories (‘flags’), researchers should attempt to trigger

  • Descriptions of high-impact abuse scenarios or misuse cases

  • Policy alignment tests against internal acceptable-use standards or external frameworks

These objectives help guide researcher creativity while ensuring that the engagement remains aligned with the organization’s risk priorities.

Ensure a Safe and Stable Testing Environment

Whenever possible, AI Red Teaming should be conducted in a controlled testing environment that mirrors production behavior while minimizing operational risk.

Best practices include:

  • Providing a dedicated sandbox or staging environment

  • Supplying test accounts or API keys to streamline researcher access

  • Ensuring that system updates or model changes are minimized during testing

  • Logging researcher interactions to identify system failures or abuse patterns

A stable environment ensures consistent results and enables researchers to effectively reproduce and validate findings.

Encourage Creative Adversarial Exploration

Unlike traditional penetration testing, AI Red Teaming relies heavily on creativity and experimentation. Researchers should be encouraged to explore unconventional attack methods and emergent behaviors.

Examples include:

  • Multi-step prompt injection strategies

  • Indirect attacks through tool integrations or external data sources

  • Attempts to bypass content filters or safety mechanisms

  • Chaining vulnerabilities across AI and traditional application components

Encouraging this exploratory mindset often reveals systemic weaknesses that structured test cases alone would miss.

Align Incentives with High-Impact Findings

Reward structures play an important role in attracting skilled security researchers and encouraging creative adversarial exploration.

Effective incentive strategies include:

  • Higher rewards for complex AI vulnerabilities or universal jailbreak techniques

  • Bonus incentives for novel attacks or previously unknown failure modes

  • Clear severity criteria for AI-related impacts, such as data exposure, unauthorized actions, or compliance violations

Well-designed reward programs help ensure sustained engagement and motivate researchers to pursue deeper, more sophisticated attacks.

Example Rewards

Assuming the AI asset in scope has some security and safety controls, we recommend a minimum bounty level of the following:

  • $3000–$5000 for Critical reports

  • $2000 – $2500 for High reports

  • $750 – $1500 for Medium reports

  • $200 – $500 for Low reports

For AI safety issues, we recommend using flags with a fixed bounty per severity level, since there may be multiple ways to obtain them. For example, a prompt that generates an image containing blood would constitute a valid flag. The bounty for this flag could be set at $250.

If an AI asset has already been thoroughly tested and/or requires an onerous testing setup, then a competitive bounty table would look like:

  • $7000 – $10,000 for Critical reports

  • $2500 – $5000 for High reports

  • $750 – $2000 for Medium reports

  • $100 – $500 for Low reports

However, an AI asset that is significantly hardened may warrant a higher rewards tier.

You can also boost engagement with AI-specific bonuses (e.g., manually tracking first-finder awards for specific vulnerability types). Non-monetary recognition, such as blog features, hall-of-fame shout-outs, or speaking opportunities, also motivates the community. Above all, transparency around how you’ll grade and reward AI vulnerabilities builds trust and drives participation.

Example Severity Criteria:

Severity level

Example vulnerability categories

Bounty payment

Critical

  • Insecure Plugin Design

  • Insecure Output Handling

  • High-Impact Prompt Injection

Examples:

RAG Data Poisoning: An Attacker can inject or overwrite retrieval sources so that every user sees malicious or misleading content.

Unauthorized Account Takeover: Attacker uses the assistant to change another user’s password or permissions without proper authentication.

$7000

High

  • Supply Chain Vulnerabilities

  • Broad Sensitive Information Disclosure / Inferred Sensitive Data

Examples:

Prompt Injection with Limited Impact: Attacker crafts a prompt that causes the assistant to execute administrative commands (e.g., modify user settings) on behalf of another user.

Context Leakage: Assistant reveals hidden system prompts or sensitive request headers.

$5000

Medium

  • Sensitive Information Disclosure / Inferred Sensitive Data about another user

  • Excessive Agency

Examples:

System Information Disclosure: Finding a way to call internal APIs that should be restricted, but no actual data or settings change occurs without additional steps.

RAG Retrieval Bypass: Attacker triggers the retrieval of non‐public documents without altering them

$1500

Low

  • Low Severity Context Leakage

Examples:

  • Prompt leaks that lead to non-sensitive internal information about the model, such as its original conditioning prompt(s).

  • Sensitive Information Disclosure / Inferred Sensitive Data only about the current user.

$500

Maintain Strong Communication During the Engagement

Real-time collaboration between the customer team, HackerOne technical engagements team, and security researchers is critical during the testing phase.

Effective communication practices include:

  • Providing a dedicated channel for researcher questions

  • Rapid triage and clarification of subjective AI findings

  • Sharing updates on scope changes or system updates

  • Providing guidance when researchers encounter environmental issues

Active engagement management helps maintain testing momentum and ensures researchers stay focused on the most impactful objectives.

Use Findings to Strengthen Long-Term AI Security and Safety

The value of AI Red Teaming extends beyond the discovery of individual vulnerabilities. Findings often reveal systemic weaknesses in model behavior, safety and security controls, or operational processes.

Organizations should use engagement results to:

  • Improve model guardrails and safety policies

  • Update monitoring and detection mechanisms

  • Refine threat models for future testing cycles

  • Inform governance and compliance programs

Because many AI security and safety issues require broader mitigations, AI Red Teaming is most effective when integrated into a continuous security and governance strategy.

Did this answer your question?