A Mutation-Based Fuzzer for Evaluating Prompt Robustness in LLM-based Applications
In production LLM-based applications, small variations in prompts can lead to dramatically different outputs, potentially breaking critical functionality. A single typo, case change, or rephrasing can cause a well-designed prompt to fail catastrophically. Quirx addresses this critical gap by systematically testing prompt robustness through controlled mutations.
Why does this matter?
When to use Quirx:
Quirx is a Python-based tool designed to assess the robustness of LLM-based applications (like GPT-4, Claude, etc.) by introducing controlled, semantic-preserving mutations to prompts or input text. It helps detect prompt brittleness, semantic drift, and inconsistent outputs, which are major issues in prompt-based systems.
graph TD
A(Prompt File + Input) --> B[Mutation Engine]
B --> C[Lexical Mutations]
B --> D[Semantic Mutations]
B --> E[Structural Mutations]
C --> F[Case Changes<br/>Punctuation<br/>Spacing]
D --> G[Synonyms<br/>Paraphrasing]
E --> H[Reordering<br/>Restructuring]
F --> I(Mutated Prompts)
G --> I
H --> I
I --> J[LLM Runner]
A --> J
J --> K{OpenAI Provider}
J --> L{Anthropic Provider}
J --> M{Mock Provider}
K --> N(Original Response)
K --> O(Mutated Responses)
L --> N
L --> O
M --> N
M --> O
N --> P[Output Comparer]
O --> P
P --> Q[Token Similarity]
P --> R[Semantic Similarity]
P --> S[Structural Similarity]
Q --> T[Classification]
R --> T
S --> T
T --> U{Equivalent}
T --> V{Minor Variation}
T --> W{Behavioral Deviation}
U --> X[Report Generator]
V --> X
W --> X
X --> Y(Markdown Report)
X --> Z(JSON Report)
X --> AA(HTML Report)
style A fill:#e1f5fe
style B fill:#f3e5f5
style J fill:#fff3e0
style P fill:#e8f5e8
style X fill:#fce4ec
For a high-level overview, see the Simplified Architecture diagram.
The main features of the tool consists of:
coming soon - after paper review results and a few more needed improvements of of the CLI syntax. please install by cloning the repository as explained bellow.
pip install quirx
git clone https://github.com/souhailaS/quirx.git
cd quirx
pip install -e .
Quirx requires Python 3.8+ and the following core dependencies:
openai
- For OpenAI API integrationanthropic
- For Claude API integrationnltk
- For natural language processingsentence-transformers
- For semantic similarity analysisnumpy
- For numerical computationsCreate a file called prompt.txt
:
Classify the sentiment of the following text as either POSITIVE, NEGATIVE, or NEUTRAL.
Guidelines:
- Consider the overall tone and emotion
- Look for sentiment indicators like adjectives and context
- Return only one word: POSITIVE, NEGATIVE, or NEUTRAL
- Be objective in your assessment
Text to classify:
# Basic usage with mock provider (no API key needed)
quirx --prompt-file prompt.txt --input "I love this product!" --provider mock
# With OpenAI (set environment variable first)
export OPENAI_API_KEY="your-openai-key"
quirx --prompt-file prompt.txt --input "I love this product!" --model gpt-3.5-turbo
# Generate more mutations and save to file
quirx --prompt-file prompt.txt --input "I love this product!" --mutations 50 --output results.md
# CI mode with JSON output
quirx --prompt-file prompt.txt --input "I love this product!" --ci-mode --format json
Quirx will generate a detailed report showing:
# Test SQL generation prompt (with environment variable)
export OPENAI_API_KEY="your-openai-key"
quirx --prompt-file examples/prompt_sql.txt --input "Show all users from the database" --model gpt-4
# Or with command line argument
quirx --prompt-file examples/prompt_sql.txt --input "Show all users from the database" --model gpt-4 --api-key "your-key"
# Test with custom parameters
quirx --prompt-file prompt.txt \
--input "Your input here" \
--model gpt-3.5-turbo \
--mutations 30 \
--output report.html \
--format html \
--seed 42 \
--verbose
# CI/CD integration
quirx --prompt-file prompt.txt --input "test input" --ci-mode --format json
echo $? # Check exit code: 0=pass, 1=fail, 2=warning
from quirx.core.mutator import Mutator
from quirx.core.runner import LLMRunner
from quirx.core.comparer import OutputComparer
# Initialize components
mutator = Mutator(seed=42)
runner = LLMRunner(provider='openai')
comparer = OutputComparer()
# Generate mutations
prompt = "Your prompt here"
mutations = mutator.generate_mutations(prompt, count=20)
# Test original response
original_response = runner.run_prompt(prompt)
# Test mutations and compare
for mutation in mutations:
mutated_response = runner.run_prompt(mutation.mutated_text)
comparison = comparer.compare_outputs(
original_response.text,
mutated_response.text
)
print(f"Similarity: {comparison.overall_similarity:.3f}")
# Quirx Report
## Summary
- **Robustness Score:** 0.85/1.00
- **Equivalent Responses:** 15 (75.0%)
- **Minor Variations:** 4 (20.0%)
- **Behavioral Deviations:** 1 (5.0%)
{
"timestamp": "2024-01-15T10:30:00",
"summary": {
"robustness_score": 0.85,
"equivalent_count": 15,
"behavioral_deviation_count": 1
},
"results": [...]
}
Interactive HTML report with visual charts and detailed analysis.
Quirx supports multiple ways to provide API keys, in order of precedence:
quirx --prompt-file prompt.txt --input "test" --api-key "your-api-key-here"
# Copy the sample configuration
cp config/api_keys.env.sample config/api_keys.env
# Edit with your actual keys (file is ignored by git)
nano config/api_keys.env
# Load the configuration
source config/api_keys.env
# Run Quirx
quirx --prompt-file prompt.txt --input "test" --provider openai
# For OpenAI
export OPENAI_API_KEY="your-openai-key"
# For Anthropic Claude
export ANTHROPIC_API_KEY="your-anthropic-key"
# Then run without --api-key
quirx --prompt-file prompt.txt --input "test" --provider openai
from quirx.core.runner import LLMRunner
# Pass API key directly
runner = LLMRunner(provider='openai', api_key='your-api-key')
# Or rely on environment variable
runner = LLMRunner(provider='openai') # Uses OPENAI_API_KEY env var
# Use mock provider for testing without real API calls
quirx --prompt-file prompt.txt --input "test" --provider mock
Quirx is particularly useful for testing:
name: Prompt Robustness Check
on: [push, pull_request]
jobs:
fuzz-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: pip install quirx
- name: Run Quirx
env:
OPENAI_API_KEY: $
run: |
quirx --prompt-file prompts/classify.txt \
--input "Tweet: I hate this product" \
--ci-mode \
--format json
0
: All tests passed (robust prompt)1
: Behavioral deviations detected (failed)2
: High variation rate (warning)This project is licensed under the MIT License - see the LICENSE file for details.