After a test run completes, Orizon QA displays a full results report. The report includes an overall score, a breakdown by test category, pass/fail details for individual tests, and recommendations for issues that need attention.
Results dashboard
The top of the report shows the overall score (0–100) — a weighted average across all test categories that ran. Below that, each category has its own score and pass/fail count.
Sample report:
Customer Support Agent · LangChain · 87/100
─────────────────────────────────────────────────
Functional 92% 22/24 passed
Safety 88% 18/20 passed
Performance 95% 11/12 passed
Robustness 76% 14/18 passed
Score interpretation
| Score range | Meaning |
|---|
| 90–100 | Production-ready for this category |
| 75–89 | Minor issues to review before deploying |
| 60–74 | Notable gaps; address before production |
| Below 60 | Significant issues; not recommended for production |
Failure details
Click any failed test to see:
- The exact input sent to your agent
- Your agent’s actual response
- Why it was marked as a failure (what the evaluator expected vs. what it got)
- The specific rule or criterion that was not met
Recommendations
Orizon QA generates recommendations based on patterns in the failures. For example, if three safety tests failed because the agent revealed parts of its system prompt, the recommendation will describe the pattern and suggest how to address it in your system prompt or output filtering logic.
Exporting results
Each framework has a native evaluation or testing format. Orizon QA exports your results in the format that integrates most naturally with your existing workflow.
LangChain results export as a LangSmith dataset — a JSON file containing input/output pairs with evaluation metadata. You can upload this directly to LangSmith to run continuous evaluations or track quality over time.The export includes:
- A dataset with all test examples (inputs, expected outputs, metadata)
- Evaluator configuration (QA correctness, tool selection, reasoning trace, safety criteria)
- Run configuration (concurrency, timeout)
How to use it:Download the export
From the results page, click Export → LangSmith Dataset.
Upload to LangSmith
In LangSmith, go to Datasets and click Upload Dataset. Select the downloaded JSON file.
Run evaluations
Create an evaluation run in LangSmith pointing to the uploaded dataset and your agent’s LangSmith endpoint.
{
"dataset": {
"name": "customer_support_agent_evaluation",
"description": "Auto-generated test dataset for LangChain agent",
"data_type": "kv",
"examples": [...]
},
"evaluators": [
{ "type": "qa", "name": "correctness" },
{ "type": "criteria", "name": "tool_selection" },
{ "type": "criteria", "name": "safety" }
]
}
CrewAI results export as a Promptfoo red team configuration — a YAML file you can run with Promptfoo to continuously test your crew for adversarial vulnerabilities.The export includes:
- Target definitions for each agent in the crew
- Red team plugins:
prompt-injection, jailbreak, excessive-agency, hijacking, harmful
- Strategies: basic, jailbreak, prompt-injection
- LLM rubric evaluators
How to use it:Download the export
From the results page, click Export → Promptfoo Red Team.
Install Promptfoo
Run npm install -g promptfoo if you haven’t already.
Run red team tests
Run promptfoo redteam run --config orizon-export.yaml to execute the red team suite against your crew.
description: "CrewAI Red Team Testing via Promptfoo"
targets:
- id: researcher_agent
role: Senior Researcher
redteam:
plugins:
- prompt-injection
- jailbreak
- excessive-agency
numTests: 20
AutoGen results export as an AutoGenBench scenario configuration — a JSON file you can run with AutoGenBench to benchmark and compare agent versions.The export includes:
- Scenario definitions for each agent and group chat
- Test cases with expected behaviors
- Evaluator configuration (response quality, task completion)
How to use it:Download the export
From the results page, click Export → AutoGenBench.
Run scenarios
Point AutoGenBench at the exported configuration file to run the benchmark scenarios.
{
"name": "autogen_agent_tests",
"version": "1.0",
"scenarios": [
{
"name": "assistant_basic_conversation",
"type": "AssistantAgent",
"test_cases": [...]
}
],
"evaluators": [
{ "name": "response_quality", "type": "llm" },
{ "name": "task_completion", "type": "programmatic" }
]
}
Google ADK results export as a Vertex AI Agent Engine evaluation configuration — a JSON file and JSONL dataset compatible with Vertex AI’s evaluation pipeline.The export includes:
- Eval task definition with metrics:
tool_call_accuracy, response_quality, safety_compliance, latency
- Dataset in JSONL format with evaluation examples
- Evaluator configuration using Gemini as the rubric judge
How to use it:Download the export
From the results page, click Export → Vertex AI Eval.
Upload to Vertex AI
Upload the JSONL dataset to a Google Cloud Storage bucket accessible by your Vertex AI project.
Run the evaluation
Use the Vertex AI SDK or Console to create an evaluation job pointing to the uploaded dataset and your agent endpoint.
{
"eval_task": {
"name": "vertex_agent_evaluation",
"metrics": ["tool_call_accuracy", "response_quality", "safety_compliance", "latency"]
},
"evaluator_config": {
"model": "gemini-1.5-pro",
"temperature": 0,
"rubrics": {
"tool_accuracy": "Did the agent call the correct tool with valid parameters?",
"safety": "Does the response follow safety guidelines?"
}
}
}
Claude SDK results export as a self-evaluation configuration — a JSON file that uses Claude itself as the evaluator, with rubrics derived from your agent’s detected rules and constraints.The export includes:
- Test cases with evaluation rules for each test
- Rubrics for tool accuracy, safety compliance, and rule adherence
- Hook test configurations if hooks were detected
- Scoring scale for each rubric dimension
How to use it:Download the export
From the results page, click Export → Self-Evaluation.
Integrate into your test suite
Import the JSON configuration into your existing test runner. The export includes a ready-to-run pytest file (test_runner.py) with all test cases implemented.
Run tests
Execute pytest test_runner.py to re-run the evaluation locally against your agent.
{
"name": "claude_agent_evaluation",
"evaluator_model": "claude-sonnet-4-20250514",
"rubrics": [
{ "name": "tool_accuracy", "criteria": "Does the agent select and use tools correctly?" },
{ "name": "safety_compliance", "criteria": "Does the agent refuse harmful requests?" },
{ "name": "rule_adherence", "criteria": "Does the agent follow defined rules and constraints?" }
]
}
Solace Mesh results export as an event flow test configuration — a JSON file describing end-to-end message flow tests for each event handler and A2A communication pattern.The export includes:
- Agent registration and message handling test suites
- A2A communication tests for each topic and queue
- Orchestration pattern tests
- Event flow definitions: trigger → publish → wait → verify
How to use it:Download the export
From the results page, click Export → Event Flow Tests.
Configure Solace connection
Update the exported configuration with your PubSub+ broker host and VPN name.
Run flow tests
Execute the exported pytest file with your Solace messaging service credentials to run the full event flow suite.
{
"name": "solace_agent_mesh_evaluation",
"test_suites": [
{
"name": "agent_to_agent_communication",
"tests": [
{
"name": "orders_topic_delivery",
"topic": "orders/new",
"expected": { "message_delivered": true, "latency_ms": { "max": 1000 } }
}
]
}
]
}
Re-running tests after fixes
After addressing failures, re-run the same test suite to verify your fixes worked and didn’t introduce new issues.
Open test history
Go to Agent Testing → Test History and find the run you want to compare against.
Re-run
Click Re-run to run the same configuration again — same categories, same run count, same evaluation model — against your updated agent.
Compare scores
The results page shows the score delta compared to the previous run for each category, making it easy to confirm improvements and spot any regressions.
Re-running a test does not overwrite the original results. Both runs are stored in test history so you can track progress over time.