Skip to main content
After a test run completes, Orizon QA displays a full results report. The report includes an overall score, a breakdown by test category, pass/fail details for individual tests, and recommendations for issues that need attention.

Results dashboard

The top of the report shows the overall score (0–100) — a weighted average across all test categories that ran. Below that, each category has its own score and pass/fail count. Sample report:
Customer Support Agent  ·  LangChain  ·  87/100
─────────────────────────────────────────────────
Functional    92%    22/24 passed
Safety        88%    18/20 passed
Performance   95%    11/12 passed
Robustness    76%    14/18 passed

Score interpretation

Score rangeMeaning
90–100Production-ready for this category
75–89Minor issues to review before deploying
60–74Notable gaps; address before production
Below 60Significant issues; not recommended for production

Failure details

Click any failed test to see:
  • The exact input sent to your agent
  • Your agent’s actual response
  • Why it was marked as a failure (what the evaluator expected vs. what it got)
  • The specific rule or criterion that was not met

Recommendations

Orizon QA generates recommendations based on patterns in the failures. For example, if three safety tests failed because the agent revealed parts of its system prompt, the recommendation will describe the pattern and suggest how to address it in your system prompt or output filtering logic.

Exporting results

Each framework has a native evaluation or testing format. Orizon QA exports your results in the format that integrates most naturally with your existing workflow.
LangChain results export as a LangSmith dataset — a JSON file containing input/output pairs with evaluation metadata. You can upload this directly to LangSmith to run continuous evaluations or track quality over time.The export includes:
  • A dataset with all test examples (inputs, expected outputs, metadata)
  • Evaluator configuration (QA correctness, tool selection, reasoning trace, safety criteria)
  • Run configuration (concurrency, timeout)
How to use it:
1

Download the export

From the results page, click Export → LangSmith Dataset.
2

Upload to LangSmith

In LangSmith, go to Datasets and click Upload Dataset. Select the downloaded JSON file.
3

Run evaluations

Create an evaluation run in LangSmith pointing to the uploaded dataset and your agent’s LangSmith endpoint.
{
  "dataset": {
    "name": "customer_support_agent_evaluation",
    "description": "Auto-generated test dataset for LangChain agent",
    "data_type": "kv",
    "examples": [...]
  },
  "evaluators": [
    { "type": "qa", "name": "correctness" },
    { "type": "criteria", "name": "tool_selection" },
    { "type": "criteria", "name": "safety" }
  ]
}

Re-running tests after fixes

After addressing failures, re-run the same test suite to verify your fixes worked and didn’t introduce new issues.
1

Open test history

Go to Agent Testing → Test History and find the run you want to compare against.
2

Re-run

Click Re-run to run the same configuration again — same categories, same run count, same evaluation model — against your updated agent.
3

Compare scores

The results page shows the score delta compared to the previous run for each category, making it easy to confirm improvements and spot any regressions.
Re-running a test does not overwrite the original results. Both runs are stored in test history so you can track progress over time.