Evaluation Framework¶
The openjarvis-evals package provides a structured harness for measuring model quality
across research benchmarks. It is a separate package from the main openjarvis library,
installed from the evals/ directory, and exposes a CLI (openjarvis-eval) plus a
Python API for programmatic use.
The framework is organized around four ABCs — InferenceBackend, DatasetProvider,
Scorer, and the concrete EvalRunner — wired together by RunConfig. A TOML-based
suite configuration system expands a models-by-benchmarks matrix into individual
RunConfig objects so an entire comparison table can be launched from a single file.
Installation
The evaluation framework is a separate package. Install it from the repository root:
The package requires Python 3.10+, openjarvis>=1.0.0, and datasets>=2.14.
Core Types (evals.core.types)¶
These dataclasses are the shared vocabulary for every component in the framework.
EvalRecord¶
A single evaluation sample loaded from a dataset.
| Field | Type | Default | Description |
|---|---|---|---|
record_id |
str |
— | Unique identifier for this sample |
problem |
str |
— | The prompt or question presented to the model |
reference |
str |
— | Ground-truth answer used for scoring |
category |
str |
— | Task category: "chat", "reasoning", "rag", or "agentic" |
subject |
str |
"" |
Subject area or sub-topic within the benchmark |
metadata |
Dict[str, Any] |
{} |
Benchmark-specific extra fields (options, difficulty, file paths, etc.) |
record = EvalRecord(
record_id="supergpqa-0",
problem="What is the capital of France?\nOptions:\nA. Berlin\nB. Paris\nC. Madrid",
reference="B",
category="reasoning",
subject="geography",
metadata={"difficulty": "easy", "options": ["Berlin", "Paris", "Madrid"]},
)
EvalResult¶
The result of running inference on a single EvalRecord.
| Field | Type | Default | Description |
|---|---|---|---|
record_id |
str |
— | Matches the source EvalRecord.record_id |
model_answer |
str |
— | Raw text output from the model |
is_correct |
Optional[bool] |
None |
Scoring verdict; None if scoring could not be determined |
score |
Optional[float] |
None |
Numeric score (typically 1.0 / 0.0); may be None if is_correct is None |
latency_seconds |
float |
0.0 |
Wall-clock generation time |
prompt_tokens |
int |
0 |
Input token count from usage metadata |
completion_tokens |
int |
0 |
Output token count from usage metadata |
cost_usd |
float |
0.0 |
Estimated inference cost in USD |
error |
Optional[str] |
None |
Exception message if inference or scoring failed |
scoring_metadata |
Dict[str, Any] |
{} |
Scorer-specific details (extracted letter, judge output, match type, etc.) |
Distinguishing errors from wrong answers
A non-None error field means inference itself failed. When error is None but
is_correct is None, scoring was attempted but the scorer could not determine a
verdict (for example, the judge returned an unparseable response).
RunConfig¶
Configuration for a single evaluation run (one model on one benchmark).
| Field | Type | Default | Description |
|---|---|---|---|
benchmark |
str |
— | Benchmark name: "supergpqa", "gaia", "frames", or "wildchat" |
backend |
str |
— | Backend identifier: "jarvis-direct" or "jarvis-agent" |
model |
str |
— | Model identifier passed to the backend (e.g., "qwen3:8b", "gpt-4o") |
max_samples |
Optional[int] |
None |
Limit the dataset to this many records; None uses the full dataset |
max_workers |
int |
4 |
Number of parallel threads for inference |
temperature |
float |
0.0 |
Sampling temperature |
max_tokens |
int |
2048 |
Maximum output tokens per sample |
judge_model |
str |
"gpt-4o" |
Model identifier used by the LLM judge scorer |
engine_key |
Optional[str] |
None |
Override the OpenJarvis engine ("ollama", "vllm", "cloud", etc.) |
agent_name |
Optional[str] |
None |
Agent name for jarvis-agent backend; defaults to "orchestrator" |
tools |
List[str] |
[] |
Tool names enabled for the agent (e.g., ["calculator", "file_read"]) |
output_path |
Optional[str] |
None |
JSONL output file path; auto-generated from benchmark and model name if None |
seed |
int |
42 |
Random seed for dataset shuffling |
dataset_split |
Optional[str] |
None |
Override the dataset split (e.g., "validation", "test") |
config = RunConfig(
benchmark="supergpqa",
backend="jarvis-direct",
model="qwen3:8b",
max_samples=100,
max_workers=8,
engine_key="ollama",
output_path="results/supergpqa_qwen3-8b.jsonl",
)
RunSummary¶
Aggregate statistics produced by EvalRunner.run() at the end of a completed run.
| Field | Type | Default | Description |
|---|---|---|---|
benchmark |
str |
— | Benchmark name |
category |
str |
— | Task category (inferred from records; falls back to benchmark name) |
backend |
str |
— | Backend used |
model |
str |
— | Model identifier |
total_samples |
int |
— | Total records processed (including errors) |
scored_samples |
int |
— | Records where is_correct is not None |
correct |
int |
— | Records where is_correct is True |
accuracy |
float |
— | correct / scored_samples; rounded to 4 decimal places |
errors |
int |
— | Records where inference or scoring raised an exception |
mean_latency_seconds |
float |
— | Mean wall-clock latency across all successful inferences |
total_cost_usd |
float |
— | Sum of cost_usd across all records |
per_subject |
Dict[str, Dict[str, float]] |
{} |
Per-subject breakdown: {subject: {accuracy, total, scored, correct}} |
started_at |
float |
0.0 |
Unix timestamp at run start |
ended_at |
float |
0.0 |
Unix timestamp at run end |
The runner also writes a .summary.json file alongside the JSONL output, containing
the serialized RunSummary.
Suite Config Types (evals.core.types)¶
These dataclasses map directly to sections in a TOML eval suite config file.
They are populated by load_eval_config() and consumed by expand_suite().
MetaConfig¶
Maps to the [meta] TOML section. Both fields are optional and used only for
display output in the CLI.
DefaultsConfig¶
Maps to [defaults]. These values are the lowest-priority settings in the merge
precedence: benchmark-level > model-level > [defaults] > built-in defaults.
JudgeConfig¶
@dataclass
class JudgeConfig:
model: str = "gpt-4o"
provider: Optional[str] = None
temperature: float = 0.0
max_tokens: int = 1024
Maps to [judge]. The judge model is used by LLM-as-judge scorers (GAIA, FRAMES,
WildChat, SuperGPQA). The provider field is reserved for future routing; currently
the judge backend is always constructed with engine_key="cloud".
ExecutionConfig¶
Maps to [run]. output_dir is the base directory for all JSONL output files;
individual filenames are auto-generated as {benchmark}_{model-slug}.jsonl.
ModelConfig¶
@dataclass
class ModelConfig:
name: str = ""
engine: Optional[str] = None
provider: Optional[str] = None
temperature: Optional[float] = None
max_tokens: Optional[int] = None
Maps to each [[models]] entry. name is required. temperature and max_tokens
override [defaults] for every benchmark this model runs against, unless a
benchmark-level override also exists.
BenchmarkConfig¶
@dataclass
class BenchmarkConfig:
name: str = ""
backend: str = "jarvis-direct"
max_samples: Optional[int] = None
split: Optional[str] = None
agent: Optional[str] = None
tools: List[str] = field(default_factory=list)
judge_model: Optional[str] = None
temperature: Optional[float] = None
max_tokens: Optional[int] = None
Maps to each [[benchmarks]] entry. name is required. backend must be one of
"jarvis-direct" or "jarvis-agent". judge_model overrides [judge].model for
this benchmark only.
EvalSuiteConfig¶
The top-level config object returned by load_eval_config().
@dataclass
class EvalSuiteConfig:
meta: MetaConfig
defaults: DefaultsConfig
judge: JudgeConfig
run: ExecutionConfig
models: List[ModelConfig]
benchmarks: List[BenchmarkConfig]
expand_suite(suite) iterates over models x benchmarks to produce one RunConfig
per pair, applying the merge precedence rules documented in DefaultsConfig.
Config Module (evals.core.config)¶
EvalConfigError¶
Raised by load_eval_config() for structural validation failures: missing required
fields, invalid backend names, or empty [[models]] / [[benchmarks]] lists.
load_eval_config¶
Load and validate an eval suite configuration from a TOML file.
Uses the standard library tomllib on Python 3.11+ and the tomli backport on
Python 3.10.
Parameters:
| Parameter | Type | Description |
|---|---|---|
path |
str \| Path |
Path to the TOML config file |
Returns: EvalSuiteConfig
Raises:
EvalConfigError— structural validation failures (missingname, invalid backend, no models/benchmarks defined)FileNotFoundError— if the config file does not exist
from evals.core.config import load_eval_config
suite = load_eval_config("evals/configs/full-suite.toml")
print(f"{len(suite.models)} models, {len(suite.benchmarks)} benchmarks")
expand_suite¶
Expand an EvalSuiteConfig into a flat list of RunConfig objects, one per
model-benchmark pair, with all override layers merged.
Merge precedence (highest wins):
- Benchmark-level (
BenchmarkConfig.temperature,.max_tokens,.judge_model) - Model-level (
ModelConfig.temperature,.max_tokens) - Suite defaults (
DefaultsConfig) - Built-in dataclass defaults
Output paths are auto-generated as {output_dir}/{benchmark}_{model-slug}.jsonl,
where model-slug replaces / and : with -.
Parameters:
| Parameter | Type | Description |
|---|---|---|
suite |
EvalSuiteConfig |
Parsed suite configuration |
Returns: List[RunConfig] — one entry per model-benchmark combination.
from evals.core.config import load_eval_config, expand_suite
suite = load_eval_config("evals/configs/full-suite.toml")
run_configs = expand_suite(suite) # e.g., 3 models x 4 benchmarks = 12 RunConfigs
for rc in run_configs:
print(f"{rc.benchmark} / {rc.model} -> {rc.output_path}")
Abstract Base Classes¶
InferenceBackend (evals.core.backend)¶
Base class for all inference backends. A backend wraps an engine or agent and provides a uniform text-in / text-out interface for the runner.
Class attribute:
| Attribute | Type | Description |
|---|---|---|
backend_id |
str |
Registry identifier (e.g., "jarvis-direct", "jarvis-agent") |
Abstract methods:
generate¶
@abstractmethod
def generate(
self,
prompt: str,
*,
model: str,
system: str = "",
temperature: float = 0.0,
max_tokens: int = 2048,
) -> str
Generate a response and return the text content only.
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
str |
— | User message or formatted problem text |
model |
str |
— | Model identifier |
system |
str |
"" |
Optional system prompt |
temperature |
float |
0.0 |
Sampling temperature |
max_tokens |
int |
2048 |
Maximum output tokens |
Returns: str — model text output.
generate_full¶
@abstractmethod
def generate_full(
self,
prompt: str,
*,
model: str,
system: str = "",
temperature: float = 0.0,
max_tokens: int = 2048,
) -> Dict[str, Any]
Generate a response and return full details including usage and cost metadata.
Returns: dict with keys:
| Key | Type | Description |
|---|---|---|
content |
str |
Model text output |
usage |
dict |
Token usage (prompt_tokens, completion_tokens) |
model |
str |
Model identifier used |
latency_seconds |
float |
Wall-clock generation time |
cost_usd |
float |
Estimated inference cost |
close¶
Release resources held by the backend (connections, engine handles, etc.). The default implementation is a no-op; subclasses override as needed.
DatasetProvider (evals.core.dataset)¶
Base class for all evaluation dataset providers. Datasets are loaded lazily via
load() and then consumed record-by-record through iter_records().
Class attributes:
| Attribute | Type | Description |
|---|---|---|
dataset_id |
str |
Short identifier matching the CLI benchmark name |
dataset_name |
str |
Human-readable display name |
Abstract methods:
load¶
@abstractmethod
def load(
self,
*,
max_samples: Optional[int] = None,
split: Optional[str] = None,
seed: Optional[int] = None,
) -> None
Load the dataset, optionally downloading from HuggingFace Hub. Must be called
before iter_records().
| Parameter | Type | Default | Description |
|---|---|---|---|
max_samples |
Optional[int] |
None |
Truncate to this many records after shuffling |
split |
Optional[str] |
None |
Dataset split override (e.g., "test", "validation") |
seed |
Optional[int] |
None |
Shuffle seed; None preserves original order |
iter_records¶
Iterate over the loaded EvalRecord objects. Raises if called before load().
size¶
Return the count of loaded records.
Scorer (evals.core.scorer)¶
Base class for all scorers. A scorer compares a model's answer to the reference
in an EvalRecord and returns a correctness verdict with optional metadata.
Class attribute:
| Attribute | Type | Description |
|---|---|---|
scorer_id |
str |
Short identifier matching the benchmark name |
Abstract method:
score¶
@abstractmethod
def score(
self,
record: EvalRecord,
model_answer: str,
) -> Tuple[Optional[bool], Dict[str, Any]]
Score a model answer against the reference.
| Parameter | Type | Description |
|---|---|---|
record |
EvalRecord |
The source sample including reference and metadata |
model_answer |
str |
Raw text output from the model |
Returns: (is_correct, metadata) tuple where:
is_correctisTrue,False, orNone(if scoring could not be determined)metadatais adictof scorer-specific details stored inEvalResult.scoring_metadata
LLMJudgeScorer¶
class LLMJudgeScorer(Scorer):
def __init__(self, judge_backend: InferenceBackend, judge_model: str) -> None
Convenience base class for scorers that call an LLM to evaluate answers.
Exposes _ask_judge() to subclasses.
def _ask_judge(
self,
prompt: str,
*,
system: str = "",
temperature: float = 0.0,
max_tokens: int = 1024,
) -> str
Send a prompt to the judge LLM and return the response text. Delegates to
judge_backend.generate().
EvalRunner (evals.core.runner)¶
The EvalRunner wires together a RunConfig, DatasetProvider, InferenceBackend,
and Scorer and executes the benchmark. Inference is parallelized using a
ThreadPoolExecutor. Results are written to JSONL incrementally so progress is
not lost if the run is interrupted.
Constructor¶
class EvalRunner:
def __init__(
self,
config: RunConfig,
dataset: DatasetProvider,
backend: InferenceBackend,
scorer: Scorer,
) -> None
| Parameter | Type | Description |
|---|---|---|
config |
RunConfig |
Run parameters (model, workers, output path, etc.) |
dataset |
DatasetProvider |
Dataset to evaluate against |
backend |
InferenceBackend |
Inference backend for generation |
scorer |
Scorer |
Scorer for comparing model answers to references |
run¶
Execute the full evaluation and return aggregate statistics.
The method:
- Calls
dataset.load()with theRunConfigsampling parameters - Submits all records to a
ThreadPoolExecutorwithconfig.max_workersthreads - For each record, calls
backend.generate_full()thenscorer.score() - Writes each
EvalResultto a JSONL file as it completes - Writes a
.summary.jsonalongside the JSONL at the end
Returns: RunSummary
from evals.core.types import RunConfig
from evals.core.runner import EvalRunner
from evals.datasets.supergpqa import SuperGPQADataset
from evals.backends.jarvis_direct import JarvisDirectBackend
from evals.scorers.supergpqa_mcq import SuperGPQAScorer
config = RunConfig(
benchmark="supergpqa",
backend="jarvis-direct",
model="qwen3:8b",
max_samples=50,
engine_key="ollama",
)
dataset = SuperGPQADataset()
backend = JarvisDirectBackend(engine_key="ollama")
judge_backend = JarvisDirectBackend(engine_key="cloud")
scorer = SuperGPQAScorer(judge_backend=judge_backend, judge_model="gpt-4o")
runner = EvalRunner(config, dataset, backend, scorer)
summary = runner.run()
print(f"Accuracy: {summary.accuracy:.4f} ({summary.correct}/{summary.scored_samples})")
print(f"Mean latency: {summary.mean_latency_seconds:.2f}s")
print(f"Total cost: ${summary.total_cost_usd:.4f}")
backend.close()
judge_backend.close()
Backends¶
JarvisDirectBackend (evals.backends.jarvis_direct)¶
Engine-level inference via SystemBuilder. Routes directly to the configured
InferenceEngine without an agent loop, making it the fastest backend and
appropriate for benchmarks that do not require tool use.
class JarvisDirectBackend(InferenceBackend):
backend_id = "jarvis-direct"
def __init__(self, engine_key: Optional[str] = None) -> None
| Parameter | Type | Default | Description |
|---|---|---|---|
engine_key |
Optional[str] |
None |
OpenJarvis engine identifier. None uses the auto-discovered engine from ~/.openjarvis/config.toml |
Telemetry and traces are disabled for eval runs. The backend calls
SystemBuilder().engine(engine_key).telemetry(False).traces(False).build().
Compatible benchmarks: supergpqa, frames, wildchat (any benchmark that
does not require multi-step tool calling).
JarvisAgentBackend (evals.backends.jarvis_agent)¶
Agent-level inference via JarvisSystem.ask(). Wraps the full OpenJarvis agent
harness, enabling multi-turn tool-calling loops for agentic benchmarks.
class JarvisAgentBackend(InferenceBackend):
backend_id = "jarvis-agent"
def __init__(
self,
engine_key: Optional[str] = None,
agent_name: str = "orchestrator",
tools: Optional[List[str]] = None,
) -> None
| Parameter | Type | Default | Description |
|---|---|---|---|
engine_key |
Optional[str] |
None |
OpenJarvis engine identifier |
agent_name |
str |
"orchestrator" |
Agent to use ("orchestrator", "react", etc.) |
tools |
Optional[List[str]] |
None |
Tool names to enable (e.g., ["calculator", "file_read"]) |
The generate_full() return dict includes two additional keys beyond the standard
InferenceBackend contract:
| Key | Type | Description |
|---|---|---|
turns |
int |
Number of agent turns completed |
tool_results |
list |
Tool call results from the agent loop |
Compatible benchmarks: gaia (requires file reading and multi-step reasoning).
backend = JarvisAgentBackend(
engine_key="ollama",
agent_name="orchestrator",
tools=["file_read", "calculator"],
)
result = backend.generate_full(
"How many pages is the attached PDF?",
model="qwen3:8b",
)
print(result["content"])
print(f"Completed in {result['turns']} turn(s)")
backend.close()
Dataset Providers¶
SuperGPQADataset (evals.datasets.supergpqa)¶
Loads the SuperGPQA multiple-choice benchmark from HuggingFace (m-a-p/SuperGPQA).
Records have category="reasoning" and subject set to the discipline subfield.
- Default split:
"train" - HuggingFace path:
m-a-p/SuperGPQA - Each problem is formatted with lettered options (A, B, C, ...) and the instruction "Respond with the correct letter only."
record.referenceis the correct answer letter (e.g.,"B").
GAIADataset (evals.datasets.gaia)¶
Loads the GAIA agentic benchmark from HuggingFace (gaia-benchmark/GAIA).
Records have category="agentic" and subject set to level_1, level_2, or
level_3.
class GAIADataset(DatasetProvider):
dataset_id = "gaia"
dataset_name = "GAIA"
def __init__(self, cache_dir: Optional[str] = None) -> None
| Parameter | Type | Default | Description |
|---|---|---|---|
cache_dir |
Optional[str] |
~/.cache/gaia_benchmark |
Local directory for HuggingFace snapshot download |
- Default split:
"validation" - Default subset:
"2023_all" - Downloads the full dataset snapshot including associated files (PDFs, images, CSVs) referenced in questions. File paths are embedded in the problem prompt.
Dataset access
GAIA requires accepting the HuggingFace dataset terms of service and being logged
in with huggingface-cli login before the snapshot download can proceed.
FRAMESDataset (evals.datasets.frames)¶
Loads the FRAMES multi-hop factual retrieval benchmark from HuggingFace
(google/frames-benchmark). Records have category="rag" and subject set to
the reasoning type(s) (e.g., "multi-hop, temporal").
- Default split:
"test" - Wikipedia article links referenced in each question are included in the problem prompt.
WildChatDataset (evals.datasets.wildchat)¶
Loads the WildChat-1M dataset (allenai/WildChat-1M) and filters to English
single-turn conversations for chat quality evaluation. Records have
category="chat" and subject="conversation".
- Default split:
"train" - Filters by
language == "english"and exactly two turns (one user + one assistant). record.problemis the user message;record.referenceis the original assistant response used as the quality baseline by the judge scorer.
Scorers¶
SuperGPQAScorer (evals.scorers.supergpqa_mcq)¶
LLM-based letter extraction followed by exact match against the reference letter.
The judge LLM extracts the final answer letter from potentially verbose model
responses, then compares it to record.reference.
Scoring metadata keys:
| Key | Description |
|---|---|
reference_letter |
Correct answer letter from the dataset |
candidate_letter |
Letter extracted by the judge LLM |
valid_letters |
Valid answer letters for this question (e.g., "ABCD") |
reason |
Set to "missing_reference_letter" or "no_choice_letter_extracted" on failure |
GAIAScorer (evals.scorers.gaia_exact)¶
Normalized exact match with an LLM fallback for semantic comparison. Tries exact match first (no API call); falls back to the judge LLM only when exact match fails.
Normalization rules for exact match:
- Numbers: strips
$,%,,then compares asfloat - Lists (comma- or semicolon-separated): splits and compares element-by-element
- Strings: lowercases, strips whitespace and punctuation
Scoring metadata keys:
| Key | Description |
|---|---|
match_type |
"exact" or "llm_fallback" |
raw_judge_output |
Full LLM judge response (llm_fallback only) |
extracted_answer |
Answer extracted by the judge (llm_fallback only) |
The exact_match helper function is also exported and can be used independently:
from evals.scorers.gaia_exact import exact_match
assert exact_match("$1,000", "1000") is True
assert exact_match("paris", "Paris") is True
assert exact_match("3, 5", "3,5") is True
FRAMESScorer (evals.scorers.frames_judge)¶
LLM-as-judge scorer for FRAMES multi-hop factual retrieval. Uses a structured grading rubric that focuses on semantic equivalence, ignoring formatting and capitalization differences.
Scoring metadata keys:
| Key | Description |
|---|---|
raw_judge_output |
Full LLM judge response |
extracted_answer |
Answer extracted by the judge |
WildChatScorer (evals.scorers.wildchat_judge)¶
Dual-comparison LLM-as-judge for chat quality. Runs two comparisons — once with the model answer as Assistant A and once as Assistant B — to reduce position bias. The model answer is considered correct if it wins or ties in either comparison.
The judge uses a five-point verdict scale: [[A>>B]], [[A>B]], [[A=B]],
[[B>A]], [[B>>A]]. A tie (A=B) is counted as correct.
Scoring metadata keys:
| Key | Description |
|---|---|
generated_as_a |
{verdict, response} from the first comparison pass |
generated_as_b |
{verdict, response} from the second comparison pass |
CLI Reference¶
The evaluation framework ships a openjarvis-eval CLI built with Click.
openjarvis-eval run¶
Run a single benchmark or a full suite from a TOML config.
openjarvis-eval run \
--benchmark supergpqa \
--model qwen3:8b \
--engine ollama \
--max-samples 100 \
--max-workers 8 \
--output results/supergpqa_qwen3-8b.jsonl
| Option | Short | Default | Description |
|---|---|---|---|
--config |
-c |
None |
TOML suite config file; enables suite mode |
--benchmark |
-b |
— | Benchmark name (required in single-run mode) |
--backend |
jarvis-direct |
jarvis-direct or jarvis-agent |
|
--model |
-m |
— | Model identifier (required in single-run mode) |
--engine |
-e |
None |
Engine key override |
--agent |
orchestrator |
Agent name for jarvis-agent backend |
|
--tools |
"" |
Comma-separated tool names | |
--max-samples |
-n |
None |
Sample limit |
--max-workers |
-w |
4 |
Parallel threads |
--judge-model |
gpt-4o |
LLM judge model | |
--output |
-o |
auto | JSONL output path |
--seed |
42 |
Shuffle seed | |
--split |
None |
Dataset split override | |
--temperature |
0.0 |
Sampling temperature | |
--max-tokens |
2048 |
Maximum output tokens | |
--verbose |
-v |
False |
Enable debug logging |
openjarvis-eval run-all¶
Run all four benchmarks against a single model.
openjarvis-eval run-all \
--model qwen3:8b \
--engine ollama \
--max-samples 50 \
--output-dir results/
openjarvis-eval summarize¶
Recompute summary statistics from an existing JSONL output file.
openjarvis-eval list¶
List all available benchmarks and backends.
TOML Suite Config Format¶
A suite config drives a full models x benchmarks comparison matrix with a single
command. All sections except [[models]] and [[benchmarks]] are optional.
[meta]
name = "full-suite-v1"
description = "Evaluate all benchmarks against production models"
[defaults]
temperature = 0.0
max_tokens = 2048
[judge]
model = "gpt-4o"
temperature = 0.0
max_tokens = 1024
[run]
max_workers = 4
output_dir = "results/"
seed = 42
# One [[models]] entry per model to evaluate
[[models]]
name = "qwen3:8b"
engine = "ollama"
temperature = 0.3
[[models]]
name = "gpt-4o"
provider = "openai"
# One [[benchmarks]] entry per benchmark
[[benchmarks]]
name = "supergpqa"
backend = "jarvis-direct"
max_samples = 200
[[benchmarks]]
name = "gaia"
backend = "jarvis-agent"
agent = "orchestrator"
tools = ["file_read", "calculator"]
max_samples = 50
judge_model = "claude-sonnet-4-20250514" # override judge for this benchmark
[[benchmarks]]
name = "frames"
backend = "jarvis-direct"
max_samples = 100
[[benchmarks]]
name = "wildchat"
backend = "jarvis-direct"
max_samples = 150
temperature = 0.7
openjarvis-eval run --config evals/configs/full-suite.toml
# Suite: full-suite-v1
# 2 model(s) x 4 benchmark(s) = 8 run(s)
See Also¶
- Benchmarks Module —
openjarvis.benchperformance benchmarks (latency, throughput) for the inference engine, separate from the eval framework - Telemetry & Traces —
openjarvis.telemetryandopenjarvis.tracesfor production monitoring - Python SDK —
Jarvisclass used internally by eval backends - Agents — Agent implementations invoked by
JarvisAgentBackend