Skip to content

Benchmarks Module

The benchmarks module provides a framework for measuring inference engine performance. All benchmarks implement the BaseBenchmark ABC and are registered via BenchmarkRegistry. The BenchmarkSuite runner executes a collection of benchmarks and aggregates results into JSONL or summary format.

Abstract Base Class and Runner

BaseBenchmark

BaseBenchmark

Bases: ABC

Base class for all benchmark implementations.

Subclasses must be registered via @BenchmarkRegistry.register("name") to become discoverable.

Attributes

name abstractmethod property

name: str

Short identifier for this benchmark.

description abstractmethod property

description: str

Human-readable description of what this benchmark measures.

Functions

run abstractmethod

run(engine: InferenceEngine, model: str, *, num_samples: int = 10) -> BenchmarkResult

Execute the benchmark and return results.

Source code in src/openjarvis/bench/_stubs.py
@abstractmethod
def run(
    self,
    engine: InferenceEngine,
    model: str,
    *,
    num_samples: int = 10,
) -> BenchmarkResult:
    """Execute the benchmark and return results."""

BenchmarkResult

BenchmarkResult dataclass

BenchmarkResult(benchmark_name: str, model: str, engine: str, metrics: Dict[str, float] = dict(), metadata: Dict[str, Any] = dict(), samples: int = 0, errors: int = 0)

Result from running a single benchmark.

BenchmarkSuite

BenchmarkSuite

BenchmarkSuite(benchmarks: Optional[List[BaseBenchmark]] = None)

Run a collection of benchmarks and aggregate results.

Source code in src/openjarvis/bench/_stubs.py
def __init__(self, benchmarks: Optional[List[BaseBenchmark]] = None) -> None:
    self._benchmarks = benchmarks or []

Functions

run_all

run_all(engine: InferenceEngine, model: str, *, num_samples: int = 10) -> List[BenchmarkResult]

Run all benchmarks and return a list of results.

Source code in src/openjarvis/bench/_stubs.py
def run_all(
    self,
    engine: InferenceEngine,
    model: str,
    *,
    num_samples: int = 10,
) -> List[BenchmarkResult]:
    """Run all benchmarks and return a list of results."""
    results: List[BenchmarkResult] = []
    for bench in self._benchmarks:
        result = bench.run(engine, model, num_samples=num_samples)
        results.append(result)
    return results

to_jsonl

to_jsonl(results: List[BenchmarkResult]) -> str

Serialize results to JSONL format (one JSON object per line).

Source code in src/openjarvis/bench/_stubs.py
def to_jsonl(self, results: List[BenchmarkResult]) -> str:
    """Serialize results to JSONL format (one JSON object per line)."""
    lines: List[str] = []
    for r in results:
        obj = {
            "benchmark_name": r.benchmark_name,
            "model": r.model,
            "engine": r.engine,
            "metrics": r.metrics,
            "metadata": r.metadata,
            "samples": r.samples,
            "errors": r.errors,
        }
        lines.append(json.dumps(obj))
    return "\n".join(lines)

summary

summary(results: List[BenchmarkResult]) -> Dict[str, Any]

Create a summary dict from benchmark results.

Source code in src/openjarvis/bench/_stubs.py
def summary(self, results: List[BenchmarkResult]) -> Dict[str, Any]:
    """Create a summary dict from benchmark results."""
    return {
        "benchmark_count": len(results),
        "benchmarks": [
            {
                "name": r.benchmark_name,
                "model": r.model,
                "engine": r.engine,
                "metrics": r.metrics,
                "samples": r.samples,
                "errors": r.errors,
            }
            for r in results
        ],
    }

Benchmark Implementations

LatencyBenchmark

LatencyBenchmark

Bases: BaseBenchmark

Measures per-call inference latency with short prompts.

ThroughputBenchmark

ThroughputBenchmark

Bases: BaseBenchmark

Measures inference throughput in tokens per second.