Inference Engine Pillar¶

The Engine pillar provides the inference runtime -- the layer that connects OpenJarvis to language model servers. All backends implement a uniform interface, making it straightforward to swap between local and cloud inference without changing application code.

InferenceEngine ABC¶

Every engine backend extends the InferenceEngine abstract base class:

class InferenceEngine(ABC):
    engine_id: str

    @abstractmethod
    def generate(
        self,
        messages: Sequence[Message],
        *,
        model: str,
        temperature: float = 0.7,
        max_tokens: int = 1024,
        **kwargs: Any,
    ) -> Dict[str, Any]:
        """Synchronous completion -- returns a dict with 'content' and 'usage'."""

    @abstractmethod
    async def stream(
        self,
        messages: Sequence[Message],
        *,
        model: str,
        temperature: float = 0.7,
        max_tokens: int = 1024,
        **kwargs: Any,
    ) -> AsyncIterator[str]:
        """Yield token strings as they are generated."""

    @abstractmethod
    def list_models(self) -> List[str]:
        """Return identifiers of models available on this engine."""

    @abstractmethod
    def health(self) -> bool:
        """Return True when the engine is reachable and healthy."""

    def prepare(self, model: str) -> None:
        """Optional warm-up hook called before the first request."""

Return Format¶

The generate() method returns a dictionary with the following structure:

{
    "content": "The model's response text",
    "usage": {
        "prompt_tokens": 42,
        "completion_tokens": 128,
        "total_tokens": 170,
    },
    "model": "qwen3:8b",
    "finish_reason": "stop",
    "tool_calls": [...]  # Optional, present if model requested tool calls
}

When the model requests tool calls, they are extracted and passed through in OpenAI format:

{
    "tool_calls": [
        {
            "id": "call_abc123",
            "name": "calculator",
            "arguments": "{\"expression\": \"2 + 2\"}"
        }
    ]
}

Multi-Provider Tool Call Extraction¶

Engine backends normalize tool calls from different providers into the standard flat format used by agents:

Provider	Source Format	Extraction Logic
OpenAI	`choices[0].message.tool_calls[].function.{name, arguments}`	Direct extraction, add `id` from `tool_calls[].id`
Anthropic	`content[]` blocks with `type: "tool_use"`	Filter `tool_use` blocks, map `input` dict to JSON `arguments`
Google	`candidates[0].content.parts[]` with `function_call`	Extract `function_call.name` and `function_call.args`, serialize args to JSON
LiteLLM	Flat `{id, name, arguments}` dicts (proxy pre-normalizes)	Pass through directly
Ollama	`message.tool_calls[].function.{name, arguments}`	Extract from Ollama native format, serialize arguments dict to JSON

All providers produce the same output format consumed by agents:

{
    "tool_calls": [
        {"id": "call_abc", "name": "calculator", "arguments": "{\"expression\": \"2+2\"}"}
    ]
}

Backend Comparison¶

Backend	Registry Key	Protocol	GPU Required	Best For
Ollama	`ollama`	Native HTTP API	No (GPU optional)	Getting started, consumer GPUs, Apple Silicon
vLLM	`vllm`	OpenAI-compatible	NVIDIA recommended	Datacenter GPUs (A100, H100), high throughput
SGLang	`sglang`	OpenAI-compatible	NVIDIA recommended	Structured generation, speculative decoding
llama.cpp	`llamacpp`	OpenAI-compatible	No (CPU-optimized)	CPU-only systems, GGUF models, edge devices
Cloud	`cloud`	Provider SDKs	No	OpenAI, Anthropic, Google API access

Ollama¶

The Ollama backend communicates via Ollama's native HTTP API at /api/chat and /api/tags. It is the default engine on Apple Silicon and consumer NVIDIA GPUs.

Default host: http://localhost:11434
Health check: GET /api/tags
Model listing: GET /api/tags (extracts model names)
Tool support: Passes tools in the request payload and extracts tool_calls from responses

vLLM¶

The vLLM backend uses the OpenAI-compatible /v1/chat/completions API. It is recommended for datacenter GPUs (A100, H100, L40, A10, A30) and AMD GPUs.

Default host: http://localhost:8000
Health check: GET /v1/models
Tool fallback: If the server returns HTTP 400 when tools are included, the engine automatically retries without tools

SGLang¶

The SGLang backend also uses the OpenAI-compatible API. It shares the same _OpenAICompatibleEngine base class as vLLM and llama.cpp.

Default host: http://localhost:30000
Health check: GET /v1/models

llama.cpp¶

The llama.cpp backend connects to a llama-server instance via the OpenAI-compatible API. It is recommended for CPU-only systems and GGUF-quantized models.

Default host: http://localhost:8080
Health check: GET /v1/models

Cloud¶

The Cloud backend provides access to OpenAI, Anthropic, and Google models via their respective Python SDKs. It automatically detects the provider based on the model name:

Models containing "claude" route to the Anthropic client
Models containing "gemini" route to the Google client
All other models route to the OpenAI client

API Keys

Cloud models require API keys set as environment variables: OPENAI_API_KEY, ANTHROPIC_API_KEY, GEMINI_API_KEY (or GOOGLE_API_KEY). The cloud engine is only registered if the corresponding SDK packages are installed.

Hardware Auto-Detection¶

OpenJarvis automatically detects system hardware to recommend the best engine. Detection runs at config load time via detect_hardware():

Detection	Method	Information Extracted
NVIDIA GPU	`nvidia-smi`	GPU name, VRAM (GB), count
AMD GPU	`rocm-smi`	GPU name
Apple Silicon	`system_profiler SPDisplaysDataType`	Chipset model name
CPU	`/proc/cpuinfo` or `sysctl`	Brand string
RAM	`/proc/meminfo` or `sysctl hw.memsize`	Total GB

Engine Recommendation Logic¶

The recommend_engine() function maps hardware to the best engine:

graph TD
    A["detect_hardware()"] --> B{"GPU detected?"}
    B -->|No| C["llamacpp"]
    B -->|Yes| D{"GPU vendor?"}
    D -->|Apple| E["ollama"]
    D -->|NVIDIA| F{"Datacenter card?<br/>(A100, H100, H200,<br/>L40, A10, A30)"}
    F -->|Yes| G["vllm"]
    F -->|No| H["ollama"]
    D -->|AMD| I["vllm"]
    D -->|Other| J["llamacpp"]

Engine Discovery¶

The _discovery.py module provides three functions for finding and instantiating engines at runtime.

`get_engine(config, engine_key=None)`¶

Returns a (key, engine_instance) tuple for the requested engine, or None if unavailable:

If engine_key is specified, try to instantiate and health-check that specific engine
Otherwise, try the default engine from config
If the default is unhealthy, fall back to any healthy engine via discover_engines()

`discover_engines(config)`¶

Probes all registered engines for health and returns a sorted list of healthy (key, engine) pairs. The config default engine is sorted first.

from openjarvis.engine import discover_engines
from openjarvis.core.config import load_config

config = load_config()
healthy = discover_engines(config)
# [("ollama", OllamaEngine(...)), ("vllm", VLLMEngine(...))]

`discover_models(engines)`¶

Calls list_models() on each engine and returns a dictionary mapping engine keys to model ID lists:

from openjarvis.engine import discover_engines, discover_models

engines = discover_engines(config)
models = discover_models(engines)
# {"ollama": ["qwen3:8b", "llama3.2:3b"], "vllm": ["mistral:7b"]}

OpenAI Compatibility Layer¶

The _OpenAICompatibleEngine base class provides a shared implementation for engines that serve the standard /v1/chat/completions endpoint. vLLM, SGLang, and llama.cpp all extend this base class with minimal overrides -- typically just setting engine_id and _default_host.

class _OpenAICompatibleEngine(InferenceEngine):
    engine_id: str = ""
    _default_host: str = "http://localhost:8000"

    def __init__(self, host: str | None = None, *, timeout: float = 120.0):
        self._host = (host or self._default_host).rstrip("/")
        self._client = httpx.Client(base_url=self._host, timeout=timeout)

Key behaviors:

Synchronous generation: POST /v1/chat/completions with stream=False
Streaming: POST /v1/chat/completions with stream=True, parsing SSE data: lines
Model listing: GET /v1/models, extracting data[].id
Health check: GET /v1/models with a 2-second timeout
Tool call fallback: On HTTP 400 with tools in the payload, retries without tools (handles engines that do not support function calling)

Configuration¶

Engine hosts and defaults are configured in ~/.openjarvis/config.toml using nested per-engine sub-sections:

[engine]
default = "ollama"

[engine.ollama]
host = "http://localhost:11434"

[engine.vllm]
host = "http://localhost:8000"

[engine.sglang]
host = "http://localhost:30000"

# [engine.llamacpp]
# host = "http://localhost:8080"
# binary_path = ""

The EngineConfig dataclass and its per-engine sub-dataclasses map these settings:

Config Class	Field	Default	Description
`EngineConfig`	`default`	`"ollama"` (hardware-dependent)	Preferred engine backend
`OllamaEngineConfig`	`host`	`http://localhost:11434`	Ollama server URL
`VLLMEngineConfig`	`host`	`http://localhost:8000`	vLLM server URL
`SGLangEngineConfig`	`host`	`http://localhost:30000`	SGLang server URL
`LlamaCppEngineConfig`	`host`	`http://localhost:8080`	llama.cpp server URL
`LlamaCppEngineConfig`	`binary_path`	`""`	Path to llama.cpp binary (for managed mode)

Backward compatibility

The old flat field names ollama_host, vllm_host, llamacpp_host, llamacpp_path, and sglang_host under [engine] are still accepted as backward-compatible properties on EngineConfig. New configurations should use the nested sub-section format.

Utility Functions¶

`messages_to_dicts()`¶

Converts a sequence of Message objects to OpenAI-format dictionaries, handling tool calls and tool call IDs:

from openjarvis.engine._base import messages_to_dicts
from openjarvis.core.types import Message, Role

messages = [Message(role=Role.USER, content="Hello")]
dicts = messages_to_dicts(messages)
# [{"role": "user", "content": "Hello"}]

`EngineConnectionError`¶

A custom exception raised when an engine is unreachable. All engine backends catch httpx.ConnectError and httpx.TimeoutException and re-raise as EngineConnectionError:

from openjarvis.engine import EngineConnectionError

try:
    result = engine.generate(messages, model="qwen3:8b")
except EngineConnectionError as exc:
    print(f"Engine unavailable: {exc}")