Learning Module¶
The learning module implements learning policies that improve routing, agent,
and tool decisions based on historical interaction outcomes. The module provides
a LearningPolicy ABC taxonomy with specialized sub-ABCs for intelligence
(model routing), agent behavior, and tool selection. It also includes reward
functions for scoring inference results.
Abstract Base Classes¶
RouterPolicy¶
RouterPolicy
¶
QueryAnalyzer¶
RoutingContext¶
RoutingContext is defined in core/types.py.
RewardFunction¶
RewardFunction
¶
LearningPolicy Taxonomy¶
The learning system defines a hierarchy of learning policy ABCs:
LearningPolicy-- base ABC for all learning policiesIntelligenceLearningPolicy-- specialization for model routing decisionsAgentLearningPolicy-- specialization for agent behavior advice (ICL examples, tool-use strategies)
Policy Implementations¶
TraceDrivenPolicy¶
TraceDrivenPolicy
¶
TraceDrivenPolicy(analyzer: Optional[TraceAnalyzer] = None, *, available_models: Optional[List[str]] = None, default_model: str = '', fallback_model: str = '')
Bases: RouterPolicy
Router policy that learns from historical traces.
Maintains a mapping of query_class → best_model derived from
trace outcomes. Falls back to the provided default when no trace
data is available for a query class.
The policy is updated by calling :meth:update_from_traces, which
reads the TraceAnalyzer and recomputes the mapping.
Source code in src/openjarvis/learning/trace_policy.py
Attributes¶
policy_map
property
¶
Current learned routing decisions (read-only copy).
Functions¶
select_model
¶
select_model(context: RoutingContext) -> str
Select the best model based on learned policy or fallback.
Source code in src/openjarvis/learning/trace_policy.py
update_from_traces
¶
update_from_traces(*, since: Optional[float] = None, until: Optional[float] = None) -> Dict[str, Any]
Recompute the policy map from trace history.
Returns a summary of what changed for logging/debugging.
Source code in src/openjarvis/learning/trace_policy.py
observe
¶
Record a single observation for online (incremental) updates.
This is a lighter-weight alternative to :meth:update_from_traces
for use cases where you want to update the policy after every
interaction rather than in batch.
Source code in src/openjarvis/learning/trace_policy.py
classify_query¶
classify_query
¶
Classify a query into a broad category for routing.
Source code in src/openjarvis/learning/trace_policy.py
SFTRouterPolicy¶
SFTRouterPolicy
¶
Bases: IntelligenceLearningPolicy
Trace-driven router that learns query_class → model mappings.
Reads historical traces, groups by query class (code, math, short, long, general), scores each model via a composite metric (60% outcome + 40% feedback), and produces a routing table that maps query classes to their best-performing model.
Source code in src/openjarvis/learning/sft_policy.py
Functions¶
update
¶
Analyze trace outcomes and update the policy map.
Source code in src/openjarvis/learning/sft_policy.py
AgentAdvisorPolicy¶
AgentAdvisorPolicy
¶
Bases: AgentLearningPolicy
Higher-level LM analyzes traces, suggests agent structure changes.
Does NOT auto-apply changes — returns recommendations that can be reviewed or applied via config.
Source code in src/openjarvis/learning/agent_advisor.py
Functions¶
update
¶
Analyze traces and return agent improvement recommendations.
Source code in src/openjarvis/learning/agent_advisor.py
ICLUpdaterPolicy¶
ICLUpdaterPolicy
¶
Bases: AgentLearningPolicy
Updates in-context examples and discovers skills from traces.
Analyzes traces for successful tool call patterns, extracts in-context learning examples, and discovers reusable multi-tool sequences ("skills"). This updates agent logic (ICL examples and tool-use strategies), not tool implementations themselves.
Source code in src/openjarvis/learning/icl_updater.py
Functions¶
update
¶
Analyze traces and extract ICL examples + skills.
Source code in src/openjarvis/learning/icl_updater.py
GRPORouterPolicy¶
GRPORouterPolicy
¶
Bases: RouterPolicy
Placeholder for GRPO-trained router policy (Phase 5).
Raises NotImplementedError until training infrastructure is ready.
Source code in src/openjarvis/learning/grpo_policy.py
Reward Functions¶
HeuristicRewardFunction¶
HeuristicRewardFunction
¶
HeuristicRewardFunction(*, weight_latency: float = 0.4, weight_cost: float = 0.3, weight_efficiency: float = 0.3, max_latency: float = 30.0, max_cost: float = 0.01)
Bases: RewardFunction
Computes a scalar reward based on latency, cost, and token efficiency.
Each component is normalised to [0, 1] and combined via a weighted sum.