Score / Cost / Latency
Will visualize: Per-run score, cost, and latency with task-suite and model context.
Required data: Published run records with model, eval suite, score, cost, and latency fields.
Open Model Lab
Dashboard pages are planned scaffolding. Charts will only appear after the underlying data is available and tied to published runs.
Will visualize: Per-run score, cost, and latency with task-suite and model context.
Required data: Published run records with model, eval suite, score, cost, and latency fields.
Will visualize: Failure-mode counts by model variant, task category, and report.
Required data: Runs with validated failure-mode labels.
Will visualize: Controlled behavior changes across base, SFT, and DPO variants.
Required data: Comparable runs on the same task suite and documented model cards.
Will visualize: Step-by-step tool use, retries, errors, and task outcomes.
Required data: Replayable trace records from the agent harness.
Will visualize: Refusal quality, over-refusal, under-refusal, jailbreak robustness, and helpfulness.
Required data: Safety eval runs over both risky and harmless requests.
Will visualize: Behavior gain compared with training or inference compute cost.
Required data: Scaling-ladder runs with compute accounting and comparable score fields.