Skip to content

Supported model servers

Recognised even while Idle, so the server stays on the dashboard when its model is unloaded. Per-model VRAM comes from the server's API where available, otherwise it's attributed from nvidia-smi.

Server Model name Per-model VRAM
Ollama ✅ loaded + pulled catalogue ✅ via /api/ps (validated)
vLLM ✅ via /v1/models attributed
llama.cpp / llama-server ✅ via /v1/models attributed
LocalAI ✅ via /v1/models attributed
HF TGI / TEI ✅ via /info attributed
faster-whisper / Speaches ✅ via /v1/models attributed
koboldcpp ✅ via /api/v1/model attributed
tabbyAPI · text-generation-webui · LM Studio · xinference · Aphrodite · Infinity ✅ via /v1/models attributed
SGLang · OpenLLM · LiteLLM · GPUStack · Cortex / Jan · Ramalama · Nexa · mistral.rs ✅ via /v1/models attributed
LoRAX ✅ via /info attributed
Whisper ASR webservice / WhisperX ✅ up via /openapi.json (single entry) attributed
Wyoming (HA voice: faster-whisper / Piper / openWakeWord) ✅ via describe over TCP attributed
OpenedAI-Speech ✅ via /v1/models attributed
NVIDIA Triton ✅ up via /v2 (single entry) attributed
Stable Diffusion (A1111 / Forge / SD.Next) ✅ via /sdapi/v1/options attributed
InvokeAI ✅ via /api/v2/models/ attributed
ComfyUI ✅ checkpoints via /object_info attributed

Don't see yours? Adding a probe is a one-liner — append to PROBES in app.py. Most servers speak the OpenAI /v1/models shape and differ only by port. See How it works and Contributing.

Who's calling? (caller attribution)

Model-server APIs never reveal who is calling them. The monitor works it out from the outside: it samples each container's own established connections and matches the remote port to a model server, then attributes connection-time per caller → server — surfaced as the "Driven by" breakdown on each server card. It's sampled, so long LLM streams are tracked reliably while sub-second calls (e.g. embeddings) are approximate; the hub's own probe traffic is excluded.