Supported model servers¶
Recognised even while Idle, so the server stays on the dashboard when its model is
unloaded. Per-model VRAM comes from the server's API where available, otherwise it's
attributed from nvidia-smi.
| Server | Model name | Per-model VRAM |
|---|---|---|
| Ollama | ✅ loaded + pulled catalogue | ✅ via /api/ps (validated) |
| vLLM | ✅ via /v1/models |
attributed |
| llama.cpp / llama-server | ✅ via /v1/models |
attributed |
| LocalAI | ✅ via /v1/models |
attributed |
| HF TGI / TEI | ✅ via /info |
attributed |
| faster-whisper / Speaches | ✅ via /v1/models |
attributed |
| koboldcpp | ✅ via /api/v1/model |
attributed |
| tabbyAPI · text-generation-webui · LM Studio · xinference · Aphrodite · Infinity | ✅ via /v1/models |
attributed |
| SGLang · OpenLLM · LiteLLM · GPUStack · Cortex / Jan · Ramalama · Nexa · mistral.rs | ✅ via /v1/models |
attributed |
| LoRAX | ✅ via /info |
attributed |
| Whisper ASR webservice / WhisperX | ✅ up via /openapi.json (single entry) |
attributed |
| Wyoming (HA voice: faster-whisper / Piper / openWakeWord) | ✅ via describe over TCP |
attributed |
| OpenedAI-Speech | ✅ via /v1/models |
attributed |
| NVIDIA Triton | ✅ up via /v2 (single entry) |
attributed |
| Stable Diffusion (A1111 / Forge / SD.Next) | ✅ via /sdapi/v1/options |
attributed |
| InvokeAI | ✅ via /api/v2/models/ |
attributed |
| ComfyUI | ✅ checkpoints via /object_info |
attributed |
Don't see yours? Adding a probe is a one-liner — append to PROBES in app.py. Most
servers speak the OpenAI /v1/models shape and differ only by port. See
How it works and Contributing.
Who's calling? (caller attribution)¶
Model-server APIs never reveal who is calling them. The monitor works it out from the outside: it samples each container's own established connections and matches the remote port to a model server, then attributes connection-time per caller → server — surfaced as the "Driven by" breakdown on each server card. It's sampled, so long LLM streams are tracked reliably while sub-second calls (e.g. embeddings) are approximate; the hub's own probe traffic is excluded.