Go + PythonOpenAI CompatibleApache 2.0

Inference Architecture

Go API server + Python workers via JSON-RPC. MLX for Apple Silicon, llama.cpp for CPU/NVIDIA.

Go Layer (inferred/)
Python Layer (python/)
Cluster (Phase 3)
API Server
chi · OpenAI compat
Worker Pool
LRU eviction · mem budget
Worker Manager
Python subprocess
Model Registry
Download + cache
Prometheus
Inference metrics
JSON-RPC 2.0
stdin/stdout IPC
CLI: serve/pull/models
cmd/latticeinference
worker.py
JSON-RPC server
Model Loader
Format detection
MLX Backend
Apple Silicon Metal
llama.cpp
GGUF · CPU/NVIDIA
HF Pipeline
transformers
MLX Distributed
Multi-GPU
detect_format()
.gguf/.safetensors
is_apple_silicon()
Platform routing
mDNS Discovery
LAN zero-config
RDMA
GPU interconnect
Leader Election
Cluster coord
Topology
Node awareness