nomyo-router -- Inference Router

nomyo-router is a transparent proxy for Ollama with model deployment-aware routing. It runs between your frontend application and Ollama backend and is transparent for both.

Architecture

How It Works

nomyo-router accepts any Ollama or OpenAI API request on the configured port. It then checks the available backends for the specific request. When the request is an embed/ing, chat, or generate call, it is forwarded to the appropriate Ollama server, answered, and sent back through the router.

If another request for the same model config arrives, nomyo-router knows which model runs on which server and routes to a server where that model is already deployed. If the maximum concurrent connections per model are reached, it routes to another serving server with the least connections for fastest completion.

This makes backend utilization significantly more efficient than simple weighted round-robin or least-connection approaches.

Routing

Key Features

Model Deployment-Aware Routing

nomyo-router tracks which models are loaded on which endpoints and routes requests to servers that already have the model deployed, avoiding unnecessary reloads.

Semantic LLM Cache

Repeated or semantically similar LLM requests are served from cache with no endpoint round-trip, no token cost, and <10ms response times.

Exact match mode – identical requests served instantly
Semantic matching – “What is Python?” matches “What’s Python?”
Cache keys are scoped to model + system_prompt to prevent cross-tenant leakage
MOE requests (moe-*) always bypass the cache

# config.yaml -- semantic cache
cache_enabled: true
cache_backend: sqlite
cache_similarity: 0.90
cache_ttl: 3600
cache_history_weight: 0.3

Multi-Endpoint Support

Configure any combination of Ollama and OpenAI-compatible endpoints:

endpoints:
  - http://ollama0:11434
  - http://ollama1:11434
  - https://api.openai.com/v1

llama_server_endpoints:
  - http://192.168.0.33:8889/v1

API Format Translation

Seamlessly translates between Ollama API and OpenAI API v1 formats. Connect existing applications to any backend without code changes.

Concurrency Control

Set max_concurrent_connections per endpoint-model pair to match your OLLAMA_NUM_PARALLEL settings and prevent overload.

Optional Authentication

Lock down the router with an API key. Clients authenticate via Authorization: Bearer header or query parameter.

Configuration

# config.yaml
endpoints:
  - http://ollama0:11434
  - http://ollama1:11434
  - http://ollama2:11434
  - https://api.openai.com/v1

llama_server_endpoints:
  - http://192.168.0.33:8889/v1

max_concurrent_connections: 2

nomyo-router-api-key: ""

api_keys:
  "http://192.168.0.50:11434": "ollama"
  "http://192.168.0.51:11434": "ollama"
  "http://192.168.0.52:11434": "ollama"
  "https://api.openai.com/v1": "${OPENAI_KEY}"
  "http://192.168.0.33:8889/v1": "llama"

Installation

Docker (Recommended)

# Lean image (~300 MB)
docker pull bitfreedom.net/nomyo-ai/nomyo-router:latest

# Semantic cache image (~800 MB)
docker pull bitfreedom.net/nomyo-ai/nomyo-router:latest-semantic

Run with mounted config:

docker run -d \
  --name nomyo-router \
  -p 12434:12434 \
  -v /absolute/path/to/config:/app/config/ \
  -e NOMYO_ROUTER_CONFIG_PATH=/app/config/config.yaml \
  bitfreedom.net/nomyo-ai/nomyo-router:latest

From Source

python3 -m venv .venv/router
source .venv/router/bin/activate
pip3 install -r requirements.txt

# Optional: set environment variables
export OPENAI_KEY=YOUR_SECRET_API_KEY
export NOMYO_ROUTER_API_KEY=YOUR_ROUTER_KEY

uvicorn router:app --host 127.0.0.1 --port 12434

For high-concurrency scenarios (>500 simultaneous requests):

uvicorn router:app --host 127.0.0.1 --port 12434 --loop uvloop

Cache Management

# View cache statistics (hit rate, counters, config)
curl http://localhost:12434/api/cache/stats

# Invalidate all cache entries
curl -X POST http://localhost:12434/api/cache/invalidate

Cached Routes

/api/chat
/api/generate
/v1/chat/completions
/v1/completions

Authenticate Requests

If nomyo-router-api-key is configured, include the key with every request:

# Header (recommended)
curl -H "Authorization: Bearer $NOMYO_ROUTER_API_KEY" http://localhost:12434/api/tags

# Query param (fallback)
curl "http://localhost:12434/api/tags?api_key=$NOMYO_ROUTER_API_KEY"

Source Code

Available at bitfreedom.net/nomyo-ai/nomyo-router

Contact Us

Email: ichi@nomyo.ai

Phone: +1 (415) 289-9022

Address: 2810 N Church St, PMB 947236, Wilmington, DE 19802-4447, USA