All endpoints (except /health) require a Bearer token in the Authorization header.
Authorization:Bearer YOUR_API_KEY
Tip: Set your base URL to https://4ort.io/v1 and pass your API key — works with any OpenAI-compatible SDK or tool. Bare /chat/completions, /embeddings, etc. (without /v1/) also work.
Rate limits: Round-robin distributes across 18 locked free models (mostly NVIDIA NIM at 38 RPM each = ~684 RPM combined headroom). Per-model RPM tracked proactively — no wasted 429 attempts. Sustained 10-15 req/sec is comfortable; bursts higher work but increase tail latency.
POST/v1/chat/completions
OpenAI-compatible chat completions. Supports streaming, tool calling, vision, and structured output. 130+ chat models auto-selected by priority, or specify one explicitly. Models not in the registry are passed through to OpenRouter automatically.
Parameter
Type
Description
messages
array
Array of message objects with role and content. Required.
model
string
Model ID (e.g. google/gemini-2.5-flash). Optional — auto-selects best available model if omitted.
stream
boolean
Enable SSE streaming. Default: false
max_tokens
integer
Maximum tokens to generate. Optional
temperature
number
Sampling temperature 0-2. Optional
tools
array
Tool/function definitions for tool calling. Optional
response_format
object
Structured output format, e.g. {"type":"json_object"}. Optional
// Request
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"}
],
"model": "google/gemini-2.5-flash",
"stream": false,
"max_tokens": 1024
}
Auto-routing: Omit the model field and the proxy will automatically select the highest-priority healthy model. If a model fails, the request is automatically retried on the next best model.
Routing Headers (optional)
Header
Values
Description
X-Routing
round-robin, priority
Load-balancing strategy. round-robin distributes across models; priority always picks the top model first. Default: server config
X-Parallel
1-5
Number of models to try simultaneously. 1 = sequential (saves quota), 2-3 = fast failover (uses more quota). Default: server config
X-No-Fallback
true
When set with an explicit model, disables all fallback behavior. If the requested model fails, returns the error directly instead of trying other models. Useful for premium/paid models where you want consistent quality. Default: off
X-App-Name
string
Tag requests with your app name for usage tracking and analytics. Optional
// Example: fast mode — round-robin + 3 parallel attemptsX-Routing:round-robinX-Parallel:3// Example: quota-saver — priority routing, no parallelismX-Routing:priorityX-Parallel:1// Example: premium model — no fallback to free modelsX-No-Fallback:true
Throughput Benchmarks
Real-world load test results with round-robin routing and per-model RPM limiting (April 2026).
Concurrency
Requests
Success
p50 Latency
p90 Latency
15
100
99%
1.6s
90s
20
200
79.5%
963ms
37s
Recommendation: Stay under 15 concurrent requests for ~99% success and snappy p50. Sub-second median latency on small/medium models. Big models (Mistral Large 675B, DeepSeek 671B) can take 60-90s — set client timeout to 120s+. Big-model timeouts mostly happen at high concurrency.
POST/v1/embeddings
OpenAI-compatible text embeddings across 5 models. Auto-selects highest-priority available model — no fallback (embedding dimensions differ between providers).
Parameter
Type
Description
input
string | array
Text or array of texts to embed. Required.
model
string
Model ID. Optional — auto-selects highest priority available.
Available Models
Model ID
Dim
Notes
4ort/qwen3-embedding-4b
2000
Self-hosted Qwen3-Embedding-4B (truncated from 2560 for pgvector HNSW). Highest priority.
gemini/gemini-embedding-001
3072
Google Gemini embeddings. Free tier.
openai/text-embedding-3-large
3072
OpenAI large embeddings.
openai/text-embedding-3-small
1536
OpenAI small embeddings.
siliconflow/BAAI/bge-m3
1024
BAAI BGE-M3 multilingual.
// Request
{
"input": "The quick brown fox jumps over the lazy dog",
"model": "4ort/qwen3-embedding-4b"
}
Batch input: Pass an array of strings to input to embed multiple texts in a single request — far more efficient than one request per text.
No auto-fallback: Different embedding models have different dimensions (1024-3072), so a fallback would silently change the vector shape. Pin a model explicitly if you're storing vectors in a database.
POST/v1/images/generations
OpenAI-compatible image generation. 23 image models across Gemini (Nano Banana 2/Pro), OpenAI (GPT Image 1.5, DALL-E 3), x.ai (Grok Imagine), and SiliconFlow (FLUX, Qwen-Image, Stable Diffusion, Kolors). Auto-selects best model or specify one.
Parameter
Type
Description
prompt
string
Text description of the image to generate. Required.
model
string
Model ID (e.g. gemini/gemini-3.1-flash-image-preview). Optional — auto-selects best.
size
string
Image size, e.g. 1024x1024, 1536x1024. Default: 1024x1024
n
integer
Number of images to generate (1-4). Default: 1
quality
string
Quality level: standard or hd. Optional, model-dependent
// Request
{
"prompt": "A futuristic city skyline at sunset, cyberpunk style",
"model": "gemini/gemini-3.1-flash-image-preview",
"size": "1024x1024",
"n": 1
}
// Response
{
"created": 1740600000,
"data": [
{
"b64_json": "iVBORw0KGgo...",
// or "url": "https://..." depending on model
}
]
}
Response format varies by provider: Gemini and OpenAI return base64 in b64_json (heavy — ~100KB+ per image). x.ai and SiliconFlow return temporary url. All wrapped in OpenAI's standard { data: [...] } shape. Tip: if you're feeding the response back into an LLM context window, prefer providers that return URLs.
POST/v1/images/edits
OpenAI-compatible image editing — edit an existing image with a text prompt. Currently routed through x.ai Grok Imagine. Supports single or multi-image inputs.
Parameter
Type
Description
prompt
string
Edit instruction (e.g. "Make the sky red"). Required.
image
string | array
URL or base64 data URI of the source image. Required (or use images).
images
array
Array of URLs or base64 data URIs for multi-image edits. Alternative to image.
model
string
Model ID. Optional — auto-selects.
n
integer
Number of variations to generate. Default: 1
size
string
Output size. Optional
// Request
{
"prompt": "Add a sunset glow and dramatic clouds",
"image": "https://example.com/photo.jpg",
"model": "xai/grok-imagine-image"
}
POST/v1/videos/generations
Async video generation via 6 models across x.ai and fal.ai. Returns a request_id — poll GET /v1/videos/:requestId for the result.
Parameter
Type
Description
prompt
string
Text description of the video to generate. Required.
model
string
Model ID — required. Available: xai/grok-imagine-video, fal/minimax-hailuo, fal/kling-v2.1, fal/wan-2.6, fal/hunyuan-video, fal/pika-2.2
duration
integer
Video duration in seconds (model-dependent max: 5-15s). Optional
aspect_ratio
string
Aspect ratio: 1:1, 16:9, 9:16, 4:3, etc. Optional, varies by model
resolution
string
480p, 720p, or 1080p. Optional, varies by model
image_url
string
URL or base64 data URI for image-to-video generation. Optional (xAI only)
source_video_url
string
URL of an existing video to extend (continuation). New duration is appended to the source. Source must be ≤15s. Optional (xAI only)
// Step 1: Create video generation jobPOST /v1/videos/generations
{
"prompt": "A rocket launching from a tropical island at sunset",
"model": "xai/grok-imagine-video",
"duration": 10,
"aspect_ratio": "16:9",
"resolution": "720p"
}
Async workflow: Video generation takes 30-180 seconds. Poll the status endpoint every 5-10 seconds until status is "done" or "failed". Pricing: $0.04-$0.08/second depending on model.
Continuation chain: To build a longer narrative, generate a base clip → grab its video.url from the status response → POST a new request with source_video_url set to it. Repeat to chain. xAI keeps style/character consistent from the source clip's last frame. Source must be ≤15 seconds.
Request ID prefix: Returned IDs are prefixed with the provider (e.g. xai_abc123) so the same status endpoint works regardless of which provider generated it.
POST/v1/entities/search
Unified entity search across 4 knowledge graph providers. Each provider returns standardized Entity objects with provider-specific metadata for SEO, academic research, and identity resolution.
Parameter
Type
Description
query
string
Search query. Required.
provider
string
One of: wikidata, openalex, crossref, orcid. Default: wikidata
Cross-linking: Wikidata QIDs are the universal identifier — many entities have ORCID iDs (P496), DOIs (P356), and OpenAlex IDs (P10283) attached. Use SPARQL to traverse these relationships.
ORCID note: ORCID search is a two-step lookup (search + per-profile fetch) so it's slower but returns much richer person data than the others.
POST/v1/entities/sparql
Raw SPARQL queries against Wikidata's graph database. Returns unprocessed bindings for maximum flexibility — relationship traversal, complex filters, and cross-linking entities to ORCID, DOI, OpenAlex.
Parameter
Type
Description
query
string
Valid SPARQL query against the Wikidata endpoint. Required.
Specific engines: duckduckgo, brave, wikipedia, wikidata. Optional
language
string
Language code (e.g. en). Default: en
pageno
number
Page number for pagination. Default: 1
time_range
string
day, week, month, year. Optional
safesearch
number
0 = off, 1 = moderate, 2 = strict. Default: 0
// Response
{
"query": "typescript generics",
"number_of_results": 5,
"results": [
{
"title": "TypeScript: Documentation - Generics",
"url": "https://www.typescriptlang.org/docs/handbook/2/generics.html",
"content": "Generics provide a way to make components work with any data type...",
"engine": "brave",
"category": "general",
"score": 1.0,
"publishedDate": null,
"thumbnail": null
}
],
"suggestions": [],
"infoboxes": []
}
POST/v1/extract
Extract clean article content from any URL using Mozilla Readability. Returns title, author, date, and markdown body. Lightweight alternative to Firecrawl for article extraction.
SEO difficulty scores (0-100) showing how hard it is to rank for each keyword. Up to 1,000 keywords. Requires location + language.
POST/v1/seo/keywords/intent
Classify search intent (informational, navigational, commercial, transactional) for each keyword. Up to 1,000. Requires language.
POST/v1/seo/keywords/ai-volume
AI search volume (queries to ChatGPT, Perplexity, Claude, etc.) for up to 1,000 keywords. Useful for tracking AI-driven traffic. Requires location + language.
Location: Pass either a country name (e.g. "United States") or a DataForSEO location code (e.g. 2840).
Pricing: SEO endpoints are not free — billed via the underlying DataForSEO account. Other endpoints on 4ort.io are free or use the proxy's free-tier providers.
GET/v1/models
List all available models — chat, image, video, embedding, and entity-search. Each entry has a type field and a pricing object. 0 = free.
Programmatic service discovery — returns all enabled endpoints, their parameter schemas, and provider lists. Useful for agents that need to discover capabilities at runtime.
Live model rankings, health, rate-limit usage, and routing breakdown. Updated continuously from real request data — useful for monitoring which models are healthy and available right now.
Status of the self-hosted inference server (where 4ort/* models run). Returns CPU, memory, per-model status, queue depth, and request stats. Cached for 3 seconds — safe to poll often.