HTTP server
Run a local HTTP server that exposes an OpenAI-compatible API.
Overview
To run the server, you need the @qvac/sdk and @qvac/cli npm packages installed in your project. The server is provided by @qvac/cli and internally translates HTTP requests into SDK calls. As a result, any system compatible with the OpenAI REST API can point to http://localhost:11434/v1/ and work without changes.
AI capabilities
At present, the HTTP server supports the following QVAC AI capabilities:
Compatible tools
The following tools have been verified to work as drop-in replacements by pointing their base URL to the QVAC server:
| Tool | Required endpoints |
|---|---|
| Continue.dev | /v1/chat/completions (streaming SSE), /v1/models |
| LangChain | /v1/chat/completions, /v1/embeddings, /v1/models |
| Open Interpreter | /v1/chat/completions (streaming, tool calls), /v1/models |
Running the server
Install the SDK and CLI in your project:
npm install @qvac/sdk @qvac/cliSee Installation for environment-specific instructions of the SDK.
Create the qvac.config.* file at the root of your project declaring which models the server can load. For example:
{
"serve": {
"models": {
"my-llm": {
"model": "QWEN3_600M_INST_Q4",
"default": true,
"config": { "ctx_size": 8192 }
}
}
}
}Start the server:
qvac serve openaiSend a request:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "Hello!"}]
}'Configuration
Models are declared in qvac.config.* under the serve.models key. The server can only load models listed under this key — requests for unlisted models return 404. Each key in serve.models is a model alias — the name that HTTP clients use in the model field of their requests. For the full schema of serve.models, see Configuration — ServeConfig.
Example
{
"serve": {
"models": {
"my-llm": {
"model": "QWEN3_600M_INST_Q4",
"default": true,
"preload": true,
"config": { "ctx_size": 8192, "tools": true }
},
"my-embed": {
"model": "GTE_LARGE_FP16",
"default": true
},
"whisper": {
"model": "WHISPER_TINY",
"default": true,
"preload": true,
"config": { "language": "en", "strategy": "greedy" }
}
}
}
}model: SDK model constant name (e.g.,QWEN3_600M_INST_Q4). The server resolves it to a download source and addon type automatically.default: whentrue, this model is used when a request doesn't specify amodelfield, or when a tool picks the default for an endpoint category.preload: whentrue, the model is loaded into memory on server startup. Whenfalse, it is loaded on first request (cold start).config: model config overrides passed to the underlying addon. Same options asmodelConfiginloadModel().
CLI
qvac serve openai [options]
-c, --config <path> Config file path (default: auto-detect qvac.config.*)
-p, --port <number> Port to listen on (default: 11434)
-H, --host <address> Host to bind to (default: 127.0.0.1)
--model <alias> Model alias to preload (repeatable, must be in config)
--api-key <key> Require Bearer token authentication
--cors Enable CORS headers
-v, --verbose Detailed outputAPI
All endpoints follow the OpenAI API request and response format. Base path: /v1.
Endpoints
GET /v1/models— list loaded modelsGET /v1/models/:id— get model detailsDELETE /v1/models/:id— unload a modelPOST /v1/chat/completions— chat completions (blocking + SSE streaming)POST /v1/embeddings— text embeddings (single + batch)POST /v1/audio/transcriptions— audio transcription
GET /v1/models
List all loaded models.
curl http://localhost:11434/v1/modelsResponse:
{
"object": "list",
"data": [
{ "id": "my-llm", "object": "model", "created": 1718000000, "owned_by": "qvac" }
]
}GET /v1/models/:id
Get details of a specific loaded model.
curl http://localhost:11434/v1/models/my-llmDELETE /v1/models/:id
Unload a model, releasing its resources.
curl -X DELETE http://localhost:11434/v1/models/my-llmResponse:
{ "id": "my-llm", "object": "model", "deleted": true }POST /v1/chat/completions
Generate a chat completion. Supports both blocking and streaming (SSE) modes, tool/function calling, and per-request generation parameters.
Blocking request:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"temperature": 0.7,
"max_tokens": 256
}'Streaming request (server-sent events):
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "What is the capital of France?"}],
"stream": true
}'Tool calling:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-llm",
"messages": [{"role": "user", "content": "What is the weather in London?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": { "location": { "type": "string" } },
"required": ["location"]
}
}
}]
}'Generation parameters
The following OpenAI parameters are forwarded to the model on each request:
| OpenAI parameter | SDK parameter | Description |
|---|---|---|
temperature | temp | Sampling temperature |
max_tokens | predict | Maximum tokens to generate |
max_completion_tokens | predict | Alias for max_tokens |
top_p | top_p | Nucleus sampling threshold |
seed | seed | Random seed for deterministic output |
frequency_penalty | frequency_penalty | Penalize frequent tokens |
presence_penalty | presence_penalty | Penalize already-present tokens |
Unsupported parameters
The following OpenAI parameters are accepted but ignored (a warning is logged): n, logprobs, response_format, stop, top_logprobs, logit_bias, parallel_tool_calls, stream_options.
POST /v1/embeddings
Generate text embeddings. Accepts a single string or a batch of strings.
Single input:
curl http://localhost:11434/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "my-embed",
"input": "The quick brown fox"
}'Batch input:
curl http://localhost:11434/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "my-embed",
"input": ["First sentence", "Second sentence"]
}'Response:
{
"object": "list",
"data": [
{ "object": "embedding", "index": 0, "embedding": [0.012, -0.034, ...] }
],
"model": "my-embed",
"usage": { "prompt_tokens": 0, "total_tokens": 0 }
}encoding_format (only float is supported) and dimensions are accepted but ignored.
POST /v1/audio/transcriptions
Transcribe audio using Whisper or Parakeet models. This endpoint uses multipart/form-data (not JSON). Maximum file size: 25 MB.
For receiving a JSON response (default):
curl http://localhost:11434/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper" \
-F "response_format=json"Response:
{ "text": "transcribed text here" }For receiving a plain text response:
curl http://localhost:11434/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper" \
-F "response_format=text"With prompt (Whisper uses it as initial_prompt):
curl http://localhost:11434/v1/audio/transcriptions \
-F "file=@audio.wav" \
-F "model=whisper" \
-F "prompt=President Kennedy speech about space exploration"Parameters
| Parameter | Description | Required |
|---|---|---|
file | Audio file to transcribe. | Yes |
model | Model alias (must be in config). | Yes |
response_format | json (default) or text. | No |
prompt | Optional prompt forwarded to the model. | No |
Unsupported response_format values (srt, vtt, verbose_json) return a 400 error.
language and temperature are accepted but currently only configurable at model load time (via serve.models config), not per-request. A warning is logged when these are sent.
Authentication
By default, the server accepts unauthenticated requests on 127.0.0.1. To require a Bearer token, run the server with the --api-key flag:
qvac serve openai --api-key my-secret-tokenClients must then include the token in the Authorization header:
curl http://localhost:11434/v1/models \
-H "Authorization: Bearer my-secret-token"Requests without a valid token receive a 401 response.