QVAC Logo

HTTP server

Run a local HTTP server that exposes an OpenAI-compatible API.

Overview

To run the server, you need the @qvac/sdk and @qvac/cli npm packages installed in your project. The server is provided by @qvac/cli and internally translates HTTP requests into SDK calls. As a result, any system compatible with the OpenAI REST API can point to http://localhost:11434/v1/ and work without changes.

AI capabilities

At present, the HTTP server supports the following QVAC AI capabilities:

Compatible tools

The following tools have been verified to work as drop-in replacements by pointing their base URL to the QVAC server:

ToolRequired endpoints
Continue.dev/v1/chat/completions (streaming SSE), /v1/models
LangChain/v1/chat/completions, /v1/embeddings, /v1/models
Open Interpreter/v1/chat/completions (streaming, tool calls), /v1/models

Running the server

Install the SDK and CLI in your project:

npm install @qvac/sdk @qvac/cli

See Installation for environment-specific instructions of the SDK.

Create the qvac.config.* file at the root of your project declaring which models the server can load. For example:

qvac.config.json
{
  "serve": {
    "models": {
      "my-llm": {
        "model": "QWEN3_600M_INST_Q4",
        "default": true,
        "config": { "ctx_size": 8192 }
      }
    }
  }
}

Start the server:

qvac serve openai

Send a request:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Configuration

Models are declared in qvac.config.* under the serve.models key. The server can only load models listed under this key — requests for unlisted models return 404. Each key in serve.models is a model alias — the name that HTTP clients use in the model field of their requests. For the full schema of serve.models, see Configuration — ServeConfig.

Example

qvac.config.json
{
  "serve": {
    "models": {
      "my-llm": {
        "model": "QWEN3_600M_INST_Q4",
        "default": true,
        "preload": true,
        "config": { "ctx_size": 8192, "tools": true }
      },
      "my-embed": {
        "model": "GTE_LARGE_FP16",
        "default": true
      },
      "whisper": {
        "model": "WHISPER_TINY",
        "default": true,
        "preload": true,
        "config": { "language": "en", "strategy": "greedy" }
      }
    }
  }
}
  • model: SDK model constant name (e.g., QWEN3_600M_INST_Q4). The server resolves it to a download source and addon type automatically.
  • default: when true, this model is used when a request doesn't specify a model field, or when a tool picks the default for an endpoint category.
  • preload: when true, the model is loaded into memory on server startup. When false, it is loaded on first request (cold start).
  • config: model config overrides passed to the underlying addon. Same options as modelConfig in loadModel().

CLI

qvac serve openai [options]
  -c, --config <path>    Config file path (default: auto-detect qvac.config.*)
  -p, --port <number>    Port to listen on (default: 11434)
  -H, --host <address>   Host to bind to (default: 127.0.0.1)
  --model <alias>        Model alias to preload (repeatable, must be in config)
  --api-key <key>        Require Bearer token authentication
  --cors                 Enable CORS headers
  -v, --verbose          Detailed output

API

All endpoints follow the OpenAI API request and response format. Base path: /v1.

Endpoints

  • GET /v1/models — list loaded models
  • GET /v1/models/:id — get model details
  • DELETE /v1/models/:id — unload a model
  • POST /v1/chat/completions — chat completions (blocking + SSE streaming)
  • POST /v1/embeddings — text embeddings (single + batch)
  • POST /v1/audio/transcriptions — audio transcription

GET /v1/models

List all loaded models.

curl http://localhost:11434/v1/models

Response:

{
  "object": "list",
  "data": [
    { "id": "my-llm", "object": "model", "created": 1718000000, "owned_by": "qvac" }
  ]
}

GET /v1/models/:id

Get details of a specific loaded model.

curl http://localhost:11434/v1/models/my-llm

DELETE /v1/models/:id

Unload a model, releasing its resources.

curl -X DELETE http://localhost:11434/v1/models/my-llm

Response:

{ "id": "my-llm", "object": "model", "deleted": true }

POST /v1/chat/completions

Generate a chat completion. Supports both blocking and streaming (SSE) modes, tool/function calling, and per-request generation parameters.

Blocking request:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "temperature": 0.7,
    "max_tokens": 256
  }'

Streaming request (server-sent events):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "What is the capital of France?"}],
    "stream": true
  }'

Tool calling:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-llm",
    "messages": [{"role": "user", "content": "What is the weather in London?"}],
    "tools": [{
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
          "type": "object",
          "properties": { "location": { "type": "string" } },
          "required": ["location"]
        }
      }
    }]
  }'

Generation parameters

The following OpenAI parameters are forwarded to the model on each request:

OpenAI parameterSDK parameterDescription
temperaturetempSampling temperature
max_tokenspredictMaximum tokens to generate
max_completion_tokenspredictAlias for max_tokens
top_ptop_pNucleus sampling threshold
seedseedRandom seed for deterministic output
frequency_penaltyfrequency_penaltyPenalize frequent tokens
presence_penaltypresence_penaltyPenalize already-present tokens

Unsupported parameters

The following OpenAI parameters are accepted but ignored (a warning is logged): n, logprobs, response_format, stop, top_logprobs, logit_bias, parallel_tool_calls, stream_options.

POST /v1/embeddings

Generate text embeddings. Accepts a single string or a batch of strings.

Single input:

curl http://localhost:11434/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-embed",
    "input": "The quick brown fox"
  }'

Batch input:

curl http://localhost:11434/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-embed",
    "input": ["First sentence", "Second sentence"]
  }'

Response:

{
  "object": "list",
  "data": [
    { "object": "embedding", "index": 0, "embedding": [0.012, -0.034, ...] }
  ],
  "model": "my-embed",
  "usage": { "prompt_tokens": 0, "total_tokens": 0 }
}

encoding_format (only float is supported) and dimensions are accepted but ignored.

POST /v1/audio/transcriptions

Transcribe audio using Whisper or Parakeet models. This endpoint uses multipart/form-data (not JSON). Maximum file size: 25 MB.

For receiving a JSON response (default):

curl http://localhost:11434/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper" \
  -F "response_format=json"

Response:

{ "text": "transcribed text here" }

For receiving a plain text response:

curl http://localhost:11434/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper" \
  -F "response_format=text"

With prompt (Whisper uses it as initial_prompt):

curl http://localhost:11434/v1/audio/transcriptions \
  -F "file=@audio.wav" \
  -F "model=whisper" \
  -F "prompt=President Kennedy speech about space exploration"

Parameters

ParameterDescriptionRequired
fileAudio file to transcribe.Yes
modelModel alias (must be in config).Yes
response_formatjson (default) or text.No
promptOptional prompt forwarded to the model.No

Unsupported response_format values (srt, vtt, verbose_json) return a 400 error.

language and temperature are accepted but currently only configurable at model load time (via serve.models config), not per-request. A warning is logged when these are sent.

Authentication

By default, the server accepts unauthenticated requests on 127.0.0.1. To require a Bearer token, run the server with the --api-key flag:

qvac serve openai --api-key my-secret-token

Clients must then include the token in the Authorization header:

curl http://localhost:11434/v1/models \
  -H "Authorization: Bearer my-secret-token"

Requests without a valid token receive a 401 response.

On this page