Context length configuration

Aurora Swarm supports both config-based and dynamic context length management when using VLLMPool.

Config-based

Set default and aggregation token limits via environment variables or constructor parameters:

Environment variables:

  • AURORA_SWARM_MAX_TOKENS — default max tokens (e.g. 512)

  • AURORA_SWARM_MAX_TOKENS_AGGREGATION — for aggregation/reduce steps (e.g. 2048)

  • AURORA_SWARM_MODEL_MAX_CONTEXT — model’s max context length (optional; if unset, fetched from vLLM /v1/models)

Constructor (priority over env):

from aurora_swarm import VLLMPool, AgentEndpoint

pool = VLLMPool(
    endpoints=[AgentEndpoint("host1", 8000)],
    model="openai/gpt-oss-120b",
    max_tokens=1024,
    max_tokens_aggregation=2048,
    model_max_context=131072,  # optional; skip API query
    buffer=512,
)

Per-request override: pass max_tokens to post() for a single call.

Dynamic sizing

When max_tokens is not passed to post(), the pool computes a safe value:

  • Fetches the model’s max context from vLLM /v1/models (cached)

  • Estimates prompt tokens (e.g. len(prompt) // 4)

  • Uses min(configured_cap, model_max - prompt_est - buffer) so responses are not truncated

Aggregation steps in patterns (e.g. broadcast_and_reduce() reduce step, pipeline stages 2+) use max_tokens_aggregation when the pool provides it.

Full details, examples, and reasoning-model notes are in the repo: CONTEXT_LENGTH.md.