Batch Prompting

Overview

Batch prompting is a performance optimization that dramatically reduces HTTP request overhead by sending multiple prompts to each agent in a single request. This feature uses the OpenAI Python client’s completions API, which supports sending a list of prompts that are processed together.

Performance Impact

Request Reduction

Without batching:

10,000 prompts with 100 agents = 10,000 HTTP requests (100 per agent)

With batching:

10,000 prompts with 100 agents = 100 HTTP requests (1 per agent)

Result: 100× reduction in HTTP requests

Measured Performance

Testing with 20 prompts on 4 vLLM endpoints showed:

  • Batch mode: 2.76s (7.24 prompts/sec) using 4 HTTP requests

  • Non-batch mode: 3.09s (6.46 prompts/sec) using 20 HTTP requests

  • Improvement: 12% faster, 80% fewer requests

At scale (1,000 prompts, 4 agents), batch mode uses only 4 HTTP requests versus 1,000 for non-batch mode.

Usage

Basic Example

Batch prompting is enabled by default in VLLMPool:

from aurora_swarm import VLLMPool, parse_hostfile
from aurora_swarm.patterns.scatter_gather import scatter_gather

# Load endpoints
endpoints = parse_hostfile("agents.hostfile")

# Create pool with batch mode enabled (default)
async with VLLMPool(
    endpoints,
    model="openai/gpt-oss-120b",
    max_tokens=1024,
    use_batch=True,  # Default, can omit
) as pool:
    # Generate many prompts
    prompts = [f"Analyze gene {i}" for i in range(10000)]

    # scatter_gather automatically uses batch API
    responses = await scatter_gather(pool, prompts)

Disable Batching

For debugging or compatibility, you can disable batching:

pool = VLLMPool(
    endpoints,
    model="openai/gpt-oss-120b",
    use_batch=False,  # Falls back to individual requests
)

Manual Batch Control

You can also manually control batching:

# Send batch to specific agent
prompts_for_agent_0 = ["prompt1", "prompt2", "prompt3"]
responses = await pool.post_batch(0, prompts_for_agent_0)

# Or use send_all_batched directly
all_prompts = [f"task-{i}" for i in range(100)]
responses = await pool.send_all_batched(all_prompts, max_tokens=512)

Implementation Details

API Endpoints

Batch mode uses the /v1/completions endpoint:
  • Accepts prompt as str or list[str]

  • Returns one choice per prompt

  • Prompts sent as raw text

Non-batch mode uses the /v1/chat/completions endpoint:
  • Wraps prompts in {"role": "user", "content": prompt}

  • vLLM handles chat template formatting

  • One message per request

Architecture

When you call scatter_gather(pool, prompts):

  1. scatter_gather calls pool.send_all_batched(prompts)

  2. Prompts are grouped by target agent using round-robin (i % pool.size)

  3. For each agent, post_batch(agent_idx, prompts_for_agent) sends one request

  4. AsyncOpenAI.completions.create(prompt=list_of_prompts) processes the batch

  5. Responses are reconstructed in original input order

Key Methods

async VLLMPool.post_batch(agent_index, prompts, max_tokens=None)

Send multiple prompts to one agent in a single request.

Parameters:
  • agent_index (int) – Index of the agent to send prompts to

  • prompts (list[str]) – List of prompts to send in one batch

  • max_tokens (int) – Optional override for max tokens

Returns:

List of Response objects, one per prompt

Return type:

list[Response]

async VLLMPool.send_all_batched(prompts, max_tokens=None)

Distribute prompts across all agents with batching.

Automatically groups prompts by target agent and sends one batched request per agent. Returns responses in the same order as input prompts.

Parameters:
  • prompts (list[str]) – List of prompts to send

  • max_tokens (int) – Optional override for max tokens

Returns:

Responses in input order

Return type:

list[Response]

Pattern Integration

The following patterns automatically use batching when available:

  • scatter_gather() - Distributes prompts with batching

  • map_gather() - Uses scatter_gather internally

Note: tree_reduce() uses send_all() (chat completions) instead of batching for both leaf and supervisor phases. This ensures compatibility with instruction-tuned models that expect chat-formatted prompts.

Backward Compatibility

The batch prompting feature is fully backward compatible:

✓ All existing tests pass ✓ AgentPool.send_all() unchanged ✓ Non-VLLMPool usage unchanged ✓ post() method unchanged (uses chat completions) ✓ Can disable batching with use_batch=False ✓ Pattern APIs unchanged (batching is transparent)

Chat Template Handling

The completions API sends prompts as raw text. For instruction-tuned models that expect specific chat formatting, you may need to format prompts before sending:

# Example: Add chat template manually
prompt = "<|user|>\nAnalyze gene ABC123\n<|assistant|>\n"
prompts = [prompt]  # Send formatted prompt

This ensures the model receives the expected format even when using the completions endpoint.

Troubleshooting

Connection Errors

If batch requests fail with connection errors, verify your vLLM endpoints support /v1/completions:

curl -X POST http://hostname:port/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"model-name","prompt":"test","max_tokens":10}'

Empty Responses

If responses are empty, the model may need:

  • More specific prompts

  • Higher max_tokens value

  • Different prompt formatting (e.g., chat template)

responses = await pool.send_all_batched(prompts, max_tokens=2048)

Timeout Errors

For large batches, increase the timeout:

pool = VLLMPool(endpoints, timeout=600.0, use_batch=True)

Testing

Unit tests with mock endpoints:

pytest tests/test_vllm_pool.py -v

Integration tests with real vLLM endpoints:

pytest tests/integration/ --hostfile=/path/to/hostfile -v

Performance comparison:

python test_batch_integration.py /path/to/hostfile

See Also