Batch Prompting =============== Overview -------- Batch prompting is a performance optimization that dramatically reduces HTTP request overhead by sending multiple prompts to each agent in a single request. This feature uses the OpenAI Python client's completions API, which supports sending a list of prompts that are processed together. Performance Impact ------------------ Request Reduction ~~~~~~~~~~~~~~~~~ **Without batching:** 10,000 prompts with 100 agents = 10,000 HTTP requests (100 per agent) **With batching:** 10,000 prompts with 100 agents = 100 HTTP requests (1 per agent) **Result: 100× reduction in HTTP requests** Measured Performance ~~~~~~~~~~~~~~~~~~~~ Testing with 20 prompts on 4 vLLM endpoints showed: - **Batch mode:** 2.76s (7.24 prompts/sec) using 4 HTTP requests - **Non-batch mode:** 3.09s (6.46 prompts/sec) using 20 HTTP requests - **Improvement:** 12% faster, 80% fewer requests At scale (1,000 prompts, 4 agents), batch mode uses only 4 HTTP requests versus 1,000 for non-batch mode. Usage ----- Basic Example ~~~~~~~~~~~~~ Batch prompting is enabled by default in :class:`VLLMPool`: .. code-block:: python from aurora_swarm import VLLMPool, parse_hostfile from aurora_swarm.patterns.scatter_gather import scatter_gather # Load endpoints endpoints = parse_hostfile("agents.hostfile") # Create pool with batch mode enabled (default) async with VLLMPool( endpoints, model="openai/gpt-oss-120b", max_tokens=1024, use_batch=True, # Default, can omit ) as pool: # Generate many prompts prompts = [f"Analyze gene {i}" for i in range(10000)] # scatter_gather automatically uses batch API responses = await scatter_gather(pool, prompts) Disable Batching ~~~~~~~~~~~~~~~~ For debugging or compatibility, you can disable batching: .. code-block:: python pool = VLLMPool( endpoints, model="openai/gpt-oss-120b", use_batch=False, # Falls back to individual requests ) Manual Batch Control ~~~~~~~~~~~~~~~~~~~~ You can also manually control batching: .. code-block:: python # Send batch to specific agent prompts_for_agent_0 = ["prompt1", "prompt2", "prompt3"] responses = await pool.post_batch(0, prompts_for_agent_0) # Or use send_all_batched directly all_prompts = [f"task-{i}" for i in range(100)] responses = await pool.send_all_batched(all_prompts, max_tokens=512) Implementation Details ---------------------- API Endpoints ~~~~~~~~~~~~~ **Batch mode** uses the ``/v1/completions`` endpoint: - Accepts ``prompt`` as ``str`` or ``list[str]`` - Returns one choice per prompt - Prompts sent as raw text **Non-batch mode** uses the ``/v1/chat/completions`` endpoint: - Wraps prompts in ``{"role": "user", "content": prompt}`` - vLLM handles chat template formatting - One message per request Architecture ~~~~~~~~~~~~ When you call ``scatter_gather(pool, prompts)``: 1. ``scatter_gather`` calls ``pool.send_all_batched(prompts)`` 2. Prompts are grouped by target agent using round-robin (``i % pool.size``) 3. For each agent, ``post_batch(agent_idx, prompts_for_agent)`` sends one request 4. ``AsyncOpenAI.completions.create(prompt=list_of_prompts)`` processes the batch 5. Responses are reconstructed in original input order Key Methods ~~~~~~~~~~~ .. py:method:: VLLMPool.post_batch(agent_index, prompts, max_tokens=None) :async: Send multiple prompts to one agent in a single request. :param int agent_index: Index of the agent to send prompts to :param list[str] prompts: List of prompts to send in one batch :param int max_tokens: Optional override for max tokens :return: List of Response objects, one per prompt :rtype: list[Response] .. py:method:: VLLMPool.send_all_batched(prompts, max_tokens=None) :async: Distribute prompts across all agents with batching. Automatically groups prompts by target agent and sends one batched request per agent. Returns responses in the same order as input prompts. :param list[str] prompts: List of prompts to send :param int max_tokens: Optional override for max tokens :return: Responses in input order :rtype: list[Response] Pattern Integration ~~~~~~~~~~~~~~~~~~~ The following patterns automatically use batching when available: - ``scatter_gather()`` - Distributes prompts with batching - ``map_gather()`` - Uses scatter_gather internally **Note:** ``tree_reduce()`` uses ``send_all()`` (chat completions) instead of batching for both leaf and supervisor phases. This ensures compatibility with instruction-tuned models that expect chat-formatted prompts. Backward Compatibility ---------------------- The batch prompting feature is fully backward compatible: ✓ All existing tests pass ✓ ``AgentPool.send_all()`` unchanged ✓ Non-VLLMPool usage unchanged ✓ ``post()`` method unchanged (uses chat completions) ✓ Can disable batching with ``use_batch=False`` ✓ Pattern APIs unchanged (batching is transparent) Chat Template Handling ---------------------- The completions API sends prompts as raw text. For instruction-tuned models that expect specific chat formatting, you may need to format prompts before sending: .. code-block:: python # Example: Add chat template manually prompt = "<|user|>\nAnalyze gene ABC123\n<|assistant|>\n" prompts = [prompt] # Send formatted prompt This ensures the model receives the expected format even when using the completions endpoint. Troubleshooting --------------- Connection Errors ~~~~~~~~~~~~~~~~~ If batch requests fail with connection errors, verify your vLLM endpoints support ``/v1/completions``: .. code-block:: bash curl -X POST http://hostname:port/v1/completions \ -H "Content-Type: application/json" \ -d '{"model":"model-name","prompt":"test","max_tokens":10}' Empty Responses ~~~~~~~~~~~~~~~ If responses are empty, the model may need: - More specific prompts - Higher ``max_tokens`` value - Different prompt formatting (e.g., chat template) .. code-block:: python responses = await pool.send_all_batched(prompts, max_tokens=2048) Timeout Errors ~~~~~~~~~~~~~~ For large batches, increase the timeout: .. code-block:: python pool = VLLMPool(endpoints, timeout=600.0, use_batch=True) Testing ------- Unit tests with mock endpoints: .. code-block:: bash pytest tests/test_vllm_pool.py -v Integration tests with real vLLM endpoints: .. code-block:: bash pytest tests/integration/ --hostfile=/path/to/hostfile -v Performance comparison: .. code-block:: bash python test_batch_integration.py /path/to/hostfile See Also -------- - :doc:`api` - Full API reference including VLLMPool - :doc:`context_length` - Context length configuration - ``BATCH_PROMPTING.md`` - Detailed implementation documentation