Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Process thousands of prompts through an LLM in a single session. Deploy a GPU, run vLLM in offline batch mode, collect results, tear down. Pay only for the compute hours you use.

1. Prepare your input file

Create a JSONL file with one prompt per line:
{"prompt": "Summarize this review: 'Great product, fast shipping.'", "id": "r001"}
{"prompt": "Summarize this review: 'Arrived broken. Returning.'", "id": "r002"}

2. Deploy and upload

runcrate instances create --name batch-job --gpu H100 --template ubuntu-inference
runcrate instances status batch-job

runcrate cp ./prompts.jsonl batch-job:/workspace/prompts.jsonl

3. Install vLLM and upload the batch script

runcrate ssh batch-job -- "pip install vllm"
# batch_infer.py
import json
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct", max_model_len=4096)
params = SamplingParams(max_tokens=256, temperature=0.1)

with open("/workspace/prompts.jsonl") as f:
    items = [json.loads(line) for line in f]

prompts = [item["prompt"] for item in items]
outputs = llm.generate(prompts, params)

with open("/workspace/results.jsonl", "w") as f:
    for item, output in zip(items, outputs):
        f.write(json.dumps({
            "id": item["id"],
            "response": output.outputs[0].text,
        }) + "\n")

print(f"Processed {len(items)} prompts.")
runcrate cp ./batch_infer.py batch-job:/workspace/batch_infer.py
runcrate ssh batch-job -- "cd /workspace && python batch_infer.py"

4. Download results and tear down

runcrate cp batch-job:/workspace/results.jsonl ./results.jsonl
runcrate instances delete batch-job

Monitor progress

runcrate ssh batch-job -- nvidia-smi
runcrate ssh batch-job -- "wc -l /workspace/results.jsonl"

Cost estimate

PromptsModelGPUTimeApprox. cost
10,000Llama 8BRTX 4090~15 min~$0.09
10,000Llama 70BA100 80 GB~30 min~$0.80
100,000Llama 70BH100 80 GB~2 hrs~$5.00

Tips

  • vLLM offline batch mode uses continuous batching — much faster than sequential API calls.
  • For 100K+ prompts, split the file and process in chunks to avoid OOM.
  • Check your balance before starting: runcrate billing balance.