h200 with wide-ep

(blog.vllm.ai)

84 points | by robertnishihara 14 hours ago

5 comments

snakepit 1 hour ago
Still have to update it for snakepit 0.11.0, but I did start a vLLM wrapper for Elixir
https://hex.pm/packages/vllm
kingstnap 3 hours ago
Impressive performance work. It's interesting that you still see these 40+% perf gains like this.
Makes you think that you will continue to see the costs for a fixed level of "intelligence" dropping.
[-]
- whoevercares 3 hours ago
  Absolutely. LLM inference is still a greenfield — things like overlap scheduling and JIT CUDA kernels are very recent. We’re just getting started optimizing for modern LLM architectures, so cost/perf will keep improving fast.
androiddrew 3 hours ago
Now all we need is better support for AMD gpus, both CDNA and RDNA types
[-]
- mappu 2 hours ago
  ZLUDA implements CUDA on top of AMD ROCm - they are explicitly targetting vLLM as their PyTorch compatibility test: https://vosen.github.io/ZLUDA/blog/zluda-update-q4-2025/#pyt...
  (PyTorch does also support ROCm generally, it shows up as a CUDA device.)
danielhanchen 3 hours ago
Love vLLM!
vessenes 3 hours ago
As a user of a lot of coding tokens I’m most interested in latency - these numbers are presumably for heavily batched workloads. I dearly wish Claude had a cerebras endpoint.
I’m sure I’d use more tokens because I’d get more revs, but I don’t think token usage would increase linearly with speed: I need time to think about what I want to and what’s happened or is proposed. But I feel like I would be able to stay in flow state if the responses were faster, and that’s super appealing.