Summary
This report benchmarks GPU options for deploying Scope's real-time video diffusion inference pipelines. We evaluate performance, memory fit (OOM risk), and cost trade-offs across multiple resolutions and four pipelines: reward-forcing, longlive, streamdiffusionv2, and krea-realtime-video.
Key Takeaways
- H200 SXM is the fastest in every measured case where it runs, typically delivering about 5% - 15% higher FPS than H100 SXM and materially higher FPS than RTX 5090.
- H100 SXM is the best “default”: it supports high resolutions reliably (large VRAM), performs close to H200 SXM, and avoids the 5090’s earlier OOM failures.
- RTX 5090 offers strong value when workloads fit in 32 GB VRAM, but it hits memory limits at higher resolutions and on memory-heavy pipelines (notably krea-realtime-video).
- Provider choice can matter at low resolutions (underutilized GPU): RTX 5090 on TensorDock was 10–30% faster than on RunPod in several low-res tests, while results converge at high resolutions where GPUs saturate.
- Based on current results, reward-forcing is the strongest performing pipeline based on FPS: it offers the highest-FPS pipeline across GPUs and resolutions in this benchmark set.
Next Steps to Optimize Performance
- Accelerate inference: evaluate TensorRT, torch.compile, and quantization (FP16/INT8 where feasible) to reduce latency and boost throughput.
- Lower-res + upscale: run diffusion at a smaller base resolution, then apply an upscaler to reach the target output size—balancing quality against cost.
- Increase GPU packing: at low resolutions, GPU utilization often falls below 90% with free VRAM remaining. Test running multiple concurrent streams per GPU to maximize throughput per dollar.
Which GPU should you choose?
Use this as the default selection logic; detailed evidence appears in later sections.
| Situation |
Recommended default |
Why |
| Lowest cost for workloads that fit (no OOM) |
RTX 5090 |
Best economics when VRAM is sufficient |
| High resolution and/or memory-heavy pipelines |
H100 SXM |
Large VRAM headroom and strong throughput |
| Maximum throughput / lowest latency, cost secondary |
H200 SXM |
Highest FPS (usually 5% - 15% above H100 SXM) |
| Need ultra-high resolution where 5090 OOMs, but want cheaper than Hopper |
RTX A6000 Ada (situational) |
More VRAM than 5090; slower but can run cases where 5090 fails (can skip for the current pipelines) |
Benchmark
Methodology
Pipelines tested: