piqc scans your Kubernetes cluster (Read-only) and identifies which models are running on the wrong GPU tier and what the cost attribution is. It runs in a minute. I'd like to hear the community's experiences/thoughts on our detection approach and its benefits.
Agree. At high concurrency, you are better off spending the compute budget on parallel requests rather than draft prediction. The challenging part is that most deployment don't have static traffic profiles. A configuration that was right at launch may no longer be correct months later, and there is no signal that tells you when you have crossed the threshold.
Great README. Genuinely one of the clearest walkthrough of inference internals. The KV cache section is worth lingering one as most of the OOM and throughput issues trace back to this and normally difficult to reason about. sequence length and batch size fill the cache in a way that show up under real traffic.
reply