Speed
Speed of Serving
LLM deployment and inference require significant infrastructure undertaking to make model serving performant and cost-efficient. Achieving sub-second inference latency while serving thousands of concurrent users, requires a combination of state-of-the-art hardware and software.
Latency & Througput
For software to take advantage of this hardware, we run NVIDIA’s TensorRT-LLM, an open-source library that accelerates and optimizes LLM inference. TensorRT-LLM wraps TensorRT’s deep learning compiler and includes the latest optimized kernels made for cutting-edge implementations of FlashAttention and masked multi-head attention (MHA) for the context and generation phases of LLM model execution. Furthermore, our models are served by series of optimizations to use appropriate floating point precisions (FP16, FP8, int4 etc.), optimal batch sizes and efficient multi-head decoding enabling prediction of multiple future tokens simultaneously. A combination of state-of-the-art hardware, novel algorithms and efficient deployment strategies enables us to achieve TTFB (time to first byte) in the range of couple hundred milliseconds at a very high throughput in terms of tokens per second.