Apple Hits out at Meta's numerous interoperability requests that would give them far-reaching access to their technology stack
Apple invents User Interfaces for positioning a Virtual Keyboard in an HMDs 3D environment

Apple has announced Accelerating LLM Inference on NVIDIA GPUs with ReDrafter

1 x cover Nvidia  Apple

Accelerating LLM inference is an important ML research problem, as auto-regressive token generation is computationally expensive and relatively slow, and improving inference efficiency can reduce latency for users. In addition to ongoing efforts to accelerate inference on Apple silicon, we have recently made significant progress in accelerating LLM inference for the NVIDIA GPUs widely used for production applications across the industry.

Earlier this year, we published and open sourced Recurrent Drafter (ReDrafter), a novel approach to speculative decoding that achieves state of the art performance. ReDrafter uses an RNN draft model, and combines beam search with dynamic tree attention to speed up LLM token generation by up to 3.5 tokens per generation step for open source models, surpassing the performance of prior speculative decoding techniques.

2 Apple Nvidia

Productionizing ReDrafter to Speed up NVIDIA TensorRT-LLM

This research work demonstrated strong results, but its greater impact comes from being applied in production to accelerate LLM inference. To make this advancement production-ready for NVIDIA GPUs, we collaborated with NVIDIA to integrate ReDrafter into the NVIDIA TensorRT-LLM inference acceleration framework.

Although TensorRT-LLM supports numerous open source LLMs and the Medusa speculative decoding method, ReDrafter’s beam search and tree attention algorithms rely on operators that had never been used in previous applications. To enable the integration of ReDrafter, NVIDIA added new operators or exposed existing ones, which considerably improved TensorRT-LLM's capability to accommodate sophisticated models and decoding methods. ML developers using NVIDIA GPUs can now easily benefit from ReDrafter’s accelerated token generation for their production LLM applications with TensorRT-LLM.

In benchmarking a tens-of-billions parameter production model on NVIDIA GPUs, using the NVIDIA TensorRT-LLM inference acceleration framework with ReDrafter, we have seen 2.7x speed-up in generated tokens per second for greedy decoding (see Figure 1). These benchmark results indicate this tech could significantly reduce latency users may experience, while also using fewer GPUs and consuming less power.

For additional detail, see this post on the NVIDIA developer blog.

Conclusion

LLMs are increasingly being used to power production applications, and improving inference efficiency can both impact computational costs and reduce latency for users. With ReDrafter’s novel approach to speculative decoding integrated into the NVIDIA TensorRT-LLM framework, developers can now benefit from faster token generation on NVIDIA GPUs for their production LLM applications.

A Breakthrough Announcement from NVIDIA + TSMC

On another note, NVIDIA has positively predicted the future of silicon photonics while presenting its AI GPU technology at the world's leading semiconductor conference, IEDM 2024, held in the United States on Dec. 7. During the conference, NVIDIA showcased a silicon photonics prototype developed in collaboration with TSMC. 

An industry insider commented on the significance of this technology, stating, "It is hundreds of times faster than the existing method where data moves through metals like copper." This speed advantage is crucial for data-intensive applications such as AI data centers, where efficient data transfer is paramount. For more on Nvidia/TSMC news, read the full report by BusinessKorea.

10.0F3bb - Patently AI