Apple and Nvidia have officially announced that they are working hand in hand to advance large language model (LLM) technology. The first results revealed already show significant progress, highlighting the effectiveness of this strategic collaboration.
A collaboration to accelerate language models (LLM) using ReDrafter
Tech giants Apple and NVIDIA are joining forces to push the boundaries of artificial intelligence (AI) with a new approach to generating faster text. In a blog postApple engineers revealed details of their collaboration to integrate their method ReDrafter to the NVIDIA TensorRT-LLM framework.
This advancement promises to transform the way large language models (LLMs) are used in production, significantly reducing latency and hardware resource requirements.
ReDrafter: an innovation in text generation
Introduced by Apple earlier this year, ReDrafter is an advanced text generation method that combines two innovative techniques:
- Beam search: a method that explores several possible paths to generate the most coherent text.
- Dynamic tree attention: a technique that optimizes choice management using a tree structure, thereby improving speed and efficiency.
ReDrafter offers faster text generation while maintaining industry-leading quality. This innovation is particularly useful for large language models used in complex applications.
A strategic collaboration with NVIDIA
To deploy ReDrafter at scale, Apple partnered with NVIDIA, a leader in GPUs and artificial intelligence acceleration tools. Through this collaboration, ReDrafter has been integrated into TensorRT-LLM, an NVIDIA benchmark designed to accelerate language model inference on GPUs.
Improvements from NVIDIA
To enable this integration, NVIDIA has added new operators or exposed existing operators in TensorRT-LLM, making the framework more flexible and efficient. Additionally, they optimized the text generation process, bringing significant benefits to developers using NVIDIA GPUs for production LLM applications.
Impressive results: an acceleration of 2.7 times
In production testing with a model with tens of billions of parameters, Apple and NVIDIA saw a 2.7x speedup in the number of tokens generated per second during greedy decoding.
Key benefits include reduced latency, allowing users to experience less waiting time for AI-generated responses. Additionally, the decrease in cost is notable because fewer GPUs are required to achieve similar performance, thereby reducing computational costs. Finally, increased energy efficiency leads to decreased energy consumption, contributing to a lower carbon footprint for production applications.
Significant impact for developers and businesses
Apple researchers emphasize that this advance meets a growing demand for efficient solutions in production. LLMs are used today in areas such as:
- Virtual assistants
- Automated content generation
- Advanced analysis tools
With the integration of ReDrafter into TensorRT-LLM, developers can now leverage improved performance on NVIDIA GPUs to create faster, more accessible applications.
A revolution in generative AI
This collaboration between Apple and NVIDIA marks an important step in optimizing large language models. By combining Apple’s algorithmic innovations with NVIDIA’s hardware and software capabilities, the two companies show how strategic partnerships can accelerate real-world adoption of AI.