The company of graphics cardsNVIDIA, presented on September 20 the series RTX4000. During the conference we saw the RTX 4090 and 4080 together with the architecture that brings it to life: Ada Lovelace. These GPU promise to be the GeForce strongest in history.
If we already talked about everything we had to talk about the RTX 4000: models, release date, prices, specifications… Now it’s time to focus on its architecture, which makes these graphics cards unique.
At the heart of the GeForce RTX 4090 is the gigantic AD102 silicon. Built on a 4nm silicon manufacturing process, this chip measures 608mm² in area and contains 76.3 billion transistors.
And the good news is that we’ve now been able to take a better look at the AD102’s silicon-level block diagram, which includes the introduction of several new components.
Thus the new architecture of the NVIDIA GeForce
The AD102 has an interface for PCI-Express 4.0 x16 and a 384-bit GDDR6X memory interface. The Gigathread engine acts as the main resource allocation component of the silicon.
Ada introduces the Optical Flow Accelerator, a crucial component for DLSS 3 to render full frames without the graphics rendering machinery intervening.
The chip has Twice as many media encoding hardware engines as Ampere, including hardware accelerated AV1 encoding/decoding. Multiple accelerators allow you to transcode multiple video streams (great for content creators).
The main graphics rendering components of the AD102 are the GPCs (Graphics Processing Groups). There are 12 of them, compared to 7 in the previous generation of GA102. Each GPC shares a rasterization engine and rendering backends with six TPCs (Texture Processing Clusters).
Each TPC contains two SMs (stream multiprocessors), the indivisible number-crunching machinery of the NVIDIA GPU. The SM is where NVIDIA performs the greatest architectural innovation and where it derives the great performance what is expected
Each SM contains a third-generation RT core, a 128 KB L1 cache, and four TMUs., among four clusters each containing 16 CUDA FP32 cores, 16 CUDA cores, 4 load/storage units, a tiny L0 cache; a log file and the all-important fourth-generation Tensor Core.
Therefore, each SM contains a total of 128 CUDA cores, 4 Tensor cores and one RT core. There are 12 SM per GPC, i.e. 1536 CUDA cores, 48 Tensor cores and 12 RT cores per GPC. That is, twelve GPCs add up to 18,432 CUDA cores, 576 Tensor cores and 144 RT cores.
Then each GPC contributes 16 ROPs, so there are a whopping 192 ROPs on the chip. An L2 cache serves as a place for the various GPCs, memory controllers, and the PCIe host interface to exchange data.
NVIDIA has not mentioned the size of this L2 cache, but it is said to be significantly larger than the previous generation. and that it plays an important role in lubricating the memory subsystem enough for NVIDIA to maintain the same 21 Gbps 384-bit data rate as the previous generation.
NVIDIA is introducing shader execution reordering (SER), a new technology that reorganizes math workloads to be relevant to each worker thread, so they are processed more efficiently by SIMD components.
This is expected to have an especially big impact on ray-traced game rendering. In Cyberpunk 2077, with its new Overdrive graphics preset, which greatly increases RT calculations per pixel, SER improves performance by up to 44%.
NVIDIA has the difficult job of justifying its new generation after two years of shortages, sky-high prices and little information. Surely launching these first high-end models at -official- prices never seen before is not the best of ideas.