If you compare Intel's quad-level cache, and AMD's 3D cache...

If you compare Intel's quad-level cache, and AMD's 3D cache...

Recently there are rumors again that Intel's next to be released 14th generation Core may use L4 cache, which is a quad-level cache. Some commentators believe that Intel is adding L4 cache to the next generation Core in order to compete head-on with AMD's 3D V-Cache. Are these two really comparable?


CPUs working on on-chip cache - that is, cache - has always been an old tradition in CPU microarchitecture updates. For the past two years we have been discussing AMD's 3D V-Cache (hereafter referred to as V-Cache) - a solution that stacks L3 cache levels vertically to increase the processor's L3 cache capacity.


Recently, there have been rumours that Intel's next generation of Core 14 may feature L4 cache, or quad-level cache, which is not new but still a rarity in PCs. Some commentators believe that Intel is adding L4 cache to the next generation of Core in order to compete head-on with V-Cache.

图片无替代文字

In fact last year Chips and Cheese modelled the possible increase in cache access latency from this V-Cache and its negative impact in different applications or loads. In other words cache stacking may in turn bring negative additions for certain types of applications.


Our view at the time was that AMD's introduction of different processor versions with and without V-Cache was essentially more of a case of whether you were more willing to pay for cache cache capacity or core count for the user, and that there was a difference in application scenarios between these two tendencies.


Recently Chips and Cheese really did some testing on this generation of AMD 7950X3D and came to a more interesting conclusion. Although the AMD Ryzen 7000 series seems to be in the midst of an overload event these days, we look forward to using Chips and Cheese's data, as well as the rumours of a Gen 14 Core L4 cache, to talk about the value of "heap cache" and what the future of PC client processor cache might look like. what the future of PC client processor cache might look like.


The L4 cache rumour and its history

About half a month ago, a Linux patch appeared with the following description:

图片无替代文字

The latter sentence shows an L4 cache - except that it is not clear what the ADM in front of the L4 cache is.


The previous sentence says that on the 14th generation Core processors, cores no longer have access to LLC resources, only the CPU does. LLC in this case is Last Level Cache, which on a normal CPU means L3 cache.

图片无替代文字

We have previously written about this point: 14th generation Core processors are starting to switch to a chiplet solution, where the graphics chiplet, where the core graphics (iGPU) is located, is further away from the chiplet where the CPU is located. Previously, Intel's processor architecture was designed so that the core graphics were hooked up to the ring bus, so the core graphics could also share the LLC with the CPU.


The chiplet design of the 14th generation Core means that the core can no longer share the LLC - or L3 cache - with the CPU, which is probably why the 14th generation Core intends to start using L4 cache.

图片无替代文字

And the ADM/L4 cache that appears in the second half of the Linux patch description is actually in evidence. The foreign press has discovered a new Intel patent in the last two days, which mentions L4 cache codenamed Adamantine for certain CPUs.


This patent mentions that ADM cache is able to improve the performance of communication not only between the CPU and memory, but also between the CPU and the security controller. Eventually boot optimisation and even data retention can be performed to reduce load times. The schematic of this patent, which appears Gen 12.7, RWC (Redwood Cove) and CMT (Crestmont), basically reveals that this should be the 14th generation Core.


From the schematic, it appears that the L4 cache may be located in the base die, but it is not clear how large this quad-level cache will be, nor is there any other relevant information.


Those familiar with Intel processors will know that this is not the first time Intel has added an intermediate level of storage media to the L3 cache on a client CPU. We don't know the specifics of the L4 cache on top of Meteor Lake yet, but it wouldn't be correct to say that this is to compete with AMD's V-Cache.


V-Cache is also an expansion of L3 cache, but it is still an L3 cache - some of the data from the Chips and Cheese tests will be given later. It may still be significantly different from the L4 cache, which adds a new level of cache.


We know that multi-level cache is designed to have more cache capacity at lower levels - the value of which is to further improve cache hit rates and hide the high latency of processor-memory interactions, but the lower levels of cache are also significantly slower. One of the main reasons why L4 cache has not been used on a large scale in PC processors is that it can have a very limited impact on system performance.


The L4 cache that we would normally refer to now is also already DRAM, not SRAM, and the better known L4 cache in PCs would have appeared in the Intel Haswell microarchitecture era (2013, 4th generation Core). At that time, for the high core models, Intel gave this generation of processors up to 128MB of eDRAM (embedded DRAM) in the package, which would be considered as L4 cache. so why don't we say that this is a capacity extension of L3 cache?

图片无替代文字

Back in the day, eDRAM on Intel processors was used as a separate die (the die on the left in the picture above), and was extraordinarily eye-catching from the outside. This part of the L4 cache was used for the core GPU and CPU to do data sharing, and was also initially used as the victim cache for the L3 cache.


This eDRAM design from Intel was actually carried over to Coffee Lake (2017, 8th generation Core). It's just that the number of processor models applying this type of solution per architecture generation is also quite limited, so not many people know about L4 cache.

图片无替代文字
图片无替代文字
Source: Chips and Cheese

Chips and Cheese also tested several important values of this eDRAM from Intel in particular this time, including latency and bandwidth. The red line in both graphs is the V-Cache of the AMD Ryzen 7950X3D, and the blue line is the eDRAM of the Intel Core i7-5775C (Broadwell, 5th generation Core).


In the test, the latter reaches a data latency of over 30ns for the eDRAM access depth, which is quite high for a processor-side cache. And in terms of bandwidth, eDRAM is about 50GB/s - an amount that is also the level of dual-channel DDR3 memory, and only about 1/4 of the bandwidth of this processor's own L3 cache.


Of course this still has a lot to do with the architecture design, involving the eDRAM controller, OPIO transfers and so on. This result is not surprising, after all there is a significant difference in age between these two processors. This comparison should reflect the significant advances in semiconductor technology, and incidentally also look at how the same CPU on-chip storage expansion solution has changed over a period of about 10 years.


eDRAM is a very slow level 1 cache compared to the other CPU tiers, in other words in effect, the L3 cache still exists as a high latency isolation between the core and eDRAM. On the Broadwell generation architecture, Intel uses part of the L3 cache data array as a tag for L4 in order to eliminate the effects of high eDRAM latency to a greater extent. this way the L3 controller knows if a cache miss can be resolved by eDRAM.


On the Skylake architecture (6th generation Core, circa 2015), the eDRAM controller is placed in the CPU's system agent and has a dedicated tag. eDRAM in the Skylake era is worse than Broadwell, with longer latency and about the same bandwidth. After all, eDRAM is still DRAM, it needs to be refreshed, and it doesn't run fast. This also dictates that eDRAM is an L4 cache, not an expansion of the L3 cache.


If L4 cache is to be reintroduced in the 14th generation Core, the design will certainly be very different from the original Haswell eDRAM; and after all, semiconductor technology has advanced over the years, so perhaps this time the L4 cache will bring surprises. But since it is called L4 cache, it generally means that it is slower than L3 cache.

图片无替代文字

At the end of the day, though, the advent of eDRAM on PC processors is also more of a sort of luxury storage configuration made to enhance the performance of core graphics. Applications like gaming, where the latency requirements of the storage medium are naturally less demanding, heap capacity would seem more efficient.


And we think the Linux patch also mentions the decoupling of core graphics from the original processor's LLC, which is probably the main reason why L4 cache is appearing on the 14th generation Core processors. In other words, future chiplet-based Intel processors may further enhance core graphics GPU performance, for which L4 cache is also intended. Also there is a higher probability that L4 cache will be an option, just like eDRAM back in the day.


So Intel's L4 cache is not much comparable to AMD's 3D V-Cache, and the two are starting from different points. But considering that on the PC platform both may end up working on graphics applications like gaming, there's always a bit of a similarity of approach, even if the way it's achieved may be very different.


A sort of "heterogeneity" created by V-Cache

For those of you who are not familiar with V-Cache, you should take a look at the several articles we have written on V-Cache, including the hybrid bonding approach to its 3D packaging. AMD processors using V-Cache are also the benchmark for TSMC's cutting edge 3D packaging technology.


In simple terms, 3D V-Cache takes more of the L3 cache and stacks it vertically on top of the original CPU die. As a result, the processor has a larger L3 cache capacity.

图片无替代文字

When Chips and Cheese first started to simulate the negative effects of high latency with high capacity L3 cache, we were generally not too optimistic about this solution for applications other than gaming on PC platforms. At that time the AMD Ryzen 5800X3D was still a small test case. This year, the Zen 4 architecture seems to be quite comfortable with the same technology again, especially since some of the problems that existed in the past have been solved.


For example, with the V-Cache stack, there was no way to make the CPU core frequency very high, which largely affected applications that were more sensitive to core performance. This new generation (Ryzen 7000 series) has benefited greatly from the new 5nm process and has been enhanced in all areas. In terms of specific products, the higher core count AMD processors are also able to stack V-Cache; more importantly, the core frequencies are no longer as low as they used to be.


For example, for the 7950X3D model, there are 16 processor cores in total - each of the 8 cores is used as a But in reality only one CCD (of the 8 cores) has V-Cache stacked on it, the other one is still a normal die.


This design actually results in an interesting fact: (1) out of the 16 cores, 8 of them (assumed to be group A) have a larger cache capacity; (2) but the other 8 cores (assumed to be group B) have a higher frequency.


Although the 16 cores are the same from a core architecture point of view; because of the difference in cache capacity and core frequency, the cores in groups A and B look a little heterogeneous - although it may be inappropriate to say heterogeneous. A processor with two different architectures of cores, P-core and E-core, like Intel's, would be a standard heterogeneous core CPU.


But let's consider the fact that if an application has a high cache hit rate with 32MB L3 or a very large working set size - so large that even with a V-Cache stack, there is not much soft use for it - then obviously it would be better for such an application to run in B group of the processor is the better choice. If, on the other hand, there is an application where a larger V-Cache stack would result in a higher cache hit rate or is less sensitive to core frequency, then it would be better to run it in Group A.


In fact, the 5.2GHz core frequency in Group A and the 5.7GHz core frequency in Group B, coupled with the different cache capacities of the two, can make a difference in the performance of different applications running on different core clusters. It's also worth noting that AMD previously mentioned at the launch of this generation of V-Cache products that it was working with Microsoft on operating system level optimizations. The optimisation here should involve coarse-grained scheduling, where different games can run on different core clusters - although it may be limited to certain application types at this stage as well.


Chips and Cheese also mentions in the comments that inconsistent performance representation between cores seems to be somewhat of a trend these days - using different core configurations to cover different application scenarios. Even though it may be true that for V-Cache, AMD subjectively did not want to bring about this result, in practice it has really caused it. In other words, if a non-storage-sensitive application is running on a Group A core, it will not perform as well as it should. Although in terms of AMD's architecture design, this aspect is perhaps far less significant than Intel's large and small core design, which has such a huge impact when poorly scheduled.


The true performance of V-Cache

Finally, we borrowed data from Chips and Cheese's tests to see how V-Cache really performs in this generation of processors. From this data, we believe that the latency increase caused by V-Cache is much less severe than we thought - so the overall results are quite good.

图片无替代文字

The two main objects of comparison are actually the aforementioned Group A (blue line) and Group B (orange line) core clusters. That is, the 8 cores of the AMD Ryzen 7950X3D with V-Cache stacked on them, compared to the 8 cores without V-Cache stacked on them.


The V-Cache of the Zen 4 architecture does introduce a certain amount of latency penalty. group A cores take 4 extra cycles to fetch data from L3 compared to group B. And it should be noted that in terms of latency time, the actual latency impact is somewhat greater because of the already low frequency of Group A; but +1.61ns of latency is still worth it for a 3x capacity increase.

图片无替代文字

In terms of bandwidth, with a single core occupying L3 bandwidth, group B has an 11% bandwidth advantage over group A - this should be mainly due to the frequency difference. If all cores are used to fetch data, the maximum frequency of all cores will of course drop - after the drop, group A still has a lower frequency than group B, so the all-core L3 cache bandwidth performance is still better for group B - but the bandwidth difference, at this point, is is greater than the difference in frequency.


The data obtained by Chips and Cheese is that the average core cluster in Group A has 18.45 bytes of data per cycle per core, compared to 20.8 bytes in Group B. We feel that there should not be a theoretical difference in this respect, but the reason for the difference is unknown. The reason for this difference is unknown.

图片无替代文字

Also in terms of cache hits, Chips and Cheese measured several games including GHPC, Cyberpunk 2077, Digital Combat Simulation World (DCS) and Call of Duty: Black Ops Cold War. Here we'll just do some data summaries of the specific tests and data, for those interested in the original article.


In the GHPC game, V-Cache was able to deliver a 33% increase in L3 cache hit rate, reaching an average of 78% average hit rate - and just about completely offsetting that little increase in latency. In the GHPC game, Group A ended up with a 9.67% IPC boost compared to Group B cores - which is pretty good, I guess.


The Cyberpunk 2077 game showed a 13.4% IPC lead for Group A compared to Group B (above) - we would say that a generation of CPU architecture evolution would be quite successful in bringing about an IPC increase of this magnitude on average.


And in DCS, a game that already has a relatively high L3 hit rate, V-Cache gains very little - instead it's the increased latency that is a big factor in the game's negative framerate addition. So DCS is a counter-example to the actual effect of V-Cache.


Call of Duty: Black Ops Cold War, a game that had an average L3 hit rate of just under 50%, saw a 47% increase in hit rate with the addition of V-Cache, resulting in a 19% IPC increase on paper. The backend stress ratio was significantly reduced, meaning that the execution unit was more efficient due to the increased cache hit rate; however, there was a small impact on front-end performance.

图片无替代文字

Finally, Chips and Cheese also compared video encoding and file compression scenarios. In the libx264 video encoding test, there was a 16.75% improvement in hit rate and about a 4.9% improvement in IPC - but considering that the cores in Group B are 7% more frequent, there is ultimately no performance advantage to speak of for Group A.


In fact, the libx264 test is a good reflection of why AMD uses two different sets of cores, A and B, or just one CCD stack V-Cache, because higher core frequencies still provide better performance in a large number of application scenarios.


For the 7-Zip file compression test, V-Cache delivered a nearly 30% cache hit rate improvement and a 9.75% IPC improvement. Considering that a 9% advantage in Group B core frequency is not usually linear in performance, this test still shows the value that V-Cache brings in some non-gaming loads. However, it seems that AMD's default strategy is to run non-gaming applications on CCDs without V-Cache, which seems to have given AMD the first taste of the scheduling problems associated with "heterogeneity"...

图片无替代文字

In fact, this article is mainly expected to talk about the value that V-Cache brings, which is not as limited as we previously thought - although the cost of die size still needs to be considered. The first part of the article talks about the possible L4 cache for the 14th gen Core, but it's just a side note. And we still have limited intelligence on the Gen 14 Core, after all, so we don't know how much L4 cache will actually deliver, and how exactly it will do so.


But what is certain is that future AMD client processors will still stick with the V-Cache scheme. I just wonder if AMD will have other aspects to think about further in dealing with different cores that are starting to show differences in performance. And even if it's just V-Cache, which seems to be a technical change in CPU caching, it shows the need for special optimisation for different types of applications (although maybe V-Cache is just a by-product of server-side technology...)

Aged like milk. 14th gen is a flop.

Like
Reply

To view or add a comment, sign in

More articles by Turpanic

Insights from the community

Others also viewed

Explore topics