Intel Xe2 GPUs Official: 50% Performance Uplift, New Ray Tracing Cores, Coming To Lunar Lake First & Battlemage Discrete Cards Later

Hassan Mujtaba

Intel's Xe2 is official and will be coming to Lunar Lake CPUs and the next-gen Arc discrete graphics lineup codenamed "Battlemage".

Intel Xe2 GPU Architecture Achieves 50% Performance Improvement, New Ray Tracing Units & VVC Support Onboard, Arc Battlemage Coming Soon!

At ITT 2024, Intel squashed all rumors around the cancelation or delay of its GPU and Arc lineup. Tom Petersen gave one of the most charged presentations during the event which was centered around the next-generation Xe2 architecture. Starting with the details, Intel is making things simpler, and instead of using LP, LPG, HP, and HPG naming schemes, the company is simply calling its next-gen lineup Xe2. Internally, these chips will still feature these codenames but it won't be used for the client side anymore.

Related Story Intel Arrow Lake & Lunar Lake I/O Configurations Revealed Along With LGA 1851 Socket Pinout Plan

Some of the goals with Xe2 for Intel were to achieve higher utilization, improved work distribution, and less software overhead. It is a design from the ground up and has fixed several major issues that were noticed with Xe "Alchemist" GPUs. Right off the bat, Intel wowed the audiences with an IP performance efficiency chart that shows gains of up to 12.5x which are quite significant and we have this deep dive to showcase what is Xe2 and how Intel is achieving these gains.

Intel states that the Xe2 architecture, just like Xe, is highly scalable which will lead to its integration within low-power mobile SOCs such as Lunar Lake and up to higher-end Arc graphics cards with discrete options that come out later.

Intel Xe2 Architectural Deep-Dive

So beginning our deep-dive, the second generation Xe core or Xe2 comes with several compute resources that are repartitioned into native SIMD16 engines for increased efficiency.

The Xe2 core features:

  • 8 512-bit Vector Engines
  • 8 2048-bit XMX Engines
  • 64b atomic ops support
  • 192KB Shared L1$/SLM

The Vector Engine has also been updated which includes:

  • SIMD16 native ALUs - Support for SIMD16 and SIMD32 ops
  • Xe Matrix Extensions (Support for INT2, INT4, INT8, FP16, BF16)
  • Extended Math & FP64 - Transcendentals: SIN, COS, LOG, EXP
  • 3-way co-issue - FP + INT/EM + XMX

The Xe Matrix Engines or XMX units were also present on Alchemist "Xe" GPUs but what has changed now is that they support more data types and run much faster with FP16 rated at 2048 OPS/clock & INT8 rated at 4096 OPS/clock.

With those two out of the way, let's see how these new engines stack within the Xe2 render slice which are the fundamental blocks of the Xe2 GPU. These Render slices can be stacked and scaled as needed and are optimized to reduce latency, remove stalls, and improve HW/SW handshake. These Render Slices are connected to a Command Front End which natively supports Execute Indirect.

The render slice also includes a new Geometry engine with 3x vertex fetch throughput and 3x mesh shading performance (with vertex re-use), new L1$/SLM cache for out-of-order sampling (with compressed textures), 2x throughput for sampling without filtering & Programmable offsets, a new HiZ unit which has 50% more cache and supports Early HiZ culling of small primitives. Lastly, there are two new Pixel Backends which offer twice the blending throughput, a 33% increase in pixel color cache, and renders the target pre-fetch to L2$.

2024-06-04_8-52-13
2024-06-04_8-52-14
2024-06-04_8-52-15
2024-06-04_8-52-16
2024-06-04_8-52-17
2024-06-04_8-52-18

Xe2's Latest Ray Tracing Unit Improves Upon Xe1

A major block of the Xe2 core is its RTU (Ray Tracing Unit) which features 3 traversal pipelines, 18 box intersections (6 per Box intersection & 3 boxes per RTU), and 2 triangle intersections.

So that's the low-level summary of Intel's Xe2 GPU architecture which offers:

  • 2nd Gen Xe2 Cores
  • Enhanced Vector Engines
  • Deeper Caches
  • New XMX Engines
  • Performance & Efficiency - Optimized front-end
  • Native hardware support for exectue indirect commands
  • Larger Ray Tracing Units

Overall, Intel's Xe2 GPU architecture is designed to be more compatible with games and achieve higher utilization. The new Execute Indirect block is used by games to accelerate draw calls and gaining a 12.5x jump bodes well for gamers since it is used by engines such as Unreal Engine a lot.

Intel Lunar Lake Gets The First Xe2 GPU IP, Full Deep-Dive of Integrated Xe2

The first product to feature Xe2 GPUs is Lunar Lake and it comes in the integrated configuration. Several blocks within Lunar Lake are tied to the GPU  such as the Media Engine and the Display Engine.

Before we get into those, let's talk about the Xe2 configuration for Lunar Lake:

  • 8 Xe2 Cores
  • 64 Vector Engines
  • 2 Geometry Pipelines
  • 8 Samplers
  • 4 Pixel backends
  • 8 ray-tracing units
  • 8 MB L2$
2024-06-04_8-52-23
2024-06-04_8-52-24
2024-06-04_8-52-25

The Lunar Lake Xe2 GPU features 8 Xe2 cores and each Xe2 core has 8 XMX and 8 Vector units, a Load/Store unit, a Thread Sorting Unit, and a dedicated L1/L$ cache. Each of these four Xe2 cores makes a single Render Slice.

So how does this all scale in terms of performance compared to Meteor Lake's Xe GPU. Intel states that the Xe2 GPUs achieved 50% higher performance at ISO and significantly lower power with the same performance.

The XMX block is also a significant portion that sees the influx of 67 peak INT8 TOPS which adds to the overall AI prowess offered by the Lunar Lake CPUs. The chip in total offers 120 platform TOPS which include 48 TOPS from the NPU4 and 5 TOPS from the CPU itself.

Xe Display Engine For Lunar Lake

Now we shift from the GPU to the other blocks on the Lunar Lake CPU itself, starting with the Display Engine. The Display Engine comes with 3 Display Pipes with up to 8K60 HDR support, up to 3x 4K60 HDR support, and up to 1080p360 or 1440p360 support. The display engine supports HDMI 2.1, DisplayPort 2.1, and the new eDP 1.5 capabilities.

The front end of the Display Engine includes Decode/Decrypt and a Streaming Buffer Zone. For the pixel processing pipeline, you are getting 6 planes per pipeline with hardware support for color conversion and composition while being flexible & power efficient.

2024-06-04_8-52-32
2024-06-04_8-52-34
2024-06-04_8-52-35
2024-06-04_8-52-37
2024-06-04_8-52-38
2024-06-04_8-53-06

There's also an additional Low-Power optimized pipeline with Panel Replay (power gating during idle frames) and a new Brightness sensor with LACE (Local Adaptive Contrast Enhancement). On the compression and encoding side, you get a display stream compression engine with 31 visually lossless compressions and transport encoding (stream encode for HDMI and DisplayPort protocols). Router and Ports include Stream assembly and Port Routing with up to 4 ports supported for added flexibility.

2024-06-04_8-53-07
2024-06-04_8-53-08
2024-06-04_8-53-10
2024-06-04_8-53-11
2024-06-04_8-53-12
2024-06-04_8-53-14
2024-06-04_8-53-15
2024-06-04_8-53-25
2024-06-04_8-53-27

Coming back to eDP (eDisplayPort) 1.5 with Panel Replay, it's being referred to as an evolution of panel self-refresh with selective updates with early transport and adaptive sync support. The new display capability offers reducer Judder and improved playback while offering higher power efficiency.

Xe Media Engine For Lunar Lake - VVC Support, Side-Cache & Better Encoding

The last block of the Lunar Lake SOC that is connected to the Xe2 GPU is the Media Engine which now comes with its own dedicated 8 MB of shared side cache. This new cache can be used by the rest of the chip but there's no need for it since the rest of the cores have a dedicated cache themselves.

This side-cache allows Lunar Lake a lot of bandwidth savings since there's reduced traffic to system memory across media workloads. This also allows significant power reductions for encode workloads.

2024-06-04_8-53-31
2024-06-04_8-53-32
2024-06-04_8-53-33
2024-06-04_8-53-34
2024-06-04_8-53-35
2024-06-04_8-53-37

Diving into the Media Engine, it supports up to 8k60 10-bit HDR decode, up to 8k60 10-bit HDR encode, AVC, VP9, H.265 HEVC, AV1 and a brand new VVC engine. The VVC engine significantly reduces bitrate while delivering the same quality as AV1 (up to 10% file size reduction). It also supports Adaptive Resolution Streaming and Screen content coding.

And lastly, we have the Windows GPU software stack which is ready for Xe2 GPUs. Intel said that it spent a lot of time tuning the API-level performance of its Alchemist "Xe" GPUs, especially DX9, but all of that software work is moving over to Xe2 with support of all the latest APIs and Frameworks along with their runtimes.

2024-06-04_8-53-40
2024-06-04_8-53-41
2024-06-04_8-53-43
2024-06-04_8-53-48

That wraps things up for Xe2, a brand new graphics architecture that brings huge performance improvements, the latest feature sets, and a lot more to both integrated solutions such as Lunar Lake and discrete options with the upcoming Arc Battlemage lineup. The company will share more on Battlemage discrete offerings later in the year.

Share this story

Deal of the Day

Comments

  翻译: