cuQuantum

Accelerate quantum computing research.

NVIDIA cuQuantum is an SDK of optimized libraries and tools for accelerating quantum computing emulations at both the circuit and device level by orders of magnitude.

Download NowInstall With Conda

Placeholder

Features and Benefits

Flexible

Choose the best approach for your work from algorithm-agnostic accelerated quantum circuit simulation methods.

State vector method features include optimized memory management and, gate application kernels.

Tensor network method features include accelerated tensor network contraction, order optimization, and approximate contractions.

Density Matrix method features include arbitrary operator action on the state.

Scalable

Leverage the power of multi-node, multi-GPU clusters using the latest GPUs on premises or in the cloud.

Low-level C++ APIs provide increased control and flexibility for a single GPU and single-node multi-GPU clusters.

The high-level Python API supports drop-in multi-node execution.

Fast

Simulate bigger problems faster and get more work done sooner.

Using an NVIDIA H200 Tensor Core GPU over CPU implementations delivers orders-of-magnitude speedups on key quantum problems, including random quantum circuits, Shor’s algorithm, and the Variational Quantum Eigensolver.

Leveraging the NVIDIA Eos supercomputer, cuQuantum generated a sample from a full-circuit simulation of the Google Sycamore processor in less than five minutes.


Framework Integrations

cuQuantum is integrated with leading quantum simulation frameworks. Download cuQuantum to dramatically accelerate performance using your framework of choice, with zero code changes.

NVIDIA cuQuantum is integrated with Amazon Web Services (AWS)
NVIDIA cuQuantum is integrated with blueqat
NVIDIA cuQuantum is integrated with Cirq
NVIDIA cuQuantum is integrated with ExaTN
NVIDIA cuQuantum is integrated with Pennylane
NVIDIA cuQuantum is integrated with Qibo
NVIDIA cuQuantum is integrated with Qiskit
NVIDIA cuQuantum is integrated with QuEST
NVIDIA cuQuantum is integrated with Qulacs
 NVIDIA cuQuantum is integrated with TKET
NVIDIA cuQuantum is integrated with Torch Quantum
NVIDIA cuQuantum is integrated with XACC

Components

Tools to accelerate quantum emulations on NVIDIA hardware

Largest Scale Dynamics

Designing quantum computers and devices has always been challenging. Simulations for these problems can be slow and limited in the ability to scale. cuQuantum now includes time dynamics functionality, which enables users to accelerate analog Hamiltonian dynamics to unprecedented scales. Users can now better understand how to optimize device design where quantum phenomena occur faster than before.

By distributing the state and operators across multi-GPU multi-node systems, cuQuantum allows phase space exploration larger than ever before, only limited by the number of GPUs you have access to.

A 36 qubit multi-node quantum dynamics simulation

Google was able to scale simulations of analog dynamics on its processors to 36 qubits with 64 GPUs using NVIDIA’s Eos supercomputer. This enables QPU builders like Google to understand long-range effects on their devices, perform validation, and design more effectively than ever before, ushering in a new age for QPU design.

Fastest GPU Implementations

The core operator action API enables developers of custom solvers the flexibility to apply arbitrary time-dependent operators to the quantum state more efficiently than was previously possible. Our advanced algorithms allow us to scale further with the same hardware memory.

This enables users to design better quantum systems more quickly than was previously possible. With multi-GPU memory, developers can drastically accelerate their QPU design cycle by simulating 473 different quantum systems in the time it formerly took to do just one. Strong scaling shows that these APIs can speed up a range of Hamiltonians and operator terms to even further accelerate your hardware development cycle.

Learn More
A line graph showing multi-GPU dynamics

cuDensityMat speeds up and scales simulations beyond what was previously possible with the next best alternatives. Simulating a qudit with 2 resonators is now 56X faster than GPU alternatives and 116X faster than CPU alternatives. A 13-qubit 1D spin chain is now 49X faster than GPU alternatives and 78X faster than CPU.

Multi-GPU Speedups

State vector simulation tracks the entire state of the system over time, through each gate operation. It’s an excellent tool for simulating deep or highly entangled quantum circuits, and for simulating noisy qubits.

Recent software updates to our offering have enabled a 4.4X speedup over previously reported numbers. Combined with ~2X speedups offered by Hopper GPUs, users see even greater speedups over CPU implementations despite CPU hardware and software improvements.

Dual-Socket CPU vs. up to 8x GPU

A graph showing how cuStateVec speeds up simulations of popular quantum algorithms

cuStateVec speeds up simulations of popular quantum algorithms like quantum Fourier transform, Shor’s algorithm, and quantum supremacy circuits by 90–369X on NVIDIA H100 80GB Tensor Core GPUs over CPU implementations on dual Intel Xeon Platinum 8480C CPUs.

Multi-Node Speedups

This multi-node capability enables users of the NVIDIA Quantum platform to achieve the most performant quantum circuit simulations at supercomputer scales. On key problems like quantum phase estimation, quantum approximate optimization algorithm (QAOA), quantum volume, and more, the newest cuQuantum Appliance is over two orders of magnitude faster than previous implementations and seamlessly scales from a single GPU to a supercomputer.

Learn More

Dual-Socket CPU vs. up to 8x GPU

A 36 qubit multi-node quantum dynamics simulation

cuStateVec speeds up simulations of popular quantum algorithms like quantum Fourier transform, Shor’s algorithm, and quantum supremacy circuits by 90–369X on NVIDIA H100 80GB Tensor Core GPUs over CPU implementations on dual Intel Xeon Platinum 8480C CPUs.

Pathfinding Performance

Tensor network methods are rapidly gaining popularity to simulate hundreds or thousands of qubits for near-term quantum algorithms. Tensor networks scale with the number of quantum gates rather than the number of qubits. This makes it possible to simulate very large qubit counts with smaller gate counts on large supercomputers.

Tensor contractions dramatically reduce the memory requirement for running a circuit on a tensor network simulator. The research community is investing heavily in improving pathfinding methods for quickly finding near-optimal tensor contractions before running a simulation.

A graph showing time to find an optimized contraction path using single core

Performance for cuTensorNet pathfinding compared to Cotengra in terms of seconds per sample. Both runs are leveraging a single-core Xeon Platinum 8480+.

Sycamore refers to 53-qubit random quantum circuits of depth 10, and 20 from Arute et al., Quantum supremacy using a programmable superconducting processor. www.nature.com/articles/s41586-019-1666-5

Cotengra: Gray & Kourtis, Hyper-optimized Tensor Network Contraction, 2021. quantum-journal.org/papers/q-2021-03-15-410

State-of-the-Art Performance for Contraction Time

Contraction performance for cuTensorNet compared to Torch, cuPy, and NumPy. All runs leverage the same best contraction path. cuTensorNet, cuPy, and Torch all ran on one NVIDIA H200 GPU. NumPy was run on a single-socket Xeon 8480+.

cuTensorNet provides state-of-the-art performance for both the pathfinding and contraction stages of tensor network simulation.
Using cuQuantum, NVIDIA researchers were able to simulate a variational quantum algorithm for solving the MaxCut optimization problem using 1,688 qubits to encode 3,375 vertices on an NVIDIA DGX SuperPOD™ system, a 16X improvement over the previous largest simulation—and multiple orders of magnitude larger than the largest problem run on quantum hardware to date.

Learn More
 Three bar charts showing state-of-the-art performance for contraction time

Sycamore Circuit: 53 qubits depth 10
Quantum Fourier Transform: 34 qubits
Inverse Quantum Fourier Transform: 36 qubits
Quantum Volume: 26 and 30 qubits with depth 30
QAOA: 36 qubits with one and four parameters
PEPS: tensor network with dimensions of 3x3 and operator depth of 30.

Approximate Tensor Network Methods

As the quantum problems of interest can greatly vary in both size and complexity, researchers have developed highly customized approximate tensor network algorithms to address the gamut of possibilities. To enable easy integration with these frameworks and libraries, cuTensorNet provides a set of APIs to cover the following common use cases: tensor QR, tensor SVD, and gate split. These primitives enable users to accelerate and scale different types of quantum circuit simulators. A common approach to simulating quantum computers, which takes advantage of these methods, is matrix product states (MPS, also known as tensor train). Users can leverage these new cuTensorNet APIs to accelerate MPS-based quantum circuit simulators. The gate split, and tensor SVD APIs, enable nearly an order of magnitude speedup over state-of-the-art CPU implementations. Tensor QR is the most efficient, with nearly two orders-of-magnitude speedup over the same Xeon 8480+ CPU.

Learn More
A chart showing MPS gate split performance on GPU

MPS gate split performance is measured in execution time as a function of bond dimension. We execute this on an NVIDIA H200 140GB GPU and compare it to NumPy running on an Xeon 8480+data center CPU.


Get started with NVIDIA cuQuantum.

Download Now

  翻译: