TensorRT Error: `[genericReformat.cuh::copyVectorizedRunKernel::1579] Error Code 1: Cuda Runtime (invalid resource handle)` in ROS Callback Function

qli037037 · 2024-12-10T06:23:42.960Z

Description

TensorRT Error: [genericReformat.cuh::copyVectorizedRunKernel::1579] Error Code 1: Cuda Runtime (invalid resource handle) in ROS Callback Function

Environment

TensorRT Version: TensorRT 8.6.1
GPU Type: RTX 4060
Nvidia Driver Version: 551.86
CUDA Version: 11.7
CUDNN Version: 8.9.7
Operating System + Version: Linux Ubuntu 20.04
Python Version (if applicable): 3.8.0
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.13.0+cu117
Baremetal or Container (if container which image + tag):

Relevant Files

Here is my ROS subscriber node code, launch file, and enigne engine files
ROS_lidarcallback_tensorRT.zip (8.9 MB)

Steps To Reproduce

I encounter the following error during inference in a TensorRT-based ROS node for real-time LiDAR point cloud processing:
[genericReformat.cuh::copyVectorizedRunKernel::1579] Error Code 1: Cuda Runtime (invalid resource handle)

This is my node initialisation display

roslaunch pointpillar_ros pointpillar_ros_tsrrt.launch
… logging to /root/.ros/log/380476a8-b6b1-11ef-97cb-0242ac110002/roslaunch-b5914d4fa2e0-9533.log
Checking log directory for disk usage. This may take a while.
Press Ctrl-C to interrupt
Done checking log file disk usage. Usage is <1GB.
SUMMARY
========
PARAMETERS
pointpillar_ros/detection_score_threshold: 0.3
pointpillar_ros/engine_file_name: /root/mmdeploy/po…
pointpillar_ros/pub_detection_topic: /detections
pointpillar_ros/sub_point_cloud_topic: /livox/lidar
rosdistro: noetic
rosversion: 1.16.0
NODES
/
pointpillar_ros (pointpillar_ros/pointpillar_ros_tsrrt.py)
ROS_MASTER_URI=http://localhost:11311
process[pointpillar_ros-1]: started with pid [9578]
[INFO] [1733810688.300759, 0.000000]: Pointpillar ROS node initialized with the following settings:
Engine File Name: /root/mmdeploy/pointpillars/optimized_engine_fp16.engine
detection_score_threshold: 0.3
Input Point Cloud Topic: /livox/lidar
Output Detection Topic: /detections
/root/catkin_ws/src/pointpillar-ros-node/pointpillar_ros/src/pointpillar_ros_tsrrt.py:72: DeprecationWarning: Use get_tensor_mode instead.
is_input = self.engine.binding_is_input(binding) # 判断是否为输入绑定
[INFO] [1733810688.407896, 0.000000]: Input Binding: voxels, Min Shape: (2000, 32, 4), Opt Shape: (5000, 32, 4), Max Shape: (9000, 32, 4)
[INFO] [1733810688.412809, 0.000000]: Input Binding: num_points, Min Shape: (2000,), Opt Shape: (5000,), Max Shape: (9000,)
[INFO] [1733810688.414620, 0.000000]: Input Binding: coors, Min Shape: (2000, 4), Opt Shape: (5000, 4), Max Shape: (9000, 4)
[INFO] [1733810688.415342, 0.000000]: Output Binding: cls_score0, Shape: (1, 18, 248, 216)
[INFO] [1733810688.418381, 0.000000]: Output Binding: bbox_pred0, Shape: (1, 42, 248, 216)
[INFO] [1733810688.423864, 0.000000]: Output Binding: dir_cls_pred0, Shape: (1, 12, 248, 216)
[INFO] [1733810688.426648, 0.000000]: Binding 0: Name: voxels, Shape: (-1, 32, 4)
[INFO] [1733810688.427285, 0.000000]: Binding 1: Name: num_points, Shape: (-1,)
[INFO] [1733810688.427775, 0.000000]: Binding 2: Name: coors, Shape: (-1, 4)
[INFO] [1733810688.428298, 0.000000]: Binding 3: Name: cls_score0, Shape: (1, 18, 248, 216)
[INFO] [1733810688.428794, 0.000000]: Binding 4: Name: bbox_pred0, Shape: (1, 42, 248, 216)
[INFO] [1733810688.429497, 0.000000]: Binding 5: Name: dir_cls_pred0, Shape: (1, 12, 248, 216)
[INFO] [1733810688.430391, 0.000000]: PointPillar TensorRT engine is initialized!

The following is the output log of the inference performed by this node after obtaining data from the publisher node:

[INFO] [1733810924.488454, 2977.618000]: CUDA context pushed successfully.
[INFO] [1733810924.489487, 2977.619000]: Active CUDA context is valid.
[INFO] [1733810924.505155, 2977.628000]: Point count in current frame: 20000
[INFO] [1733810924.506446, 2977.634000]: First few points: [[-0.19559641 1.0795102 1.407952 0. ]
[-0.22286038 1.1716075 1.3791255 0. ]
[-0.2683279 1.3539146 1.4516338 0. ]
[-0.25758672 1.2537537 1.2325326 0. ]
[-0.17430878 1.0882461 1.4154265 0. ]]
[INFO] [1733810924.508901, 2977.637000]: Clipped Point Cloud Size: 19079
[INFO] [1733810924.534154, 2977.661000]: Non-zero voxels: 3856
[INFO] [1733810924.535470, 2977.662000]: Setting binding shapes: voxels=(3856, 32, 4), num_points=(3856,), coors=(3856, 4)
[INFO] [1733810924.536435, 2977.663000]: Size check passed for voxels: Expected 1341440, Actual 1341440
[INFO] [1733810924.537632, 2977.664000]: Size check passed for num_points: Expected 10480, Actual 10480
[INFO] [1733810924.538666, 2977.665000]: Size check passed for coors: Expected 41920, Actual 41920
[INFO] [1733810924.539395, 2977.666000]: Preparing to transfer voxels data to GPU…
[INFO] [1733810924.540418, 2977.667000]: voxels data transferred successfully.
[INFO] [1733810924.545913, 2977.672000]: Saved voxels data from GPU to /tmp/pointpillar_input/voxels_from_gpu.bin
[INFO] [1733810924.547234, 2977.674000]: Preparing to transfer num_points data to GPU…
[INFO] [1733810924.548853, 2977.675000]: num_points data transferred successfully.
[INFO] [1733810924.551899, 2977.678000]: Saved num_points data from GPU to /tmp/pointpillar_input/num_points_from_gpu.bin
[INFO] [1733810924.552984, 2977.679000]: Preparing to transfer coors data to GPU…
[INFO] [1733810924.554074, 2977.680000]: coors data transferred successfully.
[INFO] [1733810924.557126, 2977.683000]: Saved coors data from GPU to /tmp/pointpillar_input/coors_from_gpu.bin
[INFO] [1733810924.558111, 2977.684000]: Executing inference…
[12/10/2024-06:08:44] [TRT] [E] 1: [genericReformat.cuh::copyVectorizedRunKernel::1579] Error Code 1: Cuda Runtime (invalid resource handle)
[INFO] [1733810924.577849, 2977.703000]: Inference executed successfully.
[INFO] [1733810924.588943, 2977.714000]: All outputs successfully retrieved from GPU.
[INFO] [1733810924.590820, 2977.714000]: Detections: [array([0., 0., 0., …, 0., 0., 0.], dtype=float32), array([0., 0., 0., …, 0., 0., 0.], dtype=float32), array([0., 0., 0., …, 0., 0., 0.], dtype=float32)]
[INFO] [1733810924.592081, 2977.714000]: Publishing detections is not fully implemented.
[INFO] [1733810924.593354, 2977.714000]: CUDA context popped successfully.

As you can see from the log above, all contexts are available and reasoning works fine, but all output data is 0

But the interesting thing is that I extracted my input data after the data transfer to the GPU to make sure that I had no problems, I saved the bin file and used the trtexec command to go through the inference, and it seems to be fine, here are the relevant commands, And we can also see the normal inference output afterwards with the --dumpOutput command.

root@b5914d4fa2e0:~/mmdeploy/pointpillars# trtexec
–loadEngine=optimized_engine_fp16.engine
–shapes=voxels:3856x32x4,num_points:3856,coors:3856x4
–loadInputs=voxels:/tmp/pointpillar_input/voxels_from_gpu.bin,num_points:/tmp/pointpillar_input/num_points_from_gpu.bin,coors:/tmp/pointpillar_input/coors_from_gpu.bin
–verbose
&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec --loadEngine=optimized_engine_fp16.engine --shapes=voxels:3856x32x4,num_points:3856,coors:3856x4 --loadInputs=voxels:/tmp/pointpillar_input/voxels_from_gpu.bin,num_points:/tmp/pointpillar_input/num_points_from_gpu.bin,coors:/tmp/pointpillar_input/coors_from_gpu.bin --verbose
[12/10/2024-06:15:48] [I] === Model Options ===
[12/10/2024-06:15:48] [I] Format: *
[12/10/2024-06:15:48] [I] Model:
[12/10/2024-06:15:48] [I] Output:
[12/10/2024-06:15:48] [I] === Build Options ===
[12/10/2024-06:15:48] [I] Max batch: explicit batch
[12/10/2024-06:15:48] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[12/10/2024-06:15:48] [I] minTiming: 1
[12/10/2024-06:15:48] [I] avgTiming: 8
[12/10/2024-06:15:48] [I] Precision: FP32
[12/10/2024-06:15:48] [I] LayerPrecisions:
[12/10/2024-06:15:48] [I] Layer Device Types:
[12/10/2024-06:15:48] [I] Calibration:
[12/10/2024-06:15:48] [I] Refit: Disabled
[12/10/2024-06:15:48] [I] Version Compatible: Disabled
[12/10/2024-06:15:48] [I] TensorRT runtime: full
[12/10/2024-06:15:48] [I] Lean DLL Path:
[12/10/2024-06:15:48] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[12/10/2024-06:15:48] [I] Exclude Lean Runtime: Disabled
[12/10/2024-06:15:48] [I] Sparsity: Disabled
[12/10/2024-06:15:48] [I] Safe mode: Disabled
[12/10/2024-06:15:48] [I] Build DLA standalone loadable: Disabled
[12/10/2024-06:15:48] [I] Allow GPU fallback for DLA: Disabled
[12/10/2024-06:15:48] [I] DirectIO mode: Disabled
[12/10/2024-06:15:48] [I] Restricted mode: Disabled
[12/10/2024-06:15:48] [I] Skip inference: Disabled
[12/10/2024-06:15:48] [I] Save engine:
[12/10/2024-06:15:48] [I] Load engine: optimized_engine_fp16.engine
[12/10/2024-06:15:48] [I] Profiling verbosity: 0
[12/10/2024-06:15:48] [I] Tactic sources: Using default tactic sources
[12/10/2024-06:15:48] [I] timingCacheMode: local
[12/10/2024-06:15:48] [I] timingCacheFile:
[12/10/2024-06:15:48] [I] Heuristic: Disabled
[12/10/2024-06:15:48] [I] Preview Features: Use default preview flags.
[12/10/2024-06:15:48] [I] MaxAuxStreams: -1
[12/10/2024-06:15:48] [I] BuilderOptimizationLevel: -1
[12/10/2024-06:15:48] [I] Input(s)s format: fp32:CHW
[12/10/2024-06:15:48] [I] Output(s)s format: fp32:CHW
[12/10/2024-06:15:48] [I] Input build shape: voxels=3856x32x4+3856x32x4+3856x32x4
[12/10/2024-06:15:48] [I] Input build shape: num_points=3856+3856+3856
[12/10/2024-06:15:48] [I] Input build shape: coors=3856x4+3856x4+3856x4
[12/10/2024-06:15:48] [I] Input calibration shapes: model
[12/10/2024-06:15:48] [I] === System Options ===
[12/10/2024-06:15:48] [I] Device: 0
[12/10/2024-06:15:48] [I] DLACore:
[12/10/2024-06:15:48] [I] Plugins:
[12/10/2024-06:15:48] [I] setPluginsToSerialize:
[12/10/2024-06:15:48] [I] dynamicPlugins:
[12/10/2024-06:15:48] [I] ignoreParsedPluginLibs: 0
[12/10/2024-06:15:48] [I]
[12/10/2024-06:15:48] [I] === Inference Options ===
[12/10/2024-06:15:48] [I] Batch: Explicit
[12/10/2024-06:15:48] [I] Input inference shape: coors=3856x4
[12/10/2024-06:15:48] [I] Input inference shape: num_points=3856
[12/10/2024-06:15:48] [I] Input inference shape: voxels=3856x32x4
[12/10/2024-06:15:48] [I] Iterations: 10
[12/10/2024-06:15:48] [I] Duration: 3s (+ 200ms warm up)
[12/10/2024-06:15:48] [I] Sleep time: 0ms
[12/10/2024-06:15:48] [I] Idle time: 0ms
[12/10/2024-06:15:48] [I] Inference Streams: 1
[12/10/2024-06:15:48] [I] ExposeDMA: Disabled
[12/10/2024-06:15:48] [I] Data transfers: Enabled
[12/10/2024-06:15:48] [I] Spin-wait: Disabled
[12/10/2024-06:15:48] [I] Multithreading: Disabled
[12/10/2024-06:15:48] [I] CUDA Graph: Disabled
[12/10/2024-06:15:48] [I] Separate profiling: Disabled
[12/10/2024-06:15:48] [I] Time Deserialize: Disabled
[12/10/2024-06:15:48] [I] Time Refit: Disabled
[12/10/2024-06:15:48] [I] NVTX verbosity: 0
[12/10/2024-06:15:48] [I] Persistent Cache Ratio: 0
[12/10/2024-06:15:48] [I] Inputs:
[12/10/2024-06:15:48] [I] coors<-/tmp/pointpillar_input/coors_from_gpu.bin
[12/10/2024-06:15:48] [I] num_points<-/tmp/pointpillar_input/num_points_from_gpu.bin
[12/10/2024-06:15:48] [I] voxels<-/tmp/pointpillar_input/voxels_from_gpu.bin
[12/10/2024-06:15:48] [I] === Reporting Options ===
[12/10/2024-06:15:48] [I] Verbose: Enabled
[12/10/2024-06:15:48] [I] Averages: 10 inferences
[12/10/2024-06:15:48] [I] Percentiles: 90,95,99
[12/10/2024-06:15:48] [I] Dump refittable layers:Disabled
[12/10/2024-06:15:48] [I] Dump output: Disabled
[12/10/2024-06:15:48] [I] Profile: Disabled
[12/10/2024-06:15:48] [I] Export timing to JSON file:
[12/10/2024-06:15:48] [I] Export output to JSON file:
[12/10/2024-06:15:48] [I] Export profile to JSON file:
[12/10/2024-06:15:48] [I]
[12/10/2024-06:15:48] [I] === Device Information ===
[12/10/2024-06:15:48] [I] Selected Device: NVIDIA GeForce RTX 4060 Laptop GPU
[12/10/2024-06:15:48] [I] Compute Capability: 8.9
[12/10/2024-06:15:48] [I] SMs: 24
[12/10/2024-06:15:48] [I] Device Global Memory: 8187 MiB
[12/10/2024-06:15:48] [I] Shared Memory per SM: 100 KiB
[12/10/2024-06:15:48] [I] Memory Bus Width: 128 bits (ECC disabled)
[12/10/2024-06:15:48] [I] Application Compute Clock Rate: 1.89 GHz
[12/10/2024-06:15:48] [I] Application Memory Clock Rate: 8.001 GHz
[12/10/2024-06:15:48] [I]
[12/10/2024-06:15:48] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[12/10/2024-06:15:48] [I]
[12/10/2024-06:15:48] [I] TensorRT version: 8.6.1
[12/10/2024-06:15:48] [I] Loading standard plugins
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::BatchedNMS_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::BatchTilePlugin_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::Clip_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::CoordConvAC version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::CropAndResizeDynamic version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::CropAndResize version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::DecodeBbox3DPlugin version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::DetectionLayer_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::EfficientNMS_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::FlattenConcat_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::GenerateDetection_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::GridAnchor_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::GridAnchorRect_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::InstanceNormalization_TRT version 2
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::LReLU_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::ModulatedDeformConv2d version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::MultilevelProposeROI_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::MultiscaleDeformableAttnPlugin_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::NMSDynamic_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::NMS_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::Normalize_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::PillarScatterPlugin version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::PriorBox_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::ProposalDynamic version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::ProposalLayer_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::Proposal version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::PyramidROIAlign_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::Region_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::Reorg_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::ResizeNearest_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::ROIAlign_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::RPROI_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::ScatterND version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::SpecialSlice_TRT version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::Split version 1
[12/10/2024-06:15:48] [V] [TRT] Registered plugin creator - ::VoxelGeneratorPlugin version 1
[12/10/2024-06:15:48] [I] Engine loaded in 0.0373064 sec.
[12/10/2024-06:15:49] [I] [TRT] Loaded engine size: 35 MiB
[12/10/2024-06:15:49] [V] [TRT] Deserialization required 13445 microseconds.
[12/10/2024-06:15:49] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +35, now: CPU 0, GPU 35 (MiB)
[12/10/2024-06:15:49] [I] Engine deserialized in 0.115949 sec.
[12/10/2024-06:15:49] [I] [TRT] [MS] Running engine with multi stream info
[12/10/2024-06:15:49] [I] [TRT] [MS] Number of aux streams is 1
[12/10/2024-06:15:49] [I] [TRT] [MS] Number of total worker streams is 2
[12/10/2024-06:15:49] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[12/10/2024-06:15:49] [V] [TRT] Total per-runner device persistent memory is 436224
[12/10/2024-06:15:49] [V] [TRT] Total per-runner host persistent memory is 112256
[12/10/2024-06:15:49] [V] [TRT] Allocated activation device memory of size 119940608
[12/10/2024-06:15:49] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +115, now: CPU 0, GPU 150 (MiB)
[12/10/2024-06:15:49] [V] [TRT] CUDA lazy loading is enabled.
[12/10/2024-06:15:49] [I] Setting persistentCacheLimit to 0 bytes.
[12/10/2024-06:15:49] [V] Using enqueueV3.
[12/10/2024-06:15:49] [I] Using values loaded from /tmp/pointpillar_input/voxels_from_gpu.bin for input voxels
[12/10/2024-06:15:49] [I] Input binding for voxels with dimensions 3856x32x4 is created.
[12/10/2024-06:15:49] [I] Using values loaded from /tmp/pointpillar_input/num_points_from_gpu.bin for input num_points
[12/10/2024-06:15:49] [I] Input binding for num_points with dimensions 3856 is created.
[12/10/2024-06:15:49] [I] Using values loaded from /tmp/pointpillar_input/coors_from_gpu.bin for input coors
[12/10/2024-06:15:49] [I] Input binding for coors with dimensions 3856x4 is created.
[12/10/2024-06:15:49] [I] Output binding for cls_score0 with dimensions 1x18x248x216 is created.
[12/10/2024-06:15:49] [I] Output binding for bbox_pred0 with dimensions 1x42x248x216 is created.
[12/10/2024-06:15:49] [I] Output binding for dir_cls_pred0 with dimensions 1x12x248x216 is created.
[12/10/2024-06:15:49] [I] Starting inference
[12/10/2024-06:15:52] [I] Warmup completed 30 queries over 200 ms
[12/10/2024-06:15:52] [I] Timing trace has 628 queries over 3.0095 s
[12/10/2024-06:15:52] [I]
[12/10/2024-06:15:52] [I] === Trace details ===
[12/10/2024-06:15:52] [I] Trace averages of 10 runs:
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 4.12191 ms - Host latency: 5.47151 ms (enqueue 0.159129 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.77248 ms - Host latency: 5.11964 ms (enqueue 0.181818 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.35788 ms - Host latency: 4.70271 ms (enqueue 0.185059 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37201 ms - Host latency: 4.71761 ms (enqueue 0.158688 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37152 ms - Host latency: 4.71684 ms (enqueue 0.182654 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37487 ms - Host latency: 4.72102 ms (enqueue 0.161847 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37375 ms - Host latency: 4.71888 ms (enqueue 0.158905 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36915 ms - Host latency: 4.71435 ms (enqueue 0.177661 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37141 ms - Host latency: 4.71702 ms (enqueue 0.201007 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.366 ms - Host latency: 4.71411 ms (enqueue 0.165039 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.4324 ms - Host latency: 4.77982 ms (enqueue 0.19967 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.35989 ms - Host latency: 4.70529 ms (enqueue 0.136676 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36709 ms - Host latency: 4.71309 ms (enqueue 0.142969 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.3632 ms - Host latency: 4.70928 ms (enqueue 0.175732 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37323 ms - Host latency: 4.71774 ms (enqueue 0.178882 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36292 ms - Host latency: 4.70944 ms (enqueue 0.161359 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36313 ms - Host latency: 4.71176 ms (enqueue 0.145978 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37172 ms - Host latency: 4.72012 ms (enqueue 0.150372 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36921 ms - Host latency: 4.71473 ms (enqueue 0.134192 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.3712 ms - Host latency: 4.7178 ms (enqueue 0.192749 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36425 ms - Host latency: 4.71234 ms (enqueue 0.161755 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36514 ms - Host latency: 4.71072 ms (enqueue 0.156323 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37191 ms - Host latency: 4.71639 ms (enqueue 0.179016 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36534 ms - Host latency: 4.71136 ms (enqueue 0.159509 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37058 ms - Host latency: 4.71952 ms (enqueue 0.15802 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36624 ms - Host latency: 4.71301 ms (enqueue 0.169043 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36863 ms - Host latency: 4.71532 ms (enqueue 0.158386 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36655 ms - Host latency: 4.71377 ms (enqueue 0.173035 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36372 ms - Host latency: 4.70869 ms (enqueue 0.244519 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36874 ms - Host latency: 4.71503 ms (enqueue 0.231055 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36719 ms - Host latency: 4.71497 ms (enqueue 0.166785 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36875 ms - Host latency: 4.71411 ms (enqueue 0.249377 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37035 ms - Host latency: 4.715 ms (enqueue 0.188538 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.51608 ms - Host latency: 4.86587 ms (enqueue 0.192163 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.35508 ms - Host latency: 4.70072 ms (enqueue 0.164709 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.35593 ms - Host latency: 4.70707 ms (enqueue 0.185376 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.35631 ms - Host latency: 4.70314 ms (enqueue 0.156543 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.35504 ms - Host latency: 4.70233 ms (enqueue 0.158655 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.35815 ms - Host latency: 4.70228 ms (enqueue 0.22959 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.3658 ms - Host latency: 4.71321 ms (enqueue 0.205151 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.3616 ms - Host latency: 4.70959 ms (enqueue 0.194409 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37131 ms - Host latency: 4.72273 ms (enqueue 0.151489 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36489 ms - Host latency: 4.70918 ms (enqueue 0.201807 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37083 ms - Host latency: 4.71838 ms (enqueue 0.189331 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36443 ms - Host latency: 4.70889 ms (enqueue 0.149023 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.3657 ms - Host latency: 4.70981 ms (enqueue 0.13103 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.377 ms - Host latency: 4.72109 ms (enqueue 0.166602 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36782 ms - Host latency: 4.71375 ms (enqueue 0.148657 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37646 ms - Host latency: 4.72266 ms (enqueue 0.180396 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.55532 ms - Host latency: 4.90393 ms (enqueue 0.182227 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.38179 ms - Host latency: 4.8187 ms (enqueue 0.193896 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.35364 ms - Host latency: 4.69976 ms (enqueue 0.197412 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.35605 ms - Host latency: 4.703 ms (enqueue 0.194531 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.35488 ms - Host latency: 4.70063 ms (enqueue 0.2073 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36482 ms - Host latency: 4.70996 ms (enqueue 0.186572 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36196 ms - Host latency: 4.70725 ms (enqueue 0.192236 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37131 ms - Host latency: 4.71636 ms (enqueue 0.214722 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36292 ms - Host latency: 4.7073 ms (enqueue 0.203149 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.37307 ms - Host latency: 4.7189 ms (enqueue 0.17229 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.47739 ms - Host latency: 4.82246 ms (enqueue 0.17124 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.47888 ms - Host latency: 4.82722 ms (enqueue 0.226318 ms)
[12/10/2024-06:15:52] [I] Average on 10 runs - GPU latency: 3.36506 ms - Host latency: 4.70986 ms (enqueue 0.185767 ms)
[12/10/2024-06:15:52] [I]
[12/10/2024-06:15:52] [I] === Performance summary ===
[12/10/2024-06:15:52] [I] Throughput: 208.672 qps
[12/10/2024-06:15:52] [I] Latency: min = 4.67957 ms, max = 6.32764 ms, mean = 4.74274 ms, median = 4.70273 ms, percentile(90%) = 4.76379 ms, percentile(95%) = 4.79004 ms, percentile(99%) = 5.47668 ms
[12/10/2024-06:15:52] [I] Enqueue Time: min = 0.0947266 ms, max = 0.555786 ms, mean = 0.178175 ms, median = 0.159363 ms, percentile(90%) = 0.281738 ms, percentile(95%) = 0.308105 ms, percentile(99%) = 0.403076 ms
[12/10/2024-06:15:52] [I] H2D Latency: min = 0.166016 ms, max = 0.194336 ms, mean = 0.1684 ms, median = 0.16748 ms, percentile(90%) = 0.169922 ms, percentile(95%) = 0.173645 ms, percentile(99%) = 0.180542 ms
[12/10/2024-06:15:52] [I] GPU Compute Time: min = 3.33411 ms, max = 4.97046 ms, mean = 3.3949 ms, median = 3.3576 ms, percentile(90%) = 3.41809 ms, percentile(95%) = 3.44287 ms, percentile(99%) = 4.12672 ms
[12/10/2024-06:15:52] [I] D2H Latency: min = 1.17676 ms, max = 2.09326 ms, mean = 1.17944 ms, median = 1.17712 ms, percentile(90%) = 1.17822 ms, percentile(95%) = 1.1875 ms, percentile(99%) = 1.19409 ms
[12/10/2024-06:15:52] [I] Total Host Walltime: 3.0095 s
[12/10/2024-06:15:52] [I] Total GPU Compute Time: 2.132 s
[12/10/2024-06:15:52] [W] * GPU compute time is unstable, with coefficient of variance = 4.73301%.
[12/10/2024-06:15:52] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[12/10/2024-06:15:52] [I] Explanations of the performance metrics are printed in the verbose logs.
[12/10/2024-06:15:52] [V]
[12/10/2024-06:15:52] [V] === Explanations of the performance metrics ===
[12/10/2024-06:15:52] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[12/10/2024-06:15:52] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[12/10/2024-06:15:52] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[12/10/2024-06:15:52] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[12/10/2024-06:15:52] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[12/10/2024-06:15:52] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[12/10/2024-06:15:52] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[12/10/2024-06:15:52] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[12/10/2024-06:15:52] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --loadEngine=optimized_engine_fp16.engine --shapes=voxels:3856x32x4,num_points:3856,coors:3856x4 --loadInputs=voxels:/tmp/pointpillar_input/voxels_from_gpu.bin,num_points:/tmp/pointpillar_input/num_points_from_gpu.bin,coors:/tmp/pointpillar_input/coors_from_gpu.bin --verbose