TensorRT provides capabilities to take models trained in single (FP32) and half (FP16) precision and convert them for deployment with INT8 quantizations at reduced precision with minimal accuracy loss. INT8 models compute faster and place lower requirements on bandwidth but present a challenge in representing weights and activations of neural networks because of the reduced dynamic range available.
Minimum Positive Value
-3.4×1038 ~ +3.4×1038
1.4 × 10−45
65504 ~ +65504
5.96 x 10-8
-128 ~ +127
To address this, TensorRT uses a calibration process that minimizes the information loss when approximating the FP32 network with a limited 8-bit integer representation. With the new integration, after optimizing the TensorFlow graph with TensorRT, you can pass the graph to TensorRT for calibration as below.
The rest of the inference workflow remains unchanged from above. The output of this step is a frozen graph that is executed by TensorFlow as described earlier.
Automatically use Tensor Cores on NVIDIA Volta GPUs
TensorRT runs half precision TensorFlow models on Tensor Cores in VOLTA GPUs for inference. Tensor Cores, provide 8x more throughput than single precision math pipelines. Half precision (also known as FP16) data compared to higher precision FP32 vs FP64 reduces memory usage of the neural network. This allows training and deployment of larger networks, and FP16 data transfers take less time than FP32 or FP64 transfers.
Each Tensor Core performs D = A x B + C, where A, B, C and D are matrices. A and B are half-precision 4×4 matrices, whereas D and C can be either half or single precision 4×4 matrices. The peak performance of Tensor Cores on the V100 is about an order of magnitude (10x) faster than double precision (FP64) and about 4 times faster than single precision (FP32).
We are excited about this release and will continue to work closely with NVIDIA to enhance this integration. We expect the new solutions ensure the highest performance possible while maintaining the ease and flexibility of TensorFlow. And as TensorRT supports more networks, you will automatically benefit from the updates without any changes to your code.
To get the new solution, you can use the standard pip install process once TensorFlow 1.7 is released: