Optimized INT8 Inference

provides capabilities to take models trained in single (FP32) and half (FP16) precision and convert them for deployment with INT8 quantizations at reduced precision with minimal accuracy loss. INT8 models compute and place lower requirements on bandwidth but present a challenge in representing weights and activations of neural networks because of the reduced dynamic range available.

Dynamic Range Minimum Positive Value
FP32 -3.4×1038 ~ +3.4×1038 1.4 × 10−45
FP16 65504 ~ +65504 5.96 x 10-8
INT8 -128 ~ +127 1

To address this, TensorRT uses a calibration process that minimizes the information loss when approximating the FP32 network with a limited 8-bit integer representation. With the new , after optimizing the graph with TensorRT, you can pass the graph to TensorRT for calibration as below.


The of the inference workflow remains unchanged from above. The output of this step is a frozen graph that is executed by TensorFlow as described earlier.

Automatically use Tensor Cores on NVIDIA Volta GPUs

TensorRT runs half precision TensorFlow models on Tensor Cores in VOLTA GPUs for inference. Tensor Cores, provide 8x more throughput than single precision math pipelines. Half precision (also known as FP16) data compared to higher precision FP32 vs FP64 reduces memory usage of the neural network. This allows training and deployment of larger networks, and FP16 data transfers take less than FP32 or FP64 transfers.

Each Tensor Core performs D = A x B + C, where A, B, C and D are matrices. A and B are half-precision 4×4 matrices, whereas D and C can be either half or single precision 4×4 matrices. The peak performance of Tensor Cores on the V100 is about an order of magnitude (10x) faster than double precision (FP64) and about 4 times faster than single precision (FP32).


We are excited about this release and will continue to work closely with NVIDIA to enhance this integration. We expect the new solutions ensure the highest performance possible while maintaining the ease and flexibility of TensorFlow. And as TensorRT supports more networks, you will automatically benefit from the updates without any changes to your code.

To get the new solution, you can use the standard pip install process once TensorFlow 1.7 is released:

pip install tensorflow-gpu r1.7

Till then, find detailed installation instructions here:

Try it out and let us know what you think!

Source link


Please enter your comment!
Please enter your name here