[Performance] Max operator became 4.5X slower after Fixing NaN propagation for float16 min and max operators. #23337

SuhwanSong · 2025-01-13T13:02:15Z

Describe the issue

ONNXRuntime introduces a 50% slowdown in version 1.20.1 compared to version 1.17.0.
The performance regression originates from commit ce13f65.

Bisected Commit Range:
cfa45df6b5060af6327a98a625eb9fe74580f56c..ce13f651d86952335a126f04e741d68bc41323fa

Model

Environment

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.31.4
Libc version: glibc-2.35

Python version: 3.10.13 (main, Sep  5 2023, 06:03:44) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.8.0-nyx+-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

To reproduce

Download "poc.onnx".
Run the following script.

poc.onnx.zip

import time
import onnxruntime
import numpy as np

# Set the random seed
np.random.seed(0)

onnx_model_path = 'poc.onnx'

# Load the ONNX model with the CPUExecutionProvider
ort_session = onnxruntime.InferenceSession(onnx_model_path, providers=['CPUExecutionProvider'])
ort_session.get_modelmeta()
inputs = ort_session.get_inputs()

nth = 100000

# Warm-up inference to cache optimizations
warmup_input = np.random.randn(1,1,1,58,34).astype('float16')
warmup_dict = {inputs[0].name: warmup_input}
ort_session.run(None, warmup_dict)

# Measure inference time excluding input creation
total_time_ns = 0
for _ in range(nth):
    np_input = np.random.randn(1,1,1,58,34).astype('float16')
    input_ = {inputs[0].name: np_input}

    start_ns = time.perf_counter_ns()
    ort_session.run(None, input_)
    end_ns = time.perf_counter_ns()

    total_time_ns += end_ns - start_ns

avg_time_ns = total_time_ns / nth
avg_time_ms = avg_time_ns / 1e6

print(f'[{onnxruntime.__version__}] Average inference time: {avg_time_ms:.5f} ms')

[1.17.0] Average inference time: 0.01710 ms
[1.20.1] Average inference time: 0.02532 ms

Urgency

No response

Platform

Linux

OS Version

6.8.0

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.20.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

The text was updated successfully, but these errors were encountered:

SuhwanSong · 2025-01-14T16:07:59Z

Bug Analysis Report

Summary

Performance regression was identified when comparing two versions: cfa45df and ce13f65. Profiling revealed significant differences in execution times for key operations, indicating potential inefficiencies introduced in the newer version.

Key Findings

Overall Regression
- Total execution time increased from 5,230,050,197 ns to 11,253,223,058 ns, reflecting a 215.16% increase.
Significant Time Increases
- /Max_kernel_time: Increased by 475.43% (from 1,444,747 ns to 6,868,824 ns). This suggests a major inefficiency in kernel execution.
- InsertedPrecisionFreeCast_/Max_output_0_kernel_time: Increased by 113.49% (from 247,874 ns to 281,315 ns), highlighting inefficiencies in this specific operation.
- InsertedPrecisionFreeCast_v2_0_kernel_time: Increased by 133.52% (from 1,473,947 ns to 1,967,967 ns), further emphasizing kernel-level performance issues.

How to Reproduce

Download and unzip model.zip.
Run the following code.

import time
import onnxruntime
import numpy as np

# Set the random seed
np.random.seed(0)

onnx_model_path = 'model.onnx'

# Load the ONNX model with the CPUExecutionProvider
ort_session = onnxruntime.InferenceSession(onnx_model_path, providers=['CPUExecutionProvider'])
ort_session.get_modelmeta()
inputs = ort_session.get_inputs()

nth = 100000

# Warm-up inference to cache optimizations

input_data = np.load("input.npy", allow_pickle=True).item()
ort_session.run(None, input_data)

# Measure inference time excluding input creation
total_time_ns = 0
for _ in range(nth):

    start_ns = time.perf_counter_ns()
    ort_session.run(None, input_data)
    end_ns = time.perf_counter_ns()

    total_time_ns += end_ns - start_ns

avg_time_ns = total_time_ns / nth
avg_time_ms = avg_time_ns / 1e6

print(f'[{onnxruntime.__version__}] Average inference time: {avg_time_ms:.5f} ms')

Profile Log

Found performance regression between cfa45df6b5060af6327a98a625eb9fe74580f56c and ce13f651d86952335a126f04e741d68bc41323fa

total_time: 11253223058 / 5230050197 : 215.1647237430903%
model_loading_uri: 401 / 391 : 102.55754475703324%
session_initialization: 3464 / 3514 : 98.57712009106432%
/Max_fence_before: 5 / 18 : 27.77777777777778%
/Max_kernel_time: 6868824 / 1444747 : 475.4343840132563%
/Max_fence_after: 102 / 13 : 784.6153846153845%
InsertedPrecisionFreeCast_/Max_output_0_fence_before: 100 / 39 : 256.4102564102564%
InsertedPrecisionFreeCast_/Max_output_0_kernel_time: 281315 / 247874 : 113.49112855725085%
InsertedPrecisionFreeCast_/Max_output_0_fence_after: 465 / 26 : 1788.4615384615383%
/Abs_fence_before: 170 / 0 : 0%
/Abs_kernel_time: 307071 / 311399 : 98.61014325672208%
/Abs_fence_after: 22 / 27 : 81.48148148148148%
InsertedPrecisionFreeCast_v2_0_fence_before: 37 / 5 : 740.0%
InsertedPrecisionFreeCast_v2_0_kernel_time: 1967967 / 1473947 : 133.51680894903276%
InsertedPrecisionFreeCast_v2_0_fence_after: 0 / 18 : 0.0%
/Greater_fence_before: 16 / 0 : 0%
/Greater_kernel_time: 372283 / 373171 : 99.76203938676906%
/Greater_fence_after: 0 / 18 : 0.0%
SequentialExecutor::Execute: 10367625 / 4420489 : 234.53570408160726%
model_run: 10437331 / 4487777 : 232.5724072296819%

SuhwanSong added the performance issues related to performance regressions label Jan 13, 2025

github-actions bot added the ep:Xnnpack issues related to XNNPACK EP label Jan 13, 2025

skottmckay removed the ep:Xnnpack issues related to XNNPACK EP label Jan 13, 2025

SuhwanSong changed the title ~~[Performance] Fixing NaN propagation for float16 min and max operators introduces 50% slowdown.~~ [Performance] Max operator became 4.5X slower after Fixing NaN propagation for float16 min and max operators. Jan 14, 2025

github-actions bot added the ep:Xnnpack issues related to XNNPACK EP label Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance] Max operator became 4.5X slower after Fixing NaN propagation for float16 min and max operators. #23337

[Performance] Max operator became 4.5X slower after Fixing NaN propagation for float16 min and max operators. #23337

SuhwanSong commented Jan 13, 2025

SuhwanSong commented Jan 14, 2025

[Performance] Max operator became 4.5X slower after Fixing NaN propagation for float16 min and max operators. #23337

[Performance] Max operator became 4.5X slower after Fixing NaN propagation for float16 min and max operators. #23337

Comments

SuhwanSong commented Jan 13, 2025

Describe the issue

To reproduce

Urgency

Platform

OS Version

ONNX Runtime Installation

ONNX Runtime Version or Commit ID

ONNX Runtime API

Architecture

Execution Provider

Execution Provider Library Version

Model File

Is this a quantized model?

SuhwanSong commented Jan 14, 2025

Bug Analysis Report

Summary

Key Findings

How to Reproduce

Profile Log