Tensor primitives divide int32 #111505

alexcovington · 2025-01-16T18:26:50Z

Add a vector path for int in TensorPrimitives.Divide.

Improves performance in microbenchmarks:

| Method        | Job        | Toolchain        | BufferLength | Mean       | Error    | StdDev   | Median     | Min        | Max        | Ratio | Allocated | Alloc Ratio |
|-------------- |----------- |----------------- |------------- |-----------:|---------:|---------:|-----------:|-----------:|-----------:|------:|----------:|------------:|
| Divide_Vector | Job-SUKFCS | Base             | 128          |   174.4 ns |  0.83 ns |  0.69 ns |   174.2 ns |   173.5 ns |   176.0 ns |  1.00 |         - |          NA |
| Divide_Vector | Job-OOXWSE | Diff             | 128          |   111.5 ns |  0.56 ns |  0.50 ns |   111.5 ns |   110.7 ns |   112.3 ns |  0.64 |         - |          NA |
|               |            |                  |              |            |          |          |            |            |            |       |           |             |
| Divide_Scalar | Job-SUKFCS | Base             | 128          |   138.9 ns |  0.88 ns |  0.78 ns |   138.5 ns |   137.8 ns |   140.5 ns |  1.00 |         - |          NA |
| Divide_Scalar | Job-OOXWSE | Diff             | 128          |   104.8 ns |  0.50 ns |  0.42 ns |   104.7 ns |   104.3 ns |   105.8 ns |  0.75 |         - |          NA |
|               |            |                  |              |            |          |          |            |            |            |       |           |             |
| Divide_Vector | Job-SUKFCS | Base             | 3079         | 4,038.8 ns | 13.47 ns | 11.94 ns | 4,034.4 ns | 4,025.5 ns | 4,062.2 ns |  1.00 |         - |          NA |
| Divide_Vector | Job-OOXWSE | Diff             | 3079         | 1,358.5 ns |  5.78 ns |  4.83 ns | 1,356.7 ns | 1,351.8 ns | 1,367.6 ns |  0.34 |         - |          NA |
|               |            |                  |              |            |          |          |            |            |            |       |           |             |
| Divide_Scalar | Job-SUKFCS | Base             | 3079         | 3,315.1 ns | 12.55 ns | 11.12 ns | 3,313.0 ns | 3,301.0 ns | 3,337.9 ns |  1.00 |         - |          NA |
| Divide_Scalar | Job-OOXWSE | Diff             | 3079         | 1,294.8 ns |  7.84 ns |  6.95 ns | 1,293.1 ns | 1,286.2 ns | 1,309.0 ns |  0.39 |         - |          NA |

huoyaoyuan · 2025-01-16T19:30:22Z

...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs

+                }
+
+                Debug.Assert(Avx.IsSupported && typeof(T) == typeof(int));
+                Vector128<int> denominator_zero = Sse2.CompareEqual(y.AsInt32(), Vector128<int>.Zero);


All of these shouldn't be written in x86 intrinsics. Comparison, conversion and divide are all having cross platform intrinsic available.

Comparison and divide have appropriate cross platform instrinsics, but the conversion uses x86 intrinsics that convert Int32 -> Double and Double -> Int32. I did not see a cross platform convert that supports that case, but maybe I missed it? If there is a cross platform convert that supports the conversion, I'm happy to switch to that.

It would be for ConvertToDouble(WidenLower(vectorOfInt32)) which we don't currently optimize but would like to pattern match and have implicitly emit the more optimal Sse2.ConvertToVector128Double API

Correspondingly we'd want something like Create(ConvertToDouble(WidenLower(vectorOfInt32)), ConvertToDouble(WidenUpper(vectorOfInt32))) or one of the other patterns to work for Avx.ConvertToVector256Double

I think its fine to write the xplat code here first and then log the issue for optimizing it, which should also help prioritize getting that optimization implemented and ensure that it lights up on more hardware by default.

...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs

stephentoub · 2025-01-16T20:28:00Z

...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs

@@ -70,11 +72,80 @@ public static void Divide<T>(T x, ReadOnlySpan<T> y, Span<T> destination)
        internal readonly struct DivideOperator<T> : IBinaryOperator<T> where T : IDivisionOperators<T, T, T>
        {
            public static bool Vectorizable => typeof(T) == typeof(float)
-                                            || typeof(T) == typeof(double);
+                                            || typeof(T) == typeof(double)
+#if NET10_0_OR_GREATER


Why is this limited to .NET 10? Are the APIs being used newly introduced only in .NET 10?

I was running into issues with FloatRoundingMode not being available in .NET 8.0.

That was added in 9, not 10.

...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs

tannergooding · 2025-01-17T00:33:00Z

...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs

+
+                Debug.Assert(Avx.IsSupported && typeof(T) == typeof(int));
+                Vector128<int> denominator_zero = Sse2.CompareEqual(y.AsInt32(), Vector128<int>.Zero);
+                if (denominator_zero != Vector128<int>.Zero)


This can be the following on .NET 10:

if (Vector128.Any(y.AsInt32(), 0)) { ThrowHelper.DivideByZeroException(); }

or for .NET 8/9:

if (Vector128.EqualsAny(y.AsInt32(), Vector128<int>.Zero)) { ThrowHelper.DivideByZeroException(); }

tannergooding · 2025-01-17T00:35:33Z

...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs

+                Vector256<double> num_pd = Avx.ConvertToVector256Double(x.AsInt32());
+                Vector256<double> den_pd = Avx.ConvertToVector256Double(y.AsInt32());
+                Vector256<double> div_pd = Avx.Divide(num_pd, den_pd);
+                Vector128<int> div_epi32 = Avx.ConvertToVector128Int32WithTruncation(div_pd);
+                return div_epi32.As<int, T>();


I think it'd be better to use the xplat algorithm around ConvertToDouble(WidenLower(x.AsInt32())) / ConvertToDouble(WidenLower(y.AsInt32())) so it can lightup for .NET 8/9.

We should then ideally accelerate Vector128.operator / on .NET 10 and just use that there, so it can be directly handled and lightup anyone else using operator / as well.

Alex Covington (Advanced Micro Devices added 2 commits January 16, 2025 09:50

Add vectorized path for Int32 type in TensorPrimitives.Divide

dfa44fa

Add ISA guards and Debug.Assert

37afdd2

dotnet-issue-labeler bot added the area-System.Numerics.Tensors label Jan 16, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 16, 2025

huoyaoyuan reviewed Jan 16, 2025

View reviewed changes

stephentoub reviewed Jan 16, 2025

View reviewed changes

...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs Outdated Show resolved Hide resolved

stephentoub reviewed Jan 16, 2025

View reviewed changes

...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs Outdated Show resolved Hide resolved

This was referenced Jan 16, 2025

ProcessThreadTests.TestStartTimeProperty failure in CI #105526

Open

restarted. Azure DevOps can't recover from restarts. dotnet/dnceng#3879

Open

[WASI] Sockets - unknown handle index #108726

Open

Simplify vectorizable check, simplify preprocessor guard

0cc3b79

tannergooding reviewed Jan 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensor primitives divide int32 #111505

Tensor primitives divide int32 #111505

alexcovington commented Jan 16, 2025

huoyaoyuan Jan 16, 2025

alexcovington Jan 16, 2025 •

edited

Loading

tannergooding Jan 17, 2025

stephentoub Jan 16, 2025

alexcovington Jan 16, 2025

stephentoub Jan 17, 2025

tannergooding Jan 17, 2025

tannergooding Jan 17, 2025

Tensor primitives divide int32 #111505

Are you sure you want to change the base?

Tensor primitives divide int32 #111505

Conversation

alexcovington commented Jan 16, 2025

huoyaoyuan Jan 16, 2025

Choose a reason for hiding this comment

alexcovington Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

tannergooding Jan 17, 2025

Choose a reason for hiding this comment

stephentoub Jan 16, 2025

Choose a reason for hiding this comment

alexcovington Jan 16, 2025

Choose a reason for hiding this comment

stephentoub Jan 17, 2025

Choose a reason for hiding this comment

tannergooding Jan 17, 2025

Choose a reason for hiding this comment

tannergooding Jan 17, 2025

Choose a reason for hiding this comment

alexcovington Jan 16, 2025 •

edited

Loading