Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor primitives divide int32 #111505

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

alexcovington
Copy link
Contributor

Add a vector path for int in TensorPrimitives.Divide.

Improves performance in microbenchmarks:

| Method        | Job        | Toolchain        | BufferLength | Mean       | Error    | StdDev   | Median     | Min        | Max        | Ratio | Allocated | Alloc Ratio |
|-------------- |----------- |----------------- |------------- |-----------:|---------:|---------:|-----------:|-----------:|-----------:|------:|----------:|------------:|
| Divide_Vector | Job-SUKFCS | Base             | 128          |   174.4 ns |  0.83 ns |  0.69 ns |   174.2 ns |   173.5 ns |   176.0 ns |  1.00 |         - |          NA |
| Divide_Vector | Job-OOXWSE | Diff             | 128          |   111.5 ns |  0.56 ns |  0.50 ns |   111.5 ns |   110.7 ns |   112.3 ns |  0.64 |         - |          NA |
|               |            |                  |              |            |          |          |            |            |            |       |           |             |
| Divide_Scalar | Job-SUKFCS | Base             | 128          |   138.9 ns |  0.88 ns |  0.78 ns |   138.5 ns |   137.8 ns |   140.5 ns |  1.00 |         - |          NA |
| Divide_Scalar | Job-OOXWSE | Diff             | 128          |   104.8 ns |  0.50 ns |  0.42 ns |   104.7 ns |   104.3 ns |   105.8 ns |  0.75 |         - |          NA |
|               |            |                  |              |            |          |          |            |            |            |       |           |             |
| Divide_Vector | Job-SUKFCS | Base             | 3079         | 4,038.8 ns | 13.47 ns | 11.94 ns | 4,034.4 ns | 4,025.5 ns | 4,062.2 ns |  1.00 |         - |          NA |
| Divide_Vector | Job-OOXWSE | Diff             | 3079         | 1,358.5 ns |  5.78 ns |  4.83 ns | 1,356.7 ns | 1,351.8 ns | 1,367.6 ns |  0.34 |         - |          NA |
|               |            |                  |              |            |          |          |            |            |            |       |           |             |
| Divide_Scalar | Job-SUKFCS | Base             | 3079         | 3,315.1 ns | 12.55 ns | 11.12 ns | 3,313.0 ns | 3,301.0 ns | 3,337.9 ns |  1.00 |         - |          NA |
| Divide_Scalar | Job-OOXWSE | Diff             | 3079         | 1,294.8 ns |  7.84 ns |  6.95 ns | 1,293.1 ns | 1,286.2 ns | 1,309.0 ns |  0.39 |         - |          NA |

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 16, 2025
}

Debug.Assert(Avx.IsSupported && typeof(T) == typeof(int));
Vector128<int> denominator_zero = Sse2.CompareEqual(y.AsInt32(), Vector128<int>.Zero);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these shouldn't be written in x86 intrinsics. Comparison, conversion and divide are all having cross platform intrinsic available.

Copy link
Contributor Author

@alexcovington alexcovington Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparison and divide have appropriate cross platform instrinsics, but the conversion uses x86 intrinsics that convert Int32 -> Double and Double -> Int32. I did not see a cross platform convert that supports that case, but maybe I missed it? If there is a cross platform convert that supports the conversion, I'm happy to switch to that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be for ConvertToDouble(WidenLower(vectorOfInt32)) which we don't currently optimize but would like to pattern match and have implicitly emit the more optimal Sse2.ConvertToVector128Double API

Correspondingly we'd want something like Create(ConvertToDouble(WidenLower(vectorOfInt32)), ConvertToDouble(WidenUpper(vectorOfInt32))) or one of the other patterns to work for Avx.ConvertToVector256Double

I think its fine to write the xplat code here first and then log the issue for optimizing it, which should also help prioritize getting that optimization implemented and ensure that it lights up on more hardware by default.

@@ -70,11 +72,80 @@ public static void Divide<T>(T x, ReadOnlySpan<T> y, Span<T> destination)
internal readonly struct DivideOperator<T> : IBinaryOperator<T> where T : IDivisionOperators<T, T, T>
{
public static bool Vectorizable => typeof(T) == typeof(float)
|| typeof(T) == typeof(double);
|| typeof(T) == typeof(double)
#if NET10_0_OR_GREATER
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this limited to .NET 10? Are the APIs being used newly introduced only in .NET 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running into issues with FloatRoundingMode not being available in .NET 8.0.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was added in 9, not 10.


Debug.Assert(Avx.IsSupported && typeof(T) == typeof(int));
Vector128<int> denominator_zero = Sse2.CompareEqual(y.AsInt32(), Vector128<int>.Zero);
if (denominator_zero != Vector128<int>.Zero)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be the following on .NET 10:

if (Vector128.Any(y.AsInt32(), 0))
{
    ThrowHelper.DivideByZeroException();
}

or for .NET 8/9:

if (Vector128.EqualsAny(y.AsInt32(), Vector128<int>.Zero))
{
    ThrowHelper.DivideByZeroException();
}

Comment on lines 97 to 101
Vector256<double> num_pd = Avx.ConvertToVector256Double(x.AsInt32());
Vector256<double> den_pd = Avx.ConvertToVector256Double(y.AsInt32());
Vector256<double> div_pd = Avx.Divide(num_pd, den_pd);
Vector128<int> div_epi32 = Avx.ConvertToVector128Int32WithTruncation(div_pd);
return div_epi32.As<int, T>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to use the xplat algorithm around ConvertToDouble(WidenLower(x.AsInt32())) / ConvertToDouble(WidenLower(y.AsInt32())) so it can lightup for .NET 8/9.

We should then ideally accelerate Vector128.operator / on .NET 10 and just use that there, so it can be directly handled and lightup anyone else using operator / as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Numerics.Tensors community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants