-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tensor primitives divide int32 #111505
base: main
Are you sure you want to change the base?
Tensor primitives divide int32 #111505
Conversation
} | ||
|
||
Debug.Assert(Avx.IsSupported && typeof(T) == typeof(int)); | ||
Vector128<int> denominator_zero = Sse2.CompareEqual(y.AsInt32(), Vector128<int>.Zero); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these shouldn't be written in x86 intrinsics. Comparison, conversion and divide are all having cross platform intrinsic available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparison and divide have appropriate cross platform instrinsics, but the conversion uses x86 intrinsics that convert Int32 -> Double
and Double -> Int32
. I did not see a cross platform convert that supports that case, but maybe I missed it? If there is a cross platform convert that supports the conversion, I'm happy to switch to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be for ConvertToDouble(WidenLower(vectorOfInt32))
which we don't currently optimize but would like to pattern match and have implicitly emit the more optimal Sse2.ConvertToVector128Double
API
Correspondingly we'd want something like Create(ConvertToDouble(WidenLower(vectorOfInt32)), ConvertToDouble(WidenUpper(vectorOfInt32)))
or one of the other patterns to work for Avx.ConvertToVector256Double
I think its fine to write the xplat code here first and then log the issue for optimizing it, which should also help prioritize getting that optimization implemented and ensure that it lights up on more hardware by default.
...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs
Outdated
Show resolved
Hide resolved
@@ -70,11 +72,80 @@ public static void Divide<T>(T x, ReadOnlySpan<T> y, Span<T> destination) | |||
internal readonly struct DivideOperator<T> : IBinaryOperator<T> where T : IDivisionOperators<T, T, T> | |||
{ | |||
public static bool Vectorizable => typeof(T) == typeof(float) | |||
|| typeof(T) == typeof(double); | |||
|| typeof(T) == typeof(double) | |||
#if NET10_0_OR_GREATER |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this limited to .NET 10? Are the APIs being used newly introduced only in .NET 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was running into issues with FloatRoundingMode
not being available in .NET 8.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was added in 9, not 10.
...aries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.Divide.cs
Outdated
Show resolved
Hide resolved
|
||
Debug.Assert(Avx.IsSupported && typeof(T) == typeof(int)); | ||
Vector128<int> denominator_zero = Sse2.CompareEqual(y.AsInt32(), Vector128<int>.Zero); | ||
if (denominator_zero != Vector128<int>.Zero) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be the following on .NET 10:
if (Vector128.Any(y.AsInt32(), 0))
{
ThrowHelper.DivideByZeroException();
}
or for .NET 8/9:
if (Vector128.EqualsAny(y.AsInt32(), Vector128<int>.Zero))
{
ThrowHelper.DivideByZeroException();
}
Vector256<double> num_pd = Avx.ConvertToVector256Double(x.AsInt32()); | ||
Vector256<double> den_pd = Avx.ConvertToVector256Double(y.AsInt32()); | ||
Vector256<double> div_pd = Avx.Divide(num_pd, den_pd); | ||
Vector128<int> div_epi32 = Avx.ConvertToVector128Int32WithTruncation(div_pd); | ||
return div_epi32.As<int, T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be better to use the xplat algorithm around ConvertToDouble(WidenLower(x.AsInt32())) / ConvertToDouble(WidenLower(y.AsInt32()))
so it can lightup for .NET 8/9.
We should then ideally accelerate Vector128.operator /
on .NET 10 and just use that there, so it can be directly handled and lightup anyone else using operator /
as well.
Add a vector path for
int
inTensorPrimitives.Divide
.Improves performance in microbenchmarks: