Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix #105969 #106218

Merged
merged 13 commits into from
Aug 15, 2024
Merged

fix #105969 #106218

merged 13 commits into from
Aug 15, 2024

Conversation

Ruihan-Yin
Copy link
Contributor

attempt to resolve #105969

@Ruihan-Yin
Copy link
Contributor Author

Build analysis is green, all failures are known.
@JulieLeeMSFT @tannergooding PTAL.

Comment on lines 11127 to 11133
// Avx512F.Insert/ExtractVector256 accepts inputs with base type smaller that 64
// bits. it makes operSize wrong in those case.
if (op2->AsHWIntrinsic()->GetHWIntrinsicId() == NI_AVX512F_InsertVector256 ||
op2->AsHWIntrinsic()->GetHWIntrinsicId() == NI_AVX512F_ExtractVector256)
{
operSize = 8;
}
Copy link
Member

@tannergooding tannergooding Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix the issue, but I'm not sure its entirely the "right" fix (and may miss some other similar cases).

We have a few instructions (namely those listed here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/instr.cpp#L110) where they effectively operate "bitwise" without masking but which need to consider the actual operand size when masking (or broadcast, if supported) is used.

So we probably need to ensure all of these are handled and are picking the "right" operand size (generally 4 when the input is of a "small type").

In the particular case of Insert/ExtractVector256 we should probably be using vinserti32x8 instead of vinserti64x4 for the small types and for int/uint. Those are technically AVX512DQ, but the JIT relies on all of F+BW+CD+DQ+VL being available together, so it should be fine to just adjust them in hwintrinsiclistxarch.h and get the "better" codegen. Some of the other intrinsics that need special handling may be in the same boat.
-- Alternatively we could just explicitly lower cases like NI_AVX512F_InsertVector256 to be NI_AVX512DQ_InsertVector256 and change the base type to int/uint for byte/ubyte, short/ushort, and int/uint. We could similarly lower cases like NI_AVX512F_And to normalize the base type to int/uint, which would likely also fix the issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fix the issue, but I'm not sure its entirely the "right" fix (and may miss some other similar cases).

We have a few instructions (namely those listed here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/instr.cpp#L110) where they effectively operate "bitwise" without masking but which need to consider the actual operand size when masking (or broadcast, if supported) is used.

Thanks for the inputs,

At a quick glance on the instructions on the list, and/andn/or/xor/insert/extract may hit the same bug, movudqx/round/broadcast does not have EmbMask support yet. Do we consider fix them all? or leave those unsupported instructions for now (those instructions intrinsically have the support for embedded masking).

So we probably need to ensure all of these are handled and are picking the "right" operand size (generally 4 when the input is of a "small type").

In the particular case of Insert/ExtractVector256 we should probably be using vinserti32x8 instead of vinserti64x4 for the small types and for int/uint. Those are technically AVX512DQ, but the JIT relies on all of F+BW+CD+DQ+VL being available together, so it should be fine to just adjust them in hwintrinsiclistxarch.h and get the "better" codegen. Some of the other intrinsics that need special handling may be in the same boat. -- Alternatively we could just explicitly lower cases like NI_AVX512F_InsertVector256 to be NI_AVX512DQ_InsertVector256 and change the base type to int/uint for byte/ubyte, short/ushort, and int/uint. We could similarly lower cases like NI_AVX512F_And to normalize the base type to int/uint, which would likely also fix the issue

I would try to use the first option, as this is how those logical instructions (And/Or/AndN/Xor) are designed as well and based on how Avx512 ISA family is being checked, it should safe to use DQ instructions. These might still need to be handled together during lowering to normalize the base data type to int/uint unless it is a long type. (did some extensive testing, And with base type to be byte will have the same bug.)

@JulieLeeMSFT JulieLeeMSFT added this to the 9.0.0 milestone Aug 12, 2024
@JulieLeeMSFT
Copy link
Member

  • @TIHan for code review for .NET 9.

…atches with the data type implied by the intrinsics.
@tannergooding
Copy link
Member

Going to give this another passover later tonight before approving/merging.

Want to double check the tables myself and make sure we didn't miss any other obvious edge cases here. The changes look generally good/correct, however, so it's more just an extra audit I'm wanting to do for completeness

@JulieLeeMSFT JulieLeeMSFT requested a review from TIHan August 14, 2024 16:02
@JulieLeeMSFT
Copy link
Member

@Ruihan-Yin, is this ready for final review?
Rerunning failed tests.

@Ruihan-Yin
Copy link
Contributor Author

@Ruihan-Yin, is this ready for final review? Rerunning failed tests.

Yes, changes are ready for review, thanks for the help from @tannergooding

@JulieLeeMSFT
Copy link
Member

@Ruihan-Yin, there is an assertion failure. PTAL.

Pipeline: runtime-coreclr jitstress-isas-avx512/20240813.7
Log:

  Starting:    profiler.elt.XUnitWrapper (parallel test collections = on [4 threads], stop on fail = off)
    profiler\elt\slowpatheltenter\slowpatheltenter.cmd [FAIL]
      
      Assert failure(PID 4496 [0x00001190], Thread: 5920 [0x1720]): Assertion failed 'IS_ALIGNED(addr, byteSize)' in 'System.Text.Ascii:GetIndexOfFirstNonAsciiChar_Vector(uint,uint):uint' during 'Emit code' (IL size 810; hash 0x3c71a05c; FullOpts)
      
          File: D:\a\_work\1\s\src\coreclr\jit\emitxarch.cpp:14542
          Image: C:\h\w\B3E509A2\p\CoreRun.exe
      
      Unhandled exception. System.Exception: Profiler tests are expected to contain the text 'PROFILER TEST PASSES' in the console output of the profilee app to indicate a passing test. Usually it is printed from the Shutdown() method of the profiler implementation. This text was not found in the output above. Profilee returned exit code -1073740286.
         at Profiler.Tests.ProfilerTestRunner.FailFastWithMessage(String error)
         at Profiler.Tests.ProfilerTestRunner.Run(String profileePath, String testName, Guid profilerClsid, String profileeArguments, ProfileeOptions profileeOptions, Dictionary`2 envVars, String reverseServerName, Boolean loadAsNotification, Int32 notificationCopies)
         at SlowPathELTTests.SlowPathELTEnter.Main(String[] args)

@tannergooding
Copy link
Member

Wanted to call out that spmi-diffs shows size regression, but that's namely because it doesn't account for data constant size at all. With the base type being int/uint it's recognizing cases like this now:

- RWD00  	dq	00FF00FF00FF00FFh, 00FF00FF00FF00FFh, 00FF00FF00FF00FFh, 00FF00FF00FF00FFh
+ RWD00  	dd	00FF00FFh

So while the encoding used for vpand is 2-bytes larger due to the embedded broadcast, we actually have 28 bytes of size savings.


There also appears to be a disassembly quirk in that its showing {1to1} when it should say {1to4} or {1to8}. This looks to be because emitDispEmbBroadcastCount is using emitGetBaseMemOpSize which is taking broadcast size into account, when rather broadcast size should only be taking into account for emitGetMemOpSize. This is unrelated to this PR, so I'll get a separate fix up for it.

@tannergooding
Copy link
Member

tannergooding commented Aug 14, 2024

#106405 fixes the disassembly and the IS_ALIGNED assert

Edit: Changed to include the fix directly in this PR so they can make the cutoff

@tannergooding
Copy link
Member

There's still a remaining failure caused by this change. Validating the fix locally and should have it pushed up shortly.

Comment on lines +3098 to +3153
GenTree* nestedOp1 = op1Intrinsic->Op(1);
GenTree* nestedOp2 = op1Intrinsic->Op(2);

if (nestedOp2->isContained() && nestedOp2->OperIsHWIntrinsic())
{
GenTreeHWIntrinsic* nestedIntrin = nestedOp2->AsHWIntrinsic();
NamedIntrinsic nestedIntrinId = nestedIntrin->GetHWIntrinsicId();

if ((nestedIntrinId == NI_SSE3_MoveAndDuplicate) ||
(nestedIntrinId == NI_AVX2_BroadcastScalarToVector128) ||
(nestedIntrinId == NI_AVX2_BroadcastScalarToVector256) ||
(nestedIntrinId == NI_AVX512F_BroadcastScalarToVector512))
{
// We need to rewrite the embedded broadcast back to a regular constant
// so that the subsequent containment check for ptestm can determine
// if the embedded broadcast is still relevant

GenTree* broadcastOp = nestedIntrin->Op(1);

if (broadcastOp->OperIsHWIntrinsic(NI_Vector128_CreateScalarUnsafe))
{
BlockRange().Remove(broadcastOp);
broadcastOp = broadcastOp->AsHWIntrinsic()->Op(1);
}

assert(broadcastOp->OperIsConst());

GenTree* vecCns =
comp->gtNewSimdCreateBroadcastNode(simdType, broadcastOp,
op1Intrinsic->GetSimdBaseJitType(), simdSize);

BlockRange().InsertAfter(broadcastOp, vecCns);
nestedOp2 = vecCns;

BlockRange().Remove(broadcastOp);
BlockRange().Remove(nestedIntrin);
}
}

node->Op(1) = nestedOp1;
node->Op(2) = nestedOp2;

// Make sure we aren't contained since ptestm will do its own containment check
node->Op(2)->ClearContained();
nestedOp2->ClearContained();

if (varTypeIsSmall(simdBaseType))
{
// Fixup the base type so embedded broadcast and the mask size checks still work
node->NormalizeJitBaseTypeToInt(testIntrinsicId, simdBaseType);

simdBaseJitType = node->GetSimdBaseJitType();
simdBaseType = node->GetSimdBaseType();

maskBaseJitType = simdBaseJitType;
maskBaseType = simdBaseType;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the normalization, we could now more easily encounter something that looked like (x & y) == 0 where (x & y) were TYP_INT or TYP_UINT but tmp == 0 was still a small type. -- This was also possible before the fix, so an existing bug, but less likely since it wouldn't have been typical for a user to do something like (x & y).AsByte() == 0

In such a case, (x & y) may have also already made y an `embedded broadcast and that would cause a disconnect.

The simplest fix would've been to simply not do this optimization if op2 of the AND operation was already contained, but that had some fairly significant size regressions and a measurable perf hit to some core APIs, including GetIndexOfFirstNonAsciiChar_Vector as the inner loop codegen was no longer doing a simple check and branch.

So I went with the more verbose "proper" fix that instead ensures the type of tmp == 0 is fixed up and the containment of op2 is properly resolved so that the rewritten PTESTM can properly contain the operation itself.

@JulieLeeMSFT
Copy link
Member

@TIHan is checking spmi asmdiffs test failure.

@TIHan
Copy link
Contributor

TIHan commented Aug 14, 2024

The diffs were failing to acquire the mch files, so it didn't run anything. It's re-running now and has been for about 45 min.

@tannergooding
Copy link
Member

The last set of spmi failures were all due to the downloads failing, generally that's due to a JIT/EE version change going in (like the one we had earlier today: https://github.com/dotnet/runtime/commits/9d9af3d22f4b9fdd453790eecab63795f896ea1a/src/coreclr/inc/jiteeversionguid.h)

The current jitstress-isas-avx512 failure is unrelated, the history shows its been failing off and on for a while now: https://dev.azure.com/dnceng-public/public/_build/results?buildId=776127&view=ms.vss-test-web.build-test-results-tab&runId=19865742&resultId=118465&paneView=history

@tannergooding
Copy link
Member

I ran the full PMI diffs locally (jit-diff.exe diff --diff --pmi --tests, jit-diff.exe diff --diff --pmi --frameworks, and jit-diff.exe diff --diff --pmi --benchmarks), and no failures as compared to existing dotnet/main (there's historically been few like for FSharp.Core and a few interop tests)

@tannergooding
Copy link
Member

/ba-g runtime queue completed but results were not reported due to GitHub outage: https://dev.azure.com/dnceng-public/public/_build/results?buildId=776122&view=results

Issues are known #100558 and #106428

@Ruihan-Yin
Copy link
Contributor Author

Thanks for the investigation! is there any other blocker before merging?

@tannergooding
Copy link
Member

Some spmi-diffs jobs failed to acquire the MCH files again, due to the JIT/EE version guid change

Local runs of both SPMI diffs and full PMI diffs are passing. Just waiting on SPMI replay to finish

@tannergooding
Copy link
Member

SPMI replay has also finished, only failures are also MCH acquisition issues due to the JIT/EE version change. They remain passing locally.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: Potential bad codegen with Avx512BW intrinsic functions
4 participants