Revamp the generation of runtime division checks on ARM64 #111543

snickolls-arm · 2025-01-17T16:23:46Z

This patch introduces a new compilation phase that passes over the GenTrees looking for GT_DIV/GT_UDIV nodes on integral types, and morphs the code to introduce the necessary conformance checks (overflow/divide-by-zero) early on in the compilation pipeline. Currently these are added during the Emit phase, meaning optimizations don't run on any code introduced.

The aim is to allow the compiler to make decisions on code position and instruction selection for these checks. For example on ARM64 this enables certain scenarios to choose the cbz instruction over cmp/beq, can lead to more compact code. It also allows some of the comparisons in the checks to be hoisted out of loops.

Fixes dotnet#64795 This patch introduces a new compilation phase that passes over the GenTrees looking for GT_DIV/GT_UDIV nodes on integral types, and morphs the code to introduce the necessary conformance checks (overflow/divide-by-zero) early on in the compilation pipeline. Currently these are added during the Emit phase, meaning optimizations don't run on any code introduced. The aim is to allow the compiler to make decisions on code position and instruction selection for these checks. For example on ARM64 this enables certain scenarios to choose the cbz instruction over cmp/beq, can lead to more compact code. It also allows some of the comparisons in the checks to be hoisted out of loops.

dotnet-policy-service · 2025-01-17T16:24:27Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

snickolls-arm · 2025-01-17T16:27:07Z

@kunalspathak @a74nh

This is WIP. I've taken a different approach to adding new nodes, instead adding a pass that modifies the HIR.

The pass will run through all of the code in the function looking for GT_DIV/GT_UDIV nodes. On ARM64 we need to run this after morph because so we catch any GT_DIV nodes that might've been introduced by conversions such as the MOD to SUB-MUL-DIV. If the pass encounters a GT_DIV node, it will use fgSplitBlockBeforeTree to ensure any side effects of the tree will run before the runtime check. Then it will add the runtime checks to the graph just after these side effects, but before the actual division occurs.

The added HIR looks like this for the signed overflow check, for example. This is checking for (dividend < 0 && divisor == -1), which should throw an overflow exception.

------------ BB06 [0005] [???..???) -> BB07(0.01),BB05(0.99) (cond), preds={BB02} succs={BB05,BB07}

***** BB06 [0005]
STMT00007 ( ??? ... ??? )
               [000032] -----------                         *  JTRUE     void
               [000030] J----------                         \--*  EQ        int
               [000028] -----------                            +--*  AND       int
               [000025] -----------                            |  +--*  EQ        int
               [000022] -----------                            |  |  +--*  LCL_VAR   int    V01 arg1
               [000024] -----------                            |  |  \--*  CNS_INT   int    -1
               [000027] -----------                            |  \--*  LT        int
               [000023] -----------                            |     +--*  LCL_VAR   int    V03 loc0
               [000026] -----------                            |     \--*  CNS_INT   int    0
               [000029] -----------                            \--*  CNS_INT   int    1

------------ BB07 [0006] [???..???) (throw), preds={BB06} succs={}

***** BB07 [0006]
STMT00006 ( ??? ... ??? )
               [000031] --CXG------                         *  CALL help void   CORINFO_HELP_OVERFLOW

Here's the example @kunalspathak mentioned in #64795:

// See https://aka.ms/new-console-template for more information
using System;

namespace MyApp
{
    internal class Program
    {
        public static int issue2(int x, int y, int z)
	{
	    int result = x;
	    for (int i = 0; i < z; i++)
	    {
		//result = x % y; <-- this hoist things properly because both dividend and divisor are invariant.
		result = result % y;
	    }
	    return result;
	}

        static void Main(string[] args)
        {
	    var rand = new Random(1234);
	    Console.WriteLine(issue2(rand.Next(), rand.Next(), rand.Next()));
        }
    }
}

Before the change:

; Total bytes of code 80, prolog size 8, PerfScore 81.00, instruction count 24, allocated bytes for code 80 (MethodHash=3a9665a0) for method MyApp.Program:issue2(int,int,int):int (FullOpts)
; ============================================================

*************** After end code gen, before unwindEmit()
G_M39519_IG01:        ; func=00, offs=0x000000, size=0x0008, bbWeight=1, PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG

IN0015: 000000      stp     fp, lr, [sp, #-0x10]!
IN0016: 000004      mov     fp, sp

G_M39519_IG02:        ; offs=0x000008, size=0x0008, bbWeight=1, PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB01 [0000], byref, isz

IN0001: 000008      cmp     w2, #0
IN0002: 00000C      ble     G_M39519_IG06

G_M39519_IG03:        ; offs=0x000010, size=0x0000, bbWeight=0.25, PerfScore 0.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB02 [0005], byref, isz

IN0003: 000010      align   [0 bytes for IG04]
IN0004: 000010      align   [0 bytes]
IN0005: 000010      align   [0 bytes]
IN0006: 000010      align   [0 bytes]

G_M39519_IG04:        ; offs=0x000010, size=0x0018, bbWeight=4, PerfScore 18.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB03 [0001], byref, isz

IN0007: 000010      cmp     w1, #0
IN0008: 000014      beq     G_M39519_IG07
IN0009: 000018      cmn     w1, #1
IN000a: 00001C      bne     G_M39519_IG05
IN000b: 000020      cmp     w0, #1
IN000c: 000024      bvs     G_M39519_IG08

G_M39519_IG05:        ; offs=0x000028, size=0x0010, bbWeight=4, PerfScore 58.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, loop=IG04, BB03 [0001], byref, isz

IN000d: 000028      sdiv    w3, w0, w1
IN000e: 00002C      msub    w0, w3, w1, w0
IN000f: 000030      sub     w2, w2, #1
IN0010: 000034      cbnz    w2, G_M39519_IG04

G_M39519_IG06:        ; offs=0x000038, size=0x0008, bbWeight=1, PerfScore 2.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, epilog, nogc

IN0017: 000038      ldp     fp, lr, [sp], #0x10
IN0018: 00003C      ret     lr

G_M39519_IG07:        ; offs=0x000040, size=0x0008, bbWeight=0, PerfScore 0.00, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB06 [0007], gcvars, byref

IN0011: 000040      bl      CORINFO_HELP_THROWDIVZERO
IN0012: 000044      brk     #0

G_M39519_IG08:        ; offs=0x000048, size=0x0008, bbWeight=0, PerfScore 0.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB07 [0008], byref

IN0013: 000048      bl      CORINFO_HELP_OVERFLOW
IN0014: 00004C      brk     #0

After the change:

; Total bytes of code 84, prolog size 8, PerfScore 79.25, instruction count 25, allocated bytes for code 84 (MethodHash=3a9665a0) for method MyApp.Program:issue2(int,int,int):int (FullOpts)
; ============================================================

*************** After end code gen, before unwindEmit()
G_M39519_IG01:        ; func=00, offs=0x000000, size=0x0008, bbWeight=1, PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, nogc <-- Prolog IG

IN0016: 000000      stp     fp, lr, [sp, #-0x10]!
IN0017: 000004      mov     fp, sp

G_M39519_IG02:        ; offs=0x000008, size=0x0008, bbWeight=1, PerfScore 1.50, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB01 [0000], byref, isz

IN0001: 000008      cmp     w2, #0
IN0002: 00000C      ble     G_M39519_IG05

G_M39519_IG03:        ; offs=0x000010, size=0x0008, bbWeight=0.25, PerfScore 0.25, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB02 [0011], byref, isz

IN0003: 000010      cmn     w1, #1
IN0004: 000014      cset    x3, eq
IN0005: 000018      align   [0 bytes for IG04]
IN0006: 000018      align   [0 bytes]
IN0007: 000018      align   [0 bytes]
IN0008: 000018      align   [0 bytes]

G_M39519_IG04:        ; offs=0x000018, size=0x0024, bbWeight=4, PerfScore 74.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, loop=IG04, BB03 [0001], BB04 [0004], BB05 [0007], byref, isz

IN0009: 000018      lsr     w4, w0, #31
IN000a: 00001C      and     w4, w3, w4
IN000b: 000020      cmp     w4, #1
IN000c: 000024      beq     G_M39519_IG07
IN000d: 000028      cbz     w1, G_M39519_IG06
IN000e: 00002C      sdiv    w4, w0, w1
IN000f: 000030      msub    w0, w4, w1, w0
IN0010: 000034      sub     w2, w2, #1
IN0011: 000038      cbnz    w2, G_M39519_IG04

G_M39519_IG05:        ; offs=0x00003C, size=0x0008, bbWeight=1, PerfScore 2.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, byref, epilog, nogc

IN0018: 00003C      ldp     fp, lr, [sp], #0x10
IN0019: 000040      ret     lr

G_M39519_IG06:        ; offs=0x000044, size=0x0008, bbWeight=0, PerfScore 0.00, gcVars=0000000000000000 {}, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB07 [0009], gcvars, byref

IN0012: 000044      bl      CORINFO_HELP_THROWDIVZERO
IN0013: 000048      brk     #0

G_M39519_IG07:        ; offs=0x00004C, size=0x0008, bbWeight=0, PerfScore 0.00, gcrefRegs=0000 {}, byrefRegs=0000 {}, BB06 [0006], byref

IN0014: 00004C      bl      CORINFO_HELP_OVERFLOW
IN0015: 000050      brk     #0

The main difference is at label IG04, rather than a fixed sequence of compare and branch instructions chosen at the emit stage, the compiler has decided to build a logical expression for the overflow check and emit a cbz for the divide-by-zero check. The loop hoisting optimization has decided that the test for (divisor == -1) can be performed outside of the loop to save an instruction inside the loop, this is computed in IG03. Building a logical expression instead of a branch sequence has also allowed the compiler to perform these checks with 2 compare and branches instead of 3.

The approach is working well when:
• The trees containing GT_DIV don't have many side-effects, as these will have to be split out and this can result in spilling, especially in MinOpts.
• GT_DIV occurs in a loop, as some of the expression tree for the check can now be hoisted outside the loop.
• There are a lot of GT_DIV nodes in a function, as now the compiler seems to choose cbz more often than cmp/beq.

It seems to have an adverse effect on MinOpts though, because splitting the tree will often spill and there aren't any optimization passes running to clear up these spills.

At the moment I haven't focused on the efficiency of the pass itself but I believe it could be improved. I could borrow the recursive traversals in the earlier morph phase to build a work-list for where checks need to be added. Then the pass can be linear over a pre-built list of nodes rather than a search in a loop. I would just have to be careful to update all of the locations of the nodes after any trees are split, but I think this should be possible.

I've also had to make a temporary fix on a problem with the tree splitting code where it wasn't correctly updating the node flags after splitting out side effects. After splitting the tree I traverse it post-order to update all of the flags. There might be a more efficient way of doing this.

snickolls-arm · 2025-01-17T16:54:26Z

I think the build is failing on Release mode due to use of GenTree::gtTreeID so I'll need to look into having access to this, or some similar identifier, for all modes as it is part of the algorithm.

kunalspathak · 2025-01-18T02:26:34Z

can you also eliminate the regressions?

jakobbotsch · 2025-01-18T09:34:41Z

I think the build is failing on Release mode due to use of GenTree::gtTreeID so I'll need to look into having access to this, or some similar identifier, for all modes as it is part of the algorithm.

What do you need this for? Increasing the size of GenTree is hard to justify. I do not think this transformation qualifies. Most likely you have other options.

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 17, 2025

dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Jan 17, 2025

build-analysis bot mentioned this pull request Jan 17, 2025

System.Data.Common.Tests Assert failure on Linx x64 CI test run #108070

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revamp the generation of runtime division checks on ARM64 #111543

Revamp the generation of runtime division checks on ARM64 #111543

snickolls-arm commented Jan 17, 2025

dotnet-policy-service bot commented Jan 17, 2025

snickolls-arm commented Jan 17, 2025

snickolls-arm commented Jan 17, 2025

kunalspathak commented Jan 18, 2025

jakobbotsch commented Jan 18, 2025

Revamp the generation of runtime division checks on ARM64 #111543

Are you sure you want to change the base?

Revamp the generation of runtime division checks on ARM64 #111543

Conversation

snickolls-arm commented Jan 17, 2025

dotnet-policy-service bot commented Jan 17, 2025

snickolls-arm commented Jan 17, 2025

snickolls-arm commented Jan 17, 2025

kunalspathak commented Jan 18, 2025

jakobbotsch commented Jan 18, 2025