More optimizations #101

turol · 2024-08-31T20:18:11Z

GCC was putting a conditional move in the copy loop for some reason. The second one is less certain and seems to be highly sensitive to changes.

    N           Min           Max        Median           Avg        Stddev
x  30      18039044      18336971      18278984      18267936     61730.367
+  30      18453057      18728267      18672697      18658935     56689.819
Difference at 95.0% confidence
	390999 +/- 30634.3
	2.14036% +/- 0.167694%
	(Student's t, pooled s = 59263.7)
Instructions/second, higher is better

vlutas · 2024-09-02T06:46:09Z

bddisasm/bdx86_decoder.c

-    for (opIndex = 0; 
-         opIndex < ((Size < ND_MAX_INSTRUCTION_LENGTH) ? Size : ND_MAX_INSTRUCTION_LENGTH); 
-         opIndex++)
+    if (Size < ND_MAX_INSTRUCTION_LENGTH)


This just moves the condition from the for outside. Modern compilers should be able to deal with it as it is (for example, MSVC unrolls the entire loop, so introducing this if actually makes things worse).

This is very likely GCC doing something stupid. I'll test clang too and see what happens.

vlutas · 2024-09-02T06:50:44Z

bddisasm/bdx86_decoder.c

@@ -3813,22 +3813,27 @@ NdGetEffectiveAddrAndOpMode(
    static const ND_UINT8 szLut[3] = { ND_SIZE_16BIT, ND_SIZE_32BIT, ND_SIZE_64BIT };
    ND_BOOL w64, f64, d64, has66;

-    if ((ND_CODE_64 != Instrux->DefCode) && !!(Instrux->Attributes & ND_FLAG_IWO64))
+    // Branchless form of (ND_CODE_64 != Instrux->DefCode) && !!(Instrux->Attributes & ND_FLAG_IWO64)
+    if (((ND_CODE_64 ^ Instrux->DefCode) * (Instrux->Attributes & ND_FLAG_IWO64)) != 0)


I would expect the branch predictors in modern CPUs to be quite competent at predicting the branches here, since they operate on INSTRUX state that will very often be the same, allowing for a very high prediction rate.
I understand that these changes may remove some conditional moves or branches, but is it really worth it? At some point, one has to decide whether the (small) increase in performance is worth the less readable code. I don't think this is the case here. In addition, depending on the architecture, the multiplication may be a new source of overhead.

Multiplication pipelines easier than branches. I wasn't entirely happy with this commit either but it did make it faster.

The compiler may even ditch the multiplication altogether, and rely on conditional instructions (such as CMOVcc or SETcc), but this is not the point. The point is that it shouldn't really matter, since modern branch prediction should easily handle these branches with minimal overhead. These statements are not a performance bottleneck.

vlutas · 2024-09-02T07:26:38Z

On Windows (MSVC compiler), the performance with these changes is essentially the same.

turol · 2024-09-02T17:31:01Z

Does your CI pipeline upload the MSVC -built binaries somewhere?

turol added 2 commits August 31, 2024 21:24

Move condition away from loop header

21da6ed

Convert some complex conditions to branchless form

7a705ce

vlutas reviewed Sep 2, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More optimizations #101

More optimizations #101

turol commented Aug 31, 2024 •

edited

Loading

vlutas Sep 2, 2024

turol Sep 2, 2024

vlutas Sep 2, 2024

turol Sep 2, 2024

vlutas Sep 2, 2024

vlutas commented Sep 2, 2024

turol commented Sep 2, 2024

More optimizations #101

Are you sure you want to change the base?

More optimizations #101

Conversation

turol commented Aug 31, 2024 • edited Loading

vlutas Sep 2, 2024

Choose a reason for hiding this comment

turol Sep 2, 2024

Choose a reason for hiding this comment

vlutas Sep 2, 2024

Choose a reason for hiding this comment

turol Sep 2, 2024

Choose a reason for hiding this comment

vlutas Sep 2, 2024

Choose a reason for hiding this comment

vlutas commented Sep 2, 2024

turol commented Sep 2, 2024

turol commented Aug 31, 2024 •

edited

Loading