Skip to content

Optimize multicast stubs#130207

Open
MichalPetryka wants to merge 2 commits into
dotnet:mainfrom
MichalPetryka:multicast-invoke
Open

Optimize multicast stubs#130207
MichalPetryka wants to merge 2 commits into
dotnet:mainfrom
MichalPetryka:multicast-invoke

Conversation

@MichalPetryka

Copy link
Copy Markdown
Contributor

Rewrites multicast stubs to byref loops to remove bounds checks and covarianc helpers from every iteration.

This assumes the wrapper struct is the same size as a ref, is such assumption fine for the VM? @jkotas @MichalStrehovsky

@MichalPetryka

Copy link
Copy Markdown
Contributor Author

@EgorBot -arm -amd --envvars DOTNET_JitDisasm:IL_STUB_MulticastDelegate_Invoke

using BenchmarkDotNet.Attributes;

public class MyBenchmarks
{
    public Action a;

    [GlobalSetup]
    public void Setup()
    {
        for (int i = 0; i < 10000; i++)
            a += new Action(() => {});
    }

    [Benchmark]
    public void Bench() => a();
}

@MichalPetryka

MichalPetryka commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

Hmm the asm seems better but the arm perf is much worse, I assume it's cause the JIT is ordering blocks wrong which causes branch mispredictions.

EDIT: the branch order is also different than what I get locally, is the bot using any weird settings? @EgorBo

@jkotas

jkotas commented Jul 5, 2026

Copy link
Copy Markdown
Member

This assumes the wrapper struct is the same size as a ref, is such assumption fine for the VM

These sort of Unsafe.As UB casts can confuse optimizations that depend on accurate type information. I think we should be rather fixing the few places where we still do them in CoreLib instead of introducing new ones.

the arm perf is much worse,

We have seen number of cases where unsafe code "optimizations" result in worse performance. It sounds like this is another one of those.

covariant helpers from every iteration.

What are the covariant helpers that this is eliminating?

for (int i = 0; i < 10000; i++)
a += new Action(() => {})

The typical multicast delegate has very few targets, and the targets are typically different. This is not very representative microbenchmark.

@MichalPetryka

Copy link
Copy Markdown
Contributor Author

@EgorBot -arm -amd --envvars DOTNET_JitDisasm:IL_STUB_MulticastDelegate_Invoke

using BenchmarkDotNet.Attributes;

public class MyBenchmarks
{
    public Action a;

    [GlobalSetup]
    public void Setup()
    {
        for (int i = 0; i < 10000; i++)
            a += new Action(() => {});
    }

    [Benchmark]
    public void Bench() => a();
}

@MichalPetryka

Copy link
Copy Markdown
Contributor Author

These sort of Unsafe.As UB casts can confuse optimizations that depend on accurate type information. I think we should be rather fixing the few places where we still do them in CoreLib instead of introducing new ones.

It's not UB here since the As is on the ref, not on the object. It's just layout reliant. If desired, I can swithc the code to fetch the sizeof of thee struct and use ldfld to avoid assuming the field is at offset 0.

We have seen number of cases where unsafe code "optimizations" result in worse performance. It sounds like this is another one of those.

I assume the bad perf is from the horrible layout from the JIT caused by the debugging helper check which is predicted to be hot. Does the code need to recheck this on every iteration?

What are the covariant helpers that this is eliminating?

I misread the loop as doing ldelema, it's just a bounds check, not a covariance helper.

The typical multicast delegate has very few targets, and the targets are typically different. This is not very representative microbenchmark.

The targets being different shouldnt matter much outside of CPU branch prediction here and it makes results more stable. This should also not scale badly with size, I assume this'd be faster even with just 2 delegates if the JIT used a proper layout.

The actual diff of the loop, excluding the debugging call is this:
image

#endif // DEBUGGING_SUPPORTED

ILCodeLabel *realLoopStart = pCode->NewCodeLabel();
pCode->EmitBEQ(realLoopStart);

@huoyaoyuan huoyaoyuan Jul 5, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look optimal for branch predicting (forward conditional jump as hot path). The original logic was constructed for optimizing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants