LinuxLists.cc - [perf/x86] 81ec3f3c4c: will-it-scale.per_process

Subject: Re: [LKP] Re: [perf/x86] 81ec3f3c4c: will-it-scale.per_process_ops -5.5% regression

On Mon, Feb 24, 2020 at 11:24 AM Linus Torvalds
<[email protected]> wrote:
>
> I don't know. This does not seem to be a particularly serious load.
> But it does feel like it should be possible to combine the two atomic
> accesses into one, where you don't need to do the refcount thing
> except for the case where sigcount goes from zero to non-zero (and
> back to zero again).

Ok, that looks just as simple as I thought it would be.

TOTALLY UNTESTED patch attached. It may be completely buggy garbage,
but it _looks_ trivial enough. Just make the rule be that "if we have
any user->sigpending cases, we'll get a ref to the user for the first
one, and drop it only when getting rid of the last one".

So it might be worth testing this. But again: I have NOT done so.

There might be some silly reason why this doesn't work because I just
did the tests wrong or missed some case.

Or there might be some subtle reason why it doesn't work because I
didn't think this through properly.

But it _looks_ obvious and simple enough. And it compiles for me. So
maybe it works.

Linus

Attachments:

patch.diff (1.40 kB)

2020-02-24 20:57:16

Hi Linus,

On Mon, Feb 24, 2020 at 07:15:15PM -0800, Linus Torvalds wrote:
> On Mon, Feb 24, 2020 at 6:57 PM Feng Tang <[email protected]> wrote:
> >
> > Thanks for the optimization patch for signal!
> >
> > It makes a big difference, that the performance score is tripled!
> > bump from original 17000 to 54000. Also the gap between 5.0-rc6 and
> > 5.0-rc6+Jiri's patch is reduced to around 2%.
>
> Ok, so what I think is happening is that the exact same issue still
> exists, but now with less contention it's not quite as noticeable.

I thought that too.

Since we have the reproducable platform, we will keep an eye on it,
and report back if anything found.

You've mentioned the patch's effect on small system in another mail,
I ran the benchmark on a 4 core Skylake desktop, and it only brought
2% performance gain, as expected.

>
> Can you find some Intel CPU hardware person who could spend a moment
> on that odd 32-byte sub-block issue?
>
> Considering that this effect apparently doesn't happen on any other
> platform you've tested, and this Cascade Lake platform is the newly
> released current Intel server platform, I think it's worth looking at.

I'll try to reach some silicon people, and get back if found anything.

> That microbenchmark is not important on its own, but the odd timing
> behaviour it has would be good to have explained.
>
> And while the signal sending microbenchmark is not likely to be very
> relevant to much anything else, I guess I'll apply the patch. Even if
> it's just a microbenchmark, it's not like we haven't used those before
> to pinpoint some very specific behavior. We used lmbench (and whatever
> that odd page cache benchmark was) to do some fairly fundamental
> optimizations back in the days.

Thanks again for the patch.

- Feng
>
> If you fix the details on all the microbenchmarks you find, eventually
> you probably do well on real loads too..
>
> Linus