LinuxLists.cc - hackbench regression with 2.6.36-rc1

2010-08-18 06:19:10

Subject: hackbench regression with 2.6.36-rc1

Comparing with 2.6.35's result, hackbench (thread mode) has about
80% regression on dual-socket Nehalem machine and about 90% regression
on 4-socket Tigerton machines.

Command to start hackbench:
#./hackbench 100 thread 2000

process mode has no such regression.

Profiling shows:
#perf top
samples pcnt function DSO
_______ _____ ________________________ ________________________

74415.00 29.9% put_pid [kernel.kallsyms]
38395.00 15.4% unix_stream_recvmsg [kernel.kallsyms]
34877.00 14.0% unix_stream_sendmsg [kernel.kallsyms]
25204.00 10.1% pid_vnr [kernel.kallsyms]
21864.00 8.8% unix_scm_to_skb [kernel.kallsyms]
13637.00 5.5% cred_to_ucred [kernel.kallsyms]
6520.00 2.6% unix_destruct_scm [kernel.kallsyms]
4731.00 1.9% sock_alloc_send_pskb [kernel.kallsyms]

With 2.6.35, perf doesn't show put_pid/pid_vnr.

Alex Shi and I did a quick bisect and located below 2 patches.
1) commit 7361c36c5224519b258219fe3d0e8abc865d8134
Author: Eric W. Biederman <[email protected]>
Date: Sun Jun 13 03:34:33 2010 +0000

af_unix: Allow credentials to work across user and pid namespaces.

In unix_skb_parms store pointers to struct pid and struct cred instead
of raw uid, gid, and pid values, then translate the credentials on
reception into values that are meaningful in the receiving processes
namespaces.

2) commit 257b5358b32f17e0603b6ff57b13610b0e02348f
Author: Eric W. Biederman <[email protected]>
Date: Sun Jun 13 03:32:34 2010 +0000

scm: Capture the full credentials of the scm sender.

Start capturing not only the userspace pid, uid and gid values of the
sending process but also the struct pid and struct cred of the sending
process as well.

2010-08-18 10:57:12

by Eric W. Biederman

[permalink] [raw]

Subject: Re: hackbench regression with 2.6.36-rc1

"Zhang, Yanmin" <[email protected]> writes:

> Comparing with 2.6.35's result, hackbench (thread mode) has about
> 80% regression on dual-socket Nehalem machine and about 90% regression
> on 4-socket Tigerton machines.

That seems unfortunate. Do you only show a regression in the pthread
hackbench test? Do you show a regression when you use pipes?

Does the size of the regression very based on the number of loop
iterations? I ask because it appears that on the last message the
sender will exit necessitating that the receiver put the senders pid.
Which should be atypical.

> Command to start hackbench:
> #./hackbench 100 thread 2000
>
> process mode has no such regression.
>
> Profiling shows:
> #perf top
> samples pcnt function DSO
> _______ _____ ________________________ ________________________
>
> 74415.00 29.9% put_pid [kernel.kallsyms]
> 38395.00 15.4% unix_stream_recvmsg [kernel.kallsyms]
> 34877.00 14.0% unix_stream_sendmsg [kernel.kallsyms]
> 25204.00 10.1% pid_vnr [kernel.kallsyms]
> 21864.00 8.8% unix_scm_to_skb [kernel.kallsyms]
> 13637.00 5.5% cred_to_ucred [kernel.kallsyms]
> 6520.00 2.6% unix_destruct_scm [kernel.kallsyms]
> 4731.00 1.9% sock_alloc_send_pskb [kernel.kallsyms]
>
>
> With 2.6.35, perf doesn't show put_pid/pid_NR.

Yes. 2.6.35 is imperfect and can report the wrong pid in some
circumstances. I am surprised nothing related to the reference count on
struct cred does not show up in your profiling traces.

You are performing statistical sampling so I don't believe the
percentage of hits per function is the same as the percentage of
time per function.

Given that we are talking about a scheduler benchmark that is
doing something rather artificial (inter thread communication via
sockets), I don't know that this case is worth worrying about.

> Alex Shi and I did a quick bisect and located below 2 patches.

That is a plausible result. The atomic reference counts may
be causing you to ping pong cache lines between cpus.

Eric

> 1) commit 7361c36c5224519b258219fe3d0e8abc865d8134
> Author: Eric W. Biederman <[email protected]>
> Date: Sun Jun 13 03:34:33 2010 +0000
>
> af_unix: Allow credentials to work across user and pid namespaces.
>
> In unix_skb_parms store pointers to struct pid and struct cred instead
> of raw uid, gid, and pid values, then translate the credentials on
> reception into values that are meaningful in the receiving processes
> namespaces.
>
>
> 2) commit 257b5358b32f17e0603b6ff57b13610b0e02348f
> Author: Eric W. Biederman <[email protected]>
> Date: Sun Jun 13 03:32:34 2010 +0000
>
> scm: Capture the full credentials of the scm sender.
>
> Start capturing not only the userspace pid, uid and gid values of the
> sending process but also the struct pid and struct cred of the sending
> process as well.

2010-08-19 08:52:41

by Yanmin Zhang

[permalink] [raw]

Subject: Re: hackbench regression with 2.6.36-rc1

On Wed, 2010-08-18 at 03:56 -0700, Eric W. Biederman wrote:
> "Zhang, Yanmin" <[email protected]> writes:
>
> > Comparing with 2.6.35's result, hackbench (thread mode) has about
> > 80% regression on dual-socket Nehalem machine and about 90% regression
> > on 4-socket Tigerton machines.
>
> That seems unfortunate.

> Do you only show a regression in the pthread
> hackbench test?
Yes.

> Do you show a regression when you use pipes?
No.

>
> Does the size of the regression very based on the number of loop
> iterations?
No. I tried 1000 and get the similar regression ratio.
I choose a large 2000 loop number because I want to get a stable result.

It's easy to reproduce it. We found it almost on all our machines.

> I ask because it appears that on the last message the
> sender will exit necessitating that the receiver put the senders pid.
> Which should be atypical.
I don't agree on that. With hackbench, sender would send loops*receiver_num_per_group
messages before exiting.
In addition, 'perf top' shows put_pid is the hottest function in the beginning
after I start hackbench.

>
> > Command to start hackbench:
> > #./hackbench 100 thread 2000
> >
> > process mode has no such regression.
> >
> > Profiling shows:
> > #perf top
> > samples pcnt function DSO
> > _______ _____ ________________________ ________________________
> >
> > 74415.00 29.9% put_pid [kernel.kallsyms]
> > 38395.00 15.4% unix_stream_recvmsg [kernel.kallsyms]
> > 34877.00 14.0% unix_stream_sendmsg [kernel.kallsyms]
> > 25204.00 10.1% pid_vnr [kernel.kallsyms]
> > 21864.00 8.8% unix_scm_to_skb [kernel.kallsyms]
> > 13637.00 5.5% cred_to_ucred [kernel.kallsyms]
> > 6520.00 2.6% unix_destruct_scm [kernel.kallsyms]
> > 4731.00 1.9% sock_alloc_send_pskb [kernel.kallsyms]
> >
> >
> > With 2.6.35, perf doesn't show put_pid/pid_NR.
>
> Yes. 2.6.35 is imperfect and can report the wrong pid in some
> circumstances. I am surprised nothing related to the reference count on
> struct cred does not show up in your profiling traces.
>

> You are performing statistical sampling so I don't believe the
> percentage of hits per function is the same as the percentage of
> time per function.
Agree. But from performance tuning point of view, percentage of hit is enough
for helping developers to investigate.

I provide 'perf top' data is to help you debug, not to prove your patches
cause the regression. We used bisect to locate them.

>
> Given that we are talking about a scheduler benchmark that is
> doing something rather artificial (inter thread communication via
> sockets), I don't know that this case is worth worrying about.
Good question. I don't know how about below scenario:
Start 2 processes and every process creates many threads. threads of process 1
communicates with threads of process 2.

>
> > Alex Shi and I did a quick bisect and located below 2 patches.
>
> That is a plausible result.

> The atomic reference counts may
> be causing you to ping pong cache lines between cpus.
Agree.

2010-08-19 20:25:26

by Eric W. Biederman

[permalink] [raw]

Subject: Re: hackbench regression with 2.6.36-rc1

"Zhang, Yanmin" <[email protected]> writes:

> On Wed, 2010-08-18 at 03:56 -0700, Eric W. Biederman wrote:
>> "Zhang, Yanmin" <[email protected]> writes:
>>
>> > Comparing with 2.6.35's result, hackbench (thread mode) has about
>> > 80% regression on dual-socket Nehalem machine and about 90% regression
>> > on 4-socket Tigerton machines.
>>
>> That seems unfortunate.
>
>> Do you only show a regression in the pthread
>> hackbench test?
> Yes.
>
>> Do you show a regression when you use pipes?
> No.
>
>>
>> Does the size of the regression very based on the number of loop
>> iterations?
> No. I tried 1000 and get the similar regression ratio.
> I choose a large 2000 loop number because I want to get a stable
> result.
>
> It's easy to reproduce it. We found it almost on all our machines.
>
>> I ask because it appears that on the last message the
>> sender will exit necessitating that the receiver put the senders pid.
>> Which should be atypical.
> I don't agree on that. With hackbench, sender would send loops*receiver_num_per_group
> messages before exiting.
> In addition, 'perf top' shows put_pid is the hottest function in the beginning
> after I start hackbench.

If increasing the number of loops does not improve the performance the
hypothesis that it is only the last message that has the regression
is shot.

>> > Command to start hackbench:
>> > #./hackbench 100 thread 2000
>> >
>> > process mode has no such regression.
>> >
>> > Profiling shows:
>> > #perf top
>> > samples pcnt function DSO
>> > _______ _____ ________________________ ________________________
>> >
>> > 74415.00 29.9% put_pid [kernel.kallsyms]
>> > 38395.00 15.4% unix_stream_recvmsg [kernel.kallsyms]
>> > 34877.00 14.0% unix_stream_sendmsg [kernel.kallsyms]
>> > 25204.00 10.1% pid_vnr [kernel.kallsyms]
>> > 21864.00 8.8% unix_scm_to_skb [kernel.kallsyms]
>> > 13637.00 5.5% cred_to_ucred [kernel.kallsyms]
>> > 6520.00 2.6% unix_destruct_scm [kernel.kallsyms]
>> > 4731.00 1.9% sock_alloc_send_pskb [kernel.kallsyms]
>> >
>> >
>> > With 2.6.35, perf doesn't show put_pid/pid_NR.
>>
>> Yes. 2.6.35 is imperfect and can report the wrong pid in some
>> circumstances. I am surprised nothing related to the reference count on
>> struct cred does not show up in your profiling traces.
>>
>
>> You are performing statistical sampling so I don't believe the
>> percentage of hits per function is the same as the percentage of
>> time per function.
> Agree. But from performance tuning point of view, percentage of hit is enough
> for helping developers to investigate.
>
> I provide 'perf top' data is to help you debug, not to prove your patches
> cause the regression. We used bisect to locate them.

Sure I was just trying to figure out how to explain why the creds
don't show a similar hit. I still don't have a complete explanation
for the profile but the cred put and get are inline functions so they
won't be present as distinct functions in the profile.

>> Given that we are talking about a scheduler benchmark that is
>> doing something rather artificial (inter thread communication via
>> sockets), I don't know that this case is worth worrying about.
> Good question. I don't know how about below scenario:
> Start 2 processes and every process creates many threads. threads of process 1
> communicates with threads of process 2.

Maybe. A lot depends on the timing, and what it takes to trigger
the cross cpu cache line bounce.

And we still have pipes for ultimate performance. Grrr.

I will give it some thought to see if I can find a less expensive way
but I don't have any good ideas at the moment.

Eric