2021-12-10 05:37:58

by kernel test robot

[permalink] [raw]
Subject: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression



Greeting,

FYI, we noticed a -5.7% regression of will-it-scale.per_thread_ops due to commit:


commit: 054aa8d439b9185d4f5eb9a90282d1ce74772969 ("fget: check that the fd still exists after getting a ref to it")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: will-it-scale
on test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz with 112G memory
with following parameters:

nr_task: 50%
mode: thread
test: poll2
cpufreq_governor: performance
ucode: 0x42e

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale

In addition to that, the commit also has significant impact on the following tests:

+------------------+---------------------------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_thread_ops -6.0% regression |
| test machine | 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz with 112G memory |
| test parameters | cpufreq_governor=performance |
| | mode=thread |
| | nr_task=16 |
| | test=poll2 |
| | ucode=0x42e |
+------------------+---------------------------------------------------------------------------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/thread/50%/debian-10.4-x86_64-20200603.cgz/lkp-ivb-2ep1/poll2/will-it-scale/0x42e

commit:
5f58da2bef ("Merge tag 'drm-fixes-2021-12-03-1' of git://anongit.freedesktop.org/drm/drm")
054aa8d439 ("fget: check that the fd still exists after getting a ref to it")

5f58da2befa58edf 054aa8d439b9185d4f5eb9a9028
---------------- ---------------------------
%stddev %change %stddev
\ | \
6666720 -5.7% 6288280 will-it-scale.24.threads
277779 -5.7% 262011 will-it-scale.per_thread_ops
6666720 -5.7% 6288280 will-it-scale.workload
173.02 +1.0% 174.71 turbostat.CorWatt
27.16 ? 10% +4.3 31.51 ? 2% perf-profile.calltrace.cycles-pp.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
22.91 ? 10% +4.4 27.34 ? 2% perf-profile.calltrace.cycles-pp.__fget_files.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64
26.33 ? 10% +4.4 30.70 ? 2% perf-profile.children.cycles-pp.__fget_light
22.92 ? 10% +4.4 27.35 ? 2% perf-profile.children.cycles-pp.__fget_files
22.70 ? 10% +4.4 27.11 ? 2% perf-profile.self.cycles-pp.__fget_files
17234 ? 13% -33.0% 11544 ? 26% softirqs.CPU18.RCU
6001 ? 83% +418.0% 31085 ? 39% softirqs.CPU18.SCHED
14695 ? 4% -20.3% 11714 ? 12% softirqs.CPU27.RCU
15579 ? 16% -23.0% 12000 ? 10% softirqs.CPU35.RCU
10842 ? 9% +49.4% 16201 ? 14% softirqs.CPU42.RCU
37992 ? 13% -66.8% 12604 ?101% softirqs.CPU42.SCHED
8899 ? 35% +87.0% 16640 ? 41% softirqs.CPU47.SCHED
5611 ? 47% -53.2% 2624 ? 35% interrupts.CPU12.NMI:Non-maskable_interrupts
5611 ? 47% -53.2% 2624 ? 35% interrupts.CPU12.PMI:Performance_monitoring_interrupts
5372 ? 42% -57.6% 2277 ? 49% interrupts.CPU13.NMI:Non-maskable_interrupts
5372 ? 42% -57.6% 2277 ? 49% interrupts.CPU13.PMI:Performance_monitoring_interrupts
6058 ? 38% -52.6% 2871 ? 87% interrupts.CPU18.NMI:Non-maskable_interrupts
6058 ? 38% -52.6% 2871 ? 87% interrupts.CPU18.PMI:Performance_monitoring_interrupts
284.33 ? 14% -65.3% 98.67 ? 97% interrupts.CPU18.RES:Rescheduling_interrupts
1323 ? 6% -20.4% 1053 ? 26% interrupts.CPU22.CAL:Function_call_interrupts
2175 ? 47% +176.5% 6015 ? 38% interrupts.CPU42.NMI:Non-maskable_interrupts
2175 ? 47% +176.5% 6015 ? 38% interrupts.CPU42.PMI:Performance_monitoring_interrupts
102.33 ?136% +233.9% 341.67 ? 48% interrupts.CPU42.TLB:TLB_shootdowns
312.83 ? 34% -69.3% 96.00 ?110% interrupts.CPU43.TLB:TLB_shootdowns
5386 ? 36% -55.6% 2393 ? 27% interrupts.CPU45.NMI:Non-maskable_interrupts
5386 ? 36% -55.6% 2393 ? 27% interrupts.CPU45.PMI:Performance_monitoring_interrupts
0.20 -0.0 0.19 perf-stat.i.branch-miss-rate%
44316621 -5.5% 41893733 perf-stat.i.branch-misses
0.74 -12.9% 0.65 perf-stat.i.cpi
0.19 ? 3% -0.0 0.14 ? 8% perf-stat.i.dTLB-load-miss-rate%
2.296e+10 +25.4% 2.88e+10 perf-stat.i.dTLB-loads
0.10 -0.0 0.08 perf-stat.i.dTLB-store-miss-rate%
14444127 -5.8% 13613178 ? 2% perf-stat.i.dTLB-store-misses
1.458e+10 +21.9% 1.777e+10 perf-stat.i.dTLB-stores
7406987 -4.6% 7067379 perf-stat.i.iTLB-load-misses
9.764e+10 +14.6% 1.118e+11 perf-stat.i.instructions
13170 +20.0% 15808 perf-stat.i.instructions-per-iTLB-miss
1.35 +14.9% 1.55 perf-stat.i.ipc
1277 +15.0% 1469 perf-stat.i.metric.M/sec
0.19 -0.0 0.17 perf-stat.overall.branch-miss-rate%
0.74 -13.0% 0.64 perf-stat.overall.cpi
0.19 ? 3% -0.0 0.14 ? 8% perf-stat.overall.dTLB-load-miss-rate%
0.10 -0.0 0.08 perf-stat.overall.dTLB-store-miss-rate%
13182 +20.1% 15826 perf-stat.overall.instructions-per-iTLB-miss
1.35 +14.9% 1.55 perf-stat.overall.ipc
4411089 +21.5% 5360082 perf-stat.overall.path-length
44188606 -5.4% 41787951 perf-stat.ps.branch-misses
2.289e+10 +25.4% 2.871e+10 perf-stat.ps.dTLB-loads
14395868 -5.8% 13567459 ? 2% perf-stat.ps.dTLB-store-misses
1.453e+10 +21.9% 1.771e+10 perf-stat.ps.dTLB-stores
7382256 -4.6% 7043603 perf-stat.ps.iTLB-load-misses
9.731e+10 +14.6% 1.115e+11 perf-stat.ps.instructions
2.941e+13 +14.6% 3.371e+13 perf-stat.total.instructions


***************************************************************************************************
lkp-ivb-2ep1: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz with 112G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/thread/16/debian-10.4-x86_64-20200603.cgz/lkp-ivb-2ep1/poll2/will-it-scale/0x42e

commit:
5f58da2bef ("Merge tag 'drm-fixes-2021-12-03-1' of git://anongit.freedesktop.org/drm/drm")
054aa8d439 ("fget: check that the fd still exists after getting a ref to it")

5f58da2befa58edf 054aa8d439b9185d4f5eb9a9028
---------------- ---------------------------
%stddev %change %stddev
\ | \
4461267 -6.0% 4194062 will-it-scale.16.threads
278828 -6.0% 262128 will-it-scale.per_thread_ops
4461267 -6.0% 4194062 will-it-scale.workload
105.14 ? 56% +129.9% 241.71 ? 18% interrupts.CPU15.RES:Rescheduling_interrupts
162.86 ? 46% -75.6% 39.71 ? 69% interrupts.CPU39.RES:Rescheduling_interrupts
26439 ? 38% -69.0% 8190 ? 43% softirqs.CPU15.SCHED
18393 ? 51% +96.6% 36159 ? 10% softirqs.CPU39.SCHED
15.89 ? 45% -9.7 6.18 ? 95% turbostat.C1E%
49.22 ? 13% +8.3 57.51 ? 7% turbostat.C6%
0.23 -0.0 0.21 ? 2% perf-stat.i.branch-miss-rate%
34323793 -4.1% 32902170 perf-stat.i.branch-misses
0.76 -12.5% 0.66 perf-stat.i.cpi
1.545e+10 +25.2% 1.934e+10 perf-stat.i.dTLB-loads
9.781e+09 +21.6% 1.19e+10 perf-stat.i.dTLB-stores
5312601 -5.1% 5039573 perf-stat.i.iTLB-load-misses
6.538e+10 +14.5% 7.489e+10 perf-stat.i.instructions
12305 +20.6% 14845 perf-stat.i.instructions-per-iTLB-miss
1.33 +14.4% 1.52 perf-stat.i.ipc
857.08 +14.9% 984.90 perf-stat.i.metric.M/sec
0.22 -0.0 0.21 perf-stat.overall.branch-miss-rate%
0.75 -12.6% 0.66 perf-stat.overall.cpi
12310 +20.7% 14859 perf-stat.overall.instructions-per-iTLB-miss
1.33 +14.4% 1.52 perf-stat.overall.ipc
4411591 +21.9% 5379108 perf-stat.overall.path-length
34213688 -4.1% 32797737 perf-stat.ps.branch-misses
1.539e+10 +25.2% 1.927e+10 perf-stat.ps.dTLB-loads
9.749e+09 +21.6% 1.186e+10 perf-stat.ps.dTLB-stores
5294998 -5.1% 5022765 perf-stat.ps.iTLB-load-misses
6.516e+10 +14.5% 7.464e+10 perf-stat.ps.instructions
1.968e+13 +14.6% 2.256e+13 perf-stat.total.instructions





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (10.99 kB)
config-5.16.0-rc3-00228-g054aa8d439b9 (169.48 kB)
job-script (7.73 kB)
job.yaml (5.24 kB)
reproduce (335.00 B)
Download all attachments

2021-12-10 18:34:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

On Thu, Dec 9, 2021 at 9:38 PM kernel test robot <[email protected]> wrote:
>
> FYI, we noticed a -5.7% regression of will-it-scale.per_thread_ops due to commit:
> 054aa8d439b9 ("fget: check that the fd still exists after getting a ref to it")

Well, some downside of the new checks was expected, that's just much
more than I really like or would have thought.

But it's exactly where you'd expect:

> 27.16 ± 10% +4.3 31.51 ± 2% perf-profile.calltrace.cycles-pp.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 22.91 ± 10% +4.4 27.34 ± 2% perf-profile.calltrace.cycles-pp.__fget_files.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64
> 26.33 ± 10% +4.4 30.70 ± 2% perf-profile.children.cycles-pp.__fget_light
> 22.92 ± 10% +4.4 27.35 ± 2% perf-profile.children.cycles-pp.__fget_files
> 22.70 ± 10% +4.4 27.11 ± 2% perf-profile.self.cycles-pp.__fget_files

although there's odd spikes in dTLB-loads etc.

I checked whether it's some unexpected code generation issue, but the
new "re-check file table after refcount update" really looks very
cheap when I look at what gcc generates, there's nothing really
unexpected there.

What did change was:

(a) some branches go other ways, which might well affect branch
prediction and just be unlucky. It might be that just marking the
mismatch case "unlikely()" will help.

(b) the obvious few new instructions (re-load and check file table
pointer, re-load and check file pointer)

(c) that __fget_files() function is now no longer a leaf function in
a simple config case, since it calls "fput_many" in the error case.

And that (c) is worth mentioning simply because it means that the
function goes from not having any stack frame at all, to having to
save/restore four registers. So now it has the usual push/pop
sequences.

It may also be that the test-case actually does a lot of threaded
open/close/poll, and either actually triggers the re-lookup looping
case (unlikely) or just sees a lot of cacheline bouncing that now got
worse due to the re-check of the file pointer.

So this regression looks real, and the issue seems to be that
__fget_files() really is _that_ important for this do_sys_poll()
benchmark, and even just the handful of extra instructions end up
being meaningful.

Oliver - I'm attaching the obvious "unlikely9)" oneliner in case it's
just "gcc thought the retry loop was the common case" issue and bad
branch prediction.

And it would perhaps be interesting to get an actual instruction-level
profile of that __fget_files() thing for that benchmark, if that
pinpoints exactly what is going on and in case that would be easy to
get on that machine.

Because it might just be truly horrendously bad luck, with the 32-byte
stack frame meaning that the kernel stack goes one more page down
(just jhandwaving from the dTLB number spike), and this all being just
random bad luck on that particular benchmark.

Of course, the thing about poll() is that for that case, we *don't*
really need the "re-check the file descriptor" code at all, since the
resulting fd isn't going to be installed as a new fd, and it doesn't
matter for the socket garbage collector logic.

So maybe it was a mistake to put that re-check in the generic fdget()
code - yes, it should be cheap, but it's also some of the most hot
code in the kernel on some loads.

But if we move it elsewhere, we'd need to come up with some list of
"these cases need it". Some are obvious: dup, dup2, unix domain file
passing. It's the non-obvious ones I'd worry about.

Anybody?

Linus


Attachments:
patch.diff (512.00 B)

2021-12-10 20:30:18

by Jann Horn

[permalink] [raw]
Subject: Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

On Fri, Dec 10, 2021 at 7:34 PM Linus Torvalds
<[email protected]> wrote:
> On Thu, Dec 9, 2021 at 9:38 PM kernel test robot <[email protected]> wrote:
> >
> > FYI, we noticed a -5.7% regression of will-it-scale.per_thread_ops due to commit:
> > 054aa8d439b9 ("fget: check that the fd still exists after getting a ref to it")
>
> Well, some downside of the new checks was expected, that's just much
> more than I really like or would have thought.
>
> But it's exactly where you'd expect:
>
> > 27.16 ± 10% +4.3 31.51 ± 2% perf-profile.calltrace.cycles-pp.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > 22.91 ± 10% +4.4 27.34 ± 2% perf-profile.calltrace.cycles-pp.__fget_files.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64
> > 26.33 ± 10% +4.4 30.70 ± 2% perf-profile.children.cycles-pp.__fget_light
> > 22.92 ± 10% +4.4 27.35 ± 2% perf-profile.children.cycles-pp.__fget_files
> > 22.70 ± 10% +4.4 27.11 ± 2% perf-profile.self.cycles-pp.__fget_files
>
> although there's odd spikes in dTLB-loads etc.
>
> I checked whether it's some unexpected code generation issue, but the
> new "re-check file table after refcount update" really looks very
> cheap when I look at what gcc generates, there's nothing really
> unexpected there.
>
> What did change was:
>
> (a) some branches go other ways, which might well affect branch
> prediction and just be unlucky. It might be that just marking the
> mismatch case "unlikely()" will help.
>
> (b) the obvious few new instructions (re-load and check file table
> pointer, re-load and check file pointer)
>
> (c) that __fget_files() function is now no longer a leaf function in
> a simple config case, since it calls "fput_many" in the error case.
>
> And that (c) is worth mentioning simply because it means that the
> function goes from not having any stack frame at all, to having to
> save/restore four registers. So now it has the usual push/pop
> sequences.
>
> It may also be that the test-case actually does a lot of threaded
> open/close/poll, and either actually triggers the re-lookup looping
> case (unlikely) or just sees a lot of cacheline bouncing that now got
> worse due to the re-check of the file pointer.
>
> So this regression looks real, and the issue seems to be that
> __fget_files() really is _that_ important for this do_sys_poll()
> benchmark, and even just the handful of extra instructions end up
> being meaningful.
>
> Oliver - I'm attaching the obvious "unlikely9)" oneliner in case it's
> just "gcc thought the retry loop was the common case" issue and bad
> branch prediction.
>
> And it would perhaps be interesting to get an actual instruction-level
> profile of that __fget_files() thing for that benchmark, if that
> pinpoints exactly what is going on and in case that would be easy to
> get on that machine.
>
> Because it might just be truly horrendously bad luck, with the 32-byte
> stack frame meaning that the kernel stack goes one more page down
> (just jhandwaving from the dTLB number spike), and this all being just
> random bad luck on that particular benchmark.
>
> Of course, the thing about poll() is that for that case, we *don't*
> really need the "re-check the file descriptor" code at all, since the
> resulting fd isn't going to be installed as a new fd, and it doesn't
> matter for the socket garbage collector logic.
>
> So maybe it was a mistake to put that re-check in the generic fdget()
> code - yes, it should be cheap, but it's also some of the most hot
> code in the kernel on some loads.
>
> But if we move it elsewhere, we'd need to come up with some list of
> "these cases need it". Some are obvious: dup, dup2, unix domain file
> passing. It's the non-obvious ones I'd worry about.

The thing is, even though my proof of concept used dup() to put the
file in the fd table again, that's not strictly necessary. Instead of
using dup() for the race, you could also use recvmsg() directly with
the right timing.

And the recvmsg() path is wired up to a ton of syscalls, including
read(), I believe? So you'd have to special-case read(), readv(),
recv(), recvmsg(), io_submit(), splice(), the io_uring stuff, and so
on. And I think read() is probably one of the hottest syscalls related
to file I/O?


Oh, and I just realized that your patch probably actually also fixes
an issue entirely unrelated to unix sockets. __fdget_pos() does this:

unsigned long __fdget_pos(unsigned int fd)
{
unsigned long v = __fdget(fd);
struct file *file = (struct file *)(v & ~3);

if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
if (file_count(file) > 1) {
v |= FDPUT_POS_UNLOCK;
mutex_lock(&file->f_pos_lock);
}
}
return v;
}

and with the same fget race, I think you could get past that
file_count(file) check?

FMODE_ATOMIC_POS is always set for regular files and directories,
which means that this is what protects getdents()'s access to the file
offset when it calls into f_op->iterate_shared():

SYSCALL_DEFINE3(getdents, unsigned int, fd,
struct linux_dirent __user *, dirent, unsigned int, count)
{
struct fd f;
[...]
f = fdget_pos(fd);
if (!f.file)
return -EBADF;

error = iterate_dir(f.file, &buf.ctx);
if (error >= 0)
error = buf.error;
[...]
fdput_pos(f);
return error;
}

int iterate_dir(struct file *file, struct dir_context *ctx)
{
struct inode *inode = file_inode(file);
bool shared = false;
int res = -ENOTDIR;
if (file->f_op->iterate_shared)
shared = true;
else if (!file->f_op->iterate)
goto out;
[...]
if (shared)
res = down_read_killable(&inode->i_rwsem);
else
[...]
if (res)
goto out;
[...]
if (!IS_DEADDIR(inode)) {
ctx->pos = file->f_pos;
if (shared)
res = file->f_op->iterate_shared(file, ctx);
else
res = file->f_op->iterate(file, ctx);
file->f_pos = ctx->pos;
[...]
}
if (shared)
inode_unlock_shared(inode);
else
[...]
out:
return res;
}

And the ext4 implementation of ->iterate_shared(), which doesn't seem
to be taking any exclusive locks, then also reads and writes
->f_version and relies on having that in sync with ->f_pos. (At least
in the inline storage case it looks that way, I haven't looked at the
rest.) So I think that without your fix, it might also be possible to
get ext4 to read a struct ext4_dir_entry_2 from a misaligned offset? I
don't think that would lead to memory corruption, just to getting
bogus data from getdents() and/or making ext4 think that the
filesystem is corrupted, but it's not exactly great...

2021-12-10 21:26:18

by Linus Torvalds

[permalink] [raw]
Subject: Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

On Fri, Dec 10, 2021 at 12:30 PM Jann Horn <[email protected]> wrote:
>
> Oh, and I just realized that your patch probably actually also fixes
> an issue entirely unrelated to unix sockets. __fdget_pos() does this:

Hmm. I was going to say that you're wrong, because that case doesn't
actually have that "offset" thing that the unix domain GC code has,
and a "true zero" reference count is special and will never come back
to life.

But I think you're right - the "close and resurrect" situation wrt the
file count can happen _after_ the __fdget() in __fdget_pos() (where we
have that implicit offset of our own ref), and thus show a false case
of "we're the only user".

So I think you've convinced me - doing it in __fget_files() was the
right thing to do, but I really don't like that 5% regression.

Maybe it's purely on that artificial benchmark, but a multi-threaded
poll loop doesn't sound super-unusual (I think a single-threaded one
is already protected from this all by our "__fget_light()" logic).

Sadly, looking at my gcc code generation, adding that "unlikely()"
does move the "fput_many()" call out to it's own out-of-line code
section, but gcc will still end up doing the stack frame around the
whole function.

So if it's all due to just extra code and stack references due to the
now necessary stack-frame, it doesn't look obvious how to improve on
that.

We could make a special light-weight version of files_lookup_fd_raw(),
I guess. We don't need the *whole* "look it up again". We don't need
to re-check the array bounds, and we don't need to do the nospec
lookup - we would have triggered a NULL file pointer if that happened
the first time around.

So all we'd need to do is "check that fdt is the same, and check that
fdt->fd[fd] is the same".

Linus

2021-12-10 21:59:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

On Fri, Dec 10, 2021 at 1:25 PM Linus Torvalds
<[email protected]> wrote:
>
> We could make a special light-weight version of files_lookup_fd_raw(),
> I guess. We don't need the *whole* "look it up again". We don't need
> to re-check the array bounds, and we don't need to do the nospec
> lookup - we would have triggered a NULL file pointer if that happened
> the first time around.
>
> So all we'd need to do is "check that fdt is the same, and check that
> fdt->fd[fd] is the same".

This is an ENTIRELY UNTESTED patch to do that.

It basically rewrites __fget_files() from scratch: it really wants to
do the fd array lookup by hand, in order to cache the intermediate fdt
pointer, and in order to cache the intermediate speculation-safe fd
array index etc.

It's not a very complicated function, and rewriting it actually cleans
up the loop to not need the ugly goto.

I made it use a helper wrapper function for the rcu locking, so that
the "meat" of the function can just use plain "return NULL" for the
error cases.

However, not only is it entirely untested, this rewrite also means
that gcc has now decided that the result is so simple and clear that
it will inline it into all the callers.

I guess that's a good sign - writing the code in a way that makes the
compiler say "now it's so trivial that it should be inlined" is
certainly not a bad thing. But it makes it hard to really compare the
asm.

I did try a version with "noinline" just to make it more comparable,
and hey, it all looked sane to me there too.

I added more comments about what is going on.

Again - this is UNTESTED. I've looked at the code, I've looked at the
diff, and I've looked at the code it generates. It all looks fine to
me. But I've looked at it so much that I suspect that I'd be entirely
blind to any completely obvious bug by now.

Comments?

Oliver, does this make any difference in the performance department?

Linus


Attachments:
patch.diff (2.28 kB)

2021-12-10 23:30:04

by Jann Horn

[permalink] [raw]
Subject: Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

On Fri, Dec 10, 2021 at 10:59 PM Linus Torvalds
<[email protected]> wrote:
>
> On Fri, Dec 10, 2021 at 1:25 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > We could make a special light-weight version of files_lookup_fd_raw(),
> > I guess. We don't need the *whole* "look it up again". We don't need
> > to re-check the array bounds, and we don't need to do the nospec
> > lookup - we would have triggered a NULL file pointer if that happened
> > the first time around.
> >
> > So all we'd need to do is "check that fdt is the same, and check that
> > fdt->fd[fd] is the same".
>
> This is an ENTIRELY UNTESTED patch to do that.
>
> It basically rewrites __fget_files() from scratch: it really wants to
> do the fd array lookup by hand, in order to cache the intermediate fdt
> pointer, and in order to cache the intermediate speculation-safe fd
> array index etc.
>
> It's not a very complicated function, and rewriting it actually cleans
> up the loop to not need the ugly goto.
>
> I made it use a helper wrapper function for the rcu locking, so that
> the "meat" of the function can just use plain "return NULL" for the
> error cases.
>
> However, not only is it entirely untested, this rewrite also means
> that gcc has now decided that the result is so simple and clear that
> it will inline it into all the callers.
>
> I guess that's a good sign - writing the code in a way that makes the
> compiler say "now it's so trivial that it should be inlined" is
> certainly not a bad thing. But it makes it hard to really compare the
> asm.
>
> I did try a version with "noinline" just to make it more comparable,
> and hey, it all looked sane to me there too.
>
> I added more comments about what is going on.
>
> Again - this is UNTESTED. I've looked at the code, I've looked at the
> diff, and I've looked at the code it generates. It all looks fine to
> me. But I've looked at it so much that I suspect that I'd be entirely
> blind to any completely obvious bug by now.
>
> Comments?

One nit: The original implementation is using rcu_dereference_raw()
because it can run in different contexts, but here plain
rcu_dereference() would probably be more appropriate?

(I was wondering for a bit whether we should also change the
get_mm_exe_file() path, but I guess that's fine because it can only
ever happen for regular executable files and currently there's also no
path to pull out the mm->exe_file and use it for some other syscall?)

> Oliver, does this make any difference in the performance department?
>
> Linus

2021-12-11 01:03:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

On Fri, Dec 10, 2021 at 3:30 PM Jann Horn <[email protected]> wrote:
>
> One nit: The original implementation is using rcu_dereference_raw()
> because it can run in different contexts, but here plain
> rcu_dereference() would probably be more appropriate?

Well, I actually did that somewhat on purpose.

The RCU locking is right there, so doing the checking seems pointless.

That's particularly true since the whole point of the patch was "this
is truly critical, let's rewrite it to generate basically perfect
code"..

Linus

2021-12-11 01:33:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

On Fri, Dec 10, 2021 at 1:59 PM Linus Torvalds
<[email protected]> wrote:
>
> This is an ENTIRELY UNTESTED patch to do that.

Oh, and I guess it's somewhat tested now. I've been running that patch
for the last couple of hours, and it seems to work fine.

Not that I actually stressed it in any particular way, but at least it
doesn't cause any obvious problems. So it's probably fine.

Famous last words.

Linus

2021-12-13 08:33:58

by Carel Si

[permalink] [raw]
Subject: Re: [LKP] Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

Hi Linus,

On Fri, Dec 10, 2021 at 01:59:08PM -0800, Linus Torvalds wrote:
> On Fri, Dec 10, 2021 at 1:25 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > We could make a special light-weight version of files_lookup_fd_raw(),
> > I guess. We don't need the *whole* "look it up again". We don't need
> > to re-check the array bounds, and we don't need to do the nospec
> > lookup - we would have triggered a NULL file pointer if that happened
> > the first time around.
> >
> > So all we'd need to do is "check that fdt is the same, and check that
> > fdt->fd[fd] is the same".
>
> This is an ENTIRELY UNTESTED patch to do that.
>
> It basically rewrites __fget_files() from scratch: it really wants to
> do the fd array lookup by hand, in order to cache the intermediate fdt
> pointer, and in order to cache the intermediate speculation-safe fd
> array index etc.
>
> It's not a very complicated function, and rewriting it actually cleans
> up the loop to not need the ugly goto.
>
> I made it use a helper wrapper function for the rcu locking, so that
> the "meat" of the function can just use plain "return NULL" for the
> error cases.
>
> However, not only is it entirely untested, this rewrite also means
> that gcc has now decided that the result is so simple and clear that
> it will inline it into all the callers.
>
> I guess that's a good sign - writing the code in a way that makes the
> compiler say "now it's so trivial that it should be inlined" is
> certainly not a bad thing. But it makes it hard to really compare the
> asm.
>
> I did try a version with "noinline" just to make it more comparable,
> and hey, it all looked sane to me there too.
>
> I added more comments about what is going on.
>
> Again - this is UNTESTED. I've looked at the code, I've looked at the
> diff, and I've looked at the code it generates. It all looks fine to
> me. But I've looked at it so much that I suspect that I'd be entirely
> blind to any completely obvious bug by now.
>
> Comments?
>
> Oliver, does this make any difference in the performance department?

We tested your patch, the performance regression has recovered from -5.7% to
-0.4%, thanks.

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/thread/50%/debian-10.4-x86_64-20200603.cgz/lkp-ivb-2ep1/poll2/will-it-scale/0x42e

commit:
5f58da2bef ("Merge tag 'drm-fixes-2021-12-03-1' of git://anongit.freedesktop.org/drm/drm")
054aa8d439 ("fget: check that the fd still exists after getting a ref to it")
242d36be3a ("fixup-for-054aa8d439")

5f58da2befa58edf 054aa8d439b9185d4f5eb9a9028 242d36be3ad02063a1cf6df7802
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
6666720 -5.7% 6288280 -0.4% 6639231 will-it-scale.24.threads
277779 -5.7% 262011 -0.4% 276634 will-it-scale.per_thread_ops
6666720 -5.7% 6288280 -0.4% 6639231 will-it-scale.workload
0.09 ? 59% -24.7% 0.07 ? 16% +5.0% 0.10 ? 38% perf-stat.overall.MPKI
0.19 -0.0 0.17 +0.0 0.19 perf-stat.overall.branch-miss-rate%
8.12 ?118% -4.7 3.43 ? 14% +2.6 10.68 ?118% perf-stat.overall.cache-miss-rate%
0.74 -13.0% 0.64 -4.6% 0.71 perf-stat.overall.cpi
234529 ? 42% +15.5% 270984 -20.6% 186280 ? 55% perf-stat.overall.cycles-between-cache-misses
0.19 ? 3% -0.0 0.14 ? 8% -0.0 0.16 ? 7% perf-stat.overall.dTLB-load-miss-rate%
0.10 -0.0 0.08 -0.0 0.08 perf-stat.overall.dTLB-store-miss-rate%
95.61 -0.1 95.50 -0.4 95.22 perf-stat.overall.iTLB-load-miss-rate%
13182 +20.1% 15826 +0.4% 13238 ? 2% perf-stat.overall.instructions-per-iTLB-miss
1.35 +14.9% 1.55 +4.8% 1.42 perf-stat.overall.ipc
46.17 ? 3% -0.8 45.40 +0.2 46.40 ? 4% perf-stat.overall.node-load-miss-rate%
34.45 ? 10% -0.3 34.18 ? 3% -1.9 32.53 ? 12% perf-stat.overall.node-store-miss-rate%
4411089 +21.5% 5360082 +5.3% 4644112 perf-stat.overall.path-length
2.37e+10 +0.8% 2.388e+10 -4.3% 2.267e+10 perf-stat.ps.branch-instructions
44188606 -5.4% 41787951 -1.2% 43679032 perf-stat.ps.branch-misses
1254452 ?177% -78.9% 264997 ? 2% +15.0% 1442189 ?160% perf-stat.ps.cache-misses
9152124 ? 58% -13.4% 7923752 ? 17% +9.6% 10028813 ? 36% perf-stat.ps.cache-references
1317 ? 2% +0.5% 1323 ? 3% +2.4% 1348 ? 2% perf-stat.ps.context-switches
47842 -0.0% 47841 +0.0% 47842 perf-stat.ps.cpu-clock
7.202e+10 -0.3% 7.179e+10 -0.4% 7.175e+10 perf-stat.ps.cpu-cycles
72.64 +0.1% 72.68 ? 2% +1.7% 73.85 perf-stat.ps.cpu-migrations
42723891 ? 3% -7.1% 39692131 ? 8% -0.5% 42496149 ? 6% perf-stat.ps.dTLB-load-misses
2.289e+10 +25.4% 2.871e+10 +17.9% 2.698e+10 perf-stat.ps.dTLB-loads
14395868 -5.8% 13567459 ? 2% -1.2% 14222816 perf-stat.ps.dTLB-store-misses
1.453e+10 +21.9% 1.771e+10 +16.7% 1.695e+10 perf-stat.ps.dTLB-stores
7382256 -4.6% 7043603 +4.0% 7680167 ? 2% perf-stat.ps.iTLB-load-misses
339276 ? 2% -2.1% 332065 ? 4% +13.3% 384551 ? 29% perf-stat.ps.iTLB-loads
9.731e+10 +14.6% 1.115e+11 +4.4% 1.016e+11 perf-stat.ps.instructions
0.47 ? 14% -9.2% 0.43 ? 8% -6.6% 0.44 ? 12% perf-stat.ps.major-faults
2699 -0.3% 2690 +0.2% 2705 perf-stat.ps.minor-faults
788197 ?193% -86.7% 105083 ? 5% +13.3% 892699 ?174% perf-stat.ps.node-load-misses
817124 ?189% -84.5% 126259 ? 3% +13.4% 926398 ?170% perf-stat.ps.node-loads
248114 ?168% -74.7% 62837 ? 5% +8.2% 268474 ?160% perf-stat.ps.node-store-misses
434936 ?163% -72.2% 120915 ? 3% +9.9% 477830 ?149% perf-stat.ps.node-stores
2700 -0.3% 2691 +0.2% 2705 perf-stat.ps.page-faults
47842 -0.0% 47841 +0.0% 47842 perf-stat.ps.task-clock
2.941e+13 +14.6% 3.371e+13 +4.9% 3.083e+13 perf-stat.total.instructions
28.74 ? 23% -1.9 26.84 ? 6% +6.1 34.84 ? 20% perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry
27.82 ? 23% -1.9 25.93 ? 5% +5.7 33.54 ? 20% perf-profile.calltrace.cycles-pp.cpuidle_enter.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
27.82 ? 23% -1.9 25.93 ? 5% +5.7 33.54 ? 20% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry.start_secondary
27.83 ? 23% -1.9 25.94 ? 5% +5.7 33.55 ? 20% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
27.83 ? 23% -1.9 25.94 ? 5% +5.7 33.55 ? 20% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
27.83 ? 23% -1.9 25.94 ? 5% +5.7 33.55 ? 20% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
28.81 ? 23% -1.9 26.93 ? 6% +6.0 34.85 ? 20% perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
19.12 ? 9% -1.7 17.39 ? 2% -2.8 16.33 ? 11% perf-profile.calltrace.cycles-pp.fput_many.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.26 ?100% -0.2 0.10 ?223% -0.2 0.09 ?223% perf-profile.calltrace.cycles-pp.__kmalloc.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.88 ? 9% -0.1 2.76 ? 2% -0.3 2.62 ? 11% perf-profile.calltrace.cycles-pp.testcase
3.27 ? 9% -0.1 3.18 ? 2% -0.2 3.06 ? 11% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll
1.70 ? 9% -0.1 1.63 ? 2% -0.1 1.57 ? 11% perf-profile.calltrace.cycles-pp.__fdget.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.70 ? 10% -0.1 1.64 ? 2% -0.2 1.52 ? 10% perf-profile.calltrace.cycles-pp.fput.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.04 ? 8% -0.1 0.98 ? 4% -0.0 1.00 ? 10% perf-profile.calltrace.cycles-pp.__check_object_size.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.64 ? 9% -0.1 1.59 ? 3% -0.2 1.48 ? 12% perf-profile.calltrace.cycles-pp.__entry_text_start.__poll
1.69 ? 9% -0.0 1.65 ? 3% -0.1 1.57 ? 12% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.__poll
1.11 ? 9% -0.0 1.08 ? 4% -0.1 1.01 ? 11% perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64
1.48 ? 8% -0.0 1.45 ? 4% -0.1 1.34 ? 10% perf-profile.calltrace.cycles-pp._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.94 ? 57% -0.0 0.92 ? 54% +0.4 1.31 ? 46% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_kernel.secondary_startup_64_no_verify
0.94 ? 57% -0.0 0.92 ? 54% +0.4 1.31 ? 46% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_kernel.secondary_startup_64_no_verify
0.94 ? 57% -0.0 0.92 ? 54% +0.4 1.31 ? 46% perf-profile.calltrace.cycles-pp.cpuidle_enter.do_idle.cpu_startup_entry.start_kernel.secondary_startup_64_no_verify
0.94 ? 57% -0.0 0.92 ? 54% +0.4 1.31 ? 46% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry.start_kernel
0.94 ? 57% -0.0 0.92 ? 54% +0.4 1.31 ? 46% perf-profile.calltrace.cycles-pp.start_kernel.secondary_startup_64_no_verify
0.54 ? 45% +0.1 0.59 ? 9% -0.2 0.29 ?100% perf-profile.calltrace.cycles-pp.kfree.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
67.66 ? 9% +2.0 69.69 ? 2% -5.7 61.98 ? 11% perf-profile.calltrace.cycles-pp.__poll
63.66 ? 9% +2.2 65.82 ? 2% -5.4 58.30 ? 10% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__poll
63.48 ? 9% +2.2 65.64 ? 2% -5.3 58.14 ? 11% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll
59.85 ? 9% +2.3 62.12 ? 2% -5.1 54.77 ? 10% perf-profile.calltrace.cycles-pp.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll
57.74 ? 9% +2.4 60.09 ? 2% -4.9 52.86 ? 10% perf-profile.calltrace.cycles-pp.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll
27.16 ? 10% +4.3 31.51 ? 2% -1.1 26.04 ? 10% perf-profile.calltrace.cycles-pp.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
22.91 ? 10% +4.4 27.34 ? 2% -22.9 0.00 perf-profile.calltrace.cycles-pp.__fget_files.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64
27.83 ? 23% -1.9 25.94 ? 5% +5.7 33.55 ? 20% perf-profile.children.cycles-pp.start_secondary
28.80 ? 23% -1.9 26.92 ? 6% +6.0 34.85 ? 20% perf-profile.children.cycles-pp.cpuidle_enter
28.80 ? 23% -1.9 26.92 ? 6% +6.0 34.85 ? 20% perf-profile.children.cycles-pp.cpuidle_enter_state
28.81 ? 23% -1.9 26.93 ? 6% +6.0 34.85 ? 20% perf-profile.children.cycles-pp.secondary_startup_64_no_verify
28.81 ? 23% -1.9 26.93 ? 6% +6.0 34.85 ? 20% perf-profile.children.cycles-pp.cpu_startup_entry
28.81 ? 23% -1.9 26.93 ? 6% +6.0 34.85 ? 20% perf-profile.children.cycles-pp.do_idle
28.78 ? 23% -1.9 26.91 ? 6% +6.1 34.84 ? 20% perf-profile.children.cycles-pp.intel_idle
18.28 ? 9% -1.7 16.59 ? 2% -2.7 15.57 ? 11% perf-profile.children.cycles-pp.fput_many
2.90 ? 9% -0.1 2.76 ? 2% -0.3 2.64 ? 11% perf-profile.children.cycles-pp.testcase
3.29 ? 9% -0.1 3.20 ? 2% -0.2 3.08 ? 11% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
1.69 ? 9% -0.1 1.63 ? 2% -0.1 1.56 ? 11% perf-profile.children.cycles-pp.__fdget
1.70 ? 9% -0.1 1.64 ? 2% -0.2 1.53 ? 10% perf-profile.children.cycles-pp.fput
1.82 ? 9% -0.1 1.76 ? 3% -0.2 1.66 ? 12% perf-profile.children.cycles-pp.__entry_text_start
1.07 ? 9% -0.1 1.02 ? 4% -0.0 1.04 ? 10% perf-profile.children.cycles-pp.__check_object_size
1.90 ? 9% -0.0 1.86 ? 2% -0.1 1.77 ? 12% perf-profile.children.cycles-pp.syscall_return_via_sysret
1.50 ? 8% -0.0 1.47 ? 4% -0.1 1.37 ? 10% perf-profile.children.cycles-pp._copy_from_user
1.12 ? 9% -0.0 1.09 ? 4% -0.1 1.02 ? 11% perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
0.30 ? 8% -0.0 0.26 ? 11% +0.0 0.31 ? 20% perf-profile.children.cycles-pp.__virt_addr_valid
0.12 ? 8% -0.0 0.09 ? 11% -0.0 0.09 ? 13% perf-profile.children.cycles-pp.poll@plt
0.04 ? 71% -0.0 0.01 ?223% -0.0 0.00 perf-profile.children.cycles-pp.seq_read_iter
0.17 ? 26% -0.0 0.14 ? 32% -0.0 0.13 ? 23% perf-profile.children.cycles-pp.cmd_sched
0.17 ? 26% -0.0 0.14 ? 32% -0.0 0.13 ? 23% perf-profile.children.cycles-pp.cmd_record
0.04 ? 71% -0.0 0.02 ?144% -0.0 0.01 ?223% perf-profile.children.cycles-pp.ksys_read
0.04 ? 71% -0.0 0.02 ?144% -0.0 0.01 ?223% perf-profile.children.cycles-pp.vfs_read
0.61 ? 13% -0.0 0.59 ? 9% -0.1 0.51 ? 13% perf-profile.children.cycles-pp.kfree
0.17 ? 26% -0.0 0.15 ? 30% -0.0 0.13 ? 20% perf-profile.children.cycles-pp.__libc_start_main
0.17 ? 26% -0.0 0.15 ? 30% -0.0 0.13 ? 20% perf-profile.children.cycles-pp.main
0.17 ? 26% -0.0 0.15 ? 30% -0.0 0.13 ? 20% perf-profile.children.cycles-pp.run_builtin
0.02 ?142% -0.0 0.00 -0.0 0.00 perf-profile.children.cycles-pp.ksys_write
0.02 ?142% -0.0 0.00 -0.0 0.00 perf-profile.children.cycles-pp.vfs_write
0.02 ?142% -0.0 0.00 -0.0 0.00 perf-profile.children.cycles-pp.new_sync_write
0.04 ? 45% -0.0 0.02 ? 99% -0.0 0.02 ?141% perf-profile.children.cycles-pp.__x86_indirect_thunk_rax
0.13 ? 10% -0.0 0.12 ? 6% -0.0 0.11 ? 6% perf-profile.children.cycles-pp.kmalloc_slab
0.02 ? 99% -0.0 0.01 ?223% -0.0 0.00 perf-profile.children.cycles-pp.seq_read
0.02 ? 99% -0.0 0.01 ?223% -0.0 0.00 perf-profile.children.cycles-pp.__libc_read
0.15 ? 26% -0.0 0.14 ? 31% -0.0 0.12 ? 20% perf-profile.children.cycles-pp.record__finish_output
0.15 ? 26% -0.0 0.14 ? 31% -0.0 0.12 ? 20% perf-profile.children.cycles-pp.perf_session__process_events
0.06 ? 46% -0.0 0.04 ? 73% -0.0 0.03 ?100% perf-profile.children.cycles-pp.machines__deliver_event
0.14 ? 27% -0.0 0.12 ? 32% -0.0 0.11 ? 20% perf-profile.children.cycles-pp.process_simple
0.09 ? 30% -0.0 0.08 ? 34% -0.0 0.07 ? 18% perf-profile.children.cycles-pp.perf_session__process_user_event
0.05 ? 47% -0.0 0.04 ? 71% -0.0 0.01 ?223% perf-profile.children.cycles-pp.exc_page_fault
0.10 ? 24% -0.0 0.08 ? 30% -0.0 0.07 ? 15% perf-profile.children.cycles-pp.perf_session__deliver_event
0.02 ?142% -0.0 0.01 ?223% -0.0 0.01 ?223% perf-profile.children.cycles-pp.poll_select_set_timeout
0.09 ? 24% -0.0 0.08 ? 28% -0.0 0.07 ? 14% perf-profile.children.cycles-pp.__ordered_events__flush
0.02 ?141% -0.0 0.01 ?223% -0.0 0.00 perf-profile.children.cycles-pp.proc_reg_read
0.02 ?141% -0.0 0.01 ?223% -0.0 0.00 perf-profile.children.cycles-pp.perf_output_sample
0.01 ?223% -0.0 0.00 -0.0 0.00 perf-profile.children.cycles-pp.do_user_addr_fault
0.04 ? 45% -0.0 0.04 ? 71% -0.0 0.01 ?223% perf-profile.children.cycles-pp.perf_callchain_user
0.29 ? 9% -0.0 0.28 ? 4% -0.0 0.28 ? 13% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.11 ? 12% -0.0 0.10 ? 8% -0.0 0.10 ? 14% perf-profile.children.cycles-pp.__might_sleep
0.05 ? 49% -0.0 0.05 ? 52% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.check_stack_object
0.16 ? 20% -0.0 0.16 ? 9% -0.0 0.13 ? 8% perf-profile.children.cycles-pp.perf_callchain_kernel
0.12 ? 20% -0.0 0.11 ? 8% -0.0 0.09 ? 7% perf-profile.children.cycles-pp.unwind_next_frame
0.29 ? 11% -0.0 0.28 ? 5% -0.0 0.26 ? 14% perf-profile.children.cycles-pp.__check_heap_object
0.43 ? 19% +0.0 0.43 ? 10% -0.1 0.36 ? 10% perf-profile.children.cycles-pp.update_process_times
0.23 ? 20% +0.0 0.23 ? 11% -0.1 0.18 ? 8% perf-profile.children.cycles-pp.perf_prepare_sample
0.21 ? 20% +0.0 0.21 ? 10% -0.0 0.17 ? 8% perf-profile.children.cycles-pp.perf_callchain
0.02 ?141% +0.0 0.02 ?141% -0.0 0.01 ?223% perf-profile.children.cycles-pp.ordered_events__queue
0.02 ?142% +0.0 0.02 ?142% -0.0 0.02 ?141% perf-profile.children.cycles-pp.io_serial_in
0.07 ? 12% +0.0 0.07 ? 6% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.uart_console_write
0.07 ? 12% +0.0 0.08 ? 6% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.serial8250_console_write
0.21 ? 21% +0.0 0.21 ? 10% -0.0 0.17 ? 8% perf-profile.children.cycles-pp.get_perf_callchain
0.07 ? 12% +0.0 0.07 ? 6% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.wait_for_xmitr
0.08 ? 12% +0.0 0.08 ? 6% -0.0 0.07 ? 13% perf-profile.children.cycles-pp.vprintk_emit
0.08 ? 12% +0.0 0.08 ? 6% -0.0 0.07 ? 13% perf-profile.children.cycles-pp.console_unlock
0.48 ? 17% +0.0 0.48 ? 9% -0.1 0.41 ? 10% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.32 ? 21% +0.0 0.32 ? 11% -0.1 0.25 ? 9% perf-profile.children.cycles-pp.update_curr
0.61 ? 16% +0.0 0.61 ? 7% -0.1 0.53 ? 10% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.30 ? 21% +0.0 0.30 ? 10% -0.1 0.24 ? 9% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
0.28 ? 20% +0.0 0.29 ? 10% -0.1 0.22 ? 8% perf-profile.children.cycles-pp.__perf_event_overflow
0.28 ? 19% +0.0 0.28 ? 11% -0.1 0.22 ? 7% perf-profile.children.cycles-pp.perf_event_output_forward
0.29 ? 10% +0.0 0.29 ? 9% -0.0 0.25 ? 10% perf-profile.children.cycles-pp.__might_fault
0.08 ? 12% +0.0 0.08 ? 22% -0.0 0.07 ? 7% perf-profile.children.cycles-pp.syscall_enter_from_user_mode
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.02 ?142% perf-profile.children.cycles-pp.worker_thread
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.02 ?142% perf-profile.children.cycles-pp.process_one_work
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.02 ?142% perf-profile.children.cycles-pp.drm_fb_helper_damage_work
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.02 ?142% perf-profile.children.cycles-pp.drm_atomic_helper_dirtyfb
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.02 ?142% perf-profile.children.cycles-pp.drm_atomic_helper_commit
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.02 ?142% perf-profile.children.cycles-pp.commit_tail
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.02 ?142% perf-profile.children.cycles-pp.drm_atomic_helper_commit_tail
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.02 ?142% perf-profile.children.cycles-pp.drm_atomic_helper_commit_planes
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.02 ?142% perf-profile.children.cycles-pp.mgag200_simple_display_pipe_update
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.02 ?142% perf-profile.children.cycles-pp.drm_fb_memcpy_dstclip
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.02 ?142% perf-profile.children.cycles-pp.memcpy_toio
0.07 ? 8% +0.0 0.07 ? 6% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.serial8250_console_putchar
0.43 ? 19% +0.0 0.44 ? 9% -0.1 0.36 ? 10% perf-profile.children.cycles-pp.tick_sched_handle
0.57 ? 16% +0.0 0.58 ? 7% -0.1 0.50 ? 10% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.05 ? 47% +0.0 0.06 ? 8% -0.0 0.02 ? 99% perf-profile.children.cycles-pp.__unwind_start
0.28 ? 20% +0.0 0.29 ? 10% -0.1 0.22 ? 7% perf-profile.children.cycles-pp.perf_swevent_overflow
0.07 ? 9% +0.0 0.08 ? 6% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.asm_sysvec_irq_work
0.07 ? 9% +0.0 0.08 ? 6% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.sysvec_irq_work
0.07 ? 9% +0.0 0.08 ? 6% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.__sysvec_irq_work
0.07 ? 9% +0.0 0.08 ? 6% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.irq_work_run
0.07 ? 9% +0.0 0.08 ? 6% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.irq_work_single
0.07 ? 9% +0.0 0.08 ? 6% -0.0 0.06 ? 17% perf-profile.children.cycles-pp._printk
0.44 ? 19% +0.0 0.45 ? 9% -0.1 0.37 ? 10% perf-profile.children.cycles-pp.tick_sched_timer
0.54 ? 17% +0.0 0.54 ? 8% -0.1 0.46 ? 9% perf-profile.children.cycles-pp.hrtimer_interrupt
0.54 ? 17% +0.0 0.55 ? 8% -0.1 0.47 ? 9% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.37 ? 19% +0.0 0.37 ? 10% -0.1 0.30 ? 9% perf-profile.children.cycles-pp.task_tick_fair
0.07 ? 9% +0.0 0.08 ? 8% -0.0 0.06 ? 17% perf-profile.children.cycles-pp.irq_work_run_list
0.06 ? 47% +0.0 0.07 ? 18% -0.0 0.06 ? 13% perf-profile.children.cycles-pp.asm_exc_page_fault
0.29 ? 21% +0.0 0.30 ? 10% -0.1 0.23 ? 8% perf-profile.children.cycles-pp.perf_tp_event
0.98 ? 47% +0.0 0.99 ? 38% +0.3 1.31 ? 46% perf-profile.children.cycles-pp.start_kernel
0.01 ?223% +0.0 0.02 ?141% +0.0 0.01 ?223% perf-profile.children.cycles-pp.__cond_resched
0.01 ?223% +0.0 0.02 ?141% +0.0 0.01 ?223% perf-profile.children.cycles-pp.queue_event
0.00 +0.0 0.01 ?223% +0.0 0.00 perf-profile.children.cycles-pp.build_id__mark_dso_hit
0.00 +0.0 0.01 ?223% +0.0 0.01 ?223% perf-profile.children.cycles-pp.poll_freewait
0.39 ? 19% +0.0 0.40 ? 9% -0.1 0.32 ? 9% perf-profile.children.cycles-pp.scheduler_tick
0.06 ? 8% +0.0 0.07 ? 11% -0.0 0.04 ? 73% perf-profile.children.cycles-pp.ret_from_fork
0.06 ? 8% +0.0 0.07 ? 11% -0.0 0.04 ? 73% perf-profile.children.cycles-pp.kthread
0.04 ? 71% +0.0 0.04 ? 45% +0.0 0.04 ? 71% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
0.19 ? 10% +0.0 0.20 ? 9% -0.0 0.16 ? 10% perf-profile.children.cycles-pp.__might_resched
0.02 ?141% +0.0 0.03 ? 70% +0.0 0.06 ? 50% perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
0.01 ?223% +0.0 0.03 ? 70% -0.0 0.00 perf-profile.children.cycles-pp.__get_user_nocheck_8
0.50 ? 12% +0.0 0.53 ? 9% -0.0 0.46 ? 12% perf-profile.children.cycles-pp.__kmalloc
67.89 ? 9% +2.0 69.92 ? 2% -5.7 62.20 ? 11% perf-profile.children.cycles-pp.__poll
63.84 ? 9% +2.1 65.97 ? 2% -5.4 58.43 ? 10% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
63.66 ? 9% +2.1 65.80 ? 2% -5.4 58.27 ? 10% perf-profile.children.cycles-pp.do_syscall_64
59.86 ? 9% +2.3 62.12 ? 2% -5.1 54.77 ? 10% perf-profile.children.cycles-pp.__x64_sys_poll
59.47 ? 9% +2.3 61.75 ? 2% -5.0 54.42 ? 10% perf-profile.children.cycles-pp.do_sys_poll
26.33 ? 10% +4.4 30.70 ? 2% -1.0 25.29 ? 10% perf-profile.children.cycles-pp.__fget_light
22.92 ? 10% +4.4 27.35 ? 2% -22.9 0.00 perf-profile.children.cycles-pp.__fget_files
28.78 ? 23% -1.9 26.91 ? 6% +6.1 34.84 ? 20% perf-profile.self.cycles-pp.intel_idle
17.29 ? 9% -1.6 15.64 ? 2% -2.6 14.67 ? 11% perf-profile.self.cycles-pp.fput_many
11.06 ? 9% -0.3 10.74 ? 2% -1.0 10.09 ? 11% perf-profile.self.cycles-pp.do_sys_poll
2.87 ? 9% -0.1 2.74 ? 2% -0.3 2.61 ? 12% perf-profile.self.cycles-pp.testcase
3.21 ? 9% -0.1 3.12 ? 2% -0.2 2.98 ? 11% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
1.60 ? 9% -0.1 1.54 ? 4% -0.2 1.45 ? 12% perf-profile.self.cycles-pp.__entry_text_start
1.90 ? 8% -0.0 1.86 ? 3% -0.1 1.77 ? 12% perf-profile.self.cycles-pp.syscall_return_via_sysret
2.55 ? 9% -0.0 2.51 ? 4% +21.8 24.30 ? 10% perf-profile.self.cycles-pp.__fget_light
0.84 ? 8% -0.0 0.81 ? 2% -0.1 0.77 ? 11% perf-profile.self.cycles-pp.fput
0.29 ? 8% -0.0 0.26 ? 11% +0.0 0.30 ? 20% perf-profile.self.cycles-pp.__virt_addr_valid
1.10 ? 9% -0.0 1.07 ? 4% -0.1 1.00 ? 11% perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
0.28 ? 8% -0.0 0.26 ? 5% -0.0 0.24 ? 10% perf-profile.self.cycles-pp.do_syscall_64
0.84 ? 9% -0.0 0.82 ? 3% -0.1 0.78 ? 11% perf-profile.self.cycles-pp.__fdget
0.10 ? 6% -0.0 0.08 ? 10% -0.0 0.08 ? 14% perf-profile.self.cycles-pp.poll@plt
0.60 ? 12% -0.0 0.59 ? 9% -0.1 0.51 ? 14% perf-profile.self.cycles-pp.kfree
0.11 ? 12% -0.0 0.09 ? 7% -0.0 0.10 ? 12% perf-profile.self.cycles-pp._copy_from_user
0.44 ? 9% -0.0 0.43 ? 3% -0.0 0.42 ? 15% perf-profile.self.cycles-pp.__check_object_size
0.31 ? 10% -0.0 0.30 ? 3% -0.0 0.29 ? 12% perf-profile.self.cycles-pp.__x64_sys_poll
0.18 ? 8% -0.0 0.17 ? 5% -0.0 0.16 ? 10% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.02 ?142% -0.0 0.01 ?223% -0.0 0.01 ?223% perf-profile.self.cycles-pp.poll_select_set_timeout
0.04 ? 71% -0.0 0.02 ? 99% -0.0 0.02 ?141% perf-profile.self.cycles-pp.__x86_indirect_thunk_rax
0.12 ? 10% -0.0 0.12 ? 6% -0.0 0.10 ? 10% perf-profile.self.cycles-pp.kmalloc_slab
0.01 ?223% -0.0 0.00 +0.0 0.04 ? 73% perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
0.06 ? 11% -0.0 0.05 ? 8% -0.0 0.04 ? 45% perf-profile.self.cycles-pp.syscall_enter_from_user_mode
0.03 ?100% -0.0 0.02 ?142% -0.0 0.02 ? 99% perf-profile.self.cycles-pp.check_stack_object
0.28 ? 11% -0.0 0.27 ? 5% -0.0 0.25 ? 13% perf-profile.self.cycles-pp.__check_heap_object
0.09 ? 13% -0.0 0.09 ? 7% -0.0 0.08 ? 14% perf-profile.self.cycles-pp.__might_sleep
0.08 ? 17% -0.0 0.08 ? 25% -0.0 0.08 ? 16% perf-profile.self.cycles-pp.__might_fault
0.03 ?100% -0.0 0.02 ? 99% -0.0 0.00 perf-profile.self.cycles-pp.unwind_next_frame
0.02 ?142% +0.0 0.02 ?142% -0.0 0.02 ?141% perf-profile.self.cycles-pp.io_serial_in
0.01 ?223% +0.0 0.01 ?223% +0.0 0.01 ?223% perf-profile.self.cycles-pp.exit_to_user_mode_prepare
0.11 ? 9% +0.0 0.12 ? 8% -0.0 0.11 ? 9% perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
0.44 ? 8% +0.0 0.45 ? 4% -0.0 0.42 ? 15% perf-profile.self.cycles-pp.__poll
0.01 ?223% +0.0 0.02 ?141% +0.0 0.01 ?223% perf-profile.self.cycles-pp.queue_event
0.18 ? 11% +0.0 0.19 ? 8% -0.0 0.16 ? 10% perf-profile.self.cycles-pp.__might_resched
0.26 ? 13% +0.0 0.30 ? 17% -0.0 0.25 ? 15% perf-profile.self.cycles-pp.__kmalloc
22.70 ? 10% +4.4 27.11 ? 2% -22.7 0.00 perf-profile.self.cycles-pp.__fget_files

>
> Linus

> fs/file.c | 69 ++++++++++++++++++++++++++++++++++++++++++++++++---------------
> 1 file changed, 53 insertions(+), 16 deletions(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index ad4a8bf3cf10..70662fb1ab32 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -841,28 +841,65 @@ void do_close_on_exec(struct files_struct *files)
> spin_unlock(&files->file_lock);
> }
>
> +static inline struct file *__fget_files_rcu(struct files_struct *files,
> + unsigned int fd, fmode_t mask, unsigned int refs)
> +{
> + for (;;) {
> + struct file *file;
> + struct fdtable *fdt = rcu_dereference_raw(files->fdt);
> + struct file __rcu **fdentry;
> +
> + if (unlikely(fd >= fdt->max_fds))
> + return NULL;
> +
> + fdentry = fdt->fd + array_index_nospec(fd, fdt->max_fds);
> + file = rcu_dereference_raw(*fdentry);
> + if (unlikely(!file))
> + return NULL;
> +
> + if (unlikely(file->f_mode & mask))
> + return NULL;
> +
> + /*
> + * Ok, we have a file pointer. However, because we do
> + * this all locklessly under RCU, we may be racing with
> + * that file being closed.
> + *
> + * Such a race can take two forms:
> + *
> + * (a) the file ref already went down to zero,
> + * and get_file_rcu_many() fails. Just try
> + * again:
> + */
> + if (unlikely(!get_file_rcu_many(file, refs)))
> + continue;
> +
> + /*
> + * (b) the file table entry has changed under us.
> + *
> + * If so, we need to put our refs and try again.
> + */
> + if (unlikely(rcu_dereference_raw(files->fdt) != fdt) ||
> + unlikely(rcu_dereference_raw(*fdentry) != file)) {
> + fput_many(file, refs);
> + continue;
> + }
> +
> + /*
> + * Ok, we have a ref to the file, and checked that it
> + * still exists.
> + */
> + return file;
> + }
> +}
> +
> static struct file *__fget_files(struct files_struct *files, unsigned int fd,
> fmode_t mask, unsigned int refs)
> {
> struct file *file;
>
> rcu_read_lock();
> -loop:
> - file = files_lookup_fd_rcu(files, fd);
> - if (file) {
> - /* File object ref couldn't be taken.
> - * dup2() atomicity guarantee is the reason
> - * we loop to catch the new file (or NULL pointer)
> - */
> - if (file->f_mode & mask)
> - file = NULL;
> - else if (!get_file_rcu_many(file, refs))
> - goto loop;
> - else if (files_lookup_fd_raw(files, fd) != file) {
> - fput_many(file, refs);
> - goto loop;
> - }
> - }
> + file = __fget_files_rcu(files, fd, mask, refs);
> rcu_read_unlock();
>
> return file;

> _______________________________________________
> LKP mailing list -- [email protected]
> To unsubscribe send an email to [email protected]


2021-12-13 10:59:33

by Carel Si

[permalink] [raw]
Subject: Re: [LKP] Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

Hi Linus,

On Fri, Dec 10, 2021 at 10:33:43AM -0800, Linus Torvalds wrote:
> On Thu, Dec 9, 2021 at 9:38 PM kernel test robot <[email protected]> wrote:
> >
> > FYI, we noticed a -5.7% regression of will-it-scale.per_thread_ops due to commit:
> > 054aa8d439b9 ("fget: check that the fd still exists after getting a ref to it")
>
> Well, some downside of the new checks was expected, that's just much
> more than I really like or would have thought.
>
> But it's exactly where you'd expect:
>
> > 27.16 ? 10% +4.3 31.51 ? 2% perf-profile.calltrace.cycles-pp.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > 22.91 ? 10% +4.4 27.34 ? 2% perf-profile.calltrace.cycles-pp.__fget_files.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64
> > 26.33 ? 10% +4.4 30.70 ? 2% perf-profile.children.cycles-pp.__fget_light
> > 22.92 ? 10% +4.4 27.35 ? 2% perf-profile.children.cycles-pp.__fget_files
> > 22.70 ? 10% +4.4 27.11 ? 2% perf-profile.self.cycles-pp.__fget_files
>
> although there's odd spikes in dTLB-loads etc.
>
> I checked whether it's some unexpected code generation issue, but the
> new "re-check file table after refcount update" really looks very
> cheap when I look at what gcc generates, there's nothing really
> unexpected there.
>
> What did change was:
>
> (a) some branches go other ways, which might well affect branch
> prediction and just be unlucky. It might be that just marking the
> mismatch case "unlikely()" will help.
>
> (b) the obvious few new instructions (re-load and check file table
> pointer, re-load and check file pointer)
>
> (c) that __fget_files() function is now no longer a leaf function in
> a simple config case, since it calls "fput_many" in the error case.
>
> And that (c) is worth mentioning simply because it means that the
> function goes from not having any stack frame at all, to having to
> save/restore four registers. So now it has the usual push/pop
> sequences.
>
> It may also be that the test-case actually does a lot of threaded
> open/close/poll, and either actually triggers the re-lookup looping
> case (unlikely) or just sees a lot of cacheline bouncing that now got
> worse due to the re-check of the file pointer.
>
> So this regression looks real, and the issue seems to be that
> __fget_files() really is _that_ important for this do_sys_poll()
> benchmark, and even just the handful of extra instructions end up
> being meaningful.
>
> Oliver - I'm attaching the obvious "unlikely9)" oneliner in case it's
> just "gcc thought the retry loop was the common case" issue and bad
> branch prediction.

We tested your patch, it didn't work, still has -6.0% regression, thanks.

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/thread/50%/debian-10.4-x86_64-20200603.cgz/lkp-ivb-2ep1/poll2/will-it-scale/0x42e

commit:
5f58da2bef ("Merge tag 'drm-fixes-2021-12-03-1' of git://anongit.freedesktop.org/drm/drm")
054aa8d439 ("fget: check that the fd still exists after getting a ref to it")
ef8c68873e ("fixup-for-054aa8d439")

5f58da2befa58edf 054aa8d439b9185d4f5eb9a9028 ef8c68873e75cf486bec22c3e8d
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
6666720 -5.7% 6288280 -6.0% 6268295 will-it-scale.24.threads
277779 -5.7% 262011 -6.0% 261178 will-it-scale.per_thread_ops
6666720 -5.7% 6288280 -6.0% 6268295 will-it-scale.workload
28.74 ? 23% -1.9 26.84 ? 6% +1.2 29.92 ? 25% perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry
27.82 ? 23% -1.9 25.93 ? 5% +1.4 29.20 ? 26% perf-profile.calltrace.cycles-pp.cpuidle_enter.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
27.82 ? 23% -1.9 25.93 ? 5% +1.4 29.20 ? 26% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry.start_secondary
27.83 ? 23% -1.9 25.94 ? 5% +1.4 29.20 ? 26% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
27.83 ? 23% -1.9 25.94 ? 5% +1.4 29.20 ? 26% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64_no_verify
27.83 ? 23% -1.9 25.94 ? 5% +1.4 29.20 ? 26% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64_no_verify
28.81 ? 23% -1.9 26.93 ? 6% +1.2 29.98 ? 25% perf-profile.calltrace.cycles-pp.secondary_startup_64_no_verify
19.12 ? 9% -1.7 17.39 ? 2% -2.5 16.64 ? 11% perf-profile.calltrace.cycles-pp.fput_many.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.26 ?100% -0.2 0.10 ?223% -0.1 0.17 ?141% perf-profile.calltrace.cycles-pp.__kmalloc.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.88 ? 9% -0.1 2.76 ? 2% -0.2 2.64 ? 11% perf-profile.calltrace.cycles-pp.testcase
3.27 ? 9% -0.1 3.18 ? 2% -0.2 3.04 ? 11% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll
1.70 ? 9% -0.1 1.63 ? 2% -0.1 1.56 ? 11% perf-profile.calltrace.cycles-pp.__fdget.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.70 ? 10% -0.1 1.64 ? 2% -0.1 1.58 ? 10% perf-profile.calltrace.cycles-pp.fput.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.04 ? 8% -0.1 0.98 ? 4% -0.1 0.96 ? 11% perf-profile.calltrace.cycles-pp.__check_object_size.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.64 ? 9% -0.1 1.59 ? 3% -0.1 1.50 ? 10% perf-profile.calltrace.cycles-pp.__entry_text_start.__poll
1.69 ? 9% -0.0 1.65 ? 3% -0.1 1.57 ? 12% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.__poll
1.11 ? 9% -0.0 1.08 ? 4% -0.1 1.02 ? 9% perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64
1.48 ? 8% -0.0 1.45 ? 4% -0.1 1.38 ? 10% perf-profile.calltrace.cycles-pp._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.94 ? 57% -0.0 0.92 ? 54% -0.2 0.74 ? 56% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_kernel.secondary_startup_64_no_verify
0.94 ? 57% -0.0 0.92 ? 54% -0.2 0.74 ? 56% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_kernel.secondary_startup_64_no_verify
0.94 ? 57% -0.0 0.92 ? 54% -0.2 0.74 ? 56% perf-profile.calltrace.cycles-pp.cpuidle_enter.do_idle.cpu_startup_entry.start_kernel.secondary_startup_64_no_verify
0.94 ? 57% -0.0 0.92 ? 54% -0.2 0.74 ? 56% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry.start_kernel
0.94 ? 57% -0.0 0.92 ? 54% -0.2 0.74 ? 56% perf-profile.calltrace.cycles-pp.start_kernel.secondary_startup_64_no_verify
0.54 ? 45% +0.1 0.59 ? 9% +0.1 0.66 ? 6% perf-profile.calltrace.cycles-pp.kfree.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
67.66 ? 9% +2.0 69.69 ? 2% -0.9 66.78 ? 10% perf-profile.calltrace.cycles-pp.__poll
63.66 ? 9% +2.2 65.82 ? 2% -0.6 63.10 ? 10% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__poll
63.48 ? 9% +2.2 65.64 ? 2% -0.6 62.93 ? 10% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll
59.85 ? 9% +2.3 62.12 ? 2% -0.3 59.56 ? 10% perf-profile.calltrace.cycles-pp.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll
57.74 ? 9% +2.4 60.09 ? 2% -0.1 57.62 ? 10% perf-profile.calltrace.cycles-pp.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll
27.16 ? 10% +4.3 31.51 ? 2% +3.0 30.21 ? 10% perf-profile.calltrace.cycles-pp.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
22.91 ? 10% +4.4 27.34 ? 2% +3.3 26.25 ? 10% perf-profile.calltrace.cycles-pp.__fget_files.__fget_light.do_sys_poll.__x64_sys_poll.do_syscall_64
27.83 ? 23% -1.9 25.94 ? 5% +1.4 29.20 ? 26% perf-profile.children.cycles-pp.start_secondary
28.80 ? 23% -1.9 26.92 ? 6% +1.2 29.98 ? 25% perf-profile.children.cycles-pp.cpuidle_enter
28.80 ? 23% -1.9 26.92 ? 6% +1.2 29.98 ? 25% perf-profile.children.cycles-pp.cpuidle_enter_state
28.81 ? 23% -1.9 26.93 ? 6% +1.2 29.98 ? 25% perf-profile.children.cycles-pp.secondary_startup_64_no_verify
28.81 ? 23% -1.9 26.93 ? 6% +1.2 29.98 ? 25% perf-profile.children.cycles-pp.cpu_startup_entry
28.81 ? 23% -1.9 26.93 ? 6% +1.2 29.98 ? 25% perf-profile.children.cycles-pp.do_idle
28.78 ? 23% -1.9 26.91 ? 6% +1.2 29.96 ? 25% perf-profile.children.cycles-pp.intel_idle
18.28 ? 9% -1.7 16.59 ? 2% -2.4 15.87 ? 11% perf-profile.children.cycles-pp.fput_many
2.90 ? 9% -0.1 2.76 ? 2% -0.3 2.64 ? 11% perf-profile.children.cycles-pp.testcase
3.29 ? 9% -0.1 3.20 ? 2% -0.2 3.07 ? 11% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
1.69 ? 9% -0.1 1.63 ? 2% -0.1 1.57 ? 10% perf-profile.children.cycles-pp.__fdget
1.70 ? 9% -0.1 1.64 ? 2% -0.1 1.57 ? 10% perf-profile.children.cycles-pp.fput
1.82 ? 9% -0.1 1.76 ? 3% -0.2 1.67 ? 11% perf-profile.children.cycles-pp.__entry_text_start
1.07 ? 9% -0.1 1.02 ? 4% -0.1 0.99 ? 10% perf-profile.children.cycles-pp.__check_object_size
1.90 ? 9% -0.0 1.86 ? 2% -0.1 1.77 ? 12% perf-profile.children.cycles-pp.syscall_return_via_sysret
1.50 ? 8% -0.0 1.47 ? 4% -0.1 1.39 ? 10% perf-profile.children.cycles-pp._copy_from_user
1.12 ? 9% -0.0 1.09 ? 4% -0.1 1.03 ? 10% perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
0.30 ? 8% -0.0 0.26 ? 11% -0.0 0.29 ? 20% perf-profile.children.cycles-pp.__virt_addr_valid
0.12 ? 8% -0.0 0.09 ? 11% -0.0 0.09 ? 12% perf-profile.children.cycles-pp.poll@plt
0.04 ? 71% -0.0 0.01 ?223% -0.0 0.03 ?100% perf-profile.children.cycles-pp.seq_read_iter
0.17 ? 26% -0.0 0.14 ? 32% -0.0 0.14 ? 23% perf-profile.children.cycles-pp.cmd_sched
0.17 ? 26% -0.0 0.14 ? 32% -0.0 0.14 ? 23% perf-profile.children.cycles-pp.cmd_record
0.04 ? 71% -0.0 0.02 ?144% -0.0 0.03 ?100% perf-profile.children.cycles-pp.ksys_read
0.04 ? 71% -0.0 0.02 ?144% -0.0 0.03 ?100% perf-profile.children.cycles-pp.vfs_read
0.61 ? 13% -0.0 0.59 ? 9% +0.1 0.66 ? 5% perf-profile.children.cycles-pp.kfree
0.17 ? 26% -0.0 0.15 ? 30% -0.0 0.15 ? 21% perf-profile.children.cycles-pp.__libc_start_main
0.17 ? 26% -0.0 0.15 ? 30% -0.0 0.15 ? 21% perf-profile.children.cycles-pp.main
0.17 ? 26% -0.0 0.15 ? 30% -0.0 0.15 ? 21% perf-profile.children.cycles-pp.run_builtin
0.02 ?142% -0.0 0.00 -0.0 0.00 perf-profile.children.cycles-pp.new_sync_write
0.02 ?142% -0.0 0.00 -0.0 0.01 ?223% perf-profile.children.cycles-pp.ksys_write
0.02 ?142% -0.0 0.00 -0.0 0.01 ?223% perf-profile.children.cycles-pp.vfs_write
0.04 ? 45% -0.0 0.02 ? 99% -0.0 0.02 ? 99% perf-profile.children.cycles-pp.__x86_indirect_thunk_rax
0.13 ? 10% -0.0 0.12 ? 6% -0.0 0.12 ? 15% perf-profile.children.cycles-pp.kmalloc_slab
0.02 ? 99% -0.0 0.01 ?223% +0.0 0.02 ? 99% perf-profile.children.cycles-pp.seq_read
0.02 ? 99% -0.0 0.01 ?223% +0.0 0.02 ? 99% perf-profile.children.cycles-pp.__libc_read
0.15 ? 26% -0.0 0.14 ? 31% -0.0 0.13 ? 23% perf-profile.children.cycles-pp.record__finish_output
0.15 ? 26% -0.0 0.14 ? 31% -0.0 0.13 ? 23% perf-profile.children.cycles-pp.perf_session__process_events
0.06 ? 46% -0.0 0.04 ? 73% -0.0 0.04 ? 73% perf-profile.children.cycles-pp.machines__deliver_event
0.14 ? 27% -0.0 0.12 ? 32% -0.0 0.12 ? 19% perf-profile.children.cycles-pp.process_simple
0.09 ? 30% -0.0 0.08 ? 34% -0.0 0.08 ? 17% perf-profile.children.cycles-pp.perf_session__process_user_event
0.05 ? 47% -0.0 0.04 ? 71% -0.0 0.02 ?142% perf-profile.children.cycles-pp.exc_page_fault
0.10 ? 24% -0.0 0.08 ? 30% -0.0 0.08 ? 17% perf-profile.children.cycles-pp.perf_session__deliver_event
0.02 ?142% -0.0 0.01 ?223% -0.0 0.01 ?223% perf-profile.children.cycles-pp.poll_select_set_timeout
0.09 ? 24% -0.0 0.08 ? 28% -0.0 0.08 ? 19% perf-profile.children.cycles-pp.__ordered_events__flush
0.02 ?141% -0.0 0.01 ?223% -0.0 0.00 perf-profile.children.cycles-pp.proc_reg_read
0.02 ?141% -0.0 0.01 ?223% -0.0 0.01 ?223% perf-profile.children.cycles-pp.perf_output_sample
0.01 ?223% -0.0 0.00 +0.0 0.01 ?223% perf-profile.children.cycles-pp.do_user_addr_fault
0.04 ? 45% -0.0 0.04 ? 71% -0.0 0.00 perf-profile.children.cycles-pp.perf_callchain_user
0.29 ? 9% -0.0 0.28 ? 4% -0.0 0.28 ? 10% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.11 ? 12% -0.0 0.10 ? 8% -0.0 0.10 ? 10% perf-profile.children.cycles-pp.__might_sleep
0.05 ? 49% -0.0 0.05 ? 52% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.check_stack_object
0.16 ? 20% -0.0 0.16 ? 9% -0.0 0.12 ? 11% perf-profile.children.cycles-pp.perf_callchain_kernel
0.12 ? 20% -0.0 0.11 ? 8% -0.0 0.10 ? 11% perf-profile.children.cycles-pp.unwind_next_frame
0.29 ? 11% -0.0 0.28 ? 5% -0.0 0.26 ? 9% perf-profile.children.cycles-pp.__check_heap_object
0.43 ? 19% +0.0 0.43 ? 10% -0.1 0.37 ? 9% perf-profile.children.cycles-pp.update_process_times
0.23 ? 20% +0.0 0.23 ? 11% -0.0 0.18 ? 9% perf-profile.children.cycles-pp.perf_prepare_sample
0.21 ? 20% +0.0 0.21 ? 10% -0.0 0.17 ? 10% perf-profile.children.cycles-pp.perf_callchain
0.02 ?142% +0.0 0.02 ?142% +0.0 0.02 ?142% perf-profile.children.cycles-pp.io_serial_in
0.02 ?141% +0.0 0.02 ?141% +0.0 0.02 ?141% perf-profile.children.cycles-pp.ordered_events__queue
0.07 ? 12% +0.0 0.07 ? 6% -0.0 0.07 ? 10% perf-profile.children.cycles-pp.uart_console_write
0.07 ? 12% +0.0 0.08 ? 6% -0.0 0.07 ? 10% perf-profile.children.cycles-pp.serial8250_console_write
0.21 ? 21% +0.0 0.21 ? 10% -0.0 0.17 ? 10% perf-profile.children.cycles-pp.get_perf_callchain
0.07 ? 12% +0.0 0.07 ? 6% -0.0 0.07 ? 10% perf-profile.children.cycles-pp.wait_for_xmitr
0.08 ? 12% +0.0 0.08 ? 6% -0.0 0.07 ? 12% perf-profile.children.cycles-pp.vprintk_emit
0.08 ? 12% +0.0 0.08 ? 6% -0.0 0.07 ? 12% perf-profile.children.cycles-pp.console_unlock
0.48 ? 17% +0.0 0.48 ? 9% -0.1 0.42 ? 10% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.32 ? 21% +0.0 0.32 ? 11% -0.1 0.26 ? 11% perf-profile.children.cycles-pp.update_curr
0.61 ? 16% +0.0 0.61 ? 7% -0.1 0.55 ? 10% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.30 ? 21% +0.0 0.30 ? 10% -0.1 0.24 ? 10% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
0.28 ? 20% +0.0 0.29 ? 10% -0.1 0.23 ? 11% perf-profile.children.cycles-pp.__perf_event_overflow
0.28 ? 19% +0.0 0.28 ? 11% -0.1 0.23 ? 10% perf-profile.children.cycles-pp.perf_event_output_forward
0.29 ? 10% +0.0 0.29 ? 9% -0.0 0.28 ? 10% perf-profile.children.cycles-pp.__might_fault
0.08 ? 12% +0.0 0.08 ? 22% -0.0 0.07 ? 14% perf-profile.children.cycles-pp.syscall_enter_from_user_mode
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.worker_thread
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.process_one_work
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.drm_fb_helper_damage_work
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.drm_atomic_helper_dirtyfb
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.drm_atomic_helper_commit
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.commit_tail
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.drm_atomic_helper_commit_tail
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.drm_atomic_helper_commit_planes
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.mgag200_simple_display_pipe_update
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.drm_fb_memcpy_dstclip
0.05 ? 8% +0.0 0.06 ? 13% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.memcpy_toio
0.07 ? 8% +0.0 0.07 ? 6% -0.0 0.07 ? 10% perf-profile.children.cycles-pp.serial8250_console_putchar
0.43 ? 19% +0.0 0.44 ? 9% -0.1 0.38 ? 9% perf-profile.children.cycles-pp.tick_sched_handle
0.57 ? 16% +0.0 0.58 ? 7% -0.1 0.52 ? 9% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.05 ? 47% +0.0 0.06 ? 8% -0.0 0.04 ? 71% perf-profile.children.cycles-pp.__unwind_start
0.28 ? 20% +0.0 0.29 ? 10% -0.1 0.23 ? 11% perf-profile.children.cycles-pp.perf_swevent_overflow
0.07 ? 9% +0.0 0.08 ? 6% -0.0 0.07 ? 11% perf-profile.children.cycles-pp.asm_sysvec_irq_work
0.07 ? 9% +0.0 0.08 ? 6% -0.0 0.07 ? 11% perf-profile.children.cycles-pp.sysvec_irq_work
0.07 ? 9% +0.0 0.08 ? 6% -0.0 0.07 ? 11% perf-profile.children.cycles-pp.__sysvec_irq_work
0.07 ? 9% +0.0 0.08 ? 6% -0.0 0.07 ? 11% perf-profile.children.cycles-pp.irq_work_run
0.07 ? 9% +0.0 0.08 ? 6% -0.0 0.07 ? 11% perf-profile.children.cycles-pp.irq_work_single
0.07 ? 9% +0.0 0.08 ? 6% -0.0 0.07 ? 11% perf-profile.children.cycles-pp._printk
0.44 ? 19% +0.0 0.45 ? 9% -0.1 0.38 ? 9% perf-profile.children.cycles-pp.tick_sched_timer
0.54 ? 17% +0.0 0.55 ? 8% -0.1 0.49 ? 10% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.54 ? 17% +0.0 0.54 ? 8% -0.1 0.48 ? 10% perf-profile.children.cycles-pp.hrtimer_interrupt
0.37 ? 19% +0.0 0.37 ? 10% -0.1 0.31 ? 9% perf-profile.children.cycles-pp.task_tick_fair
0.06 ? 47% +0.0 0.07 ? 18% -0.0 0.06 ? 21% perf-profile.children.cycles-pp.asm_exc_page_fault
0.07 ? 9% +0.0 0.08 ? 8% -0.0 0.07 ? 11% perf-profile.children.cycles-pp.irq_work_run_list
0.98 ? 47% +0.0 0.99 ? 38% -0.2 0.78 ? 45% perf-profile.children.cycles-pp.start_kernel
0.29 ? 21% +0.0 0.30 ? 10% -0.1 0.24 ? 10% perf-profile.children.cycles-pp.perf_tp_event
0.01 ?223% +0.0 0.02 ?141% +0.0 0.01 ?223% perf-profile.children.cycles-pp.queue_event
0.00 +0.0 0.01 ?223% +0.0 0.01 ?223% perf-profile.children.cycles-pp.build_id__mark_dso_hit
0.01 ?223% +0.0 0.02 ?141% +0.0 0.02 ? 99% perf-profile.children.cycles-pp.__cond_resched
0.00 +0.0 0.01 ?223% +0.0 0.02 ? 99% perf-profile.children.cycles-pp.poll_freewait
0.39 ? 19% +0.0 0.40 ? 9% -0.1 0.34 ? 9% perf-profile.children.cycles-pp.scheduler_tick
0.04 ? 71% +0.0 0.04 ? 45% -0.0 0.02 ?141% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
0.06 ? 8% +0.0 0.07 ? 11% +0.0 0.06 ? 13% perf-profile.children.cycles-pp.ret_from_fork
0.06 ? 8% +0.0 0.07 ? 11% +0.0 0.06 ? 13% perf-profile.children.cycles-pp.kthread
0.19 ? 10% +0.0 0.20 ? 9% +0.0 0.19 ? 10% perf-profile.children.cycles-pp.__might_resched
0.02 ?141% +0.0 0.03 ? 70% -0.0 0.00 perf-profile.children.cycles-pp.syscall_exit_to_user_mode_prepare
0.01 ?223% +0.0 0.03 ? 70% -0.0 0.00 perf-profile.children.cycles-pp.__get_user_nocheck_8
0.50 ? 12% +0.0 0.53 ? 9% -0.0 0.48 ? 11% perf-profile.children.cycles-pp.__kmalloc
67.89 ? 9% +2.0 69.92 ? 2% -0.9 67.00 ? 10% perf-profile.children.cycles-pp.__poll
63.84 ? 9% +2.1 65.97 ? 2% -0.6 63.26 ? 10% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
63.66 ? 9% +2.1 65.80 ? 2% -0.6 63.09 ? 10% perf-profile.children.cycles-pp.do_syscall_64
59.86 ? 9% +2.3 62.12 ? 2% -0.3 59.57 ? 10% perf-profile.children.cycles-pp.__x64_sys_poll
59.47 ? 9% +2.3 61.75 ? 2% -0.2 59.22 ? 10% perf-profile.children.cycles-pp.do_sys_poll
26.33 ? 10% +4.4 30.70 ? 2% +3.1 29.43 ? 10% perf-profile.children.cycles-pp.__fget_light
22.92 ? 10% +4.4 27.35 ? 2% +3.3 26.26 ? 10% perf-profile.children.cycles-pp.__fget_files
28.78 ? 23% -1.9 26.91 ? 6% +1.2 29.96 ? 25% perf-profile.self.cycles-pp.intel_idle
17.29 ? 9% -1.6 15.64 ? 2% -2.3 14.96 ? 11% perf-profile.self.cycles-pp.fput_many
11.06 ? 9% -0.3 10.74 ? 2% -0.8 10.29 ? 10% perf-profile.self.cycles-pp.do_sys_poll
2.87 ? 9% -0.1 2.74 ? 2% -0.2 2.62 ? 11% perf-profile.self.cycles-pp.testcase
3.21 ? 9% -0.1 3.12 ? 2% -0.2 3.00 ? 11% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
1.60 ? 9% -0.1 1.54 ? 4% -0.1 1.47 ? 10% perf-profile.self.cycles-pp.__entry_text_start
1.90 ? 8% -0.0 1.86 ? 3% -0.1 1.76 ? 12% perf-profile.self.cycles-pp.syscall_return_via_sysret
2.55 ? 9% -0.0 2.51 ? 4% -0.2 2.36 ? 12% perf-profile.self.cycles-pp.__fget_light
0.84 ? 8% -0.0 0.81 ? 2% -0.1 0.78 ? 10% perf-profile.self.cycles-pp.fput
0.29 ? 8% -0.0 0.26 ? 11% -0.0 0.28 ? 20% perf-profile.self.cycles-pp.__virt_addr_valid
1.10 ? 9% -0.0 1.07 ? 4% -0.1 1.01 ? 10% perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
0.28 ? 8% -0.0 0.26 ? 5% -0.0 0.26 ? 11% perf-profile.self.cycles-pp.do_syscall_64
0.84 ? 9% -0.0 0.82 ? 3% -0.0 0.79 ? 10% perf-profile.self.cycles-pp.__fdget
0.10 ? 6% -0.0 0.08 ? 10% -0.0 0.08 ? 14% perf-profile.self.cycles-pp.poll@plt
0.60 ? 12% -0.0 0.59 ? 9% +0.1 0.66 ? 6% perf-profile.self.cycles-pp.kfree
0.11 ? 12% -0.0 0.09 ? 7% -0.0 0.10 ? 13% perf-profile.self.cycles-pp._copy_from_user
0.44 ? 9% -0.0 0.43 ? 3% -0.0 0.40 ? 15% perf-profile.self.cycles-pp.__check_object_size
0.31 ? 10% -0.0 0.30 ? 3% -0.0 0.28 ? 15% perf-profile.self.cycles-pp.__x64_sys_poll
0.18 ? 8% -0.0 0.17 ? 5% -0.0 0.16 ? 11% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.02 ?142% -0.0 0.01 ?223% -0.0 0.01 ?223% perf-profile.self.cycles-pp.poll_select_set_timeout
0.04 ? 71% -0.0 0.02 ? 99% -0.0 0.02 ? 99% perf-profile.self.cycles-pp.__x86_indirect_thunk_rax
0.12 ? 10% -0.0 0.12 ? 6% -0.0 0.12 ? 17% perf-profile.self.cycles-pp.kmalloc_slab
0.01 ?223% -0.0 0.00 -0.0 0.00 perf-profile.self.cycles-pp.syscall_exit_to_user_mode_prepare
0.06 ? 11% -0.0 0.05 ? 8% -0.0 0.04 ? 71% perf-profile.self.cycles-pp.syscall_enter_from_user_mode
0.03 ?100% -0.0 0.02 ?142% -0.0 0.02 ?141% perf-profile.self.cycles-pp.check_stack_object
0.28 ? 11% -0.0 0.27 ? 5% -0.0 0.25 ? 11% perf-profile.self.cycles-pp.__check_heap_object
0.09 ? 13% -0.0 0.09 ? 7% -0.0 0.09 ? 10% perf-profile.self.cycles-pp.__might_sleep
0.08 ? 17% -0.0 0.08 ? 25% -0.0 0.08 ? 11% perf-profile.self.cycles-pp.__might_fault
0.03 ?100% -0.0 0.02 ? 99% -0.0 0.02 ?141% perf-profile.self.cycles-pp.unwind_next_frame
0.01 ?223% +0.0 0.01 ?223% -0.0 0.00 perf-profile.self.cycles-pp.exit_to_user_mode_prepare
0.02 ?142% +0.0 0.02 ?142% +0.0 0.02 ?142% perf-profile.self.cycles-pp.io_serial_in
0.11 ? 9% +0.0 0.12 ? 8% -0.0 0.11 ? 11% perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
0.44 ? 8% +0.0 0.45 ? 4% -0.0 0.42 ? 12% perf-profile.self.cycles-pp.__poll
0.01 ?223% +0.0 0.02 ?141% +0.0 0.01 ?223% perf-profile.self.cycles-pp.queue_event
0.18 ? 11% +0.0 0.19 ? 8% -0.0 0.18 ? 10% perf-profile.self.cycles-pp.__might_resched
0.26 ? 13% +0.0 0.30 ? 17% -0.0 0.25 ? 13% perf-profile.self.cycles-pp.__kmalloc
22.70 ? 10% +4.4 27.11 ? 2% +3.3 26.04 ? 10% perf-profile.self.cycles-pp.__fget_files

>
> And it would perhaps be interesting to get an actual instruction-level
> profile of that __fget_files() thing for that benchmark, if that
> pinpoints exactly what is going on and in case that would be easy to
> get on that machine.
>
> Because it might just be truly horrendously bad luck, with the 32-byte
> stack frame meaning that the kernel stack goes one more page down
> (just jhandwaving from the dTLB number spike), and this all being just
> random bad luck on that particular benchmark.
>
> Of course, the thing about poll() is that for that case, we *don't*
> really need the "re-check the file descriptor" code at all, since the
> resulting fd isn't going to be installed as a new fd, and it doesn't
> matter for the socket garbage collector logic.
>
> So maybe it was a mistake to put that re-check in the generic fdget()
> code - yes, it should be cheap, but it's also some of the most hot
> code in the kernel on some loads.
>
> But if we move it elsewhere, we'd need to come up with some list of
> "these cases need it". Some are obvious: dup, dup2, unix domain file
> passing. It's the non-obvious ones I'd worry about.
>
> Anybody?
>
> Linus

> fs/file.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/file.c b/fs/file.c
> index ad4a8bf3cf10..f802360e240d 100644
> --- a/fs/file.c
> +++ b/fs/file.c
> @@ -858,7 +858,7 @@ static struct file *__fget_files(struct files_struct *files, unsigned int fd,
> file = NULL;
> else if (!get_file_rcu_many(file, refs))
> goto loop;
> - else if (files_lookup_fd_raw(files, fd) != file) {
> + else if (unlikely(files_lookup_fd_raw(files, fd) != file)) {
> fput_many(file, refs);
> goto loop;
> }

> _______________________________________________
> LKP mailing list -- [email protected]
> To unsubscribe send an email to [email protected]


2021-12-13 18:38:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

On Mon, Dec 13, 2021 at 12:34 AM Carel Si <[email protected]> wrote:
>
> We tested your patch, the performance regression has recovered from -5.7% to
> -0.4%, thanks.

Thanks for testing.

That probably means that the cost was literally mostly the overhead of
function call/exit, and now that it's simple enough for gcc to inline,
it went back to being equivalent to the old leaf-function overhead.

I also suspect that it means that the benchmark is likely not great (I
didn't look at details - I assume it's literally a microbenchmark just
doing a very tight poll() loop in multiple threads), but hey, that's
not all that uncommon.

And I do think the resulting code after this patch is better (in
organization, in comments, and obviously in code generation), so maybe
the benchmark isn't great, maybe this case doesn't matter all that
much in reality, but the end result is better for it.

So I'll just apply the patch. Thanks for the report and the testing
(including testing the one-liner that didn't help).

Linus

2021-12-13 19:44:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: [LKP] Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

On Mon, Dec 13, 2021 at 10:37 AM Linus Torvalds
<[email protected]> wrote:
>
> So I'll just apply the patch. Thanks for the report and the testing

Done, it's commit e386dfc56f83 ("fget: clarify and improve
__fget_files() implementation") in my tree now.

I didn't mark it as "Fixes:" or for stable, because I can't imagine
that it matters in real life.

But then it struck me that Greg has mentioned that he ends up getting
a lot of performance regression reports for people testing stable and
they can be distracting.

So I'm adding a stable cc here just so people are aware of this as a
"yeah, will-it-scale.poll2 performance regression has been reported,
has a fix available if somebody cares".

Linus

2021-12-15 12:54:39

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [LKP] Re: [fget] 054aa8d439: will-it-scale.per_thread_ops -5.7% regression

On Mon, Dec 13, 2021 at 11:44:31AM -0800, Linus Torvalds wrote:
> On Mon, Dec 13, 2021 at 10:37 AM Linus Torvalds
> <[email protected]> wrote:
> >
> > So I'll just apply the patch. Thanks for the report and the testing
>
> Done, it's commit e386dfc56f83 ("fget: clarify and improve
> __fget_files() implementation") in my tree now.
>
> I didn't mark it as "Fixes:" or for stable, because I can't imagine
> that it matters in real life.
>
> But then it struck me that Greg has mentioned that he ends up getting
> a lot of performance regression reports for people testing stable and
> they can be distracting.
>
> So I'm adding a stable cc here just so people are aware of this as a
> "yeah, will-it-scale.poll2 performance regression has been reported,
> has a fix available if somebody cares".

Thanks, I'll keep an eye on this.

greg k-h