2023-11-20 07:14:10

by Oliver Sang

[permalink] [raw]
Subject: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression



Hello,

kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:


commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

testcase: will-it-scale
test machine: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
parameters:

nr_task: 16
mode: thread
test: poll2
cpufreq_governor: performance




If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]


Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231120/[email protected]

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit:
93faf426e3 ("vfs: shave work on failed file open")
0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74
---------------- ---------------------------
%stddev %change %stddev
\ | \
0.01 ? 9% +58125.6% 4.17 ?175% perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
89056 -2.0% 87309 proc-vmstat.nr_slab_unreclaimable
97958 ? 7% -9.7% 88449 ? 4% sched_debug.cpu.avg_idle.stddev
0.00 ? 12% +24.2% 0.00 ? 17% sched_debug.cpu.next_balance.stddev
6391048 -2.9% 6208584 will-it-scale.16.threads
399440 -2.9% 388036 will-it-scale.per_thread_ops
6391048 -2.9% 6208584 will-it-scale.workload
19.99 ? 4% -2.2 17.74 perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
1.27 ? 5% +0.8 2.11 ? 3% perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
32.69 ? 4% +5.0 37.70 perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
0.00 +27.9 27.85 perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
20.00 ? 4% -2.3 17.75 perf-profile.children.cycles-pp.fput
0.24 ? 10% -0.1 0.18 ? 2% perf-profile.children.cycles-pp.syscall_return_via_sysret
1.48 ? 5% +0.5 1.98 ? 3% perf-profile.children.cycles-pp.__fdget
31.85 ? 4% +6.0 37.86 perf-profile.children.cycles-pp.__fget_light
0.00 +27.7 27.67 perf-profile.children.cycles-pp.__get_file_rcu
30.90 ? 4% -20.6 10.35 ? 2% perf-profile.self.cycles-pp.__fget_light
19.94 ? 4% -2.4 17.53 perf-profile.self.cycles-pp.fput
9.81 ? 4% -2.4 7.42 ? 2% perf-profile.self.cycles-pp.do_poll
0.23 ? 11% -0.1 0.17 ? 4% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.00 +26.5 26.48 perf-profile.self.cycles-pp.__get_file_rcu
2.146e+10 ? 2% +8.5% 2.329e+10 ? 2% perf-stat.i.branch-instructions
0.22 ? 14% -0.0 0.19 ? 14% perf-stat.i.branch-miss-rate%
1.404e+10 ? 2% +8.7% 1.526e+10 ? 2% perf-stat.i.dTLB-stores
70.87 -2.3 68.59 perf-stat.i.iTLB-load-miss-rate%
5267608 -5.5% 4979133 ? 2% perf-stat.i.iTLB-load-misses
2102507 +5.4% 2215725 perf-stat.i.iTLB-loads
18791 ? 3% +10.5% 20757 ? 2% perf-stat.i.instructions-per-iTLB-miss
266.67 ? 2% +6.8% 284.75 ? 2% perf-stat.i.metric.M/sec
0.01 ? 10% -10.5% 0.01 ? 5% perf-stat.overall.MPKI
0.19 -0.0 0.17 perf-stat.overall.branch-miss-rate%
0.65 -3.1% 0.63 perf-stat.overall.cpi
0.00 ? 4% -0.0 0.00 ? 4% perf-stat.overall.dTLB-store-miss-rate%
71.48 -2.3 69.21 perf-stat.overall.iTLB-load-miss-rate%
18757 +10.0% 20629 perf-stat.overall.instructions-per-iTLB-miss
1.54 +3.2% 1.59 perf-stat.overall.ipc
4795147 +6.4% 5100406 perf-stat.overall.path-length
2.14e+10 ? 2% +8.5% 2.322e+10 ? 2% perf-stat.ps.branch-instructions
1.4e+10 ? 2% +8.7% 1.522e+10 ? 2% perf-stat.ps.dTLB-stores
5253923 -5.5% 4966218 ? 2% perf-stat.ps.iTLB-load-misses
2095770 +5.4% 2208605 perf-stat.ps.iTLB-loads
3.065e+13 +3.3% 3.167e+13 perf-stat.total.instructions




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


2023-11-20 07:42:07

by Mateusz Guzik

[permalink] [raw]
Subject: Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

On Mon, Nov 20, 2023 at 03:11:31PM +0800, kernel test robot wrote:
>
>
> Hello,
>
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:
>
>
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> 93faf426e3cc000c 0ede61d8589cc2d93aa78230d74
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
[snip]
> 30.90 ± 4% -20.6 10.35 ± 2% perf-profile.self.cycles-pp.__fget_light
> 0.00 +26.5 26.48 perf-profile.self.cycles-pp.__get_file_rcu
[snip]

So __fget_light now got a func call.

I don't know if this is worth patching (and benchmarking after), but I
if sorting this out is of interest, triviality below is probably the
easiest way out:

diff --git a/fs/file.c b/fs/file.c
index 5fb0b146e79e..d8d3e18800c4 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -856,14 +856,14 @@ void do_close_on_exec(struct files_struct *files)
spin_unlock(&files->file_lock);
}

-static struct file *__get_file_rcu(struct file __rcu **f)
+static __always_inline struct file *__get_file_rcu(struct file __rcu **f)
{
struct file __rcu *file;
struct file __rcu *file_reloaded;
struct file __rcu *file_reloaded_cmp;

file = rcu_dereference_raw(*f);
- if (!file)
+ if (unlikely(!file))
return NULL;

if (unlikely(!atomic_long_inc_not_zero(&file->f_count)))
@@ -891,7 +891,7 @@ static struct file *__get_file_rcu(struct file __rcu **f)
* If the pointers don't match the file has been reallocated by
* SLAB_TYPESAFE_BY_RCU.
*/
- if (file == file_reloaded_cmp)
+ if (likely(file == file_reloaded_cmp))
return file_reloaded;

fput(file);

2023-11-26 20:24:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

On Sun, 19 Nov 2023 at 23:11, kernel test robot <[email protected]> wrote:
>
> kernel test robot noticed a -2.9% regression of will-it-scale.per_thread_ops on:
>
> commit: 0ede61d8589cc2d93aa78230d74ac58b5b8d0244 ("file: convert to SLAB_TYPESAFE_BY_RCU")

Ok, so __fget_light() is one of our more performance-critical things,
and that commit ends up making it call a rather more expensive version
in __get_file_rcu(), so we have:

> 30.90 ą 4% -20.6 10.35 ą 2% perf-profile.self.cycles-pp.__fget_light
> 0.00 +26.5 26.48 perf-profile.self.cycles-pp.__get_file_rcu

and that "20% decrease balanced by 26% increase elsewhere" then
directly causes the ~3% regression.

I took a look at the code generation, and honestly, I think we're
better off just making __fget_files_rcu() have special logic for this
all, and not use __get_file_rcu().

The 'fd' case really is special because we need to do that
non-speculative pointer access.

Because it turns out that we already have to use array_index_nospec()
to safely generate that pointer to the fd entry, and once you have to
do that "use non-speculative accesses to generate a safe pointer", you
might as well just go whole hog.

So this takes a different approach, and uses the
array_index_mask_nospec() thing that we have exactly to generate that
non-speculative mask for these things:

/* Mask is a 0 for invalid fd's, ~0 for valid ones */
mask = array_index_mask_nospec(fd, fdt->max_fds);

and then it does something you can consider either horribly clever, or
horribly ugly (this first part is basically just
array_index_nospec()):

/* fdentry points to the 'fd' offset, or fdt->fd[0] */
fdentry = fdt->fd + (fd&mask);

and then we can just *unconditionally* do the load - but since we
might be loading fd[0] for an invalid case, we need to mask the result
too:

/* Do the load, then mask any invalid result */
file = rcu_dereference_raw(*fdentry);
file = (void *)(mask & (unsigned long)file);

but now we have done everything without any conditionals, and the only
conditional is now "did we load NULL" - which includes that "we masked
the bad value".

Then we just do that atomic_long_inc_not_zero() on the file, and
re-check the pointer chain we used.

I made files_lookup_fd_raw() do the same thing.

The end result is much nicer code generation at least from my quick
check. And I assume the regression will be gone, and hopefully even
turned into an improvement since this is so hot.

Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that
__get_file_rcu() does, and I don't get it. Both things come from
volatile accesses, I don't see the point of those games, but I also
didn't care, since it's no longer in a critical code path.

Christian?

NOTE! This patch is not well tested. I verified an earlier version of
this, but have been playing with it since, so caveat emptor.

IOW, I might have messed up some "trivial cleanup" when prepping for
sending it out...

Linus


Attachments:
patch.diff (4.02 kB)

2023-11-26 23:22:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
<[email protected]> wrote:
>
> IOW, I might have messed up some "trivial cleanup" when prepping for
> sending it out...

Bah. Famous last words. One of the "trivial cleanups" made the code
more "obvious" by renaming the nospec mask as just "mask".

And that trivial rename broke that patch *entirely*, because now that
name shadowed the "fmode_t" mask argument.

Don't even ask how long it took me to go from "I *tested* this,
dammit, now it doesn't work at all" to "Oh God, I'm so stupid".

So that nobody else would waste any time on this, attached is a new
attempt. This time actually tested *after* the changes.

Linus


Attachments:
0001-Improve-__fget_files_rcu-code-generation-and-thus-__.patch (4.90 kB)

2023-11-27 06:59:25

by Oliver Sang

[permalink] [raw]
Subject: Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

hi, Linus,

On Sun, Nov 26, 2023 at 03:20:58PM -0800, Linus Torvalds wrote:
> On Sun, 26 Nov 2023 at 12:23, Linus Torvalds
> <[email protected]> wrote:
> >
> > IOW, I might have messed up some "trivial cleanup" when prepping for
> > sending it out...
>
> Bah. Famous last words. One of the "trivial cleanups" made the code
> more "obvious" by renaming the nospec mask as just "mask".
>
> And that trivial rename broke that patch *entirely*, because now that
> name shadowed the "fmode_t" mask argument.
>
> Don't even ask how long it took me to go from "I *tested* this,
> dammit, now it doesn't work at all" to "Oh God, I'm so stupid".
>
> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

we applied the new patch upon 0ede61d858, and confirmed regression is gone,
even 3.4% better than 93faf426e3 now.

Tested-by: kernel test robot <[email protected]>

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/16/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit:
93faf426e3 ("vfs: shave work on failed file open")
0ede61d858 ("file: convert to SLAB_TYPESAFE_BY_RCU")
c712b4365b ("Improve __fget_files_rcu() code generation (and thus __fget_light())")

93faf426e3cc000c 0ede61d8589cc2d93aa78230d74 c712b4365b5b4dbe1d1380edd37
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
228481 ? 4% -4.6% 217900 ? 6% -11.7% 201857 ? 5% meminfo.DirectMap4k
89056 -2.0% 87309 -1.6% 87606 proc-vmstat.nr_slab_unreclaimable
16.28 -0.7% 16.16 -1.0% 16.12 turbostat.RAMWatt
0.01 ? 9% +58125.6% 4.17 ?175% +23253.5% 1.67 ?222% perf-sched.sch_delay.max.ms.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
781.67 ? 10% +6.5% 832.50 ? 19% -14.3% 670.17 ? 4% perf-sched.wait_and_delay.count.schedule_timeout.rcu_gp_fqs_loop.rcu_gp_kthread.kthread
97958 ? 7% -9.7% 88449 ? 4% -0.6% 97399 ? 4% sched_debug.cpu.avg_idle.stddev
0.00 ? 12% +24.2% 0.00 ? 17% -5.2% 0.00 ? 7% sched_debug.cpu.next_balance.stddev
6391048 -2.9% 6208584 +3.4% 6605584 will-it-scale.16.threads
399440 -2.9% 388036 +3.4% 412848 will-it-scale.per_thread_ops
6391048 -2.9% 6208584 +3.4% 6605584 will-it-scale.workload
19.99 ? 4% -2.2 17.74 +1.2 21.18 ? 2% perf-profile.calltrace.cycles-pp.fput.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
1.27 ? 5% +0.8 2.11 ? 3% +31.1 32.36 ? 2% perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
32.69 ? 4% +5.0 37.70 -32.7 0.00 perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
0.00 +27.9 27.85 +0.0 0.00 perf-profile.calltrace.cycles-pp.__get_file_rcu.__fget_light.do_poll.do_sys_poll.__x64_sys_poll
20.00 ? 4% -2.3 17.75 +0.4 20.43 ? 2% perf-profile.children.cycles-pp.fput
0.24 ? 10% -0.1 0.18 ? 2% -0.1 0.18 ? 10% perf-profile.children.cycles-pp.syscall_return_via_sysret
1.48 ? 5% +0.5 1.98 ? 3% +30.8 32.32 ? 2% perf-profile.children.cycles-pp.__fdget
31.85 ? 4% +6.0 37.86 -31.8 0.00 perf-profile.children.cycles-pp.__fget_light
0.00 +27.7 27.67 +0.0 0.00 perf-profile.children.cycles-pp.__get_file_rcu
30.90 ? 4% -20.6 10.35 ? 2% -30.9 0.00 perf-profile.self.cycles-pp.__fget_light
19.94 ? 4% -2.4 17.53 -0.3 19.62 ? 2% perf-profile.self.cycles-pp.fput
9.81 ? 4% -2.4 7.42 ? 2% +1.7 11.51 ? 4% perf-profile.self.cycles-pp.do_poll
0.23 ? 11% -0.1 0.17 ? 4% -0.1 0.18 ? 11% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.44 ? 7% +0.0 0.45 ? 5% +0.1 0.52 ? 4% perf-profile.self.cycles-pp.__poll
0.85 ? 4% +0.1 0.92 ? 3% +30.3 31.17 ? 2% perf-profile.self.cycles-pp.__fdget
0.00 +26.5 26.48 +0.0 0.00 perf-profile.self.cycles-pp.__get_file_rcu
2.146e+10 ? 2% +8.5% 2.329e+10 ? 2% -2.1% 2.101e+10 perf-stat.i.branch-instructions
0.22 ? 14% -0.0 0.19 ? 14% -0.0 0.20 ? 3% perf-stat.i.branch-miss-rate%
2.424e+10 ? 2% +4.1% 2.524e+10 ? 2% -4.7% 2.311e+10 perf-stat.i.dTLB-loads
1.404e+10 ? 2% +8.7% 1.526e+10 ? 2% -6.2% 1.316e+10 perf-stat.i.dTLB-stores
70.87 -2.3 68.59 -1.0 69.90 perf-stat.i.iTLB-load-miss-rate%
5267608 -5.5% 4979133 ? 2% -0.4% 5244253 perf-stat.i.iTLB-load-misses
2102507 +5.4% 2215725 +5.7% 2222286 perf-stat.i.iTLB-loads
18791 ? 3% +10.5% 20757 ? 2% -1.8% 18446 perf-stat.i.instructions-per-iTLB-miss
266.67 ? 2% +6.8% 284.75 ? 2% -4.1% 255.70 perf-stat.i.metric.M/sec
0.01 ? 10% -10.5% 0.01 ? 5% -1.8% 0.01 ? 6% perf-stat.overall.MPKI
0.19 -0.0 0.17 +0.0 0.20 perf-stat.overall.branch-miss-rate%
0.65 -3.1% 0.63 +6.1% 0.69 perf-stat.overall.cpi
0.00 ? 4% -0.0 0.00 ? 4% +0.0 0.00 ? 4% perf-stat.overall.dTLB-store-miss-rate%
71.48 -2.3 69.21 -1.2 70.24 perf-stat.overall.iTLB-load-miss-rate%
18757 +10.0% 20629 -3.2% 18161 perf-stat.overall.instructions-per-iTLB-miss
1.54 +3.2% 1.59 -5.8% 1.45 perf-stat.overall.ipc
4795147 +6.4% 5100406 -9.0% 4365017 perf-stat.overall.path-length
2.14e+10 ? 2% +8.5% 2.322e+10 ? 2% -2.1% 2.094e+10 perf-stat.ps.branch-instructions
2.417e+10 ? 2% +4.1% 2.516e+10 ? 2% -4.7% 2.303e+10 perf-stat.ps.dTLB-loads
1.4e+10 ? 2% +8.7% 1.522e+10 ? 2% -6.3% 1.312e+10 perf-stat.ps.dTLB-stores
5253923 -5.5% 4966218 ? 2% -0.5% 5228207 perf-stat.ps.iTLB-load-misses
2095770 +5.4% 2208605 +5.7% 2214962 perf-stat.ps.iTLB-loads
3.065e+13 +3.3% 3.167e+13 -5.9% 2.883e+13 perf-stat.total.instructions

>
> Linus

2023-11-27 10:13:14

by Christian Brauner

[permalink] [raw]
Subject: Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

> I took a look at the code generation, and honestly, I think we're
> better off just making __fget_files_rcu() have special logic for this
> all, and not use __get_file_rcu().

My initial massaging of the patch did that btw. Then I sat there
wondering whether it would matter if we just made it possible to reuse
that code and I went through a bunch of iterations. Oh well, it seems to
matter.

> Comments? I also looked at that odd OPTIMIZER_HIDE_VAR() that

Concept looks sane to me.

> __get_file_rcu() does, and I don't get it. Both things come from
> volatile accesses, I don't see the point of those games, but I also
> didn't care, since it's no longer in a critical code path.
>
> Christian?

Puts his completely imagined "I understand RCU head on".
SLAB_TYPESAFE_BY_RCU makes the RCU consume memory ordering that the
compiler doesn't officialy support (afaik) a bit wonky.

So the thinking was that we could have code patterns where you could
free the object and reallocate it while legitimatly passing the pointer
recheck. In that case there is no memory ordering between the allocation
and the pointer recheck because the last (re)allocation could have been
after the rcu_dereference().

To combat that all future loads were made to have a dependency on the
first load using the hidevar trick.

I guess that might only be theoretically possible but not in practice?
But then I liked that we explicitly commented on it as a reminder.

2023-11-27 10:28:07

by Christian Brauner

[permalink] [raw]
Subject: Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

> So that nobody else would waste any time on this, attached is a new
> attempt. This time actually tested *after* the changes.

So I've picked up your patch (vfs.misc). It's clever alright so thanks
for the comments in there otherwise I would've stared at this for far
too long.

It's a little unpleasant because of the cast-orama going on before we
check the file pointer but I don't see that it's in any way wrong. And
given how focussed people are with __fget_* performance I think it might
even be the right thing to do.

But the cleverness means we have the same logic slightly differently
twice. Not too bad ofc but not too nice either especially because that
rcu lookup is pretty complicated already.

A few days ago I did just write a long explanatory off-list email to
someone who had questions about this and who is fairly experienced so
we're not making it easy on people. But performance or simplicity; one
can't necessarily always have both.

2023-11-27 17:12:11

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

On Mon, 27 Nov 2023 at 02:27, Christian Brauner <[email protected]> wrote:
>
> So I've picked up your patch (vfs.misc). It's clever alright so thanks
> for the comments in there otherwise I would've stared at this for far
> too long.

Note that I should probably have commented on one other thing: that
whole "just load from fd[0] is always safe, because the fd[] array
always exists".

IOW, that whole "load and mask" thing only works when you know the
array exists at all.

Doing that "just mask the index" wouldn't be valid if "size = 0" is an
option and might mean that we don't have an array at all (ie if "->fd"
itself could be NULL.

But we never have a completely empty file descriptor array, and
fdp->fd is never NULL. At a minimum 'max_fds' is NR_OPEN_DEFAULT.

(The whole 'tsk->files' could be NULL, but only for kernel threads or
when exiting, so fget_task() will check for *that*, but it's a
separate thing)

So that's why it's safe to *entirely* remove the whole

if (unlikely(fd >= fdt->max_fds))

test, and do it *all* with just "mask the index, and mask the resulting load".

Because we can *always* do that load at "fdt->fd[0]", and we want to
check the result for NULL anyway, so the "mask at the end and check
for NULL" is both natural and generates very good code.

Anyway, not a big deal, bit it might be worth noting before somebody
tries the same trick on some other array that *could* be zero-sized
and with a NULL base pointer, and where that 'array[0]' access isn't
necessarily guaranteed to be ok.

> It's a little unpleasant because of the cast-orama going on before we
> check the file pointer but I don't see that it's in any way wrong.

In my cleanup phase - which was a bit messy - I did wonder if I should
have some helper for it, since it shows up in both __fget_files_rcu()
and in files_lookup_fd_raw().

So I *could* have tried to add something like a
"masked_rcu_dereference()" that took the base pointer, the index, and
the mask, and did that whole dance.

Or I could have had just a "mask_pointer()" function, which we do
occasionally do in other places too (ie we hide data in low bits, and
then we mask them away when the pointer is used as a pointer).

But with only two users, it seemed to add more conceptual complexity
than it's worth, and I was not convinced that we'd want to expose that
pattern and have others use it.

So having a helper might clarify things, but it might also encourage
wrong users. I dunno.

I suspect the only real use for this ends up being this very special
"access the fdt->fd[] array using a file descriptor".

Anyway, that's why I largely just did it with comments, and commented
both places - and just kept the cast there in the open.

Linus

2023-11-28 15:52:50

by Christian Brauner

[permalink] [raw]
Subject: Re: [linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression

On Mon, Nov 27, 2023 at 09:10:54AM -0800, Linus Torvalds wrote:
> On Mon, 27 Nov 2023 at 02:27, Christian Brauner <[email protected]> wrote:
> >
> > So I've picked up your patch (vfs.misc). It's clever alright so thanks
> > for the comments in there otherwise I would've stared at this for far
> > too long.
>
> Note that I should probably have commented on one other thing: that
> whole "just load from fd[0] is always safe, because the fd[] array
> always exists".

I added a comment to that effect in the code.

>
> IOW, that whole "load and mask" thing only works when you know the
> array exists at all.
>
> Doing that "just mask the index" wouldn't be valid if "size = 0" is an
> option and might mean that we don't have an array at all (ie if "->fd"
> itself could be NULL.
>
> But we never have a completely empty file descriptor array, and
> fdp->fd is never NULL. At a minimum 'max_fds' is NR_OPEN_DEFAULT.
>
> (The whole 'tsk->files' could be NULL, but only for kernel threads or
> when exiting, so fget_task() will check for *that*, but it's a
> separate thing)

Yep.

>
> So that's why it's safe to *entirely* remove the whole
>
> if (unlikely(fd >= fdt->max_fds))
>
> test, and do it *all* with just "mask the index, and mask the resulting load".

Yep.