Making close(2) delegate fput finalization with task_work_add() runs
into a slowdown (atomics needed to do it) which is artificially worsened
in presence of rseq, which glibc blindly uses if present (and it is
normally present) -- they added a user memory-touching handler into
resume_user_mode_work(), where the thread leaving the kernel lands after
issuing task_work_add() from fput(). Said touching requires a SMAP
round-trip which is quite expensive and it always executes when landing
in the resume routine.
I'm going to write a separate e-mail about the rseq problem later, but
even if it gets sorted out there is still perf to gain (or rather,
overhead to avoid).
Numbers are below in the proposed patch, but tl;dr without CONFIG_RSEQ
making things worse for the stock kernel I see about 7% increase in
ops/s with open+close.
Searching mailing lists for discussions explaining why close(2) was not
already doing this I found a patch with the easiest way out (call
__fput_sync() in filp_close()):
https://lore.kernel.org/all/[email protected]/
There was no response to it though.
From poking around there is tons of filp_close() users (including from
close_fd()) and it is unclear to me if they are going to be fine with
such a change.
With the assumption this is not going to work, I wrote my own patch
which adds close_fd_sync() and filp_close_sync(). They are shipped as
dedicated func entry points, but perhaps inlines which internally add a
flag to to the underlying routine would be preferred? Also adding __ in
front would be in line with __fput_sync, but having __filp_close_sync
call __filp_close looks weird to me.
All that said, if the simpler patch by Oleg Nestero works, then I'm
happy to drop this one. I just would like to see this sorted out,
whichever way.
Thoughts?
============================================================
fs: use __fput_sync in close(2)
close(2) is a special close which guarantees shallow kernel stack,
making delegation to task_work machinery unnecessary. Said delegation is
problematic as it involves atomic ops and interrupt masking trips, none
of which are cheap on x86-64. Forcing close(2) to do it looks like an
oversight in the original work.
Moreover presence of CONFIG_RSEQ adds an additional overhead as fput()
-> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the
thread returning to userspace land in resume_user_mode_work(), where
rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for
the thread (and it is by default with contemporary glibc).
Sample result when benchmarking open1_processes -t 1 from will-it-scale
(that's a open + close loop) + tmpfs on /tmp, running on the Sapphire
Rapid CPU (ops/s):
stock+RSEQ: 1329857
stock-RSEQ: 1421667 (+7%)
patched: 1523521 (+14.5% / +7%) (with / without rseq)
Patched result is the same as it dodges rseq.
As there are numerous close_fd() and filp_close() consumers which may or
may not tolerate __fput_sync() behavior, dedicated routines are added
for close(2).
Signed-off-by: Mateusz Guzik <[email protected]>
---
fs/file.c | 20 +++++++++++++++++---
fs/file_table.c | 2 --
fs/open.c | 21 ++++++++++++++++++---
include/linux/fdtable.h | 1 +
include/linux/fs.h | 1 +
5 files changed, 37 insertions(+), 8 deletions(-)
diff --git a/fs/file.c b/fs/file.c
index 3fd003a8604f..eedb8a9fb6d2 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -651,7 +651,7 @@ static struct file *pick_file(struct files_struct *files, unsigned fd)
return file;
}
-int close_fd(unsigned fd)
+static __always_inline int __close_fd(unsigned fd, bool sync)
{
struct files_struct *files = current->files;
struct file *file;
@@ -662,9 +662,23 @@ int close_fd(unsigned fd)
if (!file)
return -EBADF;
- return filp_close(file, files);
+ if (sync)
+ return filp_close_sync(file, files);
+ else
+ return filp_close(file, files);
+}
+
+int close_fd_sync(unsigned fd)
+{
+ return __close_fd(fd, true);
+}
+EXPORT_SYMBOL(close_fd_sync); /* for ksys_close() */
+
+int close_fd(unsigned fd)
+{
+ return __close_fd(fd, false);
}
-EXPORT_SYMBOL(close_fd); /* for ksys_close() */
+EXPORT_SYMBOL(close_fd);
/**
* last_fd - return last valid index into fd table
diff --git a/fs/file_table.c b/fs/file_table.c
index fc7d677ff5ad..c7b7fcd7a8b5 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -462,8 +462,6 @@ void fput(struct file *file)
void __fput_sync(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count)) {
- struct task_struct *task = current;
- BUG_ON(!(task->flags & PF_KTHREAD));
__fput(file);
}
}
diff --git a/fs/open.c b/fs/open.c
index e6ead0f19964..e5f03f891977 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1503,7 +1503,7 @@ SYSCALL_DEFINE2(creat, const char __user *, pathname, umode_t, mode)
* "id" is the POSIX thread ID. We use the
* files pointer for this..
*/
-int filp_close(struct file *filp, fl_owner_t id)
+static __always_inline int __filp_close(struct file *filp, fl_owner_t id, bool sync)
{
int retval = 0;
@@ -1520,12 +1520,27 @@ int filp_close(struct file *filp, fl_owner_t id)
dnotify_flush(filp, id);
locks_remove_posix(filp, id);
}
- fput(filp);
+ if (sync)
+ __fput_sync(filp);
+ else
+ fput(filp);
return retval;
}
+int filp_close_sync(struct file *filp, fl_owner_t id)
+{
+ return __filp_close(filp, id, true);
+}
+EXPORT_SYMBOL(filp_close_sync);
+
+int filp_close(struct file *filp, fl_owner_t id)
+{
+ return __filp_close(filp, id, false);
+}
EXPORT_SYMBOL(filp_close);
+extern unsigned long sysctl_fput_sync;
+
/*
* Careful here! We test whether the file pointer is NULL before
* releasing the fd. This ensures that one clone task can't release
@@ -1533,7 +1548,7 @@ EXPORT_SYMBOL(filp_close);
*/
SYSCALL_DEFINE1(close, unsigned int, fd)
{
- int retval = close_fd(fd);
+ int retval = close_fd_sync(fd);
/* can't restart close syscall because file table entry was cleared */
if (unlikely(retval == -ERESTARTSYS ||
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index e066816f3519..dd3d0505d34b 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -123,6 +123,7 @@ int iterate_fd(struct files_struct *, unsigned,
int (*)(const void *, struct file *, unsigned),
const void *);
+extern int close_fd_sync(unsigned int fd);
extern int close_fd(unsigned int fd);
extern int __close_range(unsigned int fd, unsigned int max_fd, unsigned int flags);
extern struct file *close_fd_get_file(unsigned int fd);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 562f2623c9c9..300ce66eef0a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2388,6 +2388,7 @@ static inline struct file *file_clone_open(struct file *file)
{
return dentry_open(&file->f_path, file->f_flags, file->f_cred);
}
+extern int filp_close_sync(struct file *, fl_owner_t id);
extern int filp_close(struct file *, fl_owner_t id);
extern struct filename *getname_flags(const char __user *, int, int *);
--
2.39.2
On Mon, Aug 07, 2023 at 01:06:27AM +0200, Mateusz Guzik wrote:
> With the assumption this is not going to work, I wrote my own patch
> which adds close_fd_sync() and filp_close_sync(). They are shipped as
> dedicated func entry points, but perhaps inlines which internally add a
> flag to to the underlying routine would be preferred?
Yes, I think static inlines would be better here.
> Also adding __ in
> front would be in line with __fput_sync, but having __filp_close_sync
> call __filp_close looks weird to me.
I'd handle this as ...
int file_close_sync(struct file *, fl_owner_t, bool sync);
static inline filp_close(struct file *file, fl_owner_t owner)
{
return file_close_sync(file, owner, false);
}
> > And making filp_close() fully sync again is also really not great.
>
> The patch is not doing it.
No, but you were referencing the proposed patch as an alternative in
your commit message.
> > Yes, we just did re-added the f_pos optimization because it may have had
> > an impact. And that makes more sense because that was something we had
> > changed just a few days/weeks before.
> >
>
> I don't think perf tax on something becomes more sensible the longer
> it is there.
One does need to answer the question why it does suddenly become
relevant after all these years though.
The original discussion was triggered by fifo ordering in task work
which led to a noticable regression and why it was ultimately reverted.
The sync proposal for fput() was an orthogonal proposal and the
conclusion was that it wasn't safe generally
https://lore.kernel.org/all/[email protected]
even though it wasn't a direct response to the patch you linked.
Sure, for f_pos it was obvious when and how that happend. Here? It needs
a bit more justification.
If you care about it enough send a patch that just makes close(2) go
sync. We'll stuff it in a branch and we'll see what LKP has to say about
it or whether this gets lost in noise. I really don't think letting
micro-benchmarks become a decisive factor for code churn is a good
idea.
And fwiw, yes, maybe the difference between close(2) and other parts
doesn't matter for you but for use mortals that maintain a bunch more
then just a few lines of code in file.c if you have a tiny collection of
differences in behavior everywhere it adds up. The fact that you think
it's irrelevant doesn't mean we have that luxury.
That's not to say your patches haven't been useful. Not at all. The
close_range() tweak was very much appreciated and that f_pos thing was
good to fix as well.
On 8/8/23, Christian Brauner <[email protected]> wrote:
> It adds two new exports of filp_close_sync() and close_fd_sync() without
> any users. That's not something we do and we also shouldn't encourage
> random drivers to switch to sync behavior.
>
They don't need to be exported for the patch to work.
> That rseq thing is completely orthogonal and maybe that needs to be
> fixed and you can go and convince the glibc people to do it.
>
This is not a glibc problem, but rseq problem -- it should not be
lumped into any case which uses task_work_add.
> And making filp_close() fully sync again is also really not great.
The patch is not doing it.
> Simplicity wins out and if all codepaths share the same behavior we're
> better off then having parts use task work and other parts just not.
>
The difference is not particularly complicated.
> Yes, we just did re-added the f_pos optimization because it may have had
> an impact. And that makes more sense because that was something we had
> changed just a few days/weeks before.
>
I don't think perf tax on something becomes more sensible the longer
it is there.
> But this is over 10 year old behavior and this micro benchmarking isn't
> a strong selling point imho. We could make close(2) go sync, sure. But
> I'm skeptical even about this without real-world data or from a proper
> testsuite.
>
I responded to this in my mail to Eric.
> (There's also a stray sysctl_fput_sync there which is scary to think that
> we'd ever allow a sysctl that allows userspace to control how we close
> fds.)
>
This is a leftover from my tests -- I added a runtime switch so can I
flip back and forth, most definitely not something I would expect to
be shipped.
--
Mateusz Guzik <mjguzik gmail.com>
On 8/8/23, Eric W. Biederman <[email protected]> wrote:
>
> Adding a couple more people.
>
> Mateusz Guzik <[email protected]> writes:
>
>> Making close(2) delegate fput finalization with task_work_add() runs
>> into a slowdown (atomics needed to do it) which is artificially worsened
>> in presence of rseq, which glibc blindly uses if present (and it is
>> normally present) -- they added a user memory-touching handler into
>> resume_user_mode_work(), where the thread leaving the kernel lands after
>> issuing task_work_add() from fput(). Said touching requires a SMAP
>> round-trip which is quite expensive and it always executes when landing
>> in the resume routine.
>>
>> I'm going to write a separate e-mail about the rseq problem later, but
>> even if it gets sorted out there is still perf to gain (or rather,
>> overhead to avoid).
>>
>> Numbers are below in the proposed patch, but tl;dr without CONFIG_RSEQ
>> making things worse for the stock kernel I see about 7% increase in
>> ops/s with open+close.
>>
>> Searching mailing lists for discussions explaining why close(2) was not
>> already doing this I found a patch with the easiest way out (call
>> __fput_sync() in filp_close()):
>> https://lore.kernel.org/all/[email protected]/
>
> What you need to search for is probably the opposite why is
> task_work_add used in close.
>
You are splitting hairs here.
> Taking a quick look at the history it appears that fput was always
> synchronous until a decade ago when commit 4a9d4b024a31 ("switch fput to
> task_work_add") was merged.
>
> The next two commits 3ffa3c0e3f6e ("aio: now fput() is OK from interrupt
> context; get rid of manual delayed __fput()") and commit 6120d3dbb122
> ("get rid of ->scm_work_list") seem to demonstrate why fput was made
> asynchronous. They rely on the new fput behavior to break recursive
> calls and to allow fput from any context. That plus as Al talks about
> having any lock held over fput can potentially cause a deadlock.
>
> All 3 issues taken together says that a synchronous fput is a
> loaded foot gun that must be used very carefully. That said
> close(2) does seem to be a reliably safe place to be synchronous.
>
Benefits of fput not taking surprise sleepable locks (and only
sometimes) and whatnot were pretty obvious and I'm not proposing
changing random consumers back to __fput_sync equivalent.
What is not obvious is if filp_close consumers, which already suffer a
lot of work, forces them to be in a spot where __fput_sync is safe to
do. The question was de facto brought up by Oleg's patch and did not
get a response, I don't see any explanations in other places either.
Cursory reading by me suggested it is indeed dodgy thus the proposal
which does not alter filp_close semantics.
As you stated yourself, the close syscall itself should be the safe
spot here and I was surprised to find this was not sorted out already
-- it genuinely looks like an oversight.
> The big question is can your loop calling open then close going 7%
> faster into any real world improvements? How much can it generalize?
>
In this context you mean how many other spots can use it? I expect
none (apart from maybe close_range). But as I mention later, close is
not an obscure syscall, it is used all the time.
As for real world, I don't think anyone will get a marked win as is.
So happens there is perf loss all over the kernel, here is perf top
from the bench with task_work out of the way:
7.07% [kernel] [k] entry_SYSCALL_64
3.59% [kernel] [k] do_dentry_open
2.95% [kernel] [k] strncpy_from_user
2.93% [kernel] [k] kmem_cache_free
2.88% [kernel] [k] init_file
2.86% [kernel] [k] __call_rcu_common.constprop.0
2.72% [kernel] [k] kmem_cache_alloc
2.70% [kernel] [k] memcg_slab_post_alloc_hook
2.46% libc.so.6 [.] __GI___libc_open
2.37% [kernel] [k] mod_objcg_state
2.28% [kernel] [k] link_path_walk.part.0.constprop.0
2.20% [kernel] [k] apparmor_current_getsecid_subj
2.19% [kernel] [k] apparmor_file_open
[snip]
For example memory allocation/free is taking quite a bit of CPU time
and I strongly suspect with enough hackery it can be significantly
less expensive (it mostly bottlenecks on cmpxchg16b).
That is to say, if one was to sort all other things out, there is
several more percent to recover.
I would agree with the concern if the patch was complex, but it is not.
> Taking a look at close_fd, it is used in autofs, cachefiles, bpf, amoung
> others. I think there is a very good argument that we can not say that
> filep_close is always a safe place to call __fput_close. There is just
> too much going on in some of those place. A particular possibly
> dangerous example is cachefiles_ondemand_daemon_read which calls
> complete after close_fd. If as Oleg suggested filp_close started always
> calling __fput_sync that call to complete looks like a deadlock waiting
> to happen.
>
I did not look at cachefiles specifically, brief look at others was
indeed worrisome.
But then again, they are not a factor with the proposed patch.
> [snip]
> Which is a long way of saying that this only looks safe for close(2).
>
Yep. Except is a highly popular syscall, so it's not like I'm
proposing a change to facilitate something which happens once a year.
>
> Are there any real world gains if close(2) is the only place this
> optimization can be applied? Is the added maintenance burden worth the
> speed up?
>
I don't think the patch at hand adds any particular burden and can be
easily modified to appease most concerns.
> I would not export any of your new _sync variants to modules. They are
> all loaded foot-guns.
>
Specifics, including glaring warning signs and some commentary to the
extent "don't use it" can be trivially added if dodging task_work by
close is considered fine by people maintaining the code.
Ultimately the real question I asked in my e-mail was if close(2) not
being exempt from task_work was *intentional* (it does not look as
such). Writing a patch which only changes semantics for this syscall
(and not for anyone else) is trivial and there are numerous ways to do
it, I only shipped one I almost had to write for benchmarking anyway.
Frankly I expected changing the behavior for close(2) to be a
no-brainer, I did expect pushback on the way the patch is implemented.
That is to say, my willingness to argue about this is not particularly
high, so if people insist on not patching close I'm going to drop this
thread sooner than later.
--
Mateusz Guzik <mjguzik gmail.com>
On 8/8/23, Christian Brauner <[email protected]> wrote:
>> I don't think perf tax on something becomes more sensible the longer
>> it is there.
>
> One does need to answer the question why it does suddenly become
> relevant after all these years though.
>
There is some work I'm considering doing, but before that happens I'm
sanity checking performance of various syscalls and I keep finding
problems, some of which are trivially avoidable.
I'm genuinely confused with the strong opposition to the very notion
of making close(2) a special case (which I consider conceptually
trivial), but as you noted below I'm not ultimately the person on the
hook for any problems.
> The original discussion was triggered by fifo ordering in task work
> which led to a noticable regression and why it was ultimately reverted.
> The sync proposal for fput() was an orthogonal proposal and the
> conclusion was that it wasn't safe generally
> https://lore.kernel.org/all/[email protected]
> even though it wasn't a direct response to the patch you linked.
>
Ok, I missed this e-mail. It further discourages patching filp_close,
but does not make an argument against *just* close(2) rolling with
sync which is what I'm proposing.
> If you care about it enough send a patch that just makes close(2) go
> sync.
But this is precisely what the submitted patch is doing. It adds
file_fput_sync, then adds close_fd_sync which is the only consumer and
only makes close(2) use it. *nobody* else has sync added.
One can argue the way this is sorted out is crap and I'm not going to
defend it. I am saying making *just* close(2) roll with sync is very
easy, there are numerous ways to do it and anyone involved with
maintaining vfs can write their own variant in minutes. Basically I
don't see *technical* problems here.
> We'll stuff it in a branch and we'll see what LKP has to say about
> it or whether this gets lost in noise. I really don't think letting
> micro-benchmarks become a decisive factor for code churn is a good
> idea.
>
That would be nice. Given the patch is already doing what you asked,
can you just take it as is?
I'll note though what I mentioned elsewhere
(https://lore.kernel.org/all/CAGudoHEG7vtCRWjn0yR5LMUsaw3KJANfa+Hkke9gy0imXQz6tg@mail.gmail.com/):
can they make sure to whack CONFIG_RANDOMIZE_KSTACK_OFFSET=y from
their kernel config? It is an *optional* measure and it comes at a
massive premium, so single-threaded changes are easily diminished.
--
Mateusz Guzik <mjguzik gmail.com>
Linus Torvalds <[email protected]> writes:
> On Tue, 8 Aug 2023 at 10:15, Christian Brauner <[email protected]> wrote:
>>
>> I think you're at least missing the removal of the PF_KTHREAD check
>
> Yup.
>
>> It'd be neat to leave that in so
>> __fput_sync() doesn't get proliferated to non PF_KTHREAD without us
>> noticing. So maybe we just need a tiny primitive.
>
> Considering that over the decade we've had this, we've only grown two
> cases of actually using it, I think we're fine.
That and two cases of flush_delayed_fput() followed by task_work_run().
That combined with a maintainer who was actively against any new
calls to __fput_sync and a version of __fput_sync that called BUG_ON
if you used it.
So I am not 100% convinced that there were so few calls to __fput_sync
simply because people couldn't think of a need for it.
Eric
Adding a couple more people.
Mateusz Guzik <[email protected]> writes:
> Making close(2) delegate fput finalization with task_work_add() runs
> into a slowdown (atomics needed to do it) which is artificially worsened
> in presence of rseq, which glibc blindly uses if present (and it is
> normally present) -- they added a user memory-touching handler into
> resume_user_mode_work(), where the thread leaving the kernel lands after
> issuing task_work_add() from fput(). Said touching requires a SMAP
> round-trip which is quite expensive and it always executes when landing
> in the resume routine.
>
> I'm going to write a separate e-mail about the rseq problem later, but
> even if it gets sorted out there is still perf to gain (or rather,
> overhead to avoid).
>
> Numbers are below in the proposed patch, but tl;dr without CONFIG_RSEQ
> making things worse for the stock kernel I see about 7% increase in
> ops/s with open+close.
>
> Searching mailing lists for discussions explaining why close(2) was not
> already doing this I found a patch with the easiest way out (call
> __fput_sync() in filp_close()):
> https://lore.kernel.org/all/[email protected]/
What you need to search for is probably the opposite why is
task_work_add used in close.
Taking a quick look at the history it appears that fput was always
synchronous until a decade ago when commit 4a9d4b024a31 ("switch fput to
task_work_add") was merged.
The next two commits 3ffa3c0e3f6e ("aio: now fput() is OK from interrupt
context; get rid of manual delayed __fput()") and commit 6120d3dbb122
("get rid of ->scm_work_list") seem to demonstrate why fput was made
asynchronous. They rely on the new fput behavior to break recursive
calls and to allow fput from any context. That plus as Al talks about
having any lock held over fput can potentially cause a deadlock.
All 3 issues taken together says that a synchronous fput is a
loaded foot gun that must be used very carefully. That said
close(2) does seem to be a reliably safe place to be synchronous.
The big question is can your loop calling open then close going 7%
faster into any real world improvements? How much can it generalize?
Taking a look at close_fd, it is used in autofs, cachefiles, bpf, amoung
others. I think there is a very good argument that we can not say that
filep_close is always a safe place to call __fput_close. There is just
too much going on in some of those place. A particular possibly
dangerous example is cachefiles_ondemand_daemon_read which calls
complete after close_fd. If as Oleg suggested filp_close started always
calling __fput_sync that call to complete looks like a deadlock waiting
to happen.
I took a look exit_files() to see if it would be a reasonable candidate
for this optimization. I see: make_task_dead() -> do_exit() ->
exit_files. The function make_task_dead is designed to be called from
all kinds of awkward places so I don't see the sync variant being
safe for exit_files().
Which is a long way of saying that this only looks safe for close(2).
Are there any real world gains if close(2) is the only place this
optimization can be applied? Is the added maintenance burden worth the
speed up?
I would not export any of your new _sync variants to modules. They are
all loaded foot-guns.
>
> There was no response to it though.
>
> From poking around there is tons of filp_close() users (including from
> close_fd()) and it is unclear to me if they are going to be fine with
> such a change.
>
> With the assumption this is not going to work, I wrote my own patch
> which adds close_fd_sync() and filp_close_sync(). They are shipped as
> dedicated func entry points, but perhaps inlines which internally add a
> flag to to the underlying routine would be preferred? Also adding __ in
> front would be in line with __fput_sync, but having __filp_close_sync
> call __filp_close looks weird to me.
>
> All that said, if the simpler patch by Oleg Nestero works, then I'm
> happy to drop this one. I just would like to see this sorted out,
> whichever way.
>
> Thoughts?
Unless you can find some real world performance gains this looks like
a bad idea.
Eric
On 8/8/23, Linus Torvalds <[email protected]> wrote:
> On Tue, 8 Aug 2023 at 10:10, Mateusz Guzik <[email protected]> wrote:
>>
>> Few hours ago I sent another version which very closely resembles what
>> you did :)
>> 2 main differences:
>> - i somehow missed close_fd_get_file so I hacked my own based on close_fd
>> - you need to whack the kthread assert in __fput_sync
>
> Good call on teh __fput_sync() test.
>
> That BUG_ON() made sense ten years ago when this was all re-organized,
> not so much now.
>
Christian proposes a dedicated routine, I have 0 opinion, you guys
sort it out ;)
>> I'm offended you ask, it's all in my opening e-mail.
>
> Heh. I wasn't actually cc'd on that, so I'm going by context and then
> peeking at web links..
>
Here it is again with some typos fixed and slightly reworded, not
necessarily with quality of English improved. Feel free to quote in
whatever extent in your commit message (including none).
[quote]
fs: use __fput_sync in close(2)
close(2) is a special case which guarantees a shallow kernel stack,
making delegation to task_work machinery unnecessary. Said delegation is
problematic as it involves atomic ops and interrupt masking trips, none
of which are cheap on x86-64. Forcing close(2) to do it looks like an
oversight in the original work.
Moreover presence of CONFIG_RSEQ adds an additional overhead as fput()
-> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the
thread returning to userspace land in resume_user_mode_work(), where
rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for
the thread (and it is by default with contemporary glibc).
Sample result when benchmarking open1_processes -t 1 from will-it-scale
(that's an open + close loop) + tmpfs on /tmp, running on the Sapphire
Rapid CPU (ops/s):
stock+RSEQ: 1329857
stock-RSEQ: 1421667 (+7%)
patched: 1523521 (+14.5% / +7%) (with / without rseq)
Patched result is the same regardless of rseq as the codepath is avoided.
[/quote]
--
Mateusz Guzik <mjguzik gmail.com>
> Are there any real world gains if close(2) is the only place this
> optimization can be applied? Is the added maintenance burden worth the
> speed up?
Tbh, none of this patch makes me exited.
It adds two new exports of filp_close_sync() and close_fd_sync() without
any users. That's not something we do and we also shouldn't encourage
random drivers to switch to sync behavior.
That rseq thing is completely orthogonal and maybe that needs to be
fixed and you can go and convince the glibc people to do it.
And making filp_close() fully sync again is also really not great.
Simplicity wins out and if all codepaths share the same behavior we're
better off then having parts use task work and other parts just not.
Yes, we just did re-added the f_pos optimization because it may have had
an impact. And that makes more sense because that was something we had
changed just a few days/weeks before.
But this is over 10 year old behavior and this micro benchmarking isn't
a strong selling point imho. We could make close(2) go sync, sure. But
I'm skeptical even about this without real-world data or from a proper
testsuite.
(There's also a stray sysctl_fput_sync there which is scary to think that
we'd ever allow a sysctl that allows userspace to control how we close
fds.)
> Unless you can find some real world performance gains this looks like
> a bad idea.
I agree.
On Tue, 8 Aug 2023 at 10:36, Christian Brauner <[email protected]> wrote:
>
> @Linus, you ok with the appended thing?
Yes.
I do think that the CHECK_DATA_CORRUPTION() case (used to be in
filp_close, now in filp_flush) is now very questionable since we'll
end up doing an "fput()" on it anyway.
But I think that's actually not a new thing - it was always in the
wrong place, and only caught the "filp_close()" cases. Which -
considering that it would only happen with people using 'fput()'
incorrectly - was always quite suspicious.
The actual "CHECK_DATA_CORRUPTION()" part of the check is new, but the
check itself predates not just the git tree, but the BK history too.
Google does find that we had it trigger back in 1998, apparently.
I think we should probably remove it entirely - and just depend on all
our modern use-after-free infrastructure.
Or we could move it into __fput() itself - the ordering wrt any
flushing is immaterial, because it's no different from using read or
write or whatever on a stale file descriptor - and at least get much
better coverage of the situation where it would happen.
But that is, I think, a completely separate issue - this is all just
made more obvious by the reorganization.
Linus
On Tue, 8 Aug 2023 at 10:15, Christian Brauner <[email protected]> wrote:
>
> I think you're at least missing the removal of the PF_KTHREAD check
Yup.
> It'd be neat to leave that in so
> __fput_sync() doesn't get proliferated to non PF_KTHREAD without us
> noticing. So maybe we just need a tiny primitive.
Considering that over the decade we've had this, we've only grown two
cases of actually using it, I think we're fine.
Also, the name makes it fairly explicit what it's all about, so I
wouldn't worry.
Linus
On Tue, Aug 08, 2023 at 09:57:04AM -0700, Linus Torvalds wrote:
> On Mon, 7 Aug 2023 at 22:57, Eric W. Biederman <[email protected]> wrote:
> >
> > Taking a quick look at the history it appears that fput was always
> > synchronous [..]
>
> Indeed. Synchronous used to be the only case.
>
> The reason it's async now is because several drivers etc do the final
> close from nasty contexts, so 'fput()' needed to be async for the
> general case.
>
> > All 3 issues taken together says that a synchronous fput is a
> > loaded foot gun that must be used very carefully. That said
> > close(2) does seem to be a reliably safe place to be synchronous.
>
> Yes.
>
> That said, I detest Mateusz' patch. I hate these kinds of "do
> different things based on flags" interfaces. Particularly when it
> spreads out like this.
>
> So I do like having close() be synchronous, because we actually do
> have correctness issues wrt the close having completed properly by the
> time we return to user space, so we have that "task_work_add()" there
> that will force the synchronization anyway before we return.
>
> So the system call case is indeed a special case. Arguably
> close_range() could be too, but honestly, once you start doing ranges
> of file descriptors, you are (a) doint something fairly unusual, and
> (b) the "queue them up on the task work" might actually be a *good*
> thing.
>
> It's definitely not a good thing for the single-fd-close case, though.
>
> But even if we want to do this - and I have absolutely no objections
> to it conceptually as per above - we need to be a lot more surgical
> about it, and not pass stupid flags around.
>
> Here's a TOTALLY UNTESTED(!) patch that I think effectively does what
> Mateusz wants done, but does it all within just fs/open.c and only for
> the obvious context of the close() system call itself.
>
> All it needs is to just split out the "flush" part from filp_close(),
> and we already had all the other infrastructure for this operation.
>
> Mateusz, two questions:
>
> (a) does this patch work for you?
>
> (b) do you have numbers for this all?
I really would like to have good ways of testing the impact of such
things because I'm a little scared of endless optimization patches that
overall either complicate or uglify our code. Maybe I'm paranoid, maybe
that's dumb but it worries me.
>
> and if it all looks good I have no problems with this kind of much
> more targeted and obvious patch.
>
> Again: TOTALLY UNTESTED. It looks completely obvious, but mistakes happen.
I think you're at least missing the removal of the PF_KTHREAD check in
void __fput_sync(struct file *file)
{
if (atomic_long_dec_and_test(&file->f_count)) {
- struct task_struct *task = current;
- BUG_ON(!(task->flags & PF_KTHREAD));
__fput(file);
}
}
so right now we'd BUG_ON(). It'd be neat to leave that in so
__fput_sync() doesn't get proliferated to non PF_KTHREAD without us
noticing. So maybe we just need a tiny primitive.
> you guys sort it out ;)
You're all driving me nuts. ;)
Fixed patch appended and put on vfs.misc for testing...
@Linus, you ok with the appended thing?
On Mon, 7 Aug 2023 at 22:57, Eric W. Biederman <[email protected]> wrote:
>
> Taking a quick look at the history it appears that fput was always
> synchronous [..]
Indeed. Synchronous used to be the only case.
The reason it's async now is because several drivers etc do the final
close from nasty contexts, so 'fput()' needed to be async for the
general case.
> All 3 issues taken together says that a synchronous fput is a
> loaded foot gun that must be used very carefully. That said
> close(2) does seem to be a reliably safe place to be synchronous.
Yes.
That said, I detest Mateusz' patch. I hate these kinds of "do
different things based on flags" interfaces. Particularly when it
spreads out like this.
So I do like having close() be synchronous, because we actually do
have correctness issues wrt the close having completed properly by the
time we return to user space, so we have that "task_work_add()" there
that will force the synchronization anyway before we return.
So the system call case is indeed a special case. Arguably
close_range() could be too, but honestly, once you start doing ranges
of file descriptors, you are (a) doint something fairly unusual, and
(b) the "queue them up on the task work" might actually be a *good*
thing.
It's definitely not a good thing for the single-fd-close case, though.
But even if we want to do this - and I have absolutely no objections
to it conceptually as per above - we need to be a lot more surgical
about it, and not pass stupid flags around.
Here's a TOTALLY UNTESTED(!) patch that I think effectively does what
Mateusz wants done, but does it all within just fs/open.c and only for
the obvious context of the close() system call itself.
All it needs is to just split out the "flush" part from filp_close(),
and we already had all the other infrastructure for this operation.
Mateusz, two questions:
(a) does this patch work for you?
(b) do you have numbers for this all?
and if it all looks good I have no problems with this kind of much
more targeted and obvious patch.
Again: TOTALLY UNTESTED. It looks completely obvious, but mistakes happen.
Linus
On 8/8/23, Linus Torvalds <[email protected]> wrote:
> That said, I detest Mateusz' patch. I hate these kinds of "do
> different things based on flags" interfaces. Particularly when it
> spreads out like this.
I agree it is kind of crap, I needed something to test the impact.
> But even if we want to do this - and I have absolutely no objections
> to it conceptually as per above - we need to be a lot more surgical
> about it, and not pass stupid flags around.
>
> Here's a TOTALLY UNTESTED(!) patch that I think effectively does what
> Mateusz wants done, but does it all within just fs/open.c and only for
> the obvious context of the close() system call itself.
>
> All it needs is to just split out the "flush" part from filp_close(),
> and we already had all the other infrastructure for this operation.
>
Few hours ago I sent another version which very closely resembles what
you did :)
2 main differences:
- i somehow missed close_fd_get_file so I hacked my own based on close_fd
- you need to whack the kthread assert in __fput_sync
> Mateusz, two questions:
>
> (a) does this patch work for you?
>
Well I'm not *testing* patch right now, but it does work for me in the
sense of taking care of the problem (modulo whatever fixups,
definitely the assert thing)
> (b) do you have numbers for this all?
>
I'm offended you ask, it's all in my opening e-mail.
Copy pasting again from the commit message:
[orig]
fs: use __fput_sync in close(2)
close(2) is a special close which guarantees shallow kernel stack,
making delegation to task_work machinery unnecessary. Said delegation is
problematic as it involves atomic ops and interrupt masking trips, none
of which are cheap on x86-64. Forcing close(2) to do it looks like an
oversight in the original work.
Moreover presence of CONFIG_RSEQ adds an additional overhead as fput()
-> task_work_add(..., TWA_RESUME) -> set_notify_resume() makes the
thread returning to userspace land in resume_user_mode_work(), where
rseq_handle_notify_resume takes a SMAP round-trip if rseq is enabled for
the thread (and it is by default with contemporary glibc).
Sample result when benchmarking open1_processes -t 1 from will-it-scale
(that's a open + close loop) + tmpfs on /tmp, running on the Sapphire
Rapid CPU (ops/s):
stock+RSEQ: 1329857
stock-RSEQ: 1421667 (+7%)
patched: 1523521 (+14.5% / +7%) (with / without rseq)
Patched result is the same as it dodges rseq.
[/orig]
Since modulo fixups your patch clearly elides all of this, I don't
think there is a need to re-bench (but I can do it if you ship me a
version you consider committable just to be on the safe side)
--
Mateusz Guzik <mjguzik gmail.com>
On Tue, 8 Aug 2023 at 10:10, Mateusz Guzik <[email protected]> wrote:
>
> Few hours ago I sent another version which very closely resembles what
> you did :)
> 2 main differences:
> - i somehow missed close_fd_get_file so I hacked my own based on close_fd
> - you need to whack the kthread assert in __fput_sync
Good call on teh __fput_sync() test.
That BUG_ON() made sense ten years ago when this was all re-organized,
not so much now.
> I'm offended you ask, it's all in my opening e-mail.
Heh. I wasn't actually cc'd on that, so I'm going by context and then
peeking at web links..
Linus