2022-08-03 18:56:24

by Al Viro

[permalink] [raw]
Subject: [git pull] vfs.git pile 3 - dcache

The following changes since commit b13baccc3850ca8b8cccbf8ed9912dbaa0fdf7f3:

Linux 5.19-rc2 (2022-06-12 16:11:37 -0700)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git tags/pull-work.dcache

for you to fetch changes up to 50417d22d0efbb1be76c3cb66b2329f83741c9c7:

fs/dcache: Move wakeup out of i_seq_dir write held region. (2022-07-30 00:38:16 -0400)

----------------------------------------------------------------
Main part here is making parallel lookups safe for RT - making
sure preemption is disabled in start_dir_add()/ end_dir_add() sections (on
non-RT it's automatic, on RT it needs to to be done explicitly) and moving
wakeups from __d_lookup_done() inside of such to the end of those sections.
Wakeups can be safely delayed for as long as ->d_lock on in-lookup
dentry is held; proving that has caught a bug in d_add_ci() that allows
memory corruption when sufficiently bogus ntfs (or case-insensitive xfs)
image is mounted. Easily fixed, fortunately.

Signed-off-by: Al Viro <[email protected]>

----------------------------------------------------------------
Al Viro (1):
d_add_ci(): make sure we don't miss d_lookup_done()

Sebastian Andrzej Siewior (3):
fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT
fs/dcache: Move the wakeup from __d_lookup_done() to the caller.
fs/dcache: Move wakeup out of i_seq_dir write held region.

fs/dcache.c | 54 ++++++++++++++++++++++++++++++++++++++++----------
include/linux/dcache.h | 9 +++------
2 files changed, 46 insertions(+), 17 deletions(-)


2022-08-03 19:04:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Wed, Aug 3, 2022 at 11:39 AM Al Viro <[email protected]> wrote:
>
> Main part here is making parallel lookups safe for RT - making
> sure preemption is disabled in start_dir_add()/ end_dir_add() sections (on
> non-RT it's automatic, on RT it needs to to be done explicitly) and moving
> wakeups from __d_lookup_done() inside of such to the end of those sections.

Ugh.

I really dislike this pattern:

if (IS_ENABLED(CONFIG_PREEMPT_RT))
preempt_disable();
...
if (IS_ENABLED(CONFIG_PREEMPT_RT))
preempt_enable();

and while the new comment explains *why* it exists, it's still very ugly indeed.

We have it in a couple of other places, and we also end up having
another variation on the theme that is about "migrate_{dis,en}able()",
except it is written as

if (IS_ENABLED(CONFIG_PREEMPT_RT))
migrate_disable();
else
preempt_disable();

because on non-PREEMPT_RT obviously preempt_disable() is the better
and simpler thing.

Can we please just introduce helper functions?

At least that

if (IS_ENABLED(CONFIG_PREEMPT_RT))
preempt_disable();
...

pattern could be much more naturally expressed as

preempt_disable_under_spinlock();
...

which would make the code really explain what is going on. I would
still encourage that *comment* about it, but I think we really should
strive for code that makes sense even without a comment.

The fact that then without PREEMPT_RT, the whole
"preempt_disable_under_spinlock()" becomes a no-op is then an
implementation detail - and not so different from how a regular
preempt_disable() becomes a no-op when on UP (or with PREEMPT_NONE).

And that "preempt_disable_under_spinlock()" really documents what is
going on, and I feel would make that code easier to understand? The
fact that PREEMPT_RT has different rules about preemption is not
something that the dentry code should care about.

The dentry code could just say "I want to disable preemption, and I
already hold a spinlock, so do what is best".

So then "preempt_disable_under_spinlock()" precisely documents what
the dentry code really wants.

No?

Anyway, I have pulled this, but I really would like fewer of these
random PREEMPT_RT turds around, and more "this code makes sense" code.

Linus

2022-08-03 19:13:25

by pr-tracker-bot

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

The pull request you sent on Wed, 3 Aug 2022 19:39:25 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git tags/pull-work.dcache

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/200e340f2196d7fd427a5810d06e893b932f145a

Thank you!

--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

2022-08-03 20:11:38

by Al Viro

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Wed, Aug 03, 2022 at 11:57:27AM -0700, Linus Torvalds wrote:
> On Wed, Aug 3, 2022 at 11:39 AM Al Viro <[email protected]> wrote:
> >
> > Main part here is making parallel lookups safe for RT - making
> > sure preemption is disabled in start_dir_add()/ end_dir_add() sections (on
> > non-RT it's automatic, on RT it needs to to be done explicitly) and moving
> > wakeups from __d_lookup_done() inside of such to the end of those sections.
>
> Ugh.
>
> I really dislike this pattern:
>
> if (IS_ENABLED(CONFIG_PREEMPT_RT))
> preempt_disable();
> ...
> if (IS_ENABLED(CONFIG_PREEMPT_RT))
> preempt_enable();
>
> and while the new comment explains *why* it exists, it's still very ugly indeed.
>
> We have it in a couple of other places, and we also end up having
> another variation on the theme that is about "migrate_{dis,en}able()",
> except it is written as
>
> if (IS_ENABLED(CONFIG_PREEMPT_RT))
> migrate_disable();
> else
> preempt_disable();
>
> because on non-PREEMPT_RT obviously preempt_disable() is the better
> and simpler thing.
>
> Can we please just introduce helper functions?
>
> At least that
>
> if (IS_ENABLED(CONFIG_PREEMPT_RT))
> preempt_disable();
> ...
>
> pattern could be much more naturally expressed as
>
> preempt_disable_under_spinlock();
> ...
>
> which would make the code really explain what is going on. I would
> still encourage that *comment* about it, but I think we really should
> strive for code that makes sense even without a comment.
>
> The fact that then without PREEMPT_RT, the whole
> "preempt_disable_under_spinlock()" becomes a no-op is then an
> implementation detail - and not so different from how a regular
> preempt_disable() becomes a no-op when on UP (or with PREEMPT_NONE).
>
> And that "preempt_disable_under_spinlock()" really documents what is
> going on, and I feel would make that code easier to understand? The
> fact that PREEMPT_RT has different rules about preemption is not
> something that the dentry code should care about.
>
> The dentry code could just say "I want to disable preemption, and I
> already hold a spinlock, so do what is best".
>
> So then "preempt_disable_under_spinlock()" precisely documents what
> the dentry code really wants.
>
> No?

Fine by me, but I think that this is better dealt with by the rt folks;
I've no objections to replacing that open-coded stuff in dcache.c with
better documented primitives, so when such patches materialize...

2022-08-03 22:14:12

by Steven Rostedt

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Wed, Aug 03, 2022 at 11:57:27AM -0700, Linus Torvalds wrote:
>
> I really dislike this pattern:
>
> if (IS_ENABLED(CONFIG_PREEMPT_RT))
> preempt_disable();
> ...
> if (IS_ENABLED(CONFIG_PREEMPT_RT))
> preempt_enable();
>
> and while the new comment explains *why* it exists, it's still very ugly indeed.
>
> We have it in a couple of other places, and we also end up having
> another variation on the theme that is about "migrate_{dis,en}able()",
> except it is written as
>
> if (IS_ENABLED(CONFIG_PREEMPT_RT))
> migrate_disable();
> else
> preempt_disable();
>
> because on non-PREEMPT_RT obviously preempt_disable() is the better
> and simpler thing.
>
> Can we please just introduce helper functions?
>
> At least that
>
> if (IS_ENABLED(CONFIG_PREEMPT_RT))
> preempt_disable();
> ...
>
> pattern could be much more naturally expressed as
>
> preempt_disable_under_spinlock();
> ...
>

The original patch years ago use to have:

preempt_disable_rt()

preempt_enable_rt()


That did exactly that, but an effort was made to get rid of it. But your more
descriptive "preempt_enable/disable_under_spinlock()" may make more sense.

-- Steve


2022-08-03 22:16:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Wed, Aug 3, 2022 at 2:55 PM Steven Rostedt <[email protected]> wrote:
>
> The original patch years ago use to have:
>
> preempt_disable_rt()
>
> preempt_enable_rt()

That may be visually simpler, but I dislike how it's named for some
implementation detail, rather than for the semantic meaning.

Admittedly I think "preempt_enable_under_spinlock()" may be a bit
*too* cumbersome as a name. It does explain what is going on - and
both the implementation and the use end up being fairly clear (and the
non-RT case could have some debug version that actually tests that
preemption has already been disabled).

But it is also a ridiculously long name, no question about that.

I still feel is less cumbersome than having that
"IS_ENABLED(CONFIG_PREEMPT_RT)" test that also then pretty much
requires a comment to explain what is going on.

Linus

2022-08-03 23:05:11

by Steven Rostedt

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Wed, 3 Aug 2022 15:09:23 -0700
Linus Torvalds <[email protected]> wrote:

> Admittedly I think "preempt_enable_under_spinlock()" may be a bit
> *too* cumbersome as a name. It does explain what is going on - and
> both the implementation and the use end up being fairly clear (and the
> non-RT case could have some debug version that actually tests that
> preemption has already been disabled).

preempt_disable_inlock() ?

-- Steve

2022-08-03 23:38:22

by Matthew Wilcox (Oracle)

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Wed, Aug 03, 2022 at 06:59:36PM -0400, Steven Rostedt wrote:
> On Wed, 3 Aug 2022 15:09:23 -0700
> Linus Torvalds <[email protected]> wrote:
>
> > Admittedly I think "preempt_enable_under_spinlock()" may be a bit
> > *too* cumbersome as a name. It does explain what is going on - and
> > both the implementation and the use end up being fairly clear (and the
> > non-RT case could have some debug version that actually tests that
> > preemption has already been disabled).
>
> preempt_disable_inlock() ?

preempt_disable_locked()?

2022-08-03 23:45:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Wed, Aug 3, 2022 at 4:24 PM Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Aug 03, 2022 at 06:59:36PM -0400, Steven Rostedt wrote:
> >
> > preempt_disable_inlock() ?
>
> preempt_disable_locked()?

Heh. Shed painting in full glory.

Let's try just "preempt_enable_under_spinlock()" and see.

It's a bit long, but it's still shorter than the existing usage pattern.

And we don't have "inlock" anywhere else, and while "locked" is a real
pattern we have, it tends to be about other things (ie "I hold the
lock that you need, so don't take it").

And this is _explicitly_ only about spinning locks, because sleeping
locks don't do the preemption disable even without RT.

So let's make it verbose and clear and unambiguous. It's not like I
expect to see a _lot_ of those. Knock wood.

We had a handful of those things before (in mm/vmstat, and now added
another case to the dentry code. So it has become a pattern, but I
really really hope it's not exactly a common pattern.

And so because it's not common, typing a bit more is a good idea - and
making it really clear is probably also a good idea.

Linus

2022-08-04 01:05:30

by Matthew Wilcox (Oracle)

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Wed, Aug 03, 2022 at 04:42:43PM -0700, Linus Torvalds wrote:
> On Wed, Aug 3, 2022 at 4:24 PM Matthew Wilcox <[email protected]> wrote:
> >
> > On Wed, Aug 03, 2022 at 06:59:36PM -0400, Steven Rostedt wrote:
> > >
> > > preempt_disable_inlock() ?
> >
> > preempt_disable_locked()?
>
> Heh. Shed painting in full glory.
>
> Let's try just "preempt_enable_under_spinlock()" and see.
>
> It's a bit long, but it's still shorter than the existing usage pattern.
>
> And we don't have "inlock" anywhere else, and while "locked" is a real
> pattern we have, it tends to be about other things (ie "I hold the
> lock that you need, so don't take it").
>
> And this is _explicitly_ only about spinning locks, because sleeping
> locks don't do the preemption disable even without RT.
>
> So let's make it verbose and clear and unambiguous. It's not like I
> expect to see a _lot_ of those. Knock wood.

Should we have it take a spinlock_t pointer? We could have lockdep
check it is actually held.

2022-08-04 02:05:41

by Steven Rostedt

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Thu, 4 Aug 2022 01:42:25 +0100
Matthew Wilcox <[email protected]> wrote:

> > So let's make it verbose and clear and unambiguous. It's not like I
> > expect to see a _lot_ of those. Knock wood.
>
> Should we have it take a spinlock_t pointer? We could have lockdep
> check it is actually held.

We don't care if the lock is held or not. The point of the matter is that
spinlocks in RT do not disable preemption. The code that the
preempt_disable_under_spinlock() is inside, can not be preempted. If it is,
bad things can happen.

Currently this code assumes that spinlocks disable preemption, so there's
no need to disable preemption here. But in RT, just holding the spinlock is
not enough to disable preemption, hence we need to explicitly call it here.

As Linus's name suggests, the "preempt_enable_under_spinlock" is to make
sure preemption is disabled regardless if it's under a normal spinlock that
disables preemption, or a RT spinlock that does not.

I wonder if raw_preempt_disable() would be another name to use? We have
raw_spin_lock() to denote that it's a real spinlock even under PREEMPT_RT.
We could say that "raw_preempt_disable()" makes sure the location really
has preemption disabled regardless of PREEMPT_RT.

-- Steve

2022-08-04 02:38:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Wed, Aug 3, 2022 at 6:33 PM Steven Rostedt <[email protected]> wrote:
>
> We don't care if the lock is held or not. The point of the matter is that
> spinlocks in RT do not disable preemption. The code that the
> preempt_disable_under_spinlock() is inside, can not be preempted. If it is,
> bad things can happen.

I think you're missing Willy's point - the use would be to verify that
the spinlock really *is* held, because that's what disables preemption
on non-RT.

But no, I don't think it's worth the pain to have to specify which
spinlock is held, because the spinlock might have been taken by the
caller and we don't even have access to it - or care - we just know
somebody did take it.

If we want extra debuggingm it might be something like just verifying
that yes, the preempt count (on a non-RT preemptible kernel) really is
elevated already.

> I wonder if raw_preempt_disable() would be another name to use?

NO!

The point is that normal non-RT code does *not* disable preemption at
all, because it is already disabled thanks to the earlier spinlock.

So we definitely do *not* want to call this "raw_preempt_disable()",
because it's actually not supposed to normally disable anything at
all. Only for RT, where the spinlock code doesn't do it.

Linus

2022-08-04 10:57:29

by Steven Rostedt

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Wed, 3 Aug 2022 19:16:12 -0700
Linus Torvalds <[email protected]> wrote:

> > I wonder if raw_preempt_disable() would be another name to use?
>
> NO!
>
> The point is that normal non-RT code does *not* disable preemption at
> all, because it is already disabled thanks to the earlier spinlock.
>
> So we definitely do *not* want to call this "raw_preempt_disable()",
> because it's actually not supposed to normally disable anything at
> all. Only for RT, where the spinlock code doesn't do it.

Yeah, I'm just brainstorming ideas on what we could use to make that name a
little shorter, and I'm not coming up with much.

OK, I'm becoming colorblind with this shed.

-- Steve

2022-08-08 22:44:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Wed, Aug 03 2022 at 16:42, Linus Torvalds wrote:
> On Wed, Aug 3, 2022 at 4:24 PM Matthew Wilcox <[email protected]> wrote:
>> On Wed, Aug 03, 2022 at 06:59:36PM -0400, Steven Rostedt wrote:
>> >
>> > preempt_disable_inlock() ?
>>
>> preempt_disable_locked()?
>
> Heh. Shed painting in full glory.
>
> Let's try just "preempt_enable_under_spinlock()" and see.
>
> It's a bit long, but it's still shorter than the existing usage pattern.
>
> And we don't have "inlock" anywhere else, and while "locked" is a real
> pattern we have, it tends to be about other things (ie "I hold the
> lock that you need, so don't take it").
>
> And this is _explicitly_ only about spinning locks, because sleeping
> locks don't do the preemption disable even without RT.
>
> So let's make it verbose and clear and unambiguous. It's not like I
> expect to see a _lot_ of those. Knock wood.
>
> We had a handful of those things before (in mm/vmstat, and now added
> another case to the dentry code. So it has become a pattern, but I
> really really hope it's not exactly a common pattern.
>
> And so because it's not common, typing a bit more is a good idea - and
> making it really clear is probably also a good idea.

Sebastian and me looked over it.

The use cases in mm/vmstat are not really all under spinlocks. That code
gets called also just from plain local_irq or even just preempt disabled
regions (depending on the stats item), which makes the proposed name
less accurate than you describe.

A worse case is the u64_stat code which is an ifdef maze (only partially
due to RT). Those stats updates can also be called from various contexts
where no spinlock is involved. That code is extra convoluted due to
irqsave variants and "optimizations" for 32bit UP. Removing the latter
would make a cleanup with write_seqcount_...() wrappers pretty simple.

Aside of that we have RT conditional preempt related code in
page_alloc() and kmap_atomic(). Both care only about the task staying
pinned on a CPU. In page_alloc() using preempt_disable() on !RT is more
lightweight than migrate_disable(). So something like
task_[un]pin_temporary() might work and be descriptive enough.

For kmap_atomic() it was decided back then when we introduced
kmap_local() that we don't do a wholesale conversion and leave it to the
maintainers/developers to look at it on a case by case basis as that has
quite some cleanup potential at the call sites. 18 month later we still
have 435 of the back then 527 call sites. Sadly enough there are 21 new
instances vs. 71 removed and about 20 related cleanup patches ignored.

So either we come up with something generic or we just resort to
different wrappers for those use cases. I'll have another look with
Sebastian tomorrow.

Thoughts?

Thanks,

tglx



2022-08-08 23:11:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Mon, Aug 8, 2022 at 3:06 PM Thomas Gleixner <[email protected]> wrote:
>
> The use cases in mm/vmstat are not really all under spinlocks. That code
> gets called also just from plain local_irq or even just preempt disabled
> regions (depending on the stats item), which makes the proposed name
> less accurate than you describe.

Augh.

How about "preempt_disable_nested()" with a big comment about how some
operations normally disable preemption (interrupts off, spinlocks,
anything else?) but not on PREEMPT_RT?

> A worse case is the u64_stat code which is an ifdef maze (only partially
> due to RT). Those stats updates can also be called from various contexts
> where no spinlock is involved. That code is extra convoluted due to
> irqsave variants and "optimizations" for 32bit UP. Removing the latter
> would make a cleanup with write_seqcount_...() wrappers pretty simple.

I think we most definitely can start removing optimisations for 32-bit
UP by now.

Let's not do them without any reason, but any time you hit a code that
makes you go "this makes it harder to do better", feel free to go all
Alexander the Great on the 32-bit UP code and just cut through the
problem by removing it.

> Aside of that we have RT conditional preempt related code in
> page_alloc() and kmap_atomic(). Both care only about the task staying
> pinned on a CPU. In page_alloc() using preempt_disable() on !RT is more
> lightweight than migrate_disable(). So something like
> task_[un]pin_temporary() might work and be descriptive enough.

Yeah, that was the other odd pattern. I'm not sure "temporary" is all
that relevant, but yes, if we end up having more of those, some kind
of "thread_{un]pin_cpu()" would probably be worth it.

But the kmap code may be so special that nothing else has _that_
particular issue.

Linus

2022-08-09 16:23:22

by Matthew Wilcox (Oracle)

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Tue, Aug 09, 2022 at 06:00:17PM +0200, Thomas Gleixner wrote:
> On Mon, Aug 08 2022 at 15:43, Linus Torvalds wrote:
> > But the kmap code may be so special that nothing else has _that_
> > particular issue.
>
> We just want to get rid of kmap_atomic() completely. I'll go and find
> minions.

Be sure to coordinate with Ira & Fabio who are working on this.

2022-08-09 16:35:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

Linus,

On Mon, Aug 08 2022 at 15:43, Linus Torvalds wrote:
> On Mon, Aug 8, 2022 at 3:06 PM Thomas Gleixner <[email protected]> wrote:
>> The use cases in mm/vmstat are not really all under spinlocks. That code
>> gets called also just from plain local_irq or even just preempt disabled
>> regions (depending on the stats item), which makes the proposed name
>> less accurate than you describe.
>
> Augh.
>
> How about "preempt_disable_nested()" with a big comment about how some
> operations normally disable preemption (interrupts off, spinlocks,
> anything else?) but not on PREEMPT_RT?

Let me do that.

>> A worse case is the u64_stat code which is an ifdef maze (only partially
>> due to RT). Those stats updates can also be called from various contexts
>> where no spinlock is involved. That code is extra convoluted due to
>> irqsave variants and "optimizations" for 32bit UP. Removing the latter
>> would make a cleanup with write_seqcount_...() wrappers pretty simple.
>
> I think we most definitely can start removing optimisations for 32-bit
> UP by now.
>
> Let's not do them without any reason, but any time you hit a code that
> makes you go "this makes it harder to do better", feel free to go all
> Alexander the Great on the 32-bit UP code and just cut through the
> problem by removing it.

With that mopped up:

1 file changed, 42 insertions(+), 84 deletions(-)

plus a followup cleanup of the then not longer required
_irqsave/restore() variants:

8 files changed, 33 insertions(+), 62 deletions(-)

is not a Great conquest, but makes the code definitely readable.

The fetch/retry_irq() variants are then obsolete as well, but that's just
a rename in 70 files and the removal of the two helpers.

>> Aside of that we have RT conditional preempt related code in
>> page_alloc() and kmap_atomic(). Both care only about the task staying
>> pinned on a CPU. In page_alloc() using preempt_disable() on !RT is more
>> lightweight than migrate_disable(). So something like
>> task_[un]pin_temporary() might work and be descriptive enough.
>
> Yeah, that was the other odd pattern. I'm not sure "temporary" is all
> that relevant, but yes, if we end up having more of those, some kind
> of "thread_{un]pin_cpu()" would probably be worth it.
>
> But the kmap code may be so special that nothing else has _that_
> particular issue.

We just want to get rid of kmap_atomic() completely. I'll go and find
minions.

Thanks,

tglx

2022-08-09 18:45:50

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [git pull] vfs.git pile 3 - dcache

On Tue, Aug 09 2022 at 17:15, Matthew Wilcox wrote:
> On Tue, Aug 09, 2022 at 06:00:17PM +0200, Thomas Gleixner wrote:
>> On Mon, Aug 08 2022 at 15:43, Linus Torvalds wrote:
>> > But the kmap code may be so special that nothing else has _that_
>> > particular issue.
>>
>> We just want to get rid of kmap_atomic() completely. I'll go and find
>> minions.
>
> Be sure to coordinate with Ira & Fabio who are working on this.

Will do.