2020-09-15 08:41:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Mon, Sep 14 2020 at 15:24, Linus Torvalds wrote:
> On Mon, Sep 14, 2020 at 2:55 PM Thomas Gleixner <[email protected]> wrote:
>>
>> Yes it does generate better code, but I tried hard to spot a difference
>> in various metrics exposed by perf. It's all in the noise and I only
>> can spot a difference when the actual preemption check after the
>> decrement
>
> I'm somewhat more worried about the small-device case.

I just checked on one of my old UP ARM toys which I run at home. The .text
increase is about 2% (75k) and none of the tests I ran showed any
significant difference. Couldn't verify with perf though as the PMU on
that piece of art is unusable.

> That said, the diffstat certainly has its very clear charm, and I do
> agree that it makes things simpler.
>
> I'm just not convinced people should ever EVER do things like that "if
> (preemptible())" garbage. It sounds like somebody is doing seriously
> bad things.

OTOH, having a working 'preemptible()' or maybe better named
'can_schedule()' check makes tons of sense to make decisions about
allocation modes or other things.

We're currently looking through all of in_atomic(), in_interrupt()
etc. usage sites and quite some of them are historic and have the clear
intent of checking whether the code is called from task context or
hard/softirq context. Lots of them are completely broken or just work by
chance.

But there is clearly historic precendence that context checks are
useful, but they only can be useful if we have a consistent mechanism
which works everywhere.

Of course we could mandate that every interface which might be called
from one or the other context has a context argument or provides two
variants of the same thing. But I'm not really convinced whether that's
a win over having a consistent and reliable set of checks.

Thanks,

tglx





2020-09-15 17:58:38

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <[email protected]> wrote:
>
> OTOH, having a working 'preemptible()' or maybe better named
> 'can_schedule()' check makes tons of sense to make decisions about
> allocation modes or other things.

No. I think that those kinds of decisions about actual behavior are
always simply fundamentally wrong.

Note that this is very different from having warnings about invalid
use. THAT is correct. It may not warn in all configurations, but that
doesn't matter: what matters is that it warns in common enough
configurations that developers will catch it.

So having a warning in "might_sleep()" that doesn't always trigger,
because you have a limited configuration that can't even detect the
situation, that's fine and dandy and intentional.

But having code like

if (can_schedule())
.. do something different ..

is fundamentally complete and utter garbage.

It's one thing if you test for "am I in hardware interrupt context".
Those tests aren't great either, but at least they make sense.

But a driver - or some library routine - making a difference based on
some nebulous "can I schedule" is fundamentally and basically WRONG.

If some code changes behavior, it needs to be explicit to the *caller*
of that code.

So this is why GFP_ATOMIC is fine, but "if (!can_schedule())
do_something_atomic()" is pure shite.

And I am not IN THE LEAST interested in trying to help people doing
pure shite. We need to fix them. Like the crypto code is getting
fixed.

Linus

2020-09-15 20:00:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Tue, Sep 15 2020 at 10:35, Linus Torvalds wrote:
> On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <[email protected]> wrote:
>>
>> OTOH, having a working 'preemptible()' or maybe better named
>> 'can_schedule()' check makes tons of sense to make decisions about
>> allocation modes or other things.
>
> No. I think that those kinds of decisions about actual behavior are
> always simply fundamentally wrong.
>
> Note that this is very different from having warnings about invalid
> use. THAT is correct. It may not warn in all configurations, but that
> doesn't matter: what matters is that it warns in common enough
> configurations that developers will catch it.

You wish. I just found a 7 year old bug in a 10G network driver which
surely would have been found if people would enable debug configs and
not just run the crap on their PREEMPT_NONE, all debug off kernel. And
that driver is not subject to bitrot, it gets regular bug fixes from
people who seem to care (distro folks).

> So having a warning in "might_sleep()" that doesn't always trigger,
> because you have a limited configuration that can't even detect the
> situation, that's fine and dandy and intentional.

and lets people get away with their crap.

> But having code like
>
> if (can_schedule())
> .. do something different ..
>
> is fundamentally complete and utter garbage.
>
> It's one thing if you test for "am I in hardware interrupt context".
> Those tests aren't great either, but at least they make sense.

They make sense in limited situations like exception handlers and such
which really have to know from which context an exception was raised.

But with the above reasoning such checks do not make sense in any other
general code. 'in hard interrupt context' is just another context where
you can't do stuff which you can do when in preemptible task context.

Most tests are way broader than a single context. in_interrupt() is true
for hard interrupt, soft interrupt delivery and all BH disabled
contexts, which is completely ill defined.

> But a driver - or some library routine - making a difference based on
> some nebulous "can I schedule" is fundamentally and basically WRONG.
>
> If some code changes behavior, it needs to be explicit to the *caller*
> of that code.

I'm fine with that, but then we have to be consequent and ban _all_ of
these and not just declare can_schedule() to be a bad one.

Thanks,

tglx

2020-09-16 07:40:34

by Daniel Vetter

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds
<[email protected]> wrote:
>
> On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <[email protected]> wrote:
> >
> > OTOH, having a working 'preemptible()' or maybe better named
> > 'can_schedule()' check makes tons of sense to make decisions about
> > allocation modes or other things.
>
> No. I think that those kinds of decisions about actual behavior are
> always simply fundamentally wrong.
>
> Note that this is very different from having warnings about invalid
> use. THAT is correct. It may not warn in all configurations, but that
> doesn't matter: what matters is that it warns in common enough
> configurations that developers will catch it.
>
> So having a warning in "might_sleep()" that doesn't always trigger,
> because you have a limited configuration that can't even detect the
> situation, that's fine and dandy and intentional.
>
> But having code like
>
> if (can_schedule())
> .. do something different ..
>
> is fundamentally complete and utter garbage.
>
> It's one thing if you test for "am I in hardware interrupt context".
> Those tests aren't great either, but at least they make sense.
>
> But a driver - or some library routine - making a difference based on
> some nebulous "can I schedule" is fundamentally and basically WRONG.
>
> If some code changes behavior, it needs to be explicit to the *caller*
> of that code.
>
> So this is why GFP_ATOMIC is fine, but "if (!can_schedule())
> do_something_atomic()" is pure shite.
>
> And I am not IN THE LEAST interested in trying to help people doing
> pure shite. We need to fix them. Like the crypto code is getting
> fixed.

Just figured I'll throw my +1 in from reading too many (gpu) drivers.
Code that tries to cleverly adjust its behaviour depending upon the
context it's running in is harder to understand and blows up in more
interesting ways. We still have drm_can_sleep() and it's mostly just
used for debug code, and I've largely ended up just deleting
everything that used it because when you're driver is blowing up the
last thing you want is to realize your debug code and output can't be
relied upon. Or worse, that the only Oops you have is the one in the
debug code, because the real one scrolled away - the original idea
behind drm_can_sleep was to make all the modeset code work
automagically both in normal ioctl/kworker context and in the panic
handlers or kgdb callbacks. Wishful thinking at best.

Also at least for me that extends to everything, e.g. I much prefer
explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for
locks shared with interrupt handlers, since the former two gives me
clear information from which contexts such function can be called.
Other end is the memalloc_no*_save/restore functions, where I recently
made a real big fool of myself because I didn't realize how much that
impacts everything that's run within - suddenly "GFP_KERNEL for small
stuff never fails" is wrong everywhere.

It's all great for debugging and sanity checks (and we run with all
that stuff enabled in our CI), but really semantic changes depending
upon magic context checks freak my out :-)
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2020-09-16 18:44:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Tue, Sep 15, 2020 at 12:57 PM Thomas Gleixner <[email protected]> wrote:
>
> You wish. I just found a 7 year old bug in a 10G network driver which
> surely would have been found if people would enable debug configs and
> not just run the crap on their PREEMPT_NONE, all debug off kernel. And
> that driver is not subject to bitrot, it gets regular bug fixes from
> people who seem to care (distro folks).

That driver clearly cannot be very well maintained. All the distro
kernels have the basic debug checks in place, afaik.

Is it some wonderful "enterprise hardware" garbage again that only
gets used in special data centers?

Becasue the "enterprise" people really are special. Very much in the
"short bus" special kind of way. The fact that they have fooled so
much of the industry into thinking that they are the competent and
serious people is a disgrace.

Linus

2020-09-16 19:06:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote:
> On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds
> <[email protected]> wrote:
> >
> > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <[email protected]> wrote:
> > >
> > > OTOH, having a working 'preemptible()' or maybe better named
> > > 'can_schedule()' check makes tons of sense to make decisions about
> > > allocation modes or other things.
> >
> > No. I think that those kinds of decisions about actual behavior are
> > always simply fundamentally wrong.
> >
> > Note that this is very different from having warnings about invalid
> > use. THAT is correct. It may not warn in all configurations, but that
> > doesn't matter: what matters is that it warns in common enough
> > configurations that developers will catch it.
> >
> > So having a warning in "might_sleep()" that doesn't always trigger,
> > because you have a limited configuration that can't even detect the
> > situation, that's fine and dandy and intentional.
> >
> > But having code like
> >
> > if (can_schedule())
> > .. do something different ..
> >
> > is fundamentally complete and utter garbage.
> >
> > It's one thing if you test for "am I in hardware interrupt context".
> > Those tests aren't great either, but at least they make sense.
> >
> > But a driver - or some library routine - making a difference based on
> > some nebulous "can I schedule" is fundamentally and basically WRONG.
> >
> > If some code changes behavior, it needs to be explicit to the *caller*
> > of that code.
> >
> > So this is why GFP_ATOMIC is fine, but "if (!can_schedule())
> > do_something_atomic()" is pure shite.
> >
> > And I am not IN THE LEAST interested in trying to help people doing
> > pure shite. We need to fix them. Like the crypto code is getting
> > fixed.
>
> Just figured I'll throw my +1 in from reading too many (gpu) drivers.
> Code that tries to cleverly adjust its behaviour depending upon the
> context it's running in is harder to understand and blows up in more
> interesting ways. We still have drm_can_sleep() and it's mostly just
> used for debug code, and I've largely ended up just deleting
> everything that used it because when you're driver is blowing up the
> last thing you want is to realize your debug code and output can't be
> relied upon. Or worse, that the only Oops you have is the one in the
> debug code, because the real one scrolled away - the original idea
> behind drm_can_sleep was to make all the modeset code work
> automagically both in normal ioctl/kworker context and in the panic
> handlers or kgdb callbacks. Wishful thinking at best.
>
> Also at least for me that extends to everything, e.g. I much prefer
> explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for
> locks shared with interrupt handlers, since the former two gives me
> clear information from which contexts such function can be called.
> Other end is the memalloc_no*_save/restore functions, where I recently
> made a real big fool of myself because I didn't realize how much that
> impacts everything that's run within - suddenly "GFP_KERNEL for small
> stuff never fails" is wrong everywhere.
>
> It's all great for debugging and sanity checks (and we run with all
> that stuff enabled in our CI), but really semantic changes depending
> upon magic context checks freak my out :-)

All fair, but some of us need to write code that must handle being
invoked from a wide variety of contexts. Now perhaps you like the idea of
call_rcu() for schedulable contexts, call_rcu_nosched() when preemption
is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled,
call_rcu_raw_atomic() from contexts where (for example) raw spinlocks
are held, and so on. However, from what I can see, most people instead
consistently prefer that the RCU API instead be consolidated.

Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu()
needs to be able to allocate memory occasionally. It can do that when
invoked from some contexts, but not when invoked from others. Right now,
in !PREEMPT kernels, it cannot tell, and must either do things to the
memory allocators that some of the MM hate or must unnecessarily invoke
workqueues. Thomas's patches would allow the code to just allocate in
the common case when these primitives are invoked from contexts where
allocation is permitted.

If we want to restrict access to the can_schedule() or whatever primitive,
fine and good. We can add a check to checkpatch.pl, for example. Maybe
we can go back to the old brlock approach of requiring certain people's
review for each addition to the kernel.

But there really are use cases that it would greatly help.

Thanx, Paul

2020-09-16 20:30:54

by Daniel Vetter

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <[email protected]> wrote:
>
> On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote:
> > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds
> > <[email protected]> wrote:
> > >
> > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <[email protected]> wrote:
> > > >
> > > > OTOH, having a working 'preemptible()' or maybe better named
> > > > 'can_schedule()' check makes tons of sense to make decisions about
> > > > allocation modes or other things.
> > >
> > > No. I think that those kinds of decisions about actual behavior are
> > > always simply fundamentally wrong.
> > >
> > > Note that this is very different from having warnings about invalid
> > > use. THAT is correct. It may not warn in all configurations, but that
> > > doesn't matter: what matters is that it warns in common enough
> > > configurations that developers will catch it.
> > >
> > > So having a warning in "might_sleep()" that doesn't always trigger,
> > > because you have a limited configuration that can't even detect the
> > > situation, that's fine and dandy and intentional.
> > >
> > > But having code like
> > >
> > > if (can_schedule())
> > > .. do something different ..
> > >
> > > is fundamentally complete and utter garbage.
> > >
> > > It's one thing if you test for "am I in hardware interrupt context".
> > > Those tests aren't great either, but at least they make sense.
> > >
> > > But a driver - or some library routine - making a difference based on
> > > some nebulous "can I schedule" is fundamentally and basically WRONG.
> > >
> > > If some code changes behavior, it needs to be explicit to the *caller*
> > > of that code.
> > >
> > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule())
> > > do_something_atomic()" is pure shite.
> > >
> > > And I am not IN THE LEAST interested in trying to help people doing
> > > pure shite. We need to fix them. Like the crypto code is getting
> > > fixed.
> >
> > Just figured I'll throw my +1 in from reading too many (gpu) drivers.
> > Code that tries to cleverly adjust its behaviour depending upon the
> > context it's running in is harder to understand and blows up in more
> > interesting ways. We still have drm_can_sleep() and it's mostly just
> > used for debug code, and I've largely ended up just deleting
> > everything that used it because when you're driver is blowing up the
> > last thing you want is to realize your debug code and output can't be
> > relied upon. Or worse, that the only Oops you have is the one in the
> > debug code, because the real one scrolled away - the original idea
> > behind drm_can_sleep was to make all the modeset code work
> > automagically both in normal ioctl/kworker context and in the panic
> > handlers or kgdb callbacks. Wishful thinking at best.
> >
> > Also at least for me that extends to everything, e.g. I much prefer
> > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for
> > locks shared with interrupt handlers, since the former two gives me
> > clear information from which contexts such function can be called.
> > Other end is the memalloc_no*_save/restore functions, where I recently
> > made a real big fool of myself because I didn't realize how much that
> > impacts everything that's run within - suddenly "GFP_KERNEL for small
> > stuff never fails" is wrong everywhere.
> >
> > It's all great for debugging and sanity checks (and we run with all
> > that stuff enabled in our CI), but really semantic changes depending
> > upon magic context checks freak my out :-)
>
> All fair, but some of us need to write code that must handle being
> invoked from a wide variety of contexts. Now perhaps you like the idea of
> call_rcu() for schedulable contexts, call_rcu_nosched() when preemption
> is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled,
> call_rcu_raw_atomic() from contexts where (for example) raw spinlocks
> are held, and so on. However, from what I can see, most people instead
> consistently prefer that the RCU API instead be consolidated.
>
> Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu()
> needs to be able to allocate memory occasionally. It can do that when
> invoked from some contexts, but not when invoked from others. Right now,
> in !PREEMPT kernels, it cannot tell, and must either do things to the
> memory allocators that some of the MM hate or must unnecessarily invoke
> workqueues. Thomas's patches would allow the code to just allocate in
> the common case when these primitives are invoked from contexts where
> allocation is permitted.
>
> If we want to restrict access to the can_schedule() or whatever primitive,
> fine and good. We can add a check to checkpatch.pl, for example. Maybe
> we can go back to the old brlock approach of requiring certain people's
> review for each addition to the kernel.
>
> But there really are use cases that it would greatly help.

We can deadlock in random fun places if random stuff we're calling
suddenly starts allocating. Sometimes. Maybe once in a blue moon, to
make it extra fun to reproduce. Maybe most driver subsystems are less
brittle, but gpu drivers definitely need to know about the details for
exactly this example. And yes gpu drivers use rcu for freeing
dma_fence structures, and that tends to happen in code that we only
recently figured out should really not allocate memory.

I think minimally you need to throw in an unconditional
fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs
with full debugging knows what might happen. It's kinda like
might_sleep, but a lot more specific. might_sleep() alone is not
enough, because in the specific code paths I'm thinking of (and
created special lockdep annotations for just recently) sleeping is
allowed, but any memory allocations with GFP_RECLAIM set are no-go.

Cheers, Daniel




--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2020-09-16 21:00:26

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Wed, Sep 16, 2020 at 10:29:06PM +0200, Daniel Vetter wrote:
> On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <[email protected]> wrote:
> >
> > On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote:
> > > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds
> > > <[email protected]> wrote:
> > > >
> > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <[email protected]> wrote:
> > > > >
> > > > > OTOH, having a working 'preemptible()' or maybe better named
> > > > > 'can_schedule()' check makes tons of sense to make decisions about
> > > > > allocation modes or other things.
> > > >
> > > > No. I think that those kinds of decisions about actual behavior are
> > > > always simply fundamentally wrong.
> > > >
> > > > Note that this is very different from having warnings about invalid
> > > > use. THAT is correct. It may not warn in all configurations, but that
> > > > doesn't matter: what matters is that it warns in common enough
> > > > configurations that developers will catch it.
> > > >
> > > > So having a warning in "might_sleep()" that doesn't always trigger,
> > > > because you have a limited configuration that can't even detect the
> > > > situation, that's fine and dandy and intentional.
> > > >
> > > > But having code like
> > > >
> > > > if (can_schedule())
> > > > .. do something different ..
> > > >
> > > > is fundamentally complete and utter garbage.
> > > >
> > > > It's one thing if you test for "am I in hardware interrupt context".
> > > > Those tests aren't great either, but at least they make sense.
> > > >
> > > > But a driver - or some library routine - making a difference based on
> > > > some nebulous "can I schedule" is fundamentally and basically WRONG.
> > > >
> > > > If some code changes behavior, it needs to be explicit to the *caller*
> > > > of that code.
> > > >
> > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule())
> > > > do_something_atomic()" is pure shite.
> > > >
> > > > And I am not IN THE LEAST interested in trying to help people doing
> > > > pure shite. We need to fix them. Like the crypto code is getting
> > > > fixed.
> > >
> > > Just figured I'll throw my +1 in from reading too many (gpu) drivers.
> > > Code that tries to cleverly adjust its behaviour depending upon the
> > > context it's running in is harder to understand and blows up in more
> > > interesting ways. We still have drm_can_sleep() and it's mostly just
> > > used for debug code, and I've largely ended up just deleting
> > > everything that used it because when you're driver is blowing up the
> > > last thing you want is to realize your debug code and output can't be
> > > relied upon. Or worse, that the only Oops you have is the one in the
> > > debug code, because the real one scrolled away - the original idea
> > > behind drm_can_sleep was to make all the modeset code work
> > > automagically both in normal ioctl/kworker context and in the panic
> > > handlers or kgdb callbacks. Wishful thinking at best.
> > >
> > > Also at least for me that extends to everything, e.g. I much prefer
> > > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for
> > > locks shared with interrupt handlers, since the former two gives me
> > > clear information from which contexts such function can be called.
> > > Other end is the memalloc_no*_save/restore functions, where I recently
> > > made a real big fool of myself because I didn't realize how much that
> > > impacts everything that's run within - suddenly "GFP_KERNEL for small
> > > stuff never fails" is wrong everywhere.
> > >
> > > It's all great for debugging and sanity checks (and we run with all
> > > that stuff enabled in our CI), but really semantic changes depending
> > > upon magic context checks freak my out :-)
> >
> > All fair, but some of us need to write code that must handle being
> > invoked from a wide variety of contexts. Now perhaps you like the idea of
> > call_rcu() for schedulable contexts, call_rcu_nosched() when preemption
> > is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled,
> > call_rcu_raw_atomic() from contexts where (for example) raw spinlocks
> > are held, and so on. However, from what I can see, most people instead
> > consistently prefer that the RCU API instead be consolidated.
> >
> > Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu()
> > needs to be able to allocate memory occasionally. It can do that when
> > invoked from some contexts, but not when invoked from others. Right now,
> > in !PREEMPT kernels, it cannot tell, and must either do things to the
> > memory allocators that some of the MM hate or must unnecessarily invoke
> > workqueues. Thomas's patches would allow the code to just allocate in
> > the common case when these primitives are invoked from contexts where
> > allocation is permitted.
> >
> > If we want to restrict access to the can_schedule() or whatever primitive,
> > fine and good. We can add a check to checkpatch.pl, for example. Maybe
> > we can go back to the old brlock approach of requiring certain people's
> > review for each addition to the kernel.
> >
> > But there really are use cases that it would greatly help.
>
> We can deadlock in random fun places if random stuff we're calling
> suddenly starts allocating. Sometimes. Maybe once in a blue moon, to
> make it extra fun to reproduce. Maybe most driver subsystems are less
> brittle, but gpu drivers definitely need to know about the details for
> exactly this example. And yes gpu drivers use rcu for freeing
> dma_fence structures, and that tends to happen in code that we only
> recently figured out should really not allocate memory.
>
> I think minimally you need to throw in an unconditional
> fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs
> with full debugging knows what might happen. It's kinda like
> might_sleep, but a lot more specific. might_sleep() alone is not
> enough, because in the specific code paths I'm thinking of (and
> created special lockdep annotations for just recently) sleeping is
> allowed, but any memory allocations with GFP_RECLAIM set are no-go.

Completely agreed! Any allocation on any free path must be handled
-extremely- carefully. To that end...

First, there is always a fallback in case the allocation fails. Which
might have performance or corner-case robustness issues, but which will
at least allow forward progress. Second, we consulted with a number of
MM experts to arrive at appropriate GFP_* flags (and their patience is
greatly appreciated). Third, the paths that can allocate will do so about
one time of 500, so any issues should be spotted sooner rather than later.

So you are quite right to be concerned, but I believe we will be doing the
right things. And based on his previous track record, I am also quite
certain that Mr. Murphy will be on hand to provide me any additional
education that I might require.

Finally, I have noted down your point about fs_reclaim_acquire() and
fs_reclaim_release(). Whether or not they prove to be needed, I do
appreciate your calling them to my attention.

Thanx, Paul

2020-09-16 22:24:36

by Daniel Vetter

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Wed, Sep 16, 2020 at 10:58 PM Paul E. McKenney <[email protected]> wrote:
>
> On Wed, Sep 16, 2020 at 10:29:06PM +0200, Daniel Vetter wrote:
> > On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote:
> > > > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <[email protected]> wrote:
> > > > > >
> > > > > > OTOH, having a working 'preemptible()' or maybe better named
> > > > > > 'can_schedule()' check makes tons of sense to make decisions about
> > > > > > allocation modes or other things.
> > > > >
> > > > > No. I think that those kinds of decisions about actual behavior are
> > > > > always simply fundamentally wrong.
> > > > >
> > > > > Note that this is very different from having warnings about invalid
> > > > > use. THAT is correct. It may not warn in all configurations, but that
> > > > > doesn't matter: what matters is that it warns in common enough
> > > > > configurations that developers will catch it.
> > > > >
> > > > > So having a warning in "might_sleep()" that doesn't always trigger,
> > > > > because you have a limited configuration that can't even detect the
> > > > > situation, that's fine and dandy and intentional.
> > > > >
> > > > > But having code like
> > > > >
> > > > > if (can_schedule())
> > > > > .. do something different ..
> > > > >
> > > > > is fundamentally complete and utter garbage.
> > > > >
> > > > > It's one thing if you test for "am I in hardware interrupt context".
> > > > > Those tests aren't great either, but at least they make sense.
> > > > >
> > > > > But a driver - or some library routine - making a difference based on
> > > > > some nebulous "can I schedule" is fundamentally and basically WRONG.
> > > > >
> > > > > If some code changes behavior, it needs to be explicit to the *caller*
> > > > > of that code.
> > > > >
> > > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule())
> > > > > do_something_atomic()" is pure shite.
> > > > >
> > > > > And I am not IN THE LEAST interested in trying to help people doing
> > > > > pure shite. We need to fix them. Like the crypto code is getting
> > > > > fixed.
> > > >
> > > > Just figured I'll throw my +1 in from reading too many (gpu) drivers.
> > > > Code that tries to cleverly adjust its behaviour depending upon the
> > > > context it's running in is harder to understand and blows up in more
> > > > interesting ways. We still have drm_can_sleep() and it's mostly just
> > > > used for debug code, and I've largely ended up just deleting
> > > > everything that used it because when you're driver is blowing up the
> > > > last thing you want is to realize your debug code and output can't be
> > > > relied upon. Or worse, that the only Oops you have is the one in the
> > > > debug code, because the real one scrolled away - the original idea
> > > > behind drm_can_sleep was to make all the modeset code work
> > > > automagically both in normal ioctl/kworker context and in the panic
> > > > handlers or kgdb callbacks. Wishful thinking at best.
> > > >
> > > > Also at least for me that extends to everything, e.g. I much prefer
> > > > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for
> > > > locks shared with interrupt handlers, since the former two gives me
> > > > clear information from which contexts such function can be called.
> > > > Other end is the memalloc_no*_save/restore functions, where I recently
> > > > made a real big fool of myself because I didn't realize how much that
> > > > impacts everything that's run within - suddenly "GFP_KERNEL for small
> > > > stuff never fails" is wrong everywhere.
> > > >
> > > > It's all great for debugging and sanity checks (and we run with all
> > > > that stuff enabled in our CI), but really semantic changes depending
> > > > upon magic context checks freak my out :-)
> > >
> > > All fair, but some of us need to write code that must handle being
> > > invoked from a wide variety of contexts. Now perhaps you like the idea of
> > > call_rcu() for schedulable contexts, call_rcu_nosched() when preemption
> > > is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled,
> > > call_rcu_raw_atomic() from contexts where (for example) raw spinlocks
> > > are held, and so on. However, from what I can see, most people instead
> > > consistently prefer that the RCU API instead be consolidated.
> > >
> > > Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu()
> > > needs to be able to allocate memory occasionally. It can do that when
> > > invoked from some contexts, but not when invoked from others. Right now,
> > > in !PREEMPT kernels, it cannot tell, and must either do things to the
> > > memory allocators that some of the MM hate or must unnecessarily invoke
> > > workqueues. Thomas's patches would allow the code to just allocate in
> > > the common case when these primitives are invoked from contexts where
> > > allocation is permitted.
> > >
> > > If we want to restrict access to the can_schedule() or whatever primitive,
> > > fine and good. We can add a check to checkpatch.pl, for example. Maybe
> > > we can go back to the old brlock approach of requiring certain people's
> > > review for each addition to the kernel.
> > >
> > > But there really are use cases that it would greatly help.
> >
> > We can deadlock in random fun places if random stuff we're calling
> > suddenly starts allocating. Sometimes. Maybe once in a blue moon, to
> > make it extra fun to reproduce. Maybe most driver subsystems are less
> > brittle, but gpu drivers definitely need to know about the details for
> > exactly this example. And yes gpu drivers use rcu for freeing
> > dma_fence structures, and that tends to happen in code that we only
> > recently figured out should really not allocate memory.
> >
> > I think minimally you need to throw in an unconditional
> > fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs
> > with full debugging knows what might happen. It's kinda like
> > might_sleep, but a lot more specific. might_sleep() alone is not
> > enough, because in the specific code paths I'm thinking of (and
> > created special lockdep annotations for just recently) sleeping is
> > allowed, but any memory allocations with GFP_RECLAIM set are no-go.
>
> Completely agreed! Any allocation on any free path must be handled
> -extremely- carefully. To that end...
>
> First, there is always a fallback in case the allocation fails. Which
> might have performance or corner-case robustness issues, but which will
> at least allow forward progress. Second, we consulted with a number of
> MM experts to arrive at appropriate GFP_* flags (and their patience is
> greatly appreciated). Third, the paths that can allocate will do so about
> one time of 500, so any issues should be spotted sooner rather than later.
>
> So you are quite right to be concerned, but I believe we will be doing the
> right things. And based on his previous track record, I am also quite
> certain that Mr. Murphy will be on hand to provide me any additional
> education that I might require.
>
> Finally, I have noted down your point about fs_reclaim_acquire() and
> fs_reclaim_release(). Whether or not they prove to be needed, I do
> appreciate your calling them to my attention.

I just realized that since these dma_fence structs are refcounted and
userspace can hold references (directly, it can pass them around
behind file descriptors) we might never hit such a path until slightly
unusual or evil userspace does something interesting. Do you have
links to those patches? Some googling didn't turn up anything. I can
then figure out whether it's better to risk not spotting issues with
call_rcu vs slapping a memalloc_noio_save/restore around all these
critical section which force-degrades any allocation to GFP_ATOMIC at
most, but has the risk that we run into code that assumes "GFP_KERNEL
never fails for small stuff" and has a decidedly less tested fallback
path than rcu code.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2020-09-16 22:43:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Wed, Sep 16, 2020 at 11:43:02PM +0200, Daniel Vetter wrote:
> On Wed, Sep 16, 2020 at 10:58 PM Paul E. McKenney <[email protected]> wrote:
> >
> > On Wed, Sep 16, 2020 at 10:29:06PM +0200, Daniel Vetter wrote:
> > > On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote:
> > > > > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <[email protected]> wrote:
> > > > > > >
> > > > > > > OTOH, having a working 'preemptible()' or maybe better named
> > > > > > > 'can_schedule()' check makes tons of sense to make decisions about
> > > > > > > allocation modes or other things.
> > > > > >
> > > > > > No. I think that those kinds of decisions about actual behavior are
> > > > > > always simply fundamentally wrong.
> > > > > >
> > > > > > Note that this is very different from having warnings about invalid
> > > > > > use. THAT is correct. It may not warn in all configurations, but that
> > > > > > doesn't matter: what matters is that it warns in common enough
> > > > > > configurations that developers will catch it.
> > > > > >
> > > > > > So having a warning in "might_sleep()" that doesn't always trigger,
> > > > > > because you have a limited configuration that can't even detect the
> > > > > > situation, that's fine and dandy and intentional.
> > > > > >
> > > > > > But having code like
> > > > > >
> > > > > > if (can_schedule())
> > > > > > .. do something different ..
> > > > > >
> > > > > > is fundamentally complete and utter garbage.
> > > > > >
> > > > > > It's one thing if you test for "am I in hardware interrupt context".
> > > > > > Those tests aren't great either, but at least they make sense.
> > > > > >
> > > > > > But a driver - or some library routine - making a difference based on
> > > > > > some nebulous "can I schedule" is fundamentally and basically WRONG.
> > > > > >
> > > > > > If some code changes behavior, it needs to be explicit to the *caller*
> > > > > > of that code.
> > > > > >
> > > > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule())
> > > > > > do_something_atomic()" is pure shite.
> > > > > >
> > > > > > And I am not IN THE LEAST interested in trying to help people doing
> > > > > > pure shite. We need to fix them. Like the crypto code is getting
> > > > > > fixed.
> > > > >
> > > > > Just figured I'll throw my +1 in from reading too many (gpu) drivers.
> > > > > Code that tries to cleverly adjust its behaviour depending upon the
> > > > > context it's running in is harder to understand and blows up in more
> > > > > interesting ways. We still have drm_can_sleep() and it's mostly just
> > > > > used for debug code, and I've largely ended up just deleting
> > > > > everything that used it because when you're driver is blowing up the
> > > > > last thing you want is to realize your debug code and output can't be
> > > > > relied upon. Or worse, that the only Oops you have is the one in the
> > > > > debug code, because the real one scrolled away - the original idea
> > > > > behind drm_can_sleep was to make all the modeset code work
> > > > > automagically both in normal ioctl/kworker context and in the panic
> > > > > handlers or kgdb callbacks. Wishful thinking at best.
> > > > >
> > > > > Also at least for me that extends to everything, e.g. I much prefer
> > > > > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for
> > > > > locks shared with interrupt handlers, since the former two gives me
> > > > > clear information from which contexts such function can be called.
> > > > > Other end is the memalloc_no*_save/restore functions, where I recently
> > > > > made a real big fool of myself because I didn't realize how much that
> > > > > impacts everything that's run within - suddenly "GFP_KERNEL for small
> > > > > stuff never fails" is wrong everywhere.
> > > > >
> > > > > It's all great for debugging and sanity checks (and we run with all
> > > > > that stuff enabled in our CI), but really semantic changes depending
> > > > > upon magic context checks freak my out :-)
> > > >
> > > > All fair, but some of us need to write code that must handle being
> > > > invoked from a wide variety of contexts. Now perhaps you like the idea of
> > > > call_rcu() for schedulable contexts, call_rcu_nosched() when preemption
> > > > is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled,
> > > > call_rcu_raw_atomic() from contexts where (for example) raw spinlocks
> > > > are held, and so on. However, from what I can see, most people instead
> > > > consistently prefer that the RCU API instead be consolidated.
> > > >
> > > > Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu()
> > > > needs to be able to allocate memory occasionally. It can do that when
> > > > invoked from some contexts, but not when invoked from others. Right now,
> > > > in !PREEMPT kernels, it cannot tell, and must either do things to the
> > > > memory allocators that some of the MM hate or must unnecessarily invoke
> > > > workqueues. Thomas's patches would allow the code to just allocate in
> > > > the common case when these primitives are invoked from contexts where
> > > > allocation is permitted.
> > > >
> > > > If we want to restrict access to the can_schedule() or whatever primitive,
> > > > fine and good. We can add a check to checkpatch.pl, for example. Maybe
> > > > we can go back to the old brlock approach of requiring certain people's
> > > > review for each addition to the kernel.
> > > >
> > > > But there really are use cases that it would greatly help.
> > >
> > > We can deadlock in random fun places if random stuff we're calling
> > > suddenly starts allocating. Sometimes. Maybe once in a blue moon, to
> > > make it extra fun to reproduce. Maybe most driver subsystems are less
> > > brittle, but gpu drivers definitely need to know about the details for
> > > exactly this example. And yes gpu drivers use rcu for freeing
> > > dma_fence structures, and that tends to happen in code that we only
> > > recently figured out should really not allocate memory.
> > >
> > > I think minimally you need to throw in an unconditional
> > > fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs
> > > with full debugging knows what might happen. It's kinda like
> > > might_sleep, but a lot more specific. might_sleep() alone is not
> > > enough, because in the specific code paths I'm thinking of (and
> > > created special lockdep annotations for just recently) sleeping is
> > > allowed, but any memory allocations with GFP_RECLAIM set are no-go.
> >
> > Completely agreed! Any allocation on any free path must be handled
> > -extremely- carefully. To that end...
> >
> > First, there is always a fallback in case the allocation fails. Which
> > might have performance or corner-case robustness issues, but which will
> > at least allow forward progress. Second, we consulted with a number of
> > MM experts to arrive at appropriate GFP_* flags (and their patience is
> > greatly appreciated). Third, the paths that can allocate will do so about
> > one time of 500, so any issues should be spotted sooner rather than later.
> >
> > So you are quite right to be concerned, but I believe we will be doing the
> > right things. And based on his previous track record, I am also quite
> > certain that Mr. Murphy will be on hand to provide me any additional
> > education that I might require.
> >
> > Finally, I have noted down your point about fs_reclaim_acquire() and
> > fs_reclaim_release(). Whether or not they prove to be needed, I do
> > appreciate your calling them to my attention.
>
> I just realized that since these dma_fence structs are refcounted and
> userspace can hold references (directly, it can pass them around
> behind file descriptors) we might never hit such a path until slightly
> unusual or evil userspace does something interesting. Do you have
> links to those patches? Some googling didn't turn up anything. I can
> then figure out whether it's better to risk not spotting issues with
> call_rcu vs slapping a memalloc_noio_save/restore around all these
> critical section which force-degrades any allocation to GFP_ATOMIC at
> most, but has the risk that we run into code that assumes "GFP_KERNEL
> never fails for small stuff" and has a decidedly less tested fallback
> path than rcu code.

Here is the previous early draft version, which will change considerably
for the next version:

lore.kernel.org/lkml/[email protected]

This does kvfree_rcu(), but we expect to handle call_rcu() similarly.

The version in preparation will use workqueues to do the allocation in a
known-safe environment and also use lockless access to certain portions
of the allocator caches (as noted earlier, this last is not much loved
by some of the MM guys). Given Thomas's patch, we could with high
probability allocate directly, perhaps even not needing memory-allocator
modifications.

Either way, kvfree_rcu(), and later call_rcu(), will avoid asking the
allocator to do anything that the calling context prohibits. So what
types of bugs are you looking for? Where reclaim calls back into the
driver or some such?

Thanx, Paul

2020-09-17 07:54:26

by Daniel Vetter

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Thu, Sep 17, 2020 at 12:39 AM Paul E. McKenney <[email protected]> wrote:
>
> On Wed, Sep 16, 2020 at 11:43:02PM +0200, Daniel Vetter wrote:
> > On Wed, Sep 16, 2020 at 10:58 PM Paul E. McKenney <[email protected]> wrote:
> > >
> > > On Wed, Sep 16, 2020 at 10:29:06PM +0200, Daniel Vetter wrote:
> > > > On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <[email protected]> wrote:
> > > > >
> > > > > On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote:
> > > > > > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <[email protected]> wrote:
> > > > > > > >
> > > > > > > > OTOH, having a working 'preemptible()' or maybe better named
> > > > > > > > 'can_schedule()' check makes tons of sense to make decisions about
> > > > > > > > allocation modes or other things.
> > > > > > >
> > > > > > > No. I think that those kinds of decisions about actual behavior are
> > > > > > > always simply fundamentally wrong.
> > > > > > >
> > > > > > > Note that this is very different from having warnings about invalid
> > > > > > > use. THAT is correct. It may not warn in all configurations, but that
> > > > > > > doesn't matter: what matters is that it warns in common enough
> > > > > > > configurations that developers will catch it.
> > > > > > >
> > > > > > > So having a warning in "might_sleep()" that doesn't always trigger,
> > > > > > > because you have a limited configuration that can't even detect the
> > > > > > > situation, that's fine and dandy and intentional.
> > > > > > >
> > > > > > > But having code like
> > > > > > >
> > > > > > > if (can_schedule())
> > > > > > > .. do something different ..
> > > > > > >
> > > > > > > is fundamentally complete and utter garbage.
> > > > > > >
> > > > > > > It's one thing if you test for "am I in hardware interrupt context".
> > > > > > > Those tests aren't great either, but at least they make sense.
> > > > > > >
> > > > > > > But a driver - or some library routine - making a difference based on
> > > > > > > some nebulous "can I schedule" is fundamentally and basically WRONG.
> > > > > > >
> > > > > > > If some code changes behavior, it needs to be explicit to the *caller*
> > > > > > > of that code.
> > > > > > >
> > > > > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule())
> > > > > > > do_something_atomic()" is pure shite.
> > > > > > >
> > > > > > > And I am not IN THE LEAST interested in trying to help people doing
> > > > > > > pure shite. We need to fix them. Like the crypto code is getting
> > > > > > > fixed.
> > > > > >
> > > > > > Just figured I'll throw my +1 in from reading too many (gpu) drivers.
> > > > > > Code that tries to cleverly adjust its behaviour depending upon the
> > > > > > context it's running in is harder to understand and blows up in more
> > > > > > interesting ways. We still have drm_can_sleep() and it's mostly just
> > > > > > used for debug code, and I've largely ended up just deleting
> > > > > > everything that used it because when you're driver is blowing up the
> > > > > > last thing you want is to realize your debug code and output can't be
> > > > > > relied upon. Or worse, that the only Oops you have is the one in the
> > > > > > debug code, because the real one scrolled away - the original idea
> > > > > > behind drm_can_sleep was to make all the modeset code work
> > > > > > automagically both in normal ioctl/kworker context and in the panic
> > > > > > handlers or kgdb callbacks. Wishful thinking at best.
> > > > > >
> > > > > > Also at least for me that extends to everything, e.g. I much prefer
> > > > > > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for
> > > > > > locks shared with interrupt handlers, since the former two gives me
> > > > > > clear information from which contexts such function can be called.
> > > > > > Other end is the memalloc_no*_save/restore functions, where I recently
> > > > > > made a real big fool of myself because I didn't realize how much that
> > > > > > impacts everything that's run within - suddenly "GFP_KERNEL for small
> > > > > > stuff never fails" is wrong everywhere.
> > > > > >
> > > > > > It's all great for debugging and sanity checks (and we run with all
> > > > > > that stuff enabled in our CI), but really semantic changes depending
> > > > > > upon magic context checks freak my out :-)
> > > > >
> > > > > All fair, but some of us need to write code that must handle being
> > > > > invoked from a wide variety of contexts. Now perhaps you like the idea of
> > > > > call_rcu() for schedulable contexts, call_rcu_nosched() when preemption
> > > > > is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled,
> > > > > call_rcu_raw_atomic() from contexts where (for example) raw spinlocks
> > > > > are held, and so on. However, from what I can see, most people instead
> > > > > consistently prefer that the RCU API instead be consolidated.
> > > > >
> > > > > Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu()
> > > > > needs to be able to allocate memory occasionally. It can do that when
> > > > > invoked from some contexts, but not when invoked from others. Right now,
> > > > > in !PREEMPT kernels, it cannot tell, and must either do things to the
> > > > > memory allocators that some of the MM hate or must unnecessarily invoke
> > > > > workqueues. Thomas's patches would allow the code to just allocate in
> > > > > the common case when these primitives are invoked from contexts where
> > > > > allocation is permitted.
> > > > >
> > > > > If we want to restrict access to the can_schedule() or whatever primitive,
> > > > > fine and good. We can add a check to checkpatch.pl, for example. Maybe
> > > > > we can go back to the old brlock approach of requiring certain people's
> > > > > review for each addition to the kernel.
> > > > >
> > > > > But there really are use cases that it would greatly help.
> > > >
> > > > We can deadlock in random fun places if random stuff we're calling
> > > > suddenly starts allocating. Sometimes. Maybe once in a blue moon, to
> > > > make it extra fun to reproduce. Maybe most driver subsystems are less
> > > > brittle, but gpu drivers definitely need to know about the details for
> > > > exactly this example. And yes gpu drivers use rcu for freeing
> > > > dma_fence structures, and that tends to happen in code that we only
> > > > recently figured out should really not allocate memory.
> > > >
> > > > I think minimally you need to throw in an unconditional
> > > > fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs
> > > > with full debugging knows what might happen. It's kinda like
> > > > might_sleep, but a lot more specific. might_sleep() alone is not
> > > > enough, because in the specific code paths I'm thinking of (and
> > > > created special lockdep annotations for just recently) sleeping is
> > > > allowed, but any memory allocations with GFP_RECLAIM set are no-go.
> > >
> > > Completely agreed! Any allocation on any free path must be handled
> > > -extremely- carefully. To that end...
> > >
> > > First, there is always a fallback in case the allocation fails. Which
> > > might have performance or corner-case robustness issues, but which will
> > > at least allow forward progress. Second, we consulted with a number of
> > > MM experts to arrive at appropriate GFP_* flags (and their patience is
> > > greatly appreciated). Third, the paths that can allocate will do so about
> > > one time of 500, so any issues should be spotted sooner rather than later.
> > >
> > > So you are quite right to be concerned, but I believe we will be doing the
> > > right things. And based on his previous track record, I am also quite
> > > certain that Mr. Murphy will be on hand to provide me any additional
> > > education that I might require.
> > >
> > > Finally, I have noted down your point about fs_reclaim_acquire() and
> > > fs_reclaim_release(). Whether or not they prove to be needed, I do
> > > appreciate your calling them to my attention.
> >
> > I just realized that since these dma_fence structs are refcounted and
> > userspace can hold references (directly, it can pass them around
> > behind file descriptors) we might never hit such a path until slightly
> > unusual or evil userspace does something interesting. Do you have
> > links to those patches? Some googling didn't turn up anything. I can
> > then figure out whether it's better to risk not spotting issues with
> > call_rcu vs slapping a memalloc_noio_save/restore around all these
> > critical section which force-degrades any allocation to GFP_ATOMIC at
> > most, but has the risk that we run into code that assumes "GFP_KERNEL
> > never fails for small stuff" and has a decidedly less tested fallback
> > path than rcu code.
>
> Here is the previous early draft version, which will change considerably
> for the next version:
>
> lore.kernel.org/lkml/[email protected]
>
> This does kvfree_rcu(), but we expect to handle call_rcu() similarly.
>
> The version in preparation will use workqueues to do the allocation in a
> known-safe environment and also use lockless access to certain portions
> of the allocator caches (as noted earlier, this last is not much loved
> by some of the MM guys). Given Thomas's patch, we could with high
> probability allocate directly, perhaps even not needing memory-allocator
> modifications.
>
> Either way, kvfree_rcu(), and later call_rcu(), will avoid asking the
> allocator to do anything that the calling context prohibits. So what
> types of bugs are you looking for? Where reclaim calls back into the
> driver or some such?

Yeah pretty much. It's a problem for gpu, fs, block drivers and really
anything else that's remotely involved in memory reclaim somehow.
Generally this is all handled explicitly by passing gfp_t flags down
any call chain, but in some cases it's instead solved with the
memalloc_no* functions. E.g. sunrpc uses that to make sure the network
stack (which generally just assumes it can allocate memory) doesn't,
to avoid recursions back into nfs/sunrpc. To my knowledge there's no
way to check at runtime with which gfp flags you're allowed to
allocate memory, a preemptible check is definitely not enough.
Disabled preemption implies only GFP_ATOMIC is allowed (ignoring nmi
and stuff like that), but the inverse is not true.

So if you want the automagic in call_rcu I think either
- we need to replace all explicit gfp flags with the context marking
memalloc_no* across the entire kernel, or at least anywhere rcu might
be used.
- audit all callchains and make sure a call_rcu_noalloc is used
anywhere there might be a problem. probably better to have a
call_rcu_gfp with explicit gfp flags parameter, since generally that
needs to be passed down.

But at least to me the lockless magic in mm sounds a lot safer, since
it contains the complexity and doesn't leak it out to callers of
call_rcu.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2020-09-17 16:34:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Thu, Sep 17, 2020 at 09:52:30AM +0200, Daniel Vetter wrote:
> On Thu, Sep 17, 2020 at 12:39 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Wed, Sep 16, 2020 at 11:43:02PM +0200, Daniel Vetter wrote:
> > > On Wed, Sep 16, 2020 at 10:58 PM Paul E. McKenney <[email protected]> wrote:
> > > >
> > > > On Wed, Sep 16, 2020 at 10:29:06PM +0200, Daniel Vetter wrote:
> > > > > On Wed, Sep 16, 2020 at 5:29 PM Paul E. McKenney <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, Sep 16, 2020 at 09:37:17AM +0200, Daniel Vetter wrote:
> > > > > > > On Tue, Sep 15, 2020 at 7:35 PM Linus Torvalds
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Tue, Sep 15, 2020 at 1:39 AM Thomas Gleixner <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > OTOH, having a working 'preemptible()' or maybe better named
> > > > > > > > > 'can_schedule()' check makes tons of sense to make decisions about
> > > > > > > > > allocation modes or other things.
> > > > > > > >
> > > > > > > > No. I think that those kinds of decisions about actual behavior are
> > > > > > > > always simply fundamentally wrong.
> > > > > > > >
> > > > > > > > Note that this is very different from having warnings about invalid
> > > > > > > > use. THAT is correct. It may not warn in all configurations, but that
> > > > > > > > doesn't matter: what matters is that it warns in common enough
> > > > > > > > configurations that developers will catch it.
> > > > > > > >
> > > > > > > > So having a warning in "might_sleep()" that doesn't always trigger,
> > > > > > > > because you have a limited configuration that can't even detect the
> > > > > > > > situation, that's fine and dandy and intentional.
> > > > > > > >
> > > > > > > > But having code like
> > > > > > > >
> > > > > > > > if (can_schedule())
> > > > > > > > .. do something different ..
> > > > > > > >
> > > > > > > > is fundamentally complete and utter garbage.
> > > > > > > >
> > > > > > > > It's one thing if you test for "am I in hardware interrupt context".
> > > > > > > > Those tests aren't great either, but at least they make sense.
> > > > > > > >
> > > > > > > > But a driver - or some library routine - making a difference based on
> > > > > > > > some nebulous "can I schedule" is fundamentally and basically WRONG.
> > > > > > > >
> > > > > > > > If some code changes behavior, it needs to be explicit to the *caller*
> > > > > > > > of that code.
> > > > > > > >
> > > > > > > > So this is why GFP_ATOMIC is fine, but "if (!can_schedule())
> > > > > > > > do_something_atomic()" is pure shite.
> > > > > > > >
> > > > > > > > And I am not IN THE LEAST interested in trying to help people doing
> > > > > > > > pure shite. We need to fix them. Like the crypto code is getting
> > > > > > > > fixed.
> > > > > > >
> > > > > > > Just figured I'll throw my +1 in from reading too many (gpu) drivers.
> > > > > > > Code that tries to cleverly adjust its behaviour depending upon the
> > > > > > > context it's running in is harder to understand and blows up in more
> > > > > > > interesting ways. We still have drm_can_sleep() and it's mostly just
> > > > > > > used for debug code, and I've largely ended up just deleting
> > > > > > > everything that used it because when you're driver is blowing up the
> > > > > > > last thing you want is to realize your debug code and output can't be
> > > > > > > relied upon. Or worse, that the only Oops you have is the one in the
> > > > > > > debug code, because the real one scrolled away - the original idea
> > > > > > > behind drm_can_sleep was to make all the modeset code work
> > > > > > > automagically both in normal ioctl/kworker context and in the panic
> > > > > > > handlers or kgdb callbacks. Wishful thinking at best.
> > > > > > >
> > > > > > > Also at least for me that extends to everything, e.g. I much prefer
> > > > > > > explicit spin_lock and spin_lock_irq vs magic spin_lock_irqsave for
> > > > > > > locks shared with interrupt handlers, since the former two gives me
> > > > > > > clear information from which contexts such function can be called.
> > > > > > > Other end is the memalloc_no*_save/restore functions, where I recently
> > > > > > > made a real big fool of myself because I didn't realize how much that
> > > > > > > impacts everything that's run within - suddenly "GFP_KERNEL for small
> > > > > > > stuff never fails" is wrong everywhere.
> > > > > > >
> > > > > > > It's all great for debugging and sanity checks (and we run with all
> > > > > > > that stuff enabled in our CI), but really semantic changes depending
> > > > > > > upon magic context checks freak my out :-)
> > > > > >
> > > > > > All fair, but some of us need to write code that must handle being
> > > > > > invoked from a wide variety of contexts. Now perhaps you like the idea of
> > > > > > call_rcu() for schedulable contexts, call_rcu_nosched() when preemption
> > > > > > is disabled, call_rcu_irqs_are_disabled() when interrupts are disabled,
> > > > > > call_rcu_raw_atomic() from contexts where (for example) raw spinlocks
> > > > > > are held, and so on. However, from what I can see, most people instead
> > > > > > consistently prefer that the RCU API instead be consolidated.
> > > > > >
> > > > > > Some in-flight cache-efficiency work for kvfree_rcu() and call_rcu()
> > > > > > needs to be able to allocate memory occasionally. It can do that when
> > > > > > invoked from some contexts, but not when invoked from others. Right now,
> > > > > > in !PREEMPT kernels, it cannot tell, and must either do things to the
> > > > > > memory allocators that some of the MM hate or must unnecessarily invoke
> > > > > > workqueues. Thomas's patches would allow the code to just allocate in
> > > > > > the common case when these primitives are invoked from contexts where
> > > > > > allocation is permitted.
> > > > > >
> > > > > > If we want to restrict access to the can_schedule() or whatever primitive,
> > > > > > fine and good. We can add a check to checkpatch.pl, for example. Maybe
> > > > > > we can go back to the old brlock approach of requiring certain people's
> > > > > > review for each addition to the kernel.
> > > > > >
> > > > > > But there really are use cases that it would greatly help.
> > > > >
> > > > > We can deadlock in random fun places if random stuff we're calling
> > > > > suddenly starts allocating. Sometimes. Maybe once in a blue moon, to
> > > > > make it extra fun to reproduce. Maybe most driver subsystems are less
> > > > > brittle, but gpu drivers definitely need to know about the details for
> > > > > exactly this example. And yes gpu drivers use rcu for freeing
> > > > > dma_fence structures, and that tends to happen in code that we only
> > > > > recently figured out should really not allocate memory.
> > > > >
> > > > > I think minimally you need to throw in an unconditional
> > > > > fs_reclaim_acquire();fs_reclaim_release(); so that everyone who runs
> > > > > with full debugging knows what might happen. It's kinda like
> > > > > might_sleep, but a lot more specific. might_sleep() alone is not
> > > > > enough, because in the specific code paths I'm thinking of (and
> > > > > created special lockdep annotations for just recently) sleeping is
> > > > > allowed, but any memory allocations with GFP_RECLAIM set are no-go.
> > > >
> > > > Completely agreed! Any allocation on any free path must be handled
> > > > -extremely- carefully. To that end...
> > > >
> > > > First, there is always a fallback in case the allocation fails. Which
> > > > might have performance or corner-case robustness issues, but which will
> > > > at least allow forward progress. Second, we consulted with a number of
> > > > MM experts to arrive at appropriate GFP_* flags (and their patience is
> > > > greatly appreciated). Third, the paths that can allocate will do so about
> > > > one time of 500, so any issues should be spotted sooner rather than later.
> > > >
> > > > So you are quite right to be concerned, but I believe we will be doing the
> > > > right things. And based on his previous track record, I am also quite
> > > > certain that Mr. Murphy will be on hand to provide me any additional
> > > > education that I might require.
> > > >
> > > > Finally, I have noted down your point about fs_reclaim_acquire() and
> > > > fs_reclaim_release(). Whether or not they prove to be needed, I do
> > > > appreciate your calling them to my attention.
> > >
> > > I just realized that since these dma_fence structs are refcounted and
> > > userspace can hold references (directly, it can pass them around
> > > behind file descriptors) we might never hit such a path until slightly
> > > unusual or evil userspace does something interesting. Do you have
> > > links to those patches? Some googling didn't turn up anything. I can
> > > then figure out whether it's better to risk not spotting issues with
> > > call_rcu vs slapping a memalloc_noio_save/restore around all these
> > > critical section which force-degrades any allocation to GFP_ATOMIC at
> > > most, but has the risk that we run into code that assumes "GFP_KERNEL
> > > never fails for small stuff" and has a decidedly less tested fallback
> > > path than rcu code.
> >
> > Here is the previous early draft version, which will change considerably
> > for the next version:
> >
> > lore.kernel.org/lkml/[email protected]
> >
> > This does kvfree_rcu(), but we expect to handle call_rcu() similarly.
> >
> > The version in preparation will use workqueues to do the allocation in a
> > known-safe environment and also use lockless access to certain portions
> > of the allocator caches (as noted earlier, this last is not much loved
> > by some of the MM guys). Given Thomas's patch, we could with high
> > probability allocate directly, perhaps even not needing memory-allocator
> > modifications.
> >
> > Either way, kvfree_rcu(), and later call_rcu(), will avoid asking the
> > allocator to do anything that the calling context prohibits. So what
> > types of bugs are you looking for? Where reclaim calls back into the
> > driver or some such?
>
> Yeah pretty much. It's a problem for gpu, fs, block drivers and really
> anything else that's remotely involved in memory reclaim somehow.
> Generally this is all handled explicitly by passing gfp_t flags down
> any call chain, but in some cases it's instead solved with the
> memalloc_no* functions. E.g. sunrpc uses that to make sure the network
> stack (which generally just assumes it can allocate memory) doesn't,
> to avoid recursions back into nfs/sunrpc. To my knowledge there's no
> way to check at runtime with which gfp flags you're allowed to
> allocate memory, a preemptible check is definitely not enough.
> Disabled preemption implies only GFP_ATOMIC is allowed (ignoring nmi
> and stuff like that), but the inverse is not true.

Thank you for the confirmation!

> So if you want the automagic in call_rcu I think either
> - we need to replace all explicit gfp flags with the context marking
> memalloc_no* across the entire kernel, or at least anywhere rcu might
> be used.
> - audit all callchains and make sure a call_rcu_noalloc is used
> anywhere there might be a problem. probably better to have a
> call_rcu_gfp with explicit gfp flags parameter, since generally that
> needs to be passed down.
>
> But at least to me the lockless magic in mm sounds a lot safer, since
> it contains the complexity and doesn't leak it out to callers of
> call_rcu.

Agreed, I greatly prefer Peter Zijlstra's lockless-allocation patch
myself.

In the meantime, it looks like we will start by causing the allocation to
happen in a safe environment. That may have issues with delays, but is
at least something that can be done entirely within the confines of RCU.

Thanx, Paul

2020-09-29 08:21:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Wed 16-09-20 23:43:02, Daniel Vetter wrote:
> I can
> then figure out whether it's better to risk not spotting issues with
> call_rcu vs slapping a memalloc_noio_save/restore around all these
> critical section which force-degrades any allocation to GFP_ATOMIC at

did you mean memalloc_noreclaim_* here?

> most, but has the risk that we run into code that assumes "GFP_KERNEL
> never fails for small stuff" and has a decidedly less tested fallback
> path than rcu code.

Even if the above then please note that memalloc_noreclaim_* or
PF_MEMALLOC should be used with an extreme care. Essentially only for
internal memory reclaimers. It grants access to _all_ the available
memory so any abuse can be detrimental to the overall system operation.
Allocation failure in this mode means that we are out of memory and any
code relying on such an allocation has to carefuly consider failure.
This is not a random allocation mode.

--
Michal Hocko
SUSE Labs

2020-09-29 09:02:52

by Daniel Vetter

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Tue, Sep 29, 2020 at 10:19:38AM +0200, Michal Hocko wrote:
> On Wed 16-09-20 23:43:02, Daniel Vetter wrote:
> > I can
> > then figure out whether it's better to risk not spotting issues with
> > call_rcu vs slapping a memalloc_noio_save/restore around all these
> > critical section which force-degrades any allocation to GFP_ATOMIC at
>
> did you mean memalloc_noreclaim_* here?

Yeah I picked the wrong one of that family of functions.

> > most, but has the risk that we run into code that assumes "GFP_KERNEL
> > never fails for small stuff" and has a decidedly less tested fallback
> > path than rcu code.
>
> Even if the above then please note that memalloc_noreclaim_* or
> PF_MEMALLOC should be used with an extreme care. Essentially only for
> internal memory reclaimers. It grants access to _all_ the available
> memory so any abuse can be detrimental to the overall system operation.
> Allocation failure in this mode means that we are out of memory and any
> code relying on such an allocation has to carefuly consider failure.
> This is not a random allocation mode.

Agreed, that's why I don't like having these kind of automagic critical
sections. It's a bit a shotgun approach. Paul said that the code would
handle failures, but the problem is that it applies everywhere.

Anyway my understanding is that call_rcu will be reworked and gain a pile
of tricks so that these problems for the callchains leading to call_rcu
all disappear.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

2020-09-29 14:56:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 00/13] preempt: Make preempt count unconditional

On Tue 29-09-20 11:00:03, Daniel Vetter wrote:
> On Tue, Sep 29, 2020 at 10:19:38AM +0200, Michal Hocko wrote:
> > On Wed 16-09-20 23:43:02, Daniel Vetter wrote:
> > > I can
> > > then figure out whether it's better to risk not spotting issues with
> > > call_rcu vs slapping a memalloc_noio_save/restore around all these
> > > critical section which force-degrades any allocation to GFP_ATOMIC at
> >
> > did you mean memalloc_noreclaim_* here?
>
> Yeah I picked the wrong one of that family of functions.
>
> > > most, but has the risk that we run into code that assumes "GFP_KERNEL
> > > never fails for small stuff" and has a decidedly less tested fallback
> > > path than rcu code.
> >
> > Even if the above then please note that memalloc_noreclaim_* or
> > PF_MEMALLOC should be used with an extreme care. Essentially only for
> > internal memory reclaimers. It grants access to _all_ the available
> > memory so any abuse can be detrimental to the overall system operation.
> > Allocation failure in this mode means that we are out of memory and any
> > code relying on such an allocation has to carefuly consider failure.
> > This is not a random allocation mode.
>
> Agreed, that's why I don't like having these kind of automagic critical
> sections. It's a bit a shotgun approach. Paul said that the code would
> handle failures, but the problem is that it applies everywhere.

Ohh, in the ideal world we wouldn't need anything like that. But then
the reality fires:
* PF_MEMALLOC (resp memalloc_noreclaim_* for that matter) is primarily used
to make sure that allocations from inside the memory reclaim - yeah that
happens - will not recurse.
* PF_MEMALLOC_NO{FS,IO} (resp memalloc_no{fs,io}*) are used to mark no
fs/io reclaim recursion critical sections because controling that for
each allocation inside fs transaction (or other sensitive) or IO
contexts turned out to be unmaintainable and people simply fallen into
using NOFS/NOIO unconditionally which is causing reclaim imbalance
problems.
* PF_MEMALLOC_NOCMA (resp memalloc_nocma*) is used for long term pinning
when CMA pages cannot be pinned because that would break the CMA
guarantees. Communicating this to all potential allocations during
pinning is simply unfeasible.

So you are absolutely right that these critical sections with side
effects on all allocations are far from ideal from the API point of view
but they are mostly mirroring a demand for functionality which is
_practically_ impossible to achieve with our current code base. Not that
we couldn't get back to drawing board and come up with a saner thing and
rework the world...
--
Michal Hocko
SUSE Labs