2005-11-23 23:37:56

by Chandra Seetharaman

[permalink] [raw]
Subject: [PATCH 0/7]: Fix for unsafe notifier chain

Andrew,

I posted this set of patch to lkml last Friday as an RFC. Can you
consider these for -mm inclusion.

These patches apply on 2.6.15-rc2.

Thanks,

chandra

Here are the details:
In 2.6.14, notifier chains are unsafe. notifier_call_chain() walks through
the list of a call chain without any protection.

Alan Stern <[email protected]> brought the issue and suggested a fix
in lkml on Oct 24 2005:
http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2

There was a lot of discussion on that thread regarding the issue, and
following were the conclusions about the requirements of the notifier
call mechanism:

- The chain list has to be protected in all the places where the
list is accessed.
- We need a new notifier_head data structure to encompass the head
of the notifier chain and a semaphore that protects the list.
- There should be two types of call chains: one that is called in
a process context and another that is called in atomic/interrupt
context.
- No lock should be acquired in notifier_call_chain() for an
atomic-type chain.
- notifier_chain_register() and notifier_chain_unregister() should
be called only from process context.
- notifier_chain_unregister() should _not_ be called from a
callout routine.

I posted an RFC that meets the above listed requirements last Friday:
- http://marc.theaimsgroup.com/?l=linux-kernel&m=113175279131346&w=2

Paul McKenney provided some suggestions w.r.t RCU usage. This patchset fixes
the issues he raised. Keith Owens posted some changes to the diechain for
various architectures; his changes are included here.

This is posted as an RFC as we want to get a green signal from the owners of
the files that our classification of chains as ATOMIC or BLOCKING is ok.
Please comment.

This patchset has 7 patches:

1 of 7: Changes the definition of the heads. Same as what was posted last
Friday with changes w.r.t Paul's comments.
2 of 7: Changes that affected only the notifier_head definition.
3 of 7: Changes in which we removed some protection (it's no longer needed
as the basic infrastructure itself provides the protection).
4 of 7: changes for diechain for different architectures.
5 of 7: changes removing calls to notifier_unregister in the callout.
6 of 7: changes to dcdbas.c (requires special handling).
7 of 7: changes to make usb_notify to use the notify chain infrastructure
instead of its own.

----------------------------------------

Here are the list of chains and their classification:

BLOCKING:
+++++++++
arch/powerpc/platforms/pseries/reconfig.c: pSeries_reconfig_chain
arch/s390/kernel/process.c: idle_chain
drivers/base/memory.c: memory_chain
drivers/cpufreq/cpufreq.c: cpufreq_policy_notifier_list
drivers/cpufreq/cpufreq.c: cpufreq_transition_notifier_list
drivers/macintosh/adb.c: adb_client_list
drivers/macintosh/via-pmu.c: sleep_notifier_list
drivers/macintosh/via-pmu68k.c: sleep_notifier_list
drivers/macintosh/windfarm_core.c: wf_client_list
drivers/usb/core/notify.c: usb_notifier_list
drivers/video/fbmem.c: fb_notifier_list
kernel/cpu.c: cpu_chain
kernel/module.c: module_notify_list
kernel/profile.c: munmap_notifier
kernel/profile.c: task_exit_notifier
kernel/sys.c: reboot_notifier_list
net/core/dev.c: netdev_chain
net/decnet/dn_dev.c: dnaddr_chain
net/ipv4/devinet.c: inetaddr_chain

ATOMIC:
+++++++
arch/i386/kernel/traps.c: i386die_chain
arch/ia64/kernel/traps.c: ia64die_chain
arch/powerpc/kernel/traps.c: powerpc_die_chain
arch/sparc64/kernel/traps.c: sparc64die_chain
arch/x86_64/kernel/traps.c: die_chain
drivers/char/ipmi/ipmi_si_intf.c: xaction_notifier_list
kernel/panic.c: panic_notifier_list
kernel/profile.c: task_free_notifier
net/bluetooth/hci_core.c: hci_notifier
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_chain
net/ipv4/netfilter/ip_conntrack_core.c: ip_conntrack_expect_chain
net/ipv6/addrconf.c: inet6addr_chain
net/netfilter/nf_conntrack_core.c: nf_conntrack_chain
nen/netfilter/nf_conntrack_core.c: nf_conntrack_expect_chain
net/netlink/af_netlink.c: netlink_chain


--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------



2005-11-27 04:08:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/7]: Fix for unsafe notifier chain

Chandra Seetharaman <[email protected]> wrote:
>
> Andrew,
>
> I posted this set of patch to lkml last Friday as an RFC. Can you
> consider these for -mm inclusion.

This all looks exotically complex.

> ...
>
> Here are the details:
> In 2.6.14, notifier chains are unsafe. notifier_call_chain() walks through
> the list of a call chain without any protection.
>
> Alan Stern <[email protected]> brought the issue and suggested a fix
> in lkml on Oct 24 2005:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2
>
> There was a lot of discussion on that thread regarding the issue, and
> following were the conclusions about the requirements of the notifier
> call mechanism:
>
> - The chain list has to be protected in all the places where the
> list is accessed.
> - We need a new notifier_head data structure to encompass the head
> of the notifier chain and a semaphore that protects the list.
> - There should be two types of call chains: one that is called in
> a process context and another that is called in atomic/interrupt
> context.
> - No lock should be acquired in notifier_call_chain() for an
> atomic-type chain.
> - notifier_chain_register() and notifier_chain_unregister() should
> be called only from process context.
> - notifier_chain_unregister() should _not_ be called from a
> callout routine.
>
> I posted an RFC that meets the above listed requirements last Friday:
> - http://marc.theaimsgroup.com/?l=linux-kernel&m=113175279131346&w=2
>
> Paul McKenney provided some suggestions w.r.t RCU usage. This patchset fixes
> the issues he raised. Keith Owens posted some changes to the diechain for
> various architectures; his changes are included here.

- You don't state _why_ a callback cannot call
notifier_chain_unregister(). I assume that's because of the use of
write_lock() locking?

We could do this with a new callback function return code and do it in
the core, or just change the code so it is permitted.

- You don't explain why RCU has been introduced into this subsystem.
Seems overkillish, or was it done as a way to solve the correctness
problems?

- Generally, please don't put so much stuff into the [patch 0/N] email.
We never put empty patches into git so some sucker has to move all the
[0/N] content into [1/N]. Better that it's you than me.

- Overall take on the patches: the problem here is that notifier chains
try to provide their own locking. Each time we design a container which
does that, we screw it up and we regret it.

Please consider removing all locking from the notifer chains and moving
it into the callers.

A migration path would be:

- Introduce a new notifier API which is wholly unlocked

- Migrate all callers over to that

- Remove the current implementation

Note that with this scheme, callbacks which wish to call the unregister
function can do that, as long as they are not using read_lock() or
down_read() during the chain traversal.

2005-11-27 13:47:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

> - Introduce a new notifier API which is wholly unlocked

The old notifiers were already wholly unlocked. So it wouldn't
even need any changes. Just additional locks everywhere.

I agree it's the cleaner implementation.

-Andi

2005-11-27 15:59:18

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Sun, 27 Nov 2005 14:47:36 +0100,
Andi Kleen <[email protected]> wrote:
>akpm wrote
>> - Introduce a new notifier API which is wholly unlocked
>
>The old notifiers were already wholly unlocked. So it wouldn't
>even need any changes. Just additional locks everywhere.

Wrong. The existing implementation is racy as hell. There is NO
locking on the existing chains, these patches make the notifier chains
race free.

Some of the notifier callbacks are used in weird contexts, including
NMI, so the only option for those chains is RCU. Obviously those
callbacks cannot sleep. Other chains are used in more normal context
_AND_ the callbacks want to sleep, so those chains need to use sleeping
locks.

2005-11-27 17:27:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Mon, Nov 28, 2005 at 02:59:05AM +1100, Keith Owens wrote:
> On Sun, 27 Nov 2005 14:47:36 +0100,
> Andi Kleen <[email protected]> wrote:
> >akpm wrote
> >> - Introduce a new notifier API which is wholly unlocked
> >
> >The old notifiers were already wholly unlocked. So it wouldn't
> >even need any changes. Just additional locks everywhere.
>
> Wrong.

Did you actually read what I wrote?

-Andi

2005-11-27 17:39:55

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Sun, 27 Nov 2005 18:27:25 +0100,
Andi Kleen <[email protected]> wrote:
>On Mon, Nov 28, 2005 at 02:59:05AM +1100, Keith Owens wrote:
>> On Sun, 27 Nov 2005 14:47:36 +0100,
>> Andi Kleen <[email protected]> wrote:
>> >akpm wrote
>> >> - Introduce a new notifier API which is wholly unlocked
>> >
>> >The old notifiers were already wholly unlocked. So it wouldn't
>> >even need any changes. Just additional locks everywhere.
>>
>> Wrong.
>
>Did you actually read what I wrote?

Of course I did. The whole point is that _ALL_ of the existing
notifier chain callback code is racy[*]. Saying that the code can be
left without any changes is simply ignoring the existing races. They
_ALL_ need to be fixed.

[*] Except for one bit of hand crafted locking in USB.

2005-11-27 19:57:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

Keith Owens <[email protected]> wrote:
>
> On Sun, 27 Nov 2005 18:27:25 +0100,
> Andi Kleen <[email protected]> wrote:
> >On Mon, Nov 28, 2005 at 02:59:05AM +1100, Keith Owens wrote:
> >> On Sun, 27 Nov 2005 14:47:36 +0100,
> >> Andi Kleen <[email protected]> wrote:
> >> >akpm wrote
> >> >> - Introduce a new notifier API which is wholly unlocked
> >> >
> >> >The old notifiers were already wholly unlocked.

The register and unregister functions take a write_lock on notifier_lock.
notifier_call_chain() runs unlocked.

> So it wouldn't
> >> >even need any changes. Just additional locks everywhere.
> >>
> >> Wrong.
> >
> >Did you actually read what I wrote?
>
> Of course I did.

No you didn't!

> The whole point is that _ALL_ of the existing
> notifier chain callback code is racy[*].

yup.

> Saying that the code can be
> left without any changes is simply ignoring the existing races. They
> _ALL_ need to be fixed.

We're saying that kernel/sys.c:notifier_lock should be removed and that all
callers of notifier_chain_register(), notifier_chain_unregister() and
notifier_call_chain() should be changed to define and use their own lock.

So the _callers_ get to decide whether they're going to use down(),
spin_lock(), down_read(), read_lock(), preempt_disable(), local_irq_disable()
or whatever.

Furthermore we should alter notifier_call_chain() so that a callback may
safely perform notifier_chain_unregister() - that's sane and easy enough.

2005-11-27 22:03:56

by Greg KH

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Sun, Nov 27, 2005 at 11:56:40AM -0800, Andrew Morton wrote:
> We're saying that kernel/sys.c:notifier_lock should be removed and that all
> callers of notifier_chain_register(), notifier_chain_unregister() and
> notifier_call_chain() should be changed to define and use their own lock.
>
> So the _callers_ get to decide whether they're going to use down(),
> spin_lock(), down_read(), read_lock(), preempt_disable(), local_irq_disable()
> or whatever.

I completely agree. I just watched in mute horror as Chandra and Alan
spun off into the rcu blackhole trying to create one-size-fits-all
notifiers.

Making the user do the locking it needs to do is simple and sane.

And the reason USB's notifiers are implemented correctly, is they don't
use the notifier core, but rather, reimplemented their own, due to the
locking mess.

thanks,

greg k-h

2005-11-28 01:20:26

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Sun, 27 Nov 2005 11:56:40 -0800,
Andrew Morton <[email protected]> wrote:
>We're saying that kernel/sys.c:notifier_lock should be removed and that all
>callers of notifier_chain_register(), notifier_chain_unregister() and
>notifier_call_chain() should be changed to define and use their own lock.

We are arguing past each other's assumptions, so I am laying out my
assumptions in the hope that it will clear the air.

* Traversal of the notifier callback chain can race against
registration and unregistration on the same chain, some form of
protection is required to remove this race. The only reason that it
has not bitten us so far is that registration and unregistration of
notifier callbacks are rare events.

* There are two major classes of notifier callbacks. One class calls
routines that need to sleep, the other class is called in a context
where it cannot sleep. There is no one size that fits all types of
notifier callbacks, nor do the suggested patches claim that one size
fits all.

* If the callbacks can sleep then clearly the lock type for that class
must be a sleeping lock. This class is not a problem.

* The callbacks that are called from non-sleeping contexts vary in
their requirements, but the hardest problem is callbacks from
non-maskable interrupts. By definition, no amount of locking can
protect against NMI, so the list traversal for this class must be
lock free. RCU is only one example of lock free code, any lock free
algorithm would do the job.

* Lock free list traversal has to be different from normal list
traversal. IOW it is not enough to push the responsibility for
locking onto the caller of notifier_chain_{un,}register, we need a
different version of notifier_call_chain() as well, therefore
kernel/sys.c has to know what type of chain it is traversing.

* Once all the work has been done for the hard case of NMI, it makes
sense to reuse that lock free code for all the other chains which are
traversed in non-sleeping contexts. IOW, adding list traversal code
for different variants of spinlock, preempt disable etc. is just
adding more code for no benefit. This applies whether the locking
(if any) is done in the caller or in kernel/sys.c, in either case it
just duplicates code.

* If you accept that the lock free list traversal should not be
duplicated across several callers (and several architectures) then
the same argument applies to the class of callbacks that can sleep.
Why duplicate that code or distribute the locks when a single lock in
one place will do? I have not seen any performance data that says
that the lock for the sleeping notifier chains is any sort of
bottleneck, which would be the only justfication for splitting the
sleeping lock.

* You could argue (and I might even agree) that folding the two classes
of callback into a single set of registration routines is confusing
and there should be separate routines for
notifier_chain_blocking_register, notifier_chain_blocking_unregister,
notifier_call_blocking_chain,
notifier_chain_atomic_register, notifier_chain_atomic_unregister,
notifier_call_atomic_chain.
However that is an implementation detail. Adding the type flag to
the notifier chain head minimizes the changes to the existing code.
I could live with separate routines for blocking and atomic notifier
chains, even though it requires more work to patch it.

2005-11-28 02:43:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Sun, Nov 27, 2005 at 02:03:29PM -0800, Greg KH wrote:
> On Sun, Nov 27, 2005 at 11:56:40AM -0800, Andrew Morton wrote:
> > We're saying that kernel/sys.c:notifier_lock should be removed and that all
> > callers of notifier_chain_register(), notifier_chain_unregister() and
> > notifier_call_chain() should be changed to define and use their own lock.
> >
> > So the _callers_ get to decide whether they're going to use down(),
> > spin_lock(), down_read(), read_lock(), preempt_disable(), local_irq_disable()
> > or whatever.
>
> I completely agree. I just watched in mute horror as Chandra and Alan
> spun off into the rcu blackhole trying to create one-size-fits-all
> notifiers.

RCU as blackhole? I am impressed -- even -I- don't find RCU to have the
same limiting-case attraction that a black hole does. ;-)

(My apologies, Greg, but you did leave yourself open for this!)

> Making the user do the locking it needs to do is simple and sane.
>
> And the reason USB's notifiers are implemented correctly, is they don't
> use the notifier core, but rather, reimplemented their own, due to the
> locking mess.

The locking mess is due to the fact that the notifier chain itself
must be protected by something or another, right? Here are the options
I have come across:

1. The notifier chain is guarded by a separate notifier-chain lock.
We then have the following deadlock situation:

o The subsystem requesting the notifier likely has
to hold one of its own locks when registering and
unregistering, which means that the notifier lock
must be subordinate to the subsystem lock.

o But when invoking the notifiers, the notifier lock
must be held while walking the chain. Since the
subsystem being notified likely has to acquire one
of its own locks, the subsystem lock must be subordinate
to the notifier lock.

There are a number of ways of breaking this deadlock, some of
which are listed below. Note that this deadlock situation can
arise both for spinlocks and for sleeplocks.

I believe that this deadlock situation is the core reason why
notifier locking is so difficult to get right.

Even ignoring the deadlock, this does not work for NMIs.

2. Each element of the notifier chain is guarded by reference
counts, and the chain itself is guarded by a lock. Each
element of the chain holds a reference to the next element
in the chain, and new elements are always added to the end
of the chain. The traversal code looks something like the
following:

spin_lock(&chain_lock);
p = head;
atomic_inc(&p->refcnt);
while (p != NULL) {
spin_unlock(&chain_lock);
p->func(p->arg);
spin_lock(&chain_lock);
q = p->next;
atomic_inc(&q->refcnt);
if (atomic_dec_and_test(&p->refcnt)) {
kfree(p);
}
p = q;
}
spin_unlock(&chain_lock);

Note that all actual deletion is done under chain_lock.

This works (I think...), and the subsystem code can use a
single lock that the notifier lock (chain_lock in this case)
is subordinate to. But the notifier traversal loop is quite
heavyweight. And it cannot be invoked from NMI handlers without
risk of self-deadlock.

3. Like #1, but require that the subsystem have at least two locks,
one of which is subordinate to the notifier lock (and is acquired
from the notifier callback function), and the other of which is
superior to the notifier lock, and is held when registering and
deregistering callbacks. This can work, but could significantly
complicate the subsystem locking, and also will require that some
operations acquire two locks instead of just one, since one must
hold both to exclude both notifications and register/unregister
operations.

This approach also fails to handle notifications from NMIs.

4. "Just say no" to a separate notifier-chain lock, and guard
the hole thing with whatever mechanism the subsystem
deems appropriate. This can be made to work with NMI-based
notification, since such subsystems can use RCU or whatever
other lock-free mechanism they desire.

I believe that Greg took this "just say no" approach in USB,
but could easily be missing something.

This works with minimal added complexity to the subsystem,
but requires that each subsystem have its own notifier
chain, since it does not make sense to have a single chain
guarded by multiple locks. But it means that notifier
code must be replicated in a number of subsystems, which
seems sub-optimal.

This might be the best we can do, but in the interest of
completeness...

5. Use RCU to traverse the notifier chain and use simple locking
to guard updates, similar to Alan's and Chandra's patch.
This avoids the deadlock, and handles NMIs nicely, but does
not tolerate synchronous notification callbacks that sleep.
(Cases that can tolerate asynchronous notification can make use
of workqueues or similar mechanisms to defer the sleeping code
so that it is not executed in the scope of the rcu_read_lock()
protecting the notifier-chain traversal.)

6. Have separate mechanisms for the non-sleeping and the synchronous
sleeping cases. For example, #5 for non-sleeping and either #2,
#3, or #4 for the synchronous sleeping case.

This works in all cases, and achieves a high degree of code
commonality, but does require two separate APIs.

7. Use wait-free synchronization everywhere. This has some issues
with complexity, last I checked.

8. Come up with a safe and sane RCU implementation that tolerates
general blocking in read-side critical sections. Note that
although some realtime implementations of RCU do tolerate
blocking, they do so only in the special case that priority
inheritance applies to. Might happen,
but I would not recommend holding your breath.

Any options I missed?

Thanx, Paul

2005-11-28 04:58:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

"Paul E. McKenney" <[email protected]> wrote:
>
> Any options I missed?

Stop using the notifier chains from NMI context - it's too hard. Use a
fixed-size array in the NMI code instead.

2005-11-28 04:59:31

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Sun, Nov 27, 2005 at 08:57:45PM -0800, Andrew Morton wrote:
> "Paul E. McKenney" <[email protected]> wrote:
> >
> > Any options I missed?
>
> Stop using the notifier chains from NMI context - it's too hard. Use a
> fixed-size array in the NMI code instead.

Or just don't unregister. That is what I did for the debug notifiers.

-Andi

2005-11-28 05:05:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Mon, Nov 28, 2005 at 05:59:22AM +0100, Andi Kleen wrote:
> On Sun, Nov 27, 2005 at 08:57:45PM -0800, Andrew Morton wrote:
> > "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > Any options I missed?
> >
> > Stop using the notifier chains from NMI context - it's too hard. Use a
> > fixed-size array in the NMI code instead.
>
> Or just don't unregister. That is what I did for the debug notifiers.

So the thought is to replicate the notifier code in all non-NMI subsystems
that require it?

Thanx, Paul

2005-11-28 05:15:42

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Sun, Nov 27, 2005 at 09:05:22PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 28, 2005 at 05:59:22AM +0100, Andi Kleen wrote:
> > On Sun, Nov 27, 2005 at 08:57:45PM -0800, Andrew Morton wrote:
> > > "Paul E. McKenney" <[email protected]> wrote:
> > > >
> > > > Any options I missed?
> > >
> > > Stop using the notifier chains from NMI context - it's too hard. Use a
> > > fixed-size array in the NMI code instead.
> >
> > Or just don't unregister. That is what I did for the debug notifiers.
>
> So the thought is to replicate the notifier code in all non-NMI subsystems
> that require it?

The non NMI subsystems just need to get appropiate locking
(or preferable just use RCU if the notifiers can't sleep)

-Andi

2005-11-28 08:32:04

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Mon, 28 Nov 2005 05:59:22 +0100,
Andi Kleen <[email protected]> wrote:
>On Sun, Nov 27, 2005 at 08:57:45PM -0800, Andrew Morton wrote:
>> "Paul E. McKenney" <[email protected]> wrote:
>> >
>> > Any options I missed?
>>
>> Stop using the notifier chains from NMI context - it's too hard. Use a
>> fixed-size array in the NMI code instead.
>
>Or just don't unregister. That is what I did for the debug notifiers.

Unregister is not the only problem. Chain traversal races with
register as well.

2005-11-28 12:07:20

by Andi Kleen

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Mon, Nov 28, 2005 at 07:31:36PM +1100, Keith Owens wrote:
> On Mon, 28 Nov 2005 05:59:22 +0100,
> Andi Kleen <[email protected]> wrote:
> >On Sun, Nov 27, 2005 at 08:57:45PM -0800, Andrew Morton wrote:
> >> "Paul E. McKenney" <[email protected]> wrote:
> >> >
> >> > Any options I missed?
> >>
> >> Stop using the notifier chains from NMI context - it's too hard. Use a
> >> fixed-size array in the NMI code instead.
> >
> >Or just don't unregister. That is what I did for the debug notifiers.
>
> Unregister is not the only problem. Chain traversal races with
> register as well.

Either it follows the old next or the new next. Both are valid.
The only problem is that there isn't a write barrier between

n->next = *list;
*list=n;

in notifier_chain_register, which might hit on non i386 architectures.

-Andi

2005-11-28 18:58:15

by Chandra Seetharaman

[permalink] [raw]
Subject: Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Sat, 2005-11-26 at 20:07 -0800, Andrew Morton wrote:
> Chandra Seetharaman <[email protected]> wrote:
> >
> > Andrew,
> >
> > I posted this set of patch to lkml last Friday as an RFC. Can you
> > consider these for -mm inclusion.
>
> This all looks exotically complex.
>
> > ...
> >
> > Here are the details:
> > In 2.6.14, notifier chains are unsafe. notifier_call_chain() walks through
> > the list of a call chain without any protection.
> >
> > Alan Stern <[email protected]> brought the issue and suggested a fix
> > in lkml on Oct 24 2005:
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=113018709002036&w=2
> >
> > There was a lot of discussion on that thread regarding the issue, and
> > following were the conclusions about the requirements of the notifier
> > call mechanism:
> >
> > - The chain list has to be protected in all the places where the
> > list is accessed.
> > - We need a new notifier_head data structure to encompass the head
> > of the notifier chain and a semaphore that protects the list.
> > - There should be two types of call chains: one that is called in
> > a process context and another that is called in atomic/interrupt
> > context.
> > - No lock should be acquired in notifier_call_chain() for an
> > atomic-type chain.
> > - notifier_chain_register() and notifier_chain_unregister() should
> > be called only from process context.
> > - notifier_chain_unregister() should _not_ be called from a
> > callout routine.
> >
> > I posted an RFC that meets the above listed requirements last Friday:
> > - http://marc.theaimsgroup.com/?l=linux-kernel&m=113175279131346&w=2
> >
> > Paul McKenney provided some suggestions w.r.t RCU usage. This patchset fixes
> > the issues he raised. Keith Owens posted some changes to the diechain for
> > various architectures; his changes are included here.
>
> - You don't state _why_ a callback cannot call
> notifier_chain_unregister(). I assume that's because of the use of
> write_lock() locking?

Yes. Also, for both type of notifiers we use semaphore to protect the
list, so we cannot call unregister from atomic context.

>
> We could do this with a new callback function return code and do it in
> the core, or just change the code so it is permitted.

That is a good idea, we can do that.
>
> - You don't explain why RCU has been introduced into this subsystem.
> Seems overkillish, or was it done as a way to solve the correctness
> problems?

After discussions in lkml (and looking at usages of notifier chains in
code), the consensus was that we needed two types of notifier chains,
atomic and blocking.

During the discussions we concluded that blocking notifiers can be
easily protected with a semaphore. Locks is not good enough for atomic
notifiers as they are called from NMI context too. So, we needed a
lockless way to protect the list traversal, hence RCU.

>
> - Generally, please don't put so much stuff into the [patch 0/N] email.
> We never put empty patches into git so some sucker has to move all the
> [0/N] content into [1/N]. Better that it's you than me.

Will do.
>
> - Overall take on the patches: the problem here is that notifier chains
> try to provide their own locking. Each time we design a container which
> does that, we screw it up and we regret it.
>

The problem is that the current notifier chain mechanism gives the
notion of protection (lock acquired in register/unregister), but does
not protect list traversal.
> Please consider removing all locking from the notifer chains and moving
> it into the callers.
>
> A migration path would be:
>
> - Introduce a new notifier API which is wholly unlocked
>
> - Migrate all callers over to that
>
> - Remove the current implementation

We can just remove the write_lock from register/unregister of current
implementation and it will do what you are suggesting.

Also, with the current implementation, subsystems can assume that there
is no locks, but none of them do. They work around the locking/race
issues different ways like (1) not unregistering at all (most of them)
or having (2) their own _complete_ notify chain logic (usb).

(1) is a workaround and (2) is redundant code.

>
> Note that with this scheme, callbacks which wish to call the unregister
> function can do that, as long as they are not using read_lock() or
> down_read() during the chain traversal.
>
--

----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------


2005-11-28 19:55:11

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Mon, Nov 28, 2005 at 01:07:11PM +0100, Andi Kleen wrote:
> On Mon, Nov 28, 2005 at 07:31:36PM +1100, Keith Owens wrote:
> > On Mon, 28 Nov 2005 05:59:22 +0100,
> > Andi Kleen <[email protected]> wrote:
> > >On Sun, Nov 27, 2005 at 08:57:45PM -0800, Andrew Morton wrote:
> > >> "Paul E. McKenney" <[email protected]> wrote:
> > >> >
> > >> > Any options I missed?
> > >>
> > >> Stop using the notifier chains from NMI context - it's too hard. Use a
> > >> fixed-size array in the NMI code instead.
> > >
> > >Or just don't unregister. That is what I did for the debug notifiers.
> >
> > Unregister is not the only problem. Chain traversal races with
> > register as well.
>
> Either it follows the old next or the new next. Both are valid.
> The only problem is that there isn't a write barrier between
>
> n->next = *list;
> *list=n;
>
> in notifier_chain_register, which might hit on non i386 architectures.

Coding as follows:

n->next = *list;
rcu_assign_pointer(*list, n);

will provide memory barriers as needed, even if you are never removing
elements.

Thanx, Paul

2005-12-06 15:37:29

by Alan

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Llu, 2005-11-28 at 19:31 +1100, Keith Owens wrote:
> >Or just don't unregister. That is what I did for the debug notifiers.
>
> Unregister is not the only problem. Chain traversal races with
> register as well.

There are some NMI handler registration functions and attempts at safe
code for it in the unmerged experimental part of the bluesmoke
(bluesmoke.sf.net) project that may be useful perhaps ?

2005-12-06 23:39:07

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Sun, 04 Dec 2005 16:19:57 +0000,
Alan Cox <[email protected]> wrote:
>On Llu, 2005-11-28 at 19:31 +1100, Keith Owens wrote:
>> >Or just don't unregister. That is what I did for the debug notifiers.
>>
>> Unregister is not the only problem. Chain traversal races with
>> register as well.
>
>There are some NMI handler registration functions and attempts at safe
>code for it in the unmerged experimental part of the bluesmoke
>(bluesmoke.sf.net) project that may be useful perhaps ?

Thanks Alan, the bluesmoke NMI handlers look very similar to the code
that I have just written. However bluesmoke only handles a single
notifier chain, it has only one walking_handler_list array. The kernel
is getting to the stage where it needs multiple notifier chains that
can be traversed without locks. The patch below against 2.6.15-rc5
gives us lockfree traversal of notifier chains and supports multiple
chains.

The thing that I like about this approach is that the rest of the
kernel is barely affected. We only have to change the function calls
(adding suffix '_lockfree' and removing any locks in the callers) for
code that needs lockfree traversal. Other notifier chains are left
alone and there is no need to embed the type of chain in struct
notifier_block. Even the change to add '_lockfree' can be incremental,
converting chains as required.

Note: This patch has been compiled but not tested yet. Included for
review and discussion while I debug it.

include/linux/notifier.h | 3
kernel/sys.c | 166 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 169 insertions(+)

Index: linux/include/linux/notifier.h
===================================================================
--- linux.orig/include/linux/notifier.h 2005-10-28 10:02:08.000000000 +1000
+++ linux/include/linux/notifier.h 2005-12-06 14:28:29.413350932 +1100
@@ -24,6 +24,9 @@ struct notifier_block
extern int notifier_chain_register(struct notifier_block **list, struct notifier_block *n);
extern int notifier_chain_unregister(struct notifier_block **nl, struct notifier_block *n);
extern int notifier_call_chain(struct notifier_block **n, unsigned long val, void *v);
+extern int notifier_chain_register_lockfree(struct notifier_block **list, struct notifier_block *n);
+extern int notifier_chain_unregister_lockfree(struct notifier_block **nl, struct notifier_block *n);
+extern int notifier_call_chain_lockfree(struct notifier_block **n, unsigned long val, void *v);

#define NOTIFY_DONE 0x0000 /* Don't care */
#define NOTIFY_OK 0x0001 /* Suits me */
Index: linux/kernel/sys.c
===================================================================
--- linux.orig/kernel/sys.c 2005-12-06 12:23:38.864816047 +1100
+++ linux/kernel/sys.c 2005-12-06 18:29:50.899123936 +1100
@@ -187,6 +187,172 @@ int notifier_call_chain(struct notifier_

EXPORT_SYMBOL(notifier_call_chain);

+/* The notifier_chain_*_lockfree functions below are based on the formal
+ * notifier_chain_* functions above, but allow the notifier chain to be
+ * traversed in situations where locks cannot be taken to protect the list,
+ * typically in the various notifier_die() handlers. The 'lockfree' suffix
+ * only refers to the list traversal; register and unregister still take locks
+ * to protect against concurrent list update. Register and unregister can only
+ * be called from contexts that can sleep.
+ *
+ * Without a lock on list traversal, list entries cannot be physically deleted,
+ * instead they are logically deleted. The normal notifier_chain_* functions
+ * above use the supplied struct notifier_block in place. This introduces
+ * another race because the notifier_block is in caller storage that the caller
+ * can delete (e.g. module unload). So copy the notifier_block into local
+ * storage which is never deleted, although it can be reused.
+ *
+ * Critical property of these lists - once notifier_block_copy.nb.next is set,
+ * its value never changes. Not even if the block is unregistered or reused.
+ */
+
+struct notifier_block_copy {
+ struct notifier_block nb; /* must be at the start of the copy */
+ atomic_t deleted;
+ atomic_t calls; /* outstanding calls to the function */
+ struct notifier_block *orig;
+};
+
+/**
+ * notifier_chain_register_lockfree - Add notifier to a lockfree traversal
+ * notifier chain
+ * @list: Pointer to root list pointer
+ * @n: New entry in notifier chain
+ *
+ * Adds a notifier to a lockfree traversal notifier chain, reusing deleted
+ * entries where possible.
+ *
+ * Returns zero on success, or %-ENOMEM on failure.
+ */
+
+int notifier_chain_register_lockfree(struct notifier_block **list,
+ struct notifier_block *n)
+{
+ struct notifier_block_copy *nbc, *c, *prev = NULL;
+ nbc = kmalloc(sizeof(*nbc), GFP_KERNEL);
+ if (!nbc)
+ return -ENOMEM;
+ nbc->nb = *n;
+ atomic_set(&nbc->deleted, 0);
+ atomic_set(&nbc->calls, 0);
+ nbc->orig = n;
+ write_lock(&notifier_lock);
+ while (*list) {
+ c = (struct notifier_block_copy *)*list;
+ if (nbc->nb.priority > c->nb.priority
+ && atomic_read(&c->deleted) == 0)
+ break;
+ prev = c;
+ list = &(c->nb.next);
+ }
+ if (prev && atomic_read(&prev->deleted) != 0) {
+ nbc->nb.next = prev->nb.next;
+ prev->nb = nbc->nb;
+ prev->orig = n;
+ smp_wmb();
+ /* clearing the deleted flag must be done last */
+ atomic_set(&prev->deleted, 0);
+ kfree(nbc);
+ } else {
+ nbc->nb.next = *list;
+ smp_wmb();
+ *list = &nbc->nb;
+ }
+ write_unlock(&notifier_lock);
+ return 0;
+}
+
+EXPORT_SYMBOL(notifier_chain_register_lockfree);
+
+/**
+ * notifier_chain_unregister_lockfree - Remove notifier from a lockfree
+ * traversal notifier chain
+ * @list: Pointer to root list pointer
+ * @n: New entry in notifier chain
+ *
+ * Removes a notifier from a lockfree traversal notifier chain.
+ * Unregistered entries are logically deleted, not physically deleted.
+ *
+ * Returns zero on success, or %-ENOENT on failure.
+ *
+ * If the notifier exists in the chain and is in use, spin until there are
+ * no outstanding callbacks on that function. Only then is it safe to
+ * return to the caller, the caller may be about to free the storage that
+ * contains the function.
+ */
+
+int notifier_chain_unregister_lockfree(struct notifier_block **list,
+ struct notifier_block *n)
+{
+ struct notifier_block_copy *c;
+ write_lock(&notifier_lock);
+ while (*list) {
+ c = (struct notifier_block_copy *)*list;
+ if (c->orig == n && atomic_read(&c->deleted) == 0) {
+ atomic_set(&c->deleted, 1);
+ smp_wmb();
+ while (atomic_read(&c->calls))
+ cpu_relax();
+ write_unlock(&notifier_lock);
+ return 0;
+ }
+ list = &(c->nb.next);
+ }
+ write_unlock(&notifier_lock);
+ return -ENOENT;
+}
+
+EXPORT_SYMBOL(notifier_chain_unregister_lockfree);
+
+/**
+ * notifier_call_chain_lockfree - Call functions in a lockfree traversal
+ * notifier chain
+ * @list: Pointer to root pointer of notifier chain
+ * @val: Value passed unmodified to notifier function
+ * @v: Pointer passed unmodified to notifier function
+ *
+ * Calls each function in a lockfree traversal notifier chain in turn.
+ *
+ * If the return value of the notifier can be and'd with
+ * %NOTIFY_STOP_MASK, then notifier_call_chain will return immediately,
+ * with the return value of the notifier function which halted execution.
+ * Otherwise, the return value is the return value of the last notifier
+ * function called.
+ *
+ * The callback is passed the address of the original notifier block
+ * rather than the copy. Just in case some code decides to embed the
+ * notifier block in a larger structure then use offsetof() to get at the
+ * enclosing structure.
+ *
+ * The list traversal checks calls before deleted, unregister does the
+ * checks in the reverse order. That closes the race between traversal
+ * using an entry and unregister deleting it and ensures that unregister
+ * will not return as long as the callback is still being used.
+ */
+
+int notifier_call_chain_lockfree(struct notifier_block **list,
+ unsigned long val, void *v)
+{
+ int ret = NOTIFY_DONE;
+ struct notifier_block_copy *c;
+ c = (struct notifier_block_copy *)*list;
+ while (c) {
+ smp_read_barrier_depends();
+ atomic_inc(&c->calls);
+ smp_mb__after_atomic_inc();
+ if (atomic_read(&c->deleted) == 0)
+ ret = c->nb.notifier_call(c->orig, val, v);
+ atomic_dec(&c->calls);
+ smp_mb__after_atomic_dec();
+ if (ret & NOTIFY_STOP_MASK)
+ break;
+ c = (struct notifier_block_copy *)c->nb.next;
+ }
+ return ret;
+}
+
+EXPORT_SYMBOL(notifier_call_chain_lockfree);
+
/**
* register_reboot_notifier - Register function to be called at reboot time
* @nb: Info about notifier function to be called

2005-12-07 02:46:49

by Keith Owens

[permalink] [raw]
Subject: Re: [Lse-tech] Re: [PATCH 0/7]: Fix for unsafe notifier chain

On Wed, 07 Dec 2005 10:38:44 +1100,
Keith Owens <[email protected]> wrote:
>On Sun, 04 Dec 2005 16:19:57 +0000,
>Alan Cox <[email protected]> wrote:
>>On Llu, 2005-11-28 at 19:31 +1100, Keith Owens wrote:
>>> >Or just don't unregister. That is what I did for the debug notifiers.
>>>
>>> Unregister is not the only problem. Chain traversal races with
>>> register as well.
>>
>>There are some NMI handler registration functions and attempts at safe
>>code for it in the unmerged experimental part of the bluesmoke
>>(bluesmoke.sf.net) project that may be useful perhaps ?
>
>Thanks Alan, the bluesmoke NMI handlers look very similar to the code
>that I have just written. However bluesmoke only handles a single
>notifier chain, it has only one walking_handler_list array. The kernel
>is getting to the stage where it needs multiple notifier chains that
>can be traversed without locks. The patch below against 2.6.15-rc5
>gives us lockfree traversal of notifier chains and supports multiple
>chains.

My previous patch was way too complicated, this is much simpler. Based
on Corey Minyard's patch of http://lkml.org/lkml/2004/8/19/140,
generalized to support multiple lockfree notifier chains, with a few
extra synchronization calls added.

Again, for review only. Compiled but not tested yet.

include/linux/notifier.h | 7 +++
kernel/sys.c | 105 +++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 112 insertions(+)

Index: linux/include/linux/notifier.h
===================================================================
--- linux.orig/include/linux/notifier.h 2005-12-07 12:00:57.569908107 +1100
+++ linux/include/linux/notifier.h 2005-12-07 13:07:18.643526153 +1100
@@ -24,6 +24,13 @@ struct notifier_block
extern int notifier_chain_register(struct notifier_block **list, struct notifier_block *n);
extern int notifier_chain_unregister(struct notifier_block **nl, struct notifier_block *n);
extern int notifier_call_chain(struct notifier_block **n, unsigned long val, void *v);
+static inline int
+notifier_chain_register_lockfree(struct notifier_block **list, struct notifier_block *n)
+{
+ return notifier_chain_register(list, n);
+}
+extern int notifier_chain_unregister_lockfree(struct notifier_block **nl, struct notifier_block *n);
+extern int notifier_call_chain_lockfree(struct notifier_block **n, unsigned long val, void *v);

#define NOTIFY_DONE 0x0000 /* Don't care */
#define NOTIFY_OK 0x0001 /* Suits me */
Index: linux/kernel/sys.c
===================================================================
--- linux.orig/kernel/sys.c 2005-12-07 12:00:57.570884536 +1100
+++ linux/kernel/sys.c 2005-12-07 13:37:53.773534284 +1100
@@ -116,6 +116,7 @@ int notifier_chain_register(struct notif
list= &((*list)->next);
}
n->next = *list;
+ smp_wmb();
*list=n;
write_unlock(&notifier_lock);
return 0;
@@ -187,6 +188,110 @@ int notifier_call_chain(struct notifier_

EXPORT_SYMBOL(notifier_call_chain);

+/* The notifier_chain_*_lockfree functions below are based on the formal
+ * notifier_chain_* functions above, but allow the notifier chain to be
+ * traversed in situations where locks cannot be taken to protect the list,
+ * typically in the various notifier_die() handlers. The 'lockfree' suffix
+ * only refers to the list traversal; register and unregister still take locks
+ * to protect against concurrent list update. Register and unregister can only
+ * be called from contexts that can sleep.
+ */
+
+/* Array notifier_chain_lockfree_inuse is shared between all lockfree notifier
+ * chains. Unregistration of any chain entry must be delayed if any cpu is
+ * executing a lockfree callback, even if that callback is on a different
+ * chain. No big deal, unregister is a rare event.
+ *
+ * Each element is only updated from one cpu so the elements do not need to be
+ * atomic. This avoids problems on architectures that use a hash of spinlocks
+ * to implement atomic variables.
+ *
+ * This array could be replaced by a per cpu variable, but cpu hotplug may want
+ * to use these functions. Before converting to a per cpu variable, review the
+ * cpu hotplug code, paying particular attention to where cpu_online() is set
+ * or cleared and where cpu hotplug runs any notify chains. For the same
+ * reason, these functions do not check cpu_online() at the moment.
+ */
+
+static int notifier_chain_lockfree_inuse[NR_CPUS];
+
+/**
+ * notifier_chain_unregister_lockfree - Remove notifier from a lockfree
+ * traversal notifier chain
+ * @list: Pointer to root list pointer
+ * @n: New entry in notifier chain
+ *
+ * Removes a notifier from a lockfree traversal notifier chain.
+ *
+ * Returns zero on success, or %-ENOENT on failure.
+ */
+
+int notifier_chain_unregister_lockfree(struct notifier_block **list,
+ struct notifier_block *n)
+{
+ int i;
+ write_lock(&notifier_lock);
+ while (*list) {
+ if (*list == n) {
+ *list = n->next;
+ smp_wmb();
+ for (i = 0; i < NR_CPUS; ++i) {
+ while (unlikely(notifier_chain_lockfree_inuse[i])) {
+ barrier();
+ cpu_relax();
+ }
+ }
+ n->next = NULL;
+ write_unlock(&notifier_lock);
+ return 0;
+ }
+ list = &((*list)->next);
+ }
+ write_unlock(&notifier_lock);
+ return -ENOENT;
+}
+
+EXPORT_SYMBOL(notifier_chain_unregister_lockfree);
+
+/**
+ * notifier_call_chain_lockfree - Call functions in a lockfree traversal
+ * notifier chain
+ * @list: Pointer to root pointer of notifier chain
+ * @val: Value passed unmodified to notifier function
+ * @v: Pointer passed unmodified to notifier function
+ *
+ * Calls each function in a lockfree traversal notifier chain in turn.
+ *
+ * If the return value of the notifier can be and'd with
+ * %NOTIFY_STOP_MASK, then notifier_call_chain will return immediately,
+ * with the return value of the notifier function which halted execution.
+ * Otherwise, the return value is the return value of the last notifier
+ * function called.
+ */
+
+int notifier_call_chain_lockfree(struct notifier_block **list,
+ unsigned long val, void *v)
+{
+ int ret = NOTIFY_DONE, cpu = smp_processor_id(), nested;
+ struct notifier_block *nb;
+ nested = notifier_chain_lockfree_inuse[cpu];
+ notifier_chain_lockfree_inuse[cpu] = 1;
+ wmb();
+ nb = *list;
+ while (nb) {
+ smp_read_barrier_depends();
+ ret = nb->notifier_call(nb, val, v);
+ if (ret & NOTIFY_STOP_MASK)
+ break;
+ nb = nb->next;
+ }
+ barrier();
+ notifier_chain_lockfree_inuse[cpu] = nested;
+ return ret;
+}
+
+EXPORT_SYMBOL(notifier_call_chain_lockfree);
+
/**
* register_reboot_notifier - Register function to be called at reboot time
* @nb: Info about notifier function to be called