2006-02-21 15:54:47

by Alan Stern

[permalink] [raw]
Subject: [PATCH] Register atomic_notifiers in atomic context

Some atomic notifier chains require registrations to take place in atomic
context. An example is the die_notifier, which on some architectures may
be accessed very early during the boot-up procedure, before task-switching
is legal. To accomodate these chains, this patch (as655) replaces the
mutex in the atomic_notifier_head structure with a spinlock.

Signed-off-by: Alan Stern <[email protected]>

---

Andrew:

This ought to fix the problem you were seeing with the new notifier chains
on your emt64 system.

Alan Stern



Index: usb-2.6/kernel/sys.c
===================================================================
--- usb-2.6.orig/kernel/sys.c
+++ usb-2.6/kernel/sys.c
@@ -155,7 +155,6 @@ static int __kprobes notifier_call_chain
* @n: New entry in notifier chain
*
* Adds a notifier to an atomic notifier chain.
- * Must be called in process context.
*
* Currently always returns zero.
*/
@@ -163,11 +162,12 @@ static int __kprobes notifier_call_chain
int atomic_notifier_chain_register(struct atomic_notifier_head *nh,
struct notifier_block *n)
{
+ unsigned long flags;
int ret;

- mutex_lock(&nh->mutex);
+ spin_lock_irqsave(&nh->lock, flags);
ret = notifier_chain_register(&nh->head, n);
- mutex_unlock(&nh->mutex);
+ spin_unlock_irqrestore(&nh->lock, flags);
return ret;
}

@@ -179,18 +179,18 @@ EXPORT_SYMBOL(atomic_notifier_chain_regi
* @n: Entry to remove from notifier chain
*
* Removes a notifier from an atomic notifier chain.
- * Must be called from process context.
*
* Returns zero on success or %-ENOENT on failure.
*/
int atomic_notifier_chain_unregister(struct atomic_notifier_head *nh,
struct notifier_block *n)
{
+ unsigned long flags;
int ret;

- mutex_lock(&nh->mutex);
+ spin_lock_irqsave(&nh->lock, flags);
ret = notifier_chain_unregister(&nh->head, n);
- mutex_unlock(&nh->mutex);
+ spin_unlock_irqrestore(&nh->lock, flags);
synchronize_rcu();
return ret;
}
Index: usb-2.6/include/linux/notifier.h
===================================================================
--- usb-2.6.orig/include/linux/notifier.h
+++ usb-2.6/include/linux/notifier.h
@@ -24,9 +24,9 @@
* registration, or unregistration. All locking and protection
* must be provided by the caller.
*
- * atomic_notifier_chain_register() and blocking_notifier_chain_register()
- * may be called only from process context, and likewise for the
- * corresponding _unregister() routines.
+ * atomic_notifier_chain_register() may be called from an atomic context,
+ * but blocking_notifier_chain_register() must be called from a process
+ * context. Ditto for the corresponding _unregister() routines.
*
* atomic_notifier_chain_unregister() and blocking_notifier_chain_unregister()
* _must not_ be called from within the call chain.
@@ -39,7 +39,7 @@ struct notifier_block {
};

struct atomic_notifier_head {
- struct mutex mutex;
+ spinlock_t lock;
struct notifier_block *head;
};

@@ -53,7 +53,7 @@ struct raw_notifier_head {
};

#define ATOMIC_INIT_NOTIFIER_HEAD(name) do { \
- mutex_init(&(name)->mutex); \
+ spin_lock_init(&(name)->lock); \
(name)->head = NULL; \
} while (0)
#define BLOCKING_INIT_NOTIFIER_HEAD(name) do { \
@@ -66,7 +66,7 @@ struct raw_notifier_head {

#define ATOMIC_NOTIFIER_HEAD(name) \
struct atomic_notifier_head name = { \
- .mutex = __MUTEX_INITIALIZER((name).mutex), \
+ .lock = SPIN_LOCK_UNLOCKED, \
.head = NULL }
#define BLOCKING_NOTIFIER_HEAD(name) \
struct blocking_notifier_head name = { \


2006-02-21 23:30:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Register atomic_notifiers in atomic context

Alan Stern <[email protected]> wrote:
>
> Some atomic notifier chains require registrations to take place in atomic
> context. An example is the die_notifier, which on some architectures may
> be accessed very early during the boot-up procedure, before task-switching
> is legal. To accomodate these chains, this patch (as655) replaces the
> mutex in the atomic_notifier_head structure with a spinlock.

I think that's a good change, however x86_64 still crashes.

At great personal expense (ie: using winxp hyperterminal (I now understand
why some of the traces we get are so crappy)) I have a trace. It's still
bugging out in the BUG_ON(!irqs_disabled());


Initializing CPU#0
Kernel command line: root=/dev/sda3 profile=1 ro rhgb earlyprintk=serial,ttyS0,9600 console=ttyS0
kernel profiling enabled (shift: 1)
PID hash table entries: 4096 (order: 12, 131072 bytes)
time.c: Using 14.318180 MHz HPET timer.
time.c: Detected 3400.160 MHz processor.
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at kernel/posix-cpu-timers.c:1279
invalid opcode: 0000 [1] PREEMPT SMP
last sysfs file:
CPU 0
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.16-rc4-mm1 #147
RIP: 0010:[<ffffffff801401a2>] <ffffffff801401a2>{run_posix_cpu_timers+42}
RSP: 0018:ffffffff805a3dd8 EFLAGS: 00010202
RAX: ffffffff805a3e18 RBX: ffffffff804d6300 RCX: 0000000000000000
RDX: ffffffff80617000 RSI: 0000000000000000 RDI: ffffffff804d6300
RBP: ffffffff805a3e58 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffffffff804d6300 R14: 0000000000000000 R15: ffffffff8062be88
FS: 0000000000000000(0000) GS:ffffffff80617000(0000) knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000101000 CR4: 00000000000006a0
Process swapper (pid: 0, threadinfo ffffffff8062a000, task ffffffff804d6300)
Stack: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
ffff810007042840 0000000000000000 0000000000000000 0000000000000296
ffffffff805a3e18 ffffffff805a3e18
Call Trace: <IRQ> <ffffffff80134211>{update_process_times+110}
<ffffffff80114f3c>{smp_local_timer_interrupt+40} <ffffffff8010d39e>{main_timer_handler+527}
<ffffffff8010d554>{timer_interrupt+21} <ffffffff8014c44f>{handle_IRQ_event+48}
<ffffffff8014c528>{__do_IRQ+163} <ffffffff8010c073>{do_IRQ+50}
<ffffffff80109e36>{ret_from_intr+0} <EOI> <ffffffff803cb47a>{_spin_unlock_irqrestore+17}
<ffffffff8014c85a>{setup_irq+226} <ffffffff80630b5a>{time_init+698}
<ffffffff8062d6f9>{start_kernel+218} <ffffffff8062d286>{_sinittext+646}
Code: 0f 0b 68 42 34 3f 80 c2 ff 04 49 8b 95 50 02 00 00 48 85 d2
RIP <ffffffff801401a2>{run_posix_cpu_timer
RIP <ffffffff801401a2>{run_posix_cpu_timer
BUG: warning at kernel/panic.c:138/panic()
Call Trace: <IRQ> <ffffffff8012ab9d>{panic+557} <ffffffff803cb484>{_spin_unlock_irqrestore+27}
<ffffffff80231aaa>{__up_read+193} <ffffffff8013827d>{blocking_notifier_call_chain+71}
<ffffffff8012d565>{do_exit+158} <ffffffff803cb484>{_spin_unlock_irqrestore+27}
<ffffffff8010b34f>{do_divide_error+0} <ffffffff803cbff1>{do_trap+235}
<ffffffff8010b596>{do_invalid_op+172} <ffffffff801401a2>{run_posix_cpu_timers+42}
<ffffffff8010a821>{error_exit+0} <ffffffff801401a2>{run_posix_cpu_timers+42}
<ffffffff80134211>{update_process_times+110} <ffffffff80114f3c>{smp_local_timer_interrupt+40}
<ffffffff8010d39e>{main_timer_handler+527} <ffffffff8010d554>{timer_interrupt+21}
<ffffffff8014c44f>{handle_IRQ_event+48} <ffffffff8014c528>{__do_IRQ+163}
<ffffffff8010c073>{do_IRQ+50} <ffffffff80109e36>{ret_from_intr+0} <EOI>
<ffffffff803cb47a>{_spin_unlock_irqrestore+17} <ffffffff8014c85a>{setup_irq+226}
<ffffffff80630b5a>{time_init+698} <ffffffff8062d6f9>{start_kernel+218}
<ffffffff8062d286>{_sinittext+646}

2006-02-22 16:08:39

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] Register atomic_notifiers in atomic context

On Tue, 21 Feb 2006, Andrew Morton wrote:

> Alan Stern <[email protected]> wrote:
> >
> > Some atomic notifier chains require registrations to take place in atomic
> > context. An example is the die_notifier, which on some architectures may
> > be accessed very early during the boot-up procedure, before task-switching
> > is legal. To accomodate these chains, this patch (as655) replaces the
> > mutex in the atomic_notifier_head structure with a spinlock.
>
> I think that's a good change, however x86_64 still crashes.
>
> At great personal expense (ie: using winxp hyperterminal (I now understand
> why some of the traces we get are so crappy)) I have a trace. It's still
> bugging out in the BUG_ON(!irqs_disabled());

I hate to keep asking you to test this since you're so busy. If you know
anyone else with an x86_64 who could investigate instead, don't hesitate
to pass this on.

The only reason this patch set could cause interrupts to get enabled (when
they weren't enabled before) would be if some code was using one of the
blocking notifier chains. If one of the down_read() or down_write() calls
blocked, the scheduler would enable interrupts. On the other hand, if
there is only a single task running, how could a down_read() or
down_write() call manage to block?

The diagnostic patch below is a heavy-handed approach, but it ought to
indicate the source of the problem. Doing anything to a blocking notifier
chain at a time when task switching is not legal should be a no-no.
Maybe this means that a chain got misclassified as blocking when it
really should be atomic -- or maybe it means there has always been a more
serious problem that has never been detected before.

Alan Stern

P.S. (off-topic): Would it be possible to make the -mm series visible
through the web interface at <http://www.kernel.org/git/>?



Index: usb-2.6/kernel/sys.c
===================================================================
--- usb-2.6.orig/kernel/sys.c
+++ usb-2.6/kernel/sys.c
@@ -249,7 +249,11 @@ int blocking_notifier_chain_register(str
{
int ret;

- down_write(&nh->rwsem);
+ if (!down_write_trylock(&nh->rwsem)) {
+ printk(KERN_WARNING "%s\n", __FUNCTION__);
+ dump_stack();
+ down_write(&nh->rwsem);
+ }
ret = notifier_chain_register(&nh->head, n);
up_write(&nh->rwsem);
return ret;
@@ -272,7 +276,11 @@ int blocking_notifier_chain_unregister(s
{
int ret;

- down_write(&nh->rwsem);
+ if (!down_write_trylock(&nh->rwsem)) {
+ printk(KERN_WARNING "%s\n", __FUNCTION__);
+ dump_stack();
+ down_write(&nh->rwsem);
+ }
ret = notifier_chain_unregister(&nh->head, n);
up_write(&nh->rwsem);
return ret;
@@ -302,7 +310,11 @@ int blocking_notifier_call_chain(struct
{
int ret;

- down_read(&nh->rwsem);
+ if (!down_read_trylock(&nh->rwsem)) {
+ printk(KERN_WARNING "%s\n", __FUNCTION__);
+ dump_stack();
+ down_read(&nh->rwsem);
+ }
ret = notifier_call_chain(&nh->head, val, v);
up_read(&nh->rwsem);
return ret;

2006-02-22 16:13:00

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH] Register atomic_notifiers in atomic context

On Wed, 22 Feb 2006, Alan Stern wrote:

>
> P.S. (off-topic): Would it be possible to make the -mm series visible
> through the web interface at <http://www.kernel.org/git/>?

see
http://www.kernel.org/git/?p=linux/kernel/git/smurf/linux-trees.git;a=summary
and scroll down to tags

--
~Randy

2006-02-23 02:26:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Register atomic_notifiers in atomic context

Alan Stern <[email protected]> wrote:
>
> On Tue, 21 Feb 2006, Andrew Morton wrote:
>
> > Alan Stern <[email protected]> wrote:
> > >
> > > Some atomic notifier chains require registrations to take place in atomic
> > > context. An example is the die_notifier, which on some architectures may
> > > be accessed very early during the boot-up procedure, before task-switching
> > > is legal. To accomodate these chains, this patch (as655) replaces the
> > > mutex in the atomic_notifier_head structure with a spinlock.
> >
> > I think that's a good change, however x86_64 still crashes.
> >
> > At great personal expense (ie: using winxp hyperterminal (I now understand
> > why some of the traces we get are so crappy)) I have a trace. It's still
> > bugging out in the BUG_ON(!irqs_disabled());
>
> I hate to keep asking you to test this since you're so busy. If you know
> anyone else with an x86_64 who could investigate instead, don't hesitate
> to pass this on.

dood, I spend probably half my time weeding out bugs people sent me and
another 25% finding bugs we already merged...

> The diagnostic patch below is a heavy-handed approach, but it ought to
> indicate the source of the problem. Doing anything to a blocking notifier
> chain at a time when task switching is not legal should be a no-no.
> Maybe this means that a chain got misclassified as blocking when it
> really should be atomic -- or maybe it means there has always been a more
> serious problem that has never been detected before.

heh, it fixed the bug.

> Index: usb-2.6/kernel/sys.c
> ===================================================================
> --- usb-2.6.orig/kernel/sys.c
> +++ usb-2.6/kernel/sys.c
> @@ -249,7 +249,11 @@ int blocking_notifier_chain_register(str
> {
> int ret;
>
> - down_write(&nh->rwsem);
> + if (!down_write_trylock(&nh->rwsem)) {
> + printk(KERN_WARNING "%s\n", __FUNCTION__);
> + dump_stack();
> + down_write(&nh->rwsem);
> + }

See, we avoid doing down_write() if the lock is uncontended. And x86_64's
uncontended down_write() unconditionally enables interrupts. This evaded
notice because might_sleep() warnings are disabled early in boot due to
various horrid things.

Maybe we should enable the might_sleep() warnings because of this nasty
rwsem trap. We tried to do that a couple of weeks ago but we needed a pile
of nasty workarounds to avoid false positives.


Applying this:

--- 25/kernel/sys.c~notifier-sleep-debug-update 2006-02-22 18:23:26.000000000 -0800
+++ 25-akpm/kernel/sys.c 2006-02-22 18:25:45.000000000 -0800
@@ -249,6 +249,8 @@ int blocking_notifier_chain_register(str
{
int ret;

+ if (irqs_disabled())
+ dump_stack();
if (!down_write_trylock(&nh->rwsem)) {
printk(KERN_WARNING "%s\n", __FUNCTION__);
dump_stack();
@@ -276,6 +278,8 @@ int blocking_notifier_chain_unregister(s
{
int ret;

+ if (irqs_disabled())
+ dump_stack();
if (!down_write_trylock(&nh->rwsem)) {
printk(KERN_WARNING "%s\n", __FUNCTION__);
dump_stack();
@@ -310,6 +314,8 @@ int blocking_notifier_call_chain(struct
{
int ret;

+ if (irqs_disabled())
+ dump_stack();
if (!down_read_trylock(&nh->rwsem)) {
printk(KERN_WARNING "%s\n", __FUNCTION__);
dump_stack();
_

Gives:

Bootdata ok (command line is root=/dev/sda3 profile=1 ro rhgb earlyprintk=serial,ttyS0,9600 console=ttyS0)
Linux version 2.6.16-rc4-mm1 (akpm@x) (gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)) #151 SMP PREEMPT Wed Feb 22 18:25:50 PST 2006
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000ebbd0 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007ffd0000 (usable)
BIOS-e820: 000000007ffd0000 - 000000007ffdf000 (ACPI data)
BIOS-e820: 000000007ffdf000 - 0000000080000000 (ACPI NVS)
BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000180000000 (usable)
kernel direct mapping tables up to 180000000 @ 8000-8000
ACPI: PM-Timer IO Port: 0x408
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x06] enabled)
Processor #6 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled)
Processor #1 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x07] enabled)
Processor #7 15:3 APIC version 20
ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x09] address[0xfec81000] gsi_base[24])
IOAPIC[1]: apic_id 9, version 32, address 0xfec81000, GSI 24-47
ACPI: IOAPIC (id[0x0a] address[0xfec81400] gsi_base[48])
IOAPIC[2]: apic_id 10, version 32, address 0xfec81400, GSI 48-71
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Setting APIC routing to flat
ACPI: HPET id: 0x8086a202 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 88000000 (gap: 80000000:60000000)
Checking aperture...
Built 1 zonelists
Kernel command line: root=/dev/sda3 profile=1 ro rhgb earlyprintk=serial,ttyS0,9600 console=ttyS0
kernel profiling enabled (shift: 1)
Initializing CPU#0

Call Trace: <ffffffff80137eda>{blocking_notifier_chain_register+30}
<ffffffff80143b9b>{register_cpu_notifier+19} <ffffffff80639482>{rcu_init+40}
<ffffffff806296f8>{start_kernel+217} <ffffffff80629286>{_sinittext+646}
PID hash table entries: 4096 (order: 12, 131072 bytes)

Call Trace: <ffffffff80137eda>{blocking_notifier_chain_register+30}
<ffffffff80143b9b>{register_cpu_notifier+19} <ffffffff80639259>{init_timers+40}
<ffffffff80629707>{start_kernel+232} <ffffffff80629286>{_sinittext+646}

Call Trace: <ffffffff80137eda>{blocking_notifier_chain_register+30}
<ffffffff80143b9b>{register_cpu_notifier+19} <ffffffff80639821>{hrtimers_init+40}
<ffffffff8062970c>{start_kernel+237} <ffffffff80629286>{_sinittext+646}
time.c: Using 14.318180 MHz HPET timer.
time.c: Detected 3400.166 MHz processor.
disabling early console
Bootdata ok (command line is root=/dev/sda3 profile=1 ro rhgb earlyprintk=serial,ttyS0,9600 console=ttyS0)
Linux version 2.6.16-rc4-mm1 (akpm@x) (gcc version 3.4.2 20041017 (Red Hat 3.4.2-6.fc3)) #151 SMP PREEMPT Wed Feb 22 18:25:50 PST 2006
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000ebbd0 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007ffd0000 (usable)
BIOS-e820: 000000007ffd0000 - 000000007ffdf000 (ACPI data)
BIOS-e820: 000000007ffdf000 - 0000000080000000 (ACPI NVS)
BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
BIOS-e820: 00000000ffc00000 - 0000000100000000 (reserved)
BIOS-e820: 0000000100000000 - 0000000180000000 (usable)
ACPI: PM-Timer IO Port: 0x408
ACPI: LAPIC (acpi_id[0x01] lapic_id[0x00] enabled)
Processor #0 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x02] lapic_id[0x06] enabled)
Processor #6 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x03] lapic_id[0x01] enabled)
Processor #1 15:3 APIC version 20
ACPI: LAPIC (acpi_id[0x04] lapic_id[0x07] enabled)
Processor #7 15:3 APIC version 20
ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
IOAPIC[0]: apic_id 8, version 32, address 0xfec00000, GSI 0-23
ACPI: IOAPIC (id[0x09] address[0xfec81000] gsi_base[24])
IOAPIC[1]: apic_id 9, version 32, address 0xfec81000, GSI 24-47
ACPI: IOAPIC (id[0x0a] address[0xfec81400] gsi_base[48])
IOAPIC[2]: apic_id 10, version 32, address 0xfec81400, GSI 48-71
ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
Setting APIC routing to flat
ACPI: HPET id: 0x8086a202 base: 0xfed00000
Using ACPI (MADT) for SMP configuration information
Allocating PCI resources starting at 88000000 (gap: 80000000:60000000)
Checking aperture...
Built 1 zonelists
Kernel command line: root=/dev/sda3 profile=1 ro rhgb earlyprintk=serial,ttyS0,9600 console=ttyS0
kernel profiling enabled (shift: 1)
Initializing CPU#0

Call Trace: <ffffffff80137eda>{blocking_notifier_chain_register+30}
<ffffffff80143b9b>{register_cpu_notifier+19} <ffffffff80639482>{rcu_init+40}
<ffffffff806296f8>{start_kernel+217} <ffffffff80629286>{_sinittext+646}
PID hash table entries: 4096 (order: 12, 131072 bytes)

Call Trace: <ffffffff80137eda>{blocking_notifier_chain_register+30}
<ffffffff80143b9b>{register_cpu_notifier+19} <ffffffff80639259>{init_timers+40}
<ffffffff80629707>{start_kernel+232} <ffffffff80629286>{_sinittext+646}

Call Trace: <ffffffff80137eda>{blocking_notifier_chain_register+30}
<ffffffff80143b9b>{register_cpu_notifier+19} <ffffffff80639821>{hrtimers_init+40}
<ffffffff8062970c>{start_kernel+237} <ffffffff80629286>{_sinittext+646}
time.c: Using 14.318180 MHz HPET timer.
time.c: Detected 3400.166 MHz processor.
disabling early console
Console: colour VGA+ 80x25

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
<ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
<ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
<ffffffff8063b983>{vfs_caches_init_early+0} <ffffffff80629740>{start_kernel+289}
<ffffffff80629286>{_sinittext+646}

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
<ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
<ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
<ffffffff8063a155>{__alloc_bootmem_core+773} <ffffffff8012b0fa>{__call_console_drivers+75}
<ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063aeae>{alloc_large_system_hash+220}
<ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
<ffffffff80629286>{_sinittext+646}
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
<ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
<ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
<ffffffff8012b3f0>{release_console_sem+320} <ffffffff8012b793>{vprintk+811}
<ffffffff8012b7bc>{vprintk+852} <ffffffff8012b7bc>{vprintk+852}
<ffffffff8012b88b>{printk+162} <ffffffff8012b0fa>{__call_console_drivers+75}
<ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063af69>{alloc_large_system_hash+407}
<ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
<ffffffff80629286>{_sinittext+646}

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
<ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
<ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
<ffffffff8012b3f0>{release_console_sem+320} <ffffffff8012b793>{vprintk+811}
<ffffffff8012b7bc>{vprintk+852} <ffffffff8012b7bc>{vprintk+852}
<ffffffff8012b88b>{printk+162} <ffffffff8012b0fa>{__call_console_drivers+75}
<ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063af69>{alloc_large_system_hash+407}
<ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
<ffffffff80629286>{_sinittext+646}

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
<ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
<ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
<ffffffff8012b3f0>{release_console_sem+320} <ffffffff8012b793>{vprintk+811}
<ffffffff8012b7bc>{vprintk+852} <ffffffff8012b7bc>{vprintk+852}
<ffffffff8012b88b>{printk+162} <ffffffff8012b0fa>{__call_console_drivers+75}
<ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063af69>{alloc_large_system_hash+407}
<ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
<ffffffff80629286>{_sinittext+646}

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
<ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
<ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
<ffffffff8012b3f0>{release_console_sem+320} <ffffffff8012b793>{vprintk+811}
<ffffffff8012b7bc>{vprintk+852} <ffffffff8012b7bc>{vprintk+852}
<ffffffff8012b88b>{printk+162} <ffffffff8012b0fa>{__call_console_drivers+75}
<ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063af69>{alloc_large_system_hash+407}
<ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
<ffffffff80629286>{_sinittext+646}

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
<ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
<ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
<ffffffff8012b3f0>{release_console_sem+320} <ffffffff8012b793>{vprintk+811}
<ffffffff8012b7bc>{vprintk+852} <ffffffff8012b7bc>{vprintk+852}
<ffffffff8012b88b>{printk+162} <ffffffff8012b0fa>{__call_console_drivers+75}
<ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063af69>{alloc_large_system_hash+407}
<ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
<ffffffff80629286>{_sinittext+646}

Call Trace: <IRQ> <ffffffff80137fd8>{blocking_notifier_call_chain+35}
<ffffffff80107caf>{__exit_idle+46} <ffffffff80107cd3>{exit_idle+34}
<ffffffff8010c055>{do_IRQ+20} <ffffffff80109e36>{ret_from_intr+0} <EOI>
<ffffffff8012b3f0>{release_console_sem+320} <ffffffff8012b793>{vprintk+811}
<ffffffff8012b7bc>{vprintk+852} <ffffffff8012b7bc>{vprintk+852}
<ffffffff8012b88b>{printk+162} <ffffffff8012b0fa>{__call_console_drivers+75}
<ffffffff8063a466>{__alloc_bootmem+56} <ffffffff8063af69>{alloc_large_system_hash+407}
<ffffffff8063b9c1>{vfs_caches_init_early+62} <ffffffff80629740>{start_kernel+289}
<ffffffff80629286>{_sinittext+646}


We're using blocking notifier chains in places where it's really risky, such as
__exit_idle(). Time for a rethink, methinks.

I'd suggest that in further development, you enable might_sleep() in early
boot - that would have caught such things..

2006-02-23 17:15:12

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] Register atomic_notifiers in atomic context

On Wed, 22 Feb 2006, Andrew Morton wrote:

> See, we avoid doing down_write() if the lock is uncontended. And x86_64's
> uncontended down_write() unconditionally enables interrupts.

Not just x86_64; that's part of the general rw-semaphore implementation in
lib/rwsem-spinlock.c. down_write() and down_read() always enable
interrupts, whether contended or not.

> This evaded
> notice because might_sleep() warnings are disabled early in boot due to
> various horrid things.
>
> Maybe we should enable the might_sleep() warnings because of this nasty
> rwsem trap. We tried to do that a couple of weeks ago but we needed a pile
> of nasty workarounds to avoid false positives.

The situation is prone to bugs. Special-case situations (like not
allowing task switching during system start-up) are always hard to handle.

> Applying this:
>
> --- 25/kernel/sys.c~notifier-sleep-debug-update 2006-02-22 18:23:26.000000000 -0800
> +++ 25-akpm/kernel/sys.c 2006-02-22 18:25:45.000000000 -0800
> @@ -249,6 +249,8 @@ int blocking_notifier_chain_register(str
> {
> int ret;
>
> + if (irqs_disabled())
> + dump_stack();
> if (!down_write_trylock(&nh->rwsem)) {
> printk(KERN_WARNING "%s\n", __FUNCTION__);
> dump_stack();
> @@ -276,6 +278,8 @@ int blocking_notifier_chain_unregister(s
> {
> int ret;
>
> + if (irqs_disabled())
> + dump_stack();
> if (!down_write_trylock(&nh->rwsem)) {
> printk(KERN_WARNING "%s\n", __FUNCTION__);
> dump_stack();
> @@ -310,6 +314,8 @@ int blocking_notifier_call_chain(struct
> {
> int ret;
>
> + if (irqs_disabled())
> + dump_stack();
> if (!down_read_trylock(&nh->rwsem)) {
> printk(KERN_WARNING "%s\n", __FUNCTION__);
> dump_stack();
> _
>
> Gives:

... a bunch of calls to register_cpu_notifier and __exit_idle ...

> We're using blocking notifier chains in places where it's really risky, such as
> __exit_idle(). Time for a rethink, methinks.

Yes, that was clearly a mistake in the notifier-chain patch. In x86_64,
the idle_notifier chain should be atomic, not blocking. I missed the fact
that it gets invoked during an interrupt handler.

On the other hand, nothing in the vanilla kernel uses that notifier chain,
and there doesn't seem to be any equivalent in other architectures.
Perhaps it should just go away entirely?

The calls to register_cpu_notifier are harder. That chain really does
need to be blocking, which means we can't avoid calling down_write. The
only solution I can think of is to use down_write_trylock in the
blocking_notifier_chain_register and unregister routines, even though
doing that is a crock.

Or else change __down_read and __down_write to use spin_lock_irqsave
instead of spin_lock_irq. What do you think would be best?

> I'd suggest that in further development, you enable might_sleep() in early
> boot - that would have caught such things..

Not a bad idea. I presume that removing the "system_state ==
SYSTEM_RUNNING" test in __might_sleep will have that effect?

Alan Stern

2006-02-23 19:04:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Register atomic_notifiers in atomic context

Alan Stern <[email protected]> wrote:
>
> The calls to register_cpu_notifier are harder. That chain really does
> need to be blocking

Why?

> which means we can't avoid calling down_write. The
> only solution I can think of is to use down_write_trylock in the
> blocking_notifier_chain_register and unregister routines, even though
> doing that is a crock.
>
> Or else change __down_read and __down_write to use spin_lock_irqsave
> instead of spin_lock_irq. What do you think would be best?

Nothing's pretty. Perhaps look at system_state and not do any locking at all
in early boot?

> > I'd suggest that in further development, you enable might_sleep() in early
> > boot - that would have caught such things..
>
> Not a bad idea. I presume that removing the "system_state ==
> SYSTEM_RUNNING" test in __might_sleep will have that effect?

Yup.

2006-02-23 22:28:28

by Alan Stern

[permalink] [raw]
Subject: [PATCH] The idle notifier chain should be atomic

This patch (as658) makes the idle_notifier in x86_64 and idle_chain in
s390 into atomic notifier chains rather than blocking chains. This is
necessary because they are called during IRQ handling as CPUs leave and
enter the idle state.

Signed-off-by: Alan Stern <[email protected]>

---

This patch should fix one of the problems you saw. A second patch to fix
the other problem is right behind.

Please revert those two debugging changes (the one I sent and the one you
wrote) before applying this.

Alan Stern

Index: usb-2.6/arch/x86_64/kernel/process.c
===================================================================
--- usb-2.6.orig/arch/x86_64/kernel/process.c
+++ usb-2.6/arch/x86_64/kernel/process.c
@@ -66,17 +66,17 @@ EXPORT_SYMBOL(boot_option_idle_override)
void (*pm_idle)(void);
static DEFINE_PER_CPU(unsigned int, cpu_idle_state);

-static BLOCKING_NOTIFIER_HEAD(idle_notifier);
+static ATOMIC_NOTIFIER_HEAD(idle_notifier);

void idle_notifier_register(struct notifier_block *n)
{
- blocking_notifier_chain_register(&idle_notifier, n);
+ atomic_notifier_chain_register(&idle_notifier, n);
}
EXPORT_SYMBOL_GPL(idle_notifier_register);

void idle_notifier_unregister(struct notifier_block *n)
{
- blocking_notifier_chain_unregister(&idle_notifier, n);
+ atomic_notifier_chain_unregister(&idle_notifier, n);
}
EXPORT_SYMBOL(idle_notifier_unregister);

@@ -86,13 +86,13 @@ static DEFINE_PER_CPU(enum idle_state, i
void enter_idle(void)
{
__get_cpu_var(idle_state) = CPU_IDLE;
- blocking_notifier_call_chain(&idle_notifier, IDLE_START, NULL);
+ atomic_notifier_call_chain(&idle_notifier, IDLE_START, NULL);
}

static void __exit_idle(void)
{
__get_cpu_var(idle_state) = CPU_NOT_IDLE;
- blocking_notifier_call_chain(&idle_notifier, IDLE_END, NULL);
+ atomic_notifier_call_chain(&idle_notifier, IDLE_END, NULL);
}

/* Called from interrupts to signify idle end */
Index: usb-2.6/arch/s390/kernel/process.c
===================================================================
--- usb-2.6.orig/arch/s390/kernel/process.c
+++ usb-2.6/arch/s390/kernel/process.c
@@ -76,17 +76,17 @@ unsigned long thread_saved_pc(struct tas
/*
* Need to know about CPUs going idle?
*/
-static BLOCKING_NOTIFIER_HEAD(idle_chain);
+static ATOMIC_NOTIFIER_HEAD(idle_chain);

int register_idle_notifier(struct notifier_block *nb)
{
- return blocking_notifier_chain_register(&idle_chain, nb);
+ return atomic_notifier_chain_register(&idle_chain, nb);
}
EXPORT_SYMBOL(register_idle_notifier);

int unregister_idle_notifier(struct notifier_block *nb)
{
- return blocking_notifier_chain_unregister(&idle_chain, nb);
+ return atomic_notifier_chain_unregister(&idle_chain, nb);
}
EXPORT_SYMBOL(unregister_idle_notifier);

@@ -95,7 +95,7 @@ void do_monitor_call(struct pt_regs *reg
/* disable monitor call class 0 */
__ctl_clear_bit(8, 15);

- blocking_notifier_call_chain(&idle_chain, CPU_NOT_IDLE,
+ atomic_notifier_call_chain(&idle_chain, CPU_NOT_IDLE,
(void *)(long) smp_processor_id());
}

@@ -116,7 +116,7 @@ void default_idle(void)
return;
}

- rc = blocking_notifier_call_chain(&idle_chain,
+ rc = atomic_notifier_call_chain(&idle_chain,
CPU_IDLE, (void *)(long) cpu);
if (rc != NOTIFY_OK && rc != NOTIFY_DONE)
BUG();

2006-02-23 22:36:58

by Alan Stern

[permalink] [raw]
Subject: [PATCH] Avoid calling down_read and down_write during startup

This patch (as660) changes the registration and unregistration routines
for blocking notifier chains. During system startup, when task switching
is illegal, the routines will avoid calling down_write().

Signed-off-by: Alan Stern <[email protected]>

---

On Thu, 23 Feb 2006, Andrew Morton wrote:

> Alan Stern <[email protected]> wrote:
> >
> > The calls to register_cpu_notifier are harder. That chain really does
> > need to be blocking
>
> Why?

Because its callouts invoke blocking functions. For example,
drivers/base/topology.c:topology_cpu_callback() calls
sysfs_create_group(). In drivers/cpufreq/cpufreq.c,
cpufreq_cpu_callback() calls cpufreq_add_dev(), which does kzalloc with
GFP_KERNEL.

> > which means we can't avoid calling down_write. The
> > only solution I can think of is to use down_write_trylock in the
> > blocking_notifier_chain_register and unregister routines, even though
> > doing that is a crock.
> >
> > Or else change __down_read and __down_write to use spin_lock_irqsave
> > instead of spin_lock_irq. What do you think would be best?
>
> Nothing's pretty. Perhaps look at system_state and not do any locking at all
> in early boot?

Okay. Here's a patch to do that. It only affects the registration
routines; the actual call-out function still acquires the lock. That's
because (1) I didn't want to add extra overhead to a frequently-used
routine, and (2) if this notifier chain gets called while interrupts are
disabled then there's something badly wrong.

Combined with the previous patch, maybe you'll find that everything works
perfectly now... :-)

Please revert those two debugging patches (the one I sent you and the one
you wrote) before applying this.

Alan Stern

Index: usb-2.6/kernel/sys.c
===================================================================
--- usb-2.6.orig/kernel/sys.c
+++ usb-2.6/kernel/sys.c
@@ -249,6 +249,14 @@ int blocking_notifier_chain_register(str
{
int ret;

+ /*
+ * This code gets used during boot-up, when task switching is
+ * not yet working and interrupts must remain disabled. At
+ * such times we must not call down_write().
+ */
+ if (unlikely(system_state == SYSTEM_BOOTING))
+ return notifier_chain_register(&nh->head, n);
+
down_write(&nh->rwsem);
ret = notifier_chain_register(&nh->head, n);
up_write(&nh->rwsem);
@@ -272,6 +280,14 @@ int blocking_notifier_chain_unregister(s
{
int ret;

+ /*
+ * This code gets used during boot-up, when task switching is
+ * not yet working and interrupts must remain disabled. At
+ * such times we must not call down_write().
+ */
+ if (unlikely(system_state == SYSTEM_BOOTING))
+ return notifier_chain_unregister(&nh->head, n);
+
down_write(&nh->rwsem);
ret = notifier_chain_unregister(&nh->head, n);
up_write(&nh->rwsem);

2006-02-23 22:42:33

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

On Thu, Feb 23, 2006 at 05:36:56PM -0500, Alan Stern wrote:
> This patch (as660) changes the registration and unregistration routines
> for blocking notifier chains. During system startup, when task switching
> is illegal, the routines will avoid calling down_write().

Why is that necessary? The down_write() will immediately succeed as no
other process can possibly be holding the lock when the system is booting,
so the special casing doesn't fix anything.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-23 23:49:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] The idle notifier chain should be atomic

Alan Stern <[email protected]> writes:

> This patch (as658) makes the idle_notifier in x86_64 and idle_chain in
> s390 into atomic notifier chains rather than blocking chains. This is
> necessary because they are called during IRQ handling as CPUs leave and
> enter the idle state.

Actually they aren't. While the code is called from the interrupt
handler logically it belong to the idle thread, not the interrupt handler.
They are only called when the interrupt directly interrupts the idle
thread, so no atomicity needed.

-Andi

P.S.: Please cc maintainers in the future.

2006-02-24 00:17:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

Benjamin LaHaise <[email protected]> wrote:
>
> On Thu, Feb 23, 2006 at 05:36:56PM -0500, Alan Stern wrote:
> > This patch (as660) changes the registration and unregistration routines
> > for blocking notifier chains. During system startup, when task switching
> > is illegal, the routines will avoid calling down_write().
>
> Why is that necessary? The down_write() will immediately succeed as no
> other process can possibly be holding the lock when the system is booting,
> so the special casing doesn't fix anything.

down_write() unconditionally (and reasonably) does local_irq_enable() in
the uncontended case. And enabling local interrupts early in boot can
cause crashes.

2006-02-24 03:18:21

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

On Thu, 23 Feb 2006, Andrew Morton wrote:

> Benjamin LaHaise <[email protected]> wrote:
> >
> > On Thu, Feb 23, 2006 at 05:36:56PM -0500, Alan Stern wrote:
> > > This patch (as660) changes the registration and unregistration routines
> > > for blocking notifier chains. During system startup, when task switching
> > > is illegal, the routines will avoid calling down_write().
> >
> > Why is that necessary? The down_write() will immediately succeed as no
> > other process can possibly be holding the lock when the system is booting,
> > so the special casing doesn't fix anything.
>
> down_write() unconditionally (and reasonably) does local_irq_enable() in
> the uncontended case. And enabling local interrupts early in boot can
> cause crashes.

Ben, earlier you expressed concern about the extra overhead due to
cache-line contention (on SMP) in the down_read() call added to
blocking_notifier_call_chain. I don't remember which notifier chain in
particular you were worried about; something to do with networking.

Does this still bother you? I can see a couple of ways around it.

One is to make that notifier chain atomic instead of blocking. Of course,
this is feasible only if the chain's callout routines never block. If
they do, then perhaps there's no point worrying about cache misses.

Another possibility is to write a variant implementation of rw-semaphores,
one that wouldn't cause a cache miss in the common case of an uncontested
read lock. It could be done. I can write a somewhat inefficient
version easily enough; no doubt someone more experienced in this sort of
thing could do a better job.

What do you think?

Alan Stern

2006-02-24 03:24:55

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] The idle notifier chain should be atomic

On 24 Feb 2006, Andi Kleen wrote:

> Alan Stern <[email protected]> writes:
>
> > This patch (as658) makes the idle_notifier in x86_64 and idle_chain in
> > s390 into atomic notifier chains rather than blocking chains. This is
> > necessary because they are called during IRQ handling as CPUs leave and
> > enter the idle state.
>
> Actually they aren't. While the code is called from the interrupt
> handler logically it belong to the idle thread, not the interrupt handler.
> They are only called when the interrupt directly interrupts the idle
> thread, so no atomicity needed.

In do_IRQ() there's a call to exit_idle(), which calls __exit_idle(),
which runs the idle_notifier call chain. Surely you're not saying that we
can do a down_read() in this pathway?

And actually the chain's type doesn't seem to make much difference, since
at the moment there's nothing in the vanilla kernel that registers for the
idle_notifier chain.

> -Andi
>
> P.S.: Please cc maintainers in the future.

Yes, I should have sent the patch to you too. I apologize.

Alan Stern

2006-02-24 03:27:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] The idle notifier chain should be atomic

On Friday 24 February 2006 04:24, Alan Stern wrote:

> In do_IRQ() there's a call to exit_idle(), which calls __exit_idle(),
> which runs the idle_notifier call chain. Surely you're not saying that we
> can do a down_read() in this pathway?

No, but not because it's in an interrupt but because sleeping in the idle
task is illegal.

> And actually the chain's type doesn't seem to make much difference, since
> at the moment there's nothing in the vanilla kernel that registers for the
> idle_notifier chain.

Will come eventually.

-Andi

2006-02-24 04:04:07

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] The idle notifier chain should be atomic

On Fri, 24 Feb 2006, Andi Kleen wrote:

> On Friday 24 February 2006 04:24, Alan Stern wrote:
>
> > In do_IRQ() there's a call to exit_idle(), which calls __exit_idle(),
> > which runs the idle_notifier call chain. Surely you're not saying that we
> > can do a down_read() in this pathway?
>
> No, but not because it's in an interrupt but because sleeping in the idle
> task is illegal.

Well, either reason is sufficient justification for making idle_notifier
an atomic chain.

> > And actually the chain's type doesn't seem to make much difference, since
> > at the moment there's nothing in the vanilla kernel that registers for the
> > idle_notifier chain.
>
> Will come eventually.

Will that be just for x86_64 or for all architectures?

Alan Stern

2006-02-24 14:45:41

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

On Thu, Feb 23, 2006 at 10:18:18PM -0500, Alan Stern wrote:
> Ben, earlier you expressed concern about the extra overhead due to
> cache-line contention (on SMP) in the down_read() call added to
> blocking_notifier_call_chain. I don't remember which notifier chain in
> particular you were worried about; something to do with networking.
>
> Does this still bother you? I can see a couple of ways around it.

Yes it's a problem. Any read lock is going to act as a memory barrier,
and we need fewer of those in hot paths, not more to slow things down.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-24 14:44:12

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

On Thu, Feb 23, 2006 at 04:16:31PM -0800, Andrew Morton wrote:
> down_write() unconditionally (and reasonably) does local_irq_enable() in
> the uncontended case. And enabling local interrupts early in boot can
> cause crashes.

Why not do a down_write_trylock() instead first? Then the code doesn't
have the dependancy on system_state, which seems horribly fragile.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-24 15:03:45

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

On Fri, 24 Feb 2006, Benjamin LaHaise wrote:

> On Thu, Feb 23, 2006 at 04:16:31PM -0800, Andrew Morton wrote:
> > down_write() unconditionally (and reasonably) does local_irq_enable() in
> > the uncontended case. And enabling local interrupts early in boot can
> > cause crashes.
>
> Why not do a down_write_trylock() instead first? Then the code doesn't
> have the dependancy on system_state, which seems horribly fragile.

I suggested this to Andrew. His reply was as follows:

> > which means we can't avoid calling down_write. The
> > only solution I can think of is to use down_write_trylock in the
> > blocking_notifier_chain_register and unregister routines, even though
> > doing that is a crock.
> >
> > Or else change __down_read and __down_write to use spin_lock_irqsave
> > instead of spin_lock_irq. What do you think would be best?
>
> Nothing's pretty. Perhaps look at system_state and not do any locking at all
> in early boot?

I admit that the whole things is fragile. IMO the safest approach would
be for __down_read and __down_write not to assume that interrupts are
currently enabled. But that would introduce more overhead as well; at
least this way the overhead is confined to the notifier chain registration
routines.

Alan Stern

2006-02-24 15:04:25

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

On Fri, 24 Feb 2006, Benjamin LaHaise wrote:

> On Thu, Feb 23, 2006 at 10:18:18PM -0500, Alan Stern wrote:
> > Ben, earlier you expressed concern about the extra overhead due to
> > cache-line contention (on SMP) in the down_read() call added to
> > blocking_notifier_call_chain. I don't remember which notifier chain in
> > particular you were worried about; something to do with networking.
> >
> > Does this still bother you? I can see a couple of ways around it.
>
> Yes it's a problem. Any read lock is going to act as a memory barrier,
> and we need fewer of those in hot paths, not more to slow things down.

What do you think of the two suggestions in my previous message?

Alan Stern

2006-02-24 15:20:14

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

On Fri, Feb 24, 2006 at 10:04:23AM -0500, Alan Stern wrote:
> What do you think of the two suggestions in my previous message?

Even if the read version of the lock only touches a cacheline local to
the cpu, you'd still have to use the lock prefix to allow for correctness
when a writer comes along. It is not cacheline bouncing that worries me,
it is serialising instructions and memory barriers as those hurt immensely
when the data is in the cache. I've been looking at a lot of profiles on
P4s of late, and every single locked instruction is painful as it means
all of the memory ordering rules come into play. Neither suggestion
addresses that overhead that has been introduced.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-24 16:44:25

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

On Fri, 24 Feb 2006, Benjamin LaHaise wrote:

> On Fri, Feb 24, 2006 at 10:04:23AM -0500, Alan Stern wrote:
> > What do you think of the two suggestions in my previous message?
>
> Even if the read version of the lock only touches a cacheline local to
> the cpu, you'd still have to use the lock prefix to allow for correctness
> when a writer comes along. It is not cacheline bouncing that worries me,
> it is serialising instructions and memory barriers as those hurt immensely
> when the data is in the cache. I've been looking at a lot of profiles on
> P4s of late, and every single locked instruction is painful as it means
> all of the memory ordering rules come into play. Neither suggestion
> addresses that overhead that has been introduced.

In that case you should be worried not about acquiring and releasing the
rwsem at the beginning and end of blocking_notifier_call_chain; you should
be worried about all the RCU serialization in the core
notifier_call_chain routine.

In fact, that could be changed for blocking notifier chains. Since we own
the readlock, we know that the list won't get changed while we're using
it. So the blocking version of the call-chain routine could avoid using
the RCU dereferencing mechanism.

The atomic chains are a different matter. The ones that don't run in NMI
context could use an rw-spinlock for protection, allowing them also to
avoid memory barriers while going through the list. The notifier chains
that do run in NMI don't have this luxury. Fortunately I don't think
there are very many of them.

Alan Stern

2006-02-24 16:49:19

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

On Fri, Feb 24, 2006 at 11:44:23AM -0500, Alan Stern wrote:
> In that case you should be worried not about acquiring and releasing the
> rwsem at the beginning and end of blocking_notifier_call_chain; you should
> be worried about all the RCU serialization in the core
> notifier_call_chain routine.

RCU doesn't synchronize readers.

> The atomic chains are a different matter. The ones that don't run in NMI
> context could use an rw-spinlock for protection, allowing them also to
> avoid memory barriers while going through the list. The notifier chains
> that do run in NMI don't have this luxury. Fortunately I don't think
> there are very many of them.

A read lock is a memory barrier. That's why I'm opposed to using non-rcu
style locking for them.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-24 17:59:20

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

On Fri, 24 Feb 2006, Benjamin LaHaise wrote:

> On Fri, Feb 24, 2006 at 11:44:23AM -0500, Alan Stern wrote:
> > In that case you should be worried not about acquiring and releasing the
> > rwsem at the beginning and end of blocking_notifier_call_chain; you should
> > be worried about all the RCU serialization in the core
> > notifier_call_chain routine.
>
> RCU doesn't synchronize readers.

It does on architectures where smp_read_barrier_depends() expands into
something nontrivial. Maybe that doesn't include any of the machines
you're interested in.

> > The atomic chains are a different matter. The ones that don't run in NMI
> > context could use an rw-spinlock for protection, allowing them also to
> > avoid memory barriers while going through the list. The notifier chains
> > that do run in NMI don't have this luxury. Fortunately I don't think
> > there are very many of them.
>
> A read lock is a memory barrier. That's why I'm opposed to using non-rcu
> style locking for them.

But RCU-style locking can't be used in situations where the reader may
block. So it's not possible to use it with blocking notifier chains.

Alan Stern

2006-02-24 18:42:12

by Benjamin LaHaise

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

On Fri, Feb 24, 2006 at 12:59:18PM -0500, Alan Stern wrote:
> It does on architectures where smp_read_barrier_depends() expands into
> something nontrivial. Maybe that doesn't include any of the machines
> you're interested in.

Which includes all of about 1, I think -- alpha. In other words, it's
free.

> > > The atomic chains are a different matter. The ones that don't run in NMI
> > > context could use an rw-spinlock for protection, allowing them also to
> > > avoid memory barriers while going through the list. The notifier chains
> > > that do run in NMI don't have this luxury. Fortunately I don't think
> > > there are very many of them.
> >
> > A read lock is a memory barrier. That's why I'm opposed to using non-rcu
> > style locking for them.
>
> But RCU-style locking can't be used in situations where the reader may
> block. So it's not possible to use it with blocking notifier chains.

Then we shouldn't have non-atomic notifier chains in performance critical
codepaths. The original implementation's hooks into critical paths held
these characteristics. If that property has been broken, please fix it
instead of adding more locking.

-ben
--
"Ladies and gentlemen, I'm sorry to interrupt, but the police are here
and they've asked us to stop the party." Don't Email: <[email protected]>.

2006-02-24 20:21:43

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH] Avoid calling down_read and down_write during startup

On Fri, 24 Feb 2006, Benjamin LaHaise wrote:

> > > A read lock is a memory barrier. That's why I'm opposed to using non-rcu
> > > style locking for them.
> >
> > But RCU-style locking can't be used in situations where the reader may
> > block. So it's not possible to use it with blocking notifier chains.
>
> Then we shouldn't have non-atomic notifier chains in performance critical
> codepaths. The original implementation's hooks into critical paths held
> these characteristics. If that property has been broken, please fix it
> instead of adding more locking.

Sorry, no can do. You'll have to complain to the people who put blocking
code into critical paths in the first place. I don't know which paths are
critical nor do I know how to change the code to make it non-blocking.

Or you could write some patches yourself instead of asking other people to
do it for you.

Alan Stern