2005-03-19 19:17:20

by Ingo Molnar

[permalink] [raw]
Subject: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00


i have released the -V0.7.41-00 Real-Time Preemption patch (merged to
2.6.12-rc1), which can be downloaded from the usual place:

http://redhat.com/~mingo/realtime-preempt/

the biggest change in this patch is the merge of Paul E. McKenney's
preemptable RCU code. The new RCU code is active on PREEMPT_RT. While it
is still quite experimental at this stage, it allowed the removal of
locking cruft (mainly in the networking code), so it could solve some of
the longstanding netfilter/networking deadlocks/crashes reported by a
number of people. Be careful nevertheless.

there are a couple of minor changes relative to Paul's latest
preemptable-RCU code drop:

- made the two variants two #ifdef blocks - this is sufficient for now
and we'll see what the best way is in the longer run.

- moved rcu_check_callbacks() from the timer IRQ to ksoftirqd. (the
timer IRQ still runs in hardirq context on PREEMPT_RT.)

- changed the irq-flags method to a preempt_disable()-based method, and
moved the lock taking outside of the critical sections. (due to locks
potentially sleeping on PREEMPT_RT).

to create a -V0.7.41-00 tree from scratch, the patching order is:

http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.11.tar.bz2
http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.12-rc1.bz2
http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.12-rc1-V0.7.41-00

Ingo


2005-03-20 00:25:24

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00

On Sat, 2005-03-19 at 20:16 +0100, Ingo Molnar wrote:
> i have released the -V0.7.41-00 Real-Time Preemption patch (merged to
> 2.6.12-rc1), which can be downloaded from the usual place:
>
> http://redhat.com/~mingo/realtime-preempt/
>

3ms latency in the NFS client code. Workload was a kernel compile over
NFS.

preemption latency trace v1.1.4 on 2.6.12-rc1-RT-V0.7.41-00
--------------------------------------------------------------------
latency: 3178 �s, #4095/14224, CPU#0 | (M:preempt VP:0, KP:1, SP:1 HP:1 #P:1)
-----------------
| task: ksoftirqd/0-2 (uid:0 nice:-10 policy:0 rt_prio:0)
-----------------

_------=> CPU#
/ _-----=> irqs-off
| / _----=> need-resched
|| / _---=> hardirq/softirq
||| / _--=> preempt-depth
|||| /
||||| delay
cmd pid ||||| time | caller
\ / ||||| \ | /
(T1/#0) <...> 32105 0 3 00000004 00000000 [0011939614227867] 0.000ms (+4137027.445ms): <6500646c> (<61000000>)
(T1/#2) <...> 32105 0 3 00000004 00000002 [0011939614228097] 0.000ms (+0.000ms): __trace_start_sched_wakeup+0x9a/0xd0 <c013150a> (try_to_wake_up+0x94/0x140 <c0110474>)
(T1/#3) <...> 32105 0 3 00000003 00000003 [0011939614228436] 0.000ms (+0.000ms): preempt_schedule+0x11/0x80 <c02b57c1> (try_to_wake_up+0x94/0x140 <c0110474>)
(T3/#4) <...>-32105 0dn.3 0�s : try_to_wake_up+0x11e/0x140 <c01104fe> <<...>-2> (69 76):
(T1/#5) <...> 32105 0 3 00000002 00000005 [0011939614228942] 0.000ms (+0.000ms): preempt_schedule+0x11/0x80 <c02b57c1> (try_to_wake_up+0xf8/0x140 <c01104d8>)
(T1/#6) <...> 32105 0 3 00000002 00000006 [0011939614229130] 0.000ms (+0.000ms): wake_up_process+0x35/0x40 <c0110555> (do_softirq+0x3f/0x50 <c011b05f>)
(T6/#7) <...>-32105 0dn.1 1�s < (1)
(T1/#8) <...> 32105 0 2 00000001 00000008 [0011939614229782] 0.001ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee> (nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
(T1/#9) <...> 32105 0 2 00000001 00000009 [0011939614229985] 0.001ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup+0x52/0x70 <c01e0632>)
(T1/#10) <...> 32105 0 2 00000001 0000000a [0011939614230480] 0.001ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee> (nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
(T1/#11) <...> 32105 0 2 00000001 0000000b [0011939614230634] 0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup+0x52/0x70 <c01e0632>)
(T1/#12) <...> 32105 0 2 00000001 0000000c [0011939614230889] 0.002ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee> (nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
(T1/#13) <...> 32105 0 2 00000001 0000000d [0011939614231034] 0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup+0x52/0x70 <c01e0632>)
(T1/#14) <...> 32105 0 2 00000001 0000000e [0011939614231302] 0.002ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee> (nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
(T1/#15) <...> 32105 0 2 00000001 0000000f [0011939614231419] 0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup+0x52/0x70 <c01e0632>)

(last two lines just repeat)

This is probably not be a regression; I had never tested this with NFS
before.

Lee

2005-03-20 01:33:19

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00

On Sat, 2005-03-19 at 20:16 +0100, Ingo Molnar wrote:
> the biggest change in this patch is the merge of Paul E. McKenney's
> preemptable RCU code. The new RCU code is active on PREEMPT_RT. While it
> is still quite experimental at this stage, it allowed the removal of
> locking cruft (mainly in the networking code), so it could solve some of
> the longstanding netfilter/networking deadlocks/crashes reported by a
> number of people. Be careful nevertheless.

With PREEMPT_RT my machine deadlocked within 20 minutes of boot.
"apt-get dist-upgrade" seemed to trigger the crash. I did not see any
Oops unfortunately.

Lee



2005-03-20 01:50:25

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00

Lee Revell wrote:
> On Sat, 2005-03-19 at 20:16 +0100, Ingo Molnar wrote:
>
>>the biggest change in this patch is the merge of Paul E. McKenney's
>>preemptable RCU code. The new RCU code is active on PREEMPT_RT. While it
>>is still quite experimental at this stage, it allowed the removal of
>>locking cruft (mainly in the networking code), so it could solve some of
>>the longstanding netfilter/networking deadlocks/crashes reported by a
>>number of people. Be careful nevertheless.
>
>
> With PREEMPT_RT my machine deadlocked within 20 minutes of boot.
> "apt-get dist-upgrade" seemed to trigger the crash. I did not see any
> Oops unfortunately.
>
> Lee
>

Lee,

Just curious. Is this with UP or SMP? I currently have my UP box running
PREEMPT_RT, with no problems thus far. However, my SMP box dies while
booting (with an oops). I am working on trying to get setup to capture
the oops, although it might be tomorrow before I get that done.

--
kr

2005-03-20 04:33:31

by Lee Revell

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00

On Sat, 2005-03-19 at 19:50 -0600, K.R. Foley wrote:
> Lee Revell wrote:
> > On Sat, 2005-03-19 at 20:16 +0100, Ingo Molnar wrote:
> >
> >>the biggest change in this patch is the merge of Paul E. McKenney's
> >>preemptable RCU code. The new RCU code is active on PREEMPT_RT. While it
> >>is still quite experimental at this stage, it allowed the removal of
> >>locking cruft (mainly in the networking code), so it could solve some of
> >>the longstanding netfilter/networking deadlocks/crashes reported by a
> >>number of people. Be careful nevertheless.
> >
> >
> > With PREEMPT_RT my machine deadlocked within 20 minutes of boot.
> > "apt-get dist-upgrade" seemed to trigger the crash. I did not see any
> > Oops unfortunately.
> >
> > Lee
> >
>
> Lee,
>
> Just curious. Is this with UP or SMP? I currently have my UP box running
> PREEMPT_RT, with no problems thus far. However, my SMP box dies while
> booting (with an oops). I am working on trying to get setup to capture
> the oops, although it might be tomorrow before I get that done.
>

UP. It's 100% reproducible, this machine locks up over and over. Seems
to be associated with network activity by multiple processes.

Lee

2005-03-20 22:27:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00

On Sat, Mar 19, 2005 at 08:16:58PM +0100, Ingo Molnar wrote:
>
> i have released the -V0.7.41-00 Real-Time Preemption patch (merged to
> 2.6.12-rc1), which can be downloaded from the usual place:
>
> http://redhat.com/~mingo/realtime-preempt/
>
> the biggest change in this patch is the merge of Paul E. McKenney's
> preemptable RCU code. The new RCU code is active on PREEMPT_RT. While it
> is still quite experimental at this stage, it allowed the removal of
> locking cruft (mainly in the networking code), so it could solve some of
> the longstanding netfilter/networking deadlocks/crashes reported by a
> number of people. Be careful nevertheless.
>
> there are a couple of minor changes relative to Paul's latest
> preemptable-RCU code drop:
>
> - made the two variants two #ifdef blocks - this is sufficient for now
> and we'll see what the best way is in the longer run.
>
> - moved rcu_check_callbacks() from the timer IRQ to ksoftirqd. (the
> timer IRQ still runs in hardirq context on PREEMPT_RT.)
>
> - changed the irq-flags method to a preempt_disable()-based method, and
> moved the lock taking outside of the critical sections. (due to locks
> potentially sleeping on PREEMPT_RT).
>
> to create a -V0.7.41-00 tree from scratch, the patching order is:
>
> http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.11.tar.bz2
> http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.12-rc1.bz2
> http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.12-rc1-V0.7.41-00

Some proposed fixes from a quick scan (untested, probably does not even
compile). These proposed fixes fall into the following categories:

o Some functions that should be static.

o Introduced a synchronize_kernel_barrier() for a number of
uses of synchronize_kernel() that are broken by the new
implementation. Note that synchronize_kernel_barrier() is
the same as synchronize_kernel() in non-CONFIG_PREEMPT_RT
kernels. Not clear that synchronize_kernel_barrier()
is strong enough for some uses, may need another API
(synchronize_kernel_barrier_voluntary()???) that waits for all
tasks to -voluntary- context switch or be executing in user
space (these are marked with FIXME in the attached patch).
Dipankar and/or Rusty put out a patch that did this some time
back -- this was when we were trying to make an RCU that worked
in CONFIG_PREEMPT kernels, but did not want preempt_disable()
on the read side.

That said, some of the synchronize_kernel_barrier()s marked
with FIXME may be fixable more simply by inserting
rcu_read_lock()/rcu_read_unlock() pairs in appropriate
places.

o Merged the two identical implementations each of
rcu_dereference() and rcu_assign_pointer().

o Added an rcu_read_lock() or two. Clearly need to be searching
for patches containing "synchronize_kernel" in addition to
patches containing "rcu"...

Thoughts?

Thanx, Paul

Signed-off-by: <[email protected]>

diff -urpN -X dontdiff linux-2.6.11/arch/i386/oprofile/nmi_timer_int.c linux-2.6.11.fixes/arch/i386/oprofile/nmi_timer_int.c
--- linux-2.6.11/arch/i386/oprofile/nmi_timer_int.c Tue Mar 1 23:37:52 2005
+++ linux-2.6.11.fixes/arch/i386/oprofile/nmi_timer_int.c Sun Mar 20 08:40:31 2005
@@ -36,7 +36,7 @@ static void timer_stop(void)
{
enable_timer_nmi_watchdog();
unset_nmi_callback();
- synchronize_kernel();
+ synchronize_kernel_barrier();
}


diff -urpN -X dontdiff linux-2.6.11/arch/ppc64/kernel/ItLpQueue.c linux-2.6.11.fixes/arch/ppc64/kernel/ItLpQueue.c
--- linux-2.6.11/arch/ppc64/kernel/ItLpQueue.c Tue Mar 1 23:37:48 2005
+++ linux-2.6.11.fixes/arch/ppc64/kernel/ItLpQueue.c Sun Mar 20 08:48:29 2005
@@ -142,7 +142,9 @@ unsigned ItLpQueue_process( struct ItLpQ
lpQueue->xLpIntCountByType[nextLpEvent->xType]++;
if ( nextLpEvent->xType < HvLpEvent_Type_NumTypes &&
lpEventHandler[nextLpEvent->xType] )
+ rcu_read_lock();
lpEventHandler[nextLpEvent->xType](nextLpEvent, regs);
+ rcu_read_unlock();
else
printk(KERN_INFO "Unexpected Lp Event type=%d\n", nextLpEvent->xType );

diff -urpN -X dontdiff linux-2.6.11/arch/x86_64/kernel/mce.c linux-2.6.11.fixes/arch/x86_64/kernel/mce.c
--- linux-2.6.11/arch/x86_64/kernel/mce.c Tue Mar 1 23:37:52 2005
+++ linux-2.6.11.fixes/arch/x86_64/kernel/mce.c Sun Mar 20 08:49:45 2005
@@ -392,7 +392,7 @@ static ssize_t mce_read(struct file *fil
memset(mcelog.entry, 0, next * sizeof(struct mce));
mcelog.next = 0;

- synchronize_kernel();
+ synchronize_kernel_barrier();

/* Collect entries that were still getting written before the synchronize. */

diff -urpN -X dontdiff linux-2.6.11/drivers/acpi/processor_idle.c linux-2.6.11.fixes/drivers/acpi/processor_idle.c
--- linux-2.6.11/drivers/acpi/processor_idle.c Tue Mar 1 23:38:25 2005
+++ linux-2.6.11.fixes/drivers/acpi/processor_idle.c Sun Mar 20 09:01:44 2005
@@ -838,7 +838,7 @@ int acpi_processor_cst_has_changed (stru

/* Fall back to the default idle loop */
pm_idle = pm_idle_save;
- synchronize_kernel();
+ synchronize_kernel_barrier(); /* FIXME: strong enough? */

pr->flags.power = 0;
result = acpi_processor_get_power_info(pr);
diff -urpN -X dontdiff linux-2.6.11/drivers/char/ipmi/ipmi_si_intf.c linux-2.6.11.fixes/drivers/char/ipmi/ipmi_si_intf.c
--- linux-2.6.11/drivers/char/ipmi/ipmi_si_intf.c Sat Mar 19 14:04:13 2005
+++ linux-2.6.11.fixes/drivers/char/ipmi/ipmi_si_intf.c Sun Mar 20 08:39:49 2005
@@ -2194,7 +2194,7 @@ static int init_one_smi(int intf_num, st
/* Wait until we know that we are out of any interrupt
handlers might have been running before we freed the
interrupt. */
- synchronize_kernel();
+ synchronize_kernel_barrier();

if (new_smi->si_sm) {
if (new_smi->handlers)
@@ -2307,7 +2307,7 @@ static void __exit cleanup_one_si(struct
/* Wait until we know that we are out of any interrupt
handlers might have been running before we freed the
interrupt. */
- synchronize_kernel();
+ synchronize_kernel_barrier();

/* Wait for the timer to stop. This avoids problems with race
conditions removing the timer here. */
diff -urpN -X dontdiff linux-2.6.11/drivers/input/keyboard/atkbd.c linux-2.6.11.fixes/drivers/input/keyboard/atkbd.c
--- linux-2.6.11/drivers/input/keyboard/atkbd.c Sat Mar 19 14:04:16 2005
+++ linux-2.6.11.fixes/drivers/input/keyboard/atkbd.c Sun Mar 20 09:02:33 2005
@@ -678,7 +678,7 @@ static void atkbd_disconnect(struct seri
atkbd_disable(atkbd);

/* make sure we don't have a command in flight */
- synchronize_kernel();
+ synchronize_kernel_barrier(); /* FIXME: Strong enough? */
flush_scheduled_work();

device_remove_file(&serio->dev, &atkbd_attr_extra);
diff -urpN -X dontdiff linux-2.6.11/drivers/input/serio/i8042.c linux-2.6.11.fixes/drivers/input/serio/i8042.c
--- linux-2.6.11/drivers/input/serio/i8042.c Sat Mar 19 14:04:16 2005
+++ linux-2.6.11.fixes/drivers/input/serio/i8042.c Sun Mar 20 09:27:35 2005
@@ -396,7 +396,7 @@ static void i8042_stop(struct serio *ser
struct i8042_port *port = serio->port_data;

port->exists = 0;
- synchronize_kernel();
+ synchronize_kernel_barrier(); /* FIXME: Strong enough? */
port->serio = NULL;
}

diff -urpN -X dontdiff linux-2.6.11/drivers/net/r8169.c linux-2.6.11.fixes/drivers/net/r8169.c
--- linux-2.6.11/drivers/net/r8169.c Sat Mar 19 14:04:19 2005
+++ linux-2.6.11.fixes/drivers/net/r8169.c Sun Mar 20 09:09:06 2005
@@ -2385,7 +2385,7 @@ core_down:
}

/* Give a racing hard_start_xmit a few cycles to complete. */
- synchronize_kernel();
+ synchronize_kernel_barrier(); /* FIXME: Strong enough? */

/*
* And now for the 50k$ question: are IRQ disabled or not ?
diff -urpN -X dontdiff linux-2.6.11/drivers/s390/cio/airq.c linux-2.6.11.fixes/drivers/s390/cio/airq.c
--- linux-2.6.11/drivers/s390/cio/airq.c Tue Mar 1 23:38:17 2005
+++ linux-2.6.11.fixes/drivers/s390/cio/airq.c Sun Mar 20 09:11:57 2005
@@ -45,7 +45,7 @@ s390_register_adapter_interrupt (adapter
else
ret = (cmpxchg(&adapter_handler, NULL, handler) ? -EBUSY : 0);
if (!ret)
- synchronize_kernel();
+ synchronize_kernel_barrier(); /* FIXME: Strong enough? */

sprintf (dbf_txt, "ret:%d", ret);
CIO_TRACE_EVENT (4, dbf_txt);
@@ -65,7 +65,7 @@ s390_unregister_adapter_interrupt (adapt
ret = -EINVAL;
else {
adapter_handler = NULL;
- synchronize_kernel();
+ synchronize_kernel_barrier(); /* FIXME: Strong enough? */
ret = 0;
}
sprintf (dbf_txt, "ret:%d", ret);
diff -urpN -X dontdiff linux-2.6.11/include/linux/rcupdate.h linux-2.6.11.fixes/include/linux/rcupdate.h
--- linux-2.6.11/include/linux/rcupdate.h Sat Mar 19 14:09:52 2005
+++ linux-2.6.11.fixes/include/linux/rcupdate.h Sun Mar 20 09:24:20 2005
@@ -222,6 +222,8 @@ static inline int rcu_pending(int cpu)
*/
#define rcu_read_unlock_bh() local_bh_enable()

+#endif /* CONFIG_PREEMPT_RT */
+
/**
* rcu_dereference - fetch an RCU-protected pointer in an
* RCU read-side critical section. This pointer may later
@@ -256,6 +258,22 @@ static inline int rcu_pending(int cpu)
(p) = (v); \
})

+#ifndef CONFIG_PREEMPT_RT
+
+/**
+ * synchronize_kernel_barrier - block until each CPU executes a
+ * context switch, appears in the idle loop, or otherwise exits
+ * kernel execution. This is synonymous with synchronize_kernel()
+ * in the classic RCU implementation, but not in some RCU
+ * implementations optimized for realtime use. In these realtime
+ * uses, synchronize_kernel() can potentially return immediately,
+ * even on SMP systems.
+ *
+ * NMI-related uses of RCU need to use synchronize_kernel_barrier().
+ */
+
+#define synchronize_kernel_barrer() synchronize_kernel()
+
extern void rcu_init(void);
extern void rcu_check_callbacks(int cpu, int user);
extern void rcu_restart_cpu(int cpu);
@@ -275,40 +293,6 @@ extern void synchronize_kernel(void);
#define rcu_bh_qsctr_inc(cpu)
#define rcu_qsctr_inc(cpu)

-/**
- * rcu_dereference - fetch an RCU-protected pointer in an
- * RCU read-side critical section. This pointer may later
- * be safely dereferenced.
- *
- * Inserts memory barriers on architectures that require them
- * (currently only the Alpha), and, more importantly, documents
- * exactly which pointers are protected by RCU.
- */
-
-#define rcu_dereference(p) ({ \
- typeof(p) _________p1 = p; \
- smp_read_barrier_depends(); \
- (_________p1); \
- })
-
-/**
- * rcu_assign_pointer - assign (publicize) a pointer to a newly
- * initialized structure that will be dereferenced by RCU read-side
- * critical sections. Returns the value assigned.
- *
- * Inserts memory barriers on architectures that require them
- * (pretty much all of them other than x86), and also prevents
- * the compiler from reordering the code that initializes the
- * structure after the pointer assignment. More importantly, this
- * call documents which pointers will be dereferenced by RCU read-side
- * code.
- */
-
-#define rcu_assign_pointer(p, v) ({ \
- smp_wmb(); \
- (p) = (v); \
- })
-
extern void rcu_init(void);

/* Exported interfaces */
@@ -317,6 +301,7 @@ extern void FASTCALL(call_rcu(struct rcu
extern void rcu_read_lock(void);
extern void rcu_read_unlock(void);
extern void synchronize_kernel(void);
+extern void synchronize_kernel_barrier(void);
extern int rcu_pending(int cpu);
extern void rcu_check_callbacks(int cpu, int user);

diff -urpN -X dontdiff linux-2.6.11/kernel/module.c linux-2.6.11.fixes/kernel/module.c
--- linux-2.6.11/kernel/module.c Sat Mar 19 14:09:51 2005
+++ linux-2.6.11.fixes/kernel/module.c Sun Mar 20 09:13:23 2005
@@ -1812,7 +1812,7 @@ sys_init_module(void __user *umod,
/* Init routine failed: abort. Try to protect us from
buggy refcounters. */
mod->state = MODULE_STATE_GOING;
- synchronize_kernel();
+ synchronize_kernel_barrier(); /* FIXME: Strong enough? */
if (mod->unsafe)
printk(KERN_ERR "%s: module is now stuck!\n",
mod->name);
diff -urpN -X dontdiff linux-2.6.11/kernel/profile.c linux-2.6.11.fixes/kernel/profile.c
--- linux-2.6.11/kernel/profile.c Sat Mar 19 14:09:51 2005
+++ linux-2.6.11.fixes/kernel/profile.c Sun Mar 20 09:18:05 2005
@@ -194,7 +194,7 @@ void unregister_timer_hook(int (*hook)(s
WARN_ON(hook != timer_hook);
timer_hook = NULL;
/* make sure all CPUs see the NULL hook */
- synchronize_kernel();
+ synchronize_kernel_barrier(); /* FIXME: Strong enough? */
}

EXPORT_SYMBOL_GPL(register_timer_hook);
diff -urpN -X dontdiff linux-2.6.11/kernel/rcupdate.c linux-2.6.11.fixes/kernel/rcupdate.c
--- linux-2.6.11/kernel/rcupdate.c Sat Mar 19 14:09:51 2005
+++ linux-2.6.11.fixes/kernel/rcupdate.c Sun Mar 20 09:32:13 2005
@@ -548,7 +548,37 @@ void synchronize_kernel(void)
}
}

-void rcu_advance_callbacks(void)
+/*
+ * FIXME: Note that this implementation might not be strong enough
+ * for a number of driver uses of synchronize_kernel. Some of these
+ * uses seem to assume a non-CONFIG_PREEMPT kernel, so may need
+ * to come up with a different approach. Note that these uses
+ * are -not- waiting to free memory, but rather to ensure that
+ * a change is seen by all future driver invocations.
+ *
+ * The correct implementation is likely to be a tasklist scan,
+ * which blocks until all tasks encounter a voluntary context switch.
+ * If so, this implementation is required in CONFIG_PREEMPT
+ * kernels as well as CONFIG_PREEMPT_RT kernels.
+ */
+
+void synchronize_kernel_barrier(void)
+{
+ cpumask_t oldmask;
+ cpumask_t curmask;
+ int cpu;
+
+ if (sched_getaffinity(0, &oldmask) < 0) {
+ oldmask = cpu_possible_mask;
+ }
+ for_each_cpu(cpu) {
+ sched_setaffinity(0, cpumask_of_cpu(cpu));
+ schedule();
+ }
+ sched_setaffinity(0, oldmask);
+}
+
+static void rcu_advance_callbacks(void)
{
struct rcu_data *rdp;

@@ -578,7 +608,7 @@ void fastcall call_rcu(struct rcu_head *
put_cpu_var(rcu_data);
}

-void rcu_process_callbacks(void)
+static void rcu_process_callbacks(void)
{
struct rcu_head *next, *list;
struct rcu_data *rdp;

2005-03-20 22:41:06

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00

On Sat, Mar 19, 2005 at 11:32:59PM -0500, Lee Revell wrote:
> On Sat, 2005-03-19 at 19:50 -0600, K.R. Foley wrote:
> > Lee Revell wrote:
> > > On Sat, 2005-03-19 at 20:16 +0100, Ingo Molnar wrote:
> > >
> > >>the biggest change in this patch is the merge of Paul E. McKenney's
> > >>preemptable RCU code. The new RCU code is active on PREEMPT_RT. While it
> > >>is still quite experimental at this stage, it allowed the removal of
> > >>locking cruft (mainly in the networking code), so it could solve some of
> > >>the longstanding netfilter/networking deadlocks/crashes reported by a
> > >>number of people. Be careful nevertheless.
> > >
> > >
> > > With PREEMPT_RT my machine deadlocked within 20 minutes of boot.
> > > "apt-get dist-upgrade" seemed to trigger the crash. I did not see any
> > > Oops unfortunately.
> > >
> > > Lee
> > >
> >
> > Lee,
> >
> > Just curious. Is this with UP or SMP? I currently have my UP box running
> > PREEMPT_RT, with no problems thus far. However, my SMP box dies while
> > booting (with an oops). I am working on trying to get setup to capture
> > the oops, although it might be tomorrow before I get that done.
> >
>
> UP. It's 100% reproducible, this machine locks up over and over. Seems
> to be associated with network activity by multiple processes.

OK, guess I need to go inspect the uses of synchronize_net() in addition
to synchronize_kernel...

If you do manage to get any additional info, please let me know...

Thanx, Paul

2005-03-21 08:54:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00


got this early-bootup crash on an SMP box:

BUG: Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c0131aec
*pde = 00000000
Oops: 0002 [#1]
PREEMPT SMP
Modules linked in:
CPU: 1
EIP: 0060:[<c0131aec>] Not tainted VLI
EFLAGS: 00010293 (2.6.12-rc1-RT-V0.7.41-00)
EIP is at rcu_advance_callbacks+0x3c/0x80
eax: 00000000 ebx: c050f280 ecx: c12191e0 edx: 00000000
esi: cfd2e560 edi: cfd2e4e0 ebp: cfd31dd0 esp: cfd31dc8
ds: 007b es: 007b ss: 0068 preempt: 00000003
Process khelper (pid: 60, threadinfo=cfd30000 task=cfd106a0)
Stack: 00000001 c12191e0 cfd31de4 c0131b67 00000001 cfd2e4d8 c13004d8 cfd31e00
c017e449 cfd2e4d8 c04d6e80 cfd32006 fffffffe cfd31e54 cfd31e70 c01749cc
cfd2e4d8 cfd31e50 cfd31e4c 00000001 cfd32001 cfd2e4d8 c03dd41f c04cf920
Call Trace:
[<c010412f>] show_stack+0x7f/0xa0 (28)
[<c01042da>] show_registers+0x16a/0x1e0 (56)
[<c0104511>] die+0x101/0x190 (64)
[<c0115862>] do_page_fault+0x442/0x680 (216)
[<c0103d9b>] error_code+0x2b/0x30 (68)
[<c0131b67>] call_rcu+0x37/0x70 (20)
[<c017e449>] dput+0x139/0x210 (28)
[<c01749cc>] __link_path_walk+0x9fc/0xf80 (112)
[<c0174f9a>] link_path_walk+0x4a/0x130 (100)
[<c017538e>] path_lookup+0x9e/0x1c0 (32)
[<c01707e8>] open_exec+0x28/0x100 (100)
[<c0171a04>] do_execve+0x44/0x220 (36)
[<c0101da2>] sys_execve+0x42/0xa0 (36)
[<c0103315>] syscall_call+0x7/0xb (-8096)
---------------------------
| preempt count: 00000004 ]
| 4-level deep critical section nesting:
----------------------------------------
.. [<c0131b4f>] .... call_rcu+0x1f/0x70
.....[<c017e449>] .. ( <= dput+0x139/0x210)
.. [<c0131ac3>] .... rcu_advance_callbacks+0x13/0x80
.....[<c0131b67>] .. ( <= call_rcu+0x37/0x70)
.. [<c03dddca>] .... _raw_spin_lock_irqsave+0x1a/0xa0
.....[<c010444f>] .. ( <= die+0x3f/0x190)
.. [<c013b9e6>] .... print_traces+0x16/0x50
.....[<c010412f>] .. ( <= show_stack+0x7f/0xa0)

Code: 00 00 e8 78 2d 0a 00 8b 0c 85 20 20 51 c0 bb 80 f2 50 c0 01 d9 f0 83 44 24 00 00 a1 88 19 52 c0 39 41 40 74 23 8b 41 44 8b 51 50 <89> 02 8b 41 48 c7 41 44 00 00 00 00 89 41 50 8d 41 44 89 41 48
<6>note: khelper[60] exited with preempt_count 2

(gdb) list *0xc0131aec
0xc0131aec is in rcu_advance_callbacks (kernel/rcupdate.c:558).

553 struct rcu_data *rdp;
554
555 rdp = &get_cpu_var(rcu_data);
556 smp_mb(); /* prevent sampling batch # before list removal. */
557 if (rdp->batch != rcu_ctrlblk.batch) {
558 *rdp->donetail = rdp->waitlist;
559 rdp->donetail = rdp->waittail;
560 rdp->waitlist = NULL;
561 rdp->waittail = &rdp->waitlist;
562 rdp->batch = rcu_ctrlblk.batch;
(gdb)

2005-03-21 09:01:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00


* Ingo Molnar <[email protected]> wrote:

> got this early-bootup crash on an SMP box:

the same kernel image boots fine on an UP box, so it's an SMP bug.

note that the same occurs with your latest (synchronization barrier)
fixes applied as well.

Ingo

2005-03-21 09:07:19

by Ingo Molnar

[permalink] [raw]
Subject: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01


* Ingo Molnar <[email protected]> wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > got this early-bootup crash on an SMP box:
>
> the same kernel image boots fine on an UP box, so it's an SMP bug.
>
> note that the same occurs with your latest (synchronization barrier)
> fixes applied as well.

i've uploaded my current tree (-V0.7.41-01) to:

http://redhat.com/~mingo/realtime-preempt/

it includes the latest round of RCU fixes - but doesnt solve the SMP
bootup crash.

Ingo

2005-03-21 15:42:58

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00

Lee Revell wrote:
> On Sat, 2005-03-19 at 20:16 +0100, Ingo Molnar wrote:
>
>>i have released the -V0.7.41-00 Real-Time Preemption patch (merged to
>>2.6.12-rc1), which can be downloaded from the usual place:
>>
>> http://redhat.com/~mingo/realtime-preempt/
>>
>
>
> 3ms latency in the NFS client code. Workload was a kernel compile over
> NFS.
>
> preemption latency trace v1.1.4 on 2.6.12-rc1-RT-V0.7.41-00
> --------------------------------------------------------------------
> latency: 3178 �s, #4095/14224, CPU#0 | (M:preempt VP:0, KP:1, SP:1 HP:1 #P:1)
> -----------------
> | task: ksoftirqd/0-2 (uid:0 nice:-10 policy:0 rt_prio:0)
> -----------------
>
> _------=> CPU#
> / _-----=> irqs-off
> | / _----=> need-resched
> || / _---=> hardirq/softirq
> ||| / _--=> preempt-depth
> |||| /
> ||||| delay
> cmd pid ||||| time | caller
> \ / ||||| \ | /
> (T1/#0) <...> 32105 0 3 00000004 00000000 [0011939614227867] 0.000ms (+4137027.445ms): <6500646c> (<61000000>)
> (T1/#2) <...> 32105 0 3 00000004 00000002 [0011939614228097] 0.000ms (+0.000ms): __trace_start_sched_wakeup+0x9a/0xd0 <c013150a> (try_to_wake_up+0x94/0x140 <c0110474>)
> (T1/#3) <...> 32105 0 3 00000003 00000003 [0011939614228436] 0.000ms (+0.000ms): preempt_schedule+0x11/0x80 <c02b57c1> (try_to_wake_up+0x94/0x140 <c0110474>)
> (T3/#4) <...>-32105 0dn.3 0�s : try_to_wake_up+0x11e/0x140 <c01104fe> <<...>-2> (69 76):
> (T1/#5) <...> 32105 0 3 00000002 00000005 [0011939614228942] 0.000ms (+0.000ms): preempt_schedule+0x11/0x80 <c02b57c1> (try_to_wake_up+0xf8/0x140 <c01104d8>)
> (T1/#6) <...> 32105 0 3 00000002 00000006 [0011939614229130] 0.000ms (+0.000ms): wake_up_process+0x35/0x40 <c0110555> (do_softirq+0x3f/0x50 <c011b05f>)
> (T6/#7) <...>-32105 0dn.1 1�s < (1)
> (T1/#8) <...> 32105 0 2 00000001 00000008 [0011939614229782] 0.001ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee> (nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
> (T1/#9) <...> 32105 0 2 00000001 00000009 [0011939614229985] 0.001ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup+0x52/0x70 <c01e0632>)
> (T1/#10) <...> 32105 0 2 00000001 0000000a [0011939614230480] 0.001ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee> (nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
> (T1/#11) <...> 32105 0 2 00000001 0000000b [0011939614230634] 0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup+0x52/0x70 <c01e0632>)
> (T1/#12) <...> 32105 0 2 00000001 0000000c [0011939614230889] 0.002ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee> (nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
> (T1/#13) <...> 32105 0 2 00000001 0000000d [0011939614231034] 0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup+0x52/0x70 <c01e0632>)
> (T1/#14) <...> 32105 0 2 00000001 0000000e [0011939614231302] 0.002ms (+0.000ms): radix_tree_gang_lookup+0xe/0x70 <c01e05ee> (nfs_wait_on_requests+0x6d/0x110 <c01c744d>)
> (T1/#15) <...> 32105 0 2 00000001 0000000f [0011939614231419] 0.002ms (+0.000ms): __lookup+0xe/0xd0 <c01e051e> (radix_tree_gang_lookup+0x52/0x70 <c01e0632>)
>
> (last two lines just repeat)
>
> This is probably not be a regression; I had never tested this with NFS
> before.
>
> Lee
>

Lee,

I did some testing with NFS quite a while ago. Actually it was the NFS
compile within the stress-kernel pkg. I had similar crappy performance
problems. Ingo pointed out that there were serious locking issues with
NFS and suggested that I report the problems to the NFS folks, which I
did. The reports seemed to fall mostly on deaf ears, at least that was
my perspective. I decided to move on and took the NFS compile out of my
stress testing to be revisited at a later time.

--
kr

2005-03-21 16:45:26

by Paul Mckenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-00

> got this early-bootup crash on an SMP box:
>
> BUG: Unable to handle kernel NULL pointer dereference at virtual address 00000000
> printing eip:
> c0131aec
> *pde = 00000000
> Oops: 0002 [#1]
> PREEMPT SMP
> Modules linked in:
> CPU: 1
> EIP: 0060:[<c0131aec>] Not tainted VLI
> EFLAGS: 00010293 (2.6.12-rc1-RT-V0.7.41-00)
> EIP is at rcu_advance_callbacks+0x3c/0x80
> eax: 00000000 ebx: c050f280 ecx: c12191e0 edx: 00000000
> esi: cfd2e560 edi: cfd2e4e0 ebp: cfd31dd0 esp: cfd31dc8
> ds: 007b es: 007b ss: 0068 preempt: 00000003
> Process khelper (pid: 60, threadinfo=cfd30000 task=cfd106a0)
> Stack: 00000001 c12191e0 cfd31de4 c0131b67 00000001 cfd2e4d8 c13004d8 cfd31e00
> c017e449 cfd2e4d8 c04d6e80 cfd32006 fffffffe cfd31e54 cfd31e70 c01749cc
> cfd2e4d8 cfd31e50 cfd31e4c 00000001 cfd32001 cfd2e4d8 c03dd41f c04cf920
> Call Trace:
> [<c010412f>] show_stack+0x7f/0xa0 (28)
> [<c01042da>] show_registers+0x16a/0x1e0 (56)
> [<c0104511>] die+0x101/0x190 (64)
> [<c0115862>] do_page_fault+0x442/0x680 (216)
> [<c0103d9b>] error_code+0x2b/0x30 (68)
> [<c0131b67>] call_rcu+0x37/0x70 (20)
> [<c017e449>] dput+0x139/0x210 (28)
> [<c01749cc>] __link_path_walk+0x9fc/0xf80 (112)
> [<c0174f9a>] link_path_walk+0x4a/0x130 (100)
> [<c017538e>] path_lookup+0x9e/0x1c0 (32)
> [<c01707e8>] open_exec+0x28/0x100 (100)
> [<c0171a04>] do_execve+0x44/0x220 (36)
> [<c0101da2>] sys_execve+0x42/0xa0 (36)
> [<c0103315>] syscall_call+0x7/0xb (-8096)
> ---------------------------
> | preempt count: 00000004 ]
> | 4-level deep critical section nesting:
> ----------------------------------------
> .. [<c0131b4f>] .... call_rcu+0x1f/0x70
> .....[<c017e449>] .. ( <= dput+0x139/0x210)
> .. [<c0131ac3>] .... rcu_advance_callbacks+0x13/0x80
> .....[<c0131b67>] .. ( <= call_rcu+0x37/0x70)
> .. [<c03dddca>] .... _raw_spin_lock_irqsave+0x1a/0xa0
> .....[<c010444f>] .. ( <= die+0x3f/0x190)
> .. [<c013b9e6>] .... print_traces+0x16/0x50
> .....[<c010412f>] .. ( <= show_stack+0x7f/0xa0)
> Code: 00 00 e8 78 2d 0a 00 8b 0c 85 20 20 51 c0 bb 80 f2 50 c0 01 d9 f0 83 44 24 00 00 a1 88 19 52 c0 39 41 40 74 23 8b 41 44 8b 51 50 <89> 02 8b 41 48 c7 41 44 00 00 00 00 89 41 50 8d 41 44 89 41 48
> <6>note: khelper[60] exited with preempt_count 2
>
> (gdb) list *0xc0131aec
> 0xc0131aec is in rcu_advance_callbacks (kernel/rcupdate.c:558).
>
> 553 struct rcu_data *rdp;
> 554
> 555 rdp = &get_cpu_var(rcu_data);
> 556 smp_mb(); /* prevent sampling batch # before list removal. */
> 557 if (rdp->batch != rcu_ctrlblk.batch) {
> 558 *rdp->donetail = rdp->waitlist;
> 559 rdp->donetail = rdp->waittail;
> 560 rdp->waitlist = NULL;
> 561 rdp->waittail = &rdp->waitlist;
> 562 rdp->batch = rcu_ctrlblk.batch;
> (gdb)

Does the following help?

Thanx, Paul

diff -urpN -X dontdiff linux-2.6.11.fixes/kernel/rcupdate.c linux-2.6.11.fixes2/kernel/rcupdate.c
--- linux-2.6.11.fixes/kernel/rcupdate.c Mon Mar 21 08:14:47 2005
+++ linux-2.6.11.fixes2/kernel/rcupdate.c Mon Mar 21 08:17:00 2005
@@ -620,7 +620,7 @@ static void rcu_process_callbacks(void)
return;
}
rdp->donelist = NULL;
- rdp->donetail = &rdp->waitlist;
+ rdp->donetail = &rdp->donelist;
put_cpu_var(rcu_data);
while (list) {
next = list->next;

2005-03-21 23:16:56

by Magnus Naeslund(f)

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01

Ingo Molnar wrote:
>
> i've uploaded my current tree (-V0.7.41-01) to:
>
> http://redhat.com/~mingo/realtime-preempt/
>
> it includes the latest round of RCU fixes - but doesnt solve the SMP
> bootup crash.
>
> Ingo

With this kernel I can run for some 20 minutes, then the ip_dst_cache
overflows.
I gather it has something to do with RCU...

If I just let it run and grep ip_dst_cache /proc/slab it goes up to 4096
(max) and then the printk's are starting, and the networks stops.
After i up the limit to the double (echo "8192" >
/proc/sys/net/ipv4/route/max_size) network starts to work again.
But of course, after a while it overflows again:

# grep ip_dst_cache /proc/slabinfo
ip_dst_cache 8192 8205 256 15 1 : tunables 16 8
0 : slabdata 547 547 0

I haven't tried the vanilla 2.6.12-rc2 kernel, and I don't have the time
to do that right now, but i figured it would have something to do with
your patch. Older 2.6.8 works just fine.

Regards,
Magnus

2005-03-22 05:45:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01

On Tue, Mar 22, 2005 at 12:10:14AM +0100, Magnus Naeslund(t) wrote:
> Ingo Molnar wrote:
> >
> >i've uploaded my current tree (-V0.7.41-01) to:
> >
> > http://redhat.com/~mingo/realtime-preempt/
> >
> >it includes the latest round of RCU fixes - but doesnt solve the SMP
> >bootup crash.
> >
> > Ingo
>
> With this kernel I can run for some 20 minutes, then the ip_dst_cache
> overflows.
> I gather it has something to do with RCU...
>
> If I just let it run and grep ip_dst_cache /proc/slab it goes up to 4096
> (max) and then the printk's are starting, and the networks stops.
> After i up the limit to the double (echo "8192" >
> /proc/sys/net/ipv4/route/max_size) network starts to work again.
> But of course, after a while it overflows again:
>
> # grep ip_dst_cache /proc/slabinfo
> ip_dst_cache 8192 8205 256 15 1 : tunables 16 8
> 0 : slabdata 547 547 0
>
> I haven't tried the vanilla 2.6.12-rc2 kernel, and I don't have the time
> to do that right now, but i figured it would have something to do with
> your patch. Older 2.6.8 works just fine.

Hello, Magnus,

I believe that my earlier patch might take care of this (included again
for convenience).

Thanx, Paul

Signed-off-by: <[email protected]>

diff -urpN -X dontdiff linux-2.6.11.fixes/kernel/rcupdate.c linux-2.6.11.fixes2/kernel/rcupdate.c
--- linux-2.6.11.fixes/kernel/rcupdate.c Mon Mar 21 08:14:47 2005
+++ linux-2.6.11.fixes2/kernel/rcupdate.c Mon Mar 21 08:17:00 2005
@@ -620,7 +620,7 @@ static void rcu_process_callbacks(void)
return;
}
rdp->donelist = NULL;
- rdp->donetail = &rdp->waitlist;
+ rdp->donetail = &rdp->donelist;
put_cpu_var(rcu_data);
while (list) {
next = list->next;

2005-03-22 05:50:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01

On Mon, Mar 21, 2005 at 10:06:22AM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > got this early-bootup crash on an SMP box:
> >
> > the same kernel image boots fine on an UP box, so it's an SMP bug.
> >
> > note that the same occurs with your latest (synchronization barrier)
> > fixes applied as well.
>
> i've uploaded my current tree (-V0.7.41-01) to:
>
> http://redhat.com/~mingo/realtime-preempt/
>
> it includes the latest round of RCU fixes - but doesnt solve the SMP
> bootup crash.

Hello, Ingo,

Does the following help with the SMP problem? This fix and the earlier
one make my old patch survive a few rounds of kernbench on a 4-CPU x86
box. (Yes, I am still being cowardly! But happy that the test system
is alive again!) Without these fixes, it too dies during boot.

Thanx, Paul

diff -urpN -X dontdiff linux-2.6.11.fixes2/kernel/rcupdate.c linux-2.6.11.fixes3/kernel/rcupdate.c
--- linux-2.6.11.fixes2/kernel/rcupdate.c Mon Mar 21 08:17:00 2005
+++ linux-2.6.11.fixes3/kernel/rcupdate.c Mon Mar 21 20:00:00 2005
@@ -633,7 +633,7 @@ void rcu_check_callbacks(int cpu, int us
{
if ((unsigned long)(jiffies - rcu_ctrlblk.last_sk) >
HZ/GRACE_PERIODS_PER_SEC) {
- synchronize_kernel();
+ _synchronize_kernel();
rcu_advance_callbacks();
rcu_process_callbacks();
}

2005-03-22 07:30:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01


* Paul E. McKenney <[email protected]> wrote:

> > it includes the latest round of RCU fixes - but doesnt solve the SMP
> > bootup crash.
>
> Hello, Ingo,
>
> Does the following help with the SMP problem? This fix and the
> earlier one make my old patch survive a few rounds of kernbench on a
> 4-CPU x86 box. [...]

does not seem to fix my testbox (see the crash log below). I've uploaded
the 40-02 patch with both of your fixes (maybe i mismerged something
somewhere). Does it boot on your box with PREEMPT_RT enabled? The patch
order is:

http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.11.tar.bz2
http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.12-rc1.bz2
http://redhat.com/~mingo/realtime-preempt/realtime-preempt-2.6.12-rc1-V0.7.41-02

Ingo

NET: Registered protocol family 16
BUG: Unable to handle kernel NULL pointer dereference at virtual address 00000000
printing eip:
c0131bcc
*pde = 00000000
Oops: 0002 [#1]
PREEMPT SMP
Modules linked in:
CPU: 1
EIP: 0060:[<c0131bcc>] Not tainted VLI
EFLAGS: 00010297 (2.6.12-rc1-RT-V0.7.41-01)
EIP is at rcu_advance_callbacks+0x3c/0x80
eax: 00000000 ebx: c050f280 ecx: c12191e0 edx: 00000000
esi: c1341c64 edi: c1341be4 ebp: c1355dd0 esp: c1355dc8
ds: 007b es: 007b ss: 0068 preempt: 00000003
Process khelper (pid: 17, threadinfo=c1354000 task=c13538d0)
Stack: 00000001 c12191e0 c1355de4 c0131c47 00000001 c1341bdc c13004d8 c1355e00
c017e529 c1341bdc c04d6e80 c1358006 fffffffe c1355e54 c1355e70 c0174aac
c1341bdc c1355e50 c1355e4c c1355e54 c1358001 c1341bdc c04cf920 00000286
Call Trace:
[<c010412f>] show_stack+0x7f/0xa0 (28)
[<c01042da>] show_registers+0x16a/0x1e0 (56)
[<c0104511>] die+0x101/0x190 (64)
[<c0115862>] do_page_fault+0x442/0x680 (216)
[<c0103d9b>] error_code+0x2b/0x30 (68)
[<c0131c47>] call_rcu+0x37/0x70 (20)
[<c017e529>] dput+0x139/0x210 (28)
[<c0174aac>] __link_path_walk+0x9fc/0xf80 (112)
[<c017507a>] link_path_walk+0x4a/0x130 (100)
[<c017546e>] path_lookup+0x9e/0x1c0 (32)
[<c01708c8>] open_exec+0x28/0x100 (100)
[<c0171ae4>] do_execve+0x44/0x220 (36)
[<c0101da2>] sys_execve+0x42/0xa0 (36)
[<c0103315>] syscall_call+0x7/0xb (-8096)
---------------------------
| preempt count: 00000004 ]
| 4-level deep critical section nesting:
----------------------------------------
.. [<c0131c2f>] .... call_rcu+0x1f/0x70
.....[<c017e529>] .. ( <= dput+0x139/0x210)
.. [<c0131ba3>] .... rcu_advance_callbacks+0x13/0x80
.....[<c0131c47>] .. ( <= call_rcu+0x37/0x70)
.. [<c03ddeaa>] .... _raw_spin_lock_irqsave+0x1a/0xa0
.....[<c010444f>] .. ( <= die+0x3f/0x190)
.. [<c013bac6>] .... print_traces+0x16/0x50
.....[<c010412f>] .. ( <= show_stack+0x7f/0xa0)

Code: 00 00 e8 78 2d 0a 00 8b 0c 85 20 20 51 c0 bb 80 f2 50 c0 01 d9 f0 83 44 24 00 00 a1 88 19 52 c0 39 41 40 74 23 8b 41 44 8b 51 50 <89> 02 8b 41 48 c7 41 44 00 00 00 00 89 41 50 8d 41 44 89 41 48

(gdb) list *0xc0131bcc
0xc0131bcc is in rcu_advance_callbacks (kernel/rcupdate.c:586).
581 struct rcu_data *rdp;
582
583 rdp = &get_cpu_var(rcu_data);
584 smp_mb(); /* prevent sampling batch # before list removal. */
585 if (rdp->batch != rcu_ctrlblk.batch) {
586 *rdp->donetail = rdp->waitlist;
587 rdp->donetail = rdp->waittail;
588 rdp->waitlist = NULL;
589 rdp->waittail = &rdp->waitlist;
590 rdp->batch = rcu_ctrlblk.batch;
(gdb)

2005-03-22 08:51:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01


* Paul E. McKenney <[email protected]> wrote:

> return;
> }
> rdp->donelist = NULL;
> - rdp->donetail = &rdp->waitlist;
> + rdp->donetail = &rdp->donelist;
> put_cpu_var(rcu_data);
> while (list) {
> next = list->next;

seems like the RCU code should use list.h primitives, to avoid bugs like
this and to make the code more readable.

Ingo

2005-03-22 09:00:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01


regarding this code:

void synchronize_kernel(void)
{
long oldbatch;

smp_mb();
oldbatch = rcu_ctrlblk.batch;
schedule_timeout(HZ/GRACE_PERIODS_PER_SEC);

what is the intent of that schedule_timeout() - a fixed-period delay? Is
that acceptable? In any case, it wont do what you think it does, you
should use mdelay(), or do current->state = TASK_UNINTERRUPTIBLE -
otherwise the call will just return when we are in TASK_RUNNING.

Ingo

2005-03-22 09:23:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01


* Ingo Molnar <[email protected]> wrote:

> > Does the following help with the SMP problem? This fix and the
> > earlier one make my old patch survive a few rounds of kernbench on a
> > 4-CPU x86 box. [...]
>
> does not seem to fix my testbox (see the crash log below). [...]

seems to be a true SMP race: when i boot with 1 CPU it doesnt trigger,
the same kernel image and 2 CPUs triggers it on CPU#1. (CPU#0 is the
boot CPU) Note that the timing of the crash is not deterministic
(sometimes i get it during net startup, sometimes during ACPI startup),
but it always crashes within rcu_advance_callbacks().

one difference between your tests and mine is that your kernel is doing
_synchronize_kernel() from preempt-off sections (correct?), while my
kernel with PREEMPT_RT does it on preemptable sections.

Ingo

2005-03-22 09:32:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01


* Ingo Molnar <[email protected]> wrote:

> seems to be a true SMP race: when i boot with 1 CPU it doesnt trigger,
> the same kernel image and 2 CPUs triggers it on CPU#1. (CPU#0 is the
> boot CPU) Note that the timing of the crash is not deterministic
> (sometimes i get it during net startup, sometimes during ACPI
> startup), but it always crashes within rcu_advance_callbacks().
>
> one difference between your tests and mine is that your kernel is
> doing _synchronize_kernel() from preempt-off sections (correct?),
> while my kernel with PREEMPT_RT does it on preemptable sections.

hm, another thing: i think call_rcu() needs to take the read-lock. Right
now it assumes that it has the data structure private, but that's only
statistically true on PREEMPT_RT: another CPU may have this CPU's RCU
control structure in use. So IRQs-off (or preempt-off) is not a
guarantee to have the data structure, the read lock has to be taken.

Ingo

2005-03-22 10:02:12

by Ingo Molnar

[permalink] [raw]
Subject: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-05


* Ingo Molnar <[email protected]> wrote:

> hm, another thing: i think call_rcu() needs to take the read-lock.
> Right now it assumes that it has the data structure private, but
> that's only statistically true on PREEMPT_RT: another CPU may have
> this CPU's RCU control structure in use. So IRQs-off (or preempt-off)
> is not a guarantee to have the data structure, the read lock has to be
> taken.

i've reworked the code to use the read-lock to access the per-CPU data
RCU structures, and it boots with 2 CPUs and PREEMPT_RT now. The -40-05
patch can be downloaded from the usual place:

http://redhat.com/~mingo/realtime-preempt/

had to add two hacks though:

static void rcu_advance_callbacks(struct rcu_data *rdp)
{
if (rdp->batch != rcu_ctrlblk.batch) {
if (rdp->donetail) // HACK
*rdp->donetail = rdp->waitlist;
...

void fastcall call_rcu(struct rcu_head *head,
void (*func)(struct rcu_head *rcu))
[...]
rcu_advance_callbacks(rdp);
if (rdp->waittail) // HACK
*rdp->waittail = head;
...

without them it crashes during bootup.

maybe we are better off with the completely unlocked read path and the
long grace periods.

Ingo

2005-03-22 11:32:45

by Ingo Molnar

[permalink] [raw]
Subject: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Ingo Molnar <[email protected]> wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > hm, another thing: i think call_rcu() needs to take the read-lock.
> > Right now it assumes that it has the data structure private, but
> > that's only statistically true on PREEMPT_RT: another CPU may have
> > this CPU's RCU control structure in use. So IRQs-off (or preempt-off)
> > is not a guarantee to have the data structure, the read lock has to be
> > taken.
>
> i've reworked the code to use the read-lock to access the per-CPU data
> RCU structures, and it boots with 2 CPUs and PREEMPT_RT now. The
> -40-05 patch can be downloaded from the usual place:

bah, it's leaking dentries at a massive scale. I'm giving up on this
variant for the time being and have gone towards a much simpler variant,
implemented in the -40-07 patch at:

http://redhat.com/~mingo/realtime-preempt/

it's along the lines of Esben's patch, but with the conceptual bug fixed
via the rcu_read_lock_nesting code from Paul's patch.

there's a new CONFIG_PREEMPT_RCU option. (always-enabled on PREEMPT_RT)
It builds & boots fine on my 2-way box, doesnt leak dentries and
networking is up and running.

first question, (ignoring the grace priod problem) is this a correct RCU
implementation? The critical scenario is when a task gets migrated to
another CPU, so that current->rcu_data is that of another CPU's. That is
why ->active_readers is an atomic variable now. [ Note that while
->active_readers may be decreased from another CPU, it's always
increased on the current CPU, so when a preemption-off section
determines that a quiescent state has passed that determination stays
true up until it enables preemption again. This is needed for correct
callback processing. ]

this implementation has the 'long grace periods' problem. Starvation
should only be possible if a system has zero idle time for a long period
of time, and even then it needs the permanent starvation of
involuntarily preempted rcu-read-locked tasks. Is there any way to force
such a situation? (which would turn this into a DoS)

[ in OOM situations we could force quiescent state by walking all tasks
and checking for nonzero ->rcu_read_lock_nesting values and priority
boosting affected tasks (to RT prio 99 or RT prio 1), which they'd
automatically drop when they decrease their rcu_read_lock_nesting
counter to zero. ]

Ingo

2005-03-22 13:57:06

by Magnus Naeslund(f)

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01

Paul E. McKenney wrote:
>
> Hello, Magnus,
>
> I believe that my earlier patch might take care of this (included again
> for convenience).
>
> Thanx, Paul
>

I just tested this patch, and it did not solve my problem.
The dst cache still grows to the maximum and starts spitting errors via
printk.

I'll do a make mrproper build too to make sure I didn't make any mistakes.

Regards,
Magnus

2005-03-22 15:06:46

by K.R. Foley

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>>* Ingo Molnar <[email protected]> wrote:
>>
>>
>>>hm, another thing: i think call_rcu() needs to take the read-lock.
>>>Right now it assumes that it has the data structure private, but
>>>that's only statistically true on PREEMPT_RT: another CPU may have
>>>this CPU's RCU control structure in use. So IRQs-off (or preempt-off)
>>>is not a guarantee to have the data structure, the read lock has to be
>>>taken.
>>
>>i've reworked the code to use the read-lock to access the per-CPU data
>>RCU structures, and it boots with 2 CPUs and PREEMPT_RT now. The
>>-40-05 patch can be downloaded from the usual place:
>
>
> bah, it's leaking dentries at a massive scale. I'm giving up on this
> variant for the time being and have gone towards a much simpler variant,
> implemented in the -40-07 patch at:
>
> http://redhat.com/~mingo/realtime-preempt/
>

Actually this is more correctly related to -RT-2.6.12-rc1-V0.7.41-08.
This one boots on one of my SMP systems (dual 2.6 GHz Zeon) doesn't on
the other (dual PIII 933 MHz). Unfortunately not close to the one that
doesn't boot right now so can't debug it at all.

Was still building on the UP box, but now that one seems to have gone
down in flames. This system was running -RT-2.6.12-rc1-V0.7.41-00 while
trying to build -RT-2.6.12-rc1-V0.7.41-07. :(

--
kr

2005-03-22 18:09:13

by Magnus Naeslund(f)

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

Ingo Molnar wrote:
>
> bah, it's leaking dentries at a massive scale. I'm giving up on this
> variant for the time being and have gone towards a much simpler variant,
> implemented in the -40-07 patch at:
>
> http://redhat.com/~mingo/realtime-preempt/
>

I downloaded your V0.7.41-08 patch and it seems to have solved my
problem, atleast at a first glance.
The routing cache is now shrinkable with:
echo "0" > /proc/sys/net/ipv4/route/flush
That didn't work before.
I'll leave the server for a while and see if it overflows again...

Regards,
Magnus

2005-03-23 04:48:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01

On Tue, Mar 22, 2005 at 10:32:01AM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > seems to be a true SMP race: when i boot with 1 CPU it doesnt trigger,
> > the same kernel image and 2 CPUs triggers it on CPU#1. (CPU#0 is the
> > boot CPU) Note that the timing of the crash is not deterministic
> > (sometimes i get it during net startup, sometimes during ACPI
> > startup), but it always crashes within rcu_advance_callbacks().
> >
> > one difference between your tests and mine is that your kernel is
> > doing _synchronize_kernel() from preempt-off sections (correct?),
> > while my kernel with PREEMPT_RT does it on preemptable sections.
>
> hm, another thing: i think call_rcu() needs to take the read-lock. Right
> now it assumes that it has the data structure private, but that's only
> statistically true on PREEMPT_RT: another CPU may have this CPU's RCU
> control structure in use. So IRQs-off (or preempt-off) is not a
> guarantee to have the data structure, the read lock has to be taken.

!!! The difference is that in the stock kernel, rcu_check_callbacks()
is invoked from irq. In PREEMPT_RT, it is invoked from process context
and appears to be preemptible. This means that rcu_advance_callbacks()
can be preempted, resulting in all sorts of problems. Need to disable
preemption over this.

There are probably other bugs of this sort, I will track them down.

But, just to make sure I understand -- if I have preempt disabled over
all accesses to a per-CPU variable, and that variable is -not- accessed
from a real interrupt handler, then I am safe without a lock, right?

Thanx, Paul

PS. Fixing my stupid bugs that you pointed out in earlier email,
as well. :-/

2005-03-23 05:23:15

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-05

On Tue, Mar 22, 2005 at 11:01:53AM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > hm, another thing: i think call_rcu() needs to take the read-lock.
> > Right now it assumes that it has the data structure private, but
> > that's only statistically true on PREEMPT_RT: another CPU may have
> > this CPU's RCU control structure in use. So IRQs-off (or preempt-off)
> > is not a guarantee to have the data structure, the read lock has to be
> > taken.
>
> i've reworked the code to use the read-lock to access the per-CPU data
> RCU structures, and it boots with 2 CPUs and PREEMPT_RT now. The -40-05
> patch can be downloaded from the usual place:
>
> http://redhat.com/~mingo/realtime-preempt/
>
> had to add two hacks though:
>
> static void rcu_advance_callbacks(struct rcu_data *rdp)
> {
> if (rdp->batch != rcu_ctrlblk.batch) {
> if (rdp->donetail) // HACK
> *rdp->donetail = rdp->waitlist;
> ...
>
> void fastcall call_rcu(struct rcu_head *head,
> void (*func)(struct rcu_head *rcu))
> [...]
> rcu_advance_callbacks(rdp);
> if (rdp->waittail) // HACK
> *rdp->waittail = head;
> ...
>
> without them it crashes during bootup.

Probably also the cause of the memory leakage. More evidence that
list macros are useful, just in case anyone needed any more.

> maybe we are better off with the completely unlocked read path and the
> long grace periods.

Might well be... But doesn't there need to be an smp_mb() right after
the increment in rcu_read_lock() and before the rcu_qsctr_inc(cpu)
in rcu_read_unlock(), at least for non-x86 CPUs?

On the long grace periods...

It should be possible to force the grace periods to end by providing a
pair of active_readers counters per rcu_data structure, one "current",
the other "expiring". New rcu_read_lock() invocations get a pointer
to the "current" active_readers counter for their current CPU,
and rcu_read_unlock() decrements whichever one the corresponding
rcu_read_lock() got a pointer to.

When it is necessary to force a grace period, the roles of the "current"
and the "expiring" counters are reversed. Once all CPUs' "expiring"
counters have gone to zero, then we know that all RCU read-side critical
sections that were executing at the time of the role reversal have
completed. It is necessary to wait for all the "expiring" counters to
go to zero before doing a second role reversal.

This is very similar to some mechanism in the K42 and in Dipankar's
rcu. See page 18 of http://www.rdrop.com/users/paulmck/RCU/rcu.2002.07.08.pdf
for a better description. There is a patch against a very early
2.5 that I can dig up if anyone is interested.

Once I get the lock-based method working, I might have the confidence
to take this on. If someone wants to try it out in the meantime, I
would of course be happy to review their code.

Thanx, Paul

2005-03-23 05:46:31

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01

On Tue, Mar 22, 2005 at 02:56:56PM +0100, Magnus Naeslund(t) wrote:
> Paul E. McKenney wrote:
> >
> >Hello, Magnus,
> >
> >I believe that my earlier patch might take care of this (included again
> >for convenience).
> >
> > Thanx, Paul
> >
>
> I just tested this patch, and it did not solve my problem.
> The dst cache still grows to the maximum and starts spitting errors via
> printk.
>
> I'll do a make mrproper build too to make sure I didn't make any mistakes.

Hello, Magnus,

I have at least two other bugs that would be fatal. Having learned a
bit more about PREEMPT_RT in this thread, I need to go off and dig through
my code for similar problems.

Thanx, Paul

2005-03-23 06:19:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Tue, Mar 22, 2005 at 12:28:56PM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Ingo Molnar <[email protected]> wrote:
> >
> > > hm, another thing: i think call_rcu() needs to take the read-lock.
> > > Right now it assumes that it has the data structure private, but
> > > that's only statistically true on PREEMPT_RT: another CPU may have
> > > this CPU's RCU control structure in use. So IRQs-off (or preempt-off)
> > > is not a guarantee to have the data structure, the read lock has to be
> > > taken.
> >
> > i've reworked the code to use the read-lock to access the per-CPU data
> > RCU structures, and it boots with 2 CPUs and PREEMPT_RT now. The
> > -40-05 patch can be downloaded from the usual place:
>
> bah, it's leaking dentries at a massive scale. I'm giving up on this
> variant for the time being and have gone towards a much simpler variant,
> implemented in the -40-07 patch at:
>
> http://redhat.com/~mingo/realtime-preempt/
>
> it's along the lines of Esben's patch, but with the conceptual bug fixed
> via the rcu_read_lock_nesting code from Paul's patch.

Fair enough!

> there's a new CONFIG_PREEMPT_RCU option. (always-enabled on PREEMPT_RT)
> It builds & boots fine on my 2-way box, doesnt leak dentries and
> networking is up and running.
>
> first question, (ignoring the grace priod problem) is this a correct RCU
> implementation? The critical scenario is when a task gets migrated to
> another CPU, so that current->rcu_data is that of another CPU's. That is
> why ->active_readers is an atomic variable now. [ Note that while
> ->active_readers may be decreased from another CPU, it's always
> increased on the current CPU, so when a preemption-off section
> determines that a quiescent state has passed that determination stays
> true up until it enables preemption again. This is needed for correct
> callback processing. ]

+#ifdef CONFIG_PREEMPT_RCU
+
+void rcu_read_lock(void)
+{
+ if (current->rcu_read_lock_nesting++ == 0) {
+ current->rcu_data = &get_cpu_var(rcu_data);
+ atomic_inc(&current->rcu_data->active_readers);
+ put_cpu_var(rcu_data);

Need an smp_mb() here for non-x86 CPUs. Otherwise, the CPU can
re-order parts of the critical section to precede the rcu_read_lock().
Could precede the put_cpu_var(), but why increase latency?

+ }
+}
+EXPORT_SYMBOL(rcu_read_lock);
+
+void rcu_read_unlock(void)
+{
+ int cpu;
+
+ if (--current->rcu_read_lock_nesting == 0) {
+ atomic_dec(&current->rcu_data->active_readers);
+ /*
+ * Check whether we have reached quiescent state.
+ * Note! This is only for the local CPU, not for
+ * current->rcu_data's CPU [which typically is the
+ * current CPU, but may also be another CPU].
+ */
+ cpu = get_cpu();

And need an smp_mb() here, again for non-x86 CPUs.

+ rcu_qsctr_inc(cpu);
+ put_cpu();
+ }
+}
+EXPORT_SYMBOL(rcu_read_unlock);
+
+#endif

Assuming that the memory barriers are added, I can see a bunch of ways for
races to extend grace periods, but none so far that result in the fatal
too-short grace period. Since rcu_qsctr_inc() refuses to increment
the quiescent-state counter on any CPU that started an RCU read-side
critical section that has not yet completed, any long critical section
will have a corresponding CPU that will refuse to go through a quiescent
state. And that will prevent the grace period from completing.

> this implementation has the 'long grace periods' problem. Starvation
> should only be possible if a system has zero idle time for a long period
> of time, and even then it needs the permanent starvation of
> involuntarily preempted rcu-read-locked tasks. Is there any way to force
> such a situation? (which would turn this into a DoS)

If the sum of all the RCU read-side critical sections consumed much more
than 1/N of the CPU time, where N is the number of CPUs, then you would
see infinite grace periods just by statistics. But my guess is that
you would be in the tens of CPUs before hitting this. That said, I have
never measured this, since there has not been a reason to care.

> [ in OOM situations we could force quiescent state by walking all tasks
> and checking for nonzero ->rcu_read_lock_nesting values and priority
> boosting affected tasks (to RT prio 99 or RT prio 1), which they'd
> automatically drop when they decrease their rcu_read_lock_nesting
> counter to zero. ]

Makes sense to me. If there are too-long RCU read-side critical sections,
they could cause other tasks to miss deadlines once boosted.
Of course, if RCU read-side critical sections are all short enough,
you would already just be disabling preemption across all of them. ;-)

Thanx, Paul

2005-03-23 06:21:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-01


* Paul E. McKenney <[email protected]> wrote:

> !!! The difference is that in the stock kernel, rcu_check_callbacks()
> is invoked from irq. In PREEMPT_RT, it is invoked from process
> context and appears to be preemptible. This means that
> rcu_advance_callbacks() can be preempted, resulting in all sorts of
> problems. Need to disable preemption over this.

we can disable preemption in the PREEMPT_RT kernel too, but only for
code we know the maximum execution length of. I.e. we dont want want to
disable preemption while processing the callback _list_, but we can use
preemption-off to protect the per-cpu data structures.

> There are probably other bugs of this sort, I will track them down.
>
> But, just to make sure I understand -- if I have preempt disabled over
> all accesses to a per-CPU variable, and that variable is -not-
> accessed from a real interrupt handler, then I am safe without a lock,
> right?

correct, assuming that the pointer you get to the per-CPU data is truly
pointing to the current CPU's data. (i.e. the current->rcu_data pointer
approach breaks this assumption.)

Ingo

2005-03-23 06:33:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Paul E. McKenney <[email protected]> wrote:

> +#ifdef CONFIG_PREEMPT_RCU
> +
> +void rcu_read_lock(void)
> +{
> + if (current->rcu_read_lock_nesting++ == 0) {
> + current->rcu_data = &get_cpu_var(rcu_data);
> + atomic_inc(&current->rcu_data->active_readers);
> + put_cpu_var(rcu_data);
>
> Need an smp_mb() here for non-x86 CPUs. Otherwise, the CPU can
> re-order parts of the critical section to precede the rcu_read_lock().
> Could precede the put_cpu_var(), but why increase latency?

ok. It's enough to put a barrier into the else branch here, because the
atomic op in the main brain is a barrier by itself.

> +void rcu_read_unlock(void)
> +{
[...]
> And need an smp_mb() here, again for non-x86 CPUs.

ok.

> Assuming that the memory barriers are added, I can see a bunch of ways
> for races to extend grace periods, but none so far that result in the
> fatal too-short grace period. Since rcu_qsctr_inc() refuses to
> increment the quiescent-state counter on any CPU that started an RCU
> read-side critical section that has not yet completed, any long
> critical section will have a corresponding CPU that will refuse to go
> through a quiescent state. And that will prevent the grace period
> from completing.

i'm worried about the following scenario: what happens when a task is
migrated from CPU#1 to CPU#2, while in an RCU read section that it
acquired on CPU#1, and queues a callback. E.g. d_free() does a
call_rcu(), to queue the freeing of the dentry.

That callback will be queued on CPU#2 - while the task still keeps
current->rcu_data of CPU#1. It also means that CPU#2's read counter did
_not_ get increased - and a too short grace period may occur.

it seems to me that that only safe method is to pick an 'RCU CPU' when
first entering the read section, and then sticking to it, no matter
where the task gets migrated to. Or to 'migrate' the +1 read count from
one CPU to the other, within the scheduler.

the 'migrate read count' solution seems more promising, as it would keep
other parts of the RCU code unchanged. [ But it seems to break the nice
'flip pointers' method you found to force a grace period. If a 'read
section' can migrate from one CPU to another then it can migrate back as
well, at which point it cannot have the 'old' pointer. Maybe it would
still work better than no flip pointers. ]

Ingo

2005-03-23 06:38:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Ingo Molnar <[email protected]> wrote:

> the 'migrate read count' solution seems more promising, as it would
> keep other parts of the RCU code unchanged. [ But it seems to break
> the nice 'flip pointers' method you found to force a grace period. If
> a 'read section' can migrate from one CPU to another then it can
> migrate back as well, at which point it cannot have the 'old' pointer.
> Maybe it would still work better than no flip pointers. ]

the flip pointer method could be made to work if we had a NR_CPUS array
of 'current RCU pointer' values attached to the task - and that array
would be cleared if the task exits the read section. But this has memory
usage worries with large NR_CPUS. (full clearing of the array can be
avoided by using some sort of 'read section generation' counter attached
to each pointer)

Ingo

2005-03-23 07:16:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Ingo Molnar <[email protected]> wrote:

> That callback will be queued on CPU#2 - while the task still keeps
> current->rcu_data of CPU#1. It also means that CPU#2's read counter
> did _not_ get increased - and a too short grace period may occur.
>
> it seems to me that that only safe method is to pick an 'RCU CPU' when
> first entering the read section, and then sticking to it, no matter
> where the task gets migrated to. Or to 'migrate' the +1 read count
> from one CPU to the other, within the scheduler.

i think the 'migrate read-count' method is not adequate either, because
all callbacks queued within an RCU read section must be called after the
lock has been dropped - while with the migration method CPU#1 would be
free to process callbacks queued in the RCU read section still active on
CPU#2.

i'm wondering how much of a problem this is though. Can there be stale
pointers at that point? Yes in theory, because code like:

rcu_read_lock();
call_rcu(&dentry->d_rcu, d_callback);
func(dentry->whatever);
rcu_read_unlock();

would be unsafe because the pointer is still accessed within the RCU
read section, and if we get migrated from CPU#1 to CPU#2 after call_rcu
but before dentry->whatever dereference, the callback may be processed
early by CPU#1, making the dentry->whatever read operation unsafe.

the question is, does this occur in practice? Does existing RCU-using
code use pointers it has queued for freeing, relying on the fact that
the callback wont be processed until we drop the RCU read lock?

Ingo

2005-03-23 07:57:19

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07



On Wed, 23 Mar 2005, Ingo Molnar wrote:

>
> * Ingo Molnar <[email protected]> wrote:
>
> > That callback will be queued on CPU#2 - while the task still keeps
> > current->rcu_data of CPU#1. It also means that CPU#2's read counter
> > did _not_ get increased - and a too short grace period may occur.
> >
> > it seems to me that that only safe method is to pick an 'RCU CPU' when
> > first entering the read section, and then sticking to it, no matter
> > where the task gets migrated to. Or to 'migrate' the +1 read count
> > from one CPU to the other, within the scheduler.
>
> i think the 'migrate read-count' method is not adequate either, because
> all callbacks queued within an RCU read section must be called after the
> lock has been dropped - while with the migration method CPU#1 would be
> free to process callbacks queued in the RCU read section still active on
> CPU#2.
>

Hi Ingo,

Although you can't disable preemption for the duration of the
rcu_readlock, what about pinning the process to a CPU while it has the
lock. Would this help solve the migration issue?

-- Steve

2005-03-23 08:00:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> > i think the 'migrate read-count' method is not adequate either, because
> > all callbacks queued within an RCU read section must be called after the
> > lock has been dropped - while with the migration method CPU#1 would be
> > free to process callbacks queued in the RCU read section still active on
> > CPU#2.
>
> Although you can't disable preemption for the duration of the
> rcu_readlock, what about pinning the process to a CPU while it has the
> lock. Would this help solve the migration issue?

yes, but that's rather gross. PREEMPT_BKL was opposed by Linus precisely
because it used such a method - once that was fixed, PREEMPT_BKL got
accepted. It also limits the execution flow quite much. I'd rather not
do it if there's any other method.

Ingo

2005-03-23 08:31:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, 2005-03-23 at 08:16 +0100, Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
> > That callback will be queued on CPU#2 - while the task still keeps
> > current->rcu_data of CPU#1. It also means that CPU#2's read counter
> > did _not_ get increased - and a too short grace period may occur.
> >
> > it seems to me that that only safe method is to pick an 'RCU CPU' when
> > first entering the read section, and then sticking to it, no matter
> > where the task gets migrated to. Or to 'migrate' the +1 read count
> > from one CPU to the other, within the scheduler.
>
> i think the 'migrate read-count' method is not adequate either, because
> all callbacks queued within an RCU read section must be called after the
> lock has been dropped - while with the migration method CPU#1 would be
> free to process callbacks queued in the RCU read section still active on
> CPU#2.
>
how about keeping the rcu callback list in process context and only
splice it to a global (per cpu) list on rcu_read_unlock?

Kind regrads,

Peter Zijlstra

--
Peter Zijlstra <[email protected]>

2005-03-23 09:03:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Peter Zijlstra <[email protected]> wrote:

> > i think the 'migrate read-count' method is not adequate either, because
> > all callbacks queued within an RCU read section must be called after the
> > lock has been dropped - while with the migration method CPU#1 would be
> > free to process callbacks queued in the RCU read section still active on
> > CPU#2.
>
> how about keeping the rcu callback list in process context and only
> splice it to a global (per cpu) list on rcu_read_unlock?

hm, that would indeed solve this problem. It would also solve the grace
period problem: all callbacks in the global (per-CPU) list are
immediately processable. Paul?

Ingo

2005-03-23 09:39:35

by Herbert Xu

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

Ingo Molnar <[email protected]> wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
>> +#ifdef CONFIG_PREEMPT_RCU
>> +
>> +void rcu_read_lock(void)
>> +{
>> + if (current->rcu_read_lock_nesting++ == 0) {
>> + current->rcu_data = &get_cpu_var(rcu_data);
>> + atomic_inc(&current->rcu_data->active_readers);
>> + put_cpu_var(rcu_data);
>>
>> Need an smp_mb() here for non-x86 CPUs. Otherwise, the CPU can
>> re-order parts of the critical section to precede the rcu_read_lock().
>> Could precede the put_cpu_var(), but why increase latency?
>
> ok. It's enough to put a barrier into the else branch here, because the
> atomic op in the main brain is a barrier by itself.

Since the else branch is only taken when rcu_read_lock_nesting > 0, do
we need the barrier at all?

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-03-23 09:41:38

by Herbert Xu

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

Paul E. McKenney <[email protected]> wrote:
>
> +void rcu_read_unlock(void)
> +{
> + int cpu;
> +
> + if (--current->rcu_read_lock_nesting == 0) {
> + atomic_dec(&current->rcu_data->active_readers);
> + /*
> + * Check whether we have reached quiescent state.
> + * Note! This is only for the local CPU, not for
> + * current->rcu_data's CPU [which typically is the
> + * current CPU, but may also be another CPU].
> + */
> + cpu = get_cpu();
>
> And need an smp_mb() here, again for non-x86 CPUs.

The atomic_dec is already a barrier.

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-03-23 09:48:53

by Herbert Xu

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, Mar 23, 2005 at 08:40:58PM +1100, Herbert Xu wrote:
> Paul E. McKenney <[email protected]> wrote:
> >
> > +void rcu_read_unlock(void)
> > +{
> > + int cpu;
> > +
> > + if (--current->rcu_read_lock_nesting == 0) {
> > + atomic_dec(&current->rcu_data->active_readers);
> > + /*
> > + * Check whether we have reached quiescent state.
> > + * Note! This is only for the local CPU, not for
> > + * current->rcu_data's CPU [which typically is the
> > + * current CPU, but may also be another CPU].
> > + */
> > + cpu = get_cpu();
> >
> > And need an smp_mb() here, again for non-x86 CPUs.
>
> The atomic_dec is already a barrier.

Sorry, I was confused. atomic_dec is not a barrier of course.
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-03-23 09:51:05

by Herbert Xu

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, Mar 23, 2005 at 08:38:11PM +1100, Herbert Xu wrote:
>
> > ok. It's enough to put a barrier into the else branch here, because the
> > atomic op in the main brain is a barrier by itself.
>
> Since the else branch is only taken when rcu_read_lock_nesting > 0, do
> we need the barrier at all?

Actually, since atomic_inc isn't a barrier, we do need that mb.
However, it should only be necessary in the main branch and we
can use smp_mb__after_atomic_inc which is optimised away on a
number of architectures (i386 in particular).

Cheers,
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

2005-03-23 21:47:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Ingo Molnar <[email protected]> wrote:

> i think the 'migrate read-count' method is not adequate either,
> because all callbacks queued within an RCU read section must be called
> after the lock has been dropped - while with the migration method
> CPU#1 would be free to process callbacks queued in the RCU read
> section still active on CPU#2.
>
> i'm wondering how much of a problem this is though. Can there be stale
> pointers at that point? Yes in theory, because code like:
>
> rcu_read_lock();
> call_rcu(&dentry->d_rcu, d_callback);
> func(dentry->whatever);
> rcu_read_unlock();

but, this cannot happen, because call_rcu() is used by RCU-write code.

so the important property seems to be that any active RCU-read section
should keep at least one CPU's active_readers count elevated
permanently, for the duration of the RCU-read section. It doesnt matter
that the reader migrates between CPUs - because the RCU code itself
guarantees that no callbacks will be executed until _all_ CPUs have been
in quiescent state. I.e. all CPUs have to go through a zero
active_readers value before the callback is done.

Ingo

2005-03-24 05:29:21

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, Mar 23, 2005 at 07:33:17AM +0100, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > +#ifdef CONFIG_PREEMPT_RCU
> > +
> > +void rcu_read_lock(void)
> > +{
> > + if (current->rcu_read_lock_nesting++ == 0) {
> > + current->rcu_data = &get_cpu_var(rcu_data);
> > + atomic_inc(&current->rcu_data->active_readers);
> > + put_cpu_var(rcu_data);
> >
> > Need an smp_mb() here for non-x86 CPUs. Otherwise, the CPU can
> > re-order parts of the critical section to precede the rcu_read_lock().
> > Could precede the put_cpu_var(), but why increase latency?
>
> ok. It's enough to put a barrier into the else branch here, because the
> atomic op in the main brain is a barrier by itself.

For x86, the atomic op is a barrier, but not for other architectures.
You don't need a barrier in the else branch, because in that case
you are already in an enclosing RCU read-side critical section, so
any bleeding of code will be into this enclosing section, thus still
safe.

> > +void rcu_read_unlock(void)
> > +{
> [...]
> > And need an smp_mb() here, again for non-x86 CPUs.
>
> ok.
>
> > Assuming that the memory barriers are added, I can see a bunch of ways
> > for races to extend grace periods, but none so far that result in the
> > fatal too-short grace period. Since rcu_qsctr_inc() refuses to
> > increment the quiescent-state counter on any CPU that started an RCU
> > read-side critical section that has not yet completed, any long
> > critical section will have a corresponding CPU that will refuse to go
> > through a quiescent state. And that will prevent the grace period
> > from completing.
>
> i'm worried about the following scenario: what happens when a task is
> migrated from CPU#1 to CPU#2, while in an RCU read section that it
> acquired on CPU#1, and queues a callback. E.g. d_free() does a
> call_rcu(), to queue the freeing of the dentry.
>
> That callback will be queued on CPU#2 - while the task still keeps
> current->rcu_data of CPU#1. It also means that CPU#2's read counter did
> _not_ get increased - and a too short grace period may occur.

Let me make sure I understand the sequence of events:

CPU#1 CPU#2

task1: rcu_read_lock()

task1: migrate to CPU#2

task1: call_rcu()

task1: rcu_read_unlock()

This sequence will be safe because CPU#1's active_readers field will
be non-zero throughout, so that CPU#1 will refuse to record any
quiescent state from the time that task1 does the rcu_read_lock()
on CPU#1 until the time that it does the rcu_read_unlock() on CPU#2.

Now, it is true that CPU#2 might record a quiescent state during this
time, but this will have no effect because -all- CPUs must pass through
a quiescent state before any callbacks will be invoked. Since CPU#1
is refusing to record a quiescent state, grace periods will be blocked
for the full extent of task 1's RCU read-side critical section.

Or am I misunderstanding your scenario? Or, for that matter, your code?

> it seems to me that that only safe method is to pick an 'RCU CPU' when
> first entering the read section, and then sticking to it, no matter
> where the task gets migrated to. Or to 'migrate' the +1 read count from
> one CPU to the other, within the scheduler.

This would work too, but I don't believe it is necessary given what
you are already doing.

> the 'migrate read count' solution seems more promising, as it would keep
> other parts of the RCU code unchanged. [ But it seems to break the nice
> 'flip pointers' method you found to force a grace period. If a 'read
> section' can migrate from one CPU to another then it can migrate back as
> well, at which point it cannot have the 'old' pointer. Maybe it would
> still work better than no flip pointers. ]

Well, I do believe that suppressing migration of tasks during RCU read-side
critical sections would simplify the flip-pointers/counters code. But
that is easy to say in advance of actually producing this code. ;-)

Thanx, Paul

2005-03-24 05:35:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Paul E. McKenney <[email protected]> wrote:

> Now, it is true that CPU#2 might record a quiescent state during this
> time, but this will have no effect because -all- CPUs must pass
> through a quiescent state before any callbacks will be invoked. Since
> CPU#1 is refusing to record a quiescent state, grace periods will be
> blocked for the full extent of task 1's RCU read-side critical
> section.

ok, great. So besides the barriers issue (and the long grace period time
issue), the current design is quite ok. And i think your original flip
pointers suggestion can be used to force synchronization.

Ingo

2005-03-24 06:08:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, Mar 23, 2005 at 07:37:27AM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > the 'migrate read count' solution seems more promising, as it would
> > keep other parts of the RCU code unchanged. [ But it seems to break
> > the nice 'flip pointers' method you found to force a grace period. If
> > a 'read section' can migrate from one CPU to another then it can
> > migrate back as well, at which point it cannot have the 'old' pointer.
> > Maybe it would still work better than no flip pointers. ]
>
> the flip pointer method could be made to work if we had a NR_CPUS array
> of 'current RCU pointer' values attached to the task - and that array
> would be cleared if the task exits the read section. But this has memory
> usage worries with large NR_CPUS. (full clearing of the array can be
> avoided by using some sort of 'read section generation' counter attached
> to each pointer)

The only per-task data you need to maintain is a single pointer to
the per-CPU counter that was incremented by the outermost rcu_read_lock().
You also need a global index, call it "rcu_current_ctr_set" (or come
up with a better name!) along with a per-CPU pair of counters:

struct rcu_ctr_set { /* or maybe there is a way to drop array */
atomic_t ctr[2]; /* into DECLARE_PER_CPU() without a struct */
} /* wrapper... */
int rcu_curset = 0;
DEFINE_PER_CPU(struct rcu_ctr_set, rcu_ctr_set) = { 0, 0 };

You need two fields in the task structure:

atomic_t *rcu_preempt_ctr;
int rcu_nesting;

Then you have something like:

void rcu_read_lock(void)
{
if (current->rcu_nesting++ == 0) {
preempt_disable();
current->rcu_preempt_ctr =
&__get_cpu_var(rcu_ctr_set).ctr[rcu_curset];
atomic_inc(current->rcu_preempt_ctr);
preempt_enable();
smp_mb();
}
}

void rcu_read_unlock(void)
{
if (--current->rcu_nesting == 0) {
smb_mb(); /* might only need smp_wmb()... */
atomic_dec(current->rcu_preempt_ctr);
current->rcu_preempt_ctr = NULL; /* for debug */
}
}

One can then force a grace period via something like the following,
but only if you know that all of the rcu_ctr_set.ctr[!current] are zero:

void _synchronize_kernel(void)
{
int cpu;

spin_lock(&rcu_mutex);
rcu_curset = !rcu_curset;
for (;;) {
for_each_cpu(cpu) {
if (atomic_read(&__get_cpu_var(rcu_ctr_set).ctr[!rcu_curset]) != 0) {
/* yield CPU for a bit */
continue;
}
}
}
spin_unlock(&rcu_mutex);
}

In real life, you need a way to multiplex multiple calls to
_synchronize_kernel() into a single counter-flip event, by
setting up callbacks. And so on...

The above stolen liberally from your patch and from my memories of
Dipankar's RCU patch for CONFIG_PREEMPT kernels. Guaranteed to have
horrible bugs. ;-)

Thanx, Paul

2005-03-24 06:38:22

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, Mar 23, 2005 at 08:16:04AM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > That callback will be queued on CPU#2 - while the task still keeps
> > current->rcu_data of CPU#1. It also means that CPU#2's read counter
> > did _not_ get increased - and a too short grace period may occur.
> >
> > it seems to me that that only safe method is to pick an 'RCU CPU' when
> > first entering the read section, and then sticking to it, no matter
> > where the task gets migrated to. Or to 'migrate' the +1 read count
> > from one CPU to the other, within the scheduler.
>
> i think the 'migrate read-count' method is not adequate either, because
> all callbacks queued within an RCU read section must be called after the
> lock has been dropped - while with the migration method CPU#1 would be
> free to process callbacks queued in the RCU read section still active on
> CPU#2.

Right -- the limitation is that you cannot declare a grace period
over until -all- RCU read-side critical sections that started before
the call_rcu() have completed.

> i'm wondering how much of a problem this is though. Can there be stale
> pointers at that point? Yes in theory, because code like:
>
> rcu_read_lock();
> call_rcu(&dentry->d_rcu, d_callback);
> func(dentry->whatever);
> rcu_read_unlock();

In this code segment, a correct RCU will not invoke the callback
until after the rcu_read_unlock(). Ugly code, though. But see
below...

> would be unsafe because the pointer is still accessed within the RCU
> read section, and if we get migrated from CPU#1 to CPU#2 after call_rcu
> but before dentry->whatever dereference, the callback may be processed
> early by CPU#1, making the dentry->whatever read operation unsafe.
>
> the question is, does this occur in practice? Does existing RCU-using
> code use pointers it has queued for freeing, relying on the fact that
> the callback wont be processed until we drop the RCU read lock?

Using a pointer that had been passed to call_rcu() would be in theory
safe, but I would usually object to it. In most cases, it would cause
confusion at the very least. However, there are exceptions:

rcu_read_lock();
list_for_each_entry_rcu(p, head, list) {
if (unlikely(p->status == DELETE_ME)) {
spin_lock(&p->mutex);
if (likely(p->status == DELETE_ME)) {
p->status = DELETED;
list_rcu(&p->list);
call_rcu(&p->rcu, sublist_finalize_rcu);
}
spin_unlock(&p->mutex);
}
}
rcu_read_unlock();

You can drop the spinlock before the call_rcu() if you want, but doing
so makes the code uglier. Besides, you have to allow for the fact that
another instance of this same code might be running on the same list
on some other CPU. You cannot invoke the callback until -both- CPUs/tasks
have exited their RCU read-side critical sections.

Thanx, Paul

2005-03-24 06:45:48

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, Mar 23, 2005 at 10:03:41AM +0100, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > > i think the 'migrate read-count' method is not adequate either, because
> > > all callbacks queued within an RCU read section must be called after the
> > > lock has been dropped - while with the migration method CPU#1 would be
> > > free to process callbacks queued in the RCU read section still active on
> > > CPU#2.
> >
> > how about keeping the rcu callback list in process context and only
> > splice it to a global (per cpu) list on rcu_read_unlock?
>
> hm, that would indeed solve this problem. It would also solve the grace
> period problem: all callbacks in the global (per-CPU) list are
> immediately processable. Paul?

If I understand the proposal, it would break in the following situation
(lifted from earlier email):

rcu_read_lock();
list_for_each_entry_rcu(p, head, list) {
if (unlikely(p->status == DELETE_ME)) {
spin_lock(&p->mutex);
if (likely(p->status == DELETE_ME)) {
p->status = DELETED;
list_rcu(&p->list);
call_rcu(&p->rcu, sublist_finalize_rcu);
}
spin_unlock(&p->mutex);
}
}
rcu_read_unlock();

Here, sublist_finalize_rcu() just finds the front of the block and
kfree()s it.

Here is the scenario:

CPU 1 CPU 2

task 1 does rcu_read lock

task 2 does rcu_read_lock

task 1 sees DELETE_ME

task 2 sees DELETE_ME

task 1 acquires the lock

task 2 blocks/spins on lock

task 1 does call_rcu

task 1 releases lock

task 1 does rcu_read_unlock()

task 2 acquires lock

RCU puts the callback on global list

RCU invokes callback, kfree()!!!

task 2 now sees garbage!!!


Callbacks cannot be invoked until all RCU read-side critical sections
that were in process at the time of the rcu_callback() have all
completed.

Thanx, Paul

2005-03-24 06:53:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, Mar 23, 2005 at 08:49:54PM +1100, Herbert Xu wrote:
> On Wed, Mar 23, 2005 at 08:38:11PM +1100, Herbert Xu wrote:
> >
> > > ok. It's enough to put a barrier into the else branch here, because the
> > > atomic op in the main brain is a barrier by itself.
> >
> > Since the else branch is only taken when rcu_read_lock_nesting > 0, do
> > we need the barrier at all?
>
> Actually, since atomic_inc isn't a barrier, we do need that mb.
> However, it should only be necessary in the main branch and we
> can use smp_mb__after_atomic_inc which is optimised away on a
> number of architectures (i386 in particular).

You are right, I should have said smp_mb__after_atomic_inc() instead of
smp_mb() in the rcu_read_lock() case and smp_mb__after_atomic_dec()
in the rcu_read_unlock() case.

Thanx, Paul

2005-03-24 06:59:45

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, Mar 23, 2005 at 10:46:45PM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> > i think the 'migrate read-count' method is not adequate either,
> > because all callbacks queued within an RCU read section must be called
> > after the lock has been dropped - while with the migration method
> > CPU#1 would be free to process callbacks queued in the RCU read
> > section still active on CPU#2.
> >
> > i'm wondering how much of a problem this is though. Can there be stale
> > pointers at that point? Yes in theory, because code like:
> >
> > rcu_read_lock();
> > call_rcu(&dentry->d_rcu, d_callback);
> > func(dentry->whatever);
> > rcu_read_unlock();
>
> but, this cannot happen, because call_rcu() is used by RCU-write code.

The code would not look exactly like this, but acquiring the update-side
lock inside an RCU read-side critical section can be thought of as
a reader-to-writer lock upgrade. RCU can do this unconditionally,
which was one of the walls I was banging my head against when trying
to think up a realtime-safe RCU implementation.

So something like the following would be legitimate RCU code:

rcu_read_lock();
spin_lock(&dcache_lock);
call_rcu(&dentry->d_rcu, d_callback);
spin_unlock(&dcache_lock);
rcu_read_unlock();

The spin_lock() call upgrades from a read-side to a write-side critical
section.

> so the important property seems to be that any active RCU-read section
> should keep at least one CPU's active_readers count elevated
> permanently, for the duration of the RCU-read section.

Yep!

> It doesnt matter
> that the reader migrates between CPUs - because the RCU code itself
> guarantees that no callbacks will be executed until _all_ CPUs have been
> in quiescent state. I.e. all CPUs have to go through a zero
> active_readers value before the callback is done.

Yep again!

Thanx, Paul

2005-03-24 07:46:08

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Thu, Mar 24, 2005 at 06:34:56AM +0100, Ingo Molnar wrote:
>
> * Paul E. McKenney <[email protected]> wrote:
>
> > Now, it is true that CPU#2 might record a quiescent state during this
> > time, but this will have no effect because -all- CPUs must pass
> > through a quiescent state before any callbacks will be invoked. Since
> > CPU#1 is refusing to record a quiescent state, grace periods will be
> > blocked for the full extent of task 1's RCU read-side critical
> > section.
>
> ok, great. So besides the barriers issue (and the long grace period time
> issue), the current design is quite ok. And i think your original flip
> pointers suggestion can be used to force synchronization.

The thing I am currently struggling with on the flip-pointers approach is
handling races between rcu_read_lock() and the flipping. In the earlier
implementations that used this trick, you were guaranteed that if you were
executing concurrently with one flip, you would do a voluntary context
switch before the next flip happened, so that the race was harmless.
This guarantee does not work in the PREEMPT_RT case, so more thought
will be required. :-/

Thanx, Paul

2005-03-24 08:31:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


Ingo,

I've noticed the following situation which is caused by __up_mutex
assigning an owner right away.

Given 3 processes, with priorities 1, 2, 3, where 3 is highest and 1 is
lowest, and call them process A, B, C respectively.

1. Process A runs and grabs Lock L.
2. Process B preempts A, tries to grab Lock L.
3. Process A inherits process B's priority and starts to run.
4. Process C preempts A, tries to grab Lock L.
5. Process A inherits process C's priority and starts to run.
6. Process A releases Lock L loses its priority.
7. Process C gets Lock L.
8. Process C runs and releases Lock L.
9. With __up_mutex, Process B automatically gets Lock L.
10. Process C tries to grab Lock L again, but is now blocked by B.

So we have a needless latency for Procss C, since it should be able to get
lock L again. The problem arises because process B grabbed the lock
without actually running.

Since I agree with the rule that a lock can't have waiters while not being
owned, I believe that this problem can be solved by adding a flag that
states that the lock is "partially owned". That is the ownership of the
lock should be transferred at step 9, but a flag is set that it has not
been grabbed. This flag would be cleared when Process B wakes up and
leaves the __down call.

This way when process C tries to get the lock again, it sees that it's
owned, but B hasn't executed yet. So it would be safe for C to take the
lock back from B, that is if C is greater priority than B, otherwise it
would act the same.

If you agree with this approach, I would be willing to write a patch for
you.

-- Steve

2005-03-24 08:48:05

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Thu, 24 Mar 2005, Steven Rostedt wrote:
>
> I've noticed the following situation which is caused by __up_mutex
> assigning an owner right away.
>

I also see this with non rt tasks causing a burst of schedules.

1. Process A runs and grabs lock L. then finishes its time slice.
2. Process B runs and tries to grab Lock L.
3. Process A runs and releases lock L.
4. __up_mutex gives process B lock L.
5. Process A tries to grab lock L.
6. Process B runs and releases lock L.
7. __up_mutex gives lock L to process A.
8. Process B tries to grab lock L again.
9. Process A runs...

Here we have more unnecessary schedules. So the condition to grab a lock
should be:

1. not owned.
2. partially owned, and the owner is not RT.
3. partially owned but the owner is RT and so is the grabber, and the
grabber's priority is >= the owner's priority.

-- Steve

2005-03-24 10:43:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> This way when process C tries to get the lock again, it sees that it's
> owned, but B hasn't executed yet. So it would be safe for C to take
> the lock back from B, that is if C is greater priority than B,
> otherwise it would act the same.

agreed. In particular this would be a nice optimization for cases where
tasks are delayed for a longer time due to CPU congestion (e.g. lots of
tasks on the same SCHED_FIFO priority). So if a higher prio task comes
along while the

> If you agree with this approach, I would be willing to write a patch
> for you.

yeah - please do.

Ingo

2005-03-24 10:46:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> I also see this with non rt tasks causing a burst of schedules.
>
> 1. Process A runs and grabs lock L. then finishes its time slice.
> 2. Process B runs and tries to grab Lock L.
> 3. Process A runs and releases lock L.
> 4. __up_mutex gives process B lock L.
> 5. Process A tries to grab lock L.
> 6. Process B runs and releases lock L.
> 7. __up_mutex gives lock L to process A.
> 8. Process B tries to grab lock L again.
> 9. Process A runs...
>
> Here we have more unnecessary schedules. So the condition to grab a lock
> should be:
>
> 1. not owned.
> 2. partially owned, and the owner is not RT.
> 3. partially owned but the owner is RT and so is the grabber, and the
> grabber's priority is >= the owner's priority.

yeah, sounds good - but i'd not make any distinction between RT and
non-RT tasks - just make the rule #3 distinction based on ->prio. In
particular on UP a task should only run when its higher prio, so if a
lock is 'partially owned' then the priority rule should always be true.
(On SMP it's a bit more complex, there the priority rule could make a
real difference.)

Ingo

2005-03-24 11:40:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> Here we have more unnecessary schedules. So the condition to grab a
> lock should be:
>
> 1. not owned.
> 2. partially owned, and the owner is not RT.
> 3. partially owned but the owner is RT and so is the grabber, and the
> grabber's priority is >= the owner's priority.

there's another approach that could solve this problem: let the
scheduler sort it all out. Esben Nielsen had this suggestion a couple of
months ago - i didnt follow it because i thought that technique would
create too many runnable tasks, but maybe that was a mistake. If we do
the owning of the lock once the wakee hits the CPU we avoid the 'partial
owner' problem, and we have the scheduler sort out priorities and
policies.

but i think i like the 'partial owner' (or rather 'owner pending')
technique a bit better, because it controls concurrency explicitly, and
it would thus e.g. allow another trick: when a new owner 'steals' a lock
from another in-flight task, then we could 'unwakeup' that in-flight
thread which could thus avoid two more context-switches on e.g. SMP
systems: hitting the CPU and immediately blocking on the lock. (But this
is a second-phase optimization which needs some core scheduler magic as
well, i guess i'll be the one to code it up.)

Ingo

2005-03-24 14:33:42

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


On Thu, 24 Mar 2005, Ingo Molnar wrote:
>
> there's another approach that could solve this problem: let the
> scheduler sort it all out. Esben Nielsen had this suggestion a couple of
> months ago - i didnt follow it because i thought that technique would
> create too many runnable tasks, but maybe that was a mistake. If we do
> the owning of the lock once the wakee hits the CPU we avoid the 'partial
> owner' problem, and we have the scheduler sort out priorities and
> policies.

I've thought about this too, and came up with the conclusion that this was
too messy. You have to give up the information of the processes that are
waiting on the lock when you release it. Or keep the information of the
waiters (waking only one "the wakee") but then you have a lock with
waiters and no owner, which is the messy part.

On an SMP machine, there may even be a chance of a lower priority process
that gets it. That would be possible if the low priority process on the
other CPU tries to grab the lock just after it was released but before
the just woken up high priorty processes get scheduled. So there's a
window where the lock is open, and the lower priority process snagged it
just before the others got in.


>
> but i think i like the 'partial owner' (or rather 'owner pending')
> technique a bit better, because it controls concurrency explicitly, and
> it would thus e.g. allow another trick: when a new owner 'steals' a lock
> from another in-flight task, then we could 'unwakeup' that in-flight
> thread which could thus avoid two more context-switches on e.g. SMP
> systems: hitting the CPU and immediately blocking on the lock. (But this
> is a second-phase optimization which needs some core scheduler magic as
> well, i guess i'll be the one to code it up.)
>

Darn! It seemed like fun to implement. I may do it myself anyway on my
kernel just to understand your implementation even better.

Later,

-- Steve

2005-03-24 17:51:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> > but i think i like the 'partial owner' (or rather 'owner pending')
> > technique a bit better, because it controls concurrency explicitly, and
> > it would thus e.g. allow another trick: when a new owner 'steals' a lock
> > from another in-flight task, then we could 'unwakeup' that in-flight
> > thread which could thus avoid two more context-switches on e.g. SMP
> > systems: hitting the CPU and immediately blocking on the lock. (But this
> > is a second-phase optimization which needs some core scheduler magic as
> > well, i guess i'll be the one to code it up.)
>
> Darn! It seemed like fun to implement. I may do it myself anyway on my
> kernel just to understand your implementation even better.

feel free to implement the whole thing. Unscheduling a task should be
done carefully, for obvious reasons. (I've implemented it once 1-2 years
ago for a different purpose, to unschedule ksoftirqd - it ought to be
somewhere in the lkml archives.)

Ingo

2005-03-24 18:18:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> On an SMP machine, there may even be a chance of a lower priority
> process that gets it. That would be possible if the low priority
> process on the other CPU tries to grab the lock just after it was
> released but before the just woken up high priorty processes get
> scheduled. So there's a window where the lock is open, and the lower
> priority process snagged it just before the others got in.

that's always a possibility, on UP too: if a lower priority task manages
to acquire a lock 'just before' a highprio thread got interested in it
there's no way to undo that.

but for the other reasons the explicit approach looks better.

Ingo

2005-03-24 23:06:00

by Esben Nielsen

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Thu, 24 Mar 2005, Ingo Molnar wrote:

>
> * Steven Rostedt <[email protected]> wrote:
>
> > Here we have more unnecessary schedules. So the condition to grab a
> > lock should be:
> >
> > 1. not owned.
> > 2. partially owned, and the owner is not RT.
> > 3. partially owned but the owner is RT and so is the grabber, and the
> > grabber's priority is >= the owner's priority.
>
> there's another approach that could solve this problem: let the
> scheduler sort it all out. Esben Nielsen had this suggestion a couple of
> months ago - i didnt follow it because i thought that technique would
> create too many runnable tasks, but maybe that was a mistake. If we do
> the owning of the lock once the wakee hits the CPU we avoid the 'partial
> owner' problem, and we have the scheduler sort out priorities and
> policies.
>
> but i think i like the 'partial owner' (or rather 'owner pending')
> technique a bit better, because it controls concurrency explicitly, and
> it would thus e.g. allow another trick: when a new owner 'steals' a lock
> from another in-flight task, then we could 'unwakeup' that in-flight
> thread which could thus avoid two more context-switches on e.g. SMP
> systems: hitting the CPU and immediately blocking on the lock. (But this
> is a second-phase optimization which needs some core scheduler magic as
> well, i guess i'll be the one to code it up.)
>

I checked the implementation of a mutex I send in last fall. The unlock
operation does give ownership explicitly to the highest priority waiter,
as Ingo's implementation does.

Originally I planned for just having unlock to wake up the highest
priority owner and set lock->owner = NULL. The lock operation would be
something like
while(lock->owner!=NULL)
{
schedule();
}
grap the lock.

Then the first task, i.e. the one with highest priority on UP, will get it
first. On SMP a low priority task on another CPU might get in and take it.

I like the idea of having the scheduler take care of it - it is a very
optimal coded queue-system after all. That will work on UP but not on SMP.
Having the unlock operation to set the mutex in a "partially owned" state
will work better. The only problem I see, relative to Ingo's
implementation, is that then the awoken task have to go in and
change the state of the mutex, i.e. it has to lock the wait_lock again.
Will the extra schedulings being the problem happen offen enough in
practise to have the extra overhead?


> Ingo

Esben

2005-03-25 06:22:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Esben Nielsen <[email protected]> wrote:

> I like the idea of having the scheduler take care of it - it is a very
> optimal coded queue-system after all. That will work on UP but not on
> SMP. Having the unlock operation to set the mutex in a "partially
> owned" state will work better. The only problem I see, relative to
> Ingo's implementation, is that then the awoken task have to go in and
> change the state of the mutex, i.e. it has to lock the wait_lock
> again. Will the extra schedulings being the problem happen offen
> enough in practise to have the extra overhead?

i think this should be covered by the 'unschedule/unwakeup' feature,
mentioned in the latest mails.

Ingo

2005-03-26 16:04:39

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07



On Fri, 25 Mar 2005, Esben Nielsen wrote:
> >
>
> I checked the implementation of a mutex I send in last fall. The unlock
> operation does give ownership explicitly to the highest priority waiter,
> as Ingo's implementation does.
>
> Originally I planned for just having unlock to wake up the highest
> priority owner and set lock->owner = NULL. The lock operation would be
> something like
> while(lock->owner!=NULL)
> {
> schedule();
> }
> grap the lock.
>
> Then the first task, i.e. the one with highest priority on UP, will get it
> first. On SMP a low priority task on another CPU might get in and take it.

This could be dangerous on SMP then, because, if a higher priority process
on the CPU of the process that grabbed the lock, preempted it, then the
other CPU can spin on this since it would be the highest priority process
for that CPU. Not to mention that you have to make sure that priority
inheritance was still implemented for the higher priority process woken up
but having it stollen by the lower priority process.


>
> I like the idea of having the scheduler take care of it - it is a very
> optimal coded queue-system after all. That will work on UP but not on SMP.
> Having the unlock operation to set the mutex in a "partially owned" state
> will work better. The only problem I see, relative to Ingo's
> implementation, is that then the awoken task have to go in and
> change the state of the mutex, i.e. it has to lock the wait_lock again.
> Will the extra schedulings being the problem happen offen enough in
> practise to have the extra overhead?

Another answer is to have the "pending owner" bit be part of the task
structure. A flag maybe. If a higher priority process comes in and
decides to grab the lock from this owner, it does a test_and_clear on the
this flag on the pending owner task. When the pending owner task wakes
up, it does the test_and_clear on its own bit. Who ever had the bit set
on the test wins. If the higher prio task were to clear it first, then it
takes the ownership away from the pending owner. If the pending owner
were to clear the bit first, it won and would contiue as though it got the
lock. The higher priority tasks would do this test within the wait_lock
to keep from having more than one trying to grab the lock from the pending
owner, but the pending owner won't need to do anything since it will know
if it was the new owner just by testing its own bit.

-- Steve

2005-03-26 16:32:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


On Fri, 25 Mar 2005, Ingo Molnar wrote:
> * Esben Nielsen <[email protected]> wrote:
>
> > I like the idea of having the scheduler take care of it - it is a very
> > optimal coded queue-system after all. That will work on UP but not on
> > SMP. Having the unlock operation to set the mutex in a "partially
> > owned" state will work better. The only problem I see, relative to
> > Ingo's implementation, is that then the awoken task have to go in and
> > change the state of the mutex, i.e. it has to lock the wait_lock
> > again. Will the extra schedulings being the problem happen offen
> > enough in practise to have the extra overhead?
>
> i think this should be covered by the 'unschedule/unwakeup' feature,
> mentioned in the latest mails.
>

The first implementation would probably just be the setting of a "pending
owner" bit. But the better one may be to unschedule. But, what would the
overhead be for unscheduling. Since you need to grab the run queue locks
for that. This might make for an interesting case study. The waking up of
a process who had the lock stolen may not happen that much. The lock
stealing, would (as I see in my runs) happen quite a bit though. But on
UP, the waking of the robbed owner, would never happen, unless it also
owned a lock that a higher priority process wanted.

-- Steve

2005-03-26 19:11:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> > i think this should be covered by the 'unschedule/unwakeup' feature,
> > mentioned in the latest mails.
>
> The first implementation would probably just be the setting of a
> "pending owner" bit. But the better one may be to unschedule. But,
> what would the overhead be for unscheduling. Since you need to grab
> the run queue locks for that. This might make for an interesting case
> study. The waking up of a process who had the lock stolen may not
> happen that much. The lock stealing, would (as I see in my runs)
> happen quite a bit though. But on UP, the waking of the robbed owner,
> would never happen, unless it also owned a lock that a higher priority
> process wanted.

yeah, lets skip the unscheduling for now, the 'pending owner' bit is the
important one.

Ingo

2005-03-30 06:32:38

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Sat, 2005-03-26 at 11:04 -0500, Steven Rostedt wrote:
>
> On Fri, 25 Mar 2005, Esben Nielsen wrote:

> >
> > I like the idea of having the scheduler take care of it - it is a very
> > optimal coded queue-system after all. That will work on UP but not on SMP.
> > Having the unlock operation to set the mutex in a "partially owned" state
> > will work better. The only problem I see, relative to Ingo's
> > implementation, is that then the awoken task have to go in and
> > change the state of the mutex, i.e. it has to lock the wait_lock again.
> > Will the extra schedulings being the problem happen offen enough in
> > practise to have the extra overhead?
>
> Another answer is to have the "pending owner" bit be part of the task
> structure. A flag maybe. If a higher priority process comes in and
> decides to grab the lock from this owner, it does a test_and_clear on the
> this flag on the pending owner task. When the pending owner task wakes
> up, it does the test_and_clear on its own bit. Who ever had the bit set
> on the test wins. If the higher prio task were to clear it first, then it
> takes the ownership away from the pending owner. If the pending owner
> were to clear the bit first, it won and would contiue as though it got the
> lock. The higher priority tasks would do this test within the wait_lock
> to keep from having more than one trying to grab the lock from the pending
> owner, but the pending owner won't need to do anything since it will know
> if it was the new owner just by testing its own bit.

OK, I'm declaring defeat here. I've been fighting race conditions all
day, and it's now 1 in the morning where I live. It looks like this
implementation has no other choice but to have the waking up "pending
owner" take the wait_list lock once again. How heavy of a overhead is
that really?

The problem I've painfully discovered, is that a task trying to take the
lock must test the pending owner for two things. One is, is the pending
owner owning the same lock as the one the task is trying to get. Since
the waking up of the pending owner has no synchronous locking, it can
grab the lock and then become a pending owner to another lock after the
other task thinks it's still the pending owner of the lock its trying to
get, but before testing it. So it can mistake it as the pending owner
still for this lock, when in reality it owns to lock and is pending for
another.

The other test must also do the test_and_clear_bit on the pending owner
bit. So you need to make sure the owner not only stays the owner of the
lock the task is trying to get, but also be able to do the atomic
test_and_clear on the owner's pending owner bit.

I can't get these two in sync without grabbing a lock (in this case, the
wait_list lock).

Ingo, unless you can think of a way to do this, tomorrow (actually
today), I'll change this to have the end of __down (and friends) grab
the wait_list lock to test and clear it's pending owner bit.

-- Steve


2005-03-30 06:51:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> OK, I'm declaring defeat here. I've been fighting race conditions all
> day, and it's now 1 in the morning where I live. It looks like this
> implementation has no other choice but to have the waking up "pending
> owner" take the wait_list lock once again. How heavy of a overhead is
> that really?

as i mentioned it before, taking a lock is not a big issue at all. Since
you have to touch the lock data structure anyway (and all of it fits
into a single cacheline), it doesnt really matter whether it's atomic
flag setting/clearing, or raw spinlock based.

later on, once things are stable and well-understood, we can still
attempt to micro-optimize it.

Ingo

2005-03-30 16:47:05

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, 2005-03-30 at 08:50 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > OK, I'm declaring defeat here. I've been fighting race conditions all
> > day, and it's now 1 in the morning where I live. It looks like this
> > implementation has no other choice but to have the waking up "pending
> > owner" take the wait_list lock once again. How heavy of a overhead is
> > that really?
>
> as i mentioned it before, taking a lock is not a big issue at all. Since
> you have to touch the lock data structure anyway (and all of it fits
> into a single cacheline), it doesnt really matter whether it's atomic
> flag setting/clearing, or raw spinlock based.c0267ad4
>

OK, I've implemented adding the write_lock, but I still have a nasty
race condition, and I finally figured out why!

The damn BKL! Here's the situation I'm having.

Process A grabs BKL then tries to grab another semaphore (say sem X)
But process B has sem X so process A sleeps.
Since semaphores don't have the restriction of saving the BKL, the BKL
gets released from process A.
Process C comes along and grabs the BKL.
Finally B releases sem X and process A becomes the new "pending owner"
and wakes up.
When A schedules in it tries to grab the BKL but blocks. Now it is at
the point where A doesn't actually own either the BKL or sem X.
When C releases the BKL and gives it to A, A is now the pending owner
of both the BKL and sem X.

When this occurs, all hell breaks loose.

I believe I can solve this by making the BKL a special case and not
implement the pending owner at all for it.

> later on, once things are stable and well-understood, we can still
> attempt to micro-optimize it.
>

Heck, I'll make it bloat city till I get it working, and then tone it
down a little :-) And maybe later we can have a better solution for the
BKL.

-- Steve


2005-03-30 19:46:22

by Esben Nielsen

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, 30 Mar 2005, Steven Rostedt wrote:

> [...]
>
> Heck, I'll make it bloat city till I get it working, and then tone it
> down a little :-) And maybe later we can have a better solution for the
> BKL.
>
What about removing it alltogether?
Seriously, how much work would it be to simply remove it and go in and
make specific locks in all those places the code can't compile?

Esben

> -- Steve
>
>

2005-03-30 19:59:58

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, 2005-03-30 at 20:44 +0100, Esben Nielsen wrote:
> On Wed, 30 Mar 2005, Steven Rostedt wrote:
>
> > [...]
> >
> > Heck, I'll make it bloat city till I get it working, and then tone it
> > down a little :-) And maybe later we can have a better solution for the
> > BKL.
> >
> What about removing it alltogether?
> Seriously, how much work would it be to simply remove it and go in and
> make specific locks in all those places the code can't compile?

I would really love to do that! But I don't have the time or the
knowledge on the effects that would have.

Because of the stupid BKL, I'm going with a combination of your idea and
my idea for the solution of pending owners. I originally wanted the
stealer of the lock to put the task that was robbed back on the list.
But because of the BKL, you end up with a state that a task can be
blocked by two locks at the same time. This causes hell with priority
inheritance.

So finally, what I'm doing now is to have the lock still pick the
pending owner, and that owner is not blocked on the lock. If another
process comes along and steals the lock, the robbed task goes to a state
as if it was just before calling __down*. So when it wakes up, it checks
to see if it is the pending owner, and if not, then it tries to grab the
lock again, if it doesn't get it, it just calls task_blocks_on_lock
again.

This is the easiest solution so far, but I still like the stealer to put
it back on the list. But until we get rid of the BKL that wont happen.

-- Steve


2005-03-30 21:39:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Wed, 2005-03-30 at 14:56 -0500, Steven Rostedt wrote:

> Because of the stupid BKL, I'm going with a combination of your idea and
> my idea for the solution of pending owners. I originally wanted the
> stealer of the lock to put the task that was robbed back on the list.
> But because of the BKL, you end up with a state that a task can be
> blocked by two locks at the same time. This causes hell with priority
> inheritance.
>
> So finally, what I'm doing now is to have the lock still pick the
> pending owner, and that owner is not blocked on the lock. If another
> process comes along and steals the lock, the robbed task goes to a state
> as if it was just before calling __down*. So when it wakes up, it checks
> to see if it is the pending owner, and if not, then it tries to grab the
> lock again, if it doesn't get it, it just calls task_blocks_on_lock
> again.
>
> This is the easiest solution so far, but I still like the stealer to put
> it back on the list. But until we get rid of the BKL that wont happen.

Well, here it finally is. There's still things I don't like about it.
But it seems to work, and that's the important part.

I had to reluctantly add two items to the task_struct. I was hoping to
avoid that. But because of race conditions it seemed to be the only way.

The two are:

pending_owner - probably should be pending_lock but it made sense with
the state of the code when I first created it. This points to the lock
that the task is pending on. I would have used the blocked_on->lock but
to test this, there existed a race condition between testing it and if
it wasn't the lock, then the task could continue and this waiter (being
on the stack) would be garbage. The chances that it would be garbage and
containing the same lock is extremely low, but the chance exists, and
talk about one hell of a debug to find it!

rt_flags - I would have just added the flag to flags, but since this was
protected by the lock and not by the task, I was afraid of a
read-modify-write from schedule screwing this flag up.

To get a lock, all the down functions call grab_lock. This function may
be optimized too, but right now it is easy to read. It performs a series
of tests to see if it can grab the lock, including if the lock is free,
and if not, can the current task steal it.

All sleeping down functions also call capture_lock, which tests to see
if the lock was stolen or not. If it was, then it tries to get it again,
and if it fails it just calls task_blocks_on_lock again.

It's a relatively simple patch, but it took a lot of pain since I was
trying very hard to have the stealer do the work. But the BKL proved to
be too much.

Also, Ingo, I hope you don't mind my little copyright notice :-)

-- Steve

Index: kernel/rt.c
===================================================================
--- kernel/rt.c (revision 109)
+++ kernel/rt.c (working copy)
@@ -17,6 +17,11 @@
* Copyright (c) 2001 David Howells ([email protected]).
* - Derived partially from idea by Andrea Arcangeli <[email protected]>
* - Derived also from comments by Linus
+ *
+ * Pending ownership of locks and ownership stealing:
+ *
+ * Copyright (C) 2005, Kihon Technologies Inc., Steven Rostedt
+ *
*/
#include <linux/config.h>
#include <linux/sched.h>
@@ -28,6 +33,14 @@
#include <linux/interrupt.h>

/*
+ * These flags are used for allowing of stealing of ownerships.
+ */
+#define RT_PENDOWNER 1 /* pending owner on a lock */
+
+#define TASK_PENDING(task) \
+ ((task)->rt_flags & RT_PENDOWNER)
+
+/*
* This flag is good for debugging the PI code - it makes all tasks
* in the system fall under PI handling. Normally only SCHED_FIFO/RR
* tasks are PI-handled:
@@ -791,6 +804,113 @@
}

/*
+ * Try to grab a lock, and if it is owned but the owner
+ * hasn't woken up yet, see if we can steal it.
+ *
+ * Return: 1 if task can grab lock.
+ * 0 if not.
+ */
+static int grab_lock(struct rt_mutex *lock, struct task_struct *task)
+{
+ struct task_struct *owner = lock->owner;
+
+ if (!owner)
+ return 1;
+ /*
+ * The lock is owned, but now test to see if the owner
+ * is still sleeping and hasn't woken up to get the lock.
+ */
+
+ /* Test the simple case first, is it already running? */
+ if (!TASK_PENDING(owner))
+ return 0;
+
+ /* The owner is pending on a lock, but is it this lock? */
+ if (owner->pending_owner != lock)
+ return 0;
+
+ /*
+ * There's an owner, but it hasn't woken up to take the lock yet.
+ * See if we should steal it from him.
+ */
+ if (task->prio > owner->prio)
+ return 0;
+
+ /*
+ * The BKL is a pain in the ass. Don't ever steal it
+
+ */
+ if (lock == &kernel_sem.lock)
+ return 0;
+
+ /*
+ * This task is of higher priority than the current pending
+ * owner, so we may steal it.
+ */
+ owner->rt_flags &= ~RT_PENDOWNER;
+ owner->pending_owner = NULL;
+
+#ifdef CONFIG_RT_DEADLOCK_DETECT
+ /*
+ * This task will be taking the ownership away, and
+ * when it does, the lock can't be on the held list.
+ */
+ if (lock->debug) {
+ TRACE_WARN_ON(list_empty(&lock->held_list));
+ list_del_init(&lock->held_list);
+ }
+#endif
+ return 1;
+}
+
+/*
+ * Bring a task from pending ownership to owning a lock.
+ *
+ * Return 0 if we secured it, otherwise non-zero if it was
+ * stolen.
+ */
+static int capture_lock(struct rt_mutex_waiter *waiter, struct task_struct *task)
+{
+ struct rt_mutex *lock = waiter->lock;
+ unsigned long flags;
+ int ret = 0;
+
+ /*
+ * The BKL is special, we always get it.
+ */
+ if (lock == &kernel_sem.lock)
+ return 0;
+
+ trace_lock_irqsave(&trace_lock, flags);
+ spin_lock(&lock->wait_lock);
+
+ if (!(task->rt_flags & RT_PENDOWNER)) {
+ /* someone else stole it */
+ TRACE_BUG_ON(lock->owner == task);
+ if (grab_lock(lock,task)) {
+ /* we got it back! */
+ struct task_struct *old_owner = lock->owner;
+ spin_lock(&pi_lock);
+ set_new_owner(lock, old_owner, task, waiter->eip);
+ spin_unlock(&pi_lock);
+ ret = 0;
+ } else {
+ /* Add ourselves back to the list */
+ task_blocks_on_lock(waiter,task,lock,waiter->eip);
+ ret = 1;
+ }
+ } else {
+ task->rt_flags &= ~RT_PENDOWNER;
+ task->pending_owner = NULL;
+ }
+
+ spin_unlock(&lock->wait_lock);
+ trace_unlock_irqrestore(&trace_lock, flags);
+
+ return ret;
+}
+
+/*
* lock it semaphore-style: no worries about missed wakeups.
*/
static void __sched __down(struct rt_mutex *lock, unsigned long eip)
@@ -805,11 +925,12 @@

init_lists(lock);

- if (!lock->owner) {
+ if (grab_lock(lock,task)) {
/* granted */
- TRACE_WARN_ON(!list_empty(&lock->wait_list));
+ struct task_struct *old_owner = lock->owner;
+ TRACE_WARN_ON(!list_empty(&lock->wait_list) && !old_owner);
spin_lock(&pi_lock);
- set_new_owner(lock, NULL, task, eip);
+ set_new_owner(lock, old_owner, task, eip);
spin_unlock(&pi_lock);
spin_unlock(&lock->wait_lock);
trace_unlock_irqrestore(&trace_lock, flags);
@@ -834,6 +955,7 @@
#endif
current->flags &= ~PF_NOSCHED;

+ wait_again:
/* wait to be given the lock */
for (;;) {
if (!waiter.task)
@@ -841,6 +963,14 @@
schedule();
set_task_state(task, TASK_UNINTERRUPTIBLE);
}
+ /*
+ * Check to see if we didn't have ownership stolen.
+ */
+ if (capture_lock(&waiter,task)) {
+ set_task_state(task, TASK_UNINTERRUPTIBLE);
+ goto wait_again;
+ }
+
current->flags |= nosched_flag;
task->state = TASK_RUNNING;
}
@@ -894,11 +1024,12 @@

init_lists(lock);

- if (!lock->owner) {
+ if (grab_lock(lock,task)) {
/* granted */
- TRACE_WARN_ON(!list_empty(&lock->wait_list));
+ struct task_struct *old_owner = lock->owner;
+ TRACE_WARN_ON(!list_empty(&lock->wait_list) && !old_owner);
spin_lock(&pi_lock);
- set_new_owner(lock, NULL, task, eip);
+ set_new_owner(lock, old_owner, task, eip);
spin_unlock(&pi_lock);
spin_unlock(&lock->wait_lock);
trace_unlock_irqrestore(&trace_lock, flags);
@@ -933,6 +1064,7 @@
#endif
current->flags &= ~PF_NOSCHED;

+ wait_again:
/* wait to be given the lock */
for (;;) {
unsigned long saved_flags = current->flags & PF_NOSCHED;
@@ -949,6 +1081,16 @@
got_wakeup = 1;
}
/*
+ * Check to see if we didn't have ownership stolen.
+ */
+ if (capture_lock(&waiter,task)) {
+ state = xchg(&task->state, TASK_UNINTERRUPTIBLE);
+ if (state == TASK_RUNNING)
+ got_wakeup = 1;
+ goto wait_again;
+ }
+
+ /*
* Only set the task's state to TASK_RUNNING if it got
* a non-mutex wakeup. We keep the original state otherwise.
* A mutex wakeup changes the task's state to TASK_RUNNING_MUTEX,
@@ -1024,11 +1166,12 @@

init_lists(lock);

- if (!lock->owner) {
+ if (grab_lock(lock,task)) {
/* granted */
- TRACE_WARN_ON(!list_empty(&lock->wait_list));
+ struct task_struct *old_owner = lock->owner;
+ TRACE_WARN_ON(!list_empty(&lock->wait_list) && !old_owner);
spin_lock(&pi_lock);
- set_new_owner(lock, NULL, task, eip);
+ set_new_owner(lock, old_owner, task, eip);
spin_unlock(&pi_lock);
spin_unlock(&lock->wait_lock);
trace_unlock_irqrestore(&trace_lock, flags);
@@ -1054,6 +1197,7 @@
current->flags &= ~PF_NOSCHED;

ret = 0;
+ wait_again:
/* wait to be given the lock */
for (;;) {
if (signal_pending(current)) {
@@ -1084,6 +1228,16 @@
schedule();
set_task_state(task, TASK_INTERRUPTIBLE);
}
+
+ /*
+ * Check to see if we didn't have ownership stolen.
+ */
+ if (ret) {
+ if (capture_lock(&waiter,task)) {
+ set_task_state(task, TASK_INTERRUPTIBLE);
+ goto wait_again;
+ }
+ }

task->state = TASK_RUNNING;
current->flags |= nosched_flag;
@@ -1112,11 +1266,12 @@

init_lists(lock);

- if (!lock->owner) {
+ if (grab_lock(lock,task)) {
/* granted */
- TRACE_WARN_ON(!list_empty(&lock->wait_list));
+ struct task_struct *old_owner = lock->owner;
+ TRACE_WARN_ON(!list_empty(&lock->wait_list) && !old_owner);
spin_lock(&pi_lock);
- set_new_owner(lock, NULL, task, eip);
+ set_new_owner(lock, old_owner, task, eip);
spin_unlock(&pi_lock);
ret = 1;
}
@@ -1217,6 +1372,10 @@
if (prio != old_owner->prio)
pi_setprio(lock, old_owner, prio);
if (new_owner) {
+ if (lock != &kernel_sem.lock) {
+ new_owner->rt_flags |= RT_PENDOWNER;
+ new_owner->pending_owner = lock;
+ }
if (save_state)
wake_up_process_mutex(new_owner);
else
Index: include/linux/sched.h
===================================================================
--- include/linux/sched.h (revision 109)
+++ include/linux/sched.h (working copy)
@@ -831,6 +831,9 @@

/* RT deadlock detection and priority inheritance handling */
struct rt_mutex_waiter *blocked_on;
+ struct rt_mutex *pending_owner;
+ unsigned long rt_flags;
+

#ifdef CONFIG_RT_DEADLOCK_DETECT
void *last_kernel_lock;


2005-03-30 21:43:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


On Wed, 2005-03-30 at 16:39 -0500, Steven Rostedt wrote:
> On Wed, 2005-03-30 at 14:56 -0500, Steven Rostedt wrote:
>
> > Because of the stupid BKL, I'm going with a combination of your idea and
> > my idea for the solution of pending owners. I originally wanted the
> > stealer of the lock to put the task that was robbed back on the list.
> > But because of the BKL, you end up with a state that a task can be
> > blocked by two locks at the same time. This causes hell with priority
> > inheritance.
> >

[snip]

> It's a relatively simple patch, but it took a lot of pain since I was
> trying very hard to have the stealer do the work. But the BKL proved to
> be too much.

Oh, I forgot, this is patched against V0.7.41-11. Ingo, you're moving so
fast, I can't keep up. I'll download your latest later, and see if this
patch still qualifies.

-- Steve


2005-03-31 11:03:50

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> Well, here it finally is. There's still things I don't like about it.
> But it seems to work, and that's the important part.
>
> I had to reluctantly add two items to the task_struct. I was hoping
> to avoid that. But because of race conditions it seemed to be the only
> way.

well it's not a big problem, and we avoided having to add flags to the
rt_lock structure, which is the important issue.

your patch looks good, i've added it to my tree and have uploaded the
-26-00 patch. It boots fine on my testbox, except for some new messages:

knodemgrd_0/902: BUG in __down_complete at kernel/rt.c:1568
[<c0103956>] dump_stack+0x23/0x25 (20)
[<c0130dcd>] down_trylock+0x1fb/0x200 (48)
[<c0364ee2>] nodemgr_host_thread+0xd0/0x17b (48)
[<c0100d4d>] kernel_thread_helper+0x5/0xb (136249364)
---------------------------
| preempt count: 00000001 ]
| 1-level deep critical section nesting:
----------------------------------------
.. [<c0133a75>] .... print_traces+0x1b/0x52
.....[<c0103956>] .. ( <= dump_stack+0x23/0x25)

this goes away if i revert your patch. It seems the reason is that
trylock hasnt been updated to use the pending-owner logic?

Ingo

2005-03-31 12:03:19

by Esben Nielsen

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Thu, 31 Mar 2005, Ingo Molnar wrote:

>
> * Steven Rostedt <[email protected]> wrote:
>
> > Well, here it finally is. There's still things I don't like about it.
> > But it seems to work, and that's the important part.
> >
> > I had to reluctantly add two items to the task_struct. I was hoping
> > to avoid that. But because of race conditions it seemed to be the only
> > way.
>
> well it's not a big problem, and we avoided having to add flags to the
> rt_lock structure, which is the important issue.
>
I was going to say the opposit. I know that there are many more rt_locks
locks around and the fields thus will take more memory when put there but
I believe it is more logical to have the fields there.

Esben

2005-03-31 12:14:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Thu, 2005-03-31 at 13:03 +0100, Esben Nielsen wrote:
> On Thu, 31 Mar 2005, Ingo Molnar wrote:
>
> >
> > * Steven Rostedt <[email protected]> wrote:
> >
> > > Well, here it finally is. There's still things I don't like about it.
> > > But it seems to work, and that's the important part.
> > >
> > > I had to reluctantly add two items to the task_struct. I was hoping
> > > to avoid that. But because of race conditions it seemed to be the only
> > > way.
> >
> > well it's not a big problem, and we avoided having to add flags to the
> > rt_lock structure, which is the important issue.
> >
> I was going to say the opposit. I know that there are many more rt_locks
> locks around and the fields thus will take more memory when put there but
> I believe it is more logical to have the fields there.

It seems logical to be there, but in practicality, it's not.

The problem is that the flags represent a state of the task with respect
to a single lock. When the task loses ownership of a lock, the state of
the task changes. But the the lock has a different state at that moment
(it has a new onwner). Now when it releases the lock, it might give the
lock to another task, and that becomes the pending owner. Now the state
of the lock is the same as in the beginning. But the first task needs to
see this change.

You can still pull this off by testing the state of the lock and compare
it to the current owner, but I too like the fact that you don't increase
the size of the kernel statically. There are a lot more locks in the
kernel than tasks on most systems. And those systems that will have more
tasks than locks, need a lot of memory anyway. So we only punish the
big systems (that expect to be punished) and keep the little guys safe.

-- Steve


2005-03-31 12:22:16

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Thu, 2005-03-31 at 13:03 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > Well, here it finally is. There's still things I don't like about it.
> > But it seems to work, and that's the important part.
> >
> > I had to reluctantly add two items to the task_struct. I was hoping
> > to avoid that. But because of race conditions it seemed to be the only
> > way.
>
> well it's not a big problem, and we avoided having to add flags to the
> rt_lock structure, which is the important issue.
>
> your patch looks good, i've added it to my tree and have uploaded the
> -26-00 patch. It boots fine on my testbox, except for some new messages:
>
> knodemgrd_0/902: BUG in __down_complete at kernel/rt.c:1568
> [<c0103956>] dump_stack+0x23/0x25 (20)
> [<c0130dcd>] down_trylock+0x1fb/0x200 (48)
> [<c0364ee2>] nodemgr_host_thread+0xd0/0x17b (48)
> [<c0100d4d>] kernel_thread_helper+0x5/0xb (136249364)
> ---------------------------
> | preempt count: 00000001 ]
> | 1-level deep critical section nesting:
> ----------------------------------------
> .. [<c0133a75>] .... print_traces+0x1b/0x52
> .....[<c0103956>] .. ( <= dump_stack+0x23/0x25)
>
> this goes away if i revert your patch. It seems the reason is that
> trylock hasnt been updated to use the pending-owner logic?

Hmm, The pending owner logic in __down_trylock uses the grab_lock
function. It doesn't need the capture_lock since it never sleeps. I'm
downloading your 42-00-experimental now and installing it to see if I
can get the same message. Did you try the patch against 41-11? Maybe the
patch didn't go in so smoothly.

Anyway, I'll take a look at it now and let you know what I find.

Thanks,

-- Steve


2005-03-31 12:37:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Thu, 2005-03-31 at 13:03 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:

>
> your patch looks good, i've added it to my tree and have uploaded the
> -26-00 patch. It boots fine on my testbox, except for some new messages:
>
> knodemgrd_0/902: BUG in __down_complete at kernel/rt.c:1568
> [<c0103956>] dump_stack+0x23/0x25 (20)
> [<c0130dcd>] down_trylock+0x1fb/0x200 (48)
> [<c0364ee2>] nodemgr_host_thread+0xd0/0x17b (48)
> [<c0100d4d>] kernel_thread_helper+0x5/0xb (136249364)
> ---------------------------
> | preempt count: 00000001 ]
> | 1-level deep critical section nesting:
> ----------------------------------------
> .. [<c0133a75>] .... print_traces+0x1b/0x52
> .....[<c0103956>] .. ( <= dump_stack+0x23/0x25)
>
> this goes away if i revert your patch. It seems the reason is that
> trylock hasnt been updated to use the pending-owner logic?

Damn, now that I have a comparable kernel to look at, I see where that
1568 is. I did see these, but they went away when I changed the logic to
handle the BKL, and I thought it was related.

Since this happened with the trylock, do you see anyway that a pending
owner can cause problems? Maybe this has to do with is_locked. Now a
pending owner makes this ambiguous. Since the lock has a owner, and a
task can't get it if it is of lower priority than the pending owner, but
it can get it if it is higher. Now is it locked? My implementation was
to be safe and say that it is locked.

I'll play around some more with this.

-- Steve


2005-03-31 12:50:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Thu, 2005-03-31 at 13:03 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:

> knodemgrd_0/902: BUG in __down_complete at kernel/rt.c:1568
> [<c0103956>] dump_stack+0x23/0x25 (20)
> [<c0130dcd>] down_trylock+0x1fb/0x200 (48)
> [<c0364ee2>] nodemgr_host_thread+0xd0/0x17b (48)
> [<c0100d4d>] kernel_thread_helper+0x5/0xb (136249364)
> ---------------------------
> | preempt count: 00000001 ]
> | 1-level deep critical section nesting:
> ----------------------------------------
> .. [<c0133a75>] .... print_traces+0x1b/0x52
> .....[<c0103956>] .. ( <= dump_stack+0x23/0x25)

One more thing. Was this on SMP or UP. I haven't tested this on SMP
yet. When my laptop (HT) gets done with its work, I'll give it a try
there. Of course I need to disable NVidia on it first.

-- Steve


2005-03-31 12:59:54

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Thu, 2005-03-31 at 07:36 -0500, Steven Rostedt wrote:
> On Thu, 2005-03-31 at 13:03 +0200, Ingo Molnar wrote:
> > * Steven Rostedt <[email protected]> wrote:
>

> Since this happened with the trylock, do you see anyway that a pending
> owner can cause problems? Maybe this has to do with is_locked. Now a
> pending owner makes this ambiguous. Since the lock has a owner, and a
> task can't get it if it is of lower priority than the pending owner, but
> it can get it if it is higher. Now is it locked? My implementation was
> to be safe and say that it is locked.
>
> I'll play around some more with this.

Oops! Found a little bug. Ingo, see if this fixes it.

-- Steve

--- ./kernel/rt.c.orig 2005-03-31 07:27:59.000000000 -0500
+++ ./kernel/rt.c 2005-03-31 07:53:14.913072893 -0500
@@ -1244,7 +1244,7 @@
/*
* Check to see if we didn't have ownership stolen.
*/
- if (ret) {
+ if (!ret) {
if (capture_lock(&waiter,task)) {
set_task_state(task, TASK_INTERRUPTIBLE);
goto wait_again;


2005-03-31 13:29:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> > I'll play around some more with this.
>
> Oops! Found a little bug. Ingo, see if this fixes it.

yeah, that was it. I've uploaded -42-02 with the fix included.

Ingo

2005-03-31 13:33:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> > I was going to say the opposit. I know that there are many more rt_locks
> > locks around and the fields thus will take more memory when put there but
> > I believe it is more logical to have the fields there.
>
> It seems logical to be there, but in practicality, it's not.
>
> The problem is that the flags represent a state of the task with
> respect to a single lock. When the task loses ownership of a lock,
> the state of the task changes. But the the lock has a different state
> at that moment (it has a new onwner). Now when it releases the lock,
> it might give the lock to another task, and that becomes the pending
> owner. Now the state of the lock is the same as in the beginning. But
> the first task needs to see this change.
>
> You can still pull this off by testing the state of the lock and
> compare it to the current owner, but I too like the fact that you
> don't increase the size of the kernel statically. There are a lot
> more locks in the kernel than tasks on most systems. And those systems
> that will have more tasks than locks, need a lot of memory anyway. So
> we only punish the big systems (that expect to be punished) and keep
> the little guys safe.

no system is punished. Since task_struct embedds 2 locks already, moving
the field(s) into task_struct is already a win.

Ingo

2005-03-31 14:11:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> One more thing. Was this on SMP or UP. I haven't tested this on SMP
> yet. When my laptop (HT) gets done with its work, I'll give it a try
> there. Of course I need to disable NVidia on it first.

i've booted the latest tree on a 4-way testbox and everything seems ok.

Ingo

2005-03-31 17:42:28

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Thu, 2005-03-31 at 16:10 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > One more thing. Was this on SMP or UP. I haven't tested this on SMP
> > yet. When my laptop (HT) gets done with its work, I'll give it a try
> > there. Of course I need to disable NVidia on it first.
>
> i've booted the latest tree on a 4-way testbox and everything seems ok.

Thanks Ingo.

Oh, and did your really want to assign debug = .1?

-- Steve

Here you go:


--- ./include/linux/rt_lock.h.orig 2005-03-31 12:38:31.583913080 -0500
+++ ./include/linux/rt_lock.h 2005-03-31 12:38:35.499061576 -0500
@@ -125,7 +125,7 @@
# ifdef CONFIG_RT_DEADLOCK_DETECT
# define __RW_LOCK_UNLOCKED \
.wait_lock = __RAW_SPIN_LOCK_UNLOCKED, .save_state = 1, \
- .debug = .1, .file = __FILE__, .line = __LINE__
+ .debug = 1, .file = __FILE__, .line = __LINE__
# define _RW_LOCK_UNLOCKED(lock) \
(rwlock_t) { { { __RW_LOCK_UNLOCKED, .name = #lock } } }
# define RW_LOCK_UNLOCKED \


2005-03-31 17:53:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> Oh, and did your really want to assign debug = .1?

> - .debug = .1, .file = __FILE__, .line = __LINE__
> + .debug = 1, .file = __FILE__, .line = __LINE__

doh - indeed. This means rwlocks were nondebug all along? Ouch. I've
uploaded -42-08 with the fix.

Ingo

2005-03-31 18:18:09

by Gene Heskett

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Thursday 31 March 2005 12:49, Ingo Molnar wrote:
>* Steven Rostedt <[email protected]> wrote:
>> Oh, and did your really want to assign debug = .1?
>>
>> - .debug = .1, .file = __FILE__, .line = __LINE__
>> + .debug = 1, .file = __FILE__, .line = __LINE__
>
>doh - indeed. This means rwlocks were nondebug all along? Ouch. I've
>uploaded -42-08 with the fix.
>
> Ingo

It wasn't there when I looked yet, so I just built 42-05, Ingo.
Almost everything works except I can't get any video out of tvtime,
and the audio quality is still intermittently tinny, sometimes cured
by an internal tvtime restart with earlier versions but not this
version. Tinny sound seem to be forever now.

Also, lots of logging from tvtime, as if it was getting the video but
missing a frame occasionally, and spit this out when I quit it.

Mar 31 13:05:00 coyote kernel: rtc latency histogram of {tvtime/5251,
1303 samples}:
Mar 31 13:05:00 coyote kernel: 4 1
Mar 31 13:05:00 coyote kernel: 5 190
Mar 31 13:05:00 coyote kernel: 6 455
Mar 31 13:05:00 coyote kernel: 7 264
Mar 31 13:05:00 coyote kernel: 8 84
Mar 31 13:05:00 coyote kernel: 9 42
Mar 31 13:05:00 coyote kernel: 10 22
Mar 31 13:05:00 coyote kernel: 11 34
Mar 31 13:05:00 coyote kernel: 12 15
Mar 31 13:05:00 coyote kernel: 13 21
Mar 31 13:05:00 coyote kernel: 14 19
Mar 31 13:05:00 coyote kernel: 15 18
Mar 31 13:05:00 coyote kernel: 16 10
Mar 31 13:05:00 coyote kernel: 17 6
Mar 31 13:05:00 coyote kernel: 18 8
Mar 31 13:05:00 coyote kernel: 19 6
Mar 31 13:05:00 coyote kernel: 20 8
Mar 31 13:05:00 coyote kernel: 21 1
Mar 31 13:05:00 coyote kernel: 22 2
Mar 31 13:05:00 coyote kernel: 23 2
Mar 31 13:05:00 coyote kernel: 25 1
Mar 31 13:05:00 coyote kernel: 26 2
Mar 31 13:05:00 coyote kernel: 29 1
Mar 31 13:05:00 coyote kernel: 30 2
Mar 31 13:05:00 coyote kernel: 47 1
Mar 31 13:05:00 coyote kernel: 9999 88


I have -08 now, so I'll go build that one next. Other than these,
this one 'feels' good.

>-
>To unsubscribe from this list: send the line "unsubscribe
> linux-kernel" in the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.34% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2005 by Maurice Eugene Heskett, all rights reserved.

2005-03-31 20:22:50

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Thu, 2005-03-31 at 19:49 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > Oh, and did your really want to assign debug = .1?
>
> > - .debug = .1, .file = __FILE__, .line = __LINE__
> > + .debug = 1, .file = __FILE__, .line = __LINE__
>
> doh - indeed. This means rwlocks were nondebug all along? Ouch. I've
> uploaded -42-08 with the fix.

I noticed it because starting with V0.7.41-25 (although I only actually
noticed it with 42-07) not only were they nondebug, but they also didn't
have any priority inheritance.

-- Steve


2005-03-31 21:02:44

by Gene Heskett

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07 (update)

On Thursday 31 March 2005 13:17, Gene Heskett wrote:
>On Thursday 31 March 2005 12:49, Ingo Molnar wrote:
>>* Steven Rostedt <[email protected]> wrote:
>>> Oh, and did your really want to assign debug = .1?
>>>
>>> - .debug = .1, .file = __FILE__, .line = __LINE__
>>> + .debug = 1, .file = __FILE__, .line = __LINE__
>>
>>doh - indeed. This means rwlocks were nondebug all along? Ouch.
>> I've uploaded -42-08 with the fix.
>>
>> Ingo
>
>It wasn't there when I looked yet, so I just built 42-05, Ingo.
>Almost everything works except I can't get any video out of tvtime,
>and the audio quality is still intermittently tinny, sometimes
> cured by an internal tvtime restart with earlier versions but not
> this version. Tinny sound seem to be forever now.
>
>Also, lots of logging from tvtime, as if it was getting the video
> but missing a frame occasionally, and spit this out when I quit it.
>
>Mar 31 13:05:00 coyote kernel: rtc latency histogram of
> {tvtime/5251, 1303 samples}:
>Mar 31 13:05:00 coyote kernel: 4 1
>Mar 31 13:05:00 coyote kernel: 5 190
>Mar 31 13:05:00 coyote kernel: 6 455
>Mar 31 13:05:00 coyote kernel: 7 264
>Mar 31 13:05:00 coyote kernel: 8 84
>Mar 31 13:05:00 coyote kernel: 9 42
>Mar 31 13:05:00 coyote kernel: 10 22
>Mar 31 13:05:00 coyote kernel: 11 34
>Mar 31 13:05:00 coyote kernel: 12 15
>Mar 31 13:05:00 coyote kernel: 13 21
>Mar 31 13:05:00 coyote kernel: 14 19
>Mar 31 13:05:00 coyote kernel: 15 18
>Mar 31 13:05:00 coyote kernel: 16 10
>Mar 31 13:05:00 coyote kernel: 17 6
>Mar 31 13:05:00 coyote kernel: 18 8
>Mar 31 13:05:00 coyote kernel: 19 6
>Mar 31 13:05:00 coyote kernel: 20 8
>Mar 31 13:05:00 coyote kernel: 21 1
>Mar 31 13:05:00 coyote kernel: 22 2
>Mar 31 13:05:00 coyote kernel: 23 2
>Mar 31 13:05:00 coyote kernel: 25 1
>Mar 31 13:05:00 coyote kernel: 26 2
>Mar 31 13:05:00 coyote kernel: 29 1
>Mar 31 13:05:00 coyote kernel: 30 2
>Mar 31 13:05:00 coyote kernel: 47 1
>Mar 31 13:05:00 coyote kernel: 9999 88
>
>
>I have -08 now, so I'll go build that one next. Other than these,
>this one 'feels' good.
>
Now I'm on 42-08, it still feels good, but somebodies done something
bad for tvtime when its accessing a pcHDTV-3000 card for the src. No
video, blue screen.

Thinking that loading it in rc.local was out of order (that works for
2.6.12-rc1), I've rmmod-ed the whole thing and reloaded after
starting X, no difference.

>From the logs when I executed a modprobe cx88-dvb:
-----------------------------
Mar 31 15:18:53 coyote kernel: Linux video capture interface: v1.00
Mar 31 15:18:53 coyote kernel: cx2388x v4l2 driver version 0.0.4
loaded
Mar 31 15:18:53 coyote kernel: ACPI: PCI interrupt 0000:01:07.0[A] ->
GSI 11 (level, low) -> IRQ 11
Mar 31 15:18:53 coyote kernel: cx88[0]: subsystem: 7063:3000, board:
pcHDTV HD3000 HDTV [card=22,autodetected]
Mar 31 15:18:53 coyote kernel: i2c-algo-bit.o: cx88[0] passed test.
Mar 31 15:18:53 coyote kernel: tuner 3-0061: chip found @ 0xc2 (cx88
[0])
Mar 31 15:18:53 coyote kernel: tuner 3-0061: type set to 52 (Thomson
DDT 7610 (ATSC/NTSC))
Mar 31 15:18:53 coyote kernel: cx88[0]/0: found at 0000:01:07.0, rev:
5, irq: 11, latency: 32, mmio: 0xea000000
Mar 31 15:18:53 coyote kernel: cx88[0]/0: registered device video0
[v4l2]
Mar 31 15:18:53 coyote kernel: cx88[0]/0: registered device vbi0
Mar 31 15:18:53 coyote kernel: cx88[0]/0: registered device radio0
Mar 31 15:19:03 coyote kernel: cx2388x dvb driver version 0.0.4 loaded
Mar 31 15:19:03 coyote kernel: ACPI: PCI interrupt 0000:01:07.2[A] ->
GSI 11 (level, low) -> IRQ 11
Mar 31 15:19:03 coyote kernel: cx88[0]/2: found at 0000:01:07.2, rev:
5, irq: 11, latency: 32, mmio: 0xeb000000
Mar 31 15:19:03 coyote kernel: cx88[0]/2: cx2388x based dvb card
Mar 31 15:19:03 coyote kernel: DVB: registering new adapter (cx88[0]).
Mar 31 15:19:03 coyote kernel: DVB: registering frontend 0 (pcHDTV
HD3000 HDTV)...
----------------------------

An lsmod shows this for that card:
--Module Size Used by
cx88_dvb 5764 0
cx8802 8068 1 cx88_dvb
mt352 6532 1 cx88_dvb
or51132 9348 1 cx88_dvb
dvb_pll 3972 2 cx88_dvb,or51132
video_buf_dvb 4740 1 cx88_dvb
dvb_core 76072 1 video_buf_dvb
cx8800 27276 0
cx88xx 48416 3 cx88_dvb,cx8802,cx8800
i2c_algo_bit 8456 1 cx88xx
video_buf 17796 5
cx88_dvb,cx8802,video_buf_dvb,cx8800,cx88xx
ir_common 6404 1 cx88xx
tveeprom 11800 1 cx88xx
v4l1_compat 12676 1 cx8800
v4l2_common 4992 1 cx8800
btcx_risc 4104 3 cx8802,cx8800,cx88xx
tuner 25768 0
videodev 7680 2 cx8800,cx88xx
-------------------
I've snipped the webcam and firewire stuff as its not applicable here.

>From the logs when I fired up tvtime:
-----------------------------
Mar 31 15:22:12 coyote kernel: cx88[0]: video y / packed - dma channel
status dump
Mar 31 15:22:12 coyote kernel: cx88[0]: cmds: initial risc:
0x1ecce000
Mar 31 15:22:12 coyote kernel: cx88[0]: cmds: cdt base :
0x00180440
Mar 31 15:22:12 coyote kernel: cx88[0]: cmds: cdt size :
0x0000000c
Mar 31 15:22:12 coyote kernel: cx88[0]: cmds: iq base :
0x00180400
Mar 31 15:22:12 coyote kernel: cx88[0]: cmds: iq size :
0x00000010
Mar 31 15:22:12 coyote kernel: cx88[0]: cmds: risc pc :
0x00000000
Mar 31 15:22:12 coyote kernel: cx88[0]: cmds: iq wr ptr :
0x00000000
Mar 31 15:22:12 coyote kernel: cx88[0]: cmds: iq rd ptr :
0x00000000
Mar 31 15:22:12 coyote kernel: cx88[0]: cmds: cdt current :
0x00000000
Mar 31 15:22:12 coyote kernel: cx88[0]: cmds: pci target :
0x00000000
Mar 31 15:22:12 coyote kernel: cx88[0]: cmds: line / byte :
0x00000000
Mar 31 15:22:12 coyote kernel: cx88[0]: risc0: 0x00000000 [ INVALID
count=0 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: risc1: 0x00000000 [ INVALID
count=0 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: risc2: 0x00000000 [ INVALID
count=0 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: risc3: 0x00000000 [ INVALID
count=0 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq 0: 0x80008000 [ sync
resync count=0 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq 1: 0x1c000500 [ write sol
eol count=1280 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq 2: 0x1b9a3000 [ arg #1 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq 3: 0x1c000500 [ write sol
eol count=1280 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq 4: 0x1b9a3a00 [ arg #1 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq 5: 0x1c000500 [ write sol
eol count=1280 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq 6: 0x194d4400 [ arg #1 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq 7: 0x18000200 [ write sol
count=512 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq 8: 0x194d4e00 [ arg #1 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq 9: 0x14000300 [ write eol
count=768 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq a: 0x194d5000 [ arg #1 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq b: 0x1c000500 [ write sol
eol count=1280 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq c: 0x194d5800 [ arg #1 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq d: 0x0031c040 [ INVALID
21 20 cnt0 resync 14 count=64 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq e: 0x00000000 [ INVALID
count=0 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: iq f: 0x00000011 [ INVALID
count=17 ]
Mar 31 15:22:12 coyote kernel: cx88[0]: fifo: 0x00180c00 -> 0x183400
Mar 31 15:22:12 coyote kernel: cx88[0]: ctrl: 0x00180400 -> 0x180460
Mar 31 15:22:12 coyote kernel: cx88[0]: ptr1_reg: 0x00181d20
Mar 31 15:22:12 coyote kernel: cx88[0]: ptr2_reg: 0x00180478
Mar 31 15:22:12 coyote kernel: cx88[0]: cnt1_reg: 0x0000005a
Mar 31 15:22:12 coyote kernel: cx88[0]: cnt2_reg: 0x00000000
Mar 31 15:22:12 coyote kernel: cx88[0]/0: [f3aa07e0/0] timeout -
dma=0x00000000
Mar 31 15:22:12 coyote kernel: cx88[0]/0: [f3aa01e0/1] timeout -
dma=0x00000000
Mar 31 15:22:12 coyote kernel: cx88[0]/0: [edbc8a60/2] timeout -
dma=0x00000000
Mar 31 15:22:12 coyote kernel: cx88[0]/0: [f7391540/3] timeout -
dma=0x1ecce000
------------------------------
and which appears to repeat for every frame it should display. Very
voluminous logging.

It works great when running 2.6.12-rc1 FWTW. Which I'll be doing
again shortly, but first I'll check out kino & my movie camera, and
spca50x & my webcam.

kino works, includeing sound, this is ieee1394/firewire stuffs.

And, after a make clean;make;make install in the scpa50x src dir, that
works. My scanner works and while I haven't tried to print, the web
interface says its all ready and waiting to go. In other words the
usb appears to be fine.

Networking seems to be ok, mail is coming in more or less normally.

So does anyone have any suggestions as to what to check out in
tvtime/pcHDTV area.

Humm, in the FWIW category, under the video stuffs in a make xconfig,
the module/switch/whatever that is under
DVB broadcasting-->DVB for linux-->Core Support, and way down at the
bottom of the list is "Customize DVB Frontends", and again clear at
the bottom of the list is "nxt2002 based" which when you check it,
should show (I think)
a checkbox infront of the OR51132 Based (pcHDTV)
But it does not. However, the command to 'modprobe cx88-dvb' does
load it as can be seen by the lsmod screen above.

--
Cheers, Gene
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
99.34% setiathome rank, not too shabby for a WV hillbilly
Yahoo.com and AOL/TW attorneys please note, additions to the above
message by Gene Heskett are:
Copyright 2005 by Maurice Eugene Heskett, all rights reserved.

2005-04-01 00:59:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

Hi Ingo,

I was wondering if the issue the bit_spin_lock has gone into the side
burner? I understand that this is a major problem to change it, if you
want to get into the mainline kernel. But I still believe it to be a
problem. With kjournald spinning on a bit lock until it finishes it's
quota, can be bad for latencies.

Although it only deadlocks on your system if it was a real-time task, it
still has the ability to become one. Since it can hold two different
locks at the time of the spin, if a rt-task tries to grab it, with
priority inheritance it becomes rt, and if that just happened to be the
highest priority task on the system, you just created a deadlock.

It's not so much of an issue with me, since I'm working with a parallel
kernel, and have implemented my earlier fixes to it. I just wanted to
know if there's any plan to deal with them on your end. I do see this
causing a random lockup once in a while, and wanted the user to beware.
Of course if you just avoid ext3, you wont have a problem.

-- Steve


2005-04-01 04:44:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> Hi Ingo,
>
> I was wondering if the issue the bit_spin_lock has gone into the side
> burner? [...]

could you send me your latest patch for the bit-spin issue? My main
issue was cleanliness, so that the patch doesnt get stuck in the -RT
tree forever.

Ingo

2005-04-01 05:13:53

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Fri, 2005-04-01 at 06:43 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > Hi Ingo,
> >
> > I was wondering if the issue the bit_spin_lock has gone into the side
> > burner? [...]
>
> could you send me your latest patch for the bit-spin issue? My main
> issue was cleanliness, so that the patch doesnt get stuck in the -RT
> tree forever.

I think that's the main problem. Without changing the design of the ext3
system, I don't think there is a clean patch. The implementation that I
finally settled down with was to make the j_state and j_journal_head
locks two global locks. I had to make a few modifications to some spots
to avoid deadlocks, but this worked out well. The problem I was worried
about was this causing too much overhead. So the ext3 buffers would have
to contend with each other. I don't have any tests to see if this had
too much of an impact.

If you are still interested, then let me know and I'll pull it out and
send it to you. I preferred this method over the other wait_on_bit,
since using normal spin_locks gives priority inheritance, but to put
this into the buffer head seemed too much of an overhead.

Also, there was that inverted_lock crap in commit.c that also caused
problems. I just used the expensive wait_queue fix, where instead of
just calling schedule, I put the task on the wait queue to wake up when
the lock was released, and had unlock of j_state_lock wake it up. But
this is expensive since every time j_state_lock is unlocked, you need to
try to wake up a task that is most likely not there. This I still need
to optimize.

-- Steve


2005-04-01 05:23:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> > could you send me your latest patch for the bit-spin issue? My main
> > issue was cleanliness, so that the patch doesnt get stuck in the -RT
> > tree forever.
>
> I think that's the main problem. Without changing the design of the
> ext3 system, I don't think there is a clean patch. The implementation
> that I finally settled down with was to make the j_state and
> j_journal_head locks two global locks. I had to make a few
> modifications to some spots to avoid deadlocks, but this worked out
> well. The problem I was worried about was this causing too much
> overhead. So the ext3 buffers would have to contend with each other. I
> don't have any tests to see if this had too much of an impact.

yeah - i think Andrew said that a global lock at that particular place
might not be that much of an issue.

Ingo

2005-04-01 12:27:33

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Fri, 2005-04-01 at 07:19 +0200, Ingo Molnar wrote:
> * Steven Rostedt <[email protected]> wrote:
>
> > > could you send me your latest patch for the bit-spin issue? My main
> > > issue was cleanliness, so that the patch doesnt get stuck in the -RT
> > > tree forever.
> >
> > I think that's the main problem. Without changing the design of the
> > ext3 system, I don't think there is a clean patch. The implementation
> > that I finally settled down with was to make the j_state and
> > j_journal_head locks two global locks. I had to make a few
> > modifications to some spots to avoid deadlocks, but this worked out
> > well. The problem I was worried about was this causing too much
> > overhead. So the ext3 buffers would have to contend with each other. I
> > don't have any tests to see if this had too much of an impact.
>
> yeah - i think Andrew said that a global lock at that particular place
> might not be that much of an issue.
>

OK, I'll start stripping it out of my kernel today and make a clean
patch for you.

-- Steve

2005-04-07 21:22:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Fri, 2005-04-01 at 07:27 -0500, Steven Rostedt wrote:
> On Fri, 2005-04-01 at 07:19 +0200, Ingo Molnar wrote:
> > * Steven Rostedt <[email protected]> wrote:
> >
> > > > could you send me your latest patch for the bit-spin issue? My main
> > > > issue was cleanliness, so that the patch doesnt get stuck in the -RT
> > > > tree forever.
> > >
> > > I think that's the main problem. Without changing the design of the
> > > ext3 system, I don't think there is a clean patch. The implementation
> > > that I finally settled down with was to make the j_state and
> > > j_journal_head locks two global locks. I had to make a few
> > > modifications to some spots to avoid deadlocks, but this worked out
> > > well. The problem I was worried about was this causing too much
> > > overhead. So the ext3 buffers would have to contend with each other. I
> > > don't have any tests to see if this had too much of an impact.
> >
> > yeah - i think Andrew said that a global lock at that particular place
> > might not be that much of an issue.
> >
>
> OK, I'll start stripping it out of my kernel today and make a clean
> patch for you.
>

Ingo, I haven't forgotten about this, I just been heavily bug wacking
lately and just haven't had the time to do this.

I've pulled out both the lock_bh_state and lock_bh_journal_head and made
them two global locks. I haven't noticed any slowing down here, but
then again I haven't ran any real benchmarks. There's a BH flag set to
know when the lock is pending on a specific buffer head.

I don't know how acceptable this patch is. Take a look and if you have
any better ideas then let me know. I prefer this patch over the
wait_on_bit patch I sent you earlier since this patch still accounts for
priority inheritance, as the wait_on_bits don't.

-- Steve

Index: include/linux/jbd.h
===================================================================
--- include/linux/jbd.h (revision 113)
+++ include/linux/jbd.h (working copy)
@@ -313,47 +313,100 @@
BUFFER_FNS(RevokeValid, revokevalid)
TAS_BUFFER_FNS(RevokeValid, revokevalid)
BUFFER_FNS(Freed, freed)
+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT)
+BUFFER_FNS(State,state)

-static inline struct buffer_head *jh2bh(struct journal_head *jh)
+extern spinlock_t journal_bh_state_lock;
+extern spinlock_t journal_bh_journal_lock;
+
+static inline void jbd_lock_bh_state(struct buffer_head *bh)
{
- return jh->b_bh;
+ spin_lock(&journal_bh_state_lock);
+ BUG_ON(buffer_state(bh));
+ set_buffer_state(bh);
+ __acquire(journal_bh_state_lock);
}

-static inline struct journal_head *bh2jh(struct buffer_head *bh)
+static inline int jbd_trylock_bh_state(struct buffer_head *bh)
{
- return bh->b_private;
+ if (!spin_trylock(&journal_bh_state_lock))
+ return 0;
+
+ BUG_ON(buffer_state(bh));
+ set_buffer_state(bh);
+ __acquire(journal_bh_state_lock);
+ return 1;
}

+static inline int jbd_is_locked_bh_state(struct buffer_head *bh)
+{
+ return buffer_state(bh);
+}
+
+static inline void jbd_unlock_bh_state(struct buffer_head *bh)
+{
+ BUG_ON(!buffer_state(bh));
+ clear_buffer_state(bh);
+ spin_unlock(&journal_bh_state_lock);
+ __release(journal_bh_state_lock);
+}
+
+static inline void jbd_lock_bh_journal_head(struct buffer_head *bh)
+{
+ spin_lock(&journal_bh_journal_lock);
+ __acquire(journal_bh_journal_lock);
+}
+
+static inline void jbd_unlock_bh_journal_head(struct buffer_head *bh)
+{
+ spin_unlock(&journal_bh_journal_lock);
+ __release(journal_bh_journal_lock);
+}
+
+#else /* ! (CONFIG_SMP || CONFIG_DEBUG_SPINLOCK || CONFIG_PREEMPT) */
+
static inline void jbd_lock_bh_state(struct buffer_head *bh)
{
- bit_spin_lock(BH_State, &bh->b_state);
+ __acquire(journal_bh_state_lock);
}

static inline int jbd_trylock_bh_state(struct buffer_head *bh)
{
- return bit_spin_trylock(BH_State, &bh->b_state);
+ __acquire(journal_bh_state_lock);
+ return 1;
}

static inline int jbd_is_locked_bh_state(struct buffer_head *bh)
{
- return bit_spin_is_locked(BH_State, &bh->b_state);
+ return 1;
}

static inline void jbd_unlock_bh_state(struct buffer_head *bh)
{
- bit_spin_unlock(BH_State, &bh->b_state);
+ __release(journal_bh_state_lock);
}

static inline void jbd_lock_bh_journal_head(struct buffer_head *bh)
{
- bit_spin_lock(BH_JournalHead, &bh->b_state);
+ __acquire(journal_bh_journal_lock);
}

static inline void jbd_unlock_bh_journal_head(struct buffer_head *bh)
{
- bit_spin_unlock(BH_JournalHead, &bh->b_state);
+ __release(journal_bh_journal_lock);
}
+#endif /* (CONFIG_SMP || CONFIG_DEBUG_SPINLOCK || CONFIG_PREEMPT) */

+static inline struct buffer_head *jh2bh(struct journal_head *jh)
+{
+ return jh->b_bh;
+}
+
+static inline struct journal_head *bh2jh(struct buffer_head *bh)
+{
+ return bh->b_private;
+}
+
struct jbd_revoke_table_s;

/**
Index: fs/jbd/journal.c
===================================================================
--- fs/jbd/journal.c (revision 113)
+++ fs/jbd/journal.c (working copy)
@@ -82,6 +82,14 @@

static int journal_convert_superblock_v1(journal_t *, journal_superblock_t *);

+#if defined(CONFIG_SMP) || defined(CONFIG_DEBUG_SPINLOCK) || defined(CONFIG_PREEMPT)
+spinlock_t journal_bh_state_lock = SPIN_LOCK_UNLOCKED;
+spinlock_t journal_bh_journal_lock = SPIN_LOCK_UNLOCKED;
+
+EXPORT_SYMBOL(journal_bh_state_lock);
+EXPORT_SYMBOL(journal_bh_journal_lock);
+#endif
+
/*
* Helper function used to manage commit timeouts
*/


2005-04-10 10:32:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07


* Steven Rostedt <[email protected]> wrote:

> > > yeah - i think Andrew said that a global lock at that particular place
> > > might not be that much of an issue.
> >
> > OK, I'll start stripping it out of my kernel today and make a clean
> > patch for you.
>
> Ingo, I haven't forgotten about this, I just been heavily bug wacking
> lately and just haven't had the time to do this.
>
> I've pulled out both the lock_bh_state and lock_bh_journal_head and
> made them two global locks. I haven't noticed any slowing down here,
> but then again I haven't ran any real benchmarks. There's a BH flag
> set to know when the lock is pending on a specific buffer head.
>
> I don't know how acceptable this patch is. Take a look and if you have
> any better ideas then let me know. I prefer this patch over the
> wait_on_bit patch I sent you earlier since this patch still accounts
> for priority inheritance, as the wait_on_bits don't.

looks much cleaner than earlier ones. Would it be possible to make the
locks per journal? I've applied it to the -44-05 kernel so that it gets
some testing.

Ingo

2005-04-10 15:06:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [patch] Real-Time Preemption, -RT-2.6.12-rc1-V0.7.41-07

On Sun, 2005-04-10 at 12:31 +0200, Ingo Molnar wrote:

> looks much cleaner than earlier ones. Would it be possible to make the
> locks per journal? [...]

I've already looked into doing this, but it would be much more intrusive
to implement. The problem lies where these locks are called with only
the buffer head as a reference. The locks are used to attach or detach
the buffer head from a journal or just see if it is already attached.
So having the lock with the journal is difficult since you need to take
the lock sometimes before you know which journal is needed. I'm sure
this is possible but it will need modifying the code where the locks are
called instead of just replacing the contents of the lock function.
Maybe with the help of Stephen Tweedie, this can be done. But what I
gave you was the cleanest and most reliable solution currently, without
changing anything but the functions to take the locks.

-- Steve