LinuxLists.cc - Re: [patch 003/152] jbd: fix commit of ordered data buffers

2006-09-29 16:12:13

Subject: Re: [patch 003/152] jbd: fix commit of ordered data buffers

On Fri, 2006-09-29 at 11:02 +0200, Jan Kara wrote:
...
> > >+ }
> > >+ /* Someone already cleaned up the buffer? */
> > >+ if (!buffer_jbd(bh)
> > >+ || jh->b_transaction != commit_transaction
> > >+ || jh->b_jlist != BJ_SyncData) {
> > >+ jbd_unlock_bh_state(bh);
> > >+ if (locked)
> > >+ unlock_buffer(bh);
> > >+ BUFFER_TRACE(bh, "already cleaned up");
> > >+ put_bh(bh);
> > >+ continue;
> ---> Here the buffer was refiled by someone else

I am little concerned about this particular code. We know that
some one else will do the unfile/remove - but we will keep
spinning on it till that happens. Isn't it ? Why don't we
skip it and move to next one ?

We are seeing few following message while running tests and
wondering if your patch is causing it ..

BUG: spinlock lockup on CPU#1, scp/30189, c00000000fb503d8 (Not tainted)
Call Trace:
[C000000018FDB320] [C0000000000102E0] .show_stack+0x68/0x1b0 (unreliable)
[C000000018FDB3C0] [C0000000001734F4] ._raw_spin_lock+0x138/0x184
[C000000018FDB460] [C00000000025AD24] ._spin_lock+0x10/0x24
[C000000018FDB4E0] [D000000000172E14] .journal_dirty_data+0xa4/0x2c0 [jbd]
[C000000018FDB580] [D000000000205BAC] .ext3_journal_dirty_data+0x28/0x70 [ext3]
[C000000018FDB610] [D0000000002048BC] .walk_page_buffers+0xb0/0x134 [ext3]
[C000000018FDB6D0] [D000000000208280] .ext3_ordered_commit_write+0x74/0x114
[ext
3]
[C000000018FDB780] [C00000000007E710] .generic_file_buffered_write+0x51c/0x794
[C000000018FDB930]
[C00000000007ECB8] .__generic_file_aio_write_nolock+0x330/0x3
bc
[C000000018FDBA20] [C00000000007EDBC] .generic_file_aio_write+0x78/0x104
[C000000018FDBAE0] [D000000000202EFC] .ext3_file_write+0x2c/0xd4 [ext3]
[C000000018FDBB70] [C0000000000A8090] .do_sync_write+0xd4/0x130
[C000000018FDBCF0] [C0000000000A8E14] .vfs_write+0x118/0x200
[C000000018FDBD90] [C0000000000A9584] .sys_write+0x4c/0x8c
[C000000018FDBE30] [C000000000008434] syscall_exit+0x0/0x40

Thanks,
Badari

2006-09-29 19:22:20

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 003/152] jbd: fix commit of ordered data buffers

On Fri, 29 Sep 2006 09:11:46 -0700
Badari Pulavarty <[email protected]> wrote:

> On Fri, 2006-09-29 at 11:02 +0200, Jan Kara wrote:
> ...
> > > >+ }
> > > >+ /* Someone already cleaned up the buffer? */
> > > >+ if (!buffer_jbd(bh)
> > > >+ || jh->b_transaction != commit_transaction
> > > >+ || jh->b_jlist != BJ_SyncData) {
> > > >+ jbd_unlock_bh_state(bh);
> > > >+ if (locked)
> > > >+ unlock_buffer(bh);
> > > >+ BUFFER_TRACE(bh, "already cleaned up");
> > > >+ put_bh(bh);
> > > >+ continue;
> > ---> Here the buffer was refiled by someone else
>
> I am little concerned about this particular code. We know that
> some one else will do the unfile/remove - but we will keep
> spinning on it till that happens. Isn't it ? Why don't we
> skip it and move to next one ?

That check looks OK to me. Either

a) the buffer has been taken off the journal altogether or

b) it's been moved to the running transaction or

c) it's jounalled, it's ont he committing transaction but it's not on
t_sync_datalist any more.

So it shouldn't be possible for that buffer to be on
commit_transaction->t_sync_datalist.

> We are seeing few following message while running tests and
> wondering if your patch is causing it ..
>
> BUG: spinlock lockup on CPU#1, scp/30189, c00000000fb503d8 (Not tainted)
> Call Trace:
> [C000000018FDB320] [C0000000000102E0] .show_stack+0x68/0x1b0 (unreliable)
> [C000000018FDB3C0] [C0000000001734F4] ._raw_spin_lock+0x138/0x184
> [C000000018FDB460] [C00000000025AD24] ._spin_lock+0x10/0x24
> [C000000018FDB4E0] [D000000000172E14] .journal_dirty_data+0xa4/0x2c0 [jbd]
> [C000000018FDB580] [D000000000205BAC] .ext3_journal_dirty_data+0x28/0x70 [ext3]
> [C000000018FDB610] [D0000000002048BC] .walk_page_buffers+0xb0/0x134 [ext3]
> [C000000018FDB6D0] [D000000000208280] .ext3_ordered_commit_write+0x74/0x114

Presumably j_list_lock got stuck. What we really need to see here is
the backtrace from other CPUs.

gad, there have been so many all-CPU-backtrace patches over the years.

<optimistically cc's Ingo>

Ingo, do you think that's something which we shuld have in the spinlock
debugging code? A trace to let us see which CPU is holding that lock,
and where from? I guess if the other cpu is stuck in spin_lock_irqsave()
then we'll get stuck delivering the IPI, so it'd need to be async.

2006-09-29 19:26:46

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch 003/152] jbd: fix commit of ordered data buffers

* Andrew Morton <[email protected]> wrote:

> gad, there have been so many all-CPU-backtrace patches over the years.
>
> <optimistically cc's Ingo>
>
> Ingo, do you think that's something which we shuld have in the
> spinlock debugging code? A trace to let us see which CPU is holding
> that lock, and where from? I guess if the other cpu is stuck in
> spin_lock_irqsave() then we'll get stuck delivering the IPI, so it'd
> need to be async.

used to have this in -rt for i686 and x86_64 for the NMI watchdog tick
to print on all CPUs, in the next tick (i.e. no need to actually
initiate an IPI) - but it was all a bit hacky [but worked]. It fell
victim to some recent flux in that area.

< optimistically cc's Andi =B-) >

Ingo

2006-09-29 19:54:40

by Andi Kleen

[permalink] [raw]

Subject: Re: [patch 003/152] jbd: fix commit of ordered data buffers

On Friday 29 September 2006 21:18, Ingo Molnar wrote:
>
> * Andrew Morton <[email protected]> wrote:
>
> > gad, there have been so many all-CPU-backtrace patches over the years.
> >
> > <optimistically cc's Ingo>
> >
> > Ingo, do you think that's something which we shuld have in the
> > spinlock debugging code? A trace to let us see which CPU is holding
> > that lock, and where from? I guess if the other cpu is stuck in
> > spin_lock_irqsave() then we'll get stuck delivering the IPI, so it'd
> > need to be async.
>
> used to have this in -rt for i686 and x86_64 for the NMI watchdog tick
> to print on all CPUs, in the next tick (i.e. no need to actually
> initiate an IPI) - but it was all a bit hacky [but worked]. It fell
> victim to some recent flux in that area.

You mean spinlock debugging setting a global variable and the NMI
watchdog testing that? Makes sense. I can put it on my todo list.

-Andi

2006-09-29 20:25:07

by Andrew Morton

[permalink] [raw]

Subject: Re: [patch 003/152] jbd: fix commit of ordered data buffers

On Fri, 29 Sep 2006 21:54:30 +0200
Andi Kleen <[email protected]> wrote:

> On Friday 29 September 2006 21:18, Ingo Molnar wrote:
> >
> > * Andrew Morton <[email protected]> wrote:
> >
> > > gad, there have been so many all-CPU-backtrace patches over the years.
> > >
> > > <optimistically cc's Ingo>
> > >
> > > Ingo, do you think that's something which we shuld have in the
> > > spinlock debugging code? A trace to let us see which CPU is holding
> > > that lock, and where from? I guess if the other cpu is stuck in
> > > spin_lock_irqsave() then we'll get stuck delivering the IPI, so it'd
> > > need to be async.
> >
> > used to have this in -rt for i686 and x86_64 for the NMI watchdog tick
> > to print on all CPUs, in the next tick (i.e. no need to actually
> > initiate an IPI) - but it was all a bit hacky [but worked]. It fell
> > victim to some recent flux in that area.
>
> You mean spinlock debugging setting a global variable and the NMI
> watchdog testing that? Makes sense. I can put it on my todo list.

It does make sense.

<type, type>

Something like this? (compiled only)

From: Andrew Morton <[email protected]>

When a spinlock lockup occurs, arrange for the NMI code to emit an all-cpu
backtrace, so we get to see which CPU is holding the lock, and where.

Cc: Andi Kleen <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Badari Pulavarty <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/i386/kernel/nmi.c | 14 ++++++++++++++
arch/x86_64/kernel/nmi.c | 17 ++++++++++++++++-
include/asm-i386/nmi.h | 3 +++
include/asm-x86_64/nmi.h | 3 +++
include/linux/nmi.h | 5 +++++
lib/spinlock_debug.c | 4 ++++
6 files changed, 45 insertions(+), 1 deletion(-)

diff -puN lib/spinlock_debug.c~spinlock-debug-all-cpu-backtrace lib/spinlock_debug.c
--- a/lib/spinlock_debug.c~spinlock-debug-all-cpu-backtrace
+++ a/lib/spinlock_debug.c
@@ -7,6 +7,7 @@
*/

#include <linux/spinlock.h>
+#include <linux/nmi.h>
#include <linux/interrupt.h>
#include <linux/debug_locks.h>
#include <linux/delay.h>
@@ -116,6 +117,9 @@ static void __spin_lock_debug(spinlock_t
raw_smp_processor_id(), current->comm,
current->pid, lock);
dump_stack();
+#ifdef CONFIG_SMP
+ trigger_all_cpu_backtrace();
+#endif
}
}
}
diff -puN arch/i386/kernel/nmi.c~spinlock-debug-all-cpu-backtrace arch/i386/kernel/nmi.c
--- a/arch/i386/kernel/nmi.c~spinlock-debug-all-cpu-backtrace
+++ a/arch/i386/kernel/nmi.c
@@ -23,6 +23,7 @@
#include <linux/percpu.h>
#include <linux/dmi.h>
#include <linux/kprobes.h>
+#include <linux/cpumask.h>

#include <asm/smp.h>
#include <asm/nmi.h>
@@ -40,6 +41,8 @@
static DEFINE_PER_CPU(unsigned long, perfctr_nmi_owner);
static DEFINE_PER_CPU(unsigned long, evntsel_nmi_owner[3]);

+static cpumask_t backtrace_mask = CPU_MASK_NONE;
+
/* this number is calculated from Intel's MSR_P4_CRU_ESCR5 register and it's
* offset from MSR_P4_BSU_ESCR0. It will be the max for all platforms (for now)
*/
@@ -905,6 +908,12 @@ __kprobes int nmi_watchdog_tick(struct p
touched = 1;
}

+ if (cpu_isset(cpu, backtrace_mask)) {
+ cpu_clear(cpu, backtrace_mask);
+ printk("NMI backtrace for cpu %d\n", cpu);
+ dump_stack();
+ }
+
sum = per_cpu(irq_stat, cpu).apic_timer_irqs;

/* if the apic timer isn't firing, this cpu isn't doing much */
@@ -1031,6 +1040,11 @@ int proc_nmi_enabled(struct ctl_table *t

#endif

+void __trigger_all_cpu_backtrace(void)
+{
+ backtrace_mask = CPU_MASK_ALL;
+}
+
EXPORT_SYMBOL(nmi_active);
EXPORT_SYMBOL(nmi_watchdog);
EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi);
diff -puN arch/x86_64/kernel/nmi.c~spinlock-debug-all-cpu-backtrace arch/x86_64/kernel/nmi.c
--- a/arch/x86_64/kernel/nmi.c~spinlock-debug-all-cpu-backtrace
+++ a/arch/x86_64/kernel/nmi.c
@@ -12,14 +12,15 @@
* Mikael Pettersson : PM converted to driver model. Disable/enable API.
*/

+#include <linux/nmi.h>
#include <linux/mm.h>
#include <linux/delay.h>
#include <linux/interrupt.h>
#include <linux/module.h>
#include <linux/sysdev.h>
-#include <linux/nmi.h>
#include <linux/sysctl.h>
#include <linux/kprobes.h>
+#include <linux/cpumask.h>

#include <asm/smp.h>
#include <asm/nmi.h>
@@ -37,6 +38,8 @@
static DEFINE_PER_CPU(unsigned, perfctr_nmi_owner);
static DEFINE_PER_CPU(unsigned, evntsel_nmi_owner[2]);

+static cpumask_t backtrace_mask = CPU_MASK_NONE;
+
/* this number is calculated from Intel's MSR_P4_CRU_ESCR5 register and it's
* offset from MSR_P4_BSU_ESCR0. It will be the max for all platforms (for now)
*/
@@ -778,6 +781,7 @@ int __kprobes nmi_watchdog_tick(struct p
{
int sum;
int touched = 0;
+ int cpu = smp_processor_id();
struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk);
u64 dummy;
int rc=0;
@@ -795,6 +799,12 @@ int __kprobes nmi_watchdog_tick(struct p
touched = 1;
}

+ if (cpu_isset(cpu, backtrace_mask)) {
+ cpu_clear(cpu, backtrace_mask);
+ printk("NMI backtrace for cpu %d\n", cpu);
+ dump_stack();
+ }
+
#ifdef CONFIG_X86_MCE
/* Could check oops_in_progress here too, but it's safer
not too */
@@ -927,6 +937,11 @@ int proc_nmi_enabled(struct ctl_table *t

#endif

+void __trigger_all_cpu_backtrace(void)
+{
+ backtrace_mask = CPU_MASK_ALL;
+}
+
EXPORT_SYMBOL(nmi_active);
EXPORT_SYMBOL(nmi_watchdog);
EXPORT_SYMBOL(avail_to_resrv_perfctr_nmi);
diff -puN include/linux/nmi.h~spinlock-debug-all-cpu-backtrace include/linux/nmi.h
--- a/include/linux/nmi.h~spinlock-debug-all-cpu-backtrace
+++ a/include/linux/nmi.h
@@ -14,9 +14,14 @@
* disables interrupts for a long time. This call is stateless.
*/
#ifdef ARCH_HAS_NMI_WATCHDOG
+#include <asm/nmi.h>
extern void touch_nmi_watchdog(void);
#else
# define touch_nmi_watchdog() do { } while(0)
#endif

+#ifndef trigger_all_cpu_backtrace
+#define trigger_all_cpu_backtrace() do { } while (0)
+#endif
+
#endif
diff -puN include/asm-i386/nmi.h~spinlock-debug-all-cpu-backtrace include/asm-i386/nmi.h
--- a/include/asm-i386/nmi.h~spinlock-debug-all-cpu-backtrace
+++ a/include/asm-i386/nmi.h
@@ -36,4 +36,7 @@ extern unsigned int nmi_watchdog;
#define NMI_LOCAL_APIC 2
#define NMI_INVALID 3

+void __trigger_all_cpu_backtrace(void);
+#define trigger_all_cpu_backtrace() __trigger_all_cpu_backtrace()
+
#endif /* ASM_NMI_H */
diff -puN include/asm-x86_64/nmi.h~spinlock-debug-all-cpu-backtrace include/asm-x86_64/nmi.h
--- a/include/asm-x86_64/nmi.h~spinlock-debug-all-cpu-backtrace
+++ a/include/asm-x86_64/nmi.h
@@ -70,4 +70,7 @@ extern unsigned int nmi_watchdog;
#define NMI_LOCAL_APIC 2
#define NMI_INVALID 3

+void __trigger_all_cpu_backtrace(void);
+#define trigger_all_cpu_backtrace() __trigger_all_cpu_backtrace()
+
#endif /* ASM_NMI_H */
_

2006-09-29 21:30:53

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch 003/152] jbd: fix commit of ordered data buffers

* Andrew Morton <[email protected]> wrote:

> + if (cpu_isset(cpu, backtrace_mask)) {
> + cpu_clear(cpu, backtrace_mask);
> + printk("NMI backtrace for cpu %d\n", cpu);
> + dump_stack();
> + }

yeah. There's just one issue: this will mix all the printks from all
CPUs at once (the NMIs if IO-APIC triggered fire at once). The way i
solved this was to add a private spinlock around the printk and
dump_stack().

Ingo

2006-09-29 21:34:55

by Ingo Molnar

[permalink] [raw]

Subject: Re: [patch 003/152] jbd: fix commit of ordered data buffers

* Andrew Morton <[email protected]> wrote:

> +void __trigger_all_cpu_backtrace(void)
> +{
> + backtrace_mask = CPU_MASK_ALL;
> +}

i'd suggest cpu_online_map, i also added a busy-loop to wait for the
backtrace_mask to go back to 0. That way we get a consistent dump all
across. (as the NMI will dump on this CPU too)

if the NMI isnt working on some CPU, the wait might be quite long ;-)
But it's not a big issue i think as the system is likely hosed already.

Ingo