2010-07-20 06:41:56

by Stephen Rothwell

[permalink] [raw]
Subject: linux-next: OOPS at boot time

Hi Al,

All of my PowerPC boot tests are getting this after getting to user mode
(sometimes several times):

------------[ cut here ]------------
kernel BUG at fs/inode.c:1244!
Oops: Exception in kernel mode, sig: 5 [#1]
SMP NR_CPUS=128 NUMA pSeries
last sysfs file: /sys/kernel/uevent_seqnum
Modules linked in: ehea xt_tcpudp iptable_filter ip_tables nfnetlink nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT x_tables dm_mirror dm_region_hash dm_log dm_zero dm_snapshot parport_pc parport dm_multipath autofs4
NIP: c000000000168074 LR: c000000000168064 CTR: 0000000000000000
REGS: c0000000063d39f0 TRAP: 0700 Not tainted (2.6.35-rc5-autokern1)
MSR: 8000000000029032 <EE,ME,CE,IR,DR> CR: 22008422 XER: 20000001
TASK = c000000003a25840[4995] 'rm' THREAD: c0000000063d0000 CPU: 2
GPR00: 0000000000000001 c0000000063d3c70 c000000000bfdea0 c0000000015f2090
GPR04: 0000000000000003 0000000000000001 c0000000063d3bd0 c000000000c5a388
GPR08: 80c0000000000000 0000000000000006 6000000000000000 8000000000000000
GPR12: c0000000056c0598 c0000000074d6400 0000000003100000 0000000000779b68
GPR16: 00000000007782f0 0000000010018875 00000000ff9c74b8 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000010010000 0000000010010000
GPR24: 0000000000000001 00000000fffba994 0000000000000002 000000001001a038
GPR28: c000000006617800 c000000000aefd40 c000000000b586b8 c0000000056c0748
NIP [c000000000168074] .iput+0x288/0x2d0
LR [c000000000168064] .iput+0x278/0x2d0
Call Trace:
[c0000000063d3c70] [c000000000168064] .iput+0x278/0x2d0 (unreliable)
[c0000000063d3d00] [c00000000015e964] .do_unlinkat+0x124/0x1b8
[c0000000063d3e30] [c000000000008554] syscall_exit+0x0/0x40
Instruction dump:
38000000 f81f0008 f81f0000 e87e8090 48457d8d 60000000 7fe3fb78 4bfff339
e81f03b0 68000060 3120ffff 7c090110 <0b000000> 38210090 7fe3fb78 e8010010
---[ end trace 9ace9d3884bc0aac ]---
Trace/breakpoint trap
------------[ cut here ]------------

This is:
BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
in iput_final().

That BUG_ON was added by commit c0ae81f2 ("Make ->drop_inode() just
return whether inode needs to be dropped").

Has anyone seen this or something similar?
--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (2.24 kB)
(No filename) (490.00 B)
Download all attachments

2010-07-20 09:12:08

by Milton Miller

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

> Hi Al,
>
> All of my PowerPC boot tests are getting this after getting to user mode
> (sometimes several times):

>
> This is:
> BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
> in iput_final().
>
> That BUG_ON was added by commit c0ae81f2 ("Make ->drop_inode() just
> return whether inode needs to be dropped").
>
> Has anyone seen this or something similar?

http://kerneltrap.org/mailarchive/linux-kernel/2010/7/8/4591460/thread

Looks like it died due to unable to reproduce ...

I don't need to be cc'd.

milton

2010-07-20 10:37:29

by Andrew Morton

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

On Tue, 20 Jul 2010 16:41:45 +1000 Stephen Rothwell <[email protected]> wrote:

> Hi Al,
>
> All of my PowerPC boot tests are getting this after getting to user mode
> (sometimes several times):
>
> ------------[ cut here ]------------
> kernel BUG at fs/inode.c:1244!
> Oops: Exception in kernel mode, sig: 5 [#1]
> SMP NR_CPUS=128 NUMA pSeries
> last sysfs file: /sys/kernel/uevent_seqnum
> Modules linked in: ehea xt_tcpudp iptable_filter ip_tables nfnetlink nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT x_tables dm_mirror dm_region_hash dm_log dm_zero dm_snapshot parport_pc parport dm_multipath autofs4
> NIP: c000000000168074 LR: c000000000168064 CTR: 0000000000000000
> REGS: c0000000063d39f0 TRAP: 0700 Not tainted (2.6.35-rc5-autokern1)
> MSR: 8000000000029032 <EE,ME,CE,IR,DR> CR: 22008422 XER: 20000001
> TASK = c000000003a25840[4995] 'rm' THREAD: c0000000063d0000 CPU: 2
> GPR00: 0000000000000001 c0000000063d3c70 c000000000bfdea0 c0000000015f2090
> GPR04: 0000000000000003 0000000000000001 c0000000063d3bd0 c000000000c5a388
> GPR08: 80c0000000000000 0000000000000006 6000000000000000 8000000000000000
> GPR12: c0000000056c0598 c0000000074d6400 0000000003100000 0000000000779b68
> GPR16: 00000000007782f0 0000000010018875 00000000ff9c74b8 0000000000000000
> GPR20: 0000000000000000 0000000000000000 0000000010010000 0000000010010000
> GPR24: 0000000000000001 00000000fffba994 0000000000000002 000000001001a038
> GPR28: c000000006617800 c000000000aefd40 c000000000b586b8 c0000000056c0748
> NIP [c000000000168074] .iput+0x288/0x2d0
> LR [c000000000168064] .iput+0x278/0x2d0
> Call Trace:
> [c0000000063d3c70] [c000000000168064] .iput+0x278/0x2d0 (unreliable)
> [c0000000063d3d00] [c00000000015e964] .do_unlinkat+0x124/0x1b8
> [c0000000063d3e30] [c000000000008554] syscall_exit+0x0/0x40
> Instruction dump:
> 38000000 f81f0008 f81f0000 e87e8090 48457d8d 60000000 7fe3fb78 4bfff339
> e81f03b0 68000060 3120ffff 7c090110 <0b000000> 38210090 7fe3fb78 e8010010
> ---[ end trace 9ace9d3884bc0aac ]---
> Trace/breakpoint trap
> ------------[ cut here ]------------
>
> This is:
> BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
> in iput_final().
>
> That BUG_ON was added by commit c0ae81f2 ("Make ->drop_inode() just
> return whether inode needs to be dropped").
>
> Has anyone seen this or something similar?

I get it all the time. See the thread "Subject: Re: linux-next: Tree for
July 7".

2010-07-20 22:45:49

by Dave Chinner

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

On Tue, Jul 20, 2010 at 03:36:56AM -0700, Andrew Morton wrote:
> On Tue, 20 Jul 2010 16:41:45 +1000 Stephen Rothwell <[email protected]> wrote:
>
> > Hi Al,
> >
> > All of my PowerPC boot tests are getting this after getting to user mode
> > (sometimes several times):
> >
> > ------------[ cut here ]------------
> > kernel BUG at fs/inode.c:1244!
> > Oops: Exception in kernel mode, sig: 5 [#1]
> > SMP NR_CPUS=128 NUMA pSeries
> > last sysfs file: /sys/kernel/uevent_seqnum
> > Modules linked in: ehea xt_tcpudp iptable_filter ip_tables nfnetlink nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT x_tables dm_mirror dm_region_hash dm_log dm_zero dm_snapshot parport_pc parport dm_multipath autofs4
> > NIP: c000000000168074 LR: c000000000168064 CTR: 0000000000000000
> > REGS: c0000000063d39f0 TRAP: 0700 Not tainted (2.6.35-rc5-autokern1)
> > MSR: 8000000000029032 <EE,ME,CE,IR,DR> CR: 22008422 XER: 20000001
> > TASK = c000000003a25840[4995] 'rm' THREAD: c0000000063d0000 CPU: 2
> > GPR00: 0000000000000001 c0000000063d3c70 c000000000bfdea0 c0000000015f2090
> > GPR04: 0000000000000003 0000000000000001 c0000000063d3bd0 c000000000c5a388
> > GPR08: 80c0000000000000 0000000000000006 6000000000000000 8000000000000000
> > GPR12: c0000000056c0598 c0000000074d6400 0000000003100000 0000000000779b68
> > GPR16: 00000000007782f0 0000000010018875 00000000ff9c74b8 0000000000000000
> > GPR20: 0000000000000000 0000000000000000 0000000010010000 0000000010010000
> > GPR24: 0000000000000001 00000000fffba994 0000000000000002 000000001001a038
> > GPR28: c000000006617800 c000000000aefd40 c000000000b586b8 c0000000056c0748
> > NIP [c000000000168074] .iput+0x288/0x2d0
> > LR [c000000000168064] .iput+0x278/0x2d0
> > Call Trace:
> > [c0000000063d3c70] [c000000000168064] .iput+0x278/0x2d0 (unreliable)
> > [c0000000063d3d00] [c00000000015e964] .do_unlinkat+0x124/0x1b8
> > [c0000000063d3e30] [c000000000008554] syscall_exit+0x0/0x40
> > Instruction dump:
> > 38000000 f81f0008 f81f0000 e87e8090 48457d8d 60000000 7fe3fb78 4bfff339
> > e81f03b0 68000060 3120ffff 7c090110 <0b000000> 38210090 7fe3fb78 e8010010
> > ---[ end trace 9ace9d3884bc0aac ]---
> > Trace/breakpoint trap
> > ------------[ cut here ]------------
> >
> > This is:
> > BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
> > in iput_final().
> >
> > That BUG_ON was added by commit c0ae81f2 ("Make ->drop_inode() just
> > return whether inode needs to be dropped").
> >
> > Has anyone seen this or something similar?
>
> I get it all the time. See the thread "Subject: Re: linux-next: Tree for
> July 7".

Yet nobody else seems to be able to reproduce it. Given that powerPC
is good at triggering reace conditions, maybe there is one that
only you are unlucky eough to trigger.

Rather than just commenting out the BUG_ON() and ignoring the
problem, can you print out the inode state (and enough information
to identify the filesystem the inode belongs to) before triggering
the BUG_ON() so we can get some idea of how this is triggering?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-21 00:45:23

by Andrew Morton

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

On Wed, 21 Jul 2010 08:45:25 +1000
Dave Chinner <[email protected]> wrote:

> On Tue, Jul 20, 2010 at 03:36:56AM -0700, Andrew Morton wrote:
> > On Tue, 20 Jul 2010 16:41:45 +1000 Stephen Rothwell <[email protected]> wrote:
> >
> > > Hi Al,
> > >
> > > All of my PowerPC boot tests are getting this after getting to user mode
> > > (sometimes several times):
> > >
> > > ------------[ cut here ]------------
> > > kernel BUG at fs/inode.c:1244!
> > > Oops: Exception in kernel mode, sig: 5 [#1]
> > > SMP NR_CPUS=128 NUMA pSeries
> > > last sysfs file: /sys/kernel/uevent_seqnum
> > > Modules linked in: ehea xt_tcpudp iptable_filter ip_tables nfnetlink nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT x_tables dm_mirror dm_region_hash dm_log dm_zero dm_snapshot parport_pc parport dm_multipath autofs4
> > > NIP: c000000000168074 LR: c000000000168064 CTR: 0000000000000000
> > > REGS: c0000000063d39f0 TRAP: 0700 Not tainted (2.6.35-rc5-autokern1)
> > > MSR: 8000000000029032 <EE,ME,CE,IR,DR> CR: 22008422 XER: 20000001
> > > TASK = c000000003a25840[4995] 'rm' THREAD: c0000000063d0000 CPU: 2
> > > GPR00: 0000000000000001 c0000000063d3c70 c000000000bfdea0 c0000000015f2090
> > > GPR04: 0000000000000003 0000000000000001 c0000000063d3bd0 c000000000c5a388
> > > GPR08: 80c0000000000000 0000000000000006 6000000000000000 8000000000000000
> > > GPR12: c0000000056c0598 c0000000074d6400 0000000003100000 0000000000779b68
> > > GPR16: 00000000007782f0 0000000010018875 00000000ff9c74b8 0000000000000000
> > > GPR20: 0000000000000000 0000000000000000 0000000010010000 0000000010010000
> > > GPR24: 0000000000000001 00000000fffba994 0000000000000002 000000001001a038
> > > GPR28: c000000006617800 c000000000aefd40 c000000000b586b8 c0000000056c0748
> > > NIP [c000000000168074] .iput+0x288/0x2d0
> > > LR [c000000000168064] .iput+0x278/0x2d0
> > > Call Trace:
> > > [c0000000063d3c70] [c000000000168064] .iput+0x278/0x2d0 (unreliable)
> > > [c0000000063d3d00] [c00000000015e964] .do_unlinkat+0x124/0x1b8
> > > [c0000000063d3e30] [c000000000008554] syscall_exit+0x0/0x40
> > > Instruction dump:
> > > 38000000 f81f0008 f81f0000 e87e8090 48457d8d 60000000 7fe3fb78 4bfff339
> > > e81f03b0 68000060 3120ffff 7c090110 <0b000000> 38210090 7fe3fb78 e8010010
> > > ---[ end trace 9ace9d3884bc0aac ]---
> > > Trace/breakpoint trap
> > > ------------[ cut here ]------------
> > >
> > > This is:
> > > BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
> > > in iput_final().
> > >
> > > That BUG_ON was added by commit c0ae81f2 ("Make ->drop_inode() just
> > > return whether inode needs to be dropped").

That patch simply moved the BUG_ON around. It still triggers with that
patch reverted.


> > > Has anyone seen this or something similar?
> >
> > I get it all the time. See the thread "Subject: Re: linux-next: Tree for
> > July 7".
>
> Yet nobody else seems to be able to reproduce it. Given that powerPC
> is good at triggering reace conditions, maybe there is one that
> only you are unlucky eough to trigger.
>
> Rather than just commenting out the BUG_ON() and ignoring the
> problem, can you print out the inode state (and enough information
> to identify the filesystem the inode belongs to) before triggering
> the BUG_ON() so we can get some idea of how this is triggering?

Already did. ext3. I_DIRTY_SYNC, I_DIRTY_DATASYNC and I_DIRTY_PAGES
are set (i_state=0x67).

A bit of poking around indicates that these inodes always have zero
attached pages, and they were dirtied within dquot_free_space().

[ 304.460188] iput_final: 67
[ 304.460357] type: ext3
[ 304.460518] pages:0
[ 304.460679] dirtied:include/linux/quotaops.h:330
[ 304.460842] ------------[ cut here ]------------
[ 304.461006] WARNING: at fs/inode.c:1250 iput+0x285/0x292()
[ 304.467256] Modules linked in: autofs4 sunrpc ipv6 dm_mirror dm_region_hash dm_log dm_multipath dm_mod video output sbs sbshc battery ac lp parport sg ide_cd_mod cdrom option usb_wwan usbserial serio_raw floppy snd_hda_intel snd_hda_codec button snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss i2c_i801 i2c_core snd_pcm snd_timer snd soundcore snd_page_alloc shpchp pcspkr ehci_hcd ohci_hcd uhci_hcd
[ 304.470029] Pid: 4297, comm: rm Tainted: G W 2.6.35-rc5 #3
[ 304.470196] Call Trace:
[ 304.470360] [<ffffffff810ceb59>] ? iput+0x1fe/0x292
[ 304.470527] [<ffffffff810cebe0>] ? iput+0x285/0x292
[ 304.470695] [<ffffffff81036283>] warn_slowpath_common+0x7e/0x97
[ 304.470864] [<ffffffff810362b1>] warn_slowpath_null+0x15/0x17
[ 304.471030] [<ffffffff810cebe0>] iput+0x285/0x292
[ 304.471198] [<ffffffff810c587d>] do_unlinkat+0x104/0x15a
[ 304.471364] [<ffffffff81376ee9>] ? retint_swapgs+0xe/0x13
[ 304.471531] [<ffffffff8107225e>] ? audit_syscall_entry+0x183/0x1b6
[ 304.471700] [<ffffffff810c58e4>] sys_unlink+0x11/0x13
[ 304.471869] [<ffffffff810029eb>] system_call_fastpath+0x16/0x1b
[ 304.472035] ---[ end trace 750102f49dbda08a ]---



fs/anon_inodes.c | 2 ++
fs/buffer.c | 7 +++++++
fs/fs-writeback.c | 8 ++++++++
fs/inode.c | 9 ++++++++-
fs/pipe.c | 2 ++
include/linux/fs.h | 11 ++++++++++-
mm/page-writeback.c | 5 +++++
7 files changed, 42 insertions(+), 2 deletions(-)

diff -puN fs/inode.c~a fs/inode.c
--- a/fs/inode.c~a
+++ a/fs/inode.c
@@ -1241,7 +1241,14 @@ static void iput_final(struct inode *ino
hlist_del_init(&inode->i_hash);
spin_unlock(&inode_lock);
wake_up_inode(inode);
- BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
+ if (inode->i_state != (I_FREEING | I_CLEAR)) {
+ printk("iput_final: %lx\n", inode->i_state);
+ printk("type: %s\n", inode->i_sb->s_type->name);
+ printk("pages:%lu\n", inode->i_mapping->nrpages);
+ printk("dirtied:%s:%d\n", inode->i_where_file,
+ inode->i_where_line);
+ WARN_ON(1);
+ }
destroy_inode(inode);
}

diff -puN fs/anon_inodes.c~a fs/anon_inodes.c
--- a/fs/anon_inodes.c~a
+++ a/fs/anon_inodes.c
@@ -205,6 +205,8 @@ static struct inode *anon_inode_mkinode(
* that it already _is_ on the dirty list.
*/
inode->i_state = I_DIRTY;
+ inode->i_where_file = __FILE__;
+ inode->i_where_line = __LINE__;
inode->i_mode = S_IRUSR | S_IWUSR;
inode->i_uid = current_fsuid();
inode->i_gid = current_fsgid();
diff -puN fs/buffer.c~a fs/buffer.c
--- a/fs/buffer.c~a
+++ a/fs/buffer.c
@@ -663,6 +663,8 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
static void __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
{
+ struct inode *inode;
+
spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -671,6 +673,11 @@ static void __set_page_dirty(struct page
page_index(page), PAGECACHE_TAG_DIRTY);
}
spin_unlock_irq(&mapping->tree_lock);
+ inode = mapping->host;
+ if (!(inode->i_state & I_DIRTY_PAGES)) {
+ inode->i_where_file = __FILE__;
+ inode->i_where_line = __LINE__;
+ }
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}

diff -puN fs/fs-writeback.c~a fs/fs-writeback.c
--- a/fs/fs-writeback.c~a
+++ a/fs/fs-writeback.c
@@ -399,6 +399,10 @@ writeback_single_inode(struct inode *ino
* to b_more_io so it will get more writeout as
* soon as the queue becomes uncongested.
*/
+ if (!(inode->i_state & I_DIRTY_PAGES)) {
+ inode->i_where_file = __FILE__;
+ inode->i_where_line = __LINE__;
+ }
inode->i_state |= I_DIRTY_PAGES;
select_queue:
if (wbc->nr_to_write <= 0) {
@@ -420,6 +424,10 @@ select_queue:
* file would indefinitely suspend writeout of
* all the other files.
*/
+ if (!(inode->i_state & I_DIRTY_PAGES)) {
+ inode->i_where_file = __FILE__;
+ inode->i_where_line = __LINE__;
+ }
inode->i_state |= I_DIRTY_PAGES;
redirty_tail(inode);
}
diff -puN fs/libfs.c~a fs/libfs.c
diff -puN fs/pipe.c~a fs/pipe.c
--- a/fs/pipe.c~a
+++ a/fs/pipe.c
@@ -968,6 +968,8 @@ static struct inode * get_pipe_inode(voi
* list because "mark_inode_dirty()" will think
* that it already _is_ on the dirty list.
*/
+ inode->i_where_file = __FILE__;
+ inode->i_where_line = __LINE__;
inode->i_state = I_DIRTY;
inode->i_mode = S_IFIFO | S_IRUSR | S_IWUSR;
inode->i_uid = current_fsuid();
diff -puN mm/page-writeback.c~a mm/page-writeback.c
--- a/mm/page-writeback.c~a
+++ a/mm/page-writeback.c
@@ -1137,7 +1137,12 @@ int __set_page_dirty_nobuffers(struct pa
}
spin_unlock_irq(&mapping->tree_lock);
if (mapping->host) {
+ struct inode *inode = mapping->host;
/* !PageAnon && !swapper_space */
+ if (!(inode->i_state & I_DIRTY_PAGES)) {
+ inode->i_where_file = __FILE__;
+ inode->i_where_line = __LINE__;
+ }
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
return 1;
diff -puN include/linux/fs.h~a include/linux/fs.h
--- a/include/linux/fs.h~a
+++ a/include/linux/fs.h
@@ -775,6 +775,8 @@ struct inode {

unsigned long i_state;
unsigned long dirtied_when; /* jiffies of first dirtying */
+ char *i_where_file;
+ int i_where_line;

unsigned int i_flags;

@@ -1640,11 +1642,18 @@ struct super_operations {
#define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)

extern void __mark_inode_dirty(struct inode *, int);
-static inline void mark_inode_dirty(struct inode *inode)
+static inline void xxmark_inode_dirty(struct inode *inode, char *file, int line)
{
+ if (!(inode->i_state & I_DIRTY)) {
+ inode->i_where_file = file;
+ inode->i_where_line = line;
+ }
+
__mark_inode_dirty(inode, I_DIRTY);
}

+#define mark_inode_dirty(inode) xxmark_inode_dirty(inode, __FILE__, __LINE__)
+
static inline void mark_inode_dirty_sync(struct inode *inode)
{
__mark_inode_dirty(inode, I_DIRTY_SYNC);
_


This isn't necessarily a problem in the quota code (setting aside the
question: why the heck does dquot_free_space() set I_DIRTY_PAGES??).
If the vfs is asked to kill off a dirty inode, it should at least clean
the thing first.

I dunno. That fs/inode.c patch series from Viro looks fishy. I guess
I get to bisect it tomorrow.

2010-07-21 05:20:50

by Dave Chinner

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

On Tue, Jul 20, 2010 at 05:44:24PM -0700, Andrew Morton wrote:
> On Wed, 21 Jul 2010 08:45:25 +1000 Dave Chinner <[email protected]> wrote:
> > On Tue, Jul 20, 2010 at 03:36:56AM -0700, Andrew Morton wrote:
> > > On Tue, 20 Jul 2010 16:41:45 +1000 Stephen Rothwell <[email protected]> wrote:
> > > > Has anyone seen this or something similar?
> > >
> > > I get it all the time. See the thread "Subject: Re: linux-next: Tree for
> > > July 7".
> >
> > Yet nobody else seems to be able to reproduce it. Given that powerPC
> > is good at triggering reace conditions, maybe there is one that
> > only you are unlucky eough to trigger.
> >
> > Rather than just commenting out the BUG_ON() and ignoring the
> > problem, can you print out the inode state (and enough information
> > to identify the filesystem the inode belongs to) before triggering
> > the BUG_ON() so we can get some idea of how this is triggering?
>
> Already did. ext3. I_DIRTY_SYNC, I_DIRTY_DATASYNC and I_DIRTY_PAGES
> are set (i_state=0x67).
>
> A bit of poking around indicates that these inodes always have zero
> attached pages,

They should, because by the time that bug fires they should have had
all their pages stripped away.

> and they were dirtied within dquot_free_space().

AFAICT dquot_free_space() is called deep in the guts of
ext3_truncate() via dquot_free_block(), which is called directly
before end_writeback(). That should overwrite any state changes made
inside ext3_truncate. I wonder if iput_final() is racing with
something else here?

> This isn't necessarily a problem in the quota code (setting aside the
> question: why the heck does dquot_free_space() set I_DIRTY_PAGES??).
> If the vfs is asked to kill off a dirty inode, it should at least clean
> the thing first.
>
> I dunno. That fs/inode.c patch series from Viro looks fishy. I guess
> I get to bisect it tomorrow.

I suspect that is the only way to get to the bottom of this, short
of a reliable reproducer being discovered. I'm still trying to
reproduce it - I've even turned quota on - but I'm not having any
more luck than over the weekend, though...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-21 07:30:13

by Andrew Morton

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

On Wed, 21 Jul 2010 15:20:07 +1000 Dave Chinner <[email protected]> wrote:

> > and they were dirtied within dquot_free_space().
>
> AFAICT dquot_free_space() is called deep in the guts of
> ext3_truncate() via dquot_free_block(), which is called directly
> before end_writeback(). That should overwrite any state changes made
> inside ext3_truncate. I wonder if iput_final() is racing with
> something else here?
>

This isn't a race. I type `make' and the warnings spew out at hundreds
per second - every unlink, I'd say.

Did you try my .config?

2010-07-21 07:48:22

by Stephen Rothwell

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

On Wed, 21 Jul 2010 00:29:07 -0700 Andrew Morton <[email protected]> wrote:
>
> On Wed, 21 Jul 2010 15:20:07 +1000 Dave Chinner <[email protected]> wrote:
>
> > > and they were dirtied within dquot_free_space().
> >
> > AFAICT dquot_free_space() is called deep in the guts of
> > ext3_truncate() via dquot_free_block(), which is called directly
> > before end_writeback(). That should overwrite any state changes made
> > inside ext3_truncate. I wonder if iput_final() is racing with
> > something else here?
> >
>
> This isn't a race. I type `make' and the warnings spew out at hundreds
> per second - every unlink, I'd say.

Bisected to:

commit 8bfe4a06746e5f03c02afe3ceb97b5364c099f63
Author: Al Viro <[email protected]>
Date: Sun Jun 6 07:08:19 2010 -0400

convert ext3 to ->evict_inode()

Signed-off-by: Al Viro <[email protected]>

--
Cheers,
Stephen Rothwell [email protected]
http://www.canb.auug.org.au/~sfr/


Attachments:
(No filename) (983.00 B)
(No filename) (490.00 B)
Download all attachments

2010-07-21 12:11:46

by Jan Kara

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

On Wed 21-07-10 17:48:09, Stephen Rothwell wrote:
> On Wed, 21 Jul 2010 00:29:07 -0700 Andrew Morton <[email protected]> wrote:
> >
> > On Wed, 21 Jul 2010 15:20:07 +1000 Dave Chinner <[email protected]> wrote:
> >
> > > > and they were dirtied within dquot_free_space().
> > >
> > > AFAICT dquot_free_space() is called deep in the guts of
> > > ext3_truncate() via dquot_free_block(), which is called directly
> > > before end_writeback(). That should overwrite any state changes made
> > > inside ext3_truncate. I wonder if iput_final() is racing with
> > > something else here?
> > >
> >
> > This isn't a race. I type `make' and the warnings spew out at hundreds
> > per second - every unlink, I'd say.
>
> Bisected to:
>
> commit 8bfe4a06746e5f03c02afe3ceb97b5364c099f63
> Author: Al Viro <[email protected]>
> Date: Sun Jun 6 07:08:19 2010 -0400
>
> convert ext3 to ->evict_inode()
>
> Signed-off-by: Al Viro <[email protected]>
Thanks for bisecting this. The patch series indeed seems to uncover
some discrepancies.
Ext3 has always dirtied inode in it's ->delete_inode method (via quota
code). But previously clear_inode() just overwrote the state with I_CLEAR
and thus we never saw the BUG_ON. After Al's patches, i_state is set in
end_writeback() which happens earlier. In particular it happens before
ext3_free_inode() which dirties the inode through quota code while freeing
xattrs - they are accounted in i_blocks, so i_blocks are updated during
freeing and inode is dirtied.
Actually, ext3_mark_inode_dirty() called during each mark_inode_dirty()
call writes the inode state to the journal so the dirty flag in the inode
state is in fact stale and overwriting it with I_CLEAR never mattered. In
this sense, the BUG_ON triggered is a false positive. But I believe this is
a separate story.
I'm not sure how to really fix this. It seems a bit premature to me to
mark inode as I_CLEAR before the filesystem is actually done with it. So
maybe the line
inode->i_state = I_FREEING | I_CLEAR;
should be moved to evict() fuction?

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-07-21 12:20:05

by Jan Kara

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

On Tue 20-07-10 17:44:24, Andrew Morton wrote:
> On Wed, 21 Jul 2010 08:45:25 +1000
...
> This isn't necessarily a problem in the quota code (setting aside the
> question: why the heck does dquot_free_space() set I_DIRTY_PAGES??).
Because sometime in the dark past (2.4 days I believe), I used
mark_inode_dirty in quota functions (not sure whether there even were
different inode dirty flags back then) and it stayed this way upto now.
mark_inode_dirty_sync() is of course more appropriate for quota code these
days. Cleanup is on its way...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-07-21 17:50:21

by Al Viro

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

On Wed, Jul 21, 2010 at 02:11:17PM +0200, Jan Kara wrote:
> Thanks for bisecting this. The patch series indeed seems to uncover
> some discrepancies.
> Ext3 has always dirtied inode in it's ->delete_inode method (via quota
> code). But previously clear_inode() just overwrote the state with I_CLEAR
> and thus we never saw the BUG_ON. After Al's patches, i_state is set in
> end_writeback() which happens earlier. In particular it happens before
> ext3_free_inode() which dirties the inode through quota code while freeing
> xattrs - they are accounted in i_blocks, so i_blocks are updated during
> freeing and inode is dirtied.
> Actually, ext3_mark_inode_dirty() called during each mark_inode_dirty()
> call writes the inode state to the journal so the dirty flag in the inode
> state is in fact stale and overwriting it with I_CLEAR never mattered. In
> this sense, the BUG_ON triggered is a false positive. But I believe this is
> a separate story.
> I'm not sure how to really fix this. It seems a bit premature to me to
> mark inode as I_CLEAR before the filesystem is actually done with it. So
> maybe the line
> inode->i_state = I_FREEING | I_CLEAR;
> should be moved to evict() fuction?

Nope. I_CLEAR is "no async calls from vfs anymore; it's under complete
fs control and is about to die now".

In any case, I'll post a dumb replacement for ext3 after I verify it on
the laptop I have with me. Should be in an hour or so (the damn thing is
_slow_).

2010-07-21 21:40:30

by Al Viro

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

On Wed, Jul 21, 2010 at 02:11:17PM +0200, Jan Kara wrote:
> Thanks for bisecting this. The patch series indeed seems to uncover
> some discrepancies.
> Ext3 has always dirtied inode in it's ->delete_inode method (via quota
> code). But previously clear_inode() just overwrote the state with I_CLEAR
> and thus we never saw the BUG_ON. After Al's patches, i_state is set in
> end_writeback() which happens earlier. In particular it happens before
> ext3_free_inode() which dirties the inode through quota code while freeing
> xattrs - they are accounted in i_blocks, so i_blocks are updated during
> freeing and inode is dirtied.
> Actually, ext3_mark_inode_dirty() called during each mark_inode_dirty()
> call writes the inode state to the journal so the dirty flag in the inode
> state is in fact stale and overwriting it with I_CLEAR never mattered. In
> this sense, the BUG_ON triggered is a false positive. But I believe this is
> a separate story.

Could you please explain why the hell does ext2_xattr_delete_inode() call the
dirtying variant of dquot_free_block()? Note that the inode is well beyond
the point where writeback would consider touching it; this mark_inode_dirty()
will do nothing useful whatsoever at that place.

Anyway, I've dealt with ext3 and ext2 (both b0rken with quota) and AFAICS
the rest of quota-supporting filesystems had been OK. Changes:

ext3_evict_inode() hd end_writeback() shifted downstream (needed anyway)
ext2 switched to nodirty variant of dquot_free_block()
ext2_xattr_delete_inode() doesn't try to dirty inode anymore (always
had been pointless).

It's in for-next, should propagate to git.kernel.org in a few.

Diff against the buggy version follows, feel free to try.
diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
index e8766a3..c6c684b 100644
--- a/fs/ext2/balloc.c
+++ b/fs/ext2/balloc.c
@@ -571,7 +571,7 @@ do_more:
error_return:
brelse(bitmap_bh);
release_blocks(sb, freed);
- dquot_free_block(inode, freed);
+ dquot_free_block_nodirty(inode, freed);
}

/**
@@ -1418,7 +1418,8 @@ allocated:

*errp = 0;
brelse(bitmap_bh);
- dquot_free_block(inode, *count-num);
+ dquot_free_block_nodirty(inode, *count-num);
+ mark_inode_dirty(inode);
*count = num;
return ret_block;

@@ -1428,8 +1429,10 @@ out:
/*
* Undo the block allocation
*/
- if (!performed_allocation)
- dquot_free_block(inode, *count);
+ if (!performed_allocation) {
+ dquot_free_block_nodirty(inode, *count);
+ mark_inode_dirty(inode);
+ }
brelse(bitmap_bh);
return 0;
}
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 7e4a455..940c961 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -439,6 +439,8 @@ static int ext2_alloc_blocks(struct inode *inode,
failed_out:
for (i = 0; i <index; i++)
ext2_free_blocks(inode, new_blocks[i], 1);
+ if (index)
+ mark_inode_dirty(inode);
return ret;
}

@@ -1009,8 +1011,8 @@ static inline void ext2_free_data(struct inode *inode, __le32 *p, __le32 *q)
else if (block_to_free == nr - count)
count++;
else {
- mark_inode_dirty(inode);
ext2_free_blocks (inode, block_to_free, count);
+ mark_inode_dirty(inode);
free_this:
block_to_free = nr;
count = 1;
@@ -1018,8 +1020,8 @@ static inline void ext2_free_data(struct inode *inode, __le32 *p, __le32 *q)
}
}
if (count > 0) {
- mark_inode_dirty(inode);
ext2_free_blocks (inode, block_to_free, count);
+ mark_inode_dirty(inode);
}
}

diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index 7c39157..5ab87e6 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -674,6 +674,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
new_bh = sb_getblk(sb, block);
if (!new_bh) {
ext2_free_blocks(inode, block, 1);
+ mark_inode_dirty(inode);
error = -EIO;
goto cleanup;
}
@@ -703,8 +704,10 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
* written (only some dirty data were not) so we just proceed
* as if nothing happened and cleanup the unused block */
if (error && error != -ENOSPC) {
- if (new_bh && new_bh != old_bh)
- dquot_free_block(inode, 1);
+ if (new_bh && new_bh != old_bh) {
+ dquot_free_block_nodirty(inode, 1);
+ mark_inode_dirty(inode);
+ }
goto cleanup;
}
} else
@@ -727,6 +730,7 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
mb_cache_entry_free(ce);
ea_bdebug(old_bh, "freeing");
ext2_free_blocks(inode, old_bh->b_blocknr, 1);
+ mark_inode_dirty(inode);
/* We let our caller release old_bh, so we
* need to duplicate the buffer before. */
get_bh(old_bh);
@@ -736,7 +740,8 @@ ext2_xattr_set2(struct inode *inode, struct buffer_head *old_bh,
le32_add_cpu(&HDR(old_bh)->h_refcount, -1);
if (ce)
mb_cache_entry_release(ce);
- dquot_free_block(inode, 1);
+ dquot_free_block_nodirty(inode, 1);
+ mark_inode_dirty(inode);
mark_buffer_dirty(old_bh);
ea_bdebug(old_bh, "refcount now=%d",
le32_to_cpu(HDR(old_bh)->h_refcount));
@@ -799,7 +804,7 @@ ext2_xattr_delete_inode(struct inode *inode)
mark_buffer_dirty(bh);
if (IS_SYNC(inode))
sync_dirty_buffer(bh);
- dquot_free_block(inode, 1);
+ dquot_free_block_nodirty(inode, 1);
}
EXT2_I(inode)->i_file_acl = 0;

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 7edd529..cc55cec 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -230,7 +230,6 @@ void ext3_evict_inode (struct inode *inode)
inode->i_size = 0;
if (inode->i_blocks)
ext3_truncate(inode);
- end_writeback(inode);
/*
* Kill off the orphan record which ext3_truncate created.
* AKPM: I think this can be inside the above `if'.
@@ -252,10 +251,12 @@ void ext3_evict_inode (struct inode *inode)
if (ext3_mark_inode_dirty(handle, inode)) {
/* If that failed, just dquot_drop() and be done with that */
dquot_drop(inode);
+ end_writeback(inode);
} else {
ext3_xattr_delete_inode(handle, inode);
dquot_free_inode(inode);
dquot_drop(inode);
+ end_writeback(inode);
ext3_free_inode(handle, inode);
}
ext3_journal_stop(handle);

2010-07-21 23:20:01

by Dave Chinner

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

On Wed, Jul 21, 2010 at 12:29:07AM -0700, Andrew Morton wrote:
> On Wed, 21 Jul 2010 15:20:07 +1000 Dave Chinner <[email protected]> wrote:
>
> > > and they were dirtied within dquot_free_space().
> >
> > AFAICT dquot_free_space() is called deep in the guts of
> > ext3_truncate() via dquot_free_block(), which is called directly
> > before end_writeback(). That should overwrite any state changes made
> > inside ext3_truncate. I wonder if iput_final() is racing with
> > something else here?
> >
>
> This isn't a race. I type `make' and the warnings spew out at hundreds
> per second - every unlink, I'd say.
>
> Did you try my .config?

Yes, I did - it was the second thing I tried after using my usual
.config (*).

I'm not an ext3 expert, so I might be missing something, but I
cannot see (from code inspection) where those flags are being set
after end_writeback is called. My experience with inode flag
corruptions on XFS that can only be reprodued by a small number of
machines is that it is usually the result of a race condition.
It looks and smells similar to me.

Anyway, I'm not sure I can help much more at this point - until I
can reproduce it I'm just making SWAGs...

Cheers,

Dave.

(*) I eventually worked out that the ext3 corruption I was chasing
on that one filesystem was a result of using a kernel built with
your config because it didn't have CONFIG_EXT3_DEFAULTS_TO_ORDERED=y
set and I did the equivalent of yanking the power cord to the VM
(I do that all the time) while that kernel was running...

--
Dave Chinner
[email protected]

2010-07-23 10:04:42

by Jan Kara

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

Hi Al,

On Wed 21-07-10 22:40:16, Al Viro wrote:
> On Wed, Jul 21, 2010 at 02:11:17PM +0200, Jan Kara wrote:
> > Thanks for bisecting this. The patch series indeed seems to uncover
> > some discrepancies.
> > Ext3 has always dirtied inode in it's ->delete_inode method (via quota
> > code). But previously clear_inode() just overwrote the state with I_CLEAR
> > and thus we never saw the BUG_ON. After Al's patches, i_state is set in
> > end_writeback() which happens earlier. In particular it happens before
> > ext3_free_inode() which dirties the inode through quota code while freeing
> > xattrs - they are accounted in i_blocks, so i_blocks are updated during
> > freeing and inode is dirtied.
> > Actually, ext3_mark_inode_dirty() called during each mark_inode_dirty()
> > call writes the inode state to the journal so the dirty flag in the inode
> > state is in fact stale and overwriting it with I_CLEAR never mattered. In
> > this sense, the BUG_ON triggered is a false positive. But I believe this is
> > a separate story.
>
> Could you please explain why the hell does ext2_xattr_delete_inode() call the
> dirtying variant of dquot_free_block()? Note that the inode is well beyond
> the point where writeback would consider touching it; this mark_inode_dirty()
> will do nothing useful whatsoever at that place.
Yes, I'm aware that dirtying inode does nothing useful at that point.
But OTOH it does no harm so why not call the standard variant of quota
function? But I have no strong opinion on this particular callsite so
using dquot_free_block_nodirty() at that place is OK with me.

> Anyway, I've dealt with ext3 and ext2 (both b0rken with quota) and AFAICS
> the rest of quota-supporting filesystems had been OK. Changes:
>
> ext3_evict_inode() hd end_writeback() shifted downstream (needed anyway)
I like this.

> ext2 switched to nodirty variant of dquot_free_block()
But I don't like this - see comments in the patch. I'd be much more happy
if ext2 did the same thing as ext3 - pulled ext2_xattr_delete_inode()
call out of ext2_free_inode called end_writeback() after it.

> ext2_xattr_delete_inode() doesn't try to dirty inode anymore (always
> had been pointless).
As I wrote above I don't care about this very much.

> It's in for-next, should propagate to git.kernel.org in a few.
>
> Diff against the buggy version follows, feel free to try.
> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> index e8766a3..c6c684b 100644
> --- a/fs/ext2/balloc.c
> +++ b/fs/ext2/balloc.c
> @@ -571,7 +571,7 @@ do_more:
> error_return:
> brelse(bitmap_bh);
> release_blocks(sb, freed);
> - dquot_free_block(inode, freed);
> + dquot_free_block_nodirty(inode, freed);
> }
The above call in ext2_free_blocks() modifies the inode and
ext2_free_blocks is enough "black-box" function that it should do the right
thing and dirty the inode if it modified it.

> /**
> @@ -1418,7 +1418,8 @@ allocated:
>
> *errp = 0;
> brelse(bitmap_bh);
> - dquot_free_block(inode, *count-num);
> + dquot_free_block_nodirty(inode, *count-num);
> + mark_inode_dirty(inode);
> *count = num;
> return ret_block;
>
> @@ -1428,8 +1429,10 @@ out:
> /*
> * Undo the block allocation
> */
> - if (!performed_allocation)
> - dquot_free_block(inode, *count);
> + if (!performed_allocation) {
> + dquot_free_block_nodirty(inode, *count);
> + mark_inode_dirty(inode);
> + }
> brelse(bitmap_bh);
> return 0;
> }
Sorry, but the above two changes look stupid... Why call _nodirty variant
and dirty the inode immediately after that? It happens in two other places
in your patch as well...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2010-07-24 12:28:08

by Al Viro

[permalink] [raw]
Subject: Re: linux-next: OOPS at boot time

On Fri, Jul 23, 2010 at 12:04:08PM +0200, Jan Kara wrote:

> > @@ -1428,8 +1429,10 @@ out:
> > /*
> > * Undo the block allocation
> > */
> > - if (!performed_allocation)
> > - dquot_free_block(inode, *count);
> > + if (!performed_allocation) {
> > + dquot_free_block_nodirty(inode, *count);
> > + mark_inode_dirty(inode);
> > + }
> > brelse(bitmap_bh);
> > return 0;
> > }
> Sorry, but the above two changes look stupid... Why call _nodirty variant
> and dirty the inode immediately after that? It happens in two other places
> in your patch as well...

Frankly, I don't believe that dquot_free_block() is _ever_ the right interface;
please, leave dirtying the inode to the filesystem and let it be explicit.