2013-07-12 02:39:41

by Dave Jones

[permalink] [raw]
Subject: XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358

Just saw this during boot after an unclean shutdown. It hung afterwards.

[ 97.162665] XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358
[ 97.164646] ------------[ cut here ]------------
[ 97.165312] kernel BUG at fs/xfs/xfs_message.c:108!
[ 97.165986] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 97.166947] Modules linked in: snd_hda_codec_realtek pcspkr usb_debug snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm e1000e ptp snd_page_alloc pps_core snd_timer snd soundcore xfs libcrc32c
[ 97.170782] CPU: 3 PID: 341 Comm: rpcbind Not tainted 3.10.0+ #49
[ 97.173684] task: ffff880235d14a40 ti: ffff880235dae000 task.ti: ffff880235dae000
[ 97.173701] RIP: 0010:[<ffffffffa002a362>] [<ffffffffa002a362>] assfail+0x22/0x30 [xfs]
[ 97.173702] RSP: 0018:ffff880235dafa98 EFLAGS: 00010296
[ 97.173703] RAX: 0000000000000060 RBX: ffff880235dafb20 RCX: 0000000000000006
[ 97.173703] RDX: 00000000000031e0 RSI: ffff880235d152a0 RDI: ffff880235d14a40
[ 97.173704] RBP: ffff880235dafa98 R08: 0000000000000002 R09: 0000000000000000
[ 97.173704] R10: 0000000000000001 R11: 0000000000000001 R12: ffff880235e54000
[ 97.173705] R13: 0000000000035e92 R14: ffff880235dafc88 R15: ffff880235dafc28
[ 97.173706] FS: 00007fa6aac79840(0000) GS:ffff880244e00000(0000) knlGS:0000000000000000
[ 97.173706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 97.173707] CR2: 00007fffb1905f18 CR3: 0000000235ce5000 CR4: 00000000001407e0
[ 97.173707] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 97.173708] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 97.173708] Stack:
[ 97.173712] ffff880235dafb00 ffffffffa0076953 ffffe8ffffc066a0 ffff880240e04ca8
[ 97.173713] 0000000000000250 0000000000000020 ffff88023d05af40 ffff880235dafb20
[ 97.173715] ffff880235a14a40 ffff880235e54000 0000000000035e92 ffff880235dafc88
[ 97.173715] Call Trace:
[ 97.173730] [<ffffffffa0076953>] xfs_dir2_sf_addname+0x43/0x760 [xfs]
[ 97.173743] [<ffffffffa0067cfc>] xfs_dir_createname+0x15c/0x1b0 [xfs]
[ 97.173754] [<ffffffffa002f2dc>] xfs_create+0x4cc/0x710 [xfs]
[ 97.173764] [<ffffffffa00278ca>] xfs_vn_mknod+0x9a/0x1c0 [xfs]
[ 97.173773] [<ffffffffa0027a03>] xfs_vn_create+0x13/0x20 [xfs]
[ 97.173776] [<ffffffff811d100d>] vfs_create+0x9d/0x100
[ 97.173778] [<ffffffff811d1995>] do_last+0x925/0xe00
[ 97.173780] [<ffffffff811d1f2e>] path_openat+0xbe/0x6f0
[ 97.173783] [<ffffffff8109e33f>] ? local_clock+0x3f/0x50
[ 97.173785] [<ffffffff811e1b5f>] ? __alloc_fd+0xaf/0x200
[ 97.173787] [<ffffffff811d2c3a>] do_filp_open+0x3a/0x90
[ 97.173789] [<ffffffff811e1b5f>] ? __alloc_fd+0xaf/0x200
[ 97.173790] [<ffffffff811c0ddb>] do_sys_open+0x10b/0x200
[ 97.173792] [<ffffffff81010578>] ? syscall_trace_enter+0x18/0x290
[ 97.173794] [<ffffffff811c0eee>] SyS_open+0x1e/0x20
[ 97.173796] [<ffffffff81717c14>] tracesys+0xdd/0xe2
[ 97.173813] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 f1 41 89 d0 48 89 e5 48 89 fa 48 c7 c6 a8 98 0b a0 31 ff 31 c0 e8 ce fb ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48

This trace repeated a few times, then the same assertion was triggered from sys_renameat.

Dave


2013-07-13 02:00:37

by Dave Chinner

[permalink] [raw]
Subject: Re: XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358

On Thu, Jul 11, 2013 at 10:39:30PM -0400, Dave Jones wrote:
> Just saw this during boot after an unclean shutdown. It hung afterwards.
>
> [ 97.162665] XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358
....
> [ 97.173730] [<ffffffffa0076953>] xfs_dir2_sf_addname+0x43/0x760 [xfs]
> [ 97.173743] [<ffffffffa0067cfc>] xfs_dir_createname+0x15c/0x1b0 [xfs]
> [ 97.173754] [<ffffffffa002f2dc>] xfs_create+0x4cc/0x710 [xfs]
> [ 97.173764] [<ffffffffa00278ca>] xfs_vn_mknod+0x9a/0x1c0 [xfs]
> [ 97.173773] [<ffffffffa0027a03>] xfs_vn_create+0x13/0x20 [xfs]
> [ 97.173776] [<ffffffff811d100d>] vfs_create+0x9d/0x100
> [ 97.173778] [<ffffffff811d1995>] do_last+0x925/0xe00
> [ 97.173780] [<ffffffff811d1f2e>] path_openat+0xbe/0x6f0
> [ 97.173783] [<ffffffff8109e33f>] ? local_clock+0x3f/0x50
> [ 97.173785] [<ffffffff811e1b5f>] ? __alloc_fd+0xaf/0x200
> [ 97.173787] [<ffffffff811d2c3a>] do_filp_open+0x3a/0x90
> [ 97.173789] [<ffffffff811e1b5f>] ? __alloc_fd+0xaf/0x200
> [ 97.173790] [<ffffffff811c0ddb>] do_sys_open+0x10b/0x200
> [ 97.173792] [<ffffffff81010578>] ? syscall_trace_enter+0x18/0x290
> [ 97.173794] [<ffffffff811c0eee>] SyS_open+0x1e/0x20
>
> This trace repeated a few times, then the same assertion was triggered from sys_renameat.

That's rather curious. What this means is that there is either an
EIO or EEXIST error being returned from xfs_dir2_sf_lookup() when a
we're about to add the new entry. There are two things here - EIO
can only be returned if a shutdown has occurred - are there any
signs of a shutdown in the logs? If there is a shutdown in progress,
then this is just unlucky to shutdown with an inode in an
inconsistent state in memory that triggers this validity check
failure.

And EEXIST means that the initial lookup of the name during the open
failed to find the entry we are now trying to create. i.e. the
initial path walk failed to do the correct lookup on the directory,
and so never got down to xfs_dir2_sf_lookup() to find the directory
entry (perhaps a problem with a cached negative dentry?). Hence it
was decided during the open(O_CREATE) call that the directory entry
needed to be created, we get down to XFS to create it, and then get
EEXIST because the name already exists...

So, it's not clear what has caused this yet. Is it reproducable? If
would be good to get a trace of lookup vs addname events from XFS,
too (i.e. all the xfs_dir* and xfs_da* events) so we can see if the
correct lookups were done prior to the failing addname operation...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-13 15:07:07

by Dave Jones

[permalink] [raw]
Subject: Re: XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358

On Sat, Jul 13, 2013 at 12:00:30PM +1000, Dave Chinner wrote:
> On Thu, Jul 11, 2013 at 10:39:30PM -0400, Dave Jones wrote:
> > Just saw this during boot after an unclean shutdown. It hung afterwards.
> >
> > [ 97.162665] XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358
> ....
> > [ 97.173730] [<ffffffffa0076953>] xfs_dir2_sf_addname+0x43/0x760 [xfs]
> > [ 97.173743] [<ffffffffa0067cfc>] xfs_dir_createname+0x15c/0x1b0 [xfs]
> > [ 97.173754] [<ffffffffa002f2dc>] xfs_create+0x4cc/0x710 [xfs]
> > [ 97.173764] [<ffffffffa00278ca>] xfs_vn_mknod+0x9a/0x1c0 [xfs]
> > [ 97.173773] [<ffffffffa0027a03>] xfs_vn_create+0x13/0x20 [xfs]
> > [ 97.173776] [<ffffffff811d100d>] vfs_create+0x9d/0x100
> > [ 97.173778] [<ffffffff811d1995>] do_last+0x925/0xe00
> > [ 97.173780] [<ffffffff811d1f2e>] path_openat+0xbe/0x6f0
> > [ 97.173783] [<ffffffff8109e33f>] ? local_clock+0x3f/0x50
> > [ 97.173785] [<ffffffff811e1b5f>] ? __alloc_fd+0xaf/0x200
> > [ 97.173787] [<ffffffff811d2c3a>] do_filp_open+0x3a/0x90
> > [ 97.173789] [<ffffffff811e1b5f>] ? __alloc_fd+0xaf/0x200
> > [ 97.173790] [<ffffffff811c0ddb>] do_sys_open+0x10b/0x200
> > [ 97.173792] [<ffffffff81010578>] ? syscall_trace_enter+0x18/0x290
> > [ 97.173794] [<ffffffff811c0eee>] SyS_open+0x1e/0x20
> >
> > This trace repeated a few times, then the same assertion was triggered from sys_renameat.
>
> That's rather curious. What this means is that there is either an
> EIO or EEXIST error being returned from xfs_dir2_sf_lookup() when a
> we're about to add the new entry. There are two things here - EIO
> can only be returned if a shutdown has occurred - are there any
> signs of a shutdown in the logs?

not that I recall (machine is off, and remote right now).
1st time it was during boot (above), and 2nd time I hit it
the machine kept fuzzing for a while afterwards.

I'll see if I can reproduce it again on monday, and if so I'll
grab an xfsdump for you to stare at.

Dave

2013-07-15 19:58:16

by Dave Jones

[permalink] [raw]
Subject: Re: XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358

On Sat, Jul 13, 2013 at 12:00:30PM +1000, Dave Chinner wrote:

> > [ 97.162665] XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358
> ....
>
> So, it's not clear what has caused this yet. Is it reproducable?

Just hit it again on the same machine.

> would be good to get a trace of lookup vs addname events from XFS,
> too (i.e. all the xfs_dir* and xfs_da* events) so we can see if the
> correct lookups were done prior to the failing addname operation...

<- xfs tracing noob. Got a recipe I can cut-n-paste ?
Or a patch if that's necessary..

Dave

2013-07-16 00:27:29

by Dave Chinner

[permalink] [raw]
Subject: Re: XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358

On Mon, Jul 15, 2013 at 03:58:00PM -0400, Dave Jones wrote:
> On Sat, Jul 13, 2013 at 12:00:30PM +1000, Dave Chinner wrote:
>
> > > [ 97.162665] XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358
> > ....
> >
> > So, it's not clear what has caused this yet. Is it reproducable?
>
> Just hit it again on the same machine.
>
> > would be good to get a trace of lookup vs addname events from XFS,
> > too (i.e. all the xfs_dir* and xfs_da* events) so we can see if the
> > correct lookups were done prior to the failing addname operation...
>
> <- xfs tracing noob. Got a recipe I can cut-n-paste ?
> Or a patch if that's necessary..

Just use your usual method of pulling tracepoints out of the kernel,
be it with perf or (my preferred method) trace-cmd:

# trace-cmd record -e xfs_dir\* -e xfs_da\* <command that dies>
....
# trace-cmd report > trace.report.out

The trace.report.out file is what I'd like to see. The report file
could be large, so compressing it might be an idea...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-16 01:42:56

by Dave Jones

[permalink] [raw]
Subject: Re: XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358

On Tue, Jul 16, 2013 at 10:27:03AM +1000, Dave Chinner wrote:
> On Mon, Jul 15, 2013 at 03:58:00PM -0400, Dave Jones wrote:
> > On Sat, Jul 13, 2013 at 12:00:30PM +1000, Dave Chinner wrote:
> >
> > > > [ 97.162665] XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358
> > > ....
> > >
> > > So, it's not clear what has caused this yet. Is it reproducable?
> >
> > Just hit it again on the same machine.
> >
> > > would be good to get a trace of lookup vs addname events from XFS,
> > > too (i.e. all the xfs_dir* and xfs_da* events) so we can see if the
> > > correct lookups were done prior to the failing addname operation...
> >
> > <- xfs tracing noob. Got a recipe I can cut-n-paste ?
> > Or a patch if that's necessary..
>
> Just use your usual method of pulling tracepoints out of the kernel,
> be it with perf or (my preferred method) trace-cmd:
>
> # trace-cmd record -e xfs_dir\* -e xfs_da\* <command that dies>
> ....
> # trace-cmd report > trace.report.out
>
> The trace.report.out file is what I'd like to see. The report file
> could be large, so compressing it might be an idea...

Hmm, will that only log traces from the command that I run ?

I might need to do some kind of system-wide tracing here,
because every time this has happened so far, it's happened in
a different process. (rpcbind the first [wtf?], and chrony the 2nd)
Can trace-cmd do that ?

Dave

2013-07-16 02:02:07

by Dave Chinner

[permalink] [raw]
Subject: Re: XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358

On Mon, Jul 15, 2013 at 09:42:44PM -0400, Dave Jones wrote:
> On Tue, Jul 16, 2013 at 10:27:03AM +1000, Dave Chinner wrote:
> > On Mon, Jul 15, 2013 at 03:58:00PM -0400, Dave Jones wrote:
> > > On Sat, Jul 13, 2013 at 12:00:30PM +1000, Dave Chinner wrote:
> > >
> > > > > [ 97.162665] XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358
> > > > ....
> > > >
> > > > So, it's not clear what has caused this yet. Is it reproducable?
> > >
> > > Just hit it again on the same machine.
> > >
> > > > would be good to get a trace of lookup vs addname events from XFS,
> > > > too (i.e. all the xfs_dir* and xfs_da* events) so we can see if the
> > > > correct lookups were done prior to the failing addname operation...
> > >
> > > <- xfs tracing noob. Got a recipe I can cut-n-paste ?
> > > Or a patch if that's necessary..
> >
> > Just use your usual method of pulling tracepoints out of the kernel,
> > be it with perf or (my preferred method) trace-cmd:
> >
> > # trace-cmd record -e xfs_dir\* -e xfs_da\* <command that dies>
> > ....
> > # trace-cmd report > trace.report.out
> >
> > The trace.report.out file is what I'd like to see. The report file
> > could be large, so compressing it might be an idea...
>
> Hmm, will that only log traces from the command that I run ?

No, it will capture kernel wide. The command being run just gives it
a defined end-point to the capture. i.e. I often just use "sleep 30"
as the command.

$ sudo ~/trace-cmd/trace-cmd record -e xfs\* sleep 30 &
[1] 4180
$ /sys/kernel/debug/tracing/events/xfs*/filter
/sys/kernel/debug/tracing/events/*/xfs*/filter

$ sudo mount /dev/vda /mnt/test
$ sync
$ sudo umount /mnt/test
$ fg
sudo ~/trace-cmd/trace-cmd record -e xfs\* sleep 30
trace-cmd: Interrupted system call
recorder error in splice input
offset=3eb000
Kernel buffer statistics:
Note: "entries" are the entries left in the kernel ring buffer and are not
recorded in the trace data. They should all be zero.

CPU: 0
entries: 55
overrun: 0
commit overrun: 0
bytes: 3420
oldest event ts: 48589.221789
now ts: 48607.261966
dropped events: 0
read events: 352

$ sudo ~/trace-cmd/trace-cmd report > ~/t.t
$ head -10 ~/t.t
version = 6
cpus=1
mount-4185 [000] 48582.777997: xfs_buf_init: dev 253:0 bno 0xffffffffffffffff nblks 0x1 hold 1 pincount 0 lock 0 flags caller xfs_buf_get_uncached
mount-4185 [000] 48582.778001: xfs_buf_get_uncached: dev 253:0 bno 0xffffffffffffffff nblks 0x1 hold 1 pincount 0 lock 0 flags PAGES caller xfs_buf_read_uncached
mount-4185 [000] 48582.778002: xfs_buf_iorequest: dev 253:0 bno 0x0 nblks 0x1 hold 1 pincount 0 lock 0 flags READ|PAGES caller xfsbdstrat
mount-4185 [000] 48582.778002: xfs_buf_hold: dev 253:0 bno 0x0 nblks 0x1 hold 1 pincount 0 lock 0 flags READ|PAGES caller xfs_buf_iorequest
mount-4185 [000] 48582.778017: xfs_buf_rele: dev 253:0 bno 0x0 nblks 0x1 hold 2 pincount 0 lock 0 flags READ|PAGES caller xfs_buf_iorequest
mount-4185 [000] 48582.778017: xfs_buf_iowait: dev 253:0 bno 0x0 nblks 0x1 hold 1 pincount 0 lock 0 flags READ|PAGES caller xfs_buf_read_uncached
<idle>-0 [000] 48582.778692: xfs_buf_ioerror: dev 253:0 bno 0x0 len 0x200 hold 1 pincount 0 lock 0 error 0 flags READ|PAGES caller xfs_buf_bio_end_io
<idle>-0 [000] 48582.778693: xfs_buf_iodone: dev 253:0 bno 0x0 nblks 0x1 hold 1 pincount 0 lock 0 flags READ|PAGES caller _xfs_buf_ioend
$

> I might need to do some kind of system-wide tracing here,
> because every time this has happened so far, it's happened in
> a different process. (rpcbind the first [wtf?], and chrony the 2nd)
> Can trace-cmd do that ?

Yup. trace-cmd will trace global events unless you give it a
specific process filter to limit what PIDs it gathers events from.
Basically, trace-cmd is a much nicer front end to perf...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-25 04:16:18

by Dave Chinner

[permalink] [raw]
Subject: Re: XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358

On Thu, Jul 11, 2013 at 10:39:30PM -0400, Dave Jones wrote:
> Just saw this during boot after an unclean shutdown. It hung afterwards.
>
> [ 97.162665] XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358
> [ 97.164646] ------------[ cut here ]------------

I just reproduced this by chance. I know exactly what I was doing
when it went bang, so I might be able to reduce it from scattergun
blast testing to be able to reproduce it on demand.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2013-07-30 20:26:45

by Dave Jones

[permalink] [raw]
Subject: XFS undeletable files. (3.11rc3)

On Thu, Jul 25, 2013 at 02:15:38PM +1000, Dave Chinner wrote:
> On Thu, Jul 11, 2013 at 10:39:30PM -0400, Dave Jones wrote:
> > Just saw this during boot after an unclean shutdown. It hung afterwards.
> >
> > [ 97.162665] XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358
> > [ 97.164646] ------------[ cut here ]------------
>
> I just reproduced this by chance. I know exactly what I was doing
> when it went bang, so I might be able to reduce it from scattergun
> blast testing to be able to reproduce it on demand.

I xfs_repair'd that disk a week or so back, and have been fine until today when..

(16:22:57:root@zerosum:tmp.6)# ll
ls: cannot access : No such file or directory
total 0
?????????? ? ? ? ? ? ?
(16:22:58:root@zerosum:tmp.6)# cd ..
(16:23:06:root@zerosum:tmp)# rm -rf tmp.6
rm: cannot remove ‘tmp.6’: Directory not empty
(16:23:14:root@zerosum:tmp)# cd -
/home/davej/src/trinity/tmp/tmp.6
(16:23:23:root@zerosum:tmp.6)# stat *
stat: cannot stat ‘\020’: No such file or directory

I don't know if this is residual crap from that first problem that never got fixed up,
or something new.

Any ideas ?

Dave

2013-07-31 06:11:33

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: XFS undeletable files. (3.11rc3)

On 2013.07.30 at 16:26 -0400, Dave Jones wrote:
> On Thu, Jul 25, 2013 at 02:15:38PM +1000, Dave Chinner wrote:
> > On Thu, Jul 11, 2013 at 10:39:30PM -0400, Dave Jones wrote:
> > > Just saw this during boot after an unclean shutdown. It hung afterwards.
> > >
> > > [ 97.162665] XFS: Assertion failed: xfs_dir2_sf_lookup(args) == ENOENT, file: fs/xfs/xfs_dir2_sf.c, line: 358
> > > [ 97.164646] ------------[ cut here ]------------
> >
> > I just reproduced this by chance. I know exactly what I was doing
> > when it went bang, so I might be able to reduce it from scattergun
> > blast testing to be able to reproduce it on demand.
>
> I xfs_repair'd that disk a week or so back, and have been fine until today when..
>
> (16:22:57:root@zerosum:tmp.6)# ll
> ls: cannot access : No such file or directory
> total 0
> ?????????? ? ? ? ? ? ?
> (16:22:58:root@zerosum:tmp.6)# cd ..
> (16:23:06:root@zerosum:tmp)# rm -rf tmp.6
> rm: cannot remove ‘tmp.6’: Directory not empty
> (16:23:14:root@zerosum:tmp)# cd -
> /home/davej/src/trinity/tmp/tmp.6
> (16:23:23:root@zerosum:tmp.6)# stat *
> stat: cannot stat ‘\020’: No such file or directory
>
> I don't know if this is residual crap from that first problem that never got fixed up,
> or something new.
>
> Any ideas ?

If you have crashed your test machine running an rc1 or rc2 kernel, you
must run xfs_repair again to clean up possible corruption (that happened
during log replay). The fix (commit e1b4271ac26) is in rc3.

--
Markus