2009-01-04 01:23:59

by Adam Nielsen

[permalink] [raw]
Subject: XFS internal error xfs_da_do_buf(1) at line 2015 of file fs/xfs/xfs_da_btree.c

Hi all,

I'm having a recurring problem with XFS which started about a day ago. All of
a sudden when reading a certain part of the disk (not sure where, but my
nightly backups trigger it) I get an infinite loop of these messages appearing
in my logs:

xfs_da_do_buf: bno 8388608
dir: inode 3087268096
Filesystem "md0": XFS internal error xfs_da_do_buf(1) at line 2015 of file
fs/xfs/xfs_da_btree.c. Caller 0xffffffff802eba63
Pid: 4445, comm: metalog Tainted: P 2.6.28-rc2 #3
Call Trace:
[<ffffffff802eba63>] xfs_da_read_buf+0x24/0x29
[<ffffffff802eb6aa>] xfs_da_do_buf+0x2d2/0x621
[<ffffffff80267fb0>] balance_dirty_pages_ratelimited_nr+0x300/0x329
[<ffffffff802a52af>] block_write_end+0x4a/0x54
[<ffffffff802eba63>] xfs_da_read_buf+0x24/0x29
[<ffffffff802ecd0a>] xfs_da_node_lookup_int+0x5b/0x225
[<ffffffff802ecd0a>] xfs_da_node_lookup_int+0x5b/0x225
[<ffffffff802f2908>] xfs_dir2_node_lookup+0x43/0xe7
[<ffffffff802edbff>] xfs_dir2_isleaf+0x19/0x4a
[<ffffffff802ee357>] xfs_dir_lookup+0x10f/0x14f
[<ffffffff80297a74>] __d_lookup+0x11a/0x143
[<ffffffff80314b4b>] xfs_lookup+0x48/0xa5
[<ffffffff8031d06e>] xfs_vn_lookup+0x3c/0x78
[<ffffffff8028fafa>] __lookup_hash+0xfa/0x11e
[<ffffffff8029288d>] do_filp_open+0x159/0x7d7
[<ffffffff80507856>] _spin_unlock+0x10/0x2a
[<ffffffff8029a608>] alloc_fd+0x112/0x123
[<ffffffff8028701e>] do_sys_open+0x48/0xcc
[<ffffffff8020b3bb>] system_call_fastpath+0x16/0x1b

It doesn't seem to interfere with filesystem use, but metalog is logging
thousands of these messages into the system log files (the log files grow at
about 1MB/minute.)

Does anyone know what this error means? Do I need to reformat the filesystem?

I've restarted a few times and the error goes away until the next nightly
backup triggers it again. Killing metalog does seem to stop the messages, so
perhaps one of the log files is the culprit? I'm not sure how to map that
inode or bno back to a filename. It's always the same bno/inode and always
reports metalog as the offending program.

Any suggestions how to go about diagnosing this problem?

Many thanks,
Adam.


2009-01-04 07:47:17

by Christoph Hellwig

[permalink] [raw]
Subject: Re: XFS internal error xfs_da_do_buf(1) at line 2015 of file fs/xfs/xfs_da_btree.c

On Sun, Jan 04, 2009 at 11:16:22AM +1000, Adam Nielsen wrote:
> Hi all,
>
> I'm having a recurring problem with XFS which started about a day ago.
> All of a sudden when reading a certain part of the disk (not sure where,
> but my nightly backups trigger it) I get an infinite loop of these
> messages appearing in my logs:
>
> xfs_da_do_buf: bno 8388608
> dir: inode 3087268096
> Filesystem "md0": XFS internal error xfs_da_do_buf(1) at line 2015 of
> file fs/xfs/xfs_da_btree.c. Caller 0xffffffff802eba63
> Pid: 4445, comm: metalog Tainted: P 2.6.28-rc2 #3

This is a typical result of a power loss scenario with write caches
enabled and without barriers. Given that md can't pass through barriers
did you disable the write caches on your disk?

> Does anyone know what this error means? Do I need to reformat the filesystem?

Run xfs_repair over it to fix up the directory, and make sure to
configure your disks properly so that it doesn't happen again..

2009-01-04 09:03:45

by Adam Nielsen

[permalink] [raw]
Subject: Re: XFS internal error xfs_da_do_buf(1) at line 2015 of file fs/xfs/xfs_da_btree.c

> This is a typical result of a power loss scenario with write caches
> enabled and without barriers. Given that md can't pass through barriers
> did you disable the write caches on your disk?

No, I didn't realise I had to do that...in fact I didn't even realise SATA
disks *had* write caches, I thought the cache was for reading only...

> Run xfs_repair over it to fix up the directory, and make sure to
> configure your disks properly so that it doesn't happen again..

Will do, thanks for the advice! Is there any standard way to disable write
caching on a SATA disk? hdparm -W seems to do the trick, but then I can't run
that until the system is up and running, leaving a small window of opportunity
for something to go wrong.

Thanks again,
Adam.

2009-01-04 09:24:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: XFS internal error xfs_da_do_buf(1) at line 2015 of file fs/xfs/xfs_da_btree.c

On Sun, Jan 04, 2009 at 07:03:23PM +1000, Adam Nielsen wrote:
> No, I didn't realise I had to do that...in fact I didn't even realise
> SATA disks *had* write caches, I thought the cache was for reading
> only...

Which would be the better default (it's what high-end disks generally
do by default). I've been wondering for a while how we can make default
setups in the presence of lvm/dm more secure, but there hasn't been
any progress yet.

>> Run xfs_repair over it to fix up the directory, and make sure to
>> configure your disks properly so that it doesn't happen again..
>
> Will do, thanks for the advice! Is there any standard way to disable
> write caching on a SATA disk? hdparm -W seems to do the trick, but then
> I can't run that until the system is up and running, leaving a small
> window of opportunity for something to go wrong.

On Debian based systems you can add -W0 to /etc/default/hdparm and
it gets executed before the root filesystem is remounted read-write,
I'm not sure how other distributions handle it.

2009-01-04 11:44:33

by Alan

[permalink] [raw]
Subject: Re: XFS internal error xfs_da_do_buf(1) at line 2015 of file fs/xfs/xfs_da_btree.c

> On Debian based systems you can add -W0 to /etc/default/hdparm and
> it gets executed before the root filesystem is remounted read-write,
> I'm not sure how other distributions handle it.

Generally they avoid setting -W0 because it ruins performance and can be
very bad for disk lifetime. The barriers code is there for a reason.

Of course certain distributions default to using LVM for all their file
systems which is completely and mindbogglingly bogus. That both messes up
barriers in some cases and takes a good 10-20% off performance when I've
benched it.

LVM is cool - if you need it, most people don't.

Alan

2009-01-04 15:34:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: XFS internal error xfs_da_do_buf(1) at line 2015 of file fs/xfs/xfs_da_btree.c

On Sun, Jan 04, 2009 at 11:44:25AM +0000, Alan Cox wrote:
> Generally they avoid setting -W0 because it ruins performance and can be
> very bad for disk lifetime. The barriers code is there for a reason.

We've done measurements and for modern NCQ/TCQ disks the performance
for cache off vs cache on + barriers is close. For ext3 barriers is
generally slightly faster, and for XFS it's even or sometimes even cache
off is faster depending on the workload.

> Of course certain distributions default to using LVM for all their file
> systems which is completely and mindbogglingly bogus. That both messes up
> barriers in some cases and takes a good 10-20% off performance when I've
> benched it.

The thing is that there's no reason for that at all with just a single
underlying disk. There is absolutely no reason for not passing through
barriers, and there's also no reason why it should be any slower than
our most trivially volume manager, the partition remapping code. In
fact there's no reason trivial device mapper tables couldn't be handled
by the partition remapping code..

2009-01-04 15:49:20

by Andi Kleen

[permalink] [raw]
Subject: Re: XFS internal error xfs_da_do_buf(1) at line 2015 of file fs/xfs/xfs_da_btree.c

Christoph Hellwig <[email protected]> writes:
>
>> Of course certain distributions default to using LVM for all their file
>> systems which is completely and mindbogglingly bogus. That both messes up
>> barriers in some cases and takes a good 10-20% off performance when I've
>> benched it.
>
> The thing is that there's no reason for that at all with just a single
> underlying disk.

I've submitted patches to do exactly that in DM some time ago.
Unfortunately they still didn't make it in (as of 2.6.29 git) for unknown reasons.

-Andi
--
[email protected]

2009-01-05 05:13:15

by markus reichelt

[permalink] [raw]
Subject: Re: XFS internal error xfs_da_do_buf(1) at line 2015 of file fs/xfs/xfs_da_btree.c

* Alan Cox <[email protected]> wrote:

> > On Debian based systems you can add -W0 to /etc/default/hdparm
> > and it gets executed before the root filesystem is remounted
> > read-write, I'm not sure how other distributions handle it.
>
> Generally they avoid setting -W0 because it ruins performance and
> can be very bad for disk lifetime. The barriers code is there for a
> reason.

First time I read about a possible negative impact of disabled
write-cache to disk lifetime. Do you know of an article for further
reading?

--
left blank, right bald


Attachments:
(No filename) (562.00 B)
(No filename) (189.00 B)
Download all attachments