2006-11-28 15:49:04

by Jesper Juhl

[permalink] [raw]
Subject: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

Hi,

One of my NFS servers just gave me a nasty surprise that I think it is
relevant to tell you about:

Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
file fs/xfs/xfs_trans.c. Caller 0xffffffff8034b47e

Call Trace:
[<ffffffff8020b122>] show_trace+0xb2/0x380
[<ffffffff8020b405>] dump_stack+0x15/0x20
[<ffffffff80327b4c>] xfs_error_report+0x3c/0x50
[<ffffffff803435ae>] xfs_trans_cancel+0x6e/0x130
[<ffffffff8034b47e>] xfs_create+0x5ee/0x6a0
[<ffffffff80356556>] xfs_vn_mknod+0x156/0x2e0
[<ffffffff803566eb>] xfs_vn_create+0xb/0x10
[<ffffffff80284b2c>] vfs_create+0x8c/0xd0
[<ffffffff802e734a>] nfsd_create_v3+0x31a/0x560
[<ffffffff802ec838>] nfsd3_proc_create+0x148/0x170
[<ffffffff802e19f9>] nfsd_dispatch+0xf9/0x1e0
[<ffffffff8049d617>] svc_process+0x437/0x6e0
[<ffffffff802e176d>] nfsd+0x1cd/0x360
[<ffffffff8020ab1c>] child_rip+0xa/0x12
xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
fs/xfs/xfs_trans.c. Return address = 0xffffffff80359daa
Filesystem "dm-1": Corruption of in-memory data detected. Shutting
down filesystem: dm-1
Please umount the filesystem, and rectify the problem(s)
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
nfsd: non-standard errno: 5
(the above message repeates 1670 times, then the following)
xfs_force_shutdown(dm-1,0x1) called from line 424 of file
fs/xfs/xfs_rw.c. Return address = 0xffffffff80359daa

I unmounted the filesystem, ran xfs_repair which told me to try an
mount it first to replay the log, so I did, unmounted it again, ran
xfs_repair (which didn't find any problems) and finally mounted it and
everything is good - the filesystem seems intact.

Filesystem "dm-1": Disabling barriers, not supported with external log device
XFS mounting filesystem dm-1
Starting XFS recovery on filesystem: dm-1 (logdev: /dev/Log1/ws22_log)
Ending XFS recovery on filesystem: dm-1 (logdev: /dev/Log1/ws22_log)
Filesystem "dm-1": Disabling barriers, not supported with external log device
XFS mounting filesystem dm-1
Ending clean XFS mount for filesystem: dm-1


The server in question is running kernel 2.6.18.1


--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html


2006-11-29 01:32:35

by David Chinner

[permalink] [raw]
Subject: Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> Hi,
>
> One of my NFS servers just gave me a nasty surprise that I think it is
> relevant to tell you about:

Thanks, Jesper.

> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> file fs/xfs/xfs_trans.c. Caller 0xffffffff8034b47e
>
> Call Trace:
> [<ffffffff8020b122>] show_trace+0xb2/0x380
> [<ffffffff8020b405>] dump_stack+0x15/0x20
> [<ffffffff80327b4c>] xfs_error_report+0x3c/0x50
> [<ffffffff803435ae>] xfs_trans_cancel+0x6e/0x130
> [<ffffffff8034b47e>] xfs_create+0x5ee/0x6a0
> [<ffffffff80356556>] xfs_vn_mknod+0x156/0x2e0
> [<ffffffff803566eb>] xfs_vn_create+0xb/0x10
> [<ffffffff80284b2c>] vfs_create+0x8c/0xd0
> [<ffffffff802e734a>] nfsd_create_v3+0x31a/0x560
> [<ffffffff802ec838>] nfsd3_proc_create+0x148/0x170
> [<ffffffff802e19f9>] nfsd_dispatch+0xf9/0x1e0
> [<ffffffff8049d617>] svc_process+0x437/0x6e0
> [<ffffffff802e176d>] nfsd+0x1cd/0x360
> [<ffffffff8020ab1c>] child_rip+0xa/0x12
> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> fs/xfs/xfs_trans.c. Return address = 0xffffffff80359daa

We shut down the filesystem because we cancelled a dirty transaction.
Once we start to dirty the incore objects, we can't roll back to
an unchanged state if a subsequent fatal error occurs during the
transaction and we have to abort it.

If I understand historic occurrences of this correctly, there is
a possibility that it can be triggered in ENOMEM situations. Was your
machine running out of memoy when this occurred?

> Filesystem "dm-1": Corruption of in-memory data detected. Shutting
> down filesystem: dm-1
> Please umount the filesystem, and rectify the problem(s)
> nfsd: non-standard errno: 5

EIO gets returned in certain locations once the filesystem has
been shutdown.

> I unmounted the filesystem, ran xfs_repair which told me to try an
> mount it first to replay the log, so I did, unmounted it again, ran
> xfs_repair (which didn't find any problems) and finally mounted it and
> everything is good - the filesystem seems intact.

Yeah, the above error report typically is due to an in-memory
problem, not an on disk issue.

> The server in question is running kernel 2.6.18.1

Can happen to XFS on any kernel version - got a report of this from
someone running a 2.4 kernel a couple of weeks ago....

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-29 09:17:27

by Jesper Juhl

[permalink] [raw]
Subject: Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

On 29/11/06, David Chinner <[email protected]> wrote:
> On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> > Hi,
> >
> > One of my NFS servers just gave me a nasty surprise that I think it is
> > relevant to tell you about:
>
> Thanks, Jesper.
>
> > Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> > file fs/xfs/xfs_trans.c. Caller 0xffffffff8034b47e
> >
> > Call Trace:
> > [<ffffffff8020b122>] show_trace+0xb2/0x380
> > [<ffffffff8020b405>] dump_stack+0x15/0x20
> > [<ffffffff80327b4c>] xfs_error_report+0x3c/0x50
> > [<ffffffff803435ae>] xfs_trans_cancel+0x6e/0x130
> > [<ffffffff8034b47e>] xfs_create+0x5ee/0x6a0
> > [<ffffffff80356556>] xfs_vn_mknod+0x156/0x2e0
> > [<ffffffff803566eb>] xfs_vn_create+0xb/0x10
> > [<ffffffff80284b2c>] vfs_create+0x8c/0xd0
> > [<ffffffff802e734a>] nfsd_create_v3+0x31a/0x560
> > [<ffffffff802ec838>] nfsd3_proc_create+0x148/0x170
> > [<ffffffff802e19f9>] nfsd_dispatch+0xf9/0x1e0
> > [<ffffffff8049d617>] svc_process+0x437/0x6e0
> > [<ffffffff802e176d>] nfsd+0x1cd/0x360
> > [<ffffffff8020ab1c>] child_rip+0xa/0x12
> > xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> > fs/xfs/xfs_trans.c. Return address = 0xffffffff80359daa
>
> We shut down the filesystem because we cancelled a dirty transaction.
> Once we start to dirty the incore objects, we can't roll back to
> an unchanged state if a subsequent fatal error occurs during the
> transaction and we have to abort it.
>
So you are saying that there's nothing I can do to prevent this from
happening in the future?

> If I understand historic occurrences of this correctly, there is
> a possibility that it can be triggered in ENOMEM situations. Was your
> machine running out of memoy when this occurred?
>
Not really. I just checked my monitoring software and, at the time
this happened, the box had ~5.9G RAM free (of 8G total) and no swap
used (but 11G available).


> > Filesystem "dm-1": Corruption of in-memory data detected. Shutting
> > down filesystem: dm-1
> > Please umount the filesystem, and rectify the problem(s)
> > nfsd: non-standard errno: 5
>
> EIO gets returned in certain locations once the filesystem has
> been shutdown.
>
Makes sense.


> > I unmounted the filesystem, ran xfs_repair which told me to try an
> > mount it first to replay the log, so I did, unmounted it again, ran
> > xfs_repair (which didn't find any problems) and finally mounted it and
> > everything is good - the filesystem seems intact.
>
> Yeah, the above error report typically is due to an in-memory
> problem, not an on disk issue.
>
Good to know.


> > The server in question is running kernel 2.6.18.1
>
> Can happen to XFS on any kernel version - got a report of this from
> someone running a 2.4 kernel a couple of weeks ago....
>

Ok. Thank you for your reply David.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-11-30 02:07:54

by David Chinner

[permalink] [raw]
Subject: Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote:
> On 29/11/06, David Chinner <[email protected]> wrote:
> >On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> >> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> >> file fs/xfs/xfs_trans.c. Caller 0xffffffff8034b47e
> >>
> >> Call Trace:
> >> [<ffffffff8020b122>] show_trace+0xb2/0x380
> >> [<ffffffff8020b405>] dump_stack+0x15/0x20
> >> [<ffffffff80327b4c>] xfs_error_report+0x3c/0x50
> >> [<ffffffff803435ae>] xfs_trans_cancel+0x6e/0x130
> >> [<ffffffff8034b47e>] xfs_create+0x5ee/0x6a0
> >> [<ffffffff80356556>] xfs_vn_mknod+0x156/0x2e0
> >> [<ffffffff803566eb>] xfs_vn_create+0xb/0x10
> >> [<ffffffff80284b2c>] vfs_create+0x8c/0xd0
> >> [<ffffffff802e734a>] nfsd_create_v3+0x31a/0x560
> >> [<ffffffff802ec838>] nfsd3_proc_create+0x148/0x170
> >> [<ffffffff802e19f9>] nfsd_dispatch+0xf9/0x1e0
> >> [<ffffffff8049d617>] svc_process+0x437/0x6e0
> >> [<ffffffff802e176d>] nfsd+0x1cd/0x360
> >> [<ffffffff8020ab1c>] child_rip+0xa/0x12
> >> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> >> fs/xfs/xfs_trans.c. Return address = 0xffffffff80359daa
> >
> >We shut down the filesystem because we cancelled a dirty transaction.
> >Once we start to dirty the incore objects, we can't roll back to
> >an unchanged state if a subsequent fatal error occurs during the
> >transaction and we have to abort it.
> >
> So you are saying that there's nothing I can do to prevent this from
> happening in the future?

Pretty much - we need to work out what is going wrong and
we can't from teh shutdown message above - the error has
occurred in a path that doesn't have error report traps
in it.

Is this reproducable?

> >If I understand historic occurrences of this correctly, there is
> >a possibility that it can be triggered in ENOMEM situations. Was your
> >machine running out of memoy when this occurred?
> >
> Not really. I just checked my monitoring software and, at the time
> this happened, the box had ~5.9G RAM free (of 8G total) and no swap
> used (but 11G available).

Ok. Sounds like we need more error reporting points inserted
into that code so we dump an error earlier and hence have some
hope of working out what went wrong next time.....

OOC, there weren't any I/O errors reported before this shutdown?

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group

2006-11-30 05:51:45

by Jesper Juhl

[permalink] [raw]
Subject: Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

On 30/11/06, David Chinner <[email protected]> wrote:
> On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote:
> > On 29/11/06, David Chinner <[email protected]> wrote:
> > >On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> > >> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> > >> file fs/xfs/xfs_trans.c. Caller 0xffffffff8034b47e
> > >>
> > >> Call Trace:
> > >> [<ffffffff8020b122>] show_trace+0xb2/0x380
> > >> [<ffffffff8020b405>] dump_stack+0x15/0x20
> > >> [<ffffffff80327b4c>] xfs_error_report+0x3c/0x50
> > >> [<ffffffff803435ae>] xfs_trans_cancel+0x6e/0x130
> > >> [<ffffffff8034b47e>] xfs_create+0x5ee/0x6a0
> > >> [<ffffffff80356556>] xfs_vn_mknod+0x156/0x2e0
> > >> [<ffffffff803566eb>] xfs_vn_create+0xb/0x10
> > >> [<ffffffff80284b2c>] vfs_create+0x8c/0xd0
> > >> [<ffffffff802e734a>] nfsd_create_v3+0x31a/0x560
> > >> [<ffffffff802ec838>] nfsd3_proc_create+0x148/0x170
> > >> [<ffffffff802e19f9>] nfsd_dispatch+0xf9/0x1e0
> > >> [<ffffffff8049d617>] svc_process+0x437/0x6e0
> > >> [<ffffffff802e176d>] nfsd+0x1cd/0x360
> > >> [<ffffffff8020ab1c>] child_rip+0xa/0x12
> > >> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> > >> fs/xfs/xfs_trans.c. Return address = 0xffffffff80359daa
> > >
> > >We shut down the filesystem because we cancelled a dirty transaction.
> > >Once we start to dirty the incore objects, we can't roll back to
> > >an unchanged state if a subsequent fatal error occurs during the
> > >transaction and we have to abort it.
> > >
> > So you are saying that there's nothing I can do to prevent this from
> > happening in the future?
>
> Pretty much - we need to work out what is going wrong and
> we can't from teh shutdown message above - the error has
> occurred in a path that doesn't have error report traps
> in it.
>
> Is this reproducable?
>
Not on demand, no. It has happened only this once as far as I know and
for unknown reasons.


> > >If I understand historic occurrences of this correctly, there is
> > >a possibility that it can be triggered in ENOMEM situations. Was your
> > >machine running out of memoy when this occurred?
> > >
> > Not really. I just checked my monitoring software and, at the time
> > this happened, the box had ~5.9G RAM free (of 8G total) and no swap
> > used (but 11G available).
>
> Ok. Sounds like we need more error reporting points inserted
> into that code so we dump an error earlier and hence have some
> hope of working out what went wrong next time.....
>
> OOC, there weren't any I/O errors reported before this shutdown?
>
No. I looked but found none.

Let me know if there's anything I can do to help.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2006-12-10 15:48:24

by Jesper Juhl

[permalink] [raw]
Subject: Re: XFS internal error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c (kernel 2.6.18.1)

On 30/11/06, Jesper Juhl <[email protected]> wrote:
> On 30/11/06, David Chinner <[email protected]> wrote:
> > On Wed, Nov 29, 2006 at 10:17:25AM +0100, Jesper Juhl wrote:
> > > On 29/11/06, David Chinner <[email protected]> wrote:
> > > >On Tue, Nov 28, 2006 at 04:49:00PM +0100, Jesper Juhl wrote:
> > > >> Filesystem "dm-1": XFS internal error xfs_trans_cancel at line 1138 of
> > > >> file fs/xfs/xfs_trans.c. Caller 0xffffffff8034b47e
> > > >>

For your information; I just got this again with a different
filesystem (see below).

> > > >> Call Trace:
> > > >> [<ffffffff8020b122>] show_trace+0xb2/0x380
> > > >> [<ffffffff8020b405>] dump_stack+0x15/0x20
> > > >> [<ffffffff80327b4c>] xfs_error_report+0x3c/0x50
> > > >> [<ffffffff803435ae>] xfs_trans_cancel+0x6e/0x130
> > > >> [<ffffffff8034b47e>] xfs_create+0x5ee/0x6a0
> > > >> [<ffffffff80356556>] xfs_vn_mknod+0x156/0x2e0
> > > >> [<ffffffff803566eb>] xfs_vn_create+0xb/0x10
> > > >> [<ffffffff80284b2c>] vfs_create+0x8c/0xd0
> > > >> [<ffffffff802e734a>] nfsd_create_v3+0x31a/0x560
> > > >> [<ffffffff802ec838>] nfsd3_proc_create+0x148/0x170
> > > >> [<ffffffff802e19f9>] nfsd_dispatch+0xf9/0x1e0
> > > >> [<ffffffff8049d617>] svc_process+0x437/0x6e0
> > > >> [<ffffffff802e176d>] nfsd+0x1cd/0x360
> > > >> [<ffffffff8020ab1c>] child_rip+0xa/0x12
> > > >> xfs_force_shutdown(dm-1,0x8) called from line 1139 of file
> > > >> fs/xfs/xfs_trans.c. Return address = 0xffffffff80359daa
> > > >
> > > >We shut down the filesystem because we cancelled a dirty transaction.
> > > >Once we start to dirty the incore objects, we can't roll back to
> > > >an unchanged state if a subsequent fatal error occurs during the
> > > >transaction and we have to abort it.
> > > >
> > > So you are saying that there's nothing I can do to prevent this from
> > > happening in the future?
> >
> > Pretty much - we need to work out what is going wrong and
> > we can't from teh shutdown message above - the error has
> > occurred in a path that doesn't have error report traps
> > in it.
> >
> > Is this reproducable?
> >
> Not on demand, no. It has happened only this once as far as I know and
> for unknown reasons.
>
This time it *is* reproducible, so if you want me to try something let
me know fast since I have to delete the fs as soon as I have copied
all the data to a new one.


> > > >If I understand historic occurrences of this correctly, there is
> > > >a possibility that it can be triggered in ENOMEM situations. Was your
> > > >machine running out of memoy when this occurred?
> > > >
> > > Not really. I just checked my monitoring software and, at the time
> > > this happened, the box had ~5.9G RAM free (of 8G total) and no swap
> > > used (but 11G available).
> >
> > Ok. Sounds like we need more error reporting points inserted
> > into that code so we dump an error earlier and hence have some
> > hope of working out what went wrong next time.....
> >
> > OOC, there weren't any I/O errors reported before this shutdown?
> >
> No. I looked but found none.
>

This time the server was running 2.6.19 :

Dec 10 15:09:21 nfsserver2 kernel: Filesystem "dm-6": XFS internal
error xfs_trans_cancel at line 1138 of file fs/xfs/xfs_trans.c.
Caller 0xffffffff8034263c
Dec 10 15:09:21 nfsserver2 kernel:
Dec 10 15:09:21 nfsserver2 kernel: Call Trace:
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff8020aefc>] dump_trace+0xb3/0x42e
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff8020b2b3>] show_trace+0x3c/0x55
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff8020b2e1>] dump_stack+0x15/0x17
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff8031e91f>]
xfs_error_report+0x3c/0x3e
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff8033975b>]
xfs_trans_cancel+0x65/0x109
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff8034263c>] xfs_create+0x5bb/0x613
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff8034b43e>]
xfs_vn_mknod+0x141/0x2b6
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff8034b5ce>] xfs_vn_create+0xb/0xd
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff80278721>] vfs_create+0x7a/0xb1
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff802e0a9f>]
nfsd_create_v3+0x300/0x548
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff802e69f9>]
nfsd3_proc_create+0x152/0x164
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff802daf4a>]
nfsd_dispatch+0xea/0x1bd
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff8049cdb3>] svc_process+0x3ee/0x6fb
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff802db54b>] nfsd+0x198/0x2bd
Dec 10 15:09:21 nfsserver2 kernel: [<ffffffff8020a7a8>] child_rip+0xa/0x12
Dec 10 15:09:21 nfsserver2 kernel:
Dec 10 15:09:21 nfsserver2 kernel: xfs_force_shutdown(dm-6,0x8) called
from line 1139 of file fs/xfs/xfs_trans.c. Return address =
0xffffffff8034e9ed
Dec 10 15:09:21 nfsserver2 kernel: Filesystem "dm-6": Corruption of
in-memory data detected. Shutting down filesystem: dm-6
Dec 10 15:09:21 nfsserver2 kernel: Please umount the filesystem, and
rectify the problem(s)
Dec 10 15:12:23 nfsserver2 kernel: nfsd: last server has exited
Dec 10 15:12:23 nfsserver2 kernel: nfsd: unexporting all filesystems
Dec 10 15:12:26 nfsserver2 kernel: xfs_force_shutdown(dm-6,0x1) called
from line 424 of file fs/xfs/xfs_rw.c. Return addres s =
0xffffffff8034e9ed

If I unmount and then remount the filesystem the log gets replayed OK,
I can then unmount it and run xfs_repair on it and it finds no
problems, *but* when I then mount it again and the webserver that uses
the filesystem access it via NFS it explodes again - every single
time, it's quite reproducible.
I'm currently in the process of copying all the data to a new XFS
filesystem in the hope that the new filesystem will be OK - the copy
seems to be proceeding fine.
Unfortunately I can't keep the current filesystem around for
diagnostics work since it's a production server and I don't have space
available to let the old and new copy co-exist, so I have to delete
the current one as soon as I have copied all the data off.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html