2008-05-14 12:32:18

by Alexander Borghgraef

[permalink] [raw]
Subject: Nfs filesystem corruption(?) after kmail crash

Hi all,

I've been posting this on every linux support forum I thought
relevant, with no useful response. I suspect it's an nfs problem, so
now I'm trying this list. The problem is this: ever since upgrading
from Fedora 6 to 8, I've been experiencing periodic kmail/contact
crashes. After running for some time, kmail pops an error window:

Error opening /home/myhome/Mail/sent-mail/cur; either this is not a
valid maildir folder, or you do not have
sufficient access permissions.

Sent-mail is just an example, it happens to random mail directories.
Shortly after this, kmail crashes with the error message:

KMail encountered a fatal error and will terminate now. The error
was: Could not sync maildir folder.

So I did an ls of an affected mail directory, which resides on an nfs
filesystem on a fileserver, and I found this rather disquieting
result:

aborghgr@mypc~$ ls -l Mail/inbox/
ls: cannot access Mail/inbox/cur: No such file or directory
total 8
d????????? ? ? ? ? ? cur
drwx------ 2 aborghgr slocate 4096 2005-07-08 13:23 new
drwx------ 2 aborghgr slocate 4096 2008-04-24 10:24 tmp

The cur directory (or other files to which this has happened)is not
permanently lost: moving its parent directory for example restores all
of the files, and simply waiting also solves the problem. Still, it
makes working with kmail impossible, and also raises concerns about
the stability of the filesystem in general. The people at the kdepim
list said that a user-space program such as kmail could not possibly
corrupt the filesystem like that, so it has to be a bug in nfs. Any
ideas? I'm using Fedora 8, the 2.6.24.4-64.fc8 kernel and kmail 1.9.9
on kde 3.5.9-5.fc8. The nfs fileserver runs Fedora Core 3 with the
2.6.9-1.667 kernel (old machine).

--
Alex Borghgraef


2008-05-23 14:44:15

by Alexander Borghgraef

[permalink] [raw]
Subject: Re: Nfs filesystem corruption(?) after kmail crash

On Mon, May 19, 2008 at 4:48 PM, J. Bruce Fields <[email protected]> wrote:

> Out of curiosity--what's the filesystem on the server? (I just wonder
> if this could be due to poor time resolution, so if e.g. switching from
> ext3 to xfs would work around the problem.)

Is ext3 known for time resolution issues? Switching to a different fs
could prove problematic, but I could always ask the sysadmin to move
my home dir to my client machine, there should be enough space so that
would rule out any synchronization problems.

The main thing here is that I'd like to understand why this is
happening? What does it mean when ls returns something like:

d????????? ? ? ? ? ? cur

Why is this triggered by an app like kmail? Why does the problem
disappear when I cd into the oddly behaving directory, but not when
kmail checks its availability? Why doesn't this always work? Things
like that.

--
Alex Borghgraef

2008-05-21 08:15:02

by Alexander Borghgraef

[permalink] [raw]
Subject: Re: Nfs filesystem corruption(?) after kmail crash

On Mon, May 19, 2008 at 4:48 PM, J. Bruce Fields <[email protected]> wrote:
> You're doing the "ls" from the client, though, not the server, right?

Yes, from the client. I'll check what ls on the server does if it happens again.

> Out of curiosity--what's the filesystem on the server? (I just wonder
> if this could be due to poor time resolution, so if e.g. switching from
> ext3 to xfs would work around the problem.)

Just checked, it's ext3.

--
Alex Borghgraef

2008-05-19 14:48:08

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Nfs filesystem corruption(?) after kmail crash

On Wed, May 14, 2008 at 02:32:16PM +0200, Alexander Borghgraef wrote:
> Hi all,
>
> I've been posting this on every linux support forum I thought
> relevant, with no useful response. I suspect it's an nfs problem, so
> now I'm trying this list. The problem is this: ever since upgrading
> from Fedora 6 to 8, I've been experiencing periodic kmail/contact
> crashes. After running for some time, kmail pops an error window:
>
> Error opening /home/myhome/Mail/sent-mail/cur; either this is not a
> valid maildir folder, or you do not have
> sufficient access permissions.
>
> Sent-mail is just an example, it happens to random mail directories.
> Shortly after this, kmail crashes with the error message:
>
> KMail encountered a fatal error and will terminate now. The error
> was: Could not sync maildir folder.
>
> So I did an ls of an affected mail directory, which resides on an nfs
> filesystem on a fileserver, and I found this rather disquieting
> result:

You're doing the "ls" from the client, though, not the server, right?

>
> aborghgr@mypc~$ ls -l Mail/inbox/
> ls: cannot access Mail/inbox/cur: No such file or directory
> total 8
> d????????? ? ? ? ? ? cur
> drwx------ 2 aborghgr slocate 4096 2005-07-08 13:23 new
> drwx------ 2 aborghgr slocate 4096 2008-04-24 10:24 tmp
>
> The cur directory (or other files to which this has happened)is not
> permanently lost: moving its parent directory for example restores all
> of the files, and simply waiting also solves the problem. Still, it
> makes working with kmail impossible, and also raises concerns about
> the stability of the filesystem in general. The people at the kdepim
> list said that a user-space program such as kmail could not possibly
> corrupt the filesystem like that, so it has to be a bug in nfs. Any
> ideas? I'm using Fedora 8, the 2.6.24.4-64.fc8 kernel and kmail 1.9.9
> on kde 3.5.9-5.fc8. The nfs fileserver runs Fedora Core 3 with the
> 2.6.9-1.667 kernel (old machine).

Out of curiosity--what's the filesystem on the server? (I just wonder
if this could be due to poor time resolution, so if e.g. switching from
ext3 to xfs would work around the problem.)

--b.

2008-06-02 13:05:19

by Alexander Borghgraef

[permalink] [raw]
Subject: Re: Nfs filesystem corruption(?) after kmail crash

Nobody? Anyone care to tell me how to interpret the strace stat cur output?

--
Alex Borghgraef

2008-06-02 13:43:25

by Jeff Layton

[permalink] [raw]
Subject: Re: Nfs filesystem corruption(?) after kmail crash

On Mon, 2 Jun 2008 15:05:17 +0200
"Alexander Borghgraef" <[email protected]> wrote:

> Nobody? Anyone care to tell me how to interpret the strace stat cur output?
>

> lstat64("cur", 0xbfb81cb4) = -1 ENOENT (No such file or directory)

File doesn't exist...

If this is from "ls -l" or something like that, that means that the
client did a READDIR or READDIRPLUS and saw a "cur" entry in the
directory with a particular filehandle. It then went back and did a
stat() against that filehandle and it was gone. The two possibilities
are that something removed that directory in the interim (possibly
replacing it with a new "cur" directory), or that the filehandle was
bad for some reason. I'm not aware of any bugs causing the latter, so
the former is the most likely.

You would think that the client would just use the info returned by
READDIRPLUS to fill out the stat() info, but it doesn't because stat()
calls generate an on-the-wire getattr (unless noatime is specified).
Peter S. and I were talking about this the other day. IMO, this
probably ought to be changed.

Most likely, this is the race that Tom described. ext3 has 1s
granularity on timestamps. It's easy to do *many* NFS operations within
1s. You might consider switching to a local filesystem on the server w/
more granular timestamps if you have a lot of concurrent activity like
this.

Cheers,
--
Jeff Layton <[email protected]>

2008-06-04 12:10:05

by Alexander Borghgraef

[permalink] [raw]
Subject: Re: Nfs filesystem corruption(?) after kmail crash

On Mon, Jun 2, 2008 at 3:43 PM, Jeff Layton <[email protected]> wrote:
> On Mon, 2 Jun 2008 15:05:17 +0200
> "Alexander Borghgraef" <[email protected]> wrote:
>
>> Nobody? Anyone care to tell me how to interpret the strace stat cur output?
>>
>
>> lstat64("cur", 0xbfb81cb4) = -1 ENOENT (No such file or directory)
>
> File doesn't exist...
>
> If this is from "ls -l" or something like that, that means that the
> client did a READDIR or READDIRPLUS and saw a "cur" entry in the
> directory with a particular filehandle. It then went back and did a
> stat() against that filehandle and it was gone. The two possibilities
> are that something removed that directory in the interim (possibly
> replacing it with a new "cur" directory), or that the filehandle was
> bad for some reason. I'm not aware of any bugs causing the latter, so
> the former is the most likely.

So it's possible that kmail in syncing accesses the cur directory,
reads it, and then removes and replaces the directory before all of
the read operation's actions are executed due to the difference in
time granularity between nfs and ext3? If so, should I file this as a
bug report to the kdepim people? I've looked a bit into the kmail
code, and I traced the error message to an access (from unistd.h) call
on the directories path which fails, but that probably just notices
the problem instead of causing it. I haven't really figured out how
their syncing process works.


--
Alex Borghgraef

2008-06-04 12:39:29

by Jeff Layton

[permalink] [raw]
Subject: Re: Nfs filesystem corruption(?) after kmail crash

On Wed, 4 Jun 2008 14:10:01 +0200
"Alexander Borghgraef" <[email protected]> wrote:

> On Mon, Jun 2, 2008 at 3:43 PM, Jeff Layton <[email protected]> wrote:
> > On Mon, 2 Jun 2008 15:05:17 +0200
> > "Alexander Borghgraef" <[email protected]> wrote:
> >
> >> Nobody? Anyone care to tell me how to interpret the strace stat cur output?
> >>
> >
> >> lstat64("cur", 0xbfb81cb4) = -1 ENOENT (No such file or directory)
> >
> > File doesn't exist...
> >
> > If this is from "ls -l" or something like that, that means that the
> > client did a READDIR or READDIRPLUS and saw a "cur" entry in the
> > directory with a particular filehandle. It then went back and did a
> > stat() against that filehandle and it was gone. The two possibilities
> > are that something removed that directory in the interim (possibly
> > replacing it with a new "cur" directory), or that the filehandle was
> > bad for some reason. I'm not aware of any bugs causing the latter, so
> > the former is the most likely.
>
> So it's possible that kmail in syncing accesses the cur directory,
> reads it, and then removes and replaces the directory before all of
> the read operation's actions are executed due to the difference in
> time granularity between nfs and ext3? If so, should I file this as a
> bug report to the kdepim people? I've looked a bit into the kmail
> code, and I traced the error message to an access (from unistd.h) call
> on the directories path which fails, but that probably just notices
> the problem instead of causing it. I haven't really figured out how
> their syncing process works.
>

My suspicion would be rather that this directory is being removed by a
process on a different client (or maybe the server). If this directory
is only being changed by the client itself, then something is definitely
not working right. The client should generally be aware of changes that
it makes itself.

I doubt this is a userspace bug, per-se, though there are certainly
ways to write userspace code that are more friendly to NFS. My
suggestion would be to see about getting some network captures and
determine at what point the filehandle is changing when this happens.

An even better thing would be to track down a way to reliably reproduce
this. With that we could offer a more comprehensive explanation.

--
Jeff Layton <[email protected]>