2008-05-20 13:02:10

by Trond Myklebust

[permalink] [raw]
Subject: Re: [NFS] async vs. sync. corrupted fs

On Tue, 2008-05-20 at 10:00 +0200, Chris Fanning wrote:
> On Mon, May 19, 2008 at 4:54 PM, J. Bruce Fields <[email protected]> wrote:
> > On Fri, May 16, 2008 at 12:55:59PM +0200, Chris Fanning wrote:
> >> Hello all,
> >>
> >> Could someone help me out?
> >>
> >> I'm exporting a whole partition to a client.
> >> Using (rw,no_root_squash,sync, no_subtree_check), when I shutdown the
> >> server _before_ the client, the filesystem on the partion becomes
> >> corrupted.
> >
> > How do you know it becomes corrupted?
> >
> when I boot the server it reports the partition as corrupted and I
> need to perform an fsck.
>
> >> When I user async, it doesn't happen.
> >
> > "async" the client-side mount option, or "async" the server-side export
> > option?
> server export.
> (rw,no_root_squash,sync, no_subtree_check)
>
> client mount (fstab)
> nfs_home_server:/home /home nfs tcp,rw 0 0
>
> >
> >> Is this normal?
> >
> > No.
> I didn't think so ;)
>
> Perhaps I can shed some more light on this. The client is diskless.
> The problems I'm reporting are with it's /home mount. It's root
> filessystem is also a nfs mount (but that is exported form a
> _different_ nfs server).
> I modify the nfsroot using chroot on the server.
>
> I *think* this might be because of /proc mounted in the chroot get
> exported to the client? Any insight?
>
> I've been trying all sorts of combinations yesterday (/proc mounted
> and not mounted in the chroot) and I haven't been able to reproduce
> the error. Strange becuase it definitely happened when was shutting
> down the nfs_home_server while everything else was up and running.
>
> Perahps it is also worth noting that I'm booting the client with
> live-initramfs. /home gets mounted after the real
> filesystem(aufs,nfs+ram) comes up.
> Once the client has booted I can use /home. But it doesn't show up
> using 'df', and it is not present in /etc/mtab
> However it does show up in /proc/mounts
> nfs_home_server:/home /home nfs
> rw,vers=3,rsize=32768,wsize=23768,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=192.168.2.2
> 0 0
>
> Sorry for the convulated reply.
> Thanks.
> Chris.

The NFS client is incapable of physically corrupting the disk on the
server. The reason is simple: there are no NFS RPC calls that can access
the raw disk.

You should therefore rather be taking a long hard look at the server
side, to see if there might be anything going on there that can explain
it.
One issue to look at would be stack corruption: we know that some
combinations of soft raid + lvm + xfs can produce very deep call stacks,
which can lead to stack overflows which might be causing corruption.
You could also try running something like memtest86 to ensure that you
don't have faulty memory.

Cheers
Trond


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs



2008-05-20 14:05:40

by Marcelo Leal

[permalink] [raw]
Subject: Re: [NFS] async vs. sync. corrupted fs

We already saw xfs filesystem full and NFS server (2.4 kernel),
produce a disc corruption (partition table goes away).
But believe me, not in the full disc, but in another one (we can
reproduce). Always the same disc is corrupted (the first one i guess).

Leal.

2008/5/20 Trond Myklebust <[email protected]>:
> On Tue, 2008-05-20 at 10:00 +0200, Chris Fanning wrote:
>> On Mon, May 19, 2008 at 4:54 PM, J. Bruce Fields <[email protected]> wrote:
>> > On Fri, May 16, 2008 at 12:55:59PM +0200, Chris Fanning wrote:
>> >> Hello all,
>> >>
>> >> Could someone help me out?
>> >>
>> >> I'm exporting a whole partition to a client.
>> >> Using (rw,no_root_squash,sync, no_subtree_check), when I shutdown the
>> >> server _before_ the client, the filesystem on the partion becomes
>> >> corrupted.
>> >
>> > How do you know it becomes corrupted?
>> >
>> when I boot the server it reports the partition as corrupted and I
>> need to perform an fsck.
>>
>> >> When I user async, it doesn't happen.
>> >
>> > "async" the client-side mount option, or "async" the server-side export
>> > option?
>> server export.
>> (rw,no_root_squash,sync, no_subtree_check)
>>
>> client mount (fstab)
>> nfs_home_server:/home /home nfs tcp,rw 0 0
>>
>> >
>> >> Is this normal?
>> >
>> > No.
>> I didn't think so ;)
>>
>> Perhaps I can shed some more light on this. The client is diskless.
>> The problems I'm reporting are with it's /home mount. It's root
>> filessystem is also a nfs mount (but that is exported form a
>> _different_ nfs server).
>> I modify the nfsroot using chroot on the server.
>>
>> I *think* this might be because of /proc mounted in the chroot get
>> exported to the client? Any insight?
>>
>> I've been trying all sorts of combinations yesterday (/proc mounted
>> and not mounted in the chroot) and I haven't been able to reproduce
>> the error. Strange becuase it definitely happened when was shutting
>> down the nfs_home_server while everything else was up and running.
>>
>> Perahps it is also worth noting that I'm booting the client with
>> live-initramfs. /home gets mounted after the real
>> filesystem(aufs,nfs+ram) comes up.
>> Once the client has booted I can use /home. But it doesn't show up
>> using 'df', and it is not present in /etc/mtab
>> However it does show up in /proc/mounts
>> nfs_home_server:/home /home nfs
>> rw,vers=3,rsize=32768,wsize=23768,hard,nointr,proto=tcp,timeo=600,retrans=2,sec=sys,addr=192.168.2.2
>> 0 0
>>
>> Sorry for the convulated reply.
>> Thanks.
>> Chris.
>
> The NFS client is incapable of physically corrupting the disk on the
> server. The reason is simple: there are no NFS RPC calls that can access
> the raw disk.
>
> You should therefore rather be taking a long hard look at the server
> side, to see if there might be anything going on there that can explain
> it.
> One issue to look at would be stack corruption: we know that some
> combinations of soft raid + lvm + xfs can produce very deep call stacks,
> which can lead to stack overflows which might be causing corruption.
> You could also try running something like memtest86 to ensure that you
> don't have faulty memory.
>
> Cheers
> Trond
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2008.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs
> _______________________________________________
> Please note that [email protected] is being discontinued.
> Please subscribe to [email protected] instead.
> http://vger.kernel.org/vger-lists.html#linux-nfs
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>



--
pOSix rules

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs