2005-03-14 21:25:38

by Ara.T.Howard

[permalink] [raw]
Subject: binaries becoming corrupt on nfs


we are seeing some really bizarre strange behaviour on our nfs systems.
essentially a system will hum along nicely, running binaries from our nfs
server without issue. for no apparent reason these binaries suddenly become
corrupt on the client side and stop working. running md5sum on the affected
binary on a 'good' host and a 'bad' one shows them to, in fact, be different.

doing and unmount and remount fixes the issue. obviously so does a reboot.
both are temporary fixes though - eventually a node will start getting corrupt
binaries - or perhaps not.

the server is not under undue stress as it serves only code and no data
traffic is hitting it (we use vsftp to move data around). none of the
machines seems to logging any errors - server nor client. all of our systems
are the same:

~ > uname -srm
Linux 2.4.21-27.0.2.EL i686

~ > cat /etc/redhat-release
Red Hat Enterprise Linux WS release 3 (Taroon Update 4)

~ > cat /proc/cpuinfo | grep model
model : 2
model name : Intel(R) Xeon(TM) CPU 2.80GHz
model : 2
model name : Intel(R) Xeon(TM) CPU 2.80GHz
model : 2
model name : Intel(R) Xeon(TM) CPU 2.80GHz
model : 2
model name : Intel(R) Xeon(TM) CPU 2.80GHz

~ > free -b
total used free shared buffers cached
Mem: 4082057216 4040855552 41201664 0 16977920 3698454528
-/+ buffers/cache: 325423104 3756634112 Swap: 6325055488 96333824 6228721664

~ > rpm -qa | grep nfs
redhat-config-nfs-1.0.13-6
nfs-utils-1.0.6-33EL

all the machines are on the same subnet with one hop to the nfs server.

has anyone seen this behaviour? and ideas what the issue might be? we cannot
be certain but think the issue is associated with the latest kernel. the
reason we cannot be certain is that we've not been running much for the last
few weeks and just started seeing the problem - we booted to the latest kernel
about a month ago.

i'm not even sure where to start looking here but the symtoms seems to point
to some sort of client side caching issue... any input appreciated.

kind regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2005-03-14 21:31:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

m=E5 den 14.03.2005 Klokka 14:25 (-0700) skreiv Ara.T.Howard:
> we are seeing some really bizarre strange behaviour on our nfs systems.
> essentially a system will hum along nicely, running binaries from our nfs
> server without issue. for no apparent reason these binaries suddenly bec=
ome
> corrupt on the client side and stop working. running md5sum on the affec=
ted
> binary on a 'good' host and a 'bad' one shows them to, in fact, be differ=
ent.
>=20
> doing and unmount and remount fixes the issue. obviously so does a reboo=
t.
> both are temporary fixes though - eventually a node will start getting co=
rrupt
> binaries - or perhaps not.

Do you perhaps have some cronjob or something that is updating the
binaries on the server?

Cheers,
Trond
--=20
Trond Myklebust <[email protected]>



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-14 21:35:15

by Neil Horman

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

On Mon, Mar 14, 2005 at 02:25:30PM -0700, Ara.T.Howard wrote:
>
> we are seeing some really bizarre strange behaviour on our nfs systems.
> essentially a system will hum along nicely, running binaries from our nfs
> server without issue. for no apparent reason these binaries suddenly become
> corrupt on the client side and stop working. running md5sum on the affected
> binary on a 'good' host and a 'bad' one shows them to, in fact, be
> different.
>
> doing and unmount and remount fixes the issue. obviously so does a reboot.
> both are temporary fixes though - eventually a node will start getting
> corrupt
> binaries - or perhaps not.
>
> the server is not under undue stress as it serves only code and no data
> traffic is hitting it (we use vsftp to move data around). none of the
> machines seems to logging any errors - server nor client. all of our
> systems
> are the same:
>
> ~ > uname -srm
> Linux 2.4.21-27.0.2.EL i686
>
> ~ > cat /etc/redhat-release
> Red Hat Enterprise Linux WS release 3 (Taroon Update 4)
>
If you're only serving code can you try mounting the share Read Only from all
your clients?

Neil

> ~ > cat /proc/cpuinfo | grep model
> model : 2
> model name : Intel(R) Xeon(TM) CPU 2.80GHz
> model : 2
> model name : Intel(R) Xeon(TM) CPU 2.80GHz
> model : 2
> model name : Intel(R) Xeon(TM) CPU 2.80GHz
> model : 2
> model name : Intel(R) Xeon(TM) CPU 2.80GHz
>
> ~ > free -b
> total used free shared buffers cached
> Mem: 4082057216 4040855552 41201664 0 16977920 3698454528
> -/+ buffers/cache: 325423104 3756634112 Swap: 6325055488 96333824
> 6228721664
>
> ~ > rpm -qa | grep nfs
> redhat-config-nfs-1.0.13-6
> nfs-utils-1.0.6-33EL
>
> all the machines are on the same subnet with one hop to the nfs server.
>
> has anyone seen this behaviour? and ideas what the issue might be? we
> cannot
> be certain but think the issue is associated with the latest kernel. the
> reason we cannot be certain is that we've not been running much for the last
> few weeks and just started seeing the problem - we booted to the latest
> kernel
> about a month ago.
>
> i'm not even sure where to start looking here but the symtoms seems to point
> to some sort of client side caching issue... any input appreciated.
>
> kind regards.
>
> -a
> --
> ===============================================================================
> | EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
> | PHONE :: 303.497.6469
> | When you do something, you should burn yourself completely, like a good
> | bonfire, leaving no trace of yourself. --Shunryu Suzuki
> ===============================================================================
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs

--
/***************************************************
*Neil Horman
*Software Engineer
*Red Hat, Inc.
*[email protected]
*gpg keyid: 1024D / 0x92A74FA1
*http://pgp.mit.edu
***************************************************/


Attachments:
(No filename) (3.45 kB)
(No filename) (189.00 B)
Download all attachments

2005-03-14 21:40:17

by Ara.T.Howard

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

On Mon, 14 Mar 2005, Trond Myklebust wrote:

> m? den 14.03.2005 Klokka 14:25 (-0700) skreiv Ara.T.Howard:
>> we are seeing some really bizarre strange behaviour on our nfs systems.
>> essentially a system will hum along nicely, running binaries from our nfs
>> server without issue. for no apparent reason these binaries suddenly become
>> corrupt on the client side and stop working. running md5sum on the affected
>> binary on a 'good' host and a 'bad' one shows them to, in fact, be different.
>>
>> doing and unmount and remount fixes the issue. obviously so does a reboot.
>> both are temporary fixes though - eventually a node will start getting corrupt
>> binaries - or perhaps not.
>
> Do you perhaps have some cronjob or something that is updating the binaries
> on the server?

absolutely nothing. bear in mind we are not seeing stale file handles - the
binaries are truely corrupt. very, very weird things will happen:

* maybe the binaries core dump on startup
* maybe it runs, but errors in strange ways
* maybe it runs, but core dumps
* sometimes it can be loaded into a debugger - sometimes not

never is it given stale file handles though...

btw. i forgot to show our mount options:

nfs bg,rw,hard,intr,rsize=8192,wsize=8192

and that's it.

ps. i just had to point out to our sysads, who are big fans of rhn, that you
just got back to me in under two minutes! amazing ;-) !

kind regards.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================

2005-03-14 21:44:00

by Ara.T.Howard

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

On Mon, 14 Mar 2005 [email protected] wrote:

> On Mon, Mar 14, 2005 at 02:25:30PM -0700, Ara.T.Howard wrote:
>>
>> we are seeing some really bizarre strange behaviour on our nfs systems.
>> essentially a system will hum along nicely, running binaries from our nfs
>> server without issue. for no apparent reason these binaries suddenly become
>> corrupt on the client side and stop working. running md5sum on the affected
>> binary on a 'good' host and a 'bad' one shows them to, in fact, be
>> different.
>>
>> doing and unmount and remount fixes the issue. obviously so does a reboot.
>> both are temporary fixes though - eventually a node will start getting
>> corrupt
>> binaries - or perhaps not.
>>
>> the server is not under undue stress as it serves only code and no data
>> traffic is hitting it (we use vsftp to move data around). none of the
>> machines seems to logging any errors - server nor client. all of our
>> systems
>> are the same:
>>
>> ~ > uname -srm
>> Linux 2.4.21-27.0.2.EL i686
>>
>> ~ > cat /etc/redhat-release
>> Red Hat Enterprise Linux WS release 3 (Taroon Update 4)
>>

> If you're only serving code can you try mounting the share Read Only from
> all your clients?

well - almost only code. sometimes we have a little sqlite database which is
being used as a job queue and this is being written to. it's not at the
moment however so we could try this in the short term for testing...

one other tidbit. last week i froze one of our boxes a few times in a row by
simply do a compile in an nfs mounted directory (writing lots of files). i
could reproduce this at will - but on no other boxes that i tested. now that
i see that boxes are failing randomly i think i should have tested this on
more boxes - i only tried two - but it's pain because it causes the box to
hang on any operation related to nfs and then needs rebooted....

cheers.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-14 21:47:50

by Trond Myklebust

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

m=E5 den 14.03.2005 Klokka 14:40 (-0700) skreiv Ara.T.Howard:

> > Do you perhaps have some cronjob or something that is updating the bina=
ries
> > on the server?
>=20
> absolutely nothing. bear in mind we are not seeing stale file handles - =
the
> binaries are truely corrupt. very, very weird things will happen:


That was why I asked. If you update the binaries by copying into them
(not renaming + creating new file), then strange things will happen: you
will not see ESTALE, but you will usually see cache corruption.

The obvious and easy way to detect if this is the case, is to look at
the ctime on the file in question.

> * maybe the binaries core dump on startup
> * maybe it runs, but errors in strange ways
> * maybe it runs, but core dumps
> * sometimes it can be loaded into a debugger - sometimes not

Have you looked at a hexdump of bad copy vs. good copy and done a diff?

Cheers,
Trond
--=20
Trond Myklebust <[email protected]>



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-14 21:59:56

by Ara.T.Howard

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

On Mon, 14 Mar 2005, Trond Myklebust wrote:

> That was why I asked. If you update the binaries by copying into them (not
> renaming + creating new file), then strange things will happen: you will not
> see ESTALE, but you will usually see cache corruption.

hmmm. i HAVE compiled these binaries and copied them up - but last week.
coud that make the cache so sick that it would not recover?

> The obvious and easy way to detect if this is the case, is to look at the
> ctime on the file in question.

then why should it corrupt the cache? i mean - if it's easy to see why would
the nfs code see this and invalidate it's cache? i understand this could only
happen based on validity of the inode cache - but this is stale ever 60
seconds (or something) so it seems this should sort it self out in time. the
problem we are seeing persists forever until remount...

>> * maybe the binaries core dump on startup
>> * maybe it runs, but errors in strange ways
>> * maybe it runs, but core dumps
>> * sometimes it can be loaded into a debugger - sometimes not
>
> Have you looked at a hexdump of bad copy vs. good copy and done a diff?

not yet - we just noticed that the md5sums were actually different!

cheers.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-14 22:05:04

by Bernd Schubert

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

On Monday 14 March 2005 22:47, Trond Myklebust wrote:
> m=E5 den 14.03.2005 Klokka 14:40 (-0700) skreiv Ara.T.Howard:
> > > Do you perhaps have some cronjob or something that is updating the
> > > binaries on the server?
> >
> > absolutely nothing. bear in mind we are not seeing stale file handles -
> > the binaries are truely corrupt. very, very weird things will happen:
>
> That was why I asked. If you update the binaries by copying into them
> (not renaming + creating new file), then strange things will happen: you
> will not see ESTALE, but you will usually see cache corruption.

Shouldn't this be prevented by inode generation numbers?

>
> The obvious and easy way to detect if this is the case, is to look at
> the ctime on the file in question.
>
> > * maybe the binaries core dump on startup
> > * maybe it runs, but errors in strange ways
> > * maybe it runs, but core dumps
> > * sometimes it can be loaded into a debugger - sometimes not

Does it happen on many machines or only on one of them? We also had this on=
ce=20
and at the end it was bad memory (though we were using ecc memory). I would=
=20
suggest to run memtest86.


Cheers,
Bernd


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-14 22:13:56

by Trond Myklebust

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

m=E5 den 14.03.2005 Klokka 14:59 (-0700) skreiv Ara.T.Howard:
> On Mon, 14 Mar 2005, Trond Myklebust wrote:
>=20
> > That was why I asked. If you update the binaries by copying into them (=
not
> > renaming + creating new file), then strange things will happen: you wil=
l not
> > see ESTALE, but you will usually see cache corruption.
>=20
> hmmm. i HAVE compiled these binaries and copied them up - but last week.
> coud that make the cache so sick that it would not recover?

On Linux-2.4.x, mmap() will prevent the NFS client from clearing the
cache in a timely fashion, and so pages that should have been thrown out
of cache may become "frozen in" to the cache.

In Linux-2.6.x, that problem should hopefully have been fixed due to a
combination of updates to the memory management layer + NFS client
fixes.

Cheers,
Trond

--=20
Trond Myklebust <[email protected]>



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-14 22:15:38

by Trond Myklebust

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

m=E5 den 14.03.2005 Klokka 23:05 (+0100) skreiv Bernd Schubert:
> On Monday 14 March 2005 22:47, Trond Myklebust wrote:
> > m=E5 den 14.03.2005 Klokka 14:40 (-0700) skreiv Ara.T.Howard:
> > > > Do you perhaps have some cronjob or something that is updating the
> > > > binaries on the server?
> > >
> > > absolutely nothing. bear in mind we are not seeing stale file handle=
s -
> > > the binaries are truely corrupt. very, very weird things will happen=
:
> >
> > That was why I asked. If you update the binaries by copying into them
> > (not renaming + creating new file), then strange things will happen: yo=
u
> > will not see ESTALE, but you will usually see cache corruption.
>=20
> Shouldn't this be prevented by inode generation numbers?

No. Inode generation numbers are updated only on file creation. They are
not updated if you just open the existing file and update it (otherwise
you would see filehandles change every time you write to a file
<shudder>).

Cheers,
Trond
--=20
Trond Myklebust <[email protected]>



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-14 22:24:45

by Ara.T.Howard

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

On Mon, 14 Mar 2005, Trond Myklebust wrote:

> m? den 14.03.2005 Klokka 14:59 (-0700) skreiv Ara.T.Howard:
>> On Mon, 14 Mar 2005, Trond Myklebust wrote:
>>
>>> That was why I asked. If you update the binaries by copying into them (not
>>> renaming + creating new file), then strange things will happen: you will not
>>> see ESTALE, but you will usually see cache corruption.
>>
>> hmmm. i HAVE compiled these binaries and copied them up - but last week.
>> coud that make the cache so sick that it would not recover?
>
> On Linux-2.4.x, mmap() will prevent the NFS client from clearing the cache
> in a timely fashion, and so pages that should have been thrown out of cache
> may become "frozen in" to the cache.

you mean ANY mmap - or one of the file in question? i do a LOT of memory
mapping - sometimes of nfs mounted files, though generally only when testing
some command or something. our production jobs always copy data locally
before working on it... now i'm worried that i could cause this by mmap'ing
nfs mounted files!

> In Linux-2.6.x, that problem should hopefully have been fixed due to a
> combination of updates to the memory management layer + NFS client fixes.

man - why can't the redhat guys send out a 'modern' kernel - it's not like we
aren't paying for it... arghh.

thanks for helping out (again.)

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================

2005-03-14 22:49:45

by Trond Myklebust

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

m=E5 den 14.03.2005 Klokka 15:24 (-0700) skreiv Ara.T.Howard:
> > On Linux-2.4.x, mmap() will prevent the NFS client from clearing the ca=
che
> > in a timely fashion, and so pages that should have been thrown out of c=
ache
> > may become "frozen in" to the cache.
>=20
> you mean ANY mmap - or one of the file in question? i do a LOT of memory
> mapping - sometimes of nfs mounted files, though generally only when test=
ing
> some command or something. our production jobs always copy data locally
> before working on it... now i'm worried that i could cause this by mmap'=
ing
> nfs mounted files!

I mean an mmap of the file in question. In the case of an executable
then you should note that the kernel will mmap() the file whenever you
run it.

Cheers,
Trond
--=20
Trond Myklebust <[email protected]>



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-14 23:00:31

by Ara.T.Howard

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

On Mon, 14 Mar 2005, Trond Myklebust wrote:

> m? den 14.03.2005 Klokka 15:24 (-0700) skreiv Ara.T.Howard:
>>> On Linux-2.4.x, mmap() will prevent the NFS client from clearing the cache
>>> in a timely fashion, and so pages that should have been thrown out of cache
>>> may become "frozen in" to the cache.
>>
>> you mean ANY mmap - or one of the file in question? i do a LOT of memory
>> mapping - sometimes of nfs mounted files, though generally only when testing
>> some command or something. our production jobs always copy data locally
>> before working on it... now i'm worried that i could cause this by mmap'ing
>> nfs mounted files!
>
> I mean an mmap of the file in question. In the case of an executable
> then you should note that the kernel will mmap() the file whenever you
> run it.

o.k. - so my mmap of data are only most stupid and this is out of my control.
i put what you meant together while i waiting for your response. it seems the
right answer then is to install using

cp a.out bindir/a.out.tmp && mv bindir/a.out.tmp bindir/a.out

or perhaps 'install' is good enough since it unlinks the dest first. boy -
you learn something every day.

of course the best thing would be to move to a 2.6 kernel - gotta love the
fact that we are paying for a 2.4 one...

thanks for the help - it must be quite late there.

cheers.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================

2005-03-14 23:07:41

by Trond Myklebust

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

m=E5 den 14.03.2005 Klokka 16:00 (-0700) skreiv Ara.T.Howard:

> cp a.out bindir/a.out.tmp && mv bindir/a.out.tmp bindir/a.out
>=20

"mv -b" is probably better. That won't disturb any clients that are
running the current version of a.out.

Cheers,
Trond
--=20
Trond Myklebust <[email protected]>



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-16 22:41:03

by Ara.T.Howard

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

On Mon, 14 Mar 2005, Trond Myklebust wrote:

> On Linux-2.4.x, mmap() will prevent the NFS client from clearing the cache
> in a timely fashion, and so pages that should have been thrown out of cache
> may become "frozen in" to the cache.
>
> In Linux-2.6.x, that problem should hopefully have been fixed due to a
> combination of updates to the memory management layer + NFS client fixes.

trond-

i'm still seeing this issue even though NO copying is occuring on mmap'd
binaries. the process used is now the built-in install program

install: all
$(install_prog) grid_ols/grid_ols $(bindir)
$(install_prog) subset/subset $(bindir)

install does not copy, it unlinks the dest and then writes a new file:

jib:~/shared/dmspnl_new > strace install a b 2>&1 | tail -13
unlink("b") = 0
open("a", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
open("b", O_WRONLY|O_CREAT|O_LARGEFILE, 0100664) = 4
fstat64(4, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
fstat64(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
read(3, "", 8192) = 0
close(4) = 0
close(3) = 0
chmod("b", 0600) = 0
chown32("b", -1, -1) = 0
chmod("b", 0755) = 0
exit_group(0) = ?

if i run 'make install' while these binaries are running on our cluster
(almost ensuring more than one of them has the file mmap'd) i will see some
small random number of nodes with corrupt caches begin to have every
subsequent run of the binary fail.

this should not be - should it?

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-16 22:57:02

by Trond Myklebust

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

on den 16.03.2005 Klokka 15:40 (-0700) skreiv Ara.T.Howard:
> i'm still seeing this issue even though NO copying is occuring on mmap'd
> binaries. the process used is now the built-in install program
>
> install: all
> $(install_prog) grid_ols/grid_ols $(bindir)
> $(install_prog) subset/subset $(bindir)
>
> install does not copy, it unlinks the dest and then writes a new file:
>
> jib:~/shared/dmspnl_new > strace install a b 2>&1 | tail -13
> unlink("b") = 0
> open("a", O_RDONLY|O_LARGEFILE) = 3
> fstat64(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
> open("b", O_WRONLY|O_CREAT|O_LARGEFILE, 0100664) = 4
> fstat64(4, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
> fstat64(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
> read(3, "", 8192) = 0
> close(4) = 0
> close(3) = 0
> chmod("b", 0600) = 0
> chown32("b", -1, -1) = 0
> chmod("b", 0755) = 0
> exit_group(0) = ?
>
> if i run 'make install' while these binaries are running on our cluster
> (almost ensuring more than one of them has the file mmap'd) i will see some
> small random number of nodes with corrupt caches begin to have every
> subsequent run of the binary fail.

How are they corrupt?

Cheers,
Trond

--
Trond Myklebust <[email protected]>



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2005-03-17 01:19:16

by Ara.T.Howard

[permalink] [raw]
Subject: Re: binaries becoming corrupt on nfs

On Wed, 16 Mar 2005, Trond Myklebust wrote:

> on den 16.03.2005 Klokka 15:40 (-0700) skreiv Ara.T.Howard:
>> i'm still seeing this issue even though NO copying is occuring on mmap'd
>> binaries. the process used is now the built-in install program
>>
>> install: all
>> $(install_prog) grid_ols/grid_ols $(bindir)
>> $(install_prog) subset/subset $(bindir)
>>
>> install does not copy, it unlinks the dest and then writes a new file:
>>
>> jib:~/shared/dmspnl_new > strace install a b 2>&1 | tail -13
>> unlink("b") = 0
>> open("a", O_RDONLY|O_LARGEFILE) = 3
>> fstat64(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
>> open("b", O_WRONLY|O_CREAT|O_LARGEFILE, 0100664) = 4
>> fstat64(4, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
>> fstat64(3, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
>> read(3, "", 8192) = 0
>> close(4) = 0
>> close(3) = 0
>> chmod("b", 0600) = 0
>> chown32("b", -1, -1) = 0
>> chmod("b", 0755) = 0
>> exit_group(0) = ?
>>
>> if i run 'make install' while these binaries are running on our cluster
>> (almost ensuring more than one of them has the file mmap'd) i will see some
>> small random number of nodes with corrupt caches begin to have every
>> subsequent run of the binary fail.
>
> How are they corrupt?

in 'impossible' and random ways - impossible values on the stack, corrupt
strings, core dumps - the md5sums are not right...

i think perhaps i am being stupid - after i sent this i realized

>> unlink("b") = 0

b is gone on server

>> open("b", O_WRONLY|O_CREAT|O_LARGEFILE, 0100664) = 4

b is mmap'd on client

>> write(4, "...", 8192) = 8192

b is being copied while mmap'd -> corruption!

does this make sense and i was just silly to think that install should work?

if in install using

cp a.out nfs/a.out.tmp && mv nfs/a.out.tmp nfs/a.out

it works - which leads me to believe so. so maybe i just being dumb and
install never should have work.

cheers.

-a
--
===============================================================================
| EMAIL :: Ara [dot] T [dot] Howard [at] noaa [dot] gov
| PHONE :: 303.497.6469
| When you do something, you should burn yourself completely, like a good
| bonfire, leaving no trace of yourself. --Shunryu Suzuki
===============================================================================


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs