Hi all; we've been doing some examination of NFS client performance, and
have seen this apparently sub-optimal behavior: anyone here have any
comments on this observation or thoughts about it?
Thanks!
> I was looking at the code (which doesn't appear to be fixed in 2.6),
> and here's what I think it's doing.
>
> The Linux NFS client code is capable of only remembering a write to a
> single contiguous chunk in each 4KB page of a file. If a second
> non-contiguous write occurs to the page, the first write has to be
> flushed to the server. So if the view server is seeking around in a
> file writing a few bytes here and a few there, whenever it does a
> write to a page that has already been written, the first write has to
> be flushed to the server. The code that does most of this is in
> fs/nfs/write.c, function nfs_update_request(). That's the routine
> that, given a new write request, searches for an existing one that is
> contiguous with the new one. If it finds a contiguous request, the
> new one is coalesced with it and no NFS activitity is required. If
> instead it finds a pending write to the same page, it returns EBUSY to
> its caller, which tells the caller (nfs_updatepage()) to synchronously
> write the existing request to the server.
>
> The Solaris NFS client (and probably most other NFS client
> implementations) doesn't work this way. Whenever a small write is
> made to a block of a file, the block is read from the server, and then
> the write is applied to the cached block. If the block is already in
> cache, the write is applied to block without any NFS transactions
> being required. When the file is closed or fsync()ed, the entire
> block is written to the server. So the client code doesn't need to
> record each individual small write to a block, it just modifies the
> cached block as necessary.
>
> I don't know why the Linux NFS client doesn't work this way, but I
> think it wouldn't be difficult to make it work that way. In some
> scenarios, this might be a performance hit (because an entire block of
> the file has to be read from the server just to write a few bytes to
> it), but I think that in most cases, doing it the Solaris way would be
> a performance win. I'd guess that if any sort of database, such as a
> GDBM database, is accessed over NFS, the Linux method would result in
> much poorer performance than the Solaris method.
--
-------------------------------------------------------------------------------
Paul D. Smith <[email protected]> HASMAT: HA Software Mthds & Tools
"Please remain calm...I may be mad, but I am a professional." --Mad Scientist
-------------------------------------------------------------------------------
These are my opinions---Nortel Networks takes no responsibility for them.
-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills. Sign up for IBM's
Free Linux Tutorials. Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P? m? , 05/01/2004 klokka 17:11, skreiv Paul Smith:
> Hi all; we've been doing some examination of NFS client performance, and
> have seen this apparently sub-optimal behavior: anyone here have any
> comments on this observation or thoughts about it?
Does your observation include numbers, or is it just conjecture?
If we actually are doing contiguous sequential writes, the Linux
implementation has the obvious advantage that it doesn't issue any
read requests to the server at all.
The Solaris approach is only a win in the particular case where you
are doing several non-contiguous writes into the same page (and
without any byte-range locking).
Note that like Linux, two processes that do not share the same RPC
credentials still cannot merge their writes since you cannot rely on
them having the same file write permissions.
Looking at your particular example of GDBM, you should recall that
Solaris is forced to revert to uncached synchronous reads and writes
when doing byte range locking precisely because their page cache
writes back entire pages (the ugly alternative would be to demand that
byte range locks must be page-aligned).
OTOH Linux can continue to do cached asynchronous reads and writes
right up until the user forces an fsync()+cache invalidation by
changing the locking range because our page cache writes are not
required to be page aligned.
However, GDBM is a pretty poor example of a database. Large
professional databases will tend to want to use their own custom
locking protocols, and manage their caching entirely on their own
(using O_DIRECT in order to circumvent the kernel page cache). The
asynchronous read/write code doesn't apply at all in this case.
Finally, please note that you can actually obtain the Solaris
behaviour on the existing Linux NFS client, if you so desire, by
replacing write() with mmap().
Cheers,
Trond
-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills. Sign up for IBM's
Free Linux Tutorials. Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
paul-
large commercial databases write whole pages, and never
parts of pages, at once, to their data files. and they
write log files by extending them in a single write
request.
thus the single-write-request per-page limit is not a
problem for them.
> -----Original Message-----
> From: Paul Smith [mailto:[email protected]]=20
> Sent: Monday, January 05, 2004 5:11 PM
> To: [email protected]
> Subject: [NFS] NFS client write performance issue ... thoughts?
>=20
>=20
> Hi all; we've been doing some examination of NFS client=20
> performance, and
> have seen this apparently sub-optimal behavior: anyone here have any
> comments on this observation or thoughts about it?
>=20
> Thanks!
>=20
> > I was looking at the code (which doesn't appear to be fixed in 2.6),
> > and here's what I think it's doing.
> >=20
> > The Linux NFS client code is capable of only remembering a=20
> write to a
> > single contiguous chunk in each 4KB page of a file. If a second
> > non-contiguous write occurs to the page, the first write has to be
> > flushed to the server. So if the view server is seeking around in a
> > file writing a few bytes here and a few there, whenever it does a
> > write to a page that has already been written, the first=20
> write has to
> > be flushed to the server. The code that does most of this is in
> > fs/nfs/write.c, function nfs_update_request(). That's the routine
> > that, given a new write request, searches for an existing=20
> one that is
> > contiguous with the new one. If it finds a contiguous request, the
> > new one is coalesced with it and no NFS activitity is required. If
> > instead it finds a pending write to the same page, it=20
> returns EBUSY to
> > its caller, which tells the caller (nfs_updatepage()) to=20
> synchronously
> > write the existing request to the server.
> >=20
> > The Solaris NFS client (and probably most other NFS client
> > implementations) doesn't work this way. Whenever a small write is
> > made to a block of a file, the block is read from the=20
> server, and then
> > the write is applied to the cached block. If the block is=20
> already in
> > cache, the write is applied to block without any NFS transactions
> > being required. When the file is closed or fsync()ed, the entire
> > block is written to the server. So the client code doesn't need to
> > record each individual small write to a block, it just modifies the
> > cached block as necessary.
> >=20
> > I don't know why the Linux NFS client doesn't work this way, but I
> > think it wouldn't be difficult to make it work that way. In some
> > scenarios, this might be a performance hit (because an=20
> entire block of
> > the file has to be read from the server just to write a few bytes to
> > it), but I think that in most cases, doing it the Solaris=20
> way would be
> > a performance win. I'd guess that if any sort of database,=20
> such as a
> > GDBM database, is accessed over NFS, the Linux method would=20
> result in
> > much poorer performance than the Solaris method.
>=20
>=20
> --=20
> --------------------------------------------------------------
> -----------------
> Paul D. Smith <[email protected]> HASMAT: HA=20
> Software Mthds & Tools
> "Please remain calm...I may be mad, but I am a=20
> professional." --Mad Scientist
> --------------------------------------------------------------
> -----------------
> These are my opinions---Nortel Networks takes no=20
> responsibility for them.
>=20
>=20
> -------------------------------------------------------
> This SF.net email is sponsored by: IBM Linux Tutorials.
> Become an expert in LINUX or just sharpen your skills. Sign=20
> up for IBM's
> Free Linux Tutorials. Learn everything from the bash shell=20
> to sys admin.
> Click now! =
http://ads.osdn.com/?ad_id=3D1278&alloc_id=3D3371&op=3Dclick
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs
>=20
-------------------------------------------------------
This SF.net email is sponsored by: IBM Linux Tutorials.
Become an expert in LINUX or just sharpen your skills. Sign up for IBM's
Free Linux Tutorials. Learn everything from the bash shell to sys admin.
Click now! http://ads.osdn.com/?ad_id=1278&alloc_id=3371&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
%% "Lever, Charles" <[email protected]> writes:
lc> large commercial databases write whole pages, and never
lc> parts of pages, at once, to their data files. and they
lc> write log files by extending them in a single write
lc> request.
lc> thus the single-write-request per-page limit is not a
lc> problem for them.
I'm sure you're correct, but in our environment (ClearCase) the usage
characteristics are very different.
I'm working on getting you some hard numbers but I think we do all agree
that for this particular use case as I've described it, the Linux method
would result in less performance than the Sun method. I'm not saying
the Sun method is better in all cases, or even in most cases, I'm just
saying that for this particular usage we are seeing a performance
penalty on Linux.
The question is, is there anything to be done about this? Or is this
too much of a niche situation for the folks on this list to worry much
about?
I took Trond's comments on using mmap() to heart: in retrospect it
surprises me that they don't already use mmap() because I would think
that would give better performance. But in any case all we can do is
suggest this to IBM/Rational and a major change like that will be a long
time coming, even if they do accept it is a good idea.
--
-------------------------------------------------------------------------------
Paul D. Smith <[email protected]> HASMAT--HA Software Mthds & Tools
"Please remain calm...I may be mad, but I am a professional." --Mad Scientist
-------------------------------------------------------------------------------
These are my opinions---Nortel Networks takes no responsibility for them.
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
%% Trond Myklebust <[email protected]> writes:
tm> P=E5 m=E5 , 05/01/2004 klokka 17:11, skreiv Paul Smith:
>> Hi all; we've been doing some examination of NFS client performance, a=
nd
>> have seen this apparently sub-optimal behavior: anyone here have any
>> comments on this observation or thoughts about it?
tm> Does your observation include numbers, or is it just conjecture?
Note I'm only forwarding comments I've received from another party,
who's actually doing the investigation.
The application in question is not GDBM but rather ClearCase; I guess
there's some concern that the performance of ClearCase over NFS from a
Linux client seems to be noticeably less than from a Solaris client.
ClearCase uses a (heavily) customized version of the Raima database to
store its internal information.
I'll see if I can get any numbers.
tm> The Solaris approach is only a win in the particular case where
tm> you are doing several non-contiguous writes into the same page
tm> (and without any byte-range locking). Note that like Linux, two
tm> processes that do not share the same RPC credentials still cannot
tm> merge their writes since you cannot rely on them having the same
tm> file write permissions.
In this particular case that's fine since the ClearCase database is only
ever modified by one process on one system.
tm> Looking at your particular example of GDBM, you should recall that
tm> Solaris is forced to revert to uncached synchronous reads and writes
tm> when doing byte range locking precisely because their page cache
tm> writes back entire pages (the ugly alternative would be to demand that
tm> byte range locks must be page-aligned).
tm> OTOH Linux can continue to do cached asynchronous reads and writes
tm> right up until the user forces an fsync()+cache invalidation by
tm> changing the locking range because our page cache writes are not
tm> required to be page aligned.
Of course I don't have the source and I haven't sniffed the packets
personally so I don't know exactly what goes on, but it's quite possible
that the ClearCase implementation doesn't attempt to do any byte-range
locking at all, since they pretty much know that there is only one
process reading and writing the DB.
The reason for using NFS involves maintenance, backup, reliability,
etc. etc. (the NFS server is actually an EMC or NetApp NAS solution),
not necessarily file sharing, at least insofar as the DB itself is
concerned.
Thanks for the reply, Trond; I'll see if I can get more concrete
details and let you know.
--=20
---------------------------------------------------------------------------=
----
Paul D. Smith <[email protected]> HASMAT--HA Software Mthds & To=
ols
"Please remain calm...I may be mad, but I am a professional." --Mad Scient=
ist
---------------------------------------------------------------------------=
----
These are my opinions---Nortel Networks takes no responsibility for them.
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
ClearCase is a unique situation.
i would love an opportunity to work directly with the
Rational folks to make their products work well on
Linux NFS. my (limited) experience with ClearCase
is that it is not terribly NFS friendly.
> -----Original Message-----
> From: Paul Smith [mailto:[email protected]]
> Sent: Tuesday, January 06, 2004 1:10 PM
> To: [email protected]
> Subject: Re: [NFS] NFS client write performance issue ... thoughts?
>=20
>=20
> %% "Lever, Charles" <[email protected]> writes:
>=20
> lc> large commercial databases write whole pages, and never
> lc> parts of pages, at once, to their data files. and they
> lc> write log files by extending them in a single write
> lc> request.
>=20
> lc> thus the single-write-request per-page limit is not a
> lc> problem for them.
>=20
> I'm sure you're correct, but in our environment (ClearCase) the usage
> characteristics are very different.
>=20
> I'm working on getting you some hard numbers but I think we=20
> do all agree
> that for this particular use case as I've described it, the=20
> Linux method
> would result in less performance than the Sun method. I'm not saying
> the Sun method is better in all cases, or even in most cases, I'm just
> saying that for this particular usage we are seeing a performance
> penalty on Linux.
>=20
>=20
> The question is, is there anything to be done about this? Or is this
> too much of a niche situation for the folks on this list to worry much
> about?
>=20
> I took Trond's comments on using mmap() to heart: in retrospect it
> surprises me that they don't already use mmap() because I would think
> that would give better performance. But in any case all we can do is
> suggest this to IBM/Rational and a major change like that=20
> will be a long
> time coming, even if they do accept it is a good idea.
>=20
> --=20
> --------------------------------------------------------------
> -----------------
> Paul D. Smith <[email protected]> HASMAT--HA=20
> Software Mthds & Tools
> "Please remain calm...I may be mad, but I am a=20
> professional." --Mad Scientist
> --------------------------------------------------------------
> -----------------
> These are my opinions---Nortel Networks takes no=20
> responsibility for them.
>=20
>=20
> -------------------------------------------------------
> This SF.net email is sponsored by: Perforce Software.
> Perforce is the Fast Software Configuration Management System offering
> advanced branching capabilities and atomic changes on 50+ platforms.
> Free Eval! http://www.perforce.com/perforce/loadprog.html
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs
>=20
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
Hi all;
Here are some numbers which show the difference we're talking about.
Any comments anyone has are welcome; I'll be happy to discuss the
details of ClearCase NFS usage insofar as I understand it:
Note that the Linux build system is also using the same kernel
(2.4.18-27 from Red Hat).
> I did two identical small builds on my Linux desktop machine, wcary472,
> using NAS viewstore on scareac0, one with the view server running on
> Linux machine zcard0pf, and the other with the view server on Solaris
> machine zcars0z4. I ran nfsstat on the view servers before and after
> the builds, after making sure that there was no (or very little) other
> activity on the machine that might have screwed up the numbers. The raw
> nfsstat output is below. Here's a summary of the important numbers:
>
> View server on Linux 2.4.18-27 (zcard0pf):
>
> Build time: 35.75s user 31.68s system 33% cpu 3:21.02 total
> RPC calls: 94922
> RPC retrans: 0
> NFS V3 WRITE: 63317
> NFS V3 COMMIT: 28916
> NFS V3 LOOKUP: 1067
> NFS V3 READ: 458
> NFS V3 GETATTR: 406
> NFS V3 ACCESS 0
> NFS V3 REMOVE 5
>
> View server on Solaris 5.8 (zcars0z4)
>
> Build time: 35.50s user 32.09s system 46% cpu 2:26.36 total
> NFS calls: 3785
> RPC retrans: 0
> NFS V3 WRITE: 612
> NFS V3 COMMIT: 7
> NFS V3 LOOKUP: 1986
> NFS V3 READ: 0
> NFS V3 GETATTR: 532
> NFS V3 ACCESS 291
> NFS V3 REMOVE 291
>
> NOTES:
>
> - Viewstore on Linux was mounted using UDP with rsize=wsize=4096
> Viewstore on Solaris was mounted using TCP with rsize=wsize=32768
> I would have changed these to be the same except that that would
> have required root access on one or the other view server machine,
> which I didn't have. Other experiments have shown that Linux does
> fewer WRITES with TCP/32K mounts, but it's still considerably worse
> than Solaris. As I mentioned, I don't know why increasing [rw]size
> and/or switching to TCP improves matters on Linux.
>
> - I don't know where those NFS REMOVEs are coming from, but I vaguely
> remember seeing the view server doing some weird things when I was
> able to strace it. It does bother me that there was a big difference
> in the number of REMOVEs done from Solaris vs. Linux; that might imply
> that the there are important implementation differences between Linux
> and Solaris.
>
> So there were nearly 100K WRITES+COMMITS on Linux, but only a few hundred
> on Solaris. I didn't doublecheck it this time, but I know from past
> experience that most of the NFS I/O from the view server would have
> been against files in the view database directory, which looks like this
> after the build has finished (both view databases are similar in size):
>
> (zcars0z4) ~>> ls -l .../db
> -rw-r--r-- 1 dgraham fwptools 278528 Jan 8 05:30 view_db.d01
> -rw-r--r-- 1 dgraham fwptools 106496 Jan 8 05:30 view_db.d02
> -rw-rw-r-- 1 dgraham fwptools 7143 Jun 13 2001 view_db.dbd
> -rw-r--r-- 1 dgraham fwptools 114688 Jan 8 05:30 view_db.k01
> -r--r--r-- 1 dgraham fwptools 3 Jun 13 2001 view_db_schema_version
> -rw-r--r-- 1 dgraham fwptools 17519 Jan 8 05:30 vista.log
> -rw-r--r-- 1 dgraham fwptools 5146 Jan 8 05:30 vista.taf
--
-------------------------------------------------------------------------------
Paul D. Smith <[email protected]> HASMAT--HA Software Mthds & Tools
"Please remain calm...I may be mad, but I am a professional." --Mad Scientist
-------------------------------------------------------------------------------
These are my opinions---Nortel Networks takes no responsibility for them.
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P? to , 08/01/2004 klokka 10:26, skreiv Paul Smith:
> > View server on Linux 2.4.18-27 (zcard0pf):
> >
> > Build time: 35.75s user 31.68s system 33% cpu 3:21.02 total RPC
> calls: 94922
> > RPC retrans: 0
> > NFS V3 WRITE: 63317
> > NFS V3 COMMIT: 28916
> > NFS V3 LOOKUP: 1067
> > NFS V3 READ: 458
> > NFS V3 GETATTR: 406
> > NFS V3 ACCESS 0
> > NFS V3 REMOVE 5
> >
> > View server on Solaris 5.8 (zcars0z4)
> >
> > Build time: 35.50s user 32.09s system 46% cpu 2:26.36 total NFS
> calls: 3785
> > RPC retrans: 0
> > NFS V3 WRITE: 612
> > NFS V3 COMMIT: 7
> > NFS V3 LOOKUP: 1986
> > NFS V3 READ: 0
> > NFS V3 GETATTR: 532
> > NFS V3 ACCESS 291
> > NFS V3 REMOVE 291
All you are basically showing here is that our write caching sucks
badly. There's nothing there to pinpoint merging vs not merging requests
as the culprit.
3 things that will affect those numbers, and cloud the issue:
1) Linux 2.4.x has a hard limit of 256 outstanding read+write nfs_page
struct per mountpoint in order to deal with the fact that the VM does
not have the necessary support to notify us when we are low on memory
(This limit has been removed in 2.6.x...).
2) Linux immediately puts the write on the wire once there are more
than wsize bytes to write out. This explains why bumping wsize results
in fewer writes.
3) There are accounting errors in Linux 2.4.18 that cause
retransmitted requests to be added to the total number of transmitted
ones. That explains why switching to TCP improves matters.
Note: Try doing this with mmap(), and you will get very different
numbers, since mmap() can cache the entire database in memory, and only
flush it out when you msync() (or when memory pressure forces it to do
so).
One further criticism: there are no READ requests on the Sun machine.
That suggests that it had the database entirely in cache when you
started you test.
Cheers,
Trond
> 2) Linux immediately puts the write on the wire once there are more
> than wsize bytes to write out. This explains why bumping wsize results
> in fewer writes.
Clarification.
The actual condition is really "put all outstanding writes on the wire as
soon as you have wsize/PAGE_SIZE requests, and the last page to be
updated was completely filled".
This flushes out writes on pages that are not currently being accessed
in the background.
Cheers,
Trond
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
%% <[email protected]> writes:
tm> All you are basically showing here is that our write caching sucks
tm> badly. There's nothing there to pinpoint merging vs not merging
tm> requests as the culprit.
Good point. I think that was "intuited" from other info, but I'll have
to check.
tm> 3 things that will affect those numbers, and cloud the issue:
tm> 1) Linux 2.4.x has a hard limit of 256 outstanding read+write nfs_page
tm> struct per mountpoint in order to deal with the fact that the VM does
tm> not have the necessary support to notify us when we are low on memory
tm> (This limit has been removed in 2.6.x...).
OK.
tm> 2) Linux immediately puts the write on the wire once there are more
tm> than wsize bytes to write out. This explains why bumping wsize results
tm> in fewer writes.
OK.
tm> 3) There are accounting errors in Linux 2.4.18 that cause
tm> retransmitted requests to be added to the total number of transmitted
tm> ones. That explains why switching to TCP improves matters.
Do you know when those accounting errors were fixed?
ClearCase implements its own virtual filesystem type, and so is heavily
tied to specific kernels (the kernel module is not open source of course
:( ). We basically can move to any kernel that has been released as
part of an official Red Hat release (say, 2.4.20-8 from RH9 would work),
but no other kernels can be used (the ClearCase kernel module has checks
on the sizes of various kernel structures and won't load if they're not
what it thinks they should be--and since it's a filesystem it cares
deeply about structures that have tended to change a lot. It won't even
work with vanilla kernel.org kernels of the same version.)
tm> Note: Try doing this with mmap(), and you will get very different
tm> numbers, since mmap() can cache the entire database in memory, and only
tm> flush it out when you msync() (or when memory pressure forces it to do
tm> so).
OK... except since we don't have the source we can't switch to mmap()
without doing something very hacky like introducing some kind of shim
shared library to remap some read/write calls to mmap(). Ouch.
Also I think that ClearCase _does_ force sync fairly regularly to be
sure the database is consistent.
tm> One further criticism: there are no READ requests on the Sun
tm> machine. That suggests that it had the database entirely in cache
tm> when you started you test.
Good point.
Thanks Trond!
--
-------------------------------------------------------------------------------
Paul D. Smith <[email protected]> HASMAT--HA Software Mthds & Tools
"Please remain calm...I may be mad, but I am a professional." --Mad Scientist
-------------------------------------------------------------------------------
These are my opinions---Nortel Networks takes no responsibility for them.
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P? to , 08/01/2004 klokka 12:47, skreiv Paul Smith:
> Do you know when those accounting errors were fixed?
In the official kernels? 2.4.22, I believe... However there are patches
going back to 2.4.19...
See http://www.fys.uio.no/~trondmy/src/Linux-2.4.x/2.4.19/
The main patch you want will be linux-2.4.19-14-call_start.dif
>
> ClearCase implements its own virtual filesystem type, and so is
> heavily tied to specific kernels (the kernel module is not open source
> of course :( ). We basically can move to any kernel that has been
> released as part of an official Red Hat release (say, 2.4.20-8 from
> RH9 would work), but no other kernels can be used (the ClearCase
> kernel module has checks on the sizes of various kernel structures and
> won't load if they're not what it thinks they should be--and since
> it's a filesystem it cares deeply about structures that have tended to
> change a lot. It won't even work with vanilla kernel.org kernels of
> the same version.)
Blech...
Note: if you want to try implementing a scheme like what you propose,
then the simplest way to do it, would be to add something like the
following patch. It disables nfs_strategy(), then causes
nfs_updatepage() to extend the request size if it sees that we're not
using byte-range locking, and the complete page is in cache.
Cheers,
Trond
-----Original Message-----
From: Paul Smith [mailto:[email protected]]=20
Sent: 8. januar 2004 18:47
To: [email protected]
Subject: Re: [NFS] NFS client write performance issue ... thoughts?
%% <[email protected]> writes:
tm> All you are basically showing here is that our write caching sucks
tm> badly. There's nothing there to pinpoint merging vs not merging
tm> requests as the culprit.
Good point. I think that was "intuited" from other info, but I'll have
to check.
tm> 3 things that will affect those numbers, and cloud the issue:
tm> 1) Linux 2.4.x has a hard limit of 256 outstanding read+write
nfs_page
tm> struct per mountpoint in order to deal with the fact that the VM
does
tm> not have the necessary support to notify us when we are low on
memory
tm> (This limit has been removed in 2.6.x...).
OK.
tm> 2) Linux immediately puts the write on the wire once there are
more
tm> than wsize bytes to write out. This explains why bumping wsize
results
tm> in fewer writes.
OK.
tm> 3) There are accounting errors in Linux 2.4.18 that cause
tm> retransmitted requests to be added to the total number of
transmitted
tm> ones. That explains why switching to TCP improves matters.
Do you know when those accounting errors were fixed?
ClearCase implements its own virtual filesystem type, and so is heavily
tied to specific kernels (the kernel module is not open source of course
:( ). We basically can move to any kernel that has been released as
part of an official Red Hat release (say, 2.4.20-8 from RH9 would work),
but no other kernels can be used (the ClearCase kernel module has checks
on the sizes of various kernel structures and won't load if they're not
what it thinks they should be--and since it's a filesystem it cares
deeply about structures that have tended to change a lot. It won't even
work with vanilla kernel.org kernels of the same version.)
Actually It does not look like clearcase is checking for an exact kernel
version, it just depends on redhat hacks in the kernel (I have no clue
to which). But taking a 2.4.20-XX redhat kernel, and building it from
SRPM actually work. Furthermore, since you have the kernel in source
when building it from SRPM, you can add as many patches as you want, as
long as these patches does not screw with the same stuff clearcase mvfs
relies on. I managed to do some heavy modifying of a rh9 kernel SRPM,
patch it up to what level I needed + include support for diskless boot.
And use this on Fedora, and still got clearcase to work ( I had to tweak
the /etc/issue, since clearcase actually checks for redhat(version)
string).
tm> Note: Try doing this with mmap(), and you will get very different
tm> numbers, since mmap() can cache the entire database in memory, and
only
tm> flush it out when you msync() (or when memory pressure forces it
to do
tm> so).
OK... except since we don't have the source we can't switch to mmap()
without doing something very hacky like introducing some kind of shim
shared library to remap some read/write calls to mmap(). Ouch.
Also I think that ClearCase _does_ force sync fairly regularly to be
sure the database is consistent.
tm> One further criticism: there are no READ requests on the Sun
tm> machine. That suggests that it had the database entirely in cache
tm> when you started you test.
Good point.
Thanks Trond!
--=20
------------------------------------------------------------------------
-------
Paul D. Smith <[email protected]> HASMAT--HA Software Mthds &
Tools
"Please remain calm...I may be mad, but I am a professional." --Mad
Scientist
------------------------------------------------------------------------
-------
These are my opinions---Nortel Networks takes no responsibility for
them.
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
%% "Mikkelborg, Kjetil" <[email protected]> writes:
Hi; can you please use normal quoting when replying to email? If you
just include the entire original then type some comments in the middle
it's very difficult to find the comments you made. Thanks!
pds> ClearCase implements its own virtual filesystem type, and so is
pds> heavily tied to specific kernels (the kernel module is not open
pds> source of course :( ). We basically can move to any kernel that
pds> has been released as part of an official Red Hat release (say,
pds> 2.4.20-8 from RH9 would work), but no other kernels can be used
pds> (the ClearCase kernel module has checks on the sizes of various
pds> kernel structures and won't load if they're not what it thinks
pds> they should be--and since it's a filesystem it cares deeply about
pds> structures that have tended to change a lot. It won't even work
pds> with vanilla kernel.org kernels of the same version.)
mk> Actually It does not look like clearcase is checking for an exact
mk> kernel version, it just depends on redhat hacks in the kernel (I
mk> have no clue to which).
I didn't say it was checking for an exact kernel version. I said it
checks the sizes of various kernel structures. This is done dynamically
when the kernel module is loaded.
The way it works is this: the MVFS filesystem loadable module comes in
two parts: a precompiled part which you don't get the source to, and a
.c file which is a "wrapper". The wrapper is recompiled against your
current kernel, headers, etc. then the two are linked together to form
the real mvfs.o module. The wrapper provides a buffer against some
kinds of changes to the kernel.
But, the wrapper also examines about 30 different kernel data structures
and compares their sizes in your kernel against the ones expected in the
prebuilt .o file. If they're different then the module won't load.
Some of these are pretty static, like "struct timeval", but some are
much more dynamic, like "struct file", etc. For some structures they
care about the size of the whole structure, for some they care about the
offset into the structure of a given field.
mk> But taking a 2.4.20-XX redhat kernel, and building it from SRPM
mk> actually work. Furthermore, since you have the kernel in source
mk> when building it from SRPM,
I never bother to build from the SRPM. It's much more straightforward
to build from the kernel-source RPM. YMMV of course.
mk> you can add as many patches as you want, as long as these patches
mk> does not screw with the same stuff clearcase mvfs relies on. I
mk> managed to do some heavy modifying of a rh9 kernel SRPM, patch it
mk> up to what level I needed + include support for diskless boot.
Yes; as long as you don't mess with the structures ClearCase cares
about, you win. You can find the exact structures in question in the
STRUCT_CHECK_INIT macro in the mvfs_param.h file in the ClearCase
distribution directory.
--
-------------------------------------------------------------------------------
Paul D. Smith <[email protected]> HASMAT--HA Software Mthds & Tools
"Please remain calm...I may be mad, but I am a professional." --Mad Scientist
-------------------------------------------------------------------------------
These are my opinions---Nortel Networks takes no responsibility for them.
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
Thanks Trond.
We tried out your patch and it did really make a difference in terms of
both the nfsstat numbers (or, Ethereal in this case) AND the performance
the user sees, at least from a ClearCase perspective (gnurr.diff is with
your patch):
> This time, I used my desktop box (running 2.4.18-27.7.x + gnurr.diff)
> as view server, while running the small build on a different Linux box.
> I captured NFS traffic with Ethereal, so as to rule out any possible
> accounting errors in nfsstat. Here's what the numbers look like:
>
> nopatch gnurr.dif gnurr.diff Solaris
> UDP/4K UDP/4K UDP/32K TCP/32K
> ------- --------- ---------- -------
> WRITE 65017 1298 538 612
> COMMIT 29268 994 882 7
> LOOKUP 1082 1020 1016 1986
> GETATTR 548 490 404 532
> READ 480 75 89 0
> FSSTAT 60 3 3 60
> FSINFO 60 3 3 0
> REMOVE 19 15 8 291
> ACCESS 0 0 0 291
> NULL 10 6 5 0
> SETATTR 4 2 3 2
> RENAME 4 2 3 2
> CREATE 4 2 3 2
> ====== ===== ==== ==== ====
> Total 96556 3910 2957 3785
>
> CPU 85s 85s 85s 85s
> Elapsed 202s 121s 121s 139s
>
> Bytes read: 1.97MB 0.3MB 0.7MB ??
> Bytes writ: 90.3MB 5.2MB 4.9MB ??
>
> The Solaris numbers are from the same capture I did a few days ago.
> For the build times, all builds were done on a machine in our Linux
> pool (which was not patched), using my desktop in various
> configurations as view server. For the Solaris view server, I used a
> Sun Ultra-80 running Solaris 8. CPU use was that used by the build
> itself, not the view server, so we'd expect it to be constant.
>
> So Linux with the gnurr.dif patch stacks up pretty well against Solaris.
> Note that there are no numbers here for Linux with TCP mounts. That's
> because I was analyzing the data with ethereal, and it often can't
> find the RPC boundaries properly when TCP is in use, so the numbers
> are meaningless.
>
> One noticable difference between Linux and Solaris is that Linux
> generates a lot more COMMITS than Solaris.
However, then we discovered something concerning, so we are not using
the patch as-is right now, at least pending further investigation:
> Oops. With this patch applied, I didn't see a problem when using
> my desktop as view server, but when I use it to do builds on, I'm
> occasionally seeing the following:
>
> ldsimpc: final link failed: Message too long
>
> Ldsimpc is the GNU linker built as a cross-linker for VxSim (ie: for
> MinGW). The linker is getting EMSGSIZE somehow, and at this point, I
> can only guess that it's getting it on a call to write(). Probably NFS
> is trying to write an overly large UDP packet to the server. This is
> with UDP mounts and rsize=wsize=4KB. Write failures are a little scary,
> so I think I'm going to back out this patch for now.
--
-------------------------------------------------------------------------------
Paul D. Smith <[email protected]> HASMAT--HA Software Mthds & Tools
"Please remain calm...I may be mad, but I am a professional." --Mad Scientist
-------------------------------------------------------------------------------
These are my opinions---Nortel Networks takes no responsibility for them.
-------------------------------------------------------
This SF.net email is sponsored by: Perforce Software.
Perforce is the Fast Software Configuration Management System offering
advanced branching capabilities and atomic changes on 50+ platforms.
Free Eval! http://www.perforce.com/perforce/loadprog.html
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs