Hi,
I posted an earlier version of the attached patch to
the kernel mailing list a few days back but did not
receive any feedback. Hopefully, I'll have better
luck here. ;-)
The stock NFS client does not regulate the # async
write requests causing other accesses to block in the
presence of streaming writes. This patch adds such
support. For instance, a single dd running in the
background writing to a file in my home directory
causes my X session to hang until dd exits. With this
patch the session does not experience such hangs.
Please test it out and let me know if you see anything
problems. I'd like to see this integrated soon.
Thanks,
Shantanu
__________________________________
Do you Yahoo!?
Yahoo! Mail SpamGuard - Read only the mail you want.
http://antispam.yahoo.com/tools
Apologies for replying to my own message. I realized
there was a bug in the patch shortly after sending it.
Attached is a fixed one.
Thanks,
Shantanu
--- Shantanu Goel <[email protected]> wrote:
> Hi,
>
> I posted an earlier version of the attached patch to
> the kernel mailing list a few days back but did not
> receive any feedback. Hopefully, I'll have better
> luck here. ;-)
>
> The stock NFS client does not regulate the # async
> write requests causing other accesses to block in
> the
> presence of streaming writes. This patch adds such
> support. For instance, a single dd running in the
> background writing to a file in my home directory
> causes my X session to hang until dd exits. With
> this
> patch the session does not experience such hangs.
> Please test it out and let me know if you see
> anything
> problems. I'd like to see this integrated soon.
>
> Thanks,
> Shantanu
__________________________________
Do you Yahoo!?
Yahoo! Mail SpamGuard - Read only the mail you want.
http://antispam.yahoo.com/tools
Shantanu Goel wrote:
>
> The stock NFS client does not regulate the # async
> write requests causing other accesses to block in the
> presence of streaming writes.
This is a real problem; the same problem in IRIX was fixed at
6.5.17 (about 2 years ago).
> This patch adds such
> support. For instance, a single dd running in the
> background writing to a file in my home directory
> causes my X session to hang until dd exits. With this
> patch the session does not experience such hangs.
> Please test it out and let me know if you see anything
> problems. I'd like to see this integrated soon.
I discussed this with Dave Chinner (who implemented IRIX' write
throttling), and our conclusion was that this is implementing
throttling at the wrong level. A better approach would be to
change the allocation algorithm of rpc_rqsts to divide the
RPC_MAXREQS space into three fixed spaces by class: sync calls,
async (writes & commits), and all other async calls. Attempts to
allocate a request slot can then be classified into one of
these classes in the struct rpc_procinfo, and when a class'
space is full the allocation fails; the rpc_task already knows
how to sleep and retry request slot allocation.
Alternately, the allocation could treat the classes as priority
classes with sync calls given the highest priority; allocations
would succeed only if there were slots available at the same or
less important priority. For example, there were patches to do
similar things to the block device layer.
The exact numbers chosen for the class sizes will be a subtle
and important tuning decision. It will probably pay to bump up
RPC_MAXREQS to compensate for the subdivision; you'll want the
class for async writes to still be 16 slots in size.
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Thu, Feb 26, 2004 at 02:34:05PM +1100, Greg Banks wrote:
> change the allocation algorithm of rpc_rqsts to divide the
> RPC_MAXREQS space into three fixed spaces by class: sync calls,
> async (writes & commits), and all other async calls. Attempts to
I don't think this is a good solution for UDP. If you bump the number
of slots at the same time, you may still end up with the same symptoms,
as your async write requests may eat your congestion window, leaving just
the crumbs for sync requests. If you don't bump the number of slots,
write throughput will suffer.
What about modifying __rpc_sleep_on to put sync tasks always
at the head of the queue? This way sync tasks get priority when
it comes to call reservation.
(for fairness reasons, you may want to walk the queue and insert
your sync task before the first async task at least)
Olaf
--
Olaf Kirch | Stop wasting entropy - start using predictable
[email protected] | tempfile names today!
---------------+
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P? to , 26/02/2004 klokka 00:41, skreiv Olaf Kirch:
> What about modifying __rpc_sleep_on to put sync tasks always
> at the head of the queue? This way sync tasks get priority when
> it comes to call reservation.
How do you avoid the inverse problem that async tasks end up never making
any progress? I'm not a huge fan of playing with rpc task
priorities: just witness the type of tuning problems Ingo & co. had (and
still have IMO) when they played around with the task scheduler
algorithm for 2.6.x.
I think I prefer an approach like that taken by Shantanu, however there
are 2 problems with his code as it stands today:
- It only addresses the problem of writes (yes, we sometimes do read a
bit too).
- It is a fairly heavy layering violation. It would be nice to move
some (all?) of that code into the RPC layer if possible.
Cheers,
Trond
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Thu, 26 Feb 2004 [email protected] wrote:
> P=E5 to , 26/02/2004 klokka 00:41, skreiv Olaf Kirch:
> > What about modifying __rpc_sleep_on to put sync tasks always
> > at the head of the queue? This way sync tasks get priority when
> > it comes to call reservation.
>=20
> How do you avoid the inverse problem that async tasks end up never maki=
ng
> any progress?
How about having a ratio (that can be changed from userland) between
sync and async operations ? Let's say that you move 3 sync ops at the
head of the queue then let 1 async one go through. However, for this
to be done efficiently it's probably better to not have only one queue
but more and then take one or more ops from the head of each queue
corresponding to the ratio.
--=20
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P? to , 26/02/2004 klokka 05:20, skreiv Bogdan Costescu:
> > How do you avoid the inverse problem that async tasks end up never making
> > any progress?
>
> How about having a ratio (that can be changed from userland) between
sync and async
> operations ? Let's say that you move 3 sync ops at the head of the queue
then let 1
> async one go through. However, for this to be done efficiently it's
probably better to not
> have only one queue but more and then take one or more ops from the head
of each
> queue corresponding to the ratio.
I'm not convinced that is flexible enough. Imagine that the machine needs
to flush out some writes in order to reclaim memory for use by other
tasks. Is it then correct to be letting through a bunch of stat() requests
in order to satisfy some "ls" command instead of the memory reclaim?
To simply assert that "async" is somehow equivalent to "not important" is
just plain wrong.
Cheers,
Trond
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Thu, Feb 26, 2004 at 02:40:44PM +0100, Trond Myklebust wrote:
> To simply assert that "async" is somehow equivalent to "not important" is
> just plain wrong.
Agreed, and I also agree that my suggestion of always preferring sync
tasks over async tasks is simplistic. But I think the patch proposed
by Shantanu adds more complexity to an already overly complex rpc
implementation.
The general "unfairness" observed in writes is due to the fact that we
allow a writing process to dirty a large number of pages without blocking
for the actual IO. So if you write a large file, nfs_flushd can easily
saturate the transport (which is a good thing for performance) but it's
bad for sync tasks.
One problem is that currently, there seems to be no upper bound at all
on the number of write requests we schedule. So one step in the right
direction may be to check the number of backlogged tasks and refrain
from scheduling more write requests if it exceeds a certain threshold.
Applications doing lots of stats etc are punished so severely because
all these operations are synchronous. So if you have a maximum backlog
of N RPC requests, your average delay per sync rpc operation is O(N).
So I think it could help if nfs_flushd and friends check the transport
backlog before scheduling new writes. There should be tunables for
the minimum number of pending writes we're always permitted to
schedule, and a maximum backlog length.
To avoid any ugliness related to layering violations, you could use
"color" to tag different RPC requests rather than referring to "writes".
Of course, that's just changing the queuing, not real scheduling.
But that can be an advantage, too, because it keeps things relatively
simple.
BTW I think reading is much less of a problem. The number of readaheads
scheduled by generic_file_read is limited, so it's entirely different
from the "lets blow the router's fuse" approach of the NFS write code.
Olaf
--
Olaf Kirch | Stop wasting entropy - start using predictable
[email protected] | tempfile names today!
---------------+
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Thu, 26 Feb 2004 [email protected] wrote:
> To simply assert that "async" is somehow equivalent to "not important" is
> just plain wrong.
That was the point of the userland modifiable ratio. Then userland can
decide what is more important...
--
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Thu, 26 Feb 2004, Olaf Kirch wrote:
> The general "unfairness" observed in writes is due to the fact that we
> allow a writing process to dirty a large number of pages without blocking
> for the actual IO.
Yes, and I think that this is pretty similar to the situation
encountered with block IO, where queued writes had to be serviced
before later queued reads, giving an impression of unresponsivness.
It's easier to accept a delayed write than a delayed read.
> Applications doing lots of stats etc are punished so severely because
> all these operations are synchronous.
That's exactly the case for block IO reads.
> BTW I think reading is much less of a problem. The number of readaheads
> scheduled by generic_file_read is limited,
On the other hand, based on the block IO experience, reads should have
higher priority than writes :-)
--
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
seems like we have three different kinds of writes.
1. writes that need to go now because the VM needs to reclaim
the dirty pages
2. async writebehind writes
3. writes generated by the synchronous path because of the
"sync" mount option or O_SYNC open flag.
1 and 3 are usually a bounded number of writes and fairly
urgent, but 2 can be effectively unbounded, but not terribly
urgent at all.
likewise for reads, you have reads that are due to readahead,
and reads that are pushed out because of a sync_page (someone
is waiting for that very page).
rather than using something like the RPC_SWAPPER_TASK flag
to change queuing behavior in the RPC client, perhaps the
RPC client should have a two- (or more) level scheduler
queue. place the requests that have to go now on the top
queue, and the readahead/writebehind tasks on the bottom.
have a mechanism for boosting the priority of bottom
queue requests that suddenly become important.
then the RPC client can visit requests on the top queue
first, and when that is empty, visit requests on the
bottom queue. there is still the potential for some
unfairness here, so, after every 'n' top queue requests,
service a bottom queue request so that eventually all
bottom queue requests are completed even if there is
a steady flow of top queue requests.
this is a mechanism that has worked in the past, and
requires very little tuning (ie low maintenance).
> -----Original Message-----
> From: Bogdan Costescu [mailto:[email protected]]=20
> Sent: Thursday, February 26, 2004 8:25 AM
> To: Olaf Kirch
> Cc: [email protected]; Greg Banks; ShantanuGoel;=20
> [email protected]
> Subject: Re: [NFS] [PATCH 2.6.3] Add write throttling to NFS client
>=20
>=20
> On Thu, 26 Feb 2004, Olaf Kirch wrote:
>=20
> > The general "unfairness" observed in writes is due to the=20
> fact that we=20
> > allow a writing process to dirty a large number of pages without=20
> > blocking for the actual IO.
>=20
> Yes, and I think that this is pretty similar to the situation=20
> encountered with block IO, where queued writes had to be serviced=20
> before later queued reads, giving an impression of unresponsivness.=20
> It's easier to accept a delayed write than a delayed read.
>=20
> > Applications doing lots of stats etc are punished so=20
> severely because=20
> > all these operations are synchronous.
>=20
> That's exactly the case for block IO reads.
>=20
> > BTW I think reading is much less of a problem. The number of=20
> > readaheads scheduled by generic_file_read is limited,
>=20
> On the other hand, based on the block IO experience, reads=20
> should have=20
> higher priority than writes :-)
>=20
> --=20
> Bogdan Costescu
>=20
> IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches=20
> Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
> Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
> E-mail: [email protected]
>=20
>=20
>=20
> -------------------------------------------------------
> SF.Net is sponsored by: Speed Start Your Linux Apps Now.
> Build and deploy apps & Web services for Linux with
> a free DVD software kit from IBM. Click Now!=20
> http://ads.osdn.com/?ad_id=3D1356&alloc_id=3D3438> &op=3Dclick
>=20
> _______________________________________________
>=20
> NFS maillist - [email protected]=20
> https://lists.sourceforge.net/lists/listinfo/n> fs
>=20
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Thu, 26 Feb 2004, Lever, Charles wrote:
> 1 and 3 are usually a bounded number of writes and fairly urgent,
Well, the default these days is "sync", so 3 becomes less bounded, as
you can still do the "dd" write storm. Also if some process is started
and wants to allocate lots of memory, even 1 can trigger a large
amount of writes.
> perhaps the RPC client should have a two- (or more) level scheduler
> queue. place the requests that have to go now on the top queue, and
> the readahead/writebehind tasks on the bottom.
That was exactly what I had in mind, except that my mind considered
sync=urgent; with your detailed op types, it makes sense to order by
"urgency" rather than sync/async.
--
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
While the solution you propose certainly sounds more
elegant, in my experience, asynchronous writes seem to
eat up most of the RPC slots in normal use and this
patch only modifies the write path without affecting
the core RPC scheduler. Surely, a solution that fixes
the common case should be acceptable until the more
elegant one is available? The NFS client in Linux has
suffered from this issue for quite a while and I don't
recall seeing a patch that addressed this issue. If
there is something available to test, I'd certainly
like to look at it. As such, the current NFS client
is basically useless for concurrent activity in the
presence of heavy writers.
Shantanu
--- Greg Banks <[email protected]> wrote:
> Shantanu Goel wrote:
> >
> > The stock NFS client does not regulate the # async
> > write requests causing other accesses to block in
> the
> > presence of streaming writes.
>
> This is a real problem; the same problem in IRIX was
> fixed at
> 6.5.17 (about 2 years ago).
>
> > This patch adds such
> > support. For instance, a single dd running in the
> > background writing to a file in my home directory
> > causes my X session to hang until dd exits. With
> this
> > patch the session does not experience such hangs.
> > Please test it out and let me know if you see
> anything
> > problems. I'd like to see this integrated soon.
>
> I discussed this with Dave Chinner (who implemented
> IRIX' write
> throttling), and our conclusion was that this is
> implementing
> throttling at the wrong level. A better approach
> would be to
> change the allocation algorithm of rpc_rqsts to
> divide the
> RPC_MAXREQS space into three fixed spaces by class:
> sync calls,
> async (writes & commits), and all other async calls.
> Attempts to
> allocate a request slot can then be classified into
> one of
> these classes in the struct rpc_procinfo, and when a
> class'
> space is full the allocation fails; the rpc_task
> already knows
> how to sleep and retry request slot allocation.
>
> Alternately, the allocation could treat the classes
> as priority
> classes with sync calls given the highest priority;
> allocations
> would succeed only if there were slots available at
> the same or
> less important priority. For example, there were
> patches to do
> similar things to the block device layer.
>
> The exact numbers chosen for the class sizes will be
> a subtle
> and important tuning decision. It will probably pay
> to bump up
> RPC_MAXREQS to compensate for the subdivision;
> you'll want the
> class for async writes to still be 16 slots in size.
>
> Greg.
> --
> Greg Banks, R&D Software Engineer, SGI Australian
> Software Group.
> I don't speak for SGI.
__________________________________
Do you Yahoo!?
Get better spam protection with Yahoo! Mail.
http://antispam.yahoo.com/tools
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P? su , 29/02/2004 klokka 20:33, skreiv Shantanu Goel:
> I have been experimenting with the RPC scheduler and
> found the changes I am about to describe to be quite
> effective. A flag has been added to rpc_wait_queue
> called "batch". This changes the way sleep/wakeup are
> implemented. When the flag is set, tasks are grouped
> by the id of the initiating process and a batch count
> is set for each process. The scheduler will keep
> submitting tasks for the same process until the batch
> count goes to zero. At that point, the scheduler
> switches to another process. The transport layer uses
> this feature for the backlog queue. This has the
> following advantages.
>
> - It is independant of the type of request.
> - It prevents any process from completely hogging
> the request slots.
> - It prevents any process from starving completely.
> - It does not require any changes to NFS layer so that
> takes care of layering violations.
> - The # async requests is limited only by the amount
> of memory available.
> - The batching allows large read/write requests to be
> sent back to back.
>
> Patch is attached.
>
> Comments?
Woah.... What's all this struct rpc_pid stuff?
If you need to sort everything by pid (...and I'm not convinced that is
such a great idea) then you can do it by adding 1 extra linked list in the
struct rpc_task to act as the head of the pid list... No need for any
fragile allocation of extra structures from inside
__rpc_add_wait_queue().
The question is, though, why are you sorting stuff by pid? That isn't even
defined for most of the asynchronous requests. As I said before: in a low
memory situation it is silly to be treating a new readdir() or a stat() in
preference to a bunch of queued pdflush or memory allocator requests.
Why not sort them by inode number instead?
...or have well defined priority levels for each type of request (as Chuck
suggested) so that we can schedule them properly. Synchronous requests
already have all the properties that you listed. All you need to do is to
schedule them independently of the asynchronous tasks.
Cheers,
Trond
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
> If you need to sort everything by pid (...and I'm not=20
> convinced that is
> such a great idea) then you can do it by adding 1 extra=20
> linked list in the
> struct rpc_task to act as the head of the pid list... No need for any
> fragile allocation of extra structures from inside
> __rpc_add_wait_queue().
i agree.
i'd rather keep __rpc_add_wait_queue and __rpc_remove_wait_queue
as simple as possible. more complexity here will significantly
impact the per-op overhead, especially if we're holding the BKL
and/or other spinlocks when these are called.
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P? su , 29/02/2004 klokka 22:55, skreiv Shantanu Goel:
> You must have missed the tk_proc field I added to
> rpc_task. It is explicitly initialized with the pid
> of the initiating process in rpc_init_task.
...and my point is "what makes you think that the process that calls
rpc_init_task has anything to do with the process that scheduled the read
or write?"
> Did I miss some other place where the tasks are created?
I repeat: most asynchronous rpc_tasks are NOT scheduled by the kernel
thread that created them. See the pdflush task. It doesn't create *ANY*
read or writes, but it schedules a lot of them.
> Isn't the RPC layer supposed to be generic and not
> privy to request internals? Wouldn't knowlege of
> inodes qualify as such? If we wish to preserve this
> quality, other than the process id, how else would you
> distinguish one request from another? The only way I
> can think of is if the upper layers can pass down a
> cookie in rpc_task. In the NFS case it would be the
> inode. Is that what you have in mind?
Yes, or alternatively, just give the tasks a numeric priority. That's all
we do for real tasks after all...
> You do raise an interesting issue with regards to
> memory pressure. I'll run some tests with low memory
> to see if this patch survives it. Keep in mind
> though, any writebacks initiated by pdflush/kswapd
> will be tagged with their respective pid's and run in
> parallel. The will not block behind those of the
> original process.
EXACTLY, and that's the whole problem...
They will all be tagged with the pid of pdflush/kswapd, and hence will all
be in a single queue. That means that those tasks have a de-facto lower
priority than competing synchronous tasks (which will have different pids,
and hence only have to compete with the head of the queue) whereas we'd
actually like them to have a *higher* priority 'cos they are needed to
deal with memory pressure.
Cheers,
Trond
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P? m? , 01/03/2004 klokka 00:14, skreiv Shantanu Goel:
> I did not make that assumption. It does not appear to
> be a problem in practice. I ran 4 dd's each writing
> 256MB in parallel on a machine with 2 CPUs and 64MB
> memory and it chugged along fine and still allowed
> other NFS activity on the same mount point. However, I
> believe this reflects the way the NFS layer is
> implemented (most request ARE submitted by the writing
> process except in the case of less than a page worth
> of writes). However, if that were to ever change,
> your argument will probably be valid.
Sigh... See the patch linux-2.6.3-02-strategy.dif on the usual patch site.
People are asking that we cache more than is currently done in order to
allow for better random write performance.
> > Yes, or alternatively, just give the tasks a numeric
> > priority. That's all
> > we do for real tasks after all...
> >
>
> Just assigning a priority to requests will not fix the
> issue. The whole reason I started looking at this was
> heavy writing to a single inode blocks out all other
> async write activity. To give a concrete example.
> Start a dd in the background and then start Mozilla.
> Even if you give priority to sync requests, Mozilla
> will hang when it writes its cache files because to
> the RPC layer, one async write will look the same as
> the next. The RPC layer must segregate tasks based on
> some criteria. I chose to use the process id which
> seems to work pretty well in practice with the current
> implementation of NFS. So I see no way around
> batching tasks based on either a cookie or process id.
Huh? Just have the asynchronous tasks sit in a different priority bucket
to the synchronous ones and schedule, say 4 asynchronous tasks for every 1
synchronous one. Why would this be so difficult?
Cheers,
Trond
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Mon, 1 Mar 2004, Shantanu Goel wrote:
> It won't fix the underlying issue that a single heavy writer will
> block other async writes for a looong time.
But how is the kernel to know which process is the "bad" one which
should be punished ? Maybe this process that writes now lots does so
because it's just about to finish (or dumping core :-)) so it's more
advantageous to give it a bit of priority because it won't bother
anymore afterwards. On the other hand, if it's just writting and
writting and not going away soon, it should be punished. So how can
the kernel make the difference between these 2 situations ?
> Another process comes along and writes 128K then closes the file.
> Before the close, it will call nfs_strategy() which will queue async
> writes after all the ones from dd. So, effectively, how quickly dd
> completes will determine how quickly this process can return from
> close.
If the second process is writting to the same file, then I think that
this behaviour is actually wanted (of course, without locking, the
data will be messsed up...).
If they are writting to different files, probably it makes sense to
make sure that both files advance a bit, so queueing based on inode
might also make sense.
Queueing based on PID is probably bad also from another point of view:
a process can have more than one file open at the same time and can do
various operations on them. If it writes a lot to one file and then
wants to read a small piece from another file, should it have to wait
until the writes are done to perform the read ?
--
Bogdan Costescu
IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: [email protected]
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P? m? , 01/03/2004 klokka 10:03, skreiv Shantanu Goel:
> Not a matter of being difficult. It won't fix the underlying issue
that a single heavy writer will block other async writes for a looong
time. Here's the scenario. dd makes a huge bunch of nfs_strategy()
calls and fills up the async queue. Another process comes along and
writes 128K then closes the file. Before the close, it will call
nfs_strategy() which will queue async writes after all the ones from
dd. So, effectively, how quickly dd completes will determine how
quickly this process can return from close. What you need is a way to
tell the scheduler that the async writes from the latter process are not
> the same as the ones from dd. Using the pid is one such approach,
though as you pointed out probably not the best one. Another would be
the cookie approach. Each rpc_message has a cookie field. The NFS
layer sets the cookie to be the inode. The scheduler implements the
multiple priority queues but at the same priority level, round-robins
between different cookies. Does that clear it up?
Not entirely.
Simple round robin is not going to play well with NFSv2 style write
gathering on the server. Nor will it really work too efficiently with
standard readahead.
How about instead doing round-robin on blocks of, say, 16 requests per
cookie (which is the current maximum readahead value)? That should allow
the server some efficiency on block operations without sacrificing your
fairness criterion.
Note: to avoid having to revamp rpc_call_sync() to take a cookie all the
time, you could have rpc_init_task() set the default cookie to be a
pointer to the thread's task_struct. For the particular case of
read/write ops, the NFS layer can modify that cookie to be a pointer to
the inode.
Cheers,
Trond
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P? m? , 01/03/2004 klokka 10:58, skreiv Shantanu Goel:
> >Simple round robin is not going to play well with NFSv2 style write
gathering on the server. Nor will it really work too efficiently with
standard readahead.
> >How about instead doing round-robin on blocks of, say, 16 requests per
cookie (which is the current maximum readahead value)? That should
allow
> >the server some efficiency on block operations without sacrificing your
fairness criterion.
> >
> We could make it static but how about we add a field "nr_batch" to
rpc_clnt with a default of 16 but let the higher layer override it? This
way the NFS layer can set it to a different value for different
servers.
If you want to make it tunable, then I suggest you just make it a global
variable, and set it up as a sysctl in /proc/sys/sunrpc. I doubt there is
much call for making this a per-mountpoint variable (we already have too
many tunables in the "mount" structure as it is).
Cheers,
Trond
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P? m? , 01/03/2004 klokka 10:18, skreiv Bogdan Costescu:
> On Mon, 1 Mar 2004, Shantanu Goel wrote:
> But how is the kernel to know which process is the "bad" one which
should be punished ? Maybe this process that writes now lots does so
because it's just about to finish (or dumping core :-)) so it's more
advantageous to give it a bit of priority because it won't bother
anymore afterwards. On the other hand, if it's just writting and
writting and not going away soon, it should be punished. So how can the
kernel make the difference between these 2 situations ?
I'm currently working on a scheme for throttling processes that keep on
scheduling new writes while the VM is trying to free up memory.
> If the second process is writting to the same file, then I think that
this behaviour is actually wanted (of course, without locking, the data
will be messsed up...).
Writes always take the i_sem semaphore, so no problems with mess ups. I've
already got patches for improving the caching in the above case (we
basically get rid of nfs_strategy() and leave the job of deciding when to
schedule the writes to pdflush).
> If they are writting to different files, probably it makes sense to
make sure that both files advance a bit, so queueing based on inode
might also make sense.
Agreed.
> Queueing based on PID is probably bad also from another point of view:
a process can have more than one file open at the same time and can do
various operations on them. If it writes a lot to one file and then
wants to read a small piece from another file, should it have to wait
until the writes are done to perform the read ?
In practice you can have several processes competing for the same reads,
writes etc. I'd prefer to treat reads and writes as "belonging" to the
mapping rather than to any one process...
For synchronous operations, however, a pid (or equivalent) is fine...
Cheers,
Trond
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P? m? , 01/03/2004 klokka 10:58, skreiv [email protected]:
> I'm currently working on a scheme for throttling processes that keep on
> scheduling new writes while the VM is trying to free up memory.
OK... This is now ready for testing as part of the latest NFS4_ALL patch. See
http://www.fys.uio.no/~trondmy/src/Linux-2.6.x/2.6.3/linux-2.6.3-22-congestion.dif
Cheers,
Trond
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P=E5 on , 03/03/2004 klokka 12:58, skreiv Shantanu Goel:
> Attached is a patch that eliminates nfs_strategy() and just relies on=20
> pdflush to regulate dirty pages. It applies on top of the rpc throttle=20
> patch I emailed earlier.
See
http://www.fys.uio.no/~trondmy/src/Linux-2.6.x/2.6.3/linux-2.6.3-02-strateg=
y.dif
which already does this...
Cheers,
Trond
-------------------------------------------------------
SF.Net is sponsored by: Speed Start Your Linux Apps Now.
Build and deploy apps & Web services for Linux with
a free DVD software kit from IBM. Click Now!
http://ads.osdn.com/?ad_id=1356&alloc_id=3438&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P=E5 on , 03/03/2004 klokka 14:09, skreiv Shantanu Goel:
> Will do. Is the rpc throttle patch acceptable now for inclusion?
I have a couple of fixups to do in order to be able to append it to the
NFS4_ALL patch, but otherwise it looks OK.
I haven't tested it yet, though...
A couple of minor gripes (which I'm in the process of fixing up)
- nfs_writepage_async() really doesn't need a "how" parameter.
It only allocates an nfs_page().
=20
- Several of those #defines really ought to be inline functions.
=20
- Let's get rid of RPC_WAITQ_MARK_PRIORITY(), and instead make
an rpc_init_priority_waitq(). Changing the value of
queue->levels at any time after initialization looks really
dangerous.
=20
- Your use of task->tk_flags inside
__rpc_remove_wait_queue_priority() looks unsafe. Recall that
soft irqs may call that function, and if they do so while
someone else is touching that variable, then information may get
clobbered.
In addition, that whole flag seems redundant. Just checking for
list_empty(&task->tk_links) should suffice AFAICS...
=20
- I really don't like rpc_cookie in the struct rpc_message. It
has nothing to do with the actual RPC payload at all. Let's
leave it in the rpc_task.
=20
- A couple of list_del()+list_add*() combinations looked like
candidates for list_move*() conversion.
=20
- We really need to do something about "task_for_first()". I
realize that isn't your fault (Neil introduced it IIRC) but it
is very confusing, not to mention ugly, to be hiding an if
statement inside a macro like that.
=20
- I'm a bit worried about the RPC_PRIORITY_HIGH tasks hogging
all the queues. I suggest we just make the high priority queue
equivalent to the 2 other queues, but that we set queue->count
to 8.
If that doesn't work, we can go back to making it always be
served.
Cheers,
Trond
-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P=E5 on , 03/03/2004 klokka 18:54, skreiv Shantanu Goel:
> >=20
> > I haven't tested it yet, though...
>=20
> Trust me, it works... ;-)
ah... Famous last words 8-)
=20
> > - nfs_writepage_async() really doesn't need
> > a "how" parameter.
> > It only allocates an nfs_page().
> > =20
>=20
> I added it here because of nfs_wb_page. If the caller
> is kswapd, wouldn't we want the resultant
> nfs_sync_file to run at higher priority?
...but that call to nfs_wb_page() is in nfs_writepage(), not in
nfs_writepage_async().
> > - Several of those #defines really ought to
> > be inline functions.
> > =20
>=20
> Which ones?
Most of them. See Documentation/CodingStyle. The preference is for
inlined functions, with macros being acceptable only if you can really
prove that gcc makes a pigs ear of things.
The reason for this preference is that we want proper type checking of
arguments. Headers like sched.h are included in all sorts of places, so
you really do want that...
Again, a lot of the existing header files are pretty grimy, and do need
cleaning up w.r.t. this, but that's no reason to be introducing more
badness if we can avoid it.
> You mentioned you are already in the process of
> cleaning up some of the above. Do you need me to
> submit anything else?
Nah... ...but I'll be posting a cleaned up version as soon as I'm done
integrating it with the rest. It would be nice if you could look over
it, and see if I haven't screwed up then...
Cheers,
Trond
-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
> BTW, there was a one line nfs_rename fix I sent a
> while back. At the time 2.6.0 was in freeze so it
> couldn't be integrated. The issue was the NFS client
> does not refresh attributes after a rename causing
> some programs to spew out bogus errors. One such
> culprit is GNU tar.
that fix is appropriate for 2.4, but trond tells me
that his attribute cache fixes will address this issue
in 2.6.
-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P=E5 on , 03/03/2004 klokka 21:08, skreiv Lever, Charles:
> > BTW, there was a one line nfs_rename fix I sent a
> > while back. At the time 2.6.0 was in freeze so it
> > couldn't be integrated. The issue was the NFS client
> > does not refresh attributes after a rename causing
> > some programs to spew out bogus errors. One such
> > culprit is GNU tar.
>=20
> that fix is appropriate for 2.4, but trond tells me
> that his attribute cache fixes will address this issue
> in 2.6.
...and if it doesn't I definitely want to know...
Note that attribute caching patch is already queued up in Andrew's
2.6.4-rc1-mm2 patch...
Cheers,
Trond
-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P=E5 on , 03/03/2004 klokka 19:18, skreiv Trond Myklebust:
> P=E5 on , 03/03/2004 klokka 18:54, skreiv Shantanu Goel:
> > You mentioned you are already in the process of
> > cleaning up some of the above. Do you need me to
> > submit anything else?
>=20
> Nah... ...but I'll be posting a cleaned up version as soon as I'm done
> integrating it with the rest. It would be nice if you could look over
> it, and see if I haven't screwed up then...
>=20
OK. The patches have been cleaned up and should be ready for testing
now. The reason for the delay was that I had to track down a hang that
turned out to be caused by the congestion control patch that I posted.
It should be fixed now, I hope.
See http://www.fys.uio.no/~trondmy/src/Linux-2.6.x/2.6.4/
As usual, the NFS4_ALL patch contains the aggregate of all the
subpatches.
Cheers,
Trond
-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P=E5 fr , 05/03/2004 klokka 21:16, skreiv Shantanu Goel:
> I was looking through fs/nfs/write.c and noticed
> nfs_end_data_update() is only called when a write is
> successful. Is that intentional? I see the potential
> for the update count to never be decremented in case
> of errors. I noticed, for comparision, operations in
> dir.c call it regardless of the RPC's outcome. Am I
> missing something?
In nfs_writepage_async()? Yep, that's a bug... Thanks for spotting it!
Were there any other cases that I missed?
Cheers,
Trond
-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
P=E5 fr , 05/03/2004 klokka 22:56, skreiv Shantanu Goel:
> > Were there any other cases that I missed?
>=20
> The only other one I can see is nfs_writepage_sync().=20
Doh... Right you are. I shouldn't be checking code on a full stomach...
> One more thing I just noticed in nfs_writepages(), the
> return status of nfs_wait_on_write_congestion() is not
> checked. Do we want to ignore signals here?
The return value will always be zero, since nfs_write_congestion() is
called with "intr" set to 0. My gut feeling here is that pdflush et al.
should ignore signals. If they don't then we will end up with a memory
leak.
I'm open to being persuaded that this might be wrong, though...
Cheers,
Trond
-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs