2006-08-25 04:36:27

by NeilBrown

[permalink] [raw]
Subject: Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.

On Monday August 21, [email protected] wrote:
>
> But these numbers are in no way tied to the hardware. It may be totally
> reasonable to have 3GiB of dirty data on one system, and it may be
> totally unreasonable to have 96MiB of dirty data on another. I've always
> thought that assuming any kind of reliable throttling at the queue level
> is broken and that the vm should handle this completely.

I keep changing my mind about this. Sometimes I see it that way,
sometimes it seems very sensible for throttling to happen at the
device queue.

Can I ask a question: Why do we have a 'nr_requests' maximum? Why
not just allocate request structures whenever a request is made?
If there some reason relating to making the block layer work more
efficiently? or is it just because the VM requires it.


I'm beginning to think that the current scheme really works very well
- except for a few 'bugs'(*).
The one change that might make sense would be for the VM to be able to
tune the queue size of each backing dev. Exactly how that would work
I'm not sure, but the goal would be to get the sum of the active queue
sizes to about 1 half of dirty_threshold.

The 'bugs' I am currently aware of are:
- nfs doesn't put a limit on the request queue
- the ext3 journal often writes out dirty data without clearing
the Dirty flag on the page - so the nr_dirty count ends up wrong.
ext3 writes the buffers out and marks them clean. So when
the VM tried to flush a page, it finds all the buffers are clean
and so marks the page clean, so the nr_dirty count eventually
gets correct again, but I think this can cause write throttling to
be very unfair at times.

I think we need a queue limit on NFS requests.....

NeilBrown


2006-08-25 06:34:59

by Jens Axboe

[permalink] [raw]
Subject: Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.

On Fri, Aug 25 2006, Neil Brown wrote:
> On Monday August 21, [email protected] wrote:
> >
> > But these numbers are in no way tied to the hardware. It may be totally
> > reasonable to have 3GiB of dirty data on one system, and it may be
> > totally unreasonable to have 96MiB of dirty data on another. I've always
> > thought that assuming any kind of reliable throttling at the queue level
> > is broken and that the vm should handle this completely.
>
> I keep changing my mind about this. Sometimes I see it that way,
> sometimes it seems very sensible for throttling to happen at the
> device queue.
>
> Can I ask a question: Why do we have a 'nr_requests' maximum? Why
> not just allocate request structures whenever a request is made?
> If there some reason relating to making the block layer work more
> efficiently? or is it just because the VM requires it.

It's by and large because the vm requires it. Historically the limit was
there because the requests were statically allocated. Later the limit
help bound runtimes for the io scheduler, since the merge and sort
operations where O(N) each. Right now any of the io schedulers can
handle larger number of requests without breaking a sweat, but the vm
goes pretty nasty if you set (eg) 8192 requests as your limit.

The limit is also handy for avoiding filling memory with requests
structures. At some point here's little benefit to doing larger queues,
depending on the workload and hardware. 128 is usually a pretty fair
number, so...

> I'm beginning to think that the current scheme really works very well
> - except for a few 'bugs'(*).

It works ok, but it makes it hard to experiment with larger queue depths
when the vm falls apart :-). It's not a big deal, though, even if the
design isn't very nice - nr_requests is not a well defined entity. It
can be anywhere from 512b to megabyte(s) in size. So throttling on X
number of requests tends to be pretty vague and depends hugely on the
workload (random vs sequential IO).

--
Jens Axboe

2006-08-25 13:17:04

by Trond Myklebust

[permalink] [raw]
Subject: Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.

On Fri, 2006-08-25 at 14:36 +1000, Neil Brown wrote:
> The 'bugs' I am currently aware of are:
> - nfs doesn't put a limit on the request queue
> - the ext3 journal often writes out dirty data without clearing
> the Dirty flag on the page - so the nr_dirty count ends up wrong.
> ext3 writes the buffers out and marks them clean. So when
> the VM tried to flush a page, it finds all the buffers are clean
> and so marks the page clean, so the nr_dirty count eventually
> gets correct again, but I think this can cause write throttling to
> be very unfair at times.
>
> I think we need a queue limit on NFS requests.....

That is simply not happening until someone can give a cogent argument
for _why_ it is necessary. Such a cogent argument must, among other
things, allow us to determine what would be a sensible queue limit. It
should also point out _why_ the filesystem should be doing this instead
of the VM.

Furthermore, I'd like to point out that NFS has a "third" state for
pages: following an UNSTABLE write the data on them is marked as
'uncommitted'. Such pages are tracked using the NR_UNSTABLE_NFS counter.
The question is: if we want to set limits on the write queue, what does
that imply for the uncommitted writes?
If you go back and look at the 2.4 NFS client, we actually had an
arbitrary queue limit. That limit covered the sum of writes+uncommitted
pages. Performance sucked, 'cos we were not able to use server side
caching efficiently. The number of COMMIT requests (causes the server to
fsync() the client's data to disk) on the wire kept going through the
roof as we tried to free up pages in order to satisfy the hard limit.
For those reasons and others, the filesystem queue limit was removed for
2.6 in favour of allowing the VM to control the limits based on its
extra knowledge of the state of global resources.

Trond

2006-08-28 01:28:56

by David Chinner

[permalink] [raw]
Subject: Re: RFC - how to balance Dirty+Writeback in the face of slow writeback.

On Fri, Aug 25, 2006 at 08:37:24AM +0200, Jens Axboe wrote:
> On Fri, Aug 25 2006, Neil Brown wrote:
>
> > I'm beginning to think that the current scheme really works very well
> > - except for a few 'bugs'(*).
>
> It works ok, but it makes it hard to experiment with larger queue depths
> when the vm falls apart :-). It's not a big deal, though, even if the
> design isn't very nice - nr_requests is not a well defined entity. It
> can be anywhere from 512b to megabyte(s) in size. So throttling on X
> number of requests tends to be pretty vague and depends hugely on the
> workload (random vs sequential IO).

So maybe we need a different control parameter - the amount of memory we
allow to be backed up in a queue rather than the number of requests the
queue can take...

Cheers,

Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group