I have a question about the write throttling in
balance_dirty_pages in the face of slow writeback.
Suppose we have a filesystem where writeback is relatively slow -
e.g. NFS or EXTx over nbd over a slow link.
Suppose for the sake of simplicity that writeback is very slow and
doesn't progress at all for the first part of our experiment.
We write to a large file.
Balance_dirty_pages gets called periodically. Until the number of
Dirty pages reached 40% of memory it does nothing.
Once we hit 40%, balance_dirty_pages starts calling writeback_inodes
and these Dirty pages get converted to Writeback pages. This happens
at 1.5 times the speed that dirty pages are created (due to
sync_writeback_pages()). So for every 100K that we dirty, 150K gets
converted to writeback. But balance_dirty_pages doesn't wait for anything.
This will result in the number of dirty pages going down steadily, and
the number of writeback pages increasing quickly (3 times the speed of
the drop in Dirty). The total of Dirty+Writeback will keep growing.
When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
balance_dirty_pages will no longer be able to flush the full
'write_chunk' (1.5 times number of recent dirtied pages) and so will
spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
a busy loop, but it won't progress.
Now our very slow writeback gets it's act together and starts making
some progress and the Writeback number steadily drops down to 40%.
At this point balance_dirty_pages will exit, more pages will get
dirtied, and balance_dirty_pages will quickly flush them out again.
The steady state will be with Dirty at or close to 0, and Writeback at
or close to 40%.
Now obviously this is somewhat idealised, and even slow writeback will
make some progress early on, but you can still expect to get a very
large Writeback with a very small Dirty before stabilising.
I don't think we want this, but I'm not sure what we do want, so I'm
asking for opinions.
I don't think that pushing Dirty down to zero is the best thing to
do. If writeback is slow, we should be simply waiting for writeback
to progress rather than putting more work into the writeback queue.
This also allows pages to stay 'dirty' for longer which is generally
considered to be a good thing.
I think we need to have 2 numbers. One that is the limit of dirty
pages, and one that is the limit of the combined dirty+writeback.
Alternately it could simply be a limit on writeback.
Probably the later because having a very large writeback number makes
the 'inactive_list' of pages very large and so it takes a long time
to scan.
So suppose dirty were capped at vm_dirty_ratio, and writeback were
capped at that too, though independently.
Then in our experiment, Dirty would grow up to 40%, then
balance_dirty_pages would start flushing and Writeback would grow to
40% while Dirty stayed at 40%. Then balance_dirty_pages would not
flush anything but would just wait for Writeback to drop below 40%.
You would get a very obvious steady stage of 40% dirty and
40% Writeback.
Is this too much memory? 80% tied up in what are essentially dirty
blocks is more than you would expect when setting vm.dirty_ratio to
40.
Maybe 40% should limit Dirty+Writeback and when we cross the
threshold:
if Dirty > Writeback - flush and wait
if Dirty < Writeback - just wait
bdflush should get some writeback underway before we hit the 40%, so
balance_dirty_pages shouldn't find itself waiting for the pages it
just flushed.
Suggestions? Opinions?
The following patch demonstrates the last suggestion.
Thanks,
NeilBrown
Signed-off-by: Neil Brown <[email protected]>
### Diffstat output
./mm/page-writeback.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff .prev/mm/page-writeback.c ./mm/page-writeback.c
--- .prev/mm/page-writeback.c 2006-08-15 09:36:23.000000000 +1000
+++ ./mm/page-writeback.c 2006-08-15 09:39:17.000000000 +1000
@@ -207,7 +207,7 @@ static void balance_dirty_pages(struct a
* written to the server's write cache, but has not yet
* been flushed to permanent storage.
*/
- if (nr_reclaimable) {
+ if (nr_reclaimable > global_page_state(NR_WRITEBACK)) {
writeback_inodes(&wbc);
get_dirty_limits(&background_thresh,
&dirty_thresh, mapping);
@@ -218,8 +218,6 @@ static void balance_dirty_pages(struct a
<= dirty_thresh)
break;
pages_written += write_chunk - wbc.nr_to_write;
- if (pages_written >= write_chunk)
- break; /* We've done our duty */
}
blk_congestion_wait(WRITE, HZ/10);
}
On Tue, 15 Aug 2006 09:40:12 +1000
Neil Brown <[email protected]> wrote:
>
> I have a question about the write throttling in
> balance_dirty_pages in the face of slow writeback.
btw, we have problem in there at present when you're using a combination of
slow devices and fast devices. That worked OK in 2.5.x, iirc, but seems to
have gotten broken since.
> Suppose we have a filesystem where writeback is relatively slow -
> e.g. NFS or EXTx over nbd over a slow link.
>
> Suppose for the sake of simplicity that writeback is very slow and
> doesn't progress at all for the first part of our experiment.
>
> We write to a large file.
> Balance_dirty_pages gets called periodically. Until the number of
> Dirty pages reached 40% of memory it does nothing.
>
> Once we hit 40%, balance_dirty_pages starts calling writeback_inodes
> and these Dirty pages get converted to Writeback pages. This happens
> at 1.5 times the speed that dirty pages are created (due to
> sync_writeback_pages()). So for every 100K that we dirty, 150K gets
> converted to writeback. But balance_dirty_pages doesn't wait for anything.
>
> This will result in the number of dirty pages going down steadily, and
> the number of writeback pages increasing quickly (3 times the speed of
> the drop in Dirty). The total of Dirty+Writeback will keep growing.
>
> When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
> balance_dirty_pages will no longer be able to flush the full
> 'write_chunk' (1.5 times number of recent dirtied pages) and so will
> spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
> a busy loop, but it won't progress.
This assumes that the queues are unbounded. They're not - they're limited
to 128 requests, which is 60MB or so.
Per queue. The scenario you identify can happen if it's spread across
multiple disks simultaneously.
CFQ used to have 1024 requests and we did have problems with excessive
numbers of writeback pages. I fixed that in 2.6.early, but that seems to
have got lost as well.
> Now our very slow writeback gets it's act together and starts making
> some progress and the Writeback number steadily drops down to 40%.
> At this point balance_dirty_pages will exit, more pages will get
> dirtied, and balance_dirty_pages will quickly flush them out again.
>
> The steady state will be with Dirty at or close to 0, and Writeback at
> or close to 40%.
>
> Now obviously this is somewhat idealised, and even slow writeback will
> make some progress early on, but you can still expect to get a very
> large Writeback with a very small Dirty before stabilising.
>
> I don't think we want this, but I'm not sure what we do want, so I'm
> asking for opinions.
>
> I don't think that pushing Dirty down to zero is the best thing to
> do. If writeback is slow, we should be simply waiting for writeback
> to progress rather than putting more work into the writeback queue.
> This also allows pages to stay 'dirty' for longer which is generally
> considered to be a good thing.
>
> I think we need to have 2 numbers. One that is the limit of dirty
> pages, and one that is the limit of the combined dirty+writeback.
> Alternately it could simply be a limit on writeback.
> Probably the later because having a very large writeback number makes
> the 'inactive_list' of pages very large and so it takes a long time
> to scan.
> So suppose dirty were capped at vm_dirty_ratio, and writeback were
> capped at that too, though independently.
>
> Then in our experiment, Dirty would grow up to 40%, then
> balance_dirty_pages would start flushing and Writeback would grow to
> 40% while Dirty stayed at 40%. Then balance_dirty_pages would not
> flush anything but would just wait for Writeback to drop below 40%.
> You would get a very obvious steady stage of 40% dirty and
> 40% Writeback.
>
> Is this too much memory? 80% tied up in what are essentially dirty
> blocks is more than you would expect when setting vm.dirty_ratio to
> 40.
>
> Maybe 40% should limit Dirty+Writeback and when we cross the
> threshold:
> if Dirty > Writeback - flush and wait
> if Dirty < Writeback - just wait
>
> bdflush should get some writeback underway before we hit the 40%, so
> balance_dirty_pages shouldn't find itself waiting for the pages it
> just flushed.
>
> Suggestions? Opinions?
>
> The following patch demonstrates the last suggestion.
>
> Thanks,
> NeilBrown
>
> Signed-off-by: Neil Brown <[email protected]>
>
> ### Diffstat output
> ./mm/page-writeback.c | 4 +---
> 1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff .prev/mm/page-writeback.c ./mm/page-writeback.c
> --- .prev/mm/page-writeback.c 2006-08-15 09:36:23.000000000 +1000
> +++ ./mm/page-writeback.c 2006-08-15 09:39:17.000000000 +1000
> @@ -207,7 +207,7 @@ static void balance_dirty_pages(struct a
> * written to the server's write cache, but has not yet
> * been flushed to permanent storage.
> */
> - if (nr_reclaimable) {
> + if (nr_reclaimable > global_page_state(NR_WRITEBACK)) {
> writeback_inodes(&wbc);
> get_dirty_limits(&background_thresh,
> &dirty_thresh, mapping);
> @@ -218,8 +218,6 @@ static void balance_dirty_pages(struct a
> <= dirty_thresh)
> break;
> pages_written += write_chunk - wbc.nr_to_write;
> - if (pages_written >= write_chunk)
> - break; /* We've done our duty */
> }
> blk_congestion_wait(WRITE, HZ/10);
> }
Something like that - it'll be relatively simple.
On Tue, Aug 15, 2006 at 01:06:11AM -0700, Andrew Morton wrote:
> On Tue, 15 Aug 2006 09:40:12 +1000
> Neil Brown <[email protected]> wrote:
> > When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
> > balance_dirty_pages will no longer be able to flush the full
> > 'write_chunk' (1.5 times number of recent dirtied pages) and so will
> > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
> > a busy loop, but it won't progress.
>
> This assumes that the queues are unbounded. They're not - they're limited
> to 128 requests, which is 60MB or so.
>
> Per queue. The scenario you identify can happen if it's spread across
> multiple disks simultaneously.
Though in this situation, you don't usually have slow writeback problems.
I haven't seen any recent problems with insufficient throttling on this
sort of configuration.
> CFQ used to have 1024 requests and we did have problems with excessive
> numbers of writeback pages. I fixed that in 2.6.early, but that seems to
> have got lost as well.
CFQ still has a queue depth of 128 requests....
> > bdflush should get some writeback underway before we hit the 40%, so
> > balance_dirty_pages shouldn't find itself waiting for the pages it
> > just flushed.
balance_dirty_pages() already kicks the background writeback done
by pdflush when dirty > dirty_background_ratio (10%).
IMO, if you've got slow writeback, you should be reducing the amount
of dirty memory you allow in the machine so that you don't tie up
large amounts of memory that takes a long time to clean. Throttle earlier
and you avoid this problem entirely.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Tuesday August 15, [email protected] wrote:
> > When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
> > balance_dirty_pages will no longer be able to flush the full
> > 'write_chunk' (1.5 times number of recent dirtied pages) and so will
> > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
> > a busy loop, but it won't progress.
>
> This assumes that the queues are unbounded. They're not - they're limited
> to 128 requests, which is 60MB or so.
Ahhh... so the limit on the requests-per-queue is an important part of
write-throttling behaviour. I didn't know that, thanks.
fs/nfs doesn't seem to impose a limit. It will just allocate as many
as you ask for until you start running out of memory. I've seen 60%
of memory (10 out of 16Gig) in writeback for NFS.
Maybe I should look there to address my current issue, though imposing
a system-wide writeback limit seems safer.
>
> Per queue. The scenario you identify can happen if it's spread across
> multiple disks simultaneously.
>
> CFQ used to have 1024 requests and we did have problems with excessive
> numbers of writeback pages. I fixed that in 2.6.early, but that seems to
> have got lost as well.
>
What would you say constitutes "excessive"? Is there any sense in
which some absolute number is excessive (as it takes too long to scan
some list) or is it just a percent-of-memory thing?
>
> Something like that - it'll be relatively simple.
Unfortunately I think it is also relatively simple to get it badly
wrong:-) Make one workload fast, and another slower.
But thanks, you've been very helpful (as usual). I'll ponder it a bit
longer and see what turns up.
NeilBrown
On Wednesday August 16, [email protected] wrote:
>
> IMO, if you've got slow writeback, you should be reducing the amount
> of dirty memory you allow in the machine so that you don't tie up
> large amounts of memory that takes a long time to clean. Throttle earlier
> and you avoid this problem entirely.
I completely agree that 'throttle earlier' is important. I just not
completely sure what should be throttled when.
I think I could argue that pages in 'Writeback' are really still
dirty. The difference is really just an implementation issue.
So when the dirty_ratio is set to 40%, that should apply to all
'dirty' pages, which means both that flagged as 'Dirty' and those
flagged as 'Writeback'.
So I think you need to throttle when Dirty+Writeback hits dirty_ratio
(which we don't quite get right at the moment). But the trick is to
throttle gently and fairly, rather than having a hard wall so that any
one who hits it just stops.
Thanks,
NeilBrown
On Thu, 17 Aug 2006 14:08:58 +1000
Neil Brown <[email protected]> wrote:
> So I think you need to throttle when Dirty+Writeback hits dirty_ratio
yup.
> (which we don't quite get right at the moment). But the trick is to
> throttle gently and fairly, rather than having a hard wall so that any
> one who hits it just stops.
I swear, I had all this working in 2001. Perhaps I dreamed it. But I
specifically remember testing that processes which were performing small,
occasional writes were not getting blocked behind the activity of other
processes which were doing massive write()s. Ho hum, not to worry.
I guess a robust approach would be to track, on a per-process,
per-threadgroup, per-user, etc basis the time-averaged page-dirtying rate.
If it is "low" then accept the dirtying. If it is "high" then this process
is a heavy writer and needs throttling earlier. Up to a point - at some
level we'll need to throttle everyone as a safety net if nothing else.
Something like that covers the global dirty+writeback problem. The other
major problem space is the multiple-backing-device problem:
a) One device is being written to heavily, another lightly
b) One device is fast, another is slow.
Thus far, the limited size of the request queues has saved us from really,
really serious problems. But that doesn't work when lots of disks are
being used. To solve this properly we'd need to account for
dirty+writeback(+unstable?) pages on a per-backing-dev basis.
But as a first step, yes, using dirty+writeback for the throttling
threshold and continuing to rely upon limited request queue size to save us
from disaster would be a good step.
btw, one thing which afaik NFS _still_ doesn't do is to wake up processes
which are stuck in blk_congestion_wait() when NFS has retired a bunch of
writes. It should do so, otherwise NFS write-intensive workloads might end
up sleeping for too long. I guess the amount of buffering and hysteresis
we have in there has thus far prevented any problems from being observed.
On Thu, 17 Aug 2006 13:59:41 +1000
Neil Brown <[email protected]> wrote:
> > CFQ used to have 1024 requests and we did have problems with excessive
> > numbers of writeback pages. I fixed that in 2.6.early, but that seems to
> > have got lost as well.
> >
>
> What would you say constitutes "excessive"? Is there any sense in
> which some absolute number is excessive (as it takes too long to scan
> some list) or is it just a percent-of-memory thing?
Excessive = 100% of memory dirty or under writeback against a single disk
on a 512MB machine. Perhaps that problem just got forgotten about when CFQ
went from 1024 requests down to 128. (That 128 was actually
64-available-for-read+64-available-for-write, so it's really 64 requests).
> >
> > Something like that - it'll be relatively simple.
>
> Unfortunately I think it is also relatively simple to get it badly
> wrong:-) Make one workload fast, and another slower.
>
I think it's unlikely in this case. As long as we keep the queues
reasonably full, the disks will be running flat-out and merging will be as
good as we're going to get.
One thing one does have to watch out for is the many-disks scenario: do
concurrent dd's onto 12 disks and make sure that none of their LEDs go
out. This is actually surprisingly hard to do, but it would be very hard
to do worse than 2.4.x ;)
On Wed, Aug 16 2006, Andrew Morton wrote:
> On Thu, 17 Aug 2006 13:59:41 +1000
> Neil Brown <[email protected]> wrote:
>
> > > CFQ used to have 1024 requests and we did have problems with excessive
> > > numbers of writeback pages. I fixed that in 2.6.early, but that seems to
> > > have got lost as well.
> > >
> >
> > What would you say constitutes "excessive"? Is there any sense in
> > which some absolute number is excessive (as it takes too long to scan
> > some list) or is it just a percent-of-memory thing?
>
> Excessive = 100% of memory dirty or under writeback against a single disk
> on a 512MB machine. Perhaps that problem just got forgotten about when CFQ
> went from 1024 requests down to 128. (That 128 was actually
> 64-available-for-read+64-available-for-write, so it's really 64 requests).
That's not quite true, if you set nr_requests to 128 that's 128 for
reads and 128 for writes. With the batching you will actually typically
see 128 * 3 / 2 == 192 requests allocated. Which translates to about
96MiB of dirty data on the queue, if everything works smoothly. The 3/2
limit is quite new, before I introduced that, if you had a lot of writes
each of them would be allowed 16 requests over the limit. So you would
sometimes see huge queues, as with just eg 16 writes, you could have 128
+ 16*16 requests allocated.
I've always been of the opinion that the vm should handle all of this,
and things should not change or break if I set 10000 as the request
limit. A rate-of-dirtying throttling per process sounds like a really
good idea, we badly need to prevent the occasional write (like a process
doing sync reads, and getting stuck in slooow reclaim) from being
throttled in the presence of a heavy dirtier.
--
Jens Axboe
On Wed, 2006-08-16 at 23:14 -0700, Andrew Morton wrote:
> btw, one thing which afaik NFS _still_ doesn't do is to wake up processes
> which are stuck in blk_congestion_wait() when NFS has retired a bunch of
> writes. It should do so, otherwise NFS write-intensive workloads might end
> up sleeping for too long. I guess the amount of buffering and hysteresis
> we have in there has thus far prevented any problems from being observed.
Are we to understand it that you consider blk_congestion_wait() to be an
official API, and not just another block layer hack inside the VM?
'cos currently the only tools for waking up processes in
blk_congestion_wait() are the two routines:
static void clear_queue_congested(request_queue_t *q, int rw)
and
static void set_queue_congested(request_queue_t *q, int rw)
in block/ll_rw_blk.c. Hardly a model of well thought out code...
Trond
On Thu, 2006-08-17 at 13:59 +1000, Neil Brown wrote:
> On Tuesday August 15, [email protected] wrote:
> > > When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
> > > balance_dirty_pages will no longer be able to flush the full
> > > 'write_chunk' (1.5 times number of recent dirtied pages) and so will
> > > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
> > > a busy loop, but it won't progress.
> >
> > This assumes that the queues are unbounded. They're not - they're limited
> > to 128 requests, which is 60MB or so.
>
> Ahhh... so the limit on the requests-per-queue is an important part of
> write-throttling behaviour. I didn't know that, thanks.
>
> fs/nfs doesn't seem to impose a limit. It will just allocate as many
> as you ask for until you start running out of memory. I've seen 60%
> of memory (10 out of 16Gig) in writeback for NFS.
>
> Maybe I should look there to address my current issue, though imposing
> a system-wide writeback limit seems safer.
Exactly how would a request limit help? All that boils down to is having
the VM monitor global_page_state(NR_FILE_DIRTY) versus monitoring
global_page_state(NR_FILE_DIRTY)+global_page_state(NR_WRITEBACK).
Cheers,
Trond
On Thu, 17 Aug 2006 08:36:19 -0400
Trond Myklebust <[email protected]> wrote:
> On Wed, 2006-08-16 at 23:14 -0700, Andrew Morton wrote:
> > btw, one thing which afaik NFS _still_ doesn't do is to wake up processes
> > which are stuck in blk_congestion_wait() when NFS has retired a bunch of
> > writes. It should do so, otherwise NFS write-intensive workloads might end
> > up sleeping for too long. I guess the amount of buffering and hysteresis
> > we have in there has thus far prevented any problems from being observed.
>
> Are we to understand it that you consider blk_congestion_wait() to be an
> official API, and not just another block layer hack inside the VM?
>
> 'cos currently the only tools for waking up processes in
> blk_congestion_wait() are the two routines:
>
> static void clear_queue_congested(request_queue_t *q, int rw)
> and
> static void set_queue_congested(request_queue_t *q, int rw)
>
> in block/ll_rw_blk.c. Hardly a model of well thought out code...
>
We've been over this before...
Take a look at blk_congestion_wait(). It doesn't know about request
queues. We'd need a new
void writeback_congestion_end(int rw)
{
wake_up(congestion_wqh[rw]);
}
or similar.
On Thu, 17 Aug 2006 09:21:51 -0400
Trond Myklebust <[email protected]> wrote:
> On Thu, 2006-08-17 at 13:59 +1000, Neil Brown wrote:
> > On Tuesday August 15, [email protected] wrote:
> > > > When Dirty hits 0 (and Writeback is theoretically 80% of RAM)
> > > > balance_dirty_pages will no longer be able to flush the full
> > > > 'write_chunk' (1.5 times number of recent dirtied pages) and so will
> > > > spin in a loop calling blk_congestion_wait(WRITE, HZ/10), so it isn't
> > > > a busy loop, but it won't progress.
> > >
> > > This assumes that the queues are unbounded. They're not - they're limited
> > > to 128 requests, which is 60MB or so.
> >
> > Ahhh... so the limit on the requests-per-queue is an important part of
> > write-throttling behaviour. I didn't know that, thanks.
> >
> > fs/nfs doesn't seem to impose a limit. It will just allocate as many
> > as you ask for until you start running out of memory. I've seen 60%
> > of memory (10 out of 16Gig) in writeback for NFS.
> >
> > Maybe I should look there to address my current issue, though imposing
> > a system-wide writeback limit seems safer.
>
> Exactly how would a request limit help? All that boils down to is having
> the VM monitor global_page_state(NR_FILE_DIRTY) versus monitoring
> global_page_state(NR_FILE_DIRTY)+global_page_state(NR_WRITEBACK).
>
I assume that if NFS is not limiting its NR_WRITEBACK consumption and block
devices are doing so, we could get in a situation where NFS hogs all of the
fixed-size NR_DIRTY+NR_WRITEBACK resource at the expense of concurrent
block-device-based writeback.
Perhaps. The top-level poll-the-superblocks writeback loop might tend to
prevent that from happening. But if applications were doing a lot of
superblock-specific writeback (fdatasync,
sync_file_range(SYNC_FILE_RANGE_WRITE), etc) then unfairness might occur.
On Thu, 2006-08-17 at 08:30 -0700, Andrew Morton wrote:
> On Thu, 17 Aug 2006 09:21:51 -0400
> Trond Myklebust <[email protected]> wrote:
> > Exactly how would a request limit help? All that boils down to is having
> > the VM monitor global_page_state(NR_FILE_DIRTY) versus monitoring
> > global_page_state(NR_FILE_DIRTY)+global_page_state(NR_WRITEBACK).
> >
>
> I assume that if NFS is not limiting its NR_WRITEBACK consumption and block
> devices are doing so, we could get in a situation where NFS hogs all of the
> fixed-size NR_DIRTY+NR_WRITEBACK resource at the expense of concurrent
> block-device-based writeback.
Since NFS has no control over NR_DIRTY, how does controlling
NR_WRITEBACK help? The only resource that NFS shares with the block
device writeout queues is memory.
IOW: The resource that needs to be controlled is the dirty pages, not
the write-out queue. Unless you can throttle back on the creation of
dirty NFS pages in the first place, then the potential for unfairness
will exist.
Trond
On Thu, 2006-08-17 at 08:14 -0700, Andrew Morton wrote:
> Take a look at blk_congestion_wait(). It doesn't know about request
> queues. We'd need a new
>
> void writeback_congestion_end(int rw)
> {
> wake_up(congestion_wqh[rw]);
> }
>
> or similar.
...and how often do you want us to call this? NFS doesn't know much
about request queues either: it writes out pages on a per-RPC call
basis. In the worst case that could mean waking up the VM every time we
write out a single page.
Cheers,
Trond
On Thu, Aug 17, 2006 at 02:08:58PM +1000, Neil Brown wrote:
> On Wednesday August 16, [email protected] wrote:
> >
> > IMO, if you've got slow writeback, you should be reducing the amount
> > of dirty memory you allow in the machine so that you don't tie up
> > large amounts of memory that takes a long time to clean. Throttle earlier
> > and you avoid this problem entirely.
>
> I completely agree that 'throttle earlier' is important. I just not
> completely sure what should be throttled when.
>
> I think I could argue that pages in 'Writeback' are really still
> dirty. The difference is really just an implementation issue.
No argument here - I think you're right, Neil.
> So when the dirty_ratio is set to 40%, that should apply to all
> 'dirty' pages, which means both that flagged as 'Dirty' and those
> flagged as 'Writeback'.
Don't forget NFS client unstable pages.
FWIW, with writeback not being accounted as dirty, there is a window
in the NFS client where a page during writeback is not dirty or
unstable and hence not visible to the throttle. Hence if we have
lots of outstanding async writes to NFS servers, or their I/O
completion is held off, the throttle won't activate where is should
and potentially let too many pages get dirtied.
This may not be a major problem with the traditional small write
sizes, but with 1MB I/Os this could be a fairly large number of
pages that are unaccounted for a short period of time.
> So I think you need to throttle when Dirty+Writeback hits dirty_ratio
> (which we don't quite get right at the moment). But the trick is to
> throttle gently and fairly, rather than having a hard wall so that any
> one who hits it just stops.
I disagree with the "throttle gently" bit there. If a process is
writing faster than the underlying storage can write, then you have
to stop the process in it's tracks while the storage catches up.
Especially if other processes are writing tothe same device. You
may as well just hit it with a big hammer becauses it's simple and
pretty effective.
Besides, it is difficult to be gentle when you can dirty memory at
least an order of magnitude faster than you can clean it.
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Wed, Aug 16, 2006 at 11:14:48PM -0700, Andrew Morton wrote:
>
> I guess a robust approach would be to track, on a per-process,
> per-threadgroup, per-user, etc basis the time-averaged page-dirtying rate.
> If it is "low" then accept the dirtying. If it is "high" then this process
> is a heavy writer and needs throttling earlier. Up to a point - at some
> level we'll need to throttle everyone as a safety net if nothing else.
The problem with that approach is that throttling a large writer
forces data to disk earlier and that may be undesirable - the large
file might be a temp file that will soon be unlinked, and in this case
you don't want it throttled. Right now, you set dirty*ratio high enough
that this doesn't happen, and the file remains memory resident until
unlink.
> Something like that covers the global dirty+writeback problem. The other
> major problem space is the multiple-backing-device problem:
>
> a) One device is being written to heavily, another lightly
>
> b) One device is fast, another is slow.
Once we are past the throttling threshold, the only thing that
matters is whether we can write more data to the backing device(s).
We should not realy be allowing the input rate to exceed the output
rate one we are passed the throttle threshold.
> Thus far, the limited size of the request queues has saved us from really,
> really serious problems. But that doesn't work when lots of disks are
> being used.
Mainly because it increases the number of pages under writeback that
currently aren't accounted as dirty and the throttle doesn't
kick in when it should.
> To solve this properly we'd need to account for
> dirty+writeback(+unstable?) pages on a per-backing-dev basis.
We'd still need to account for them globally because we still need
to be able to globally limit the amount of dirty data in the
machine.
FYI, I implemented a complex two-stage throttle on Irix a couple of
years ago - it uses a per-device soft throttle threshold that is not
enforced until the global dirty state passes a configurable limit.
At that point, the per-device limits are enforced.
This meant that devices with no dirty state attached to them could
continue to dirty pages up to their soft-threshold, whereas heavy
writers would be stopped until their backing devices fell back below
the soft thresholds.
Because the amount of dirty pages could continue to grow past safe
limits if you had enough devices, there is also a global hard limit
that cannot be exceeded and this throttles all incoming write
requests regardless of the state of the device it was being written
to.
The problem with this approach is that the code was complex and
difficult to test properly. Also, working out the default config
values was an exercise in trial, error, workload measurement and
guesswork that took some time to get right.
The current linux code works as well as that two-stage throttle
(better in some cases!) because of one main thing - bound request
queue depth with feedback into the throttling control loop. Irix
has neither of these so the throttle had to provide this accounting
and limiting (soft throttle threshold).
Hence I'm not sure that per-backing-device accounting and making
decisions based on that accounting is really going to buy us much
apart from additional complexity....
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
On Thu, 17 Aug 2006 12:18:52 -0400
Trond Myklebust <[email protected]> wrote:
> On Thu, 2006-08-17 at 08:30 -0700, Andrew Morton wrote:
> > On Thu, 17 Aug 2006 09:21:51 -0400
> > Trond Myklebust <[email protected]> wrote:
> > > Exactly how would a request limit help? All that boils down to is having
> > > the VM monitor global_page_state(NR_FILE_DIRTY) versus monitoring
> > > global_page_state(NR_FILE_DIRTY)+global_page_state(NR_WRITEBACK).
> > >
> >
> > I assume that if NFS is not limiting its NR_WRITEBACK consumption and block
> > devices are doing so, we could get in a situation where NFS hogs all of the
> > fixed-size NR_DIRTY+NR_WRITEBACK resource at the expense of concurrent
> > block-device-based writeback.
>
> Since NFS has no control over NR_DIRTY, how does controlling
> NR_WRITEBACK help? The only resource that NFS shares with the block
> device writeout queues is memory.
Block devices have a limit on the amount of IO which they will queue. NFS
doesn't.
> IOW: The resource that needs to be controlled is the dirty pages, not
> the write-out queue. Unless you can throttle back on the creation of
> dirty NFS pages in the first place, then the potential for unfairness
> will exist.
Please read the whole thread - we're violently agreeing.
On Thu, 17 Aug 2006 12:22:59 -0400
Trond Myklebust <[email protected]> wrote:
> On Thu, 2006-08-17 at 08:14 -0700, Andrew Morton wrote:
> > Take a look at blk_congestion_wait(). It doesn't know about request
> > queues. We'd need a new
> >
> > void writeback_congestion_end(int rw)
> > {
> > wake_up(congestion_wqh[rw]);
> > }
> >
> > or similar.
>
> ...and how often do you want us to call this? NFS doesn't know much
> about request queues either: it writes out pages on a per-RPC call
> basis. In the worst case that could mean waking up the VM every time we
> write out a single page.
>
Once per page would work OK, but we'd save some CPU by making it less
frequent.
This stuff isn't very precise. We could make it precise, but it would
require a really large amount of extra locking, extra locks, etc.
The way this code all works is pretty crude and simple: a process comes
in to to some writeback and it enters a polling loop:
while (we need to do writeback) {
for (each superblock) {
if (the superblock's backing_dev isn't congested) {
stuff some more IO down it()
}
}
take_a_nap();
}
so the process remains captured in that polling loop until the
dirty-memory-exceed condition subsides. The reason why we avoid
congsted queues is so that one thread can keep multiple queues busy: we
don't want to allow writing threads to get stuck on a single queue and
we don't want to have to provision one pdflush per spindle (or, more
precisely, per backing_dev_info).
So the question is: how do we "take a nap"? That's blk_congestion_wait().
The process goes to sleep in there and gets woken up when someone thinks
that a queue might be able to take some more writeout.
A caller into blk_congestion_wait() is _supposed_ to be woken by writeback
completion. If the timeout actually expires, something isn't right. If we
had all the new locking in place and correct, the timeout wouldn't actually
be needed. In theory, the timeout is only there as a fallback to handle
certain races for which we don't want to implement all that new locking to
fix.
It would be good if NFS were to implement a fixed-size "request queue",
so we can't fill all memory with NFS requests. Then, NFS can implement
a congestion threshold at "75% full" (via its backing_dev_info) and
everything is in place.
As a halfway step it might provide benefit for NFS to poke the
congestion_wq[] every quarter megabyte or so, to kick any processes out
of their sleep so they go back to poll all the superblocks again,
earlier than they otherwise would have. It might not make any
difference - one would need to get in there and understand the dynamic
behaviour.
On Fri, 18 Aug 2006 10:11:02 +1000
David Chinner <[email protected]> wrote:
>
> > Something like that covers the global dirty+writeback problem. The other
> > major problem space is the multiple-backing-device problem:
> >
> > a) One device is being written to heavily, another lightly
> >
> > b) One device is fast, another is slow.
>
> Once we are past the throttling threshold, the only thing that
> matters is whether we can write more data to the backing device(s).
> We should not realy be allowing the input rate to exceed the output
> rate one we are passed the throttle threshold.
True.
But it seems really sad to block some process which is doing a really small
dirtying (say, some dopey atime update) just because some other process is
doing a huge write.
Now, things _usually_ work out all right, if only because of
balance_dirty_pages_ratelimited()'s logic. But it's more by happenstance
than by intent, and these sorts of interferences can happen.
> > To solve this properly we'd need to account for
> > dirty+writeback(+unstable?) pages on a per-backing-dev basis.
>
> We'd still need to account for them globally because we still need
> to be able to globally limit the amount of dirty data in the
> machine.
>
> FYI, I implemented a complex two-stage throttle on Irix a couple of
> years ago - it uses a per-device soft throttle threshold that is not
> enforced until the global dirty state passes a configurable limit.
> At that point, the per-device limits are enforced.
>
> This meant that devices with no dirty state attached to them could
> continue to dirty pages up to their soft-threshold, whereas heavy
> writers would be stopped until their backing devices fell back below
> the soft thresholds.
>
> Because the amount of dirty pages could continue to grow past safe
> limits if you had enough devices, there is also a global hard limit
> that cannot be exceeded and this throttles all incoming write
> requests regardless of the state of the device it was being written
> to.
>
> The problem with this approach is that the code was complex and
> difficult to test properly. Also, working out the default config
> values was an exercise in trial, error, workload measurement and
> guesswork that took some time to get right.
>
> The current linux code works as well as that two-stage throttle
> (better in some cases!) because of one main thing - bound request
> queue depth with feedback into the throttling control loop. Irix
> has neither of these so the throttle had to provide this accounting
> and limiting (soft throttle threshold).
>
> Hence I'm not sure that per-backing-device accounting and making
> decisions based on that accounting is really going to buy us much
> apart from additional complexity....
>
hm, interesting.
It seems that the many-writers-to-different-disks workloads don't happen
very often. We know this because
a) The 2.4 performance is utterly awful, and I never saw anybody
complain and
b) 2.6 has the risk of filling all memory with under-writeback pages,
and nobdy has complained about that either (iirc).
Relying on that observation and the request-queue limits has got us this
far but yeah, we should plug that PageWriteback windup scenario.
btw, Neil, has the Pagewriteback windup actually been demonstrated? If so,
how?
On Thu, Aug 17 2006, Andrew Morton wrote:
> It seems that the many-writers-to-different-disks workloads don't happen
> very often. We know this because
>
> a) The 2.4 performance is utterly awful, and I never saw anybody
> complain and
Talk to some of the people that used DVD-RAM devices (or other
excruciatingly slow writers) on their system, and they would disagree
violently :-)
It's been discussed here on lkml many times in the past, but that's
years behind us now. Thankfully your pdflush work got rid of that
embarassment. But it definitely does matter, to real ordinary users.
--
Jens Axboe
On Thursday August 17, [email protected] wrote:
>
> btw, Neil, has the Pagewriteback windup actually been demonstrated? If so,
> how?
Yes.
On large machines (e.g. 16G) just writing to large files (I think. I
don't have precise details of the application, but I think in one case
it was just iozone). By "large files" I mean larger than memory.
This has happened on both SLES9 (2.6.5 based) and SLES10 (2.6.16
based). We do have an extra patch in balance_dirty_pages which I
haven't tracked down the reason for yet. It has the effect of
breaking out of the loop once nr_dirty hits 0, which makes the problem
hard to recover from. It may even be making it occur more quickly -
I'm not sure.
What we see is Pagewriteback at about 10G out of 16G, and Dirty at 0.
The whole machine pretty much slows to a halt. There is little free
memory so lots of processes end up in 'reclaim' walking the inactive
list looking for pages to free up. Most of what they find are in
Writeback and so they just skip over them. skipping 2.6 million pages
seems to take a little while.
And there is a kmalloc call in the NFS writeout path (it is actually a
mempool_alloc so it will succeed, but (partly) as mempool uses the
reserve last instead of first it always looks for free memory first.
So Pagewriteback is at 60%, memory is tight, nfs write is progressing
very slowly and (because of our SuSE specific patch)
balance_dirty_pages isn't throttling anymore so as soon as nfs does
manage to write out a page another appears to replace it. I suspect
it is making forward progress, but not very much.
We have a fairly hackish patch in place limit the NFS writeback on a
per-file basis (sysctl tunable) but I want trying to understand the
real problem so that a real solution could be found.
NeilBrown
On Fri, 18 Aug 2006 09:03:15 +0200
Jens Axboe <[email protected]> wrote:
> On Thu, Aug 17 2006, Andrew Morton wrote:
> > It seems that the many-writers-to-different-disks workloads don't happen
> > very often. We know this because
> >
> > a) The 2.4 performance is utterly awful, and I never saw anybody
> > complain and
>
> Talk to some of the people that used DVD-RAM devices (or other
> excruciatingly slow writers) on their system, and they would disagree
> violently :-)
umm, OK, I guess that has the same cause: buffer_heads from different
devices all on the same single queue. In this case the problem is that one
device is slow. In the same-speed-devices case the problem is that all
writeback threads get stuck on the same device, allowing others to go idle.
Andrew Morton writes:
[...]
>
> The way this code all works is pretty crude and simple: a process comes
> in to to some writeback and it enters a polling loop:
>
> while (we need to do writeback) {
> for (each superblock) {
> if (the superblock's backing_dev isn't congested) {
> stuff some more IO down it()
> }
> }
> take_a_nap();
> }
>
> so the process remains captured in that polling loop until the
> dirty-memory-exceed condition subsides. The reason why we avoid
Hm... wbc->nr_to_write is checked all the way down
(balance_dirty_pages(), writeback_inodes(), sync_sb_inodes(),
mpage_writepages()), so "occasional writer" cannot be stuck for more
than 32 + 16 pages, it seems.
Nikita.
Jens Axboe <[email protected]> writes:
> On Thu, Aug 17 2006, Andrew Morton wrote:
> > It seems that the many-writers-to-different-disks workloads don't happen
> > very often. We know this because
> >
> > a) The 2.4 performance is utterly awful, and I never saw anybody
> > complain and
>
> Talk to some of the people that used DVD-RAM devices (or other
> excruciatingly slow writers) on their system, and they would disagree
> violently :-)
I hit this recently while doing backups to a slow external USB disk.
The system was quite unusable (some commands blocked for over a minute)
-Andi