Still untested, but posting the concept here anyway, since this
could explain a lot...
OOM kills have been observed with 70% of the pages in lowmem being
in the writeback state. If we count those pages in sc->nr_scanned,
the VM should throttle and wait for IO completion, instead of OOM
killing.
Signed-off-by: Rik van Riel <[email protected]>
--- linux-2.6.9/mm/vmscan.c.screclaim 2005-01-03 12:17:56.547148905 -0500
+++ linux-2.6.9/mm/vmscan.c 2005-01-03 12:18:16.855965416 -0500
@@ -376,10 +376,10 @@
BUG_ON(PageActive(page));
+ sc->nr_scanned++;
if (PageWriteback(page))
goto keep_locked;
- sc->nr_scanned++;
/* Double the slab pressure for mapped and swapcache pages */
if (page_mapped(page) || PageSwapCache(page))
sc->nr_scanned++;
Rik van Riel <[email protected]> wrote:
>
> Still untested, but posting the concept here anyway, since this
> could explain a lot...
>
> OOM kills have been observed with 70% of the pages in lowmem being
> in the writeback state. If we count those pages in sc->nr_scanned,
> the VM should throttle and wait for IO completion, instead of OOM
> killing.
>
> Signed-off-by: Rik van Riel <[email protected]>
>
> --- linux-2.6.9/mm/vmscan.c.screclaim 2005-01-03 12:17:56.547148905 -0500
> +++ linux-2.6.9/mm/vmscan.c 2005-01-03 12:18:16.855965416 -0500
> @@ -376,10 +376,10 @@
>
> BUG_ON(PageActive(page));
>
> + sc->nr_scanned++;
> if (PageWriteback(page))
> goto keep_locked;
>
> - sc->nr_scanned++;
Patch looks very sane. It in fact restores that which we were doing until
12 June 2004, when the rampant `struct scan_control' depredations violated
the tree.
On Wed, Jan 05, 2005 at 02:08:59AM -0800, Andrew Morton wrote:
> Rik van Riel <[email protected]> wrote:
> >
> > Still untested, but posting the concept here anyway, since this
> > could explain a lot...
> >
> > OOM kills have been observed with 70% of the pages in lowmem being
> > in the writeback state. If we count those pages in sc->nr_scanned,
> > the VM should throttle and wait for IO completion, instead of OOM
> > killing.
> >
> > Signed-off-by: Rik van Riel <[email protected]>
> >
> > --- linux-2.6.9/mm/vmscan.c.screclaim 2005-01-03 12:17:56.547148905 -0500
> > +++ linux-2.6.9/mm/vmscan.c 2005-01-03 12:18:16.855965416 -0500
> > @@ -376,10 +376,10 @@
> >
> > BUG_ON(PageActive(page));
> >
> > + sc->nr_scanned++;
> > if (PageWriteback(page))
> > goto keep_locked;
> >
> > - sc->nr_scanned++;
>
> Patch looks very sane. It in fact restores that which we were doing until
> 12 June 2004, when the rampant `struct scan_control' depredations violated
> the tree.
Agreed.
Another unrelated problem I have in this same area and that can explain
VM troubles at least theoretically, is that blk_congestion_wait is
broken by design. First we cannot wait on random I/O not related to
write back. Second blk_congestion_wait gets trivially fooled by
direct-io for example. Plus the timeout may cause it to return too early
with slow blkdev.
blk_congestion_wait is a fundamental piece to get oom detection right
during writeback and unfortunately it's fundamentally fragile in 2.6
(this as usual wasn't the case in 2.4).
On Wed, 5 Jan 2005, Andrea Arcangeli wrote:
> Another unrelated problem I have in this same area and that can explain
> VM troubles at least theoretically, is that blk_congestion_wait is
> broken by design. First we cannot wait on random I/O not related to
> write back. Second blk_congestion_wait gets trivially fooled by
> direct-io for example. Plus the timeout may cause it to return too early
> with slow blkdev.
Or the IO that just finished, finished for pages in
another memory zone, or pages we won't scan again in
our current go-around through the VM...
> blk_congestion_wait is a fundamental piece to get oom detection right
> during writeback and unfortunately it's fundamentally fragile in 2.6
> (this as usual wasn't the case in 2.4).
Indeed ;(
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
On Wed, Jan 05, 2005 at 01:50:51PM -0500, Rik van Riel wrote:
> On Wed, 5 Jan 2005, Andrea Arcangeli wrote:
>
> >Another unrelated problem I have in this same area and that can explain
> >VM troubles at least theoretically, is that blk_congestion_wait is
> >broken by design. First we cannot wait on random I/O not related to
> >write back. Second blk_congestion_wait gets trivially fooled by
> >direct-io for example. Plus the timeout may cause it to return too early
> >with slow blkdev.
>
> Or the IO that just finished, finished for pages in
> another memory zone, or pages we won't scan again in
> our current go-around through the VM...
Thing is there is no distinction between pages which have been written out
for what purpose at the block level.
One can conjecture the following: per-zone waitqueue to be awakened from
end_page_writeback() (for PG_reclaim pages only of course), and a function
to wait on the perzone waitqueue:
wait_vm_writeback (zone, timeout);
Instead of the current blk_congestion_wait() on try_to_free_pages/balance_pgdat.
It will probably slow the reclaiming procedure in general, though, and has
other side effects, but its the only way of strictly following VM writeback
progress from VM reclaiming routines.
Does it make any sense?
> >blk_congestion_wait is a fundamental piece to get oom detection right
> >during writeback and unfortunately it's fundamentally fragile in 2.6
> >(this as usual wasn't the case in 2.4).
>
> Indeed ;(
Marcelo Tosatti <[email protected]> wrote:
>
> On Wed, Jan 05, 2005 at 01:50:51PM -0500, Rik van Riel wrote:
> > On Wed, 5 Jan 2005, Andrea Arcangeli wrote:
> >
> > >Another unrelated problem I have in this same area and that can explain
> > >VM troubles at least theoretically, is that blk_congestion_wait is
> > >broken by design. First we cannot wait on random I/O not related to
> > >write back. Second blk_congestion_wait gets trivially fooled by
> > >direct-io for example. Plus the timeout may cause it to return too early
> > >with slow blkdev.
That's true, as we discussed a couple of months back. But the current code
is nice and simple and has been there for a couple of years with no
observed problems.
> > Or the IO that just finished, finished for pages in
> > another memory zone, or pages we won't scan again in
> > our current go-around through the VM...
>
> Thing is there is no distinction between pages which have been written out
> for what purpose at the block level.
>
> One can conjecture the following: per-zone waitqueue to be awakened from
> end_page_writeback() (for PG_reclaim pages only of course), and a function
> to wait on the perzone waitqueue:
>
> wait_vm_writeback (zone, timeout);
>
> Instead of the current blk_congestion_wait() on try_to_free_pages/balance_pgdat.
>
The caller would need to wait on all the zones which can satisfy the
caller's allocation request. A bit messy, although not rocket science.
One would have to be careful to avoid additional CPU consumption due to
delivery of multiple wakeups at each I/O completion.
We should be able to demonstrate that such a change really fixes some
problem though. Otherwise, why bother?
> The caller would need to wait on all the zones which can satisfy the
> caller's allocation request. A bit messy, although not rocket science.
> One would have to be careful to avoid additional CPU consumption due to
> delivery of multiple wakeups at each I/O completion.
>
> We should be able to demonstrate that such a change really fixes some
> problem though. Otherwise, why bother?
Agreed. The current scheme works well enough, we dont have spurious OOM kills
anymore, which is the only "problem" such change ought to fix.
You might have performance increase in some situations I believe (because you
have perzone waitqueues), but I agree its does not seem to be worth the
trouble.
Rik van Riel <[email protected]> wrote:
>
> OOM kills have been observed with 70% of the pages in lowmem being
> in the writeback state. If we count those pages in sc->nr_scanned,
> the VM should throttle and wait for IO completion, instead of OOM
> killing.
I'll queue this up:
From: Rik van Riel <[email protected]>
OOM kills have been observed with 70% of the pages in lowmem being in the
writeback state. If we count those pages in sc->nr_scanned, the VM should
throttle and wait for IO completion, instead of OOM killing.
(akpm: this is how the code was designed to work - we broke it six months
ago).
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
25-akpm/mm/vmscan.c | 6 +++---
1 files changed, 3 insertions(+), 3 deletions(-)
diff -puN mm/vmscan.c~vmscan-count-writeback-pages-in-nr_scanned mm/vmscan.c
--- 25/mm/vmscan.c~vmscan-count-writeback-pages-in-nr_scanned 2005-01-05 15:24:53.730874336 -0800
+++ 25-akpm/mm/vmscan.c 2005-01-05 15:25:48.286580608 -0800
@@ -369,14 +369,14 @@ static int shrink_list(struct list_head
BUG_ON(PageActive(page));
- if (PageWriteback(page))
- goto keep_locked;
-
sc->nr_scanned++;
/* Double the slab pressure for mapped and swapcache pages */
if (page_mapped(page) || PageSwapCache(page))
sc->nr_scanned++;
+ if (PageWriteback(page))
+ goto keep_locked;
+
referenced = page_referenced(page, 1, sc->priority <= 0);
/* In active use or really unfreeable? Activate it. */
if (referenced && page_mapping_inuse(page))
_
Marcelo Tosatti wrote:
>>The caller would need to wait on all the zones which can satisfy the
>>caller's allocation request. A bit messy, although not rocket science.
>>One would have to be careful to avoid additional CPU consumption due to
>>delivery of multiple wakeups at each I/O completion.
>>
>>We should be able to demonstrate that such a change really fixes some
>>problem though. Otherwise, why bother?
>
>
> Agreed. The current scheme works well enough, we dont have spurious OOM kills
> anymore, which is the only "problem" such change ought to fix.
>
> You might have performance increase in some situations I believe (because you
> have perzone waitqueues), but I agree its does not seem to be worth the
> trouble.
I think what Andrea is worried about is that blk_congestion_wait is
fairly vague, and can be a source of instability in the scanning
implementation.
For example, if you have a heavy IO workload that is saturating your
disks, blk_congestion_wait may do the right thing and sleep until
they become uncongested and writeout can continue.
But at 2:00 am, when your backup job is trickling writes into another
block device, blk_congestion_wait returns much earlier, and before
many pages have been cleaned.
Bad example? Yeah maybe, but I think this is what Andrea is getting
at. Would it be a problem to replace those blk_congestion_waits with
unconditional io_schedule_timeout()s? That would be the dumb-but-more
-deterministic solution.
On Thu, 6 Jan 2005, Nick Piggin wrote:
> I think what Andrea is worried about is that blk_congestion_wait is
> fairly vague, and can be a source of instability in the scanning
> implementation.
The recent OOM kill problem has been happening:
1) with cache pressure on lowmem only, due to a block device write
2) with no block congestion at all
3) with pretty much all pageable lowmme pages in writeback state
It appears the VM has trouble dealing with the situation where
there is no block congestion to wait on...
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Rik van Riel wrote:
> On Thu, 6 Jan 2005, Nick Piggin wrote:
>
>> I think what Andrea is worried about is that blk_congestion_wait is
>> fairly vague, and can be a source of instability in the scanning
>> implementation.
>
>
> The recent OOM kill problem has been happening:
> 1) with cache pressure on lowmem only, due to a block device write
> 2) with no block congestion at all
> 3) with pretty much all pageable lowmme pages in writeback state
>
> It appears the VM has trouble dealing with the situation where
> there is no block congestion to wait on...
>
Try, together with your nr_scanned patch, to replace blk_congestion_wait
with io_schedule_timeout.
Rik van Riel <[email protected]> wrote:
>
> On Thu, 6 Jan 2005, Nick Piggin wrote:
>
> > I think what Andrea is worried about is that blk_congestion_wait is
> > fairly vague, and can be a source of instability in the scanning
> > implementation.
>
> The recent OOM kill problem has been happening:
> 1) with cache pressure on lowmem only, due to a block device write
> 2) with no block congestion at all
> 3) with pretty much all pageable lowmme pages in writeback state
You must have a wild number of requests configured in the queue. Is this
CFQ?
I've done testing with "all of memory under writeback" before and it went
OK. It's certainly a design objective to handle this well. But that
testing was before we broke it.
> It appears the VM has trouble dealing with the situation where
> there is no block congestion to wait on...
It's misnamed. We don't "wait for the queue to come out of congestion".
We simply throttle until a write completes, or, rarely, timeout.
The bug which you fixed would cause the VM to scan itself to death without
throttling. Did the fix fix things?
Nick Piggin <[email protected]> wrote:
>
> Rik van Riel wrote:
> > On Thu, 6 Jan 2005, Nick Piggin wrote:
> >
> >> I think what Andrea is worried about is that blk_congestion_wait is
> >> fairly vague, and can be a source of instability in the scanning
> >> implementation.
> >
> >
> > The recent OOM kill problem has been happening:
> > 1) with cache pressure on lowmem only, due to a block device write
> > 2) with no block congestion at all
> > 3) with pretty much all pageable lowmme pages in writeback state
> >
> > It appears the VM has trouble dealing with the situation where
> > there is no block congestion to wait on...
> >
>
> Try, together with your nr_scanned patch, to replace blk_congestion_wait
> with io_schedule_timeout.
Why? Is the nr_scanned fix insufficient?
Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>Rik van Riel wrote:
>>
>>>On Thu, 6 Jan 2005, Nick Piggin wrote:
>>>
>>>
>>>>I think what Andrea is worried about is that blk_congestion_wait is
>>>>fairly vague, and can be a source of instability in the scanning
>>>>implementation.
>>>
>>>
>>>The recent OOM kill problem has been happening:
>>>1) with cache pressure on lowmem only, due to a block device write
>>>2) with no block congestion at all
>>>3) with pretty much all pageable lowmme pages in writeback state
>>>
>>>It appears the VM has trouble dealing with the situation where
>>>there is no block congestion to wait on...
>>>
>>
>>Try, together with your nr_scanned patch, to replace blk_congestion_wait
>>with io_schedule_timeout.
>
>
> Why? Is the nr_scanned fix insufficient?
>
I thought it sounded like he implied that nr_scanned was insufficient
(otherwise he might have said "to wait on ... but my patch fixes it").
On Thu, Jan 06, 2005 at 12:40:47PM +1100, Nick Piggin wrote:
> I thought it sounded like he implied that nr_scanned was insufficient
> (otherwise he might have said "to wait on ... but my patch fixes it").
BTW, from my part I can't reproduce it (even without the nr_scanned,
that I'm going to apply too anyway just in case), but my hardware may
have different timings dunno.
All I could reproduce were the swap-token issues (I had a flood of
reports about that too, all fixed by now with my 1-6/4 patches). The
other patches agreed here looks good but I don't have actual pending
bugs on that side and I can't reproduce problems either in basic
testing, so they're applied now for correctness.
Thanks.
On Wed, 5 Jan 2005, Andrew Morton wrote:
> Rik van Riel <[email protected]> wrote:
>> The recent OOM kill problem has been happening:
>> 1) with cache pressure on lowmem only, due to a block device write
>> 2) with no block congestion at all
>> 3) with pretty much all pageable lowmme pages in writeback state
>
> You must have a wild number of requests configured in the queue. Is
> this CFQ?
Yes, it is with CFQ. Around 650MB of lowmem is in writeback
stage, which is over 99% of the active and inactive lowmem
pages...
> I've done testing with "all of memory under writeback" before and it
> went OK. It's certainly a design objective to handle this well. But
> that testing was before we broke it.
I suspect something might still be broken. It may take a few
days of continuous testing to trigger the bug, though ...
> The bug which you fixed would cause the VM to scan itself to death
> without throttling. Did the fix fix things?
Mostly fixed. Things are down to the point where it often
takes over a day to bring out the bug.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan
Rik van Riel wrote:
> On Wed, 5 Jan 2005, Andrew Morton wrote:
>
>> Rik van Riel <[email protected]> wrote:
>
>
>>> The recent OOM kill problem has been happening:
>>> 1) with cache pressure on lowmem only, due to a block device write
>>> 2) with no block congestion at all
>>> 3) with pretty much all pageable lowmme pages in writeback state
>>
>>
>> You must have a wild number of requests configured in the queue. Is
>> this CFQ?
>
>
> Yes, it is with CFQ. Around 650MB of lowmem is in writeback
> stage, which is over 99% of the active and inactive lowmem
> pages...
>
>> I've done testing with "all of memory under writeback" before and it
>> went OK. It's certainly a design objective to handle this well. But
>> that testing was before we broke it.
>
>
> I suspect something might still be broken. It may take a few
> days of continuous testing to trigger the bug, though ...
>
It is possible to be those blk_congestion_wait paths, because
the queue simply won't be congested. So doing io_schedule_timeout
might help.
I wonder if reducing the size of the write queue in CFQ would help
too? IIRC, it only really wants a huge read queue.
Nick Piggin <[email protected]> wrote:
>
> >
> > I suspect something might still be broken. It may take a few
> > days of continuous testing to trigger the bug, though ...
> >
>
> It is possible to be those blk_congestion_wait paths, because
> the queue simply won't be congested. So doing io_schedule_timeout
> might help.
If the queue is not congested, blk_congestion_wait() will still sleep. See
freed_request().
> I wonder if reducing the size of the write queue in CFQ would help
> too? IIRC, it only really wants a huge read queue.
Surely it will help - but we need to be able to handle the situation
because memory can still become full of PageWriteback pages if there are
many disks.
Andrew Morton wrote:
> Nick Piggin <[email protected]> wrote:
>
>>>I suspect something might still be broken. It may take a few
>>>days of continuous testing to trigger the bug, though ...
>>>
>>
>>It is possible to be those blk_congestion_wait paths, because
>>the queue simply won't be congested. So doing io_schedule_timeout
>>might help.
>
>
> If the queue is not congested, blk_congestion_wait() will still sleep. See
> freed_request().
>
Hmm... doesn't look like it to me:
if (rl->count[rw] < queue_congestion_off_threshold(q))
clear_queue_congested(q, rw);
And clear_queue_congested does an unconditional wakeup (if there
is someone sleeping on the congestion queue).
>
>>I wonder if reducing the size of the write queue in CFQ would help
>>too? IIRC, it only really wants a huge read queue.
>
>
> Surely it will help - but we need to be able to handle the situation
> because memory can still become full of PageWriteback pages if there are
> many disks.
>
Yep.
Nick Piggin <[email protected]> wrote:
>
> > If the queue is not congested, blk_congestion_wait() will still sleep. See
> > freed_request().
> >
>
> Hmm... doesn't look like it to me:
>
> if (rl->count[rw] < queue_congestion_off_threshold(q))
> clear_queue_congested(q, rw);
>
> And clear_queue_congested does an unconditional wakeup (if there
> is someone sleeping on the congestion queue).
That's my point. blk_congestion_wait() will always sleep, regardless of
the queue's congestion state.
This memory barrier is not needed because the waitqueue will only get
waiters on it in the following situations:
rq->count has exceeded the threshold - however all manipulations of ->count
are performed under the runqueue lock, and so we will correctly pick up any
waiter.
Memory allocation for the request fails. In this case, there is no additional
help provided by the memory barrier. We are guaranteed to eventually wake
up waiters because the request allocation mempool guarantees that if the mem
allocation for a request fails, there must be some requests in flight. They
will wake up waiters when they are retired.
---
linux-2.6-npiggin/drivers/block/ll_rw_blk.c | 1 -
1 files changed, 1 deletion(-)
diff -puN drivers/block/ll_rw_blk.c~blk-no-mb drivers/block/ll_rw_blk.c
--- linux-2.6/drivers/block/ll_rw_blk.c~blk-no-mb 2005-01-06 15:37:05.000000000 +1100
+++ linux-2.6-npiggin/drivers/block/ll_rw_blk.c 2005-01-06 15:37:12.000000000 +1100
@@ -1630,7 +1630,6 @@ static void freed_request(request_queue_
if (rl->count[rw] < queue_congestion_off_threshold(q))
clear_queue_congested(q, rw);
if (rl->count[rw]+1 <= q->nr_requests) {
- smp_mb();
if (waitqueue_active(&rl->wait[rw]))
wake_up(&rl->wait[rw]);
blk_clear_queue_full(q, rw);
_
On Wed, Jan 05, 2005 at 08:47:06PM -0800, Andrew Morton wrote:
> That's my point. blk_congestion_wait() will always sleep, regardless of
Since I've no pending bugs at all with the mainline codebase I rate this
just a theoretical issue: but even sleeping for a fixed amount of time
is unreliable there, for example if the storage is very slow. That's why
using io_schedule_timeout for that isn't going to be a fix.
The fix is very simple and it is to call wait_on_page_writeback on one
of the pages under writeback. That guarantees some _writeback_ progress
has been made before retrying. That way some random direct-io or a too
short timeout, can't cause malfunction.
On Thu, Jan 06, 2005 at 03:55:34PM +1100, Nick Piggin wrote:
> However, if you had a plain io_schedule_timeout there, at least you
> would sleep for the full extend of the specified timeout.
I agree it sure would be safer but OTOH it may screwup performance by
waiting for unnecessary long times on fast stroage.
So it's ok for a test, but still it wouldn't be a final fix since the
timeout may be still too short in some case.
Waiting on one (or more) PG_writeback bitflags to go away should fix it
completely. This is how 2.4 throttles on the writeback I/O too of
course. How we choose which random page to pick may vary though.
Andrea Arcangeli <[email protected]> wrote:
>
> The fix is very simple and it is to call wait_on_page_writeback on one
> of the pages under writeback.
eek, no. That was causing waits of five seconds or more. Fixing this
caused the single greatest improvement in page allocator latency in early
2.5. We're totally at the mercy of the elevator algorithm this way.
If we're to improve things in there we want to wait on _any_ eligible page
becoming reclaimable, not on a particular page.
Andrea Arcangeli wrote:
> On Wed, Jan 05, 2005 at 08:47:06PM -0800, Andrew Morton wrote:
>
>>That's my point. blk_congestion_wait() will always sleep, regardless of
>
>
> Since I've no pending bugs at all with the mainline codebase I rate this
> just a theoretical issue: but even sleeping for a fixed amount of time
> is unreliable there, for example if the storage is very slow. That's why
> using io_schedule_timeout for that isn't going to be a fix.
>
> The fix is very simple and it is to call wait_on_page_writeback on one
> of the pages under writeback. That guarantees some _writeback_ progress
> has been made before retrying. That way some random direct-io or a too
> short timeout, can't cause malfunction.
>
Yeah I guess that isn't a bad idea.
Doesn't that have the theoretical problem of slowing down reclaim
of dirty pages backed by fast devices when you have slow devices
writing?
But presumably you'd usually have pdflush working away anyway.
On Wed, Jan 05, 2005 at 09:05:39PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > The fix is very simple and it is to call wait_on_page_writeback on one
> > of the pages under writeback.
>
> eek, no. That was causing waits of five seconds or more. Fixing this
> caused the single greatest improvement in page allocator latency in early
> 2.5. We're totally at the mercy of the elevator algorithm this way.
>
> If we're to improve things in there we want to wait on _any_ eligible page
> becoming reclaimable, not on a particular page.
I told you one way to fix it. I didn't guarantee it was the most
efficient one.
I sure agree waiting on any page to complete writeback is going to fix
it too. Exactly because this page was a "random" page anyway.
Still my point is that this is a bug, and I prefer to be slow and safe
like 2.4, than fast and unreliable like 2.6.
The slight improvement you suggested of waiting on _any_ random
PG_writeback to go away (instead of one particular one as I did in 2.4)
is going to fix the write throttling equally too as well as the 2.4
logic, but without introducing slowdown that 2.4 had.
It's easy to demonstrate: exactly because the page we pick is random
anyway, we can pick the first random one that has seen PG_writeback
transitioning from 1 to 0. The guarantee we get is the same in terms of
safety of the write throttling, but we also guarantee the best possible
latency this way. And the HZ/x hacks to avoid deadlocks will magically
go away too.
Andrea Arcangeli wrote:
> On Wed, Jan 05, 2005 at 09:05:39PM -0800, Andrew Morton wrote:
>
>>Andrea Arcangeli <[email protected]> wrote:
>>
>>>The fix is very simple and it is to call wait_on_page_writeback on one
>>> of the pages under writeback.
>>
>>eek, no. That was causing waits of five seconds or more. Fixing this
>>caused the single greatest improvement in page allocator latency in early
>>2.5. We're totally at the mercy of the elevator algorithm this way.
>>
>>If we're to improve things in there we want to wait on _any_ eligible page
>>becoming reclaimable, not on a particular page.
>
>
> I told you one way to fix it. I didn't guarantee it was the most
> efficient one.
>
> I sure agree waiting on any page to complete writeback is going to fix
> it too. Exactly because this page was a "random" page anyway.
>
> Still my point is that this is a bug, and I prefer to be slow and safe
> like 2.4, than fast and unreliable like 2.6.
>
> The slight improvement you suggested of waiting on _any_ random
> PG_writeback to go away (instead of one particular one as I did in 2.4)
> is going to fix the write throttling equally too as well as the 2.4
> logic, but without introducing slowdown that 2.4 had.
>
> It's easy to demonstrate: exactly because the page we pick is random
> anyway, we can pick the first random one that has seen PG_writeback
> transitioning from 1 to 0. The guarantee we get is the same in terms of
> safety of the write throttling, but we also guarantee the best possible
> latency this way. And the HZ/x hacks to avoid deadlocks will magically
> go away too.
>
This is practically what blk_congestion_wait does when the queue
isn't congested though, isn't it?
On Thu, Jan 06, 2005 at 04:06:34PM +1100, Nick Piggin wrote:
> Yeah I guess that isn't a bad idea.
>
> Doesn't that have the theoretical problem of slowing down reclaim
> of dirty pages backed by fast devices when you have slow devices
> writing?
Andrew just suggested to wait on any random page. That's a minor
modification to that logic. Effectively it's stupid to wait on a
specific random page, when we can wait wait on _any_ random page to
complete the writeback I/O.
With locking done right by setting the state to uninterruptible and
then registering in the new waitqueue when we know at least one page has
PG_writeback set, we can then drop the HZ anti-deadlock hacks too.
On Thu, Jan 06, 2005 at 04:19:36PM +1100, Nick Piggin wrote:
> This is practically what blk_congestion_wait does when the queue
> isn't congested though, isn't it?
The fundamental difference that makes it reliable is that:
1) only the I/O we're throttling against will be considered for the
wakeup event, which means only clearing PG_writeback will be
considered eligible for wakeup
Currently _all_ unrelated write I/O was considered eligible
for wakeup events and that could cause spurious oom kills.
2) we won't need unreliable anti-deadlock timeouts anymore
Nick Piggin <[email protected]> wrote:
>
> Andrea Arcangeli wrote:
> > On Wed, Jan 05, 2005 at 09:05:39PM -0800, Andrew Morton wrote:
> >
> >>Andrea Arcangeli <[email protected]> wrote:
> >>
> >>>The fix is very simple and it is to call wait_on_page_writeback on one
> >>> of the pages under writeback.
> >>
> >>eek, no. That was causing waits of five seconds or more. Fixing this
> >>caused the single greatest improvement in page allocator latency in early
> >>2.5. We're totally at the mercy of the elevator algorithm this way.
> >>
> >>If we're to improve things in there we want to wait on _any_ eligible page
> >>becoming reclaimable, not on a particular page.
> >
> >
> > I told you one way to fix it. I didn't guarantee it was the most
> > efficient one.
> >
And I've already described the efficient way to "fix" it. Twice.
> > I sure agree waiting on any page to complete writeback is going to fix
> > it too. Exactly because this page was a "random" page anyway.
> >
> > Still my point is that this is a bug, and I prefer to be slow and safe
> > like 2.4, than fast and unreliable like 2.6.
> >
> > The slight improvement you suggested of waiting on _any_ random
> > PG_writeback to go away (instead of one particular one as I did in 2.4)
It's a HUGE improvement.
"Example: with `mem=512m', running 4 instances of `dbench 100', 2.5.34
took 35 minutes to compile a kernel. With this patch, it took three
minutes, 45 seconds."
Plus this was the change which precipitated all the I/O scheduler
development, because it caused us to keep the queues full all the time and
the old I/O scheduler collapsed.
> > is going to fix the write throttling equally too as well as the 2.4
> > logic, but without introducing slowdown that 2.4 had.
> >
> > It's easy to demonstrate: exactly because the page we pick is random
> > anyway, we can pick the first random one that has seen PG_writeback
> > transitioning from 1 to 0. The guarantee we get is the same in terms of
> > safety of the write throttling, but we also guarantee the best possible
> > latency this way. And the HZ/x hacks to avoid deadlocks will magically
> > go away too.
> >
>
> This is practically what blk_congestion_wait does when the queue
> isn't congested though, isn't it?
Pretty much. Except:
- Doing a wakeup when a write request is retired corresponds to releasing
a batch of pages, not a single page. Usually.
- direct-io writes could confuse it.
For the third time: "fixing" this involves delivering a wakeup to all zones
in the page's classzone in end_page_writeback(), and passing the zone* into
blk_congestion_wait(). Only deliver the wakeup on every Nth page to get a
bit of batching and to reduce CPU consumption. Then demonstrating that the
change actually improves something.
Andrea Arcangeli wrote:
> On Thu, Jan 06, 2005 at 04:19:36PM +1100, Nick Piggin wrote:
>
>>This is practically what blk_congestion_wait does when the queue
>>isn't congested though, isn't it?
>
>
> The fundamental difference that makes it reliable is that:
>
> 1) only the I/O we're throttling against will be considered for the
> wakeup event, which means only clearing PG_writeback will be
> considered eligible for wakeup
> Currently _all_ unrelated write I/O was considered eligible
> for wakeup events and that could cause spurious oom kills.
I'm not entirely convinced. In Rik's case it didn't matter, because
all his writeout was in the same zone that reclaim was happening
against (ZONE_NORMAL), so in that case, PG_writeback throttling
will do exactly the same thing as blk_congestion_wait.
I do like your PG_writeback throttling idea for the other reason
that it should behave better on NUMA systems with lots of zones
and lots of disks.
> 2) we won't need unreliable anti-deadlock timeouts anymore
>
Well I think you do need *something*. If you wake up each time a
single page (or request) has completed, you only complete what,
12 (DEF_PRIORITY) requests before going OOM? In the worst possible
case scenario, which looks like what Rik's running into.
Nick
Andrea Arcangeli <[email protected]> wrote:
>
> 2) we won't need unreliable anti-deadlock timeouts anymore
The timeouts are for:
a) A fallback for backing stores which don't wake up waiters in
blk_congestion_wait() (eg: NFS).
b) handling the race case where the request queue suddenly goes empty
before the sleeper gets onto the waitqueue.
It can probably be removed with some work, and additional locking.
Nick Piggin wrote:
> Andrea Arcangeli wrote:
>
>> On Thu, Jan 06, 2005 at 04:19:36PM +1100, Nick Piggin wrote:
>>
>>> This is practically what blk_congestion_wait does when the queue
>>> isn't congested though, isn't it?
>>
>>
>>
>> The fundamental difference that makes it reliable is that:
>>
>> 1) only the I/O we're throttling against will be considered for the
>> wakeup event, which means only clearing PG_writeback will be
>> considered eligible for wakeup
>> Currently _all_ unrelated write I/O was considered eligible
>> for wakeup events and that could cause spurious oom kills.
>
>
> I'm not entirely convinced. In Rik's case it didn't matter, because
> all his writeout was in the same zone that reclaim was happening
> against (ZONE_NORMAL), so in that case, PG_writeback throttling
> will do exactly the same thing as blk_congestion_wait.
>
> I do like your PG_writeback throttling idea for the other reason
> that it should behave better on NUMA systems with lots of zones
> and lots of disks.
>
... or Andrew's described fix. I think both would result in pretty
similar behaviour, but Andrew's is probably a bit nicer because it
doesn't require the scanner to have initiated the write.
On Wed, Jan 05, 2005 at 09:32:07PM -0800, Andrew Morton wrote:
> > > The slight improvement you suggested of waiting on _any_ random
> > > PG_writeback to go away (instead of one particular one as I did in 2.4)
>
> It's a HUGE improvement.
I didn't want to question the improvement in wall clock time terms.
> For the third time: "fixing" this involves delivering a wakeup to all zones
> in the page's classzone in end_page_writeback(), and passing the zone* into
> blk_congestion_wait(). Only deliver the wakeup on every Nth page to get a
> bit of batching and to reduce CPU consumption. Then demonstrating that the
> change actually improves something.
Since I cannot reproduce oom kills with writeback, I sure can't
demonstrate it on bare hardware with unmodified kernel.
But I dislike code that works by luck, and sure I could demonstrate it
if I bothered to write an artificial testcase on simulated hardware.
This is the only reason I mentioned this bug in the first place, not
because I'm reproducing it.
On Wed, Jan 05, 2005 at 09:37:04PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > 2) we won't need unreliable anti-deadlock timeouts anymore
>
> The timeouts are for:
>
> a) A fallback for backing stores which don't wake up waiters in
> blk_congestion_wait() (eg: NFS).
that anti-deadlock will be unnecessary too with the new logic.
> b) handling the race case where the request queue suddenly goes empty
> before the sleeper gets onto the waitqueue.
as I mentioned with proper locking setting task in uninterruptible and
then registering into the new per classzone waitqueue, the timeout will
be unnecessary even for this.
> It can probably be removed with some work, and additional locking.
The additional locking will then remove the current locking in
blk_congestion_wait so it's new locking but it will replace the current
locking. But I agree registering in the waitqueue inside the
blk_congestion_wait was simpler. It's just I've an hard time to like the
timeout. Timeout is always wrong when it triggers: if it triggers it
always triggers either too late (wasted resources) or too early (early
oom kills). So unless it messes everything up, it'd be nice to the
locking strict. anyway point 1 and 2 can be implemented separately, at
first we can leave the timeout if the race is too hard to handle.
Ideally if we keep the total number of oustanding writebacks
per-classzone (not sure if we account for it already somewhere, I guess
if something we've the global number and not the per-classzone one), we
could remove the timeout without having to expose the locking outside
blk_congestion_wait. With the number of oustanding writebacks
per-classzone we could truly fix the race and obsolete the timeout in a
self contained manner. Though it will require a proper amount of memory
barriers around the account increment/decrement/read.
Andrea Arcangeli <[email protected]> wrote:
>
> But I dislike code that works by luck,
yup, it is code which works by luck. And it's a bit inaccurate - there is
potential for (small) further latency improvement in there.
But we do need to watch the CPU consumption in there - end_page_writeback()
already figures fairly high sometimes.
> and sure I could demonstrate it
> if I bothered to write an artificial testcase on simulated hardware.
Could well be so.
On Wed, Jan 05, 2005 at 09:59:09PM -0800, Andrew Morton wrote:
> But we do need to watch the CPU consumption in there - end_page_writeback()
> already figures fairly high sometimes.
The check for waitqueue_active will avoid all wake_up calls if nobody is
waiting in blk_congestion_wait. Plus it won't need to wakeup on all
classzones, if we have the other side to reigster on all the zones
composing the classzone.
The wakeup is going to be more frequent than the waiter registration.
So we can register the waiter on _all_ the "interested" zones that
composes the classzone, instead of the other way around that you
suggested in the previous email.
That will reduce the wakeup to a single waitqueue_active, and it
shouldn't be too bad.
Same goes for the number of outstanding writebacks, we can collect that
number per zone and have blk_congestion_wait returning immediatly if it
went to zero on all zones composing the classzone after registering.
On Thu, Jan 06 2005, Nick Piggin wrote:
> Andrew Morton wrote:
> >Nick Piggin <[email protected]> wrote:
> >
> >>> If the queue is not congested, blk_congestion_wait() will still sleep.
> >>See
> >>> freed_request().
> >>>
> >>
> >>Hmm... doesn't look like it to me:
> >>
> >> if (rl->count[rw] < queue_congestion_off_threshold(q))
> >> clear_queue_congested(q, rw);
> >>
> >>And clear_queue_congested does an unconditional wakeup (if there
> >>is someone sleeping on the congestion queue).
> >
> >
> >That's my point. blk_congestion_wait() will always sleep, regardless of
> >the queue's congestion state.
> >
>
> Oh yes, but it will return as soon as a single request is finished.
> Which is probably a couple of milliseconds, rather than the 100 we
> had hoped for. So the allocators will wake up again and go around
> the loop and still make no progress.
>
> However, if you had a plain io_schedule_timeout there, at least you
> would sleep for the full extend of the specified timeout.
>
> BTW. Jens, now that I look at freed_request, is the memory barrier
> required? If so, what is it protecting?
>
>
>
> This memory barrier is not needed because the waitqueue will only get
> waiters on it in the following situations:
>
> rq->count has exceeded the threshold - however all manipulations of ->count
> are performed under the runqueue lock, and so we will correctly pick up any
> waiter.
>
> Memory allocation for the request fails. In this case, there is no additional
> help provided by the memory barrier. We are guaranteed to eventually wake
> up waiters because the request allocation mempool guarantees that if the mem
> allocation for a request fails, there must be some requests in flight. They
> will wake up waiters when they are retired.
Not sure I agree completely. Yes it will work, but only because it tests
<= q->nr_requests and I don't think that 'eventually' is good enough :-)
The actual waitqueue manipulation doesn't happen under the queue lock,
so the memory barrier is needed to pickup the change on SMP. So I'd like
to keep the barrier.
I'd prefer to add smp_mb() to waitqueue_active() actually!
--
Jens Axboe
Jens Axboe wrote:
> On Thu, Jan 06 2005, Nick Piggin wrote:
>>
>>This memory barrier is not needed because the waitqueue will only get
>>waiters on it in the following situations:
>>
>>rq->count has exceeded the threshold - however all manipulations of ->count
>>are performed under the runqueue lock, and so we will correctly pick up any
>>waiter.
>>
>>Memory allocation for the request fails. In this case, there is no additional
>>help provided by the memory barrier. We are guaranteed to eventually wake
>>up waiters because the request allocation mempool guarantees that if the mem
>>allocation for a request fails, there must be some requests in flight. They
>>will wake up waiters when they are retired.
>
>
> Not sure I agree completely. Yes it will work, but only because it tests
> <= q->nr_requests and I don't think that 'eventually' is good enough :-)
>
> The actual waitqueue manipulation doesn't happen under the queue lock,
> so the memory barrier is needed to pickup the change on SMP. So I'd like
> to keep the barrier.
>
No that's right... but between the prepare_to_wait and the io_schedule,
get_request takes the lock and checks nr_requests. I think we are safe?
> I'd prefer to add smp_mb() to waitqueue_active() actually!
>
That may be a good idea (I haven't really taken much notice of how other
code uses it).
I'm not worried about any possible performance advantages of removing it,
rather just having a memory barrier without comments can be perplexing.
On Thu, Jan 06 2005, Nick Piggin wrote:
> Jens Axboe wrote:
> >On Thu, Jan 06 2005, Nick Piggin wrote:
>
> >>
> >>This memory barrier is not needed because the waitqueue will only get
> >>waiters on it in the following situations:
> >>
> >>rq->count has exceeded the threshold - however all manipulations of
> >>->count
> >>are performed under the runqueue lock, and so we will correctly pick up
> >>any
> >>waiter.
> >>
> >>Memory allocation for the request fails. In this case, there is no
> >>additional
> >>help provided by the memory barrier. We are guaranteed to eventually wake
> >>up waiters because the request allocation mempool guarantees that if the
> >>mem
> >>allocation for a request fails, there must be some requests in flight.
> >>They
> >>will wake up waiters when they are retired.
> >
> >
> >Not sure I agree completely. Yes it will work, but only because it tests
> ><= q->nr_requests and I don't think that 'eventually' is good enough :-)
> >
> >The actual waitqueue manipulation doesn't happen under the queue lock,
> >so the memory barrier is needed to pickup the change on SMP. So I'd like
> >to keep the barrier.
> >
>
> No that's right... but between the prepare_to_wait and the io_schedule,
> get_request takes the lock and checks nr_requests. I think we are safe?
It looks like it, yes you are right. But it looks to be needed a few
lines further down instead, though :-)
===== drivers/block/ll_rw_blk.c 1.281 vs edited =====
--- 1.281/drivers/block/ll_rw_blk.c 2004-12-01 09:13:57 +01:00
+++ edited/drivers/block/ll_rw_blk.c 2005-01-06 09:32:19 +01:00
@@ -1630,11 +1630,11 @@
if (rl->count[rw] < queue_congestion_off_threshold(q))
clear_queue_congested(q, rw);
if (rl->count[rw]+1 <= q->nr_requests) {
- smp_mb();
if (waitqueue_active(&rl->wait[rw]))
wake_up(&rl->wait[rw]);
blk_clear_queue_full(q, rw);
}
+ smp_mb();
if (unlikely(waitqueue_active(&rl->drain)) &&
!rl->count[READ] && !rl->count[WRITE])
wake_up(&rl->drain);
> >I'd prefer to add smp_mb() to waitqueue_active() actually!
> >
>
> That may be a good idea (I haven't really taken much notice of how other
> code uses it).
>
> I'm not worried about any possible performance advantages of removing it,
> rather just having a memory barrier without comments can be perplexing.
I fully agree, subtle things like that should always be commented.
--
Jens Axboe
Jens Axboe wrote:
> On Thu, Jan 06 2005, Nick Piggin wrote:
>
>>No that's right... but between the prepare_to_wait and the io_schedule,
>>get_request takes the lock and checks nr_requests. I think we are safe?
>
>
> It looks like it, yes you are right. But it looks to be needed a few
> lines further down instead, though :-)
>
> ===== drivers/block/ll_rw_blk.c 1.281 vs edited =====
> --- 1.281/drivers/block/ll_rw_blk.c 2004-12-01 09:13:57 +01:00
> +++ edited/drivers/block/ll_rw_blk.c 2005-01-06 09:32:19 +01:00
> @@ -1630,11 +1630,11 @@
> if (rl->count[rw] < queue_congestion_off_threshold(q))
> clear_queue_congested(q, rw);
> if (rl->count[rw]+1 <= q->nr_requests) {
> - smp_mb();
> if (waitqueue_active(&rl->wait[rw]))
> wake_up(&rl->wait[rw]);
> blk_clear_queue_full(q, rw);
> }
> + smp_mb();
> if (unlikely(waitqueue_active(&rl->drain)) &&
> !rl->count[READ] && !rl->count[WRITE])
> wake_up(&rl->drain);
>
Yes, looks like you're right there.
Any point in doing it like this
if (!rl->count[READ] && !rl->count[WRITE]) {
smb_mb();
if (unlikely(waitqueue_active(...)))
wake_up()
}
I wonder? I don't have any feeling of how memory barriers impact performance
on a very parallel system with CPUs that do lots of memory reordering like
POWER5.
On Thu, Jan 06 2005, Nick Piggin wrote:
> Jens Axboe wrote:
> >On Thu, Jan 06 2005, Nick Piggin wrote:
> >
>
> >>No that's right... but between the prepare_to_wait and the io_schedule,
> >>get_request takes the lock and checks nr_requests. I think we are safe?
> >
> >
> >It looks like it, yes you are right. But it looks to be needed a few
> >lines further down instead, though :-)
> >
> >===== drivers/block/ll_rw_blk.c 1.281 vs edited =====
> >--- 1.281/drivers/block/ll_rw_blk.c 2004-12-01 09:13:57 +01:00
> >+++ edited/drivers/block/ll_rw_blk.c 2005-01-06 09:32:19 +01:00
> >@@ -1630,11 +1630,11 @@
> > if (rl->count[rw] < queue_congestion_off_threshold(q))
> > clear_queue_congested(q, rw);
> > if (rl->count[rw]+1 <= q->nr_requests) {
> >- smp_mb();
> > if (waitqueue_active(&rl->wait[rw]))
> > wake_up(&rl->wait[rw]);
> > blk_clear_queue_full(q, rw);
> > }
> >+ smp_mb();
> > if (unlikely(waitqueue_active(&rl->drain)) &&
> > !rl->count[READ] && !rl->count[WRITE])
> > wake_up(&rl->drain);
> >
>
> Yes, looks like you're right there.
>
> Any point in doing it like this
>
> if (!rl->count[READ] && !rl->count[WRITE]) {
> smb_mb();
> if (unlikely(waitqueue_active(...)))
> wake_up()
> }
>
> I wonder? I don't have any feeling of how memory barriers impact performance
> on a very parallel system with CPUs that do lots of memory reordering like
> POWER5.
I had the same idea - I think it's a good idea, it's definitely an
optimization worth doing.
--
Jens Axboe
On Wed, 5 Jan 2005, Andrew Morton wrote:
> The timeouts are for:
>
> a) A fallback for backing stores which don't wake up waiters in
> blk_congestion_wait() (eg: NFS).
This is buggy, btw. NFS has a "fun" deadlock where it can
send so many writes over the wire at once that the PF_MEMALLOC
allocations eat up all memory and there's no memory left to
receive ACKs from the NFS server ...
NFS will need to act congested when memory is low and there
are already some writeouts underway.
--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan