2005-03-11 06:10:46

by Andrea Arcangeli

[permalink] [raw]
Subject: 2.4 fix for write throttling on x86 >1G

Hello Marcelo,

I've got a fix for you on 2.4. I got reports of stalls with heavy writes
on 2.4. There was a mistake in nr_free_buffer_pages. That function is
definitely meant _not_ to take highmem into account (dirty cache cannot
spread over highmem in 2.4 [even when on top of fs]). For unknown
reasons it was actually taking highmem into account. The code was
obviously meant to not take inot account see the GFP_USER and zonelist,
except it wasn't using the zonelist. That is a severe problem because
there will be no write throttling at all, and no bdflush wakeup either.

This should fix it, though my compiler fails to compile 2.4, so it's not
immediate to verify it. If any problem showup I'll post a followup.

This is a noop for all systems <800M (1G shouldn't be noticeable
either). This is why most people can't notice.

Thanks.

Signed-off-by: Andrea Arcangeli <[email protected]>

--- 2.4.23aa3/mm/page_alloc.c.~1~ 2004-07-04 02:09:42.000000000 +0200
+++ 2.4.23aa3/mm/page_alloc.c 2005-03-11 07:00:23.000000000 +0100
@@ -656,7 +656,7 @@ unsigned int nr_free_buffer_pages (void)
class_idx = zone_idx(zone);

sum += zone->nr_cache_pages;
- for (zone = pgdat->node_zones; zone < pgdat->node_zones + MAX_NR_ZONES; zone++) {
+ for (; zone; zone = *zonep++) {
int free = zone->free_pages - zone->watermarks[class_idx].high;
if (free <= 0)
continue;


2005-03-11 20:35:42

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4 fix for write throttling on x86 >1G

Hi Andrea!

On Fri, Mar 11, 2005 at 07:10:35AM +0100, Andrea Arcangeli wrote:
> Hello Marcelo,
>
> I've got a fix for you on 2.4. I got reports of stalls with heavy writes
> on 2.4.

Out of curiosity, that was SuSE not mainline ?

> There was a mistake in nr_free_buffer_pages. That function is
> definitely meant _not_ to take highmem into account (dirty cache cannot
> spread over highmem in 2.4 [even when on top of fs]). For unknown
> reasons it was actually taking highmem into account. The code was
> obviously meant to not take inot account see the GFP_USER and zonelist,
> except it wasn't using the zonelist.

True, initialization of "zone" variable in nr_free_buffer_pages() is
un-nice.

> That is a severe problem because
> there will be no write throttling at all, and no bdflush wakeup either.
>
> This should fix it, though my compiler fails to compile 2.4, so it's not
> immediate to verify it. If any problem showup I'll post a followup.
>
> This is a noop for all systems <800M (1G shouldn't be noticeable
> either). This is why most people can't notice.

Do we really want to limit dirty cache to low mem on HIGHIO capable
machines? I'm afraid doing so might hurt performance on such systems.

I think it might be wise to have nr_free_buffer_pages() take highmem
into account if CONFIG_HIGHIO is set ?

> --- 2.4.23aa3/mm/page_alloc.c.~1~ 2004-07-04 02:09:42.000000000 +0200
> +++ 2.4.23aa3/mm/page_alloc.c 2005-03-11 07:00:23.000000000 +0100
> @@ -656,7 +656,7 @@ unsigned int nr_free_buffer_pages (void)
> class_idx = zone_idx(zone);
>
> sum += zone->nr_cache_pages;
> - for (zone = pgdat->node_zones; zone < pgdat->node_zones + MAX_NR_ZONES; zone++) {
> + for (; zone; zone = *zonep++) {
> int free = zone->free_pages - zone->watermarks[class_idx].high;
> if (free <= 0)
> continue;

2005-03-11 20:56:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.4 fix for write throttling on x86 >1G

Hello Marcelo,

On Fri, Mar 11, 2005 at 01:04:13PM -0300, Marcelo Tosatti wrote:
> Out of curiosity, that was SuSE not mainline ?

yep.

> Do we really want to limit dirty cache to low mem on HIGHIO capable
> machines? I'm afraid doing so might hurt performance on such systems.
>
> I think it might be wise to have nr_free_buffer_pages() take highmem
> into account if CONFIG_HIGHIO is set ?

The problem is the buffercache/blkdev-pagecache: it simply can't go in
highmem. A similar fix happened recently in 2.6 for the same reasons,
but in 2.6 we allowed it with some logic specific for the
blkdev-pagecache.

nr_free_buffer_pages() was never intended to take highmem into account,
that's why there's the GFP_USER thing already, except we didn't loop
into the zonelist, so I didn't try to make a fix similar to 2.6.

2005-03-11 21:19:28

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: 2.4 fix for write throttling on x86 >1G

On Fri, Mar 11, 2005 at 09:53:09PM +0100, Andrea Arcangeli wrote:
> Hello Marcelo,
>
> On Fri, Mar 11, 2005 at 01:04:13PM -0300, Marcelo Tosatti wrote:
> > Out of curiosity, that was SuSE not mainline ?
>
> yep.
>
> > Do we really want to limit dirty cache to low mem on HIGHIO capable
> > machines? I'm afraid doing so might hurt performance on such systems.
> >
> > I think it might be wise to have nr_free_buffer_pages() take highmem
> > into account if CONFIG_HIGHIO is set ?
>
> The problem is the buffercache/blkdev-pagecache: it simply can't go in
> highmem. A similar fix happened recently in 2.6 for the same reasons,
> but in 2.6 we allowed it with some logic specific for the
> blkdev-pagecache.

Right, I dont think it is easy nor wanted to make that distiction in v2.4.

> nr_free_buffer_pages() was never intended to take highmem into account,
> that's why there's the GFP_USER thing already, except we didn't loop
> into the zonelist, so I didn't try to make a fix similar to 2.6.

Hopefully it is not a big deal to not-allow >1GB dirty pagecache on v2.4.

Applied, thanks.