2003-08-29 15:04:13

by Shantanu Goel

[permalink] [raw]
Subject: [VM PATCH] Faster reclamation of dirty pages and unused inode/dcache entries in 2.4.22

Hi kernel hackers,

The VM subsystem in Linux 2.4.22 can cause spurious
swapouts in the presence of lots of dirty pages.
Presently, as dirty pages are encountered,
shrink_cache() schedules a writepage() and moves the
page to the head of the inactive list. When a lot of
dirty pages are present, this can break the FIFO
ordering of the inactive list because clean pages
further down the list will be reclaimed first. The
following patch records the pages being laundered, and
once SWAP_CLUSTER_MAX pages have been accumulated or
the scan is complete, goes back and attempts to move
them back to the tail of the list.

The second part of the patch reclaims unused
inode/dentry/dquot entries more aggressively. I have
observed that on an NFS server where swap out activity
is low, the VM can shrink the page cache to the point
where most pages are used up by unused inode/dentry
entries. This is because page cache reclamation
succeeds most of the time except when a swap_out()
happens.

Feedback and comments are welcome.

Thanks,
Shantanu Goel

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com


Attachments:
2.4.22-vm-writeback-inode.patch (3.25 kB)
2.4.22-vm-writeback-inode.patch

2003-08-29 17:29:58

by Antonio Vargas

[permalink] [raw]
Subject: Re: [VM PATCH] Faster reclamation of dirty pages and unused inode/dcache entries in 2.4.22

On Fri, Aug 29, 2003 at 08:01:11AM -0700, Shantanu Goel wrote:
> Hi kernel hackers,
>
> The VM subsystem in Linux 2.4.22 can cause spurious
> swapouts in the presence of lots of dirty pages.
> Presently, as dirty pages are encountered,
> shrink_cache() schedules a writepage() and moves the
> page to the head of the inactive list. When a lot of
> dirty pages are present, this can break the FIFO
> ordering of the inactive list because clean pages
> further down the list will be reclaimed first. The
> following patch records the pages being laundered, and
> once SWAP_CLUSTER_MAX pages have been accumulated or
> the scan is complete, goes back and attempts to move
> them back to the tail of the list.
>
> The second part of the patch reclaims unused
> inode/dentry/dquot entries more aggressively. I have
> observed that on an NFS server where swap out activity
> is low, the VM can shrink the page cache to the point
> where most pages are used up by unused inode/dentry
> entries. This is because page cache reclamation
> succeeds most of the time except when a swap_out()
> happens.
>
> Feedback and comments are welcome.

Microoptimization (which helps on x86 a lot):

- for (i = nr_pages - 1; i >= 0; i--) {
- page = laundry[i];
+ laundry += nr_pages;
+ for (i = -nr_pages; ++i ;){
+ page = laundry[i];

Your original code reads from higher to lower addresses,
while the new one reads from lower to higer addresses.

The new code changes and then tests the loop counter,
so it's a little bit faster :)

Both check against zero, so both can use result flags
directly and do no intervening "cmp ecx,CONSTANT".

To the "powers that be", would this type of microoptimizations
be futher welcomed?

Greets, Antonio.

2003-08-29 17:55:48

by Shantanu Goel

[permalink] [raw]
Subject: Re: [VM PATCH] Faster reclamation of dirty pages and unused inode/dcache entries in 2.4.22

I am not very knowledgeable about micro-optimizations.
I'll take your word for it. ;-) What interests me
more is whether others notice any performance
improvement under swapout with this patch given that
is on the order of milliseconds.

--- Antonio Vargas <[email protected]> wrote:
> On Fri, Aug 29, 2003 at 08:01:11AM -0700, Shantanu
> Goel wrote:
> > Hi kernel hackers,
> >
> > The VM subsystem in Linux 2.4.22 can cause
> spurious
> > swapouts in the presence of lots of dirty pages.
> > Presently, as dirty pages are encountered,
> > shrink_cache() schedules a writepage() and moves
> the
> > page to the head of the inactive list. When a lot
> of
> > dirty pages are present, this can break the FIFO
> > ordering of the inactive list because clean pages
> > further down the list will be reclaimed first.
> The
> > following patch records the pages being laundered,
> and
> > once SWAP_CLUSTER_MAX pages have been accumulated
> or
> > the scan is complete, goes back and attempts to
> move
> > them back to the tail of the list.
> >
> > The second part of the patch reclaims unused
> > inode/dentry/dquot entries more aggressively. I
> have
> > observed that on an NFS server where swap out
> activity
> > is low, the VM can shrink the page cache to the
> point
> > where most pages are used up by unused
> inode/dentry
> > entries. This is because page cache reclamation
> > succeeds most of the time except when a swap_out()
> > happens.
> >
> > Feedback and comments are welcome.
>
> Microoptimization (which helps on x86 a lot):
>
> - for (i = nr_pages - 1; i >= 0; i--) {
> - page = laundry[i];
> + laundry += nr_pages;
> + for (i = -nr_pages; ++i ;){
> + page = laundry[i];
>
> Your original code reads from higher to lower
> addresses,
> while the new one reads from lower to higer
> addresses.
>
> The new code changes and then tests the loop
> counter,
> so it's a little bit faster :)
>
> Both check against zero, so both can use result
> flags
> directly and do no intervening "cmp ecx,CONSTANT".
>
> To the "powers that be", would this type of
> microoptimizations
> be futher welcomed?
>
> Greets, Antonio.


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

2003-08-29 18:07:20

by Mike Fedyk

[permalink] [raw]
Subject: Re: [VM PATCH] Faster reclamation of dirty pages and unused inode/dcache entries in 2.4.22

On Fri, Aug 29, 2003 at 10:55:44AM -0700, Shantanu Goel wrote:
> I am not very knowledgeable about micro-optimizations.
> I'll take your word for it. ;-) What interests me
> more is whether others notice any performance
> improvement under swapout with this patch given that
> is on the order of milliseconds.

But have you compared your patch with the VM patches in -aa? Will your
patch apply on -aa and make improvements there too?

In other words: Why would I want to use this patch when I could use -aa?

2003-08-29 18:46:56

by Shantanu Goel

[permalink] [raw]
Subject: Re: [VM PATCH] Faster reclamation of dirty pages and unused inode/dcache entries in 2.4.22

I prefer to run stock kernels so I don't have as much
experience with the -aa patches. However, I took a
look at the relevant code in 2.4.22pre7aa1 and I
believe my patch should help there as well. The
writepage() and page rotation behaviour is similar to
stock 2.4.22 though the inactive_list is per-classzone
in -aa. I am less sure about the inode/dcache part
though under -aa.

--- Mike Fedyk <[email protected]> wrote:
> On Fri, Aug 29, 2003 at 10:55:44AM -0700, Shantanu
> Goel wrote:
> > I am not very knowledgeable about
> micro-optimizations.
> > I'll take your word for it. ;-) What interests
> me
> > more is whether others notice any performance
> > improvement under swapout with this patch given
> that
> > is on the order of milliseconds.
>
> But have you compared your patch with the VM patches
> in -aa? Will your
> patch apply on -aa and make improvements there too?
>
> In other words: Why would I want to use this patch
> when I could use -aa?
>


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

2003-08-29 18:57:37

by Mike Fedyk

[permalink] [raw]
Subject: Re: [VM PATCH] Faster reclamation of dirty pages and unused inode/dcache entries in 2.4.22

[CCing AA & MCP]

> --- Mike Fedyk <[email protected]> wrote:
> > But have you compared your patch with the VM patches
> > in -aa? Will your
> > patch apply on -aa and make improvements there too?
> >
> > In other words: Why would I want to use this patch
> > when I could use -aa?

On Fri, Aug 29, 2003 at 11:46:44AM -0700, Shantanu Goel wrote:
> I prefer to run stock kernels so I don't have as much
> experience with the -aa patches. However, I took a
> look at the relevant code in 2.4.22pre7aa1 and I
> believe my patch should help there as well. The
> writepage() and page rotation behaviour is similar to
> stock 2.4.22 though the inactive_list is per-classzone
> in -aa. I am less sure about the inode/dcache part
> though under -aa.

You need to integrate with -aa on the VM. It has been hard enough for
Andrea to get his stuff in, I doubt you will fair any better.

If your patch shows improvements when applied on -aa Andrea will probably
integrate it.

Marc/Andrea, what do you think? Any holes to poke in this here patch?

2003-08-29 19:11:53

by Rahul Karnik

[permalink] [raw]
Subject: Re: [VM PATCH] Faster reclamation of dirty pages and unused inode/dcache entries in 2.4.22

Shantanu Goel wrote:

> Feedback and comments are welcome.

This is an optimization patch. Since the VM is all voodoo :), there are
no obvious improvements. Numbers and/or test scripts would go a long way
in getting acceptance.

-Rahul
--
Rahul Karnik
[email protected]
http://www.genebrew.com/

2003-08-29 19:28:59

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [VM PATCH] Faster reclamation of dirty pages and unused inode/dcache entries in 2.4.22

On Fri, Aug 29, 2003 at 11:57:28AM -0700, Mike Fedyk wrote:
> [CCing AA & MCP]
>
> > --- Mike Fedyk <[email protected]> wrote:
> > > But have you compared your patch with the VM patches
> > > in -aa? Will your
> > > patch apply on -aa and make improvements there too?
> > >
> > > In other words: Why would I want to use this patch
> > > when I could use -aa?
>
> On Fri, Aug 29, 2003 at 11:46:44AM -0700, Shantanu Goel wrote:
> > I prefer to run stock kernels so I don't have as much
> > experience with the -aa patches. However, I took a
> > look at the relevant code in 2.4.22pre7aa1 and I
> > believe my patch should help there as well. The
> > writepage() and page rotation behaviour is similar to
> > stock 2.4.22 though the inactive_list is per-classzone
> > in -aa. I am less sure about the inode/dcache part
> > though under -aa.
>
> You need to integrate with -aa on the VM. It has been hard enough for
> Andrea to get his stuff in, I doubt you will fair any better.
>
> If your patch shows improvements when applied on -aa Andrea will probably
> integrate it.

yes, at this point in time I'm willing to merge only anything that is an
obvious improvement. More doubious things would better go in 2.6.

I didn't see the patch in question, Shantanu, if you're interested to
merge it in -aa, could you submit against 2.4.22pre7aa1? Otherwise I'll
check it and possibly merge it myself later (i've quite some backlog to
merge already for the short term, but it can go in queue)

> Marc/Andrea, what do you think? Any holes to poke in this here patch?

didn't check it yet.

Andrea

2003-08-29 19:46:40

by Shantanu Goel

[permalink] [raw]
Subject: Re: [VM PATCH] Faster reclamation of dirty pages and unused inode/dcache entries in 2.4.22

Andrea,

I'll test and submit a patch against -aa. Also, is
there a common benchmark that you use to test for
regression?

Thanks,
Shantanu

--- Andrea Arcangeli <[email protected]> wrote:
> On Fri, Aug 29, 2003 at 11:57:28AM -0700, Mike Fedyk
> wrote:
> > [CCing AA & MCP]
> >
> > > --- Mike Fedyk <[email protected]> wrote:
> > > > But have you compared your patch with the VM
> patches
> > > > in -aa? Will your
> > > > patch apply on -aa and make improvements there
> too?
> > > >
> > > > In other words: Why would I want to use this
> patch
> > > > when I could use -aa?
> >
> > On Fri, Aug 29, 2003 at 11:46:44AM -0700, Shantanu
> Goel wrote:
> > > I prefer to run stock kernels so I don't have as
> much
> > > experience with the -aa patches. However, I
> took a
> > > look at the relevant code in 2.4.22pre7aa1 and I
> > > believe my patch should help there as well. The
> > > writepage() and page rotation behaviour is
> similar to
> > > stock 2.4.22 though the inactive_list is
> per-classzone
> > > in -aa. I am less sure about the inode/dcache
> part
> > > though under -aa.
> >
> > You need to integrate with -aa on the VM. It has
> been hard enough for
> > Andrea to get his stuff in, I doubt you will fair
> any better.
> >
> > If your patch shows improvements when applied on
> -aa Andrea will probably
> > integrate it.
>
> yes, at this point in time I'm willing to merge only
> anything that is an
> obvious improvement. More doubious things would
> better go in 2.6.
>
> I didn't see the patch in question, Shantanu, if
> you're interested to
> merge it in -aa, could you submit against
> 2.4.22pre7aa1? Otherwise I'll
> check it and possibly merge it myself later (i've
> quite some backlog to
> merge already for the short term, but it can go in
> queue)
>
> > Marc/Andrea, what do you think? Any holes to poke
> in this here patch?
>
> didn't check it yet.
>
> Andrea


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

2003-08-29 19:59:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [VM PATCH] Faster reclamation of dirty pages and unused inode/dcache entries in 2.4.22

On Fri, Aug 29, 2003 at 12:46:36PM -0700, Shantanu Goel wrote:
> Andrea,
>
> I'll test and submit a patch against -aa. Also, is
> there a common benchmark that you use to test for
> regression?

bonnie,tiobench,dbench would be a very good start for the basics (note:
dbench can be misleading, but at the same fariness levels, it's
interesting too, it's just that dbench doesn't measure the fariness
level itself [like tiobench started doing relatively recently]).

(I'm assuming the patch makes difference not only for mmapped dirty
pages, in such case the above would be non interesting)

thanks,

Andrea

2003-08-29 20:22:23

by Shantanu Goel

[permalink] [raw]
Subject: Re: [VM PATCH] Faster reclamation of dirty pages and unused inode/dcache entries in 2.4.22

Thanks for the pointer to the benchmarks.

The patch I posted only helps the mmap case so it
won't help (or hurt hopefully ;-) any program that
primarily does read/write instead of mmap. The
extreme case where I observed this was a perl script
that created a gigantic hash and tried to populate it.
The perl in question uses mmap for malloc. The
difference in execution time between stock 2.4.22 and
one with the patch was insignificant because it is
primarily I/O bound, however the other apps I was
running, Mozilla and several xterm's, were paged out
much less frequently in the latter case. The machine
has 256MB of memory and perl grew to about 1 GB.

I have written another patch that more aggresively
tries to free pages with dirty buffers which should
help with the buffer I/O case. It essentially changes
try_to_free_buffers() so it immediately starts and
waits for I/O to complete if the gfp_mask allows it.
It does not do any clustering so its performance is
questionable at the moment.

--- Andrea Arcangeli <[email protected]> wrote:
> On Fri, Aug 29, 2003 at 12:46:36PM -0700, Shantanu
> Goel wrote:
> > Andrea,
> >
> > I'll test and submit a patch against -aa. Also,
> is
> > there a common benchmark that you use to test for
> > regression?
>
> bonnie,tiobench,dbench would be a very good start
> for the basics (note:
> dbench can be misleading, but at the same fariness
> levels, it's
> interesting too, it's just that dbench doesn't
> measure the fariness
> level itself [like tiobench started doing relatively
> recently]).
>
> (I'm assuming the patch makes difference not only
> for mmapped dirty
> pages, in such case the above would be non
> interesting)
>
> thanks,
>
> Andrea


__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com

2003-08-30 07:33:04

by Antonio Vargas

[permalink] [raw]
Subject: Re: [VM PATCH] Faster reclamation of dirty pages and unused inode/dcache entries in 2.4.22

On Fri, Aug 29, 2003 at 01:20:01PM -0700, Shantanu Goel wrote:
> Thanks for the pointer to the benchmarks.
>
> The patch I posted only helps the mmap case so it
> won't help (or hurt hopefully ;-) any program that
> primarily does read/write instead of mmap. The
> extreme case where I observed this was a perl script
> that created a gigantic hash and tried to populate it.

I've experienced this workload and it's easily reproducible:
get the lxr tools and try to build the indexes.

> The perl in question uses mmap for malloc. The
> difference in execution time between stock 2.4.22 and
> one with the patch was insignificant because it is
> primarily I/O bound, however the other apps I was
> running, Mozilla and several xterm's, were paged out
> much less frequently in the latter case. The machine
> has 256MB of memory and perl grew to about 1 GB.
>
> I have written another patch that more aggresively
> tries to free pages with dirty buffers which should
> help with the buffer I/O case. It essentially changes
> try_to_free_buffers() so it immediately starts and
> waits for I/O to complete if the gfp_mask allows it.
> It does not do any clustering so its performance is
> questionable at the moment.
>
> --- Andrea Arcangeli <[email protected]> wrote:
> > On Fri, Aug 29, 2003 at 12:46:36PM -0700, Shantanu
> > Goel wrote:
> > > Andrea,
> > >
> > > I'll test and submit a patch against -aa. Also,
> > is
> > > there a common benchmark that you use to test for
> > > regression?
> >
> > bonnie,tiobench,dbench would be a very good start
> > for the basics (note:
> > dbench can be misleading, but at the same fariness
> > levels, it's
> > interesting too, it's just that dbench doesn't
> > measure the fariness
> > level itself [like tiobench started doing relatively
> > recently]).
> >
> > (I'm assuming the patch makes difference not only
> > for mmapped dirty
> > pages, in such case the above would be non
> > interesting)
> >
> > thanks,
> >
> > Andrea
>
>
> __________________________________
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free, easy-to-use web site design software
> http://sitebuilder.yahoo.com

--
winden/network

1. Dado un programa, siempre tiene al menos un fallo.
2. Dadas varias lineas de codigo, siempre se pueden acortar a menos lineas.
3. Por induccion, todos los programas se pueden
reducir a una linea que no funciona.

2003-09-20 13:39:33

by Shantanu Goel

[permalink] [raw]
Subject: A couple of 2.4.23-pre4 VM nits

Hi Andrea,

The VM fixes perform rather well in my testing
(thanks!), but I noticed a couple of glitches that the
attached patch addresses.

1. max_scan is never decremented in shrink_cache(). I
am assuming this is a typo.

2. The second part of the patch makes sure that
inode/dentry caches are shrunk at least once every 5
secs. On a machine with a heavy inode stat/directory
lookup load (e.g. NFS server), most of the memory
winds up sitting idle in unused inodes/dentry. The
present code only reclaims these when a swap_out()
happens or shrink_caches() fails. This can take a
while on a machine will very few mapped pages such as
an NFS server.

Thanks,
Shantanu

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com


Attachments:
vmscan.patch (2.23 kB)
vmscan.patch

2003-09-21 02:46:40

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: A couple of 2.4.23-pre4 VM nits

Hi Shantanu,

On Sat, Sep 20, 2003 at 06:39:30AM -0700, Shantanu Goel wrote:
> Hi Andrea,
>
> The VM fixes perform rather well in my testing
> (thanks!), but I noticed a couple of glitches that the
> attached patch addresses.
>
> 1. max_scan is never decremented in shrink_cache(). I
> am assuming this is a typo.

this is an huge half-merging error, no surprise Andi run into vm
troubles with pre4. My tree looks like this for years:

if (unlikely(!page_count(page)))
continue;

only_metadata = 0;
if (!memclass(page_zone(page), classzone)) {
/*
* Hack to address an issue found by Rik. The
* problem is that
* highmem pages can hold buffer headers
* allocated
* from the slab on lowmem, and so if we are
* working
* on the NORMAL classzone here, it is correct
* not to
* try to free the highmem pages themself (that
* would be useless)
* but we must make sure to drop any lowmem
* metadata related to those
* highmem pages.
*/
if (page->buffers && page->mapping) { /* fast
path racy check */
if (unlikely(TryLockPage(page)))
continue;
if (page->buffers && page->mapping &&
memclass_related_bhs(page, classzone)) { /* non racy check */
only_metadata = 1;
goto free_bhs;
}
UnlockPage(page);
}
continue;
}

max_scan--;

max_scan-- should happen only after the memclass check to ensure not
failing too early on GFP_KERNEL/DMA allocations (i.e. no highmem)

This is the right fix:

--- 2.4.23pre4/mm/vmscan.c.~1~ 2003-09-13 00:08:04.000000000 +0200
+++ 2.4.23pre4/mm/vmscan.c 2003-09-21 04:40:12.000000000 +0200
@@ -401,6 +401,8 @@ static int shrink_cache(int nr_pages, zo
if (!memclass(page_zone(page), classzone))
continue;

+ max_scan--;
+
/* Racy check to avoid trylocking when not worthwhile */
if (!page->buffers && (page_count(page) != 1 || !page->mapping))
goto page_mapped;

so your fix is slightly wrong in non-highmem terms (also for scheduling
terms, you would decrease it even when there's a schedule). We need to
finish the merge ASAP with Marcelo. I didn't send specific patches to
Marcelo myself yet, I only pointed out the url and list of them plus
I described the details he wanted to know. I understand he merged what
he found most interesting so I didn't notice this half merge problem yet
but I will start looking into this with the highest prio on Monday (I
was going to look into pre4 very soon for -aa that now will reject
bigtime, and the watermarks too but I had some trouble with the ram and
the scheduler in my fast amd64 this week that kept me busy, the
scheduler fix will be in next -aa and improves ppc as well 11%, on some
of my workloads it's a 99% improvement [this isn't relevant for mainline
though and apparently it was already fixed in 2.6] ;).

> 2. The second part of the patch makes sure that
> inode/dentry caches are shrunk at least once every 5
> secs. On a machine with a heavy inode stat/directory
> lookup load (e.g. NFS server), most of the memory
> winds up sitting idle in unused inodes/dentry. The
> present code only reclaims these when a swap_out()
> happens or shrink_caches() fails. This can take a
> while on a machine will very few mapped pages such as
> an NFS server.

For lots of workloads this will nuke dcache way too fast and rebuilding
it with the fs and buffercache is costly. However if you make the delay
a sysctl set to MAX_SCHEDULE_TIMEOUT, I would be definitely fine in
merging it. That way it won't make any difference by default and it can
help in the corner cases where hacks like this may provide benefits as
you noted.

thanks!

Andrea - If you prefer relying on open source software, check these links:
rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
http://www.cobite.com/cvsps/
svn://svn.kernel.org/linux-2.[46]/trunk

2003-09-21 05:32:22

by Shantanu Goel

[permalink] [raw]
Subject: Re: A couple of 2.4.23-pre4 VM nits

Agreed on all counts. I just blindly copied the
max_scan decrement from 2.4.22. Even there your
suggestion would make sense. Attached is a new patch
which adds support for vm_vfs_scan_interval sysctl and
also fixes the location of max_scan decrement.

Thanks,
Shantanu

--- Andrea Arcangeli <[email protected]> wrote:
> Hi Shantanu,
>
> On Sat, Sep 20, 2003 at 06:39:30AM -0700, Shantanu
> Goel wrote:
> > Hi Andrea,
> >
> > The VM fixes perform rather well in my testing
> > (thanks!), but I noticed a couple of glitches that
> the
> > attached patch addresses.
> >
> > 1. max_scan is never decremented in
> shrink_cache(). I
> > am assuming this is a typo.
>
> this is an huge half-merging error, no surprise Andi
> run into vm
> troubles with pre4. My tree looks like this for
> years:
>
> if (unlikely(!page_count(page)))
> continue;
>
> only_metadata = 0;
> if (!memclass(page_zone(page), classzone)) {
> /*
> * Hack to address an issue found by Rik. The
> * problem is that
> * highmem pages can hold buffer headers
> * allocated
> * from the slab on lowmem, and so if we are
> * working
> * on the NORMAL classzone here, it is correct
> * not to
> * try to free the highmem pages themself (that
> * would be useless)
> * but we must make sure to drop any lowmem
> * metadata related to those
> * highmem pages.
> */
> if (page->buffers && page->mapping) { /* fast
> path racy check */
> if (unlikely(TryLockPage(page)))
> continue;
> if (page->buffers && page->mapping &&
> memclass_related_bhs(page, classzone)) { /* non racy
> check */
> only_metadata = 1;
> goto free_bhs;
> }
> UnlockPage(page);
> }
> continue;
> }
>
> max_scan--;
>
> max_scan-- should happen only after the memclass
> check to ensure not
> failing too early on GFP_KERNEL/DMA allocations
> (i.e. no highmem)
>
> This is the right fix:
>
> --- 2.4.23pre4/mm/vmscan.c.~1~ 2003-09-13
> 00:08:04.000000000 +0200
> +++ 2.4.23pre4/mm/vmscan.c 2003-09-21
> 04:40:12.000000000 +0200
> @@ -401,6 +401,8 @@ static int shrink_cache(int
> nr_pages, zo
> if (!memclass(page_zone(page), classzone))
> continue;
>
> + max_scan--;
> +
> /* Racy check to avoid trylocking when not
> worthwhile */
> if (!page->buffers && (page_count(page) != 1 ||
> !page->mapping))
> goto page_mapped;
>
> so your fix is slightly wrong in non-highmem terms
> (also for scheduling
> terms, you would decrease it even when there's a
> schedule). We need to
> finish the merge ASAP with Marcelo. I didn't send
> specific patches to
> Marcelo myself yet, I only pointed out the url and
> list of them plus
> I described the details he wanted to know. I
> understand he merged what
> he found most interesting so I didn't notice this
> half merge problem yet
> but I will start looking into this with the highest
> prio on Monday (I
> was going to look into pre4 very soon for -aa that
> now will reject
> bigtime, and the watermarks too but I had some
> trouble with the ram and
> the scheduler in my fast amd64 this week that kept
> me busy, the
> scheduler fix will be in next -aa and improves ppc
> as well 11%, on some
> of my workloads it's a 99% improvement [this isn't
> relevant for mainline
> though and apparently it was already fixed in 2.6]
> ;).
>
> > 2. The second part of the patch makes sure that
> > inode/dentry caches are shrunk at least once every
> 5
> > secs. On a machine with a heavy inode
> stat/directory
> > lookup load (e.g. NFS server), most of the memory
> > winds up sitting idle in unused inodes/dentry.
> The
> > present code only reclaims these when a swap_out()
> > happens or shrink_caches() fails. This can take a
> > while on a machine will very few mapped pages such
> as
> > an NFS server.
>
> For lots of workloads this will nuke dcache way too
> fast and rebuilding
> it with the fs and buffercache is costly. However
> if you make the delay
> a sysctl set to MAX_SCHEDULE_TIMEOUT, I would be
> definitely fine in
> merging it. That way it won't make any difference by
> default and it can
> help in the corner cases where hacks like this may
> provide benefits as
> you noted.
>
> thanks!
>
> Andrea - If you prefer relying on open source
> software, check these links:
>
>
rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
> http://www.cobite.com/cvsps/
> svn://svn.kernel.org/linux-2.[46]/trunk

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com


Attachments:
vfs-interval.patch (3.95 kB)
vfs-interval.patch

2003-09-21 14:03:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: A couple of 2.4.23-pre4 VM nits

On Sat, Sep 20, 2003 at 10:32:09PM -0700, Shantanu Goel wrote:
> Agreed on all counts. I just blindly copied the
> max_scan decrement from 2.4.22. Even there your
> suggestion would make sense. Attached is a new patch
> which adds support for vm_vfs_scan_interval sysctl and
> also fixes the location of max_scan decrement.

this patch looks fine to me thanks. Marcelo, feel free to merge this one
instead of my one liner fix.

Andrea - If you prefer relying on open source software, check these links:
rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
http://www.cobite.com/cvsps/
svn://svn.kernel.org/linux-2.[46]/trunk

2003-09-21 14:51:42

by Shantanu Goel

[permalink] [raw]
Subject: Re: A couple of 2.4.23-pre4 VM nits

I just realized the original code had one desirable
behaviour that my patch is missing, namely, it
reclaimed memory from dcache/inode every time swap_out
is called. Please use the attached patch that
restores the original behaviour. Otherwise, if the
interval is very long, no reclamation will happen
until swap_out() fails which in the common case is
unlikely.

Thanks,
Shantanu

--- Andrea Arcangeli <[email protected]> wrote:
> On Sat, Sep 20, 2003 at 10:32:09PM -0700, Shantanu
> Goel wrote:
> > Agreed on all counts. I just blindly copied the
> > max_scan decrement from 2.4.22. Even there your
> > suggestion would make sense. Attached is a new
> patch
> > which adds support for vm_vfs_scan_interval sysctl
> and
> > also fixes the location of max_scan decrement.
>
> this patch looks fine to me thanks. Marcelo, feel
> free to merge this one
> instead of my one liner fix.
>
> Andrea - If you prefer relying on open source
> software, check these links:
>
>
rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
> http://www.cobite.com/cvsps/
> svn://svn.kernel.org/linux-2.[46]/trunk

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com


Attachments:
vfs-interval2.patch (3.94 kB)
vfs-interval2.patch

2003-09-21 15:13:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: A couple of 2.4.23-pre4 VM nits

On Sun, Sep 21, 2003 at 07:51:27AM -0700, Shantanu Goel wrote:
> I just realized the original code had one desirable
> behaviour that my patch is missing, namely, it
> reclaimed memory from dcache/inode every time swap_out
> is called. Please use the attached patch that
> restores the original behaviour. Otherwise, if the
> interval is very long, no reclamation will happen
> until swap_out() fails which in the common case is
> unlikely.

I overlooked the *failed_swapout, I thought you used only 0 and 1 as
parameters, the new version is fine.

BTW, it would also be cleaner to add a __ in front of the function name,
and to #define a _force version that will pass 1, so you don't have less
readable 0/1 in the caller, but I don't mind even with the status in
version 2 (it's simple enough to understand the semantics of the 0/1).

Andrea - If you prefer relying on open source software, check these links:
rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
http://www.cobite.com/cvsps/
svn://svn.kernel.org/linux-2.[46]/trunk

2003-09-21 15:29:00

by Shantanu Goel

[permalink] [raw]
Subject: Re: A couple of 2.4.23-pre4 VM nits

Great suggestion. Attached is a fixed patch.

--- Andrea Arcangeli <[email protected]> wrote:
> On Sun, Sep 21, 2003 at 07:51:27AM -0700, Shantanu
> Goel wrote:
> > I just realized the original code had one
> desirable
> > behaviour that my patch is missing, namely, it
> > reclaimed memory from dcache/inode every time
> swap_out
> > is called. Please use the attached patch that
> > restores the original behaviour. Otherwise, if
> the
> > interval is very long, no reclamation will happen
> > until swap_out() fails which in the common case is
> > unlikely.
>
> I overlooked the *failed_swapout, I thought you used
> only 0 and 1 as
> parameters, the new version is fine.
>
> BTW, it would also be cleaner to add a __ in front
> of the function name,
> and to #define a _force version that will pass 1, so
> you don't have less
> readable 0/1 in the caller, but I don't mind even
> with the status in
> version 2 (it's simple enough to understand the
> semantics of the 0/1).
>
> Andrea - If you prefer relying on open source
> software, check these links:
>
>
rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
> http://www.cobite.com/cvsps/
> svn://svn.kernel.org/linux-2.[46]/trunk

__________________________________
Do you Yahoo!?
Yahoo! SiteBuilder - Free, easy-to-use web site design software
http://sitebuilder.yahoo.com


Attachments:
vfs-interval3.patch (4.09 kB)
vfs-interval3.patch

2003-09-21 15:56:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: A couple of 2.4.23-pre4 VM nits

On Sun, Sep 21, 2003 at 08:28:55AM -0700, Shantanu Goel wrote:
> Great suggestion. Attached is a fixed patch.

thinks looks even better now! :) Thanks.

Andrea - If you prefer relying on open source software, check these links:
rsync.kernel.org::pub/scm/linux/kernel/bkcvs/linux-2.[45]/
http://www.cobite.com/cvsps/
svn://svn.kernel.org/linux-2.[46]/trunk

2003-09-21 18:35:17

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: A couple of 2.4.23-pre4 VM nits



On Sun, 21 Sep 2003, Andrea Arcangeli wrote:

> Hi Shantanu,
>
> On Sat, Sep 20, 2003 at 06:39:30AM -0700, Shantanu Goel wrote:
> > Hi Andrea,
> >
> > The VM fixes perform rather well in my testing
> > (thanks!), but I noticed a couple of glitches that the
> > attached patch addresses.
> >
> > 1. max_scan is never decremented in shrink_cache(). I
> > am assuming this is a typo.
>
> this is an huge half-merging error, no surprise Andi run into vm
> troubles with pre4. My tree looks like this for years:
>
> if (unlikely(!page_count(page)))
> continue;
>
> only_metadata = 0;
> if (!memclass(page_zone(page), classzone)) {
> /*
> * Hack to address an issue found by Rik. The
> * problem is that
> * highmem pages can hold buffer headers
> * allocated
> * from the slab on lowmem, and so if we are
> * working
> * on the NORMAL classzone here, it is correct
> * not to
> * try to free the highmem pages themself (that
> * would be useless)
> * but we must make sure to drop any lowmem
> * metadata related to those
> * highmem pages.
> */
> if (page->buffers && page->mapping) { /* fast
> path racy check */
> if (unlikely(TryLockPage(page)))
> continue;
> if (page->buffers && page->mapping &&
> memclass_related_bhs(page, classzone)) { /* non racy check */
> only_metadata = 1;
> goto free_bhs;
> }
> UnlockPage(page);
> }
> continue;
> }
>
> max_scan--;
>
> max_scan-- should happen only after the memclass check to ensure not
> failing too early on GFP_KERNEL/DMA allocations (i.e. no highmem)
>
> This is the right fix:
>
> --- 2.4.23pre4/mm/vmscan.c.~1~ 2003-09-13 00:08:04.000000000 +0200
> +++ 2.4.23pre4/mm/vmscan.c 2003-09-21 04:40:12.000000000 +0200
> @@ -401,6 +401,8 @@ static int shrink_cache(int nr_pages, zo
> if (!memclass(page_zone(page), classzone))
> continue;
>
> + max_scan--;
> +
> /* Racy check to avoid trylocking when not worthwhile */
> if (!page->buffers && (page_count(page) != 1 || !page->mapping))
> goto page_mapped;

Right! Thanks Andrea.

Sorry for the merge mistake people. Shame on me.

Im going to release pre5 now with this.