LinuxLists.cc - Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

2010-11-26 16:06:59

Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

On Wed, Oct 27, 2010 at 09:47:35AM +0100, Mel Gorman wrote:
><snip>
> To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
> is introduced that takes a more accurate reading of NR_FREE_PAGES when
> called from wakeup_kswapd, when deciding whether it is really safe to go
> back to sleep in sleeping_prematurely() and when deciding if a zone is
> really balanced or not in balance_pgdat(). We are still using an expensive
> function but limiting how often it is called.
><snip>
> Reported-by: Shaohua Li <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>

Hi Mel,

I notice these aren't flagged for stable, should they be? (They fairly
trivially apply and compile on 2.6.36 barring the trace_ points which
changed.) I've got a few bug reports against .36/.37 where kswapd has
been sleeping for 60s+.

I built them some kernels with these patches, but haven't heard back yet
as to whether it fixes things for them.

Thanks for any insight,
Kyle

2010-11-29 09:56:37

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

On Fri, Nov 26, 2010 at 11:06:19AM -0500, Kyle McMartin wrote:
> On Wed, Oct 27, 2010 at 09:47:35AM +0100, Mel Gorman wrote:
> ><snip>
> > To ensure that kswapd wakes up, a safe version of zone_watermark_ok()
> > is introduced that takes a more accurate reading of NR_FREE_PAGES when
> > called from wakeup_kswapd, when deciding whether it is really safe to go
> > back to sleep in sleeping_prematurely() and when deciding if a zone is
> > really balanced or not in balance_pgdat(). We are still using an expensive
> > function but limiting how often it is called.
> ><snip>
> > Reported-by: Shaohua Li <[email protected]>
> > Signed-off-by: Mel Gorman <[email protected]>
>
> Hi Mel,
>
> I notice these aren't flagged for stable, should they be? (They fairly
> trivially apply and compile on 2.6.36 barring the trace_ points which
> changed.)

They were not flagged for stable because they were performance rather than
function bugs that affected a limited number of machines. Should that decision
be revisited?

> I've got a few bug reports against .36/.37 where kswapd has
> been sleeping for 60s+.
>

I do not believe these patches would affect kswapd sleeping for 60s.

> I built them some kernels with these patches, but haven't heard back yet
> as to whether it fixes things for them.
>
> Thanks for any insight,

Can you point me at a relevant bugzilla entry or forward me the bug report
and I'll take a look?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-29 13:17:09

by Kyle McMartin

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

On Mon, Nov 29, 2010 at 09:56:19AM +0000, Mel Gorman wrote:
> Can you point me at a relevant bugzilla entry or forward me the bug report
> and I'll take a look?
>

https://bugzilla.redhat.com/show_bug.cgi?id=649694

Thanks,
Kyle

2010-11-29 15:08:41

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

On Mon, Nov 29, 2010 at 08:16:26AM -0500, Kyle McMartin wrote:
> On Mon, Nov 29, 2010 at 09:56:19AM +0000, Mel Gorman wrote:
> > Can you point me at a relevant bugzilla entry or forward me the bug report
> > and I'll take a look?
> >
>
> https://bugzilla.redhat.com/show_bug.cgi?id=649694
>

Ouch! I have been unable to create an exact copy of your kernel source as
I'm not running Fedora. From a partial conversion of a source RPM, I saw no
changes related to mm/vmscan.c. Is this accurate? I'm trying to establish
if this is a mainline bug as well.

Second, I see all the stack traces are marked with "?" making them
unreliable. Is that anything to be concerned about?

I see that one user has reported that the patches fixed the problem for him
but I fear that this might be a co-incidence or that the patches close a
race of some description. Specifically, I'm trying to identify if there is
a situation where kswapd() constantly loops checking watermarks and never
calling cond_resched(). This could conceivably happen if kswapd() is always
checking sleeping_prematurely() at a higher order where as balance_pgdat()
is always checks the watermarks at the lower order. I'm not seeing how this
could happen in 2.6.35.6 though. If Fedora doesn't have special changes,
it might mean that these patches do need to go into -stable as the
cost of zone_page_state_snapshot() is far higher on larger machines than
previously reported.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-11-29 15:23:04

by Kyle McMartin

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

On Mon, Nov 29, 2010 at 03:08:24PM +0000, Mel Gorman wrote:
> Ouch! I have been unable to create an exact copy of your kernel source as
> I'm not running Fedora. From a partial conversion of a source RPM, I saw no
> changes related to mm/vmscan.c. Is this accurate? I'm trying to establish
> if this is a mainline bug as well.
>

Sorry, if you extract the source rpm you should get the patched
sources... Aside from a few patches to mm/mmap for execshield, mm/* is
otherwise untouched from the latest stable 2.6.35 kernels.

If you git clone git://pkgs.fedoraproject.org/kernel and check out the
origin/f14/master branch, it has all the patches we apply (based on the
'ApplyPatch' lines in kernel.spec

> Second, I see all the stack traces are marked with "?" making them
> unreliable. Is that anything to be concerned about?
>

Hrm, I don't think it is, I think the ones with '?' are just artifacts
because we don't have a proper unwinder. Oh! Thanks! I just found a bug
in our configs... We don't have CONFIG_FRAME_POINTER set because
CONFIG_DEBUG_KERNEL got unset in the 'production' configs... I'll fix
that up.

> I see that one user has reported that the patches fixed the problem for him
> but I fear that this might be a co-incidence or that the patches close a
> race of some description. Specifically, I'm trying to identify if there is
> a situation where kswapd() constantly loops checking watermarks and never
> calling cond_resched(). This could conceivably happen if kswapd() is always
> checking sleeping_prematurely() at a higher order where as balance_pgdat()
> is always checks the watermarks at the lower order. I'm not seeing how this
> could happen in 2.6.35.6 though. If Fedora doesn't have special changes,
> it might mean that these patches do need to go into -stable as the
> cost of zone_page_state_snapshot() is far higher on larger machines than
> previously reported.
>

Yeah, I am a bit surprised as well. Luke seems to have quite a large
machine... I haven't seen any kswapd lockups there on my 18G machine
using the same kernel. :< (Possibly it's just not stressed enough
though.)

Thanks for looking into this!
Kyle

2010-11-29 15:27:06

by Kyle McMartin

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

On Mon, Nov 29, 2010 at 10:22:30AM -0500, Kyle McMartin wrote:
> Hrm, I don't think it is, I think the ones with '?' are just artifacts
> because we don't have a proper unwinder. Oh! Thanks! I just found a bug
> in our configs... We don't have CONFIG_FRAME_POINTER set because
> CONFIG_DEBUG_KERNEL got unset in the 'production' configs... I'll fix
> that up.
>

Oops, no, misdiagnosed that by accidentally grepping for FRAME_POINTERS
instead of FRAME_POINTER... it's set in our configs.

2010-11-29 15:58:18

by Mel Gorman

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

On Mon, Nov 29, 2010 at 10:22:30AM -0500, Kyle McMartin wrote:
> On Mon, Nov 29, 2010 at 03:08:24PM +0000, Mel Gorman wrote:
> > Ouch! I have been unable to create an exact copy of your kernel source as
> > I'm not running Fedora. From a partial conversion of a source RPM, I saw no
> > changes related to mm/vmscan.c. Is this accurate? I'm trying to establish
> > if this is a mainline bug as well.
> >
>
> Sorry, if you extract the source rpm you should get the patched
> sources... Aside from a few patches to mm/mmap for execshield, mm/* is
> otherwise untouched from the latest stable 2.6.35 kernels.
>

Perfect, that correlates with what I saw so this is probably a
mainline issue.

> If you git clone git://pkgs.fedoraproject.org/kernel and check out the
> origin/f14/master branch, it has all the patches we apply (based on the
> 'ApplyPatch' lines in kernel.spec
>
> > Second, I see all the stack traces are marked with "?" making them
> > unreliable. Is that anything to be concerned about?
> >
>
> Hrm, I don't think it is, I think the ones with '?' are just artifacts
> because we don't have a proper unwinder. Oh! Thanks! I just found a bug
> in our configs... We don't have CONFIG_FRAME_POINTER set because
> CONFIG_DEBUG_KERNEL got unset in the 'production' configs... I'll fix
> that up.
>

Ordinarily I'd expect it to be from the lack of a unwinder but if FRAME_POINTER
is there (which you say in a follow-up mail that is), it can be a bit of
a concern. There is some real weirness as it is. Take on of Luke's
examples where it appears to be locked up in

[ 5015.448127] Pid: 185, comm: kswapd1 Tainted: P 2.6.35.6-48.fc14.x86_64 #1 X8DA3/X8DA3
[ 5015.448127] RIP: 0010:[<ffffffff81469130>] [<ffffffff81469130>] _raw_spin_unlock_irqrestore+0x18/0x19

I am at a loss to explain under what circumstances that can even happen!
Is there any possibility RIP is being translated to the wrong symbol possibly
via an userspace decoder of the oops or similar? Is there any possibility
the stack is being corrupted if the swap subsystem is on a complicated
software stack?

> > I see that one user has reported that the patches fixed the problem for him
> > but I fear that this might be a co-incidence or that the patches close a
> > race of some description. Specifically, I'm trying to identify if there is
> > a situation where kswapd() constantly loops checking watermarks and never
> > calling cond_resched(). This could conceivably happen if kswapd() is always
> > checking sleeping_prematurely() at a higher order where as balance_pgdat()
> > is always checks the watermarks at the lower order. I'm not seeing how this
> > could happen in 2.6.35.6 though. If Fedora doesn't have special changes,
> > it might mean that these patches do need to go into -stable as the
> > cost of zone_page_state_snapshot() is far higher on larger machines than
> > previously reported.
> >
>
> Yeah, I am a bit surprised as well. Luke seems to have quite a large
> machine... I haven't seen any kswapd lockups there on my 18G machine
> using the same kernel. :< (Possibly it's just not stressed enough
> though.)
>

He reports his machine as 24-way but with his model of CPU it could
still be 2-socket which I ordinarily would not have expected to suffer
so badly from zone_page_state_snapshot(). I would have predicted that
the patches being thrown about in the thread "Free memory never fully
used, swapping" to be more relevant to kswapd failing to go to sleep :/

Andrew, this patch was a performance fix but is a report saying that it
fixes a functional regression in Fedora enough to push a patch torwards
stable even though an explanation as to *why* it fixes the problem is missing?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-12-23 22:18:53

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

On Mon, 29 Nov 2010, Mel Gorman wrote:

> Andrew, this patch was a performance fix but is a report saying that it
> fixes a functional regression in Fedora enough to push a patch torwards
> stable even though an explanation as to *why* it fixes the problem is missing?
>

We had to pull aa454840 "mm: page allocator: calculate a better estimate
of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
internally because tests showed that it would cause the machine to stall
as the result of heavy kswapd activity. I merged it back with this fix as
it is pending in the -mm tree and it solves the issue we were seeing, so I
definitely think this should be pushed to -stable (and I would seriously
consider it for 2.6.37 inclusion even at this late date).

2010-12-23 22:36:12

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

On Thu, 23 Dec 2010 14:18:38 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Mon, 29 Nov 2010, Mel Gorman wrote:
>
> > Andrew, this patch was a performance fix but is a report saying that it
> > fixes a functional regression in Fedora enough to push a patch torwards
> > stable even though an explanation as to *why* it fixes the problem is missing?
> >
>
> We had to pull aa454840 "mm: page allocator: calculate a better estimate
> of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
> internally because tests showed that it would cause the machine to stall
> as the result of heavy kswapd activity. I merged it back with this fix as
> it is pending in the -mm tree and it solves the issue we were seeing, so I
> definitely think this should be pushed to -stable (and I would seriously
> consider it for 2.6.37 inclusion even at this late date).

How's about I send
mm-page-allocator-adjust-the-per-cpu-counter-threshold-when-memory-is-low.patch
in for 2.6.38 and tag it for backporting into 2.6.37.1 and 2.6.36.x?
That way it'll get a bit of 2.6.38-rc testing before being merged into
2.6.37.x.

2010-12-23 23:00:54

by Kyle McMartin

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

On Thu, Dec 23, 2010 at 02:35:21PM -0800, Andrew Morton wrote:
> > > Andrew, this patch was a performance fix but is a report saying that it
> > > fixes a functional regression in Fedora enough to push a patch torwards
> > > stable even though an explanation as to *why* it fixes the problem is missing?
> > We had to pull aa454840 "mm: page allocator: calculate a better estimate
> > of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
> > internally because tests showed that it would cause the machine to stall
> > as the result of heavy kswapd activity. I merged it back with this fix as
> > it is pending in the -mm tree and it solves the issue we were seeing, so I
> > definitely think this should be pushed to -stable (and I would seriously
> > consider it for 2.6.37 inclusion even at this late date).
>
> How's about I send
> mm-page-allocator-adjust-the-per-cpu-counter-threshold-when-memory-is-low.patch
> in for 2.6.38 and tag it for backporting into 2.6.37.1 and 2.6.36.x?
> That way it'll get a bit of 2.6.38-rc testing before being merged into
> 2.6.37.x.
>

That sounds fine to me. (Thanks very much for the update, David!) I
don't mind carrying a few extra patches here and there in Fedora to get
them some exposure if they're low risk... I've been carrying Mel's
patches for a month or so now and it hasn't turned up any obvious
problems in testing.

regards, Kyle

2010-12-23 23:07:10

by David Rientjes

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

On Thu, 23 Dec 2010, Andrew Morton wrote:

> > We had to pull aa454840 "mm: page allocator: calculate a better estimate
> > of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
> > internally because tests showed that it would cause the machine to stall
> > as the result of heavy kswapd activity. I merged it back with this fix as
> > it is pending in the -mm tree and it solves the issue we were seeing, so I
> > definitely think this should be pushed to -stable (and I would seriously
> > consider it for 2.6.37 inclusion even at this late date).
>
> How's about I send
> mm-page-allocator-adjust-the-per-cpu-counter-threshold-when-memory-is-low.patch
> in for 2.6.38 and tag it for backporting into 2.6.37.1 and 2.6.36.x?
> That way it'll get a bit of 2.6.38-rc testing before being merged into
> 2.6.37.x.
>

I don't think anyone would be able to answer that judgment call other than
you or Linus, it's a trade-off on whether 2.6.37 should be released with
the knowledge that it regresses just like 2.6.36 does (rendering both
unusable on some of our machines out of the box) because we're late in the
cycle.

I personally think the testing is already sufficient since it's been
sitting in -mm for two months, it's been suggested as stable material by a
couple different parties, it was a prerequisite for the transparent
hugepage series, and we've tested and merged it as fixing the regression
in 2.6.36 (as Fedora has, as far as I know). We've already merged the fix
internally, though, so it's not for selfish reasons :)

2010-12-23 23:19:44

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 1/2] mm: page allocator: Adjust the per-cpu counter threshold when memory is low

On Thu, 23 Dec 2010 15:07:02 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Thu, 23 Dec 2010, Andrew Morton wrote:
>
> > > We had to pull aa454840 "mm: page allocator: calculate a better estimate
> > > of NR_FREE_PAGES when memory is low and kswapd is awake" from 2.6.36
> > > internally because tests showed that it would cause the machine to stall
> > > as the result of heavy kswapd activity. I merged it back with this fix as
> > > it is pending in the -mm tree and it solves the issue we were seeing, so I
> > > definitely think this should be pushed to -stable (and I would seriously
> > > consider it for 2.6.37 inclusion even at this late date).
> >
> > How's about I send
> > mm-page-allocator-adjust-the-per-cpu-counter-threshold-when-memory-is-low.patch
> > in for 2.6.38 and tag it for backporting into 2.6.37.1 and 2.6.36.x?
> > That way it'll get a bit of 2.6.38-rc testing before being merged into
> > 2.6.37.x.
> >
>
> I don't think anyone would be able to answer that judgment call other than
> you or Linus, it's a trade-off on whether 2.6.37 should be released with
> the knowledge that it regresses just like 2.6.36 does (rendering both
> unusable on some of our machines out of the box) because we're late in the
> cycle.
>
> I personally think the testing is already sufficient since it's been
> sitting in -mm for two months, it's been suggested as stable material by a
> couple different parties, it was a prerequisite for the transparent
> hugepage series, and we've tested and merged it as fixing the regression
> in 2.6.36 (as Fedora has, as far as I know). We've already merged the fix
> internally, though, so it's not for selfish reasons :)

Wibble, wobble. It's good that the patch has been used in RH kernels.
otoh, the patch is really quite big and the problem was present in
2.6.36 without a lot of complaints and we're very late in -rc and not
many people will be testing over xmas/newyear, and it would be most sad
to put badness into mainline at this time.

So I'm still inclined to go with (discretion > valour).