2012-10-11 08:52:37

by Jiri Slaby

[permalink] [raw]
Subject: kswapd0: wxcessive CPU usage

Hi,

with 3.6.0-next-20121008, kswapd0 is spinning my CPU at 100% for 1
minute or so. If I try to suspend to RAM, this trace appears:
kswapd0 R running task 0 577 2 0x00000000
0000000000000000 00000000000000c0 cccccccccccccccd ffff8801c4146800
ffff8801c4b15c88 ffffffff8116ee05 0000000000003e32 ffff8801c3a79000
ffff8801c4b15ca8 ffffffff8116fdf8 ffff8801c480f398 ffff8801c3a79000
Call Trace:
[<ffffffff8116ee05>] ? put_super+0x25/0x40
[<ffffffff8116fdd4>] ? grab_super_passive+0x24/0xa0
[<ffffffff8116ff99>] ? prune_super+0x149/0x1b0
[<ffffffff81131531>] ? shrink_slab+0xa1/0x2d0
[<ffffffff8113452d>] ? kswapd+0x66d/0xb60
[<ffffffff81133ec0>] ? try_to_free_pages+0x180/0x180
[<ffffffff810a2770>] ? kthread+0xc0/0xd0
[<ffffffff810a26b0>] ? kthread_create_on_node+0x130/0x130
[<ffffffff816a6c9c>] ? ret_from_fork+0x7c/0x90
[<ffffffff810a26b0>] ? kthread_create_on_node+0x130/0x130

# cat /proc/vmstat
nr_free_pages 239962
nr_inactive_anon 89825
nr_active_anon 711136
nr_inactive_file 60386
nr_active_file 46668
nr_unevictable 0
nr_mlock 0
nr_anon_pages 500678
nr_mapped 41319
nr_file_pages 319317
nr_dirty 45
nr_writeback 0
nr_slab_reclaimable 21909
nr_slab_unreclaimable 21598
nr_page_table_pages 12131
nr_kernel_stack 491
nr_unstable 0
nr_bounce 0
nr_vmscan_write 1674280
nr_vmscan_immediate_reclaim 301662
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 212263
nr_dirtied 10620227
nr_written 9260939
nr_anon_transparent_hugepages 172
nr_free_cma 0
nr_dirty_threshold 31459
nr_dirty_background_threshold 15729
pgpgin 31311778
pgpgout 38987552
pswpin 0
pswpout 0
pgalloc_dma 0
pgalloc_dma32 245169455
pgalloc_normal 279685864
pgalloc_movable 0
pgfree 537318727
pgactivate 13126755
pgdeactivate 2482953
pgfault 645947575
pgmajfault 193427
pgrefill_dma 0
pgrefill_dma32 1124272
pgrefill_normal 1998033
pgrefill_movable 0
pgsteal_kswapd_dma 0
pgsteal_kswapd_dma32 2531015
pgsteal_kswapd_normal 3403006
pgsteal_kswapd_movable 0
pgsteal_direct_dma 0
pgsteal_direct_dma32 362488
pgsteal_direct_normal 1134511
pgsteal_direct_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 2693620
pgscan_kswapd_normal 5836491
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 368374
pgscan_direct_normal 1658486
pgscan_direct_movable 0
pgscan_direct_throttle 0
pginodesteal 258410
slabs_scanned 86459392
kswapd_inodesteal 3907549
kswapd_low_wmark_hit_quickly 15408
kswapd_high_wmark_hit_quickly 23113
kswapd_skip_congestion_wait 10
pageoutrun 2165627235
allocstall 11256
pgrotated 219624
compact_blocks_moved 4862077
compact_pages_moved 1970005
compact_pagemigrate_failed 1726156
compact_stall 21275
compact_fail 6589
compact_success 14686
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 2799
unevictable_pgs_scanned 0
unevictable_pgs_rescued 22563
unevictable_pgs_mlocked 22563
unevictable_pgs_munlocked 22563
unevictable_pgs_cleared 0
unevictable_pgs_stranded 0
thp_fault_alloc 18725
thp_fault_fallback 64868
thp_collapse_alloc 9216
thp_collapse_alloc_failed 2031
thp_split 2146

Any ideas what it could be?

--
js
suse labs


2012-10-11 13:45:20

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: kswapd0: wxcessive CPU usage

On Thu, 11 Oct 2012 10:52:28 +0200, Jiri Slaby said:
> Hi,
>
> with 3.6.0-next-20121008, kswapd0 is spinning my CPU at 100% for 1
> minute or so.


> [<ffffffff8116ee05>] ? put_super+0x25/0x40
> [<ffffffff8116fdd4>] ? grab_super_passive+0x24/0xa0
> [<ffffffff8116ff99>] ? prune_super+0x149/0x1b0
> [<ffffffff81131531>] ? shrink_slab+0xa1/0x2d0
> [<ffffffff8113452d>] ? kswapd+0x66d/0xb60
> [<ffffffff81133ec0>] ? try_to_free_pages+0x180/0x180
> [<ffffffff810a2770>] ? kthread+0xc0/0xd0
> [<ffffffff810a26b0>] ? kthread_create_on_node+0x130/0x130
> [<ffffffff816a6c9c>] ? ret_from_fork+0x7c/0x90
> [<ffffffff810a26b0>] ? kthread_create_on_node+0x130/0x130

I don't know what it is, I haven't finished bisecting it - but I can confirm that
I started seeing the same problem 2 or 3 weeks ago. Note that said call
trace does *NOT* require a suspend - I don't do suspend on my laptop and
I'm seeing kswapd burn CPU with similar traces.

# cat /proc/31/stack
[<ffffffff81110306>] grab_super_passive+0x44/0x76
[<ffffffff81110372>] prune_super+0x3a/0x13c
[<ffffffff810dc52a>] shrink_slab+0x95/0x301
[<ffffffff810defb7>] kswapd+0x5c8/0x902
[<ffffffff8104eea4>] kthread+0x9d/0xa5
[<ffffffff815ccfac>] ret_from_fork+0x7c/0x90
[<ffffffffffffffff>] 0xffffffffffffffff
# cat /proc/31/stack
[<ffffffff8110f5af>] put_super+0x29/0x2d
[<ffffffff8110f637>] drop_super+0x1b/0x20
[<ffffffff81110462>] prune_super+0x12a/0x13c
[<ffffffff810dc52a>] shrink_slab+0x95/0x301
[<ffffffff810defb7>] kswapd+0x5c8/0x902
[<ffffffff8104eea4>] kthread+0x9d/0xa5
[<ffffffff815ccfac>] ret_from_fork+0x7c/0x90
[<ffffffffffffffff>] 0xffffffffffffffff

So at least we know we're not hallucinating. :)




Attachments:
(No filename) (865.00 B)

2012-10-11 15:34:31

by Jiri Slaby

[permalink] [raw]
Subject: Re: kswapd0: wxcessive CPU usage

On 10/11/2012 03:44 PM, [email protected] wrote:
> So at least we know we're not hallucinating. :)

Just a thought? Do you have raid?

--
js
suse labs

2012-10-11 17:57:29

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: kswapd0: wxcessive CPU usage

On Thu, 11 Oct 2012 17:34:24 +0200, Jiri Slaby said:
> On 10/11/2012 03:44 PM, [email protected] wrote:
> > So at least we know we're not hallucinating. :)
>
> Just a thought? Do you have raid?

Nope, just a 160G laptop spinning hard drive. Filesystems are
ext4 on LVM on a cryptoLUKS partition on /dev/sda2.


Attachments:
(No filename) (865.00 B)

2012-10-11 17:59:40

by Jiri Slaby

[permalink] [raw]
Subject: Re: kswapd0: wxcessive CPU usage

On 10/11/2012 07:56 PM, [email protected] wrote:
> On Thu, 11 Oct 2012 17:34:24 +0200, Jiri Slaby said:
>> On 10/11/2012 03:44 PM, [email protected] wrote:
>>> So at least we know we're not hallucinating. :)
>>
>> Just a thought? Do you have raid?
>
> Nope, just a 160G laptop spinning hard drive. Filesystems are ext4
> on LVM on a cryptoLUKS partition on /dev/sda2.

Ok, it's maybe compaction. Do you have CONFIG_COMPACTION=y?


--
js
suse labs

2012-10-11 18:20:21

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: kswapd0: wxcessive CPU usage

On Thu, 11 Oct 2012 19:59:33 +0200, Jiri Slaby said:
> On 10/11/2012 07:56 PM, [email protected] wrote:
> > On Thu, 11 Oct 2012 17:34:24 +0200, Jiri Slaby said:
> >> On 10/11/2012 03:44 PM, [email protected] wrote:
> >>> So at least we know we're not hallucinating. :)
> >>
> >> Just a thought? Do you have raid?
> >
> > Nope, just a 160G laptop spinning hard drive. Filesystems are ext4
> > on LVM on a cryptoLUKS partition on /dev/sda2.
>
> Ok, it's maybe compaction. Do you have CONFIG_COMPACTION=y?

# zgrep COMPAC /proc/config.gz
CONFIG_COMPACTION=y

Hope that tells you something useful.



Attachments:
(No filename) (865.00 B)

2012-10-11 22:08:20

by Jiri Slaby

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On 10/11/2012 08:19 PM, [email protected] wrote:
> # zgrep COMPAC /proc/config.gz
> CONFIG_COMPACTION=y
>
> Hope that tells you something useful.

It just supports my another theory. This seems to fix it for me:
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1830,8 +1830,8 @@ static inline bool should_continue_reclaim(struct
lruvec *lruvec,
*/
pages_for_compaction = (2UL << sc->order);

- pages_for_compaction = scale_for_compaction(pages_for_compaction,
- lruvec, sc);
+/* pages_for_compaction = scale_for_compaction(pages_for_compaction,
+ lruvec, sc);*/
inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
if (nr_swap_pages > 0)
inactive_lru_pages += get_lru_size(lruvec,
LRU_INACTIVE_ANON);

And for you?

(It's an effective revert of "mm: vmscan: scale number of pages
reclaimed by reclaim/compaction based on failures".)

regards,
--
js
suse labs

2012-10-11 22:14:15

by Andrew Morton

[permalink] [raw]
Subject: Re: kswapd0: wxcessive CPU usage

On Thu, 11 Oct 2012 10:52:28 +0200
Jiri Slaby <[email protected]> wrote:

> with 3.6.0-next-20121008, kswapd0 is spinning my CPU at 100% for 1
> minute or so. If I try to suspend to RAM, this trace appears:
> kswapd0 R running task 0 577 2 0x00000000
> 0000000000000000 00000000000000c0 cccccccccccccccd ffff8801c4146800
> ffff8801c4b15c88 ffffffff8116ee05 0000000000003e32 ffff8801c3a79000
> ffff8801c4b15ca8 ffffffff8116fdf8 ffff8801c480f398 ffff8801c3a79000
> Call Trace:
> [<ffffffff8116ee05>] ? put_super+0x25/0x40
> [<ffffffff8116fdd4>] ? grab_super_passive+0x24/0xa0
> [<ffffffff8116ff99>] ? prune_super+0x149/0x1b0
> [<ffffffff81131531>] ? shrink_slab+0xa1/0x2d0
> [<ffffffff8113452d>] ? kswapd+0x66d/0xb60
> [<ffffffff81133ec0>] ? try_to_free_pages+0x180/0x180
> [<ffffffff810a2770>] ? kthread+0xc0/0xd0
> [<ffffffff810a26b0>] ? kthread_create_on_node+0x130/0x130
> [<ffffffff816a6c9c>] ? ret_from_fork+0x7c/0x90
> [<ffffffff810a26b0>] ? kthread_create_on_node+0x130/0x130

Could you please do a sysrq-T a few times while it's spinning, to
confirm that this trace is consistently the culprit?

2012-10-11 22:26:09

by Jiri Slaby

[permalink] [raw]
Subject: Re: kswapd0: wxcessive CPU usage

On 10/12/2012 12:14 AM, Andrew Morton wrote:
> Could you please do a sysrq-T a few times while it's spinning, to
> confirm that this trace is consistently the culprit?

For me yes, shrink_slab is in the most of the traces.

--
js
suse labs

2012-10-12 12:38:05

by Jiri Slaby

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On 10/12/2012 12:08 AM, Jiri Slaby wrote:
> (It's an effective revert of "mm: vmscan: scale number of pages
> reclaimed by reclaim/compaction based on failures".)

Given kswapd had hours of runtime in ps/top output yesterday in the
morning and after the revert it's now 2 minutes in sum for the last 24h,
I would say, it's gone.

Mel, you wrote me it's unlikely the patch, but not impossible in the
end. Can you take a look, please? If you need some trace-cmd output or
anything, just let us know.

This is x86_64, 6G of RAM, no swap. FWIW EXT4, SLUB, COMPACTION all
enabled/used.

thanks,
--
js
suse labs

2012-10-12 13:57:33

by Mel Gorman

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On Fri, Oct 12, 2012 at 02:37:58PM +0200, Jiri Slaby wrote:
> On 10/12/2012 12:08 AM, Jiri Slaby wrote:
> > (It's an effective revert of "mm: vmscan: scale number of pages
> > reclaimed by reclaim/compaction based on failures".)
>
> Given kswapd had hours of runtime in ps/top output yesterday in the
> morning and after the revert it's now 2 minutes in sum for the last 24h,
> I would say, it's gone.
>
> Mel, you wrote me it's unlikely the patch, but not impossible in the
> end. Can you take a look, please? If you need some trace-cmd output or
> anything, just let us know.
>
> This is x86_64, 6G of RAM, no swap. FWIW EXT4, SLUB, COMPACTION all
> enabled/used.
>

Can you monitor the behaviour of this patch please? Please keep a particular
eye on kswapd activity and the amount of free memory. If free memory is
spiking it might indicate that kswapd is still too aggressive with the loss
of the __GFP_NO_KSWAPD flag. One way to tell is to record /proc/vmstat over
time and see what the pgsteal_* figures look like. If they are climbing
aggressively during what should be normal usage then it might show that
kswapd is still too aggressive when asked to reclaim for THP.

Thanks very much.

---8<---
mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim

Jiri Slaby reported the following:

(It's an effective revert of "mm: vmscan: scale number of pages
reclaimed by reclaim/compaction based on failures".)
Given kswapd had hours of runtime in ps/top output yesterday in the
morning and after the revert it's now 2 minutes in sum for the last 24h,
I would say, it's gone.

The intention of the patch in question was to compensate for the loss of
lumpy reclaim. Part of the reason lumpy reclaim worked is because it
aggressively reclaimed pages and this patch was meant to be a
sane compromise.

When compaction fails, it gets deferred and both compaction and
reclaim/compaction is deferred avoid excessive reclaim. However, since
commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each time
and continues reclaiming which was not taken into account when the patch
was developed.

As it is not taking deferred compaction into account in this path it scans
aggressively before falling out and making the compaction_deferred check in
compaction_ready. This patch avoids kswapd scaling pages for reclaim and
leaves the aggressive reclaim to the process attempting the THP
allocation.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2624edc..2b7edfa 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1763,14 +1763,20 @@ static bool in_reclaim_compaction(struct scan_control *sc)
#ifdef CONFIG_COMPACTION
/*
* If compaction is deferred for sc->order then scale the number of pages
- * reclaimed based on the number of consecutive allocation failures
+ * reclaimed based on the number of consecutive allocation failures. This
+ * scaling only happens for direct reclaim as it is about to attempt
+ * compaction. If compaction fails, future allocations will be deferred
+ * and reclaim avoided. On the other hand, kswapd does not take compaction
+ * deferral into account so if it scaled, it could scan excessively even
+ * though allocations are temporarily not being attempted.
*/
static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
struct lruvec *lruvec, struct scan_control *sc)
{
struct zone *zone = lruvec_zone(lruvec);

- if (zone->compact_order_failed <= sc->order)
+ if (zone->compact_order_failed <= sc->order &&
+ !current_is_kswapd())
pages_for_compaction <<= zone->compact_defer_shift;
return pages_for_compaction;
}

2012-10-15 09:54:20

by Jiri Slaby

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On 10/12/2012 03:57 PM, Mel Gorman wrote:
> mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
>
> Jiri Slaby reported the following:
>
> (It's an effective revert of "mm: vmscan: scale number of pages
> reclaimed by reclaim/compaction based on failures".)
> Given kswapd had hours of runtime in ps/top output yesterday in the
> morning and after the revert it's now 2 minutes in sum for the last 24h,
> I would say, it's gone.
>
> The intention of the patch in question was to compensate for the loss of
> lumpy reclaim. Part of the reason lumpy reclaim worked is because it
> aggressively reclaimed pages and this patch was meant to be a
> sane compromise.
>
> When compaction fails, it gets deferred and both compaction and
> reclaim/compaction is deferred avoid excessive reclaim. However, since
> commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each time
> and continues reclaiming which was not taken into account when the patch
> was developed.
>
> As it is not taking deferred compaction into account in this path it scans
> aggressively before falling out and making the compaction_deferred check in
> compaction_ready. This patch avoids kswapd scaling pages for reclaim and
> leaves the aggressive reclaim to the process attempting the THP
> allocation.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2624edc..2b7edfa 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1763,14 +1763,20 @@ static bool in_reclaim_compaction(struct scan_control *sc)
> #ifdef CONFIG_COMPACTION
> /*
> * If compaction is deferred for sc->order then scale the number of pages
> - * reclaimed based on the number of consecutive allocation failures
> + * reclaimed based on the number of consecutive allocation failures. This
> + * scaling only happens for direct reclaim as it is about to attempt
> + * compaction. If compaction fails, future allocations will be deferred
> + * and reclaim avoided. On the other hand, kswapd does not take compaction
> + * deferral into account so if it scaled, it could scan excessively even
> + * though allocations are temporarily not being attempted.
> */
> static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
> struct lruvec *lruvec, struct scan_control *sc)
> {
> struct zone *zone = lruvec_zone(lruvec);
>
> - if (zone->compact_order_failed <= sc->order)
> + if (zone->compact_order_failed <= sc->order &&
> + !current_is_kswapd())
> pages_for_compaction <<= zone->compact_defer_shift;
> return pages_for_compaction;
> }

Yes, applying this instead of the revert fixes the issue as well.

thanks,
--
js
suse labs

2012-10-15 11:09:42

by Mel Gorman

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On Mon, Oct 15, 2012 at 11:54:13AM +0200, Jiri Slaby wrote:
> On 10/12/2012 03:57 PM, Mel Gorman wrote:
> > mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
> >
> > Jiri Slaby reported the following:
> >
> > (It's an effective revert of "mm: vmscan: scale number of pages
> > reclaimed by reclaim/compaction based on failures".)
> > Given kswapd had hours of runtime in ps/top output yesterday in the
> > morning and after the revert it's now 2 minutes in sum for the last 24h,
> > I would say, it's gone.
> >
> > The intention of the patch in question was to compensate for the loss of
> > lumpy reclaim. Part of the reason lumpy reclaim worked is because it
> > aggressively reclaimed pages and this patch was meant to be a
> > sane compromise.
> >
> > When compaction fails, it gets deferred and both compaction and
> > reclaim/compaction is deferred avoid excessive reclaim. However, since
> > commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each time
> > and continues reclaiming which was not taken into account when the patch
> > was developed.
> >
> > As it is not taking deferred compaction into account in this path it scans
> > aggressively before falling out and making the compaction_deferred check in
> > compaction_ready. This patch avoids kswapd scaling pages for reclaim and
> > leaves the aggressive reclaim to the process attempting the THP
> > allocation.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmscan.c | 10 ++++++++--
> > 1 file changed, 8 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2624edc..2b7edfa 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1763,14 +1763,20 @@ static bool in_reclaim_compaction(struct scan_control *sc)
> > #ifdef CONFIG_COMPACTION
> > /*
> > * If compaction is deferred for sc->order then scale the number of pages
> > - * reclaimed based on the number of consecutive allocation failures
> > + * reclaimed based on the number of consecutive allocation failures. This
> > + * scaling only happens for direct reclaim as it is about to attempt
> > + * compaction. If compaction fails, future allocations will be deferred
> > + * and reclaim avoided. On the other hand, kswapd does not take compaction
> > + * deferral into account so if it scaled, it could scan excessively even
> > + * though allocations are temporarily not being attempted.
> > */
> > static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
> > struct lruvec *lruvec, struct scan_control *sc)
> > {
> > struct zone *zone = lruvec_zone(lruvec);
> >
> > - if (zone->compact_order_failed <= sc->order)
> > + if (zone->compact_order_failed <= sc->order &&
> > + !current_is_kswapd())
> > pages_for_compaction <<= zone->compact_defer_shift;
> > return pages_for_compaction;
> > }
>
> Yes, applying this instead of the revert fixes the issue as well.
>

Thanks Jiri.

--
Mel Gorman
SUSE Labs

2012-10-29 11:00:24

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

Hi!

On 15.10.2012 13:09, Mel Gorman wrote:
> On Mon, Oct 15, 2012 at 11:54:13AM +0200, Jiri Slaby wrote:
>> On 10/12/2012 03:57 PM, Mel Gorman wrote:
>>> mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
>>> Jiri Slaby reported the following:
> [...]
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 2624edc..2b7edfa 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1763,14 +1763,20 @@ static bool in_reclaim_compaction(struct scan_control *sc)
>>> #ifdef CONFIG_COMPACTION
>>> /*
>>> * If compaction is deferred for sc->order then scale the number of pages
>>> - * reclaimed based on the number of consecutive allocation failures
>>> + * reclaimed based on the number of consecutive allocation failures. This
>>> + * scaling only happens for direct reclaim as it is about to attempt
>>> + * compaction. If compaction fails, future allocations will be deferred
>>> + * and reclaim avoided. On the other hand, kswapd does not take compaction
>>> + * deferral into account so if it scaled, it could scan excessively even
>>> + * though allocations are temporarily not being attempted.
>>> */
>>> static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
>>> struct lruvec *lruvec, struct scan_control *sc)
>>> {
>>> struct zone *zone = lruvec_zone(lruvec);
>>>
>>> - if (zone->compact_order_failed <= sc->order)
>>> + if (zone->compact_order_failed <= sc->order &&
>>> + !current_is_kswapd())
>>> pages_for_compaction <<= zone->compact_defer_shift;
>>> return pages_for_compaction;
>>> }
>> Yes, applying this instead of the revert fixes the issue as well.

Just wondering, is there a reason why this patch wasn't applied to
mainline? Did it simply fall through the cracks? Or am I missing something?

I'm asking because I think I stil see the issue on
3.7-rc2-git-checkout-from-friday. Seems Fedora rawhide users are hitting
it, too:
https://bugzilla.redhat.com/show_bug.cgi?id=866988

Or are we seeing something different which just looks similar? I can
test the patch if it needs further testing, but from the discussion I
got the impression that everything is clear and the patch ready for merging.

CU
knurd

2012-10-30 19:18:50

by Mel Gorman

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On Mon, Oct 29, 2012 at 11:52:03AM +0100, Thorsten Leemhuis wrote:
> Hi!
>
> On 15.10.2012 13:09, Mel Gorman wrote:
> >On Mon, Oct 15, 2012 at 11:54:13AM +0200, Jiri Slaby wrote:
> >>On 10/12/2012 03:57 PM, Mel Gorman wrote:
> >>>mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
> >>>Jiri Slaby reported the following:
> > [...]
> >>>diff --git a/mm/vmscan.c b/mm/vmscan.c
> >>>index 2624edc..2b7edfa 100644
> >>>--- a/mm/vmscan.c
> >>>+++ b/mm/vmscan.c
> >>>@@ -1763,14 +1763,20 @@ static bool in_reclaim_compaction(struct scan_control *sc)
> >>> #ifdef CONFIG_COMPACTION
> >>> /*
> >>> * If compaction is deferred for sc->order then scale the number of pages
> >>>- * reclaimed based on the number of consecutive allocation failures
> >>>+ * reclaimed based on the number of consecutive allocation failures. This
> >>>+ * scaling only happens for direct reclaim as it is about to attempt
> >>>+ * compaction. If compaction fails, future allocations will be deferred
> >>>+ * and reclaim avoided. On the other hand, kswapd does not take compaction
> >>>+ * deferral into account so if it scaled, it could scan excessively even
> >>>+ * though allocations are temporarily not being attempted.
> >>> */
> >>> static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
> >>> struct lruvec *lruvec, struct scan_control *sc)
> >>> {
> >>> struct zone *zone = lruvec_zone(lruvec);
> >>>
> >>>- if (zone->compact_order_failed <= sc->order)
> >>>+ if (zone->compact_order_failed <= sc->order &&
> >>>+ !current_is_kswapd())
> >>> pages_for_compaction <<= zone->compact_defer_shift;
> >>> return pages_for_compaction;
> >>> }
> >>Yes, applying this instead of the revert fixes the issue as well.
>
> Just wondering, is there a reason why this patch wasn't applied to
> mainline? Did it simply fall through the cracks? Or am I missing
> something?
>

It's because a problem was reported related to the patch (off-list,
whoops). I'm waiting to hear if a second patch fixes the problem or not.

> I'm asking because I think I stil see the issue on
> 3.7-rc2-git-checkout-from-friday. Seems Fedora rawhide users are
> hitting it, too:
> https://bugzilla.redhat.com/show_bug.cgi?id=866988
>

I like the steps to reproduce. Is step 3 profit?

> Or are we seeing something different which just looks similar? I can
> test the patch if it needs further testing, but from the discussion
> I got the impression that everything is clear and the patch ready
> for merging.

It could be the same issue. Can you test with the "mm: vmscan: scale
number of pages reclaimed by reclaim/compaction only in direct reclaim"
patch and the following on top please?

Thanks.

---8<---
mm: page_alloc: Do not wake kswapd if the request is for THP but deferred

Since commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd gets woken
for every THP request in the slow path. If compaction has been deferred
the waker will not compact or enter direct reclaim on its own behalf
but kswapd is still woken to reclaim free pages that no one may consume.
If compaction was deferred because pages and slab was not reclaimable
then kswapd is just consuming cycles for no gain.

This patch avoids waking kswapd if the compaction has been deferred.
It'll still wake when compaction is running to reduce the latency of
THP allocations.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 21 +++++++++++++++++++--
1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb90971..e72674c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2378,6 +2378,15 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
}

+/* Returns true if the allocation is likely for THP */
+static bool is_thp_alloc(gfp_t gfp_mask, unsigned int order)
+{
+ if (order == pageblock_order &&
+ (gfp_mask & (__GFP_MOVABLE|__GFP_REPEAT)) == __GFP_MOVABLE)
+ return true;
+ return false;
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2416,7 +2425,15 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto nopage;

restart:
- wake_all_kswapd(order, zonelist, high_zoneidx,
+ /*
+ * kswapd is woken except when this is a THP request and compaction
+ * is deferred. If we are backing off reclaim/compaction then kswapd
+ * should not be awake aggressively reclaiming with no consumers of
+ * the freed pages
+ */
+ if (!(is_thp_alloc(gfp_mask, order) &&
+ compaction_deferred(preferred_zone, order)))
+ wake_all_kswapd(order, zonelist, high_zoneidx,
zone_idx(preferred_zone));

/*
@@ -2494,7 +2511,7 @@ rebalance:
* system then fail the allocation instead of entering direct reclaim.
*/
if ((deferred_compaction || contended_compaction) &&
- (gfp_mask & (__GFP_MOVABLE|__GFP_REPEAT)) == __GFP_MOVABLE)
+ is_thp_alloc(gfp_mask, order))
goto nopage;

/* Try direct reclaim and then allocating */

2012-10-31 11:25:24

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On 30.10.2012 20:18, Mel Gorman wrote:
> On Mon, Oct 29, 2012 at 11:52:03AM +0100, Thorsten Leemhuis wrote:
>> On 15.10.2012 13:09, Mel Gorman wrote:
>>> On Mon, Oct 15, 2012 at 11:54:13AM +0200, Jiri Slaby wrote:
>>>> On 10/12/2012 03:57 PM, Mel Gorman wrote:
>>>>> mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
>>>>> Jiri Slaby reported the following:
> [...]
>>>> Yes, applying this instead of the revert fixes the issue as well.
>> Just wondering, is there a reason why this patch wasn't applied to
>> mainline? Did it simply fall through the cracks? Or am I missing
>> something?
> It's because a problem was reported related to the patch (off-list,
> whoops). I'm waiting to hear if a second patch fixes the problem or not.

Anything in particular I should look out for while testing?

>> I'm asking because I think I stil see the issue on
>> 3.7-rc2-git-checkout-from-friday. Seems Fedora rawhide users are
>> hitting it, too:
>> https://bugzilla.redhat.com/show_bug.cgi?id=866988
> I like the steps to reproduce.

One of those cases where the bugzilla bug template was not very helpful
or where it was not used as intended (you decide) :-)

> Is step 3 profit?

Yes, but psst, don't tell anyone; step 4 (world domination! for real!)
is also hidden to keep that part of the big plan a secret for now ;-)

>> Or are we seeing something different which just looks similar? I can
>> test the patch if it needs further testing, but from the discussion
>> I got the impression that everything is clear and the patch ready
>> for merging.
> It could be the same issue. Can you test with the "mm: vmscan: scale
> number of pages reclaimed by reclaim/compaction only in direct reclaim"
> patch and the following on top please?

Built a vanilla mainline kernel with those two patches and installed it
on the machine where I was seeing problems high kswapd0 load on 3.7-rc3.
Ran it an hour yesterday and a few hours today; seems the patches fix
the issue for me as kswapd behaves:

$ LC_ALL=C ps -aux | grep 'kswapd'
root 62 0.0 0.0 0 0 ? S Oct30 0:05 [kswapd0]

So everything is looking fine again so far thx to the two patches --
hopefully it stays that way even after hitting "send" in my mailer in a
few seconds.

CU
knurd

2012-10-31 15:04:47

by Mel Gorman

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On Wed, Oct 31, 2012 at 12:25:13PM +0100, Thorsten Leemhuis wrote:
> On 30.10.2012 20:18, Mel Gorman wrote:
> >On Mon, Oct 29, 2012 at 11:52:03AM +0100, Thorsten Leemhuis wrote:
> >>On 15.10.2012 13:09, Mel Gorman wrote:
> >>>On Mon, Oct 15, 2012 at 11:54:13AM +0200, Jiri Slaby wrote:
> >>>>On 10/12/2012 03:57 PM, Mel Gorman wrote:
> >>>>>mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
> >>>>>Jiri Slaby reported the following:
> >[...]
> >>>>Yes, applying this instead of the revert fixes the issue as well.
> >>Just wondering, is there a reason why this patch wasn't applied to
> >>mainline? Did it simply fall through the cracks? Or am I missing
> >>something?
> >It's because a problem was reported related to the patch (off-list,
> >whoops). I'm waiting to hear if a second patch fixes the problem or not.
>
> Anything in particular I should look out for while testing?
>

Excessive reclaim, high CPU usage by kswapd, processes getting stick in
isolate_migratepages or isolate_freepages.

> >>I'm asking because I think I stil see the issue on
> >>3.7-rc2-git-checkout-from-friday. Seems Fedora rawhide users are
> >>hitting it, too:
> >>https://bugzilla.redhat.com/show_bug.cgi?id=866988
> >I like the steps to reproduce.
>
> One of those cases where the bugzilla bug template was not very
> helpful or where it was not used as intended (you decide) :-)
>

It wins at entertainment value if nothing else :)

> >Is step 3 profit?
>
> Yes, but psst, don't tell anyone; step 4 (world domination! for
> real!) is also hidden to keep that part of the big plan a secret for
> now ;-)
>

No doubt it's the default private comment #1 !

> >>Or are we seeing something different which just looks similar? I can
> >>test the patch if it needs further testing, but from the discussion
> >>I got the impression that everything is clear and the patch ready
> >>for merging.
> >It could be the same issue. Can you test with the "mm: vmscan: scale
> >number of pages reclaimed by reclaim/compaction only in direct reclaim"
> >patch and the following on top please?
>
> Built a vanilla mainline kernel with those two patches and installed
> it on the machine where I was seeing problems high kswapd0 load on
> 3.7-rc3. Ran it an hour yesterday and a few hours today; seems the
> patches fix the issue for me as kswapd behaves:
>
> $ LC_ALL=C ps -aux | grep 'kswapd'
> root 62 0.0 0.0 0 0 ? S Oct30 0:05 [kswapd0]
>
> So everything is looking fine again so far thx to the two patches
> -- hopefully it stays that way even after hitting "send" in my
> mailer in a few seconds.
>

Ok, great. Keep an eye on it please. If Jiri Slaby reports similar
success then I'll collapse the two patches together and resend to
Andrew.

Thanks.

--
Mel Gorman
SUSE Labs

2012-11-02 10:44:17

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

Dne 15.10.2012 13:09, Mel Gorman napsal(a):
> On Mon, Oct 15, 2012 at 11:54:13AM +0200, Jiri Slaby wrote:
>> On 10/12/2012 03:57 PM, Mel Gorman wrote:
>>> mm: vmscan: scale number of pages reclaimed by reclaim/compaction only in direct reclaim
>>>
>>> Jiri Slaby reported the following:
>>>
>>> (It's an effective revert of "mm: vmscan: scale number of pages
>>> reclaimed by reclaim/compaction based on failures".)
>>> Given kswapd had hours of runtime in ps/top output yesterday in the
>>> morning and after the revert it's now 2 minutes in sum for the last 24h,
>>> I would say, it's gone.
>>>
>>> The intention of the patch in question was to compensate for the loss of
>>> lumpy reclaim. Part of the reason lumpy reclaim worked is because it
>>> aggressively reclaimed pages and this patch was meant to be a
>>> sane compromise.
>>>
>>> When compaction fails, it gets deferred and both compaction and
>>> reclaim/compaction is deferred avoid excessive reclaim. However, since
>>> commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each time
>>> and continues reclaiming which was not taken into account when the patch
>>> was developed.
>>>
>>> As it is not taking deferred compaction into account in this path it scans
>>> aggressively before falling out and making the compaction_deferred check in
>>> compaction_ready. This patch avoids kswapd scaling pages for reclaim and
>>> leaves the aggressive reclaim to the process attempting the THP
>>> allocation.
>>>
>>> Signed-off-by: Mel Gorman <[email protected]>
>>> ---
>>> mm/vmscan.c | 10 ++++++++--
>>> 1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 2624edc..2b7edfa 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1763,14 +1763,20 @@ static bool in_reclaim_compaction(struct scan_control *sc)
>>> #ifdef CONFIG_COMPACTION
>>> /*
>>> * If compaction is deferred for sc->order then scale the number of pages
>>> - * reclaimed based on the number of consecutive allocation failures
>>> + * reclaimed based on the number of consecutive allocation failures. This
>>> + * scaling only happens for direct reclaim as it is about to attempt
>>> + * compaction. If compaction fails, future allocations will be deferred
>>> + * and reclaim avoided. On the other hand, kswapd does not take compaction
>>> + * deferral into account so if it scaled, it could scan excessively even
>>> + * though allocations are temporarily not being attempted.
>>> */
>>> static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
>>> struct lruvec *lruvec, struct scan_control *sc)
>>> {
>>> struct zone *zone = lruvec_zone(lruvec);
>>>
>>> - if (zone->compact_order_failed <= sc->order)
>>> + if (zone->compact_order_failed <= sc->order &&
>>> + !current_is_kswapd())
>>> pages_for_compaction <<= zone->compact_defer_shift;
>>> return pages_for_compaction;
>>> }
>>
>> Yes, applying this instead of the revert fixes the issue as well.
>>
>


I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive CPU
usage - mainly after suspend/resume

Here is just simple kswapd backtrace from running kernel:

kswapd0 R running task 0 30 2 0x00000000
ffff8801331ddae8 0000000000000082 ffff880135b8a340 0000000000000008
ffff880135b8a340 ffff8801331ddfd8 ffff8801331ddfd8 ffff8801331ddfd8
ffff880071db8000 ffff880135b8a340 0000000000000286 ffff8801331dc000
Call Trace:
[<ffffffff81555cd2>] preempt_schedule+0x42/0x60
[<ffffffff81557b75>] _raw_spin_unlock+0x55/0x60
[<ffffffff811929d1>] put_super+0x31/0x40
[<ffffffff81192aa2>] drop_super+0x22/0x30
[<ffffffff81193be9>] prune_super+0x149/0x1b0
[<ffffffff81141e2a>] shrink_slab+0xba/0x510
[<ffffffff81185baa>] ? mem_cgroup_iter+0x17a/0x2e0
[<ffffffff81185afa>] ? mem_cgroup_iter+0xca/0x2e0
[<ffffffff811450f9>] balance_pgdat+0x629/0x7f0
[<ffffffff81145434>] kswapd+0x174/0x620
[<ffffffff8106fd20>] ? __init_waitqueue_head+0x60/0x60
[<ffffffff811452c0>] ? balance_pgdat+0x7f0/0x7f0
[<ffffffff8106f50b>] kthread+0xdb/0xe0
[<ffffffff8106f430>] ? kthread_create_on_node+0x140/0x140
[<ffffffff8155fb1c>] ret_from_fork+0x7c/0xb0
[<ffffffff8106f430>] ? kthread_create_on_node+0x140/0x140


Zdenek

2012-11-02 10:53:42

by Jiri Slaby

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On 11/02/2012 11:44 AM, Zdenek Kabelac wrote:
>>> Yes, applying this instead of the revert fixes the issue as well.
>
> I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive
> CPU usage - mainly after suspend/resume
>
> Here is just simple kswapd backtrace from running kernel:

Yup, this is what we were seeing with the former patch only too. Try to
apply the other one too:
https://patchwork.kernel.org/patch/1673231/

For me I would say, it is fixed by the two patches now. I won't be able
to report later, since I'm leaving to a conference tomorrow.

> kswapd0 R running task 0 30 2 0x00000000
...
> [<ffffffff81141e2a>] shrink_slab+0xba/0x510

thanks,
--
js
suse labs

2012-11-02 19:45:14

by Jiri Slaby

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On 11/02/2012 11:53 AM, Jiri Slaby wrote:
> On 11/02/2012 11:44 AM, Zdenek Kabelac wrote:
>>>> Yes, applying this instead of the revert fixes the issue as well.
>>
>> I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive
>> CPU usage - mainly after suspend/resume
>>
>> Here is just simple kswapd backtrace from running kernel:
>
> Yup, this is what we were seeing with the former patch only too. Try to
> apply the other one too:
> https://patchwork.kernel.org/patch/1673231/
>
> For me I would say, it is fixed by the two patches now. I won't be able
> to report later, since I'm leaving to a conference tomorrow.

Damn it. It recurred right now, with both patches applied. After I
started a java program which consumed some more memory. Though there are
still 2 gigs free, kswap is spinning:
[<ffffffff810b00da>] __cond_resched+0x2a/0x40
[<ffffffff811318a0>] shrink_slab+0x1c0/0x2d0
[<ffffffff8113478d>] kswapd+0x66d/0xb60
[<ffffffff810a25d0>] kthread+0xc0/0xd0
[<ffffffff816aa29c>] ret_from_fork+0x7c/0xb0
[<ffffffffffffffff>] 0xffffffffffffffff

--
js
suse labs

2012-11-04 11:26:48

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

Dne 2.11.2012 20:45, Jiri Slaby napsal(a):
> On 11/02/2012 11:53 AM, Jiri Slaby wrote:
>> On 11/02/2012 11:44 AM, Zdenek Kabelac wrote:
>>>>> Yes, applying this instead of the revert fixes the issue as well.
>>>
>>> I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive
>>> CPU usage - mainly after suspend/resume
>>>
>>> Here is just simple kswapd backtrace from running kernel:
>>
>> Yup, this is what we were seeing with the former patch only too. Try to
>> apply the other one too:
>> https://patchwork.kernel.org/patch/1673231/
>>
>> For me I would say, it is fixed by the two patches now. I won't be able
>> to report later, since I'm leaving to a conference tomorrow.
>
> Damn it. It recurred right now, with both patches applied. After I
> started a java program which consumed some more memory. Though there are
> still 2 gigs free, kswap is spinning:
> [<ffffffff810b00da>] __cond_resched+0x2a/0x40
> [<ffffffff811318a0>] shrink_slab+0x1c0/0x2d0
> [<ffffffff8113478d>] kswapd+0x66d/0xb60
> [<ffffffff810a25d0>] kthread+0xc0/0xd0
> [<ffffffff816aa29c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff
>

Yep - wanted to report myself again and noticed your replay.

Yes - I've now also both patches installed - and I still observe kswapd eating
my CPU. It seems (at least for me) that prior suspend and resume is way to
trigger it more frequently.

However there is a change in behaviour - while before kswapd was running
almost indefinitely now the> CPU spikes are in the range of minutes.
(i.e. uptime ~2days - kswapd has over 32minutes CPU time)
My machine has 4GB, and no swap (disabled)

firefox (22mins), thunderbird(3mins) and pidgin(0.5min) are the 3 most memory
and CPU hungry apps for this moment.

Zdenek

2012-11-04 16:34:26

by Rik van Riel

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On 10/30/2012 03:18 PM, Mel Gorman wrote:

> restart:
> - wake_all_kswapd(order, zonelist, high_zoneidx,
> + /*
> + * kswapd is woken except when this is a THP request and compaction
> + * is deferred. If we are backing off reclaim/compaction then kswapd
> + * should not be awake aggressively reclaiming with no consumers of
> + * the freed pages
> + */
> + if (!(is_thp_alloc(gfp_mask, order) &&
> + compaction_deferred(preferred_zone, order)))
> + wake_all_kswapd(order, zonelist, high_zoneidx,
> zone_idx(preferred_zone));

What is special about thp allocations here?

Surely other large allocations that keep failing
should get the same treatment, of not waking up
kswapd if compaction is deferred?

2012-11-05 14:24:55

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"

Jiri Slaby reported the following:

(It's an effective revert of "mm: vmscan: scale number of pages
reclaimed by reclaim/compaction based on failures".) Given kswapd
had hours of runtime in ps/top output yesterday in the morning
and after the revert it's now 2 minutes in sum for the last 24h,
I would say, it's gone.

The intention of the patch in question was to compensate for the loss
of lumpy reclaim. Part of the reason lumpy reclaim worked is because
it aggressively reclaimed pages and this patch was meant to be a sane
compromise.

When compaction fails, it gets deferred and both compaction and
reclaim/compaction is deferred avoid excessive reclaim. However, since
commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each time
and continues reclaiming which was not taken into account when the patch
was developed.

Attempts to address the problem ended up just changing the shape of the
problem instead of fixing it. The release window gets closer and while a
THP allocation failing is not a major problem, kswapd chewing up a lot of
CPU is. This patch reverts "mm: vmscan: scale number of pages reclaimed
by reclaim/compaction based on failures" and will be revisited in the future.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 25 -------------------------
1 file changed, 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2624edc..e081ee8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1760,28 +1760,6 @@ static bool in_reclaim_compaction(struct scan_control *sc)
return false;
}

-#ifdef CONFIG_COMPACTION
-/*
- * If compaction is deferred for sc->order then scale the number of pages
- * reclaimed based on the number of consecutive allocation failures
- */
-static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
- struct lruvec *lruvec, struct scan_control *sc)
-{
- struct zone *zone = lruvec_zone(lruvec);
-
- if (zone->compact_order_failed <= sc->order)
- pages_for_compaction <<= zone->compact_defer_shift;
- return pages_for_compaction;
-}
-#else
-static unsigned long scale_for_compaction(unsigned long pages_for_compaction,
- struct lruvec *lruvec, struct scan_control *sc)
-{
- return pages_for_compaction;
-}
-#endif
-
/*
* Reclaim/compaction is used for high-order allocation requests. It reclaims
* order-0 pages before compacting the zone. should_continue_reclaim() returns
@@ -1829,9 +1807,6 @@ static inline bool should_continue_reclaim(struct lruvec *lruvec,
* inactive lists are large enough, continue reclaiming
*/
pages_for_compaction = (2UL << sc->order);
-
- pages_for_compaction = scale_for_compaction(pages_for_compaction,
- lruvec, sc);
inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
if (nr_swap_pages > 0)
inactive_lru_pages += get_lru_size(lruvec, LRU_INACTIVE_ANON);

2012-11-06 10:25:15

by Johannes Hirte

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"

Am Mon, 5 Nov 2012 14:24:49 +0000
schrieb Mel Gorman <[email protected]>:

> Jiri Slaby reported the following:
>
> (It's an effective revert of "mm: vmscan: scale number of
> pages reclaimed by reclaim/compaction based on failures".) Given
> kswapd had hours of runtime in ps/top output yesterday in the morning
> and after the revert it's now 2 minutes in sum for the last
> 24h, I would say, it's gone.
>
> The intention of the patch in question was to compensate for the loss
> of lumpy reclaim. Part of the reason lumpy reclaim worked is because
> it aggressively reclaimed pages and this patch was meant to be a sane
> compromise.
>
> When compaction fails, it gets deferred and both compaction and
> reclaim/compaction is deferred avoid excessive reclaim. However, since
> commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each
> time and continues reclaiming which was not taken into account when
> the patch was developed.
>
> Attempts to address the problem ended up just changing the shape of
> the problem instead of fixing it. The release window gets closer and
> while a THP allocation failing is not a major problem, kswapd chewing
> up a lot of CPU is. This patch reverts "mm: vmscan: scale number of
> pages reclaimed by reclaim/compaction based on failures" and will be
> revisited in the future.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/vmscan.c | 25 -------------------------
> 1 file changed, 25 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 2624edc..e081ee8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1760,28 +1760,6 @@ static bool in_reclaim_compaction(struct
> scan_control *sc) return false;
> }
>
> -#ifdef CONFIG_COMPACTION
> -/*
> - * If compaction is deferred for sc->order then scale the number of
> pages
> - * reclaimed based on the number of consecutive allocation failures
> - */
> -static unsigned long scale_for_compaction(unsigned long
> pages_for_compaction,
> - struct lruvec *lruvec, struct scan_control
> *sc) -{
> - struct zone *zone = lruvec_zone(lruvec);
> -
> - if (zone->compact_order_failed <= sc->order)
> - pages_for_compaction <<= zone->compact_defer_shift;
> - return pages_for_compaction;
> -}
> -#else
> -static unsigned long scale_for_compaction(unsigned long
> pages_for_compaction,
> - struct lruvec *lruvec, struct scan_control
> *sc) -{
> - return pages_for_compaction;
> -}
> -#endif
> -
> /*
> * Reclaim/compaction is used for high-order allocation requests. It
> reclaims
> * order-0 pages before compacting the zone.
> should_continue_reclaim() returns @@ -1829,9 +1807,6 @@ static inline
> bool should_continue_reclaim(struct lruvec *lruvec,
> * inactive lists are large enough, continue reclaiming
> */
> pages_for_compaction = (2UL << sc->order);
> -
> - pages_for_compaction =
> scale_for_compaction(pages_for_compaction,
> - lruvec, sc);
> inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> if (nr_swap_pages > 0)
> inactive_lru_pages += get_lru_size(lruvec,
> LRU_INACTIVE_ANON); --

Even with this patch I see kswapd0 very often on top. Much more than
with kernel 3.6.

2012-11-09 04:22:15

by Seth Jennings

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On 11/02/2012 02:45 PM, Jiri Slaby wrote:
> On 11/02/2012 11:53 AM, Jiri Slaby wrote:
>> On 11/02/2012 11:44 AM, Zdenek Kabelac wrote:
>>>>> Yes, applying this instead of the revert fixes the issue as well.
>>>
>>> I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive
>>> CPU usage - mainly after suspend/resume
>>>
>>> Here is just simple kswapd backtrace from running kernel:
>>
>> Yup, this is what we were seeing with the former patch only too. Try to
>> apply the other one too:
>> https://patchwork.kernel.org/patch/1673231/
>>
>> For me I would say, it is fixed by the two patches now. I won't be able
>> to report later, since I'm leaving to a conference tomorrow.
>
> Damn it. It recurred right now, with both patches applied. After I
> started a java program which consumed some more memory. Though there are
> still 2 gigs free, kswap is spinning:
> [<ffffffff810b00da>] __cond_resched+0x2a/0x40
> [<ffffffff811318a0>] shrink_slab+0x1c0/0x2d0
> [<ffffffff8113478d>] kswapd+0x66d/0xb60
> [<ffffffff810a25d0>] kthread+0xc0/0xd0
> [<ffffffff816aa29c>] ret_from_fork+0x7c/0xb0
> [<ffffffffffffffff>] 0xffffffffffffffff

I'm also hitting this issue in v3.7-rc4. It appears that the last
release not effected by this issue was v3.3. Bisecting the changes
included for v3.4-rc1 showed that this commit introduced the issue:

fe2c2a106663130a5ab45cb0e3414b52df2fff0c is the first bad commit
commit fe2c2a106663130a5ab45cb0e3414b52df2fff0c
Author: Rik van Riel <[email protected]>
Date: Wed Mar 21 16:33:51 2012 -0700

vmscan: reclaim at order 0 when compaction is enabled
...

This is plausible since the issue seems to be in the kswapd + compaction
realm. I've yet to figure out exactly what about this commit results in
kswapd spinning.

I would be interested if someone can confirm this finding.

--
Seth

2012-11-09 08:08:05

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

Dne 9.11.2012 05:22, Seth Jennings napsal(a):
> On 11/02/2012 02:45 PM, Jiri Slaby wrote:
>> On 11/02/2012 11:53 AM, Jiri Slaby wrote:
>>> On 11/02/2012 11:44 AM, Zdenek Kabelac wrote:
>>>>>> Yes, applying this instead of the revert fixes the issue as well.
>>>>
>>>> I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive
>>>> CPU usage - mainly after suspend/resume
>>>>
>>>> Here is just simple kswapd backtrace from running kernel:
>>>
>>> Yup, this is what we were seeing with the former patch only too. Try to
>>> apply the other one too:
>>> https://patchwork.kernel.org/patch/1673231/
>>>
>>> For me I would say, it is fixed by the two patches now. I won't be able
>>> to report later, since I'm leaving to a conference tomorrow.
>>
>> Damn it. It recurred right now, with both patches applied. After I
>> started a java program which consumed some more memory. Though there are
>> still 2 gigs free, kswap is spinning:
>> [<ffffffff810b00da>] __cond_resched+0x2a/0x40
>> [<ffffffff811318a0>] shrink_slab+0x1c0/0x2d0
>> [<ffffffff8113478d>] kswapd+0x66d/0xb60
>> [<ffffffff810a25d0>] kthread+0xc0/0xd0
>> [<ffffffff816aa29c>] ret_from_fork+0x7c/0xb0
>> [<ffffffffffffffff>] 0xffffffffffffffff
>
> I'm also hitting this issue in v3.7-rc4. It appears that the last
> release not effected by this issue was v3.3. Bisecting the changes
> included for v3.4-rc1 showed that this commit introduced the issue:
>
> fe2c2a106663130a5ab45cb0e3414b52df2fff0c is the first bad commit
> commit fe2c2a106663130a5ab45cb0e3414b52df2fff0c
> Author: Rik van Riel <[email protected]>
> Date: Wed Mar 21 16:33:51 2012 -0700
>
> vmscan: reclaim at order 0 when compaction is enabled
> ...
>
> This is plausible since the issue seems to be in the kswapd + compaction
> realm. I've yet to figure out exactly what about this commit results in
> kswapd spinning.
>
> I would be interested if someone can confirm this finding.
>
> --
> Seth
>


On my system 3.7-rc4 the problem seems to be effectively solved by revert
patch: https://lkml.org/lkml/2012/11/5/308

i.e. in 2 days uptime kswapd0 eats 6 seconds which is IMHO ok - I'm not
observing any busy loops on CPU with kswapd0.


Zdenek

2012-11-09 08:36:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"

On Tue, Nov 06, 2012 at 11:15:54AM +0100, Johannes Hirte wrote:
> Am Mon, 5 Nov 2012 14:24:49 +0000
> schrieb Mel Gorman <[email protected]>:
>
> > Jiri Slaby reported the following:
> >
> > (It's an effective revert of "mm: vmscan: scale number of
> > pages reclaimed by reclaim/compaction based on failures".) Given
> > kswapd had hours of runtime in ps/top output yesterday in the morning
> > and after the revert it's now 2 minutes in sum for the last
> > 24h, I would say, it's gone.
> >
> > The intention of the patch in question was to compensate for the loss
> > of lumpy reclaim. Part of the reason lumpy reclaim worked is because
> > it aggressively reclaimed pages and this patch was meant to be a sane
> > compromise.
> >
> > When compaction fails, it gets deferred and both compaction and
> > reclaim/compaction is deferred avoid excessive reclaim. However, since
> > commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each
> > time and continues reclaiming which was not taken into account when
> > the patch was developed.
> >
> > Attempts to address the problem ended up just changing the shape of
> > the problem instead of fixing it. The release window gets closer and
> > while a THP allocation failing is not a major problem, kswapd chewing
> > up a lot of CPU is. This patch reverts "mm: vmscan: scale number of
> > pages reclaimed by reclaim/compaction based on failures" and will be
> > revisited in the future.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/vmscan.c | 25 -------------------------
> > 1 file changed, 25 deletions(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 2624edc..e081ee8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1760,28 +1760,6 @@ static bool in_reclaim_compaction(struct
> > scan_control *sc) return false;
> > }
> >
> > -#ifdef CONFIG_COMPACTION
> > -/*
> > - * If compaction is deferred for sc->order then scale the number of
> > pages
> > - * reclaimed based on the number of consecutive allocation failures
> > - */
> > -static unsigned long scale_for_compaction(unsigned long
> > pages_for_compaction,
> > - struct lruvec *lruvec, struct scan_control
> > *sc) -{
> > - struct zone *zone = lruvec_zone(lruvec);
> > -
> > - if (zone->compact_order_failed <= sc->order)
> > - pages_for_compaction <<= zone->compact_defer_shift;
> > - return pages_for_compaction;
> > -}
> > -#else
> > -static unsigned long scale_for_compaction(unsigned long
> > pages_for_compaction,
> > - struct lruvec *lruvec, struct scan_control
> > *sc) -{
> > - return pages_for_compaction;
> > -}
> > -#endif
> > -
> > /*
> > * Reclaim/compaction is used for high-order allocation requests. It
> > reclaims
> > * order-0 pages before compacting the zone.
> > should_continue_reclaim() returns @@ -1829,9 +1807,6 @@ static inline
> > bool should_continue_reclaim(struct lruvec *lruvec,
> > * inactive lists are large enough, continue reclaiming
> > */
> > pages_for_compaction = (2UL << sc->order);
> > -
> > - pages_for_compaction =
> > scale_for_compaction(pages_for_compaction,
> > - lruvec, sc);
> > inactive_lru_pages = get_lru_size(lruvec, LRU_INACTIVE_FILE);
> > if (nr_swap_pages > 0)
> > inactive_lru_pages += get_lru_size(lruvec,
> > LRU_INACTIVE_ANON); --
>
> Even with this patch I see kswapd0 very often on top. Much more than
> with kernel 3.6.

How severe is the CPU usage? The higher usage can be explained by "mm:
remove __GFP_NO_KSWAPD" which allows kswapd to compact memory to reduce
the amount of time processes spend in compaction but will result in the
CPU cost being incurred by kswapd.

Is it really high like the bug was reporting with high usage over long
periods of time or do you just see it using 2-6% of CPU for short
periods?

Thanks.

--
Mel Gorman
SUSE Labs

2012-11-09 08:40:55

by Mel Gorman

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On Thu, Nov 08, 2012 at 10:22:05PM -0600, Seth Jennings wrote:
> On 11/02/2012 02:45 PM, Jiri Slaby wrote:
> > On 11/02/2012 11:53 AM, Jiri Slaby wrote:
> >> On 11/02/2012 11:44 AM, Zdenek Kabelac wrote:
> >>>>> Yes, applying this instead of the revert fixes the issue as well.
> >>>
> >>> I've applied this patch on 3.7.0-rc3 kernel - and I still see excessive
> >>> CPU usage - mainly after suspend/resume
> >>>
> >>> Here is just simple kswapd backtrace from running kernel:
> >>
> >> Yup, this is what we were seeing with the former patch only too. Try to
> >> apply the other one too:
> >> https://patchwork.kernel.org/patch/1673231/
> >>
> >> For me I would say, it is fixed by the two patches now. I won't be able
> >> to report later, since I'm leaving to a conference tomorrow.
> >
> > Damn it. It recurred right now, with both patches applied. After I
> > started a java program which consumed some more memory. Though there are
> > still 2 gigs free, kswap is spinning:
> > [<ffffffff810b00da>] __cond_resched+0x2a/0x40
> > [<ffffffff811318a0>] shrink_slab+0x1c0/0x2d0
> > [<ffffffff8113478d>] kswapd+0x66d/0xb60
> > [<ffffffff810a25d0>] kthread+0xc0/0xd0
> > [<ffffffff816aa29c>] ret_from_fork+0x7c/0xb0
> > [<ffffffffffffffff>] 0xffffffffffffffff
>
> I'm also hitting this issue in v3.7-rc4. It appears that the last
> release not effected by this issue was v3.3. Bisecting the changes
> included for v3.4-rc1 showed that this commit introduced the issue:
>
> fe2c2a106663130a5ab45cb0e3414b52df2fff0c is the first bad commit
> commit fe2c2a106663130a5ab45cb0e3414b52df2fff0c
> Author: Rik van Riel <[email protected]>
> Date: Wed Mar 21 16:33:51 2012 -0700
>
> vmscan: reclaim at order 0 when compaction is enabled
> ...
>
> This is plausible since the issue seems to be in the kswapd + compaction
> realm. I've yet to figure out exactly what about this commit results in
> kswapd spinning.
>
> I would be interested if someone can confirm this finding.
>

I cannot confirm the actual finding as I don't see the same sort of
problems. However, this does make sense and was more or less expected.
Reclaiming at order-0 would have forced compaction to be used more instead
of lumpy reclaim (less CPU usage but greater system distruption that is
harder to measure). Shortly after, lumpy reclaim was removed entirely so
now larger amounts of CPU time is spent compacting memory that previously
would have been reclaimed.

--
Mel Gorman
SUSE Labs

2012-11-09 09:06:43

by Mel Gorman

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On Fri, Nov 09, 2012 at 09:07:45AM +0100, Zdenek Kabelac wrote:
> >fe2c2a106663130a5ab45cb0e3414b52df2fff0c is the first bad commit
> >commit fe2c2a106663130a5ab45cb0e3414b52df2fff0c
> >Author: Rik van Riel <[email protected]>
> >Date: Wed Mar 21 16:33:51 2012 -0700
> >
> > vmscan: reclaim at order 0 when compaction is enabled
> >...
> >
> >This is plausible since the issue seems to be in the kswapd + compaction
> >realm. I've yet to figure out exactly what about this commit results in
> >kswapd spinning.
> >
> >I would be interested if someone can confirm this finding.
> >
> >--
> >Seth
> >
>
>
> On my system 3.7-rc4 the problem seems to be effectively solved by
> revert patch: https://lkml.org/lkml/2012/11/5/308
>

Ok, while there is still a question on whether it's enough I think it's
sensible to at least start with the obvious one.

Thanks very much.

--
Mel Gorman
SUSE Labs

2012-11-09 09:13:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"

On Mon, Nov 05, 2012 at 02:24:49PM +0000, Mel Gorman wrote:
> Jiri Slaby reported the following:
>
> (It's an effective revert of "mm: vmscan: scale number of pages
> reclaimed by reclaim/compaction based on failures".) Given kswapd
> had hours of runtime in ps/top output yesterday in the morning
> and after the revert it's now 2 minutes in sum for the last 24h,
> I would say, it's gone.
>
> The intention of the patch in question was to compensate for the loss
> of lumpy reclaim. Part of the reason lumpy reclaim worked is because
> it aggressively reclaimed pages and this patch was meant to be a sane
> compromise.
>
> When compaction fails, it gets deferred and both compaction and
> reclaim/compaction is deferred avoid excessive reclaim. However, since
> commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is woken up each time
> and continues reclaiming which was not taken into account when the patch
> was developed.
>
> Attempts to address the problem ended up just changing the shape of the
> problem instead of fixing it. The release window gets closer and while a
> THP allocation failing is not a major problem, kswapd chewing up a lot of
> CPU is. This patch reverts "mm: vmscan: scale number of pages reclaimed
> by reclaim/compaction based on failures" and will be revisited in the future.
>
> Signed-off-by: Mel Gorman <[email protected]>

Andrew, can you pick up this patch please and drop
mm-vmscan-scale-number-of-pages-reclaimed-by-reclaim-compaction-only-in-direct-reclaim.patch
?

There are mixed reports on how much it helps but it comes down to "this
fixes a problem" versus "kswapd is still showing higher usage". I think
the higher kswapd usage is explained by the removal of __GFP_NO_KSWAPD
and so while higher usage is bad, it is not necessarily unjustified.
Ideally it would have been proven that having kswapd doing the work
reduced application stalls in direct reclaim but unfortunately I do not
have concrete evidence of that at this time.

--
Mel Gorman
SUSE Labs

2012-11-11 09:13:32

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

Dne 9.11.2012 10:06, Mel Gorman napsal(a):
> On Fri, Nov 09, 2012 at 09:07:45AM +0100, Zdenek Kabelac wrote:
>>> fe2c2a106663130a5ab45cb0e3414b52df2fff0c is the first bad commit
>>> commit fe2c2a106663130a5ab45cb0e3414b52df2fff0c
>>> Author: Rik van Riel <[email protected]>
>>> Date: Wed Mar 21 16:33:51 2012 -0700
>>>
>>> vmscan: reclaim at order 0 when compaction is enabled
>>> ...
>>>
>>> This is plausible since the issue seems to be in the kswapd + compaction
>>> realm. I've yet to figure out exactly what about this commit results in
>>> kswapd spinning.
>>>
>>> I would be interested if someone can confirm this finding.
>>>
>>> --
>>> Seth
>>>
>>
>>
>> On my system 3.7-rc4 the problem seems to be effectively solved by
>> revert patch: https://lkml.org/lkml/2012/11/5/308
>>
>
> Ok, while there is still a question on whether it's enough I think it's
> sensible to at least start with the obvious one.
>


Hmm, so it's just took longer to hit the problem and observe kswapd0
spinning on my CPU again - it's not as endless like before - but still it
easily eats minutes - it helps to turn off Firefox or TB (memory hungry
apps) so kswapd0 stops soon - and restart those apps again.
(And I still have like >1GB of cached memory)

kswapd0 R running task 0 30 2 0x00000000
ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
Call Trace:
[<ffffffff81555bf2>] preempt_schedule+0x42/0x60
[<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
[<ffffffff81192971>] put_super+0x31/0x40
[<ffffffff81192a42>] drop_super+0x22/0x30
[<ffffffff81193b89>] prune_super+0x149/0x1b0
[<ffffffff81141e2a>] shrink_slab+0xba/0x510
[<ffffffff81185b4a>] ? mem_cgroup_iter+0x17a/0x2e0
[<ffffffff81185a9a>] ? mem_cgroup_iter+0xca/0x2e0
[<ffffffff81145099>] balance_pgdat+0x629/0x7f0
[<ffffffff811453d4>] kswapd+0x174/0x620
[<ffffffff8106fd20>] ? __init_waitqueue_head+0x60/0x60
[<ffffffff81145260>] ? balance_pgdat+0x7f0/0x7f0
[<ffffffff8106f50b>] kthread+0xdb/0xe0
[<ffffffff8106f430>] ? kthread_create_on_node+0x140/0x140
[<ffffffff8155fa1c>] ret_from_fork+0x7c/0xb0
[<ffffffff8106f430>] ? kthread_create_on_node+0x140/0x140


runnable tasks:
task PID tree-key switches prio exec-runtime
sum-exec sum-sleep
----------------------------------------------------------------------------------------------------------
kswapd0 30 8689943.729790 36266 120 8689943.729790
201495.640629 56609485.489414 /
kworker/0:1 14790 8689937.729790 16969 120 8689937.729790
374.385996 150405.181652 /
R bash 14855 821.749268 50 120 821.749268
24.027535 5252.291128 /autogroup-304




Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 146
CPU 1: hi: 186, btch: 31 usd: 135
Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 131
CPU 1: hi: 186, btch: 31 usd: 132
active_anon:726521 inactive_anon:26442 isolated_anon:0
active_file:77765 inactive_file:76890 isolated_file:0
unevictable:12 dirty:4 writeback:0 unstable:0
free:40261 slab_reclaimable:12414 slab_unreclaimable:9694
mapped:26382 shmem:162712 pagetables:6618 bounce:0
free_cma:0
DMA free:15676kB min:272kB low:340kB high:408kB active_anon:208kB
inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15900kB mlocked:0kB dirty:0kB
writeback:0kB mapped:0kB shmem:208kB slab_reclaimable:8kB
slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB
free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2951 3836 3836
DMA32 free:126072kB min:51776kB low:64720kB high:77664kB active_anon:2175104kB
inactive_anon:98976kB active_file:296252kB inactive_file:297648kB
unevictable:48kB isolated(anon):0kB isolated(file):0kB present:3021960kB
mlocked:48kB dirty:12kB writeback:0kB mapped:77664kB shmem:620388kB
slab_reclaimable:19128kB slab_unreclaimable:6292kB kernel_stack:624kB
pagetables:8900kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 885 885
Normal free:19296kB min:15532kB low:19412kB high:23296kB active_anon:730772kB
inactive_anon:6792kB active_file:14808kB inactive_file:9912kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:906664kB mlocked:0kB dirty:4kB
writeback:0kB mapped:27864kB shmem:30252kB slab_reclaimable:30520kB
slab_unreclaimable:32476kB kernel_stack:2496kB pagetables:17572kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0 0
DMA: 1*4kB 1*8kB 3*16kB 2*32kB 3*64kB 2*128kB 3*256kB 2*512kB 3*1024kB
3*2048kB 1*4096kB = 15676kB
DMA32: 730*4kB 328*8kB 223*16kB 123*32kB 182*64kB 96*128kB 172*256kB 56*512kB
12*1024kB 1*2048kB 1*4096kB = 128120kB
Normal: 600*4kB 384*8kB 164*16kB 122*32kB 40*64kB 7*128kB 1*256kB 1*512kB
1*1024kB 1*2048kB 0*4096kB = 19296kB
317367 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
1032176 pages RAM
42789 pages reserved
642501 pages shared
869271 pages non-shared

2012-11-12 11:37:37

by Mel Gorman

[permalink] [raw]
Subject: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
based on failures" reverted, Zdenek Kabelac reported the following

Hmm, so it's just took longer to hit the problem and observe
kswapd0 spinning on my CPU again - it's not as endless like before -
but still it easily eats minutes - it helps to turn off Firefox
or TB (memory hungry apps) so kswapd0 stops soon - and restart
those apps again. (And I still have like >1GB of cached memory)

kswapd0 R running task 0 30 2 0x00000000
ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
Call Trace:
[<ffffffff81555bf2>] preempt_schedule+0x42/0x60
[<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
[<ffffffff81192971>] put_super+0x31/0x40
[<ffffffff81192a42>] drop_super+0x22/0x30
[<ffffffff81193b89>] prune_super+0x149/0x1b0
[<ffffffff81141e2a>] shrink_slab+0xba/0x510

The sysrq+m indicates the system has no swap so it'll never reclaim
anonymous pages as part of reclaim/compaction. That is one part of the
problem but not the root cause as file-backed pages could also be reclaimed.

The likely underlying problem is that kswapd is woken up or kept awake
for each THP allocation request in the page allocator slow path.

If compaction fails for the requesting process then compaction will be
deferred for a time and direct reclaim is avoided. However, if there
are a storm of THP requests that are simply rejected, it will still
be the the case that kswapd is awake for a prolonged period of time
as pgdat->kswapd_max_order is updated each time. This is noticed by
the main kswapd() loop and it will not call kswapd_try_to_sleep().
Instead it will loopp, shrinking a small number of pages and calling
shrink_slab() on each iteration.

The temptation is to supply a patch that checks if kswapd was woken for
THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
backed up by proper testing. As 3.7 is very close to release and this is
not a bug we should release with, a safer path is to revert "mm: remove
__GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
balance_pgdat() logic in general.

Signed-off-by: Mel Gorman <[email protected]>
---
drivers/mtd/mtdcore.c | 6 ++++--
include/linux/gfp.h | 5 ++++-
include/trace/events/gfpflags.h | 1 +
mm/page_alloc.c | 7 ++++---
4 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 374c46d..ec794a7 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -1077,7 +1077,8 @@ EXPORT_SYMBOL_GPL(mtd_writev);
* until the request succeeds or until the allocation size falls below
* the system page size. This attempts to make sure it does not adversely
* impact system performance, so when allocating more than one page, we
- * ask the memory allocator to avoid re-trying.
+ * ask the memory allocator to avoid re-trying, swapping, writing back
+ * or performing I/O.
*
* Note, this function also makes sure that the allocated buffer is aligned to
* the MTD device's min. I/O unit, i.e. the "mtd->writesize" value.
@@ -1091,7 +1092,8 @@ EXPORT_SYMBOL_GPL(mtd_writev);
*/
void *mtd_kmalloc_up_to(const struct mtd_info *mtd, size_t *size)
{
- gfp_t flags = __GFP_NOWARN | __GFP_WAIT | __GFP_NORETRY;
+ gfp_t flags = __GFP_NOWARN | __GFP_WAIT |
+ __GFP_NORETRY | __GFP_NO_KSWAPD;
size_t min_alloc = max_t(size_t, mtd->writesize, PAGE_SIZE);
void *kbuf;

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 02c1c971..d0a7967 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -31,6 +31,7 @@ struct vm_area_struct;
#define ___GFP_THISNODE 0x40000u
#define ___GFP_RECLAIMABLE 0x80000u
#define ___GFP_NOTRACK 0x200000u
+#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u

@@ -85,6 +86,7 @@ struct vm_area_struct;
#define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
#define __GFP_NOTRACK ((__force gfp_t)___GFP_NOTRACK) /* Don't track with kmemcheck */

+#define __GFP_NO_KSWAPD ((__force gfp_t)___GFP_NO_KSWAPD)
#define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
#define __GFP_WRITE ((__force gfp_t)___GFP_WRITE) /* Allocator intends to dirty page */

@@ -114,7 +116,8 @@ struct vm_area_struct;
__GFP_MOVABLE)
#define GFP_IOFS (__GFP_IO | __GFP_FS)
#define GFP_TRANSHUGE (GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
- __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN)
+ __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN | \
+ __GFP_NO_KSWAPD)

#ifdef CONFIG_NUMA
#define GFP_THISNODE (__GFP_THISNODE | __GFP_NOWARN | __GFP_NORETRY)
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
index 9391706..d6fd8e5 100644
--- a/include/trace/events/gfpflags.h
+++ b/include/trace/events/gfpflags.h
@@ -36,6 +36,7 @@
{(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \
{(unsigned long)__GFP_MOVABLE, "GFP_MOVABLE"}, \
{(unsigned long)__GFP_NOTRACK, "GFP_NOTRACK"}, \
+ {(unsigned long)__GFP_NO_KSWAPD, "GFP_NO_KSWAPD"}, \
{(unsigned long)__GFP_OTHER_NODE, "GFP_OTHER_NODE"} \
) : "GFP_NOWAIT"

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb90971..7228260 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2416,8 +2416,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto nopage;

restart:
- wake_all_kswapd(order, zonelist, high_zoneidx,
- zone_idx(preferred_zone));
+ if (!(gfp_mask & __GFP_NO_KSWAPD))
+ wake_all_kswapd(order, zonelist, high_zoneidx,
+ zone_idx(preferred_zone));

/*
* OK, we're below the kswapd watermark and have kicked background
@@ -2494,7 +2495,7 @@ rebalance:
* system then fail the allocation instead of entering direct reclaim.
*/
if ((deferred_compaction || contended_compaction) &&
- (gfp_mask & (__GFP_MOVABLE|__GFP_REPEAT)) == __GFP_MOVABLE)
+ (gfp_mask & __GFP_NO_KSWAPD))
goto nopage;

/* Try direct reclaim and then allocating */

2012-11-12 12:20:03

by Mel Gorman

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On Sun, Nov 11, 2012 at 10:13:14AM +0100, Zdenek Kabelac wrote:
> Hmm, so it's just took longer to hit the problem and observe kswapd0
> spinning on my CPU again - it's not as endless like before - but
> still it easily eats minutes - it helps to turn off Firefox or TB
> (memory hungry apps) so kswapd0 stops soon - and restart those apps
> again.
> (And I still have like >1GB of cached memory)
>

I posted a "safe" patch that I believe explains why you are seeing what
you are seeing. It does mean that there will still be some stalls due to
THP because kswapd is not helping and it's avoiding the problem rather
than trying to deal with it.

Hence, I'm also going to post this patch even though I have not tested
it myself. If you find it fixes the problem then it would be a
preferable patch to the revert. It still is the case that the
balance_pgdat() logic is in sort need of a rethink as it's pretty
twisted right now.

Thanks

---8<---
mm: Avoid waking kswapd for THP allocations when compaction is deferred or contended

With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
based on failures" reverted, Zdenek Kabelac reported the following

Hmm, so it's just took longer to hit the problem and observe
kswapd0 spinning on my CPU again - it's not as endless like before -
but still it easily eats minutes - it helps to turn off Firefox
or TB (memory hungry apps) so kswapd0 stops soon - and restart
those apps again. (And I still have like >1GB of cached memory)

kswapd0 R running task 0 30 2 0x00000000
ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
Call Trace:
[<ffffffff81555bf2>] preempt_schedule+0x42/0x60
[<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
[<ffffffff81192971>] put_super+0x31/0x40
[<ffffffff81192a42>] drop_super+0x22/0x30
[<ffffffff81193b89>] prune_super+0x149/0x1b0
[<ffffffff81141e2a>] shrink_slab+0xba/0x510

The sysrq+m indicates the system has no swap so it'll never reclaim
anonymous pages as part of reclaim/compaction. That is one part of the
problem but not the root cause as file-backed pages could also be reclaimed.

The likely underlying problem is that kswapd is woken up or kept awake
for each THP allocation request in the page allocator slow path.

If compaction fails for the requesting process then compaction will be
deferred for a time and direct reclaim is avoided. However, if there
are a storm of THP requests that are simply rejected, it will still
be the the case that kswapd is awake for a prolonged period of time
as pgdat->kswapd_max_order is updated each time. This is noticed by
the main kswapd() loop and it will not call kswapd_try_to_sleep().
Instead it will loopp, shrinking a small number of pages and calling
shrink_slab() on each iteration.

This patch defers when kswapd gets woken up for THP allocations. For !THP
allocations, kswapd is always woken up. For THP allocations, kswapd is
woken up iff the process is willing to enter into direct
reclaim/compaction.

Signed-off-by: Mel Gorman <[email protected]>

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb90971..0b469b4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2378,6 +2378,15 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
}

+/* Returns true if the allocation is likely for THP */
+static bool is_thp_alloc(gfp_t gfp_mask, unsigned int order)
+{
+ if (order == pageblock_order &&
+ (gfp_mask & (__GFP_MOVABLE|__GFP_REPEAT)) == __GFP_MOVABLE)
+ return true;
+ return false;
+}
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, enum zone_type high_zoneidx,
@@ -2416,7 +2425,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto nopage;

restart:
- wake_all_kswapd(order, zonelist, high_zoneidx,
+ /* The decision whether to wake kswapd for THP is made later */
+ if (!is_thp_alloc(gfp_mask, order))
+ wake_all_kswapd(order, zonelist, high_zoneidx,
zone_idx(preferred_zone));

/*
@@ -2487,15 +2498,21 @@ rebalance:
goto got_pg;
sync_migration = true;

- /*
- * If compaction is deferred for high-order allocations, it is because
- * sync compaction recently failed. In this is the case and the caller
- * requested a movable allocation that does not heavily disrupt the
- * system then fail the allocation instead of entering direct reclaim.
- */
- if ((deferred_compaction || contended_compaction) &&
- (gfp_mask & (__GFP_MOVABLE|__GFP_REPEAT)) == __GFP_MOVABLE)
- goto nopage;
+ if (is_thp_alloc(gfp_mask, order)) {
+ /*
+ * If compaction is deferred for high-order allocations, it is
+ * because sync compaction recently failed. In this is the case
+ * and the caller requested a movable allocation that does not
+ * heavily disrupt the system then fail the allocation instead
+ * of entering direct reclaim.
+ */
+ if (deferred_compaction || contended_compaction)
+ goto nopage;
+
+ /* If process is willing to reclaim/compact then wake kswapd */
+ wake_all_kswapd(order, zonelist, high_zoneidx,
+ zone_idx(preferred_zone));
+ }

/* Try direct reclaim and then allocating */
page = __alloc_pages_direct_reclaim(gfp_mask, order,

2012-11-12 13:13:36

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

Dne 12.11.2012 13:19, Mel Gorman napsal(a):
> On Sun, Nov 11, 2012 at 10:13:14AM +0100, Zdenek Kabelac wrote:
>> Hmm, so it's just took longer to hit the problem and observe kswapd0
>> spinning on my CPU again - it's not as endless like before - but
>> still it easily eats minutes - it helps to turn off Firefox or TB
>> (memory hungry apps) so kswapd0 stops soon - and restart those apps
>> again.
>> (And I still have like >1GB of cached memory)
>>
>
> I posted a "safe" patch that I believe explains why you are seeing what
> you are seeing. It does mean that there will still be some stalls due to
> THP because kswapd is not helping and it's avoiding the problem rather
> than trying to deal with it.
>
> Hence, I'm also going to post this patch even though I have not tested
> it myself. If you find it fixes the problem then it would be a
> preferable patch to the revert. It still is the case that the
> balance_pgdat() logic is in sort need of a rethink as it's pretty
> twisted right now.
>


Should I apply them all together for 3.7-rc5 ?

1) https://lkml.org/lkml/2012/11/5/308
2) https://lkml.org/lkml/2012/11/12/113
3) https://lkml.org/lkml/2012/11/12/151

Zdenek

2012-11-12 13:31:44

by Mel Gorman

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On Mon, Nov 12, 2012 at 02:13:20PM +0100, Zdenek Kabelac wrote:
> Dne 12.11.2012 13:19, Mel Gorman napsal(a):
> >On Sun, Nov 11, 2012 at 10:13:14AM +0100, Zdenek Kabelac wrote:
> >>Hmm, so it's just took longer to hit the problem and observe kswapd0
> >>spinning on my CPU again - it's not as endless like before - but
> >>still it easily eats minutes - it helps to turn off Firefox or TB
> >>(memory hungry apps) so kswapd0 stops soon - and restart those apps
> >>again.
> >>(And I still have like >1GB of cached memory)
> >>
> >
> >I posted a "safe" patch that I believe explains why you are seeing what
> >you are seeing. It does mean that there will still be some stalls due to
> >THP because kswapd is not helping and it's avoiding the problem rather
> >than trying to deal with it.
> >
> >Hence, I'm also going to post this patch even though I have not tested
> >it myself. If you find it fixes the problem then it would be a
> >preferable patch to the revert. It still is the case that the
> >balance_pgdat() logic is in sort need of a rethink as it's pretty
> >twisted right now.
> >
>
>
> Should I apply them all together for 3.7-rc5 ?
>
> 1) https://lkml.org/lkml/2012/11/5/308
> 2) https://lkml.org/lkml/2012/11/12/113
> 3) https://lkml.org/lkml/2012/11/12/151
>

Not all together. Test either 1+2 or 1+3. 1+2 is the safer choice but
does nothing about THP stalls. 1+3 is a riskier version but depends on
me being correct on what the root cause of the problem you see it.

If both 1+2 and 1+3 work for you, I'd choose 1+3 for merging. If you only
have the time to test one combination then it would be preferred that you
test the safe option of 1+2.

Thanks.

--
Mel Gorman
SUSE Labs

2012-11-12 14:50:57

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

Dne 12.11.2012 14:31, Mel Gorman napsal(a):
> On Mon, Nov 12, 2012 at 02:13:20PM +0100, Zdenek Kabelac wrote:
>> Dne 12.11.2012 13:19, Mel Gorman napsal(a):
>>> On Sun, Nov 11, 2012 at 10:13:14AM +0100, Zdenek Kabelac wrote:
>>>> Hmm, so it's just took longer to hit the problem and observe kswapd0
>>>> spinning on my CPU again - it's not as endless like before - but
>>>> still it easily eats minutes - it helps to turn off Firefox or TB
>>>> (memory hungry apps) so kswapd0 stops soon - and restart those apps
>>>> again.
>>>> (And I still have like >1GB of cached memory)
>>>>
>>>
>>> I posted a "safe" patch that I believe explains why you are seeing what
>>> you are seeing. It does mean that there will still be some stalls due to
>>> THP because kswapd is not helping and it's avoiding the problem rather
>>> than trying to deal with it.
>>>
>>> Hence, I'm also going to post this patch even though I have not tested
>>> it myself. If you find it fixes the problem then it would be a
>>> preferable patch to the revert. It still is the case that the
>>> balance_pgdat() logic is in sort need of a rethink as it's pretty
>>> twisted right now.
>>>
>>
>>
>> Should I apply them all together for 3.7-rc5 ?
>>
>> 1) https://lkml.org/lkml/2012/11/5/308
>> 2) https://lkml.org/lkml/2012/11/12/113
>> 3) https://lkml.org/lkml/2012/11/12/151
>>
>
> Not all together. Test either 1+2 or 1+3. 1+2 is the safer choice but
> does nothing about THP stalls. 1+3 is a riskier version but depends on
> me being correct on what the root cause of the problem you see it.
>
> If both 1+2 and 1+3 work for you, I'd choose 1+3 for merging. If you only
> have the time to test one combination then it would be preferred that you
> test the safe option of 1+2.
>
>

I'll go with 1+2 for couple days - the issue is - I've no idea how it gets
suddenly triggered - it seemed to be running fine for 2-3 days even with
just 1) - but then kswapd0 started to occupy CPU for minutes.
Looks like some intensive workload on firefox (flash) may lead to that.

Anyway it's hard to tell quickly if it helped.

Zdenek

2012-11-14 21:44:22

by Johannes Hirte

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures"

Am Fri, 9 Nov 2012 08:36:37 +0000
schrieb Mel Gorman <[email protected]>:

> On Tue, Nov 06, 2012 at 11:15:54AM +0100, Johannes Hirte wrote:
> > Am Mon, 5 Nov 2012 14:24:49 +0000
> > schrieb Mel Gorman <[email protected]>:
> >
> > > Jiri Slaby reported the following:
> > >
> > > (It's an effective revert of "mm: vmscan: scale number of
> > > pages reclaimed by reclaim/compaction based on failures".) Given
> > > kswapd had hours of runtime in ps/top output yesterday in the
> > > morning and after the revert it's now 2 minutes in sum for the
> > > last 24h, I would say, it's gone.
> > >
> > > The intention of the patch in question was to compensate for the
> > > loss of lumpy reclaim. Part of the reason lumpy reclaim worked is
> > > because it aggressively reclaimed pages and this patch was meant
> > > to be a sane compromise.
> > >
> > > When compaction fails, it gets deferred and both compaction and
> > > reclaim/compaction is deferred avoid excessive reclaim. However,
> > > since commit c6543459 (mm: remove __GFP_NO_KSWAPD), kswapd is
> > > woken up each time and continues reclaiming which was not taken
> > > into account when the patch was developed.
> > >
> > > Attempts to address the problem ended up just changing the shape
> > > of the problem instead of fixing it. The release window gets
> > > closer and while a THP allocation failing is not a major problem,
> > > kswapd chewing up a lot of CPU is. This patch reverts "mm:
> > > vmscan: scale number of pages reclaimed by reclaim/compaction
> > > based on failures" and will be revisited in the future.
> > >
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > ---
> > > mm/vmscan.c | 25 -------------------------
> > > 1 file changed, 25 deletions(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index 2624edc..e081ee8 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -1760,28 +1760,6 @@ static bool in_reclaim_compaction(struct
> > > scan_control *sc) return false;
> > > }
> > >
> > > -#ifdef CONFIG_COMPACTION
> > > -/*
> > > - * If compaction is deferred for sc->order then scale the number
> > > of pages
> > > - * reclaimed based on the number of consecutive allocation
> > > failures
> > > - */
> > > -static unsigned long scale_for_compaction(unsigned long
> > > pages_for_compaction,
> > > - struct lruvec *lruvec, struct
> > > scan_control *sc) -{
> > > - struct zone *zone = lruvec_zone(lruvec);
> > > -
> > > - if (zone->compact_order_failed <= sc->order)
> > > - pages_for_compaction <<=
> > > zone->compact_defer_shift;
> > > - return pages_for_compaction;
> > > -}
> > > -#else
> > > -static unsigned long scale_for_compaction(unsigned long
> > > pages_for_compaction,
> > > - struct lruvec *lruvec, struct
> > > scan_control *sc) -{
> > > - return pages_for_compaction;
> > > -}
> > > -#endif
> > > -
> > > /*
> > > * Reclaim/compaction is used for high-order allocation
> > > requests. It reclaims
> > > * order-0 pages before compacting the zone.
> > > should_continue_reclaim() returns @@ -1829,9 +1807,6 @@ static
> > > inline bool should_continue_reclaim(struct lruvec *lruvec,
> > > * inactive lists are large enough, continue reclaiming
> > > */
> > > pages_for_compaction = (2UL << sc->order);
> > > -
> > > - pages_for_compaction =
> > > scale_for_compaction(pages_for_compaction,
> > > - lruvec, sc);
> > > inactive_lru_pages = get_lru_size(lruvec,
> > > LRU_INACTIVE_FILE); if (nr_swap_pages > 0)
> > > inactive_lru_pages += get_lru_size(lruvec,
> > > LRU_INACTIVE_ANON); --
> >
> > Even with this patch I see kswapd0 very often on top. Much more than
> > with kernel 3.6.
>
> How severe is the CPU usage? The higher usage can be explained by "mm:
> remove __GFP_NO_KSWAPD" which allows kswapd to compact memory to
> reduce the amount of time processes spend in compaction but will
> result in the CPU cost being incurred by kswapd.
>
> Is it really high like the bug was reporting with high usage over long
> periods of time or do you just see it using 2-6% of CPU for short
> periods?

It is really high. I've seen with compile-jobs (make -j4 on dual
core) kswapd0 consuming at least 50% CPU most time.

2012-11-16 19:14:49

by Josh Boyer

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

On Mon, Nov 12, 2012 at 6:37 AM, Mel Gorman <[email protected]> wrote:
> With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
> based on failures" reverted, Zdenek Kabelac reported the following
>
> Hmm, so it's just took longer to hit the problem and observe
> kswapd0 spinning on my CPU again - it's not as endless like before -
> but still it easily eats minutes - it helps to turn off Firefox
> or TB (memory hungry apps) so kswapd0 stops soon - and restart
> those apps again. (And I still have like >1GB of cached memory)
>
> kswapd0 R running task 0 30 2 0x00000000
> ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
> ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
> ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
> Call Trace:
> [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
> [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
> [<ffffffff81192971>] put_super+0x31/0x40
> [<ffffffff81192a42>] drop_super+0x22/0x30
> [<ffffffff81193b89>] prune_super+0x149/0x1b0
> [<ffffffff81141e2a>] shrink_slab+0xba/0x510
>
> The sysrq+m indicates the system has no swap so it'll never reclaim
> anonymous pages as part of reclaim/compaction. That is one part of the
> problem but not the root cause as file-backed pages could also be reclaimed.
>
> The likely underlying problem is that kswapd is woken up or kept awake
> for each THP allocation request in the page allocator slow path.
>
> If compaction fails for the requesting process then compaction will be
> deferred for a time and direct reclaim is avoided. However, if there
> are a storm of THP requests that are simply rejected, it will still
> be the the case that kswapd is awake for a prolonged period of time
> as pgdat->kswapd_max_order is updated each time. This is noticed by
> the main kswapd() loop and it will not call kswapd_try_to_sleep().
> Instead it will loopp, shrinking a small number of pages and calling
> shrink_slab() on each iteration.
>
> The temptation is to supply a patch that checks if kswapd was woken for
> THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
> backed up by proper testing. As 3.7 is very close to release and this is
> not a bug we should release with, a safer path is to revert "mm: remove
> __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
> balance_pgdat() logic in general.
>
> Signed-off-by: Mel Gorman <[email protected]>

Does anyone know if this is queued to go into 3.7 somewhere? I looked
a bit and can't find it in a tree. We have a few reports of Fedora
rawhide users hitting this.

josh

2012-11-16 19:51:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

On Fri, 16 Nov 2012 14:14:47 -0500
Josh Boyer <[email protected]> wrote:

> > The temptation is to supply a patch that checks if kswapd was woken for
> > THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
> > backed up by proper testing. As 3.7 is very close to release and this is
> > not a bug we should release with, a safer path is to revert "mm: remove
> > __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
> > balance_pgdat() logic in general.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> Does anyone know if this is queued to go into 3.7 somewhere? I looked
> a bit and can't find it in a tree. We have a few reports of Fedora
> rawhide users hitting this.

Still thinking about it. We're reverting quite a lot of material
lately.
mm-revert-mm-vmscan-scale-number-of-pages-reclaimed-by-reclaim-compaction-based-on-failures.patch
and revert-mm-fix-up-zone-present-pages.patch are queued for 3.7.

I'll toss this one in there as well, but I can't say I'm feeling
terribly confident. How is Valdis's machine nowadays?

2012-11-16 20:06:22

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

On Fri, Nov 16, 2012 at 02:14:47PM -0500, Josh Boyer wrote:
> On Mon, Nov 12, 2012 at 6:37 AM, Mel Gorman <[email protected]> wrote:
> > With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
> > based on failures" reverted, Zdenek Kabelac reported the following
> >
> > Hmm, so it's just took longer to hit the problem and observe
> > kswapd0 spinning on my CPU again - it's not as endless like before -
> > but still it easily eats minutes - it helps to turn off Firefox
> > or TB (memory hungry apps) so kswapd0 stops soon - and restart
> > those apps again. (And I still have like >1GB of cached memory)
> >
> > kswapd0 R running task 0 30 2 0x00000000
> > ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
> > ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
> > ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
> > Call Trace:
> > [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
> > [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
> > [<ffffffff81192971>] put_super+0x31/0x40
> > [<ffffffff81192a42>] drop_super+0x22/0x30
> > [<ffffffff81193b89>] prune_super+0x149/0x1b0
> > [<ffffffff81141e2a>] shrink_slab+0xba/0x510
> >
> > The sysrq+m indicates the system has no swap so it'll never reclaim
> > anonymous pages as part of reclaim/compaction. That is one part of the
> > problem but not the root cause as file-backed pages could also be reclaimed.
> >
> > The likely underlying problem is that kswapd is woken up or kept awake
> > for each THP allocation request in the page allocator slow path.
> >
> > If compaction fails for the requesting process then compaction will be
> > deferred for a time and direct reclaim is avoided. However, if there
> > are a storm of THP requests that are simply rejected, it will still
> > be the the case that kswapd is awake for a prolonged period of time
> > as pgdat->kswapd_max_order is updated each time. This is noticed by
> > the main kswapd() loop and it will not call kswapd_try_to_sleep().
> > Instead it will loopp, shrinking a small number of pages and calling
> > shrink_slab() on each iteration.
> >
> > The temptation is to supply a patch that checks if kswapd was woken for
> > THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
> > backed up by proper testing. As 3.7 is very close to release and this is
> > not a bug we should release with, a safer path is to revert "mm: remove
> > __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
> > balance_pgdat() logic in general.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> Does anyone know if this is queued to go into 3.7 somewhere? I looked
> a bit and can't find it in a tree. We have a few reports of Fedora
> rawhide users hitting this.
>

No, because I was waiting to hear if a) it worked and preferably if the
alternative "less safe" option worked. This close to release it might be
better to just go with the safe option.

--
Mel Gorman
SUSE Labs

2012-11-18 19:01:00

by Zdenek Kabelac

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

Dne 12.11.2012 14:31, Mel Gorman napsal(a):
> On Mon, Nov 12, 2012 at 02:13:20PM +0100, Zdenek Kabelac wrote:
>> Dne 12.11.2012 13:19, Mel Gorman napsal(a):
>>> On Sun, Nov 11, 2012 at 10:13:14AM +0100, Zdenek Kabelac wrote:
>>>> Hmm, so it's just took longer to hit the problem and observe kswapd0
>>>> spinning on my CPU again - it's not as endless like before - but
>>>> still it easily eats minutes - it helps to turn off Firefox or TB
>>>> (memory hungry apps) so kswapd0 stops soon - and restart those apps
>>>> again.
>>>> (And I still have like >1GB of cached memory)
>>>>
>>>
>>> I posted a "safe" patch that I believe explains why you are seeing what
>>> you are seeing. It does mean that there will still be some stalls due to
>>> THP because kswapd is not helping and it's avoiding the problem rather
>>> than trying to deal with it.
>>>
>>> Hence, I'm also going to post this patch even though I have not tested
>>> it myself. If you find it fixes the problem then it would be a
>>> preferable patch to the revert. It still is the case that the
>>> balance_pgdat() logic is in sort need of a rethink as it's pretty
>>> twisted right now.
>>>
>>
>>
>> Should I apply them all together for 3.7-rc5 ?
>>
>> 1) https://lkml.org/lkml/2012/11/5/308
>> 2) https://lkml.org/lkml/2012/11/12/113
>> 3) https://lkml.org/lkml/2012/11/12/151
>>
>
> Not all together. Test either 1+2 or 1+3. 1+2 is the safer choice but
> does nothing about THP stalls. 1+3 is a riskier version but depends on
> me being correct on what the root cause of the problem you see it.
>
> If both 1+2 and 1+3 work for you, I'd choose 1+3 for merging. If you only
> have the time to test one combination then it would be preferred that you
> test the safe option of 1+2.

So I've tested 1+2 for a few days - once I've rebooted for another reason,
but today happened this to me (with ~2day uptime)

For some reason my machine went ouf of memory and OOM killed
firefox and then even whole Xsession.

Unsure whether it's related to those 2 patches - but I've never had
such OOM failure before.

Should I experiment now with 1+3 - or is there newer thing to test ?

Zdenek

X: page allocation failure: order:0, mode:0x200da
Pid: 1126, comm: X Not tainted 3.7.0-rc5-00007-g95e21c5 #100
Call Trace:
[<ffffffff811354e9>] warn_alloc_failed+0xe9/0x140
[<ffffffff81138eda>] __alloc_pages_nodemask+0x7fa/0xa40
[<ffffffff81148fc3>] shmem_getpage_gfp+0x603/0x9d0
[<ffffffff8100a166>] ? native_sched_clock+0x26/0x90
[<ffffffff81149d6f>] shmem_fault+0x4f/0xa0
[<ffffffff812ad69e>] shm_fault+0x1e/0x20
[<ffffffff811571d3>] __do_fault+0x73/0x4d0
[<ffffffff81131640>] ? generic_file_aio_write+0xb0/0x100
[<ffffffff81159d67>] handle_pte_fault+0x97/0x9a0
[<ffffffff810aca4f>] ? __lock_is_held+0x5f/0x90
[<ffffffff81081711>] ? get_parent_ip+0x11/0x50
rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
rsyslogd cpuset=/ mems_allowed=0
Pid: 571, comm: rsyslogd Not tainted 3.7.0-rc5-00007-g95e21c5 #100
Call Trace:
[<ffffffff8154dfcb>] dump_header.isra.12+0x78/0x224
[<ffffffff8155b529>] ? sub_preempt_count+0x79/0xd0
[<ffffffff81557842>] ? _raw_spin_unlock_irqrestore+0x42/0x80
[<ffffffff81317c0e>] ? ___ratelimit+0x9e/0x130
[<ffffffff81133ac3>] oom_kill_process+0x1d3/0x330
[<ffffffff81134219>] out_of_memory+0x439/0x4a0
[<ffffffff81139056>] __alloc_pages_nodemask+0x976/0xa40
[<ffffffff811304b5>] ? find_get_page+0x5/0x230
[<ffffffff811322a0>] filemap_fault+0x2d0/0x480
[<ffffffff811571d3>] __do_fault+0x73/0x4d0
[<ffffffff81159d67>] handle_pte_fault+0x97/0x9a0
[<ffffffff810aca4f>] ? __lock_is_held+0x5f/0x90
[<ffffffff81081711>] ? get_parent_ip+0x11/0x50
[<ffffffff8115ae6f>] handle_mm_fault+0x22f/0x2f0
[<ffffffff8155ae7d>] __do_page_fault+0x15d/0x4e0
[<ffffffff8155b529>] ? sub_preempt_count+0x79/0xd0
[<ffffffff815578b5>] ? _raw_spin_unlock+0x35/0x60
[<ffffffff811f8d9c>] ? proc_reg_read+0x8c/0xc0
[<ffffffff815580a3>] ? error_sti+0x5/0x6
[<ffffffff8131f55d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[<ffffffff8155b20e>] do_page_fault+0xe/0x10
[<ffffffff81557ea2>] page_fault+0x22/0x30
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 30
CPU 1: hi: 186, btch: 31 usd: 6
Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 30
CPU 1: hi: 186, btch: 31 usd: 0
active_anon:900420 inactive_anon:28835 isolated_anon:0
active_file:43 inactive_file:21 isolated_file:0
unevictable:4 dirty:34 writeback:2 unstable:0
free:20731 slab_reclaimable:8641 slab_unreclaimable:10446
mapped:18325 shmem:243662 pagetables:7705 bounce:0
free_cma:0
DMA free:12120kB min:272kB low:340kB high:408kB active_anon:2892kB
inactive_anon:872kB active_file:0kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15900kB mlocked:0kB dirty:0kB
writeback:0kB mapped:1672kB shmem:3596kB slab_reclaimable:0kB
slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2951 3836 3836
DMA32 free:55296kB min:51776kB low:64720kB high:77664kB
active_anon:2834992kB inactive_anon:107924kB active_file:92kB
inactive_file:52kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:3021968kB mlocked:0kB dirty:88kB writeback:0kB mapped:65460kB
shmem:943100kB slab_reclaimable:11700kB slab_unreclaimable:8968kB
kernel_stack:592kB pagetables:11852kB unstable:0kB bounce:0kB free_cma:0kB
writeback_tmp:0kB pages_scanned:180 all_unreclaimable? yes
lowmem_reserve[]: 0 0 885 885
Normal free:15508kB min:15532kB low:19412kB high:23296kB
active_anon:763796kB inactive_anon:6544kB active_file:80kB inactive_file:32kB
unevictable:16kB isolated(anon):0kB isolated(file):0kB present:906664kB
mlocked:16kB dirty:48kB writeback:52kB mapped:6168kB shmem:27952kB
slab_reclaimable:22864kB slab_unreclaimable:32800kB kernel_stack:2568kB
pagetables:18968kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:234 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
DMA: 2*4kB 2*8kB 2*16kB 1*32kB 2*64kB 1*128kB 2*256kB 2*512kB 2*1024kB
2*2048kB 1*4096kB = 12120kB
DMA32: 900*4kB 1512*8kB 513*16kB 635*32kB 109*64kB 8*128kB 0*256kB 0*512kB
1*1024kB 1*2048kB 0*4096kB = 55296kB
Normal: 452*4kB 363*8kB 225*16kB 139*32kB 30*64kB 4*128kB 1*256kB 0*512kB
0*1024kB 1*2048kB 0*4096kB = 17496kB
243783 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
1032176 pages RAM
42789 pages reserved
553592 pages shared
943414 pages non-shared
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 351] 0 351 74685 1679 154 0 0
systemd-journal
[ 544] 0 544 5863 107 16 0 0 bluetoothd
[ 545] 0 545 88977 725 56 0 0
NetworkManager
[ 546] 0 546 30170 158 15 0 0 crond
[ 552] 0 552 1879 28 8 0 0 gpm
[ 557] 0 557 1092 37 8 0 0 acpid
[ 564] 81 564 6361 373 16 0 -900 dbus-daemon
[ 566] 0 566 61331 155 22 0 0 rsyslogd
[ 567] 498 567 7026 104 19 0 0
avahi-daemon
[ 568] 498 568 6994 59 17 0 0
avahi-daemon
[ 573] 0 573 1758 33 9 0 0 mcelog
[ 578] 0 578 5925 51 16 0 0 atd
[ 586] 105 586 121536 4270 56 0 0 polkitd
[ 593] 0 593 21967 205 48 0 -900
modem-manager
[ 601] 0 601 1087 26 8 0 0 thinkfan
[ 619] 0 619 122722 1085 129 0 0 libvirtd
[ 630] 32 630 4812 68 13 0 0 rpcbind
[ 633] 0 633 20080 199 43 0 -1000 sshd
[ 653] 29 653 5905 116 16 0 0 rpc.statd
[ 700] 0 700 13173 190 28 0 0
wpa_supplicant
[ 719] 0 719 4810 50 14 0 0 rpc.idmapd
[ 730] 0 730 28268 36 10 0 0 rpc.rquotad
[ 766] 0 766 6030 153 15 0 0 rpc.mountd
[ 806] 99 806 3306 45 11 0 0 dnsmasq
[ 985] 0 985 21219 150 46 0 0 login
[ 988] 0 988 260408 355 48 0 0
console-kit-dae
[ 1053] 11641 1053 28706 241 14 0 0 bash
[ 1097] 11641 1097 27972 58 10 0 0 startx
[ 1125] 11641 1125 3487 48 13 0 0 xinit
[ 1126] 11641 1126 80028 35289 154 0 0 X
[ 1138] 11641 1138 142989 930 122 0 0
gnome-session
[ 1151] 11641 1151 4013 64 12 0 0 dbus-launch
[ 1152] 11641 1152 6069 82 17 0 0 dbus-daemon
[ 1154] 11641 1154 85449 162 36 0 0
at-spi-bus-laun
[ 1158] 11641 1158 6103 116 17 0 0 dbus-daemon
[ 1161] 11641 1161 32328 174 33 0 0
at-spi2-registr
[ 1172] 11641 1172 4013 65 13 0 0 dbus-launch
[ 1173] 11641 1173 6350 265 18 0 0 dbus-daemon
[ 1177] 11641 1177 37416 416 29 0 0 gconfd-2
[ 1184] 11641 1184 117556 1203 44 0 0
gnome-keyring-d
[ 1185] 11641 1185 224829 2236 177 0 0
gnome-settings-
[ 1194] 0 1194 57227 786 46 0 0 upowerd
[ 1226] 11641 1226 77392 190 36 0 0 gvfsd
[ 1246] 11641 1246 118201 772 90 0 0 pulseaudio
[ 1247] 496 1247 41161 59 17 0 0
rtkit-daemon
[ 1252] 11641 1252 29494 205 58 0 0
gconf-helper
[ 1253] 106 1253 81296 355 46 0 0 colord
[ 1257] 11641 1257 59080 1574 60 0 0 openbox
[ 1258] 11641 1258 185569 3216 146 0 0 gnome-panel
[ 1264] 11641 1264 64102 229 27 0 0
dconf-service
[ 1268] 11641 1268 139203 858 116 0 0
gnome-user-shar
[ 1269] 11641 1269 268645 27442 334 0 0 pidgin
[ 1270] 11641 1270 142642 1064 117 0 0
bluetooth-apple
[ 1271] 11641 1271 193218 1775 175 0 0 nm-applet
[ 1272] 11641 1272 220194 1810 138 0 0
gnome-sound-app
[ 1285] 11641 1285 80914 632 45 0 0
gvfs-udisks2-vo
[ 1287] 0 1287 88101 599 41 0 0 udisksd
[ 1295] 11641 1295 177162 14140 150 0 0 wnck-applet
[ 1297] 11641 1297 281043 3161 199 0 0
clock-applet
[ 1299] 11641 1299 142537 1053 120 0 0
cpufreq-applet
[ 1302] 11641 1302 141960 986 113 0 0
notification-ar
[ 1340] 11641 1340 190026 6265 144 0 0
gnome-terminal
[ 1346] 11641 1346 2123 35 10 0 0
gnome-pty-helpe
[ 1347] 11641 1347 28719 253 11 0 0 bash
[ 1858] 11641 1858 10895 101 27 0 0 xfconfd
[ 2052] 11641 2052 28720 255 11 0 0 bash
[ 6239] 11641 6239 73437 711 88 0 0 kdeinit4
[ 6240] 11641 6240 83952 717 101 0 0 klauncher
[ 6242] 11641 6242 126497 1479 172 0 0 kded4
[ 6244] 11641 6244 2977 48 11 0 0 gam_server
[10804] 11641 10804 101320 307 47 0 0 gvfsd-http
[12175] 0 12175 27197 32 10 0 0 agetty
[12249] 11641 12249 28719 252 14 0 0 bash
[14862] 0 14862 51773 344 55 0 0 cupsd
[14868] 4 14868 18105 158 39 0 0 cups-polld
[16728] 11641 16728 28691 244 12 0 0 bash
[16975] 0 16975 9109 253 23 0 -1000
systemd-udevd
[17618] 0 17618 8245 87 22 0 0
systemd-logind
[ 3133] 11641 3133 43721 132 40 0 0 su
[ 3136] 0 3136 28564 139 12 0 0 bash
[ 3983] 11641 3983 43722 134 41 0 0 su
[ 3986] 0 3986 28564 144 13 0 0 bash
[16350] 11641 16350 28691 245 14 0 0 bash
[31228] 11641 31228 28691 245 11 0 0 bash
[31922] 11641 31922 28719 250 13 0 0 bash
[ 2340] 11641 2340 28691 245 15 0 0 bash
[12586] 38 12586 7851 150 19 0 0 ntpd
[32658] 11641 32658 41192 424 35 0 0 mc
[32660] 11641 32660 28692 245 13 0 0 bash
[29193] 11641 29193 713846 414344 1614 0 0 firefox
[10971] 11641 10971 43722 133 43 0 0 su
[10974] 0 10974 28564 132 12 0 0 bash
[11343] 0 11343 28497 66 11 0 0 ksmtuned
[11387] 11641 11387 28719 254 11 0 0 bash
[11450] 11641 11450 28691 246 13 0 0 bash
[11576] 11641 11576 43722 133 40 0 0 su
[11579] 0 11579 28564 141 13 0 0 bash
[12106] 11641 12106 28691 244 12 0 0 bash
[12141] 11641 12141 43722 132 44 0 0 su
[12144] 0 12144 28564 140 11 0 0 bash
[12264] 11641 12264 28691 245 11 0 0 bash
[12299] 11641 12299 43721 133 40 0 0 su
[12302] 0 12302 28564 137 12 0 0 bash
[26024] 11641 26024 28691 245 13 0 0 bash
[26083] 11641 26083 28691 245 13 0 0 bash
[28235] 11641 28235 43721 132 42 0 0 su
[28238] 0 28238 28564 143 13 0 0 bash
[29460] 11641 29460 43721 132 42 0 0 su
[29463] 0 29463 28564 137 12 0 0 bash
[29758] 11641 29758 28720 256 12 0 0 bash
[29864] 11641 29864 41916 1153 36 0 0 mc
[29866] 11641 29866 28728 257 11 0 0 bash
[32750] 0 32750 23164 2994 47 0 0 dhclient
[ 323] 0 323 24081 471 48 0 0 sendmail
[ 347] 51 347 20347 367 38 0 0 sendmail
[ 907] 11641 907 379562 159766 707 0 0 thunderbird
[ 6340] 11641 6340 28719 251 12 0 0 bash
[ 6790] 11641 6790 80307 620 101 0 0
xfce4-notifyd
[ 6844] 0 6844 26669 23 9 0 0 sleep
Out of memory: Kill process 29193 (firefox) score 420 or sacrifice child
Killed process 29193 (firefox) total-vm:2855384kB, anon-rss:1653868kB,
file-rss:3508kB
[<ffffffff8115ae6f>] handle_mm_fault+0x22f/0x2f0
[<ffffffff8115b12a>] __get_user_pages+0x12a/0x530
[<ffffffff8115b575>] get_dump_page+0x45/0x60
[<ffffffff811eec6d>] elf_core_dump+0x16bd/0x1960
[<ffffffff811edf86>] ? elf_core_dump+0x9d6/0x1960
[<ffffffff8155b529>] ? sub_preempt_count+0x79/0xd0
[<ffffffff815546ae>] ? mutex_unlock+0xe/0x10
[<ffffffff8118ed63>] ? do_truncate+0x73/0xa0
[<ffffffff811f55a1>] do_coredump+0xa21/0xeb0
[<ffffffff810b22a0>] ? debug_check_no_locks_freed+0xe0/0x170
[<ffffffff810abe8d>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8105a961>] get_signal_to_deliver+0x2e1/0x960
[<ffffffff8100236f>] do_signal+0x3f/0x9a0
[<ffffffff81540000>] ? pci_fixup_msi_k8t_onboard_sound+0x7d/0x97
[<ffffffff8154b565>] ? is_prefetch.isra.15+0x1a6/0x1fd
[<ffffffff815580a3>] ? error_sti+0x5/0x6
[<ffffffff81557cd1>] ? retint_signal+0x11/0x90
[<ffffffff81002d70>] do_notify_resume+0x80/0xb0
[<ffffffff81557d06>] retint_signal+0x46/0x90
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 30
Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 30
active_anon:900420 inactive_anon:28835 isolated_anon:0
active_file:8 inactive_file:0 isolated_file:0
unevictable:4 dirty:34 writeback:2 unstable:0
free:20724 slab_reclaimable:8641 slab_unreclaimable:10446
mapped:18325 shmem:243662 pagetables:7705 bounce:0
free_cma:0
DMA free:12120kB min:272kB low:340kB high:408kB active_anon:2892kB
inactive_anon:872kB active_file:0kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15900kB mlocked:0kB dirty:0kB
writeback:0kB mapped:1672kB shmem:3596kB slab_reclaimable:0kB
slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2951 3836 3836
DMA32 free:55404kB min:51776kB low:64720kB high:77664kB
active_anon:2834992kB inactive_anon:107924kB active_file:0kB
inactive_file:28kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:3021968kB mlocked:0kB dirty:0kB writeback:0kB mapped:65460kB
shmem:943100kB slab_reclaimable:11700kB slab_unreclaimable:8968kB
kernel_stack:592kB pagetables:11852kB unstable:0kB bounce:0kB free_cma:0kB
writeback_tmp:0kB pages_scanned:129 all_unreclaimable? yes
lowmem_reserve[]: 0 0 885 885
Normal free:15364kB min:15532kB low:19412kB high:23296kB
active_anon:763796kB inactive_anon:6544kB active_file:0kB inactive_file:24kB
unevictable:16kB isolated(anon):0kB isolated(file):0kB present:906664kB
mlocked:16kB dirty:48kB writeback:52kB mapped:6168kB shmem:27952kB
slab_reclaimable:22864kB slab_unreclaimable:32800kB kernel_stack:2568kB
pagetables:18968kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:379 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
DMA: 2*4kB 2*8kB 2*16kB 1*32kB 2*64kB 1*128kB 2*256kB 2*512kB 2*1024kB
2*2048kB 1*4096kB = 12120kB
DMA32: 896*4kB 1512*8kB 513*16kB 635*32kB 109*64kB 8*128kB 0*256kB 0*512kB
1*1024kB 1*2048kB 0*4096kB = 55280kB
Normal: 403*4kB 377*8kB 225*16kB 139*32kB 30*64kB 4*128kB 1*256kB 0*512kB
0*1024kB 1*2048kB 0*4096kB = 17412kB
243733 total pagecache pages
rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
rsyslogd cpuset=/ mems_allowed=0
Pid: 571, comm: rsyslogd Not tainted 3.7.0-rc5-00007-g95e21c5 #100
Call Trace:
[<ffffffff8154dfcb>] dump_header.isra.12+0x78/0x224
[<ffffffff8155b529>] ? sub_preempt_count+0x79/0xd0
[<ffffffff81557842>] ? _raw_spin_unlock_irqrestore+0x42/0x80
[<ffffffff81317c0e>] ? ___ratelimit+0x9e/0x130
[<ffffffff81133ac3>] oom_kill_process+0x1d3/0x330
[<ffffffff81134219>] out_of_memory+0x439/0x4a0
[<ffffffff81139056>] __alloc_pages_nodemask+0x976/0xa40
[<ffffffff811304b5>] ? find_get_page+0x5/0x230
[<ffffffff811322a0>] filemap_fault+0x2d0/0x480
[<ffffffff811571d3>] __do_fault+0x73/0x4d0
[<ffffffff81159d67>] handle_pte_fault+0x97/0x9a0
[<ffffffff810aca4f>] ? __lock_is_held+0x5f/0x90
[<ffffffff81081711>] ? get_parent_ip+0x11/0x50
[<ffffffff8115ae6f>] handle_mm_fault+0x22f/0x2f0
[<ffffffff8155ae7d>] __do_page_fault+0x15d/0x4e0
[<ffffffff815578b5>] ? _raw_spin_unlock+0x35/0x60
[<ffffffff811f8d9c>] ? proc_reg_read+0x8c/0xc0
[<ffffffff815580a3>] ? error_sti+0x5/0x6
[<ffffffff8131f55d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
[<ffffffff8155b20e>] do_page_fault+0xe/0x10
[<ffffffff81557ea2>] page_fault+0x22/0x30
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 30
Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 1
CPU 1: hi: 186, btch: 31 usd: 46
active_anon:900420 inactive_anon:28835 isolated_anon:0
active_file:0 inactive_file:7 isolated_file:0
unevictable:4 dirty:0 writeback:2 unstable:0
free:20691 slab_reclaimable:8641 slab_unreclaimable:10446
mapped:18325 shmem:243662 pagetables:7705 bounce:0
free_cma:0
DMA free:12120kB min:272kB low:340kB high:408kB active_anon:2892kB
inactive_anon:872kB active_file:0kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15900kB mlocked:0kB dirty:0kB
writeback:0kB mapped:1672kB shmem:3596kB slab_reclaimable:0kB
slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2951 3836 3836
DMA32 free:55280kB min:51776kB low:64720kB high:77664kB
active_anon:2834992kB inactive_anon:107924kB active_file:0kB
inactive_file:12kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:3021968kB mlocked:0kB dirty:0kB writeback:0kB mapped:65460kB
shmem:943100kB slab_reclaimable:11700kB slab_unreclaimable:8968kB
kernel_stack:592kB pagetables:11852kB unstable:0kB bounce:0kB free_cma:0kB
writeback_tmp:0kB pages_scanned:520 all_unreclaimable? yes
lowmem_reserve[]: 0 0 885 885
Normal free:15364kB min:15532kB low:19412kB high:23296kB
active_anon:763796kB inactive_anon:6544kB active_file:0kB inactive_file:16kB
unevictable:16kB isolated(anon):0kB isolated(file):0kB present:906664kB
mlocked:16kB dirty:48kB writeback:52kB mapped:6168kB shmem:27952kB
slab_reclaimable:22864kB slab_unreclaimable:32800kB kernel_stack:2568kB
pagetables:18968kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:571 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
DMA: 2*4kB 2*8kB 2*16kB 1*32kB 2*64kB 1*128kB 2*256kB 2*512kB 2*1024kB
2*2048kB 1*4096kB = 12120kB
DMA32: 896*4kB 1512*8kB 513*16kB 635*32kB 109*64kB 8*128kB 0*256kB 0*512kB
1*1024kB 1*2048kB 0*4096kB = 55280kB
Normal: 403*4kB 377*8kB 225*16kB 139*32kB 30*64kB 4*128kB 1*256kB 0*512kB
0*1024kB 1*2048kB 0*4096kB = 17412kB
243733 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
1032176 pages RAM
42789 pages reserved
553579 pages shared
943538 pages non-shared
1032176 pages RAM
42789 pages reserved
553576 pages shared
943549 pages non-shared
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 351] 0 351 74685 1682 154 0 0
systemd-journal
[ 544] 0 544 5863 107 16 0 0 bluetoothd
[ 545] 0 545 88977 725 56 0 0
NetworkManager
[ 546] 0 546 30170 158 15 0 0 crond
[ 552] 0 552 1879 28 8 0 0 gpm
[ 557] 0 557 1092 37 8 0 0 acpid
[ 564] 81 564 6361 373 16 0 -900 dbus-daemon
[ 566] 0 566 61331 155 22 0 0 rsyslogd
[ 567] 498 567 7026 104 19 0 0
avahi-daemon
[ 568] 498 568 6994 59 17 0 0
avahi-daemon
[ 573] 0 573 1758 33 9 0 0 mcelog
[ 578] 0 578 5925 51 16 0 0 atd
[ 586] 105 586 121536 4270 56 0 0 polkitd
[ 593] 0 593 21967 205 48 0 -900
modem-manager
[ 601] 0 601 1087 26 8 0 0 thinkfan
[ 619] 0 619 122722 1085 129 0 0 libvirtd
[ 630] 32 630 4812 68 13 0 0 rpcbind
[ 633] 0 633 20080 199 43 0 -1000 sshd
[ 653] 29 653 5905 116 16 0 0 rpc.statd
[ 700] 0 700 13173 190 28 0 0
wpa_supplicant
[ 719] 0 719 4810 50 14 0 0 rpc.idmapd
[ 730] 0 730 28268 36 10 0 0 rpc.rquotad
[ 766] 0 766 6030 153 15 0 0 rpc.mountd
[ 806] 99 806 3306 45 11 0 0 dnsmasq
[ 985] 0 985 21219 150 46 0 0 login
[ 988] 0 988 260408 355 48 0 0
console-kit-dae
[ 1053] 11641 1053 28706 241 14 0 0 bash
[ 1097] 11641 1097 27972 58 10 0 0 startx
[ 1125] 11641 1125 3487 48 13 0 0 xinit
[ 1126] 11641 1126 80028 35379 154 0 0 X
[ 1138] 11641 1138 142989 930 122 0 0
gnome-session
[ 1151] 11641 1151 4013 64 12 0 0 dbus-launch
[ 1152] 11641 1152 6069 82 17 0 0 dbus-daemon
[ 1154] 11641 1154 85449 162 36 0 0
at-spi-bus-laun
[ 1158] 11641 1158 6103 116 17 0 0 dbus-daemon
[ 1161] 11641 1161 32328 174 33 0 0
at-spi2-registr
[ 1172] 11641 1172 4013 65 13 0 0 dbus-launch
[ 1173] 11641 1173 6350 265 18 0 0 dbus-daemon
[ 1177] 11641 1177 37416 416 29 0 0 gconfd-2
[ 1184] 11641 1184 117556 1203 44 0 0
gnome-keyring-d
[ 1185] 11641 1185 224829 2236 177 0 0
gnome-settings-
[ 1194] 0 1194 57227 786 46 0 0 upowerd
[ 1226] 11641 1226 77392 190 36 0 0 gvfsd
[ 1246] 11641 1246 118201 772 90 0 0 pulseaudio
[ 1247] 496 1247 41161 59 17 0 0
rtkit-daemon
[ 1252] 11641 1252 29494 205 58 0 0
gconf-helper
[ 1253] 106 1253 81296 355 46 0 0 colord
[ 1257] 11641 1257 59080 1574 60 0 0 openbox
[ 1258] 11641 1258 185569 3216 146 0 0 gnome-panel
[ 1264] 11641 1264 64102 229 27 0 0
dconf-service
[ 1268] 11641 1268 139203 858 116 0 0
gnome-user-shar
[ 1269] 11641 1269 268645 27442 334 0 0 pidgin
[ 1270] 11641 1270 142642 1064 117 0 0
bluetooth-apple
[ 1271] 11641 1271 193218 1775 175 0 0 nm-applet
[ 1272] 11641 1272 220194 1810 138 0 0
gnome-sound-app
[ 1285] 11641 1285 80914 632 45 0 0
gvfs-udisks2-vo
[ 1287] 0 1287 88101 599 41 0 0 udisksd
[ 1295] 11641 1295 177162 14140 150 0 0 wnck-applet
[ 1297] 11641 1297 281043 3161 199 0 0
clock-applet
[ 1299] 11641 1299 142537 1051 120 0 0
cpufreq-applet
[ 1302] 11641 1302 141960 986 113 0 0
notification-ar
[ 1340] 11641 1340 190026 6265 144 0 0
gnome-terminal
[ 1346] 11641 1346 2123 35 10 0 0
gnome-pty-helpe
[ 1347] 11641 1347 28719 253 11 0 0 bash
[ 1858] 11641 1858 10895 101 27 0 0 xfconfd
X: page allocation failure: order:0, mode:0x200da
Pid: 1126, comm: X Not tainted 3.7.0-rc5-00007-g95e21c5 #100
[ 2052] 11641 2052 28720 255 11 0 0 bash
[ 6239] 11641 6239 73437 711 88 0 0 kdeinit4
[ 6240] 11641 6240 83952 717 101 0 0 klauncher
Call Trace:
[<ffffffff811354e9>] warn_alloc_failed+0xe9/0x140
[<ffffffff81138eda>] __alloc_pages_nodemask+0x7fa/0xa40
[<ffffffff81148fc3>] shmem_getpage_gfp+0x603/0x9d0
[<ffffffff8100a166>] ? native_sched_clock+0x26/0x90
[<ffffffff81149d6f>] shmem_fault+0x4f/0xa0
[<ffffffff812ad69e>] shm_fault+0x1e/0x20
[<ffffffff811571d3>] __do_fault+0x73/0x4d0
[<ffffffff81131640>] ? generic_file_aio_write+0xb0/0x100
[<ffffffff81159d67>] handle_pte_fault+0x97/0x9a0
[<ffffffff810aca4f>] ? __lock_is_held+0x5f/0x90
[<ffffffff81081711>] ? get_parent_ip+0x11/0x50
[<ffffffff8115ae6f>] handle_mm_fault+0x22f/0x2f0
[<ffffffff8115b12a>] __get_user_pages+0x12a/0x530
[<ffffffff8115b575>] get_dump_page+0x45/0x60
[<ffffffff811eec6d>] elf_core_dump+0x16bd/0x1960
[<ffffffff811edf86>] ? elf_core_dump+0x9d6/0x1960
[<ffffffff8155b529>] ? sub_preempt_count+0x79/0xd0
[<ffffffff815546ae>] ? mutex_unlock+0xe/0x10
[<ffffffff8118ed63>] ? do_truncate+0x73/0xa0
[<ffffffff811f55a1>] do_coredump+0xa21/0xeb0
[<ffffffff810b22a0>] ? debug_check_no_locks_freed+0xe0/0x170
[<ffffffff810abe8d>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8105a961>] get_signal_to_deliver+0x2e1/0x960
[<ffffffff8100236f>] do_signal+0x3f/0x9a0
[<ffffffff81540000>] ? pci_fixup_msi_k8t_onboard_sound+0x7d/0x97
[<ffffffff8154b565>] ? is_prefetch.isra.15+0x1a6/0x1fd
[<ffffffff815580a3>] ? error_sti+0x5/0x6
[<ffffffff81557cd1>] ? retint_signal+0x11/0x90
[<ffffffff81002d70>] do_notify_resume+0x80/0xb0
[<ffffffff81557d06>] retint_signal+0x46/0x90
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 0
Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 1
CPU 1: hi: 186, btch: 31 usd: 14
active_anon:900420 inactive_anon:28978 isolated_anon:0
active_file:22 inactive_file:24 isolated_file:0
unevictable:4 dirty:5 writeback:0 unstable:0
free:20346 slab_reclaimable:8656 slab_unreclaimable:10414
mapped:18437 shmem:243751 pagetables:7717 bounce:0
free_cma:0
DMA free:12120kB min:272kB low:340kB high:408kB active_anon:2892kB
inactive_anon:872kB active_file:0kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15900kB mlocked:0kB dirty:0kB
writeback:0kB mapped:1672kB shmem:3596kB slab_reclaimable:0kB
slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2951 3836 3836
DMA32 free:55316kB min:51776kB low:64720kB high:77664kB
active_anon:2834992kB inactive_anon:108408kB active_file:52kB
inactive_file:56kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:3021968kB mlocked:0kB dirty:20kB writeback:0kB mapped:65916kB
shmem:943452kB slab_reclaimable:11716kB slab_unreclaimable:8904kB
kernel_stack:488kB pagetables:11880kB unstable:0kB bounce:0kB free_cma:0kB
writeback_tmp:0kB pages_scanned:3103 all_unreclaimable? yes
lowmem_reserve[]: 0 0 885 885
Normal free:13948kB min:15532kB low:19412kB high:23296kB
active_anon:763796kB inactive_anon:6632kB active_file:36kB inactive_file:40kB
unevictable:16kB isolated(anon):0kB isolated(file):0kB present:906664kB
mlocked:16kB dirty:0kB writeback:0kB mapped:6160kB shmem:27956kB
slab_reclaimable:22908kB slab_unreclaimable:32736kB kernel_stack:2352kB
pagetables:18988kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:602 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
DMA: 2*4kB 2*8kB 2*16kB 1*32kB 2*64kB 1*128kB 2*256kB 2*512kB 2*1024kB
2*2048kB 1*4096kB = 12120kB
DMA32: 883*4kB 1525*8kB 513*16kB 637*32kB 109*64kB 8*128kB 0*256kB 0*512kB
1*1024kB 1*2048kB 0*4096kB = 55396kB
Normal: 269*4kB 255*8kB 227*16kB 141*32kB 30*64kB 4*128kB 1*256kB 0*512kB
0*1024kB 1*2048kB 0*4096kB = 15996kB
243797 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
1032176 pages RAM
42789 pages reserved
553637 pages shared
943817 pages non-shared
X: page allocation failure: order:0, mode:0x200da
Pid: 1126, comm: X Not tainted 3.7.0-rc5-00007-g95e21c5 #100
Call Trace:
[<ffffffff811354e9>] warn_alloc_failed+0xe9/0x140
[<ffffffff81138eda>] __alloc_pages_nodemask+0x7fa/0xa40
[<ffffffff81148fc3>] shmem_getpage_gfp+0x603/0x9d0
[<ffffffff8100a166>] ? native_sched_clock+0x26/0x90
[<ffffffff81149d6f>] shmem_fault+0x4f/0xa0
[<ffffffff812ad69e>] shm_fault+0x1e/0x20
[<ffffffff811571d3>] __do_fault+0x73/0x4d0
[<ffffffff81159d67>] handle_pte_fault+0x97/0x9a0
[<ffffffff810aca4f>] ? __lock_is_held+0x5f/0x90
[<ffffffff81081711>] ? get_parent_ip+0x11/0x50
[<ffffffff8115ae6f>] handle_mm_fault+0x22f/0x2f0
[<ffffffff8115b12a>] __get_user_pages+0x12a/0x530
[<ffffffff815578b5>] ? _raw_spin_unlock+0x35/0x60
[<ffffffff8115b575>] get_dump_page+0x45/0x60
[<ffffffff811eec6d>] elf_core_dump+0x16bd/0x1960
[<ffffffff811edf86>] ? elf_core_dump+0x9d6/0x1960
[<ffffffff8155b529>] ? sub_preempt_count+0x79/0xd0
[<ffffffff815546ae>] ? mutex_unlock+0xe/0x10
[<ffffffff8118ed63>] ? do_truncate+0x73/0xa0
[<ffffffff811f55a1>] do_coredump+0xa21/0xeb0
[<ffffffff810b22a0>] ? debug_check_no_locks_freed+0xe0/0x170
[<ffffffff810abe8d>] ? trace_hardirqs_off+0xd/0x10
[<ffffffff8105a961>] get_signal_to_deliver+0x2e1/0x960
[<ffffffff8100236f>] do_signal+0x3f/0x9a0
[<ffffffff81540000>] ? pci_fixup_msi_k8t_onboard_sound+0x7d/0x97
[<ffffffff8154b565>] ? is_prefetch.isra.15+0x1a6/0x1fd
[<ffffffff815580a3>] ? error_sti+0x5/0x6
[<ffffffff81557cd1>] ? retint_signal+0x11/0x90
[<ffffffff81002d70>] do_notify_resume+0x80/0xb0
[<ffffffff81557d06>] retint_signal+0x46/0x90
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 0
CPU 1: hi: 186, btch: 31 usd: 0
Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 1
CPU 1: hi: 186, btch: 31 usd: 24
active_anon:900420 inactive_anon:28978 isolated_anon:0
active_file:22 inactive_file:24 isolated_file:19
unevictable:4 dirty:5 writeback:0 unstable:0
free:20222 slab_reclaimable:8656 slab_unreclaimable:10414
mapped:18437 shmem:243751 pagetables:7717 bounce:0
free_cma:0
DMA free:12120kB min:272kB low:340kB high:408kB active_anon:2892kB
inactive_anon:872kB active_file:0kB inactive_file:0kB unevictable:0kB
isolated(anon):0kB isolated(file):0kB present:15900kB mlocked:0kB dirty:0kB
writeback:0kB mapped:1672kB shmem:3596kB slab_reclaimable:0kB
slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB
bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2951 3836 3836
DMA32 free:55316kB min:51776kB low:64720kB high:77664kB
active_anon:2834992kB inactive_anon:108408kB active_file:52kB
inactive_file:56kB unevictable:0kB isolated(anon):0kB isolated(file):0kB
present:3021968kB mlocked:0kB dirty:20kB writeback:0kB mapped:65916kB
shmem:943452kB slab_reclaimable:11716kB slab_unreclaimable:8904kB
kernel_stack:488kB pagetables:11880kB unstable:0kB bounce:0kB free_cma:0kB
writeback_tmp:0kB pages_scanned:3940 all_unreclaimable? yes
[ 6242] 11641 6242 126497 1479 172 0 0 kded4
[ 6244] 11641 6244 2977 48 11 0 0 gam_server
[10804] 11641 10804 101320 307 47 0 0 gvfsd-http
[12175] 0 12175 27197 32 10 0 0 agetty
[12249] 11641 12249 28719 252 14 0 0 bash
[14862] 0 14862 51773 344 55 0 0 cupsd
[14868] 4 14868 18105 158 39 0 0 cups-polld
[16728] 11641 16728 28691 244 12 0 0 bash
[16975] 0 16975 9109 253 23 0 -1000
systemd-udevd
[17618] 0 17618 8245 87 22 0 0
systemd-logind
[ 3133] 11641 3133 43721 132 40 0 0 su
[ 3136] 0 3136 28564 139 12 0 0 bash
[ 3983] 11641 3983 43722 134 41 0 0 su
[ 3986] 0 3986 28564 144 13 0 0 bash
[16350] 11641 16350 28691 245 14 0 0 bash
[31228] 11641 31228 28691 245 11 0 0 bash
[31922] 11641 31922 28719 250 13 0 0 bash
[ 2340] 11641 2340 28691 245 15 0 0 bash
[12586] 38 12586 7851 150 19 0 0 ntpd
[32658] 11641 32658 41192 424 35 0 0 mc
[32660] 11641 32660 28692 245 13 0 0 bash
[10971] 11641 10971 43722 133 43 0 0 su
[10974] 0 10974 28564 132 12 0 0 bash
[11343] 0 11343 28497 66 11 0 0 ksmtuned
[11387] 11641 11387 28719 254 11 0 0 bash
[11450] 11641 11450 28691 246 13 0 0 bash
[11576] 11641 11576 43722 133 40 0 0 su
[11579] 0 11579 28564 141 13 0 0 bash
[12106] 11641 12106 28691 244 12 0 0 bash
[12141] 11641 12141 43722 132 44 0 0 su
[12144] 0 12144 28564 140 11 0 0 bash
[12264] 11641 12264 28691 245 11 0 0 bash
[12299] 11641 12299 43721 133 40 0 0 su
[12302] 0 12302 28564 137 12 0 0 bash
[26024] 11641 26024 28691 245 13 0 0 bash
[26083] 11641 26083 28691 245 13 0 0 bash
[28235] 11641 28235 43721 132 42 0 0 su
[28238] 0 28238 28564 143 13 0 0 bash
[29460] 11641 29460 43721 132 42 0 0 su
[29463] 0 29463 28564 137 12 0 0 bash
[29758] 11641 29758 28720 256 12 0 0 bash
[29864] 11641 29864 41916 1153 36 0 0 mc
[29866] 11641 29866 28728 257 11 0 0 bash
[32750] 0 32750 23164 2994 47 0 0 dhclient
[ 323] 0 323 24081 471 48 0 0 sendmail
[ 347] 51 347 20347 367 38 0 0 sendmail
[ 907] 11641 907 379562 159766 707 0 0 thunderbird
[ 6340] 11641 6340 28719 251 12 0 0 bash
[ 6790] 11641 6790 80307 620 101 0 0
xfce4-notifyd
[ 6844] 0 6844 26669 23 9 0 0 sleep
Out of memory: Kill process 907 (thunderbird) score 162 or sacrifice child
Killed process 907 (thunderbird) total-vm:1518248kB, anon-rss:638476kB,
file-rss:588kB
lowmem_reserve[]: 0 0 885 885
Normal free:12832kB min:15532kB low:19412kB high:23296kB
active_anon:763796kB inactive_anon:6632kB active_file:36kB inactive_file:40kB
unevictable:16kB isolated(anon):0kB isolated(file):0kB present:906664kB
mlocked:16kB dirty:0kB writeback:0kB mapped:6160kB shmem:27956kB
slab_reclaimable:22908kB slab_unreclaimable:32736kB kernel_stack:2352kB
pagetables:18988kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:1742 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
DMA: 2*4kB 2*8kB 2*16kB 1*32kB 2*64kB 1*128kB 2*256kB 2*512kB 2*1024kB
2*2048kB 1*4096kB = 12120kB
DMA32: 883*4kB 1525*8kB 513*16kB 637*32kB 109*64kB 8*128kB 0*256kB 0*512kB
1*1024kB 1*2048kB 0*4096kB = 55396kB
Normal: 270*4kB 173*8kB 198*16kB 141*32kB 30*64kB 4*128kB 1*256kB 0*512kB
0*1024kB 1*2048kB 0*4096kB = 14880kB
243797 total pagecache pages
0 pages in swap cache
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 0kB
Total swap = 0kB
1032176 pages RAM
42789 pages reserved
553659 pages shared
937056 pages non-shared

SysRq : Emergency Sync
Emergency Sync complete
SysRq : Emergency Remount R/O


2012-11-18 19:08:04

by Jiri Slaby

[permalink] [raw]
Subject: Re: kswapd0: excessive CPU usage

On 11/18/2012 08:00 PM, Zdenek Kabelac wrote:
> For some reason my machine went ouf of memory and OOM killed
> firefox and then even whole Xsession.
>
> Unsure whether it's related to those 2 patches - but I've never had
> such OOM failure before.

As I wrote, this would be me:
https://lkml.org/lkml/2012/11/15/150

There is no -next tree for Friday which would contain the set already.
So for now, it should be enough for you to apply:
https://lkml.org/lkml/2012/11/15/95

Or, alternatively, if you use a brand new systemd, it likes to fork bomb
using udev.

thanks,
--
js
suse labs

2012-11-20 01:45:01

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

On Fri, 16 Nov 2012 11:51:24 -0800, Andrew Morton said:
> On Fri, 16 Nov 2012 14:14:47 -0500
> Josh Boyer <[email protected]> wrote:
>
> > > The temptation is to supply a patch that checks if kswapd was woken for
> > > THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
> > > backed up by proper testing. As 3.7 is very close to release and this is
> > > not a bug we should release with, a safer path is to revert "mm: remove
> > > __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
> > > balance_pgdat() logic in general.
> > >
> > > Signed-off-by: Mel Gorman <[email protected]>
> >
> > Does anyone know if this is queued to go into 3.7 somewhere? I looked
> > a bit and can't find it in a tree. We have a few reports of Fedora
> > rawhide users hitting this.
>
> Still thinking about it. We're reverting quite a lot of material
> lately.
> mm-revert-mm-vmscan-scale-number-of-pages-reclaimed-by-reclaim-compaction-based-on-failures.patch
> and revert-mm-fix-up-zone-present-pages.patch are queued for 3.7.
>
> I'll toss this one in there as well, but I can't say I'm feeling
> terribly confident. How is Valdis's machine nowadays?

I admit possibly having lost the plot. With the two patches you mention stuck
on top of next-20121114, I'm seeing less kswapd issues but am still tripping
over them on occasion. It seems to be related to uptime - I don't see any for
a few hours, but they become more frequent. I was seeing quite a few of them
yesterday after I had a 30-hour uptime.

I'll stick Mel's "mm: remove __GFP_NO_KSWAPD" patch on this evening and let you
know what happens (might be a day or two before I have definitive results, as
usualally my laptop gets rebooted twice a day).


Attachments:
(No filename) (865.00 B)

2012-11-20 09:18:30

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

On 11/12/2012 03:37 PM, Mel Gorman wrote:
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 02c1c971..d0a7967 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -31,6 +31,7 @@ struct vm_area_struct;
> #define ___GFP_THISNODE 0x40000u
> #define ___GFP_RECLAIMABLE 0x80000u
> #define ___GFP_NOTRACK 0x200000u
> +#define ___GFP_NO_KSWAPD 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> #define ___GFP_WRITE 0x1000000u

Keep in mind that this bit has been reused in -mm.
If this patch needs to be reverted, we'll need to first change
the definition of __GFP_KMEMCG (and __GFP_BITS_SHIFT as a result), or it
would break things.

2012-11-20 15:38:47

by Josh Boyer

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

On Fri, Nov 16, 2012 at 3:06 PM, Mel Gorman <[email protected]> wrote:
> On Fri, Nov 16, 2012 at 02:14:47PM -0500, Josh Boyer wrote:
>> On Mon, Nov 12, 2012 at 6:37 AM, Mel Gorman <[email protected]> wrote:
>> > With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
>> > based on failures" reverted, Zdenek Kabelac reported the following
>> >
>> > Hmm, so it's just took longer to hit the problem and observe
>> > kswapd0 spinning on my CPU again - it's not as endless like before -
>> > but still it easily eats minutes - it helps to turn off Firefox
>> > or TB (memory hungry apps) so kswapd0 stops soon - and restart
>> > those apps again. (And I still have like >1GB of cached memory)
>> >
>> > kswapd0 R running task 0 30 2 0x00000000
>> > ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
>> > ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
>> > ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
>> > Call Trace:
>> > [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
>> > [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
>> > [<ffffffff81192971>] put_super+0x31/0x40
>> > [<ffffffff81192a42>] drop_super+0x22/0x30
>> > [<ffffffff81193b89>] prune_super+0x149/0x1b0
>> > [<ffffffff81141e2a>] shrink_slab+0xba/0x510
>> >
>> > The sysrq+m indicates the system has no swap so it'll never reclaim
>> > anonymous pages as part of reclaim/compaction. That is one part of the
>> > problem but not the root cause as file-backed pages could also be reclaimed.
>> >
>> > The likely underlying problem is that kswapd is woken up or kept awake
>> > for each THP allocation request in the page allocator slow path.
>> >
>> > If compaction fails for the requesting process then compaction will be
>> > deferred for a time and direct reclaim is avoided. However, if there
>> > are a storm of THP requests that are simply rejected, it will still
>> > be the the case that kswapd is awake for a prolonged period of time
>> > as pgdat->kswapd_max_order is updated each time. This is noticed by
>> > the main kswapd() loop and it will not call kswapd_try_to_sleep().
>> > Instead it will loopp, shrinking a small number of pages and calling
>> > shrink_slab() on each iteration.
>> >
>> > The temptation is to supply a patch that checks if kswapd was woken for
>> > THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
>> > backed up by proper testing. As 3.7 is very close to release and this is
>> > not a bug we should release with, a safer path is to revert "mm: remove
>> > __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
>> > balance_pgdat() logic in general.
>> >
>> > Signed-off-by: Mel Gorman <[email protected]>
>>
>> Does anyone know if this is queued to go into 3.7 somewhere? I looked
>> a bit and can't find it in a tree. We have a few reports of Fedora
>> rawhide users hitting this.
>>
>
> No, because I was waiting to hear if a) it worked and preferably if the
> alternative "less safe" option worked. This close to release it might be
> better to just go with the safe option.

We've been tracking it in https://bugzilla.redhat.com/show_bug.cgi?id=866988
and people say this revert patch doesn't seem to make the issue go away
fully. Thorsten has created another kernel with the other patch applied
for testing.

At least I think that is the latest status from the bug. Hopefully the
commenters will chime in.

josh

2012-11-20 16:14:16

by Bruno Wolff III

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

On Tue, Nov 20, 2012 at 10:38:45 -0500,
Josh Boyer <[email protected]> wrote:
>
>We've been tracking it in https://bugzilla.redhat.com/show_bug.cgi?id=866988
>and people say this revert patch doesn't seem to make the issue go away
>fully. Thorsten has created another kernel with the other patch applied
>for testing.
>
>At least I think that is the latest status from the bug. Hopefully the
>commenters will chime in.

I am seeing kswapd0 hogging a cpu right now. I have two rsyncs and an md sync
running and a couple of large memory processes (java and firefox) idle.

I haven't been seeing this happen as often as previously. Before doing a
yum update with an rsync was pretty good at triggering the problem. Now,
not so much.

2012-11-20 17:43:14

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

On 20.11.2012 16:38, Josh Boyer wrote:
> On Fri, Nov 16, 2012 at 3:06 PM, Mel Gorman <[email protected]> wrote:
>> On Fri, Nov 16, 2012 at 02:14:47PM -0500, Josh Boyer wrote:
>>> On Mon, Nov 12, 2012 at 6:37 AM, Mel Gorman <[email protected]> wrote:
>>>> With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
>>>> based on failures" reverted, Zdenek Kabelac reported the following
>>>>
>>>> Hmm, so it's just took longer to hit the problem and observe
>>>> kswapd0 spinning on my CPU again - it's not as endless like before -
>>>> but still it easily eats minutes - it helps to turn off Firefox
>>>> or TB (memory hungry apps) so kswapd0 stops soon - and restart
>>>> those apps again. (And I still have like >1GB of cached memory)
>>>>
>>>> kswapd0 R running task 0 30 2 0x00000000
>>>> ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
>>>> ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
>>>> ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
>>>> Call Trace:
>>>> [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
>>>> [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
>>>> [<ffffffff81192971>] put_super+0x31/0x40
>>>> [<ffffffff81192a42>] drop_super+0x22/0x30
>>>> [<ffffffff81193b89>] prune_super+0x149/0x1b0
>>>> [<ffffffff81141e2a>] shrink_slab+0xba/0x510
>>>>
>>>> The sysrq+m indicates the system has no swap so it'll never reclaim
>>>> anonymous pages as part of reclaim/compaction. That is one part of the
>>>> problem but not the root cause as file-backed pages could also be reclaimed.
>>>>
>>>> The likely underlying problem is that kswapd is woken up or kept awake
>>>> for each THP allocation request in the page allocator slow path.
>>>>
>>>> If compaction fails for the requesting process then compaction will be
>>>> deferred for a time and direct reclaim is avoided. However, if there
>>>> are a storm of THP requests that are simply rejected, it will still
>>>> be the the case that kswapd is awake for a prolonged period of time
>>>> as pgdat->kswapd_max_order is updated each time. This is noticed by
>>>> the main kswapd() loop and it will not call kswapd_try_to_sleep().
>>>> Instead it will loopp, shrinking a small number of pages and calling
>>>> shrink_slab() on each iteration.
>>>>
>>>> The temptation is to supply a patch that checks if kswapd was woken for
>>>> THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
>>>> backed up by proper testing. As 3.7 is very close to release and this is
>>>> not a bug we should release with, a safer path is to revert "mm: remove
>>>> __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
>>>> balance_pgdat() logic in general.
>>>>
>>>> Signed-off-by: Mel Gorman <[email protected]>
>>>
>>> Does anyone know if this is queued to go into 3.7 somewhere? I looked
>>> a bit and can't find it in a tree. We have a few reports of Fedora
>>> rawhide users hitting this.
>>
>> No, because I was waiting to hear if a) it worked and preferably if the
>> alternative "less safe" option worked. This close to release it might be
>> better to just go with the safe option.
>
> We've been tracking it in https://bugzilla.redhat.com/show_bug.cgi?id=866988
> and people say this revert patch doesn't seem to make the issue go away
> fully. Thorsten has created another kernel with the other patch applied
> for testing.
>
> At least I think that is the latest status from the bug. Hopefully the
> commenters will chime in.

The short story from my current point of view is:

* my main machine at home where I initially saw the issue that started
this thread seems to be running fine with rc6 and the "safe" patch Mel
posted in https://lkml.org/lkml/2012/11/12/113 Before that I ran a rc5
kernel with the revert that went into rc6 and the "safe" patch -- that
worked fine for a few days, too.

* I have a second machine where I started to use 3.7-rc kernels only
yesterday (the machine triggered a bug in the radeon driver that seems
to be fixed in rc6) which showed symptoms like the ones Zdenek Kabelac
mentions in this thread. I wasn't able to look closer at it, but simply
tried rc6 with the safe patch, which didn't help. I'm now running rc6
with the "riskier" patch from https://lkml.org/lkml/2012/11/12/151
I can't yet tell if it helps. If the problems shows up again I'll try to
capture more debugging data via sysrq -- there wasn't any time for that
when I was running rc6 with the safe patch, sorry.

Thorsten

2012-11-20 20:18:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

On Tue, 20 Nov 2012 13:18:19 +0400
Glauber Costa <[email protected]> wrote:

> On 11/12/2012 03:37 PM, Mel Gorman wrote:
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index 02c1c971..d0a7967 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -31,6 +31,7 @@ struct vm_area_struct;
> > #define ___GFP_THISNODE 0x40000u
> > #define ___GFP_RECLAIMABLE 0x80000u
> > #define ___GFP_NOTRACK 0x200000u
> > +#define ___GFP_NO_KSWAPD 0x400000u
> > #define ___GFP_OTHER_NODE 0x800000u
> > #define ___GFP_WRITE 0x1000000u
>
> Keep in mind that this bit has been reused in -mm.
> If this patch needs to be reverted, we'll need to first change
> the definition of __GFP_KMEMCG (and __GFP_BITS_SHIFT as a result), or it
> would break things.

I presently have

/* Plain integer GFP bitmasks. Do not use this directly. */
#define ___GFP_DMA 0x01u
#define ___GFP_HIGHMEM 0x02u
#define ___GFP_DMA32 0x04u
#define ___GFP_MOVABLE 0x08u
#define ___GFP_WAIT 0x10u
#define ___GFP_HIGH 0x20u
#define ___GFP_IO 0x40u
#define ___GFP_FS 0x80u
#define ___GFP_COLD 0x100u
#define ___GFP_NOWARN 0x200u
#define ___GFP_REPEAT 0x400u
#define ___GFP_NOFAIL 0x800u
#define ___GFP_NORETRY 0x1000u
#define ___GFP_MEMALLOC 0x2000u
#define ___GFP_COMP 0x4000u
#define ___GFP_ZERO 0x8000u
#define ___GFP_NOMEMALLOC 0x10000u
#define ___GFP_HARDWALL 0x20000u
#define ___GFP_THISNODE 0x40000u
#define ___GFP_RECLAIMABLE 0x80000u
#define ___GFP_KMEMCG 0x100000u
#define ___GFP_NOTRACK 0x200000u
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u

and

#define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
#define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))

Which I think is OK?

I'd forgotten about __GFP_BITS_SHIFT. Should we do this?

--- a/include/linux/gfp.h~a
+++ a/include/linux/gfp.h
@@ -35,6 +35,7 @@ struct vm_area_struct;
#define ___GFP_NO_KSWAPD 0x400000u
#define ___GFP_OTHER_NODE 0x800000u
#define ___GFP_WRITE 0x1000000u
+/* If the above are modified, __GFP_BITS_SHIFT may need updating */

/*
* GFP bitmasks..
_

2012-11-21 08:30:30

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

On 11/21/2012 12:18 AM, Andrew Morton wrote:
> On Tue, 20 Nov 2012 13:18:19 +0400
> Glauber Costa <[email protected]> wrote:
>
>> On 11/12/2012 03:37 PM, Mel Gorman wrote:
>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>> index 02c1c971..d0a7967 100644
>>> --- a/include/linux/gfp.h
>>> +++ b/include/linux/gfp.h
>>> @@ -31,6 +31,7 @@ struct vm_area_struct;
>>> #define ___GFP_THISNODE 0x40000u
>>> #define ___GFP_RECLAIMABLE 0x80000u
>>> #define ___GFP_NOTRACK 0x200000u
>>> +#define ___GFP_NO_KSWAPD 0x400000u
>>> #define ___GFP_OTHER_NODE 0x800000u
>>> #define ___GFP_WRITE 0x1000000u
>>
>> Keep in mind that this bit has been reused in -mm.
>> If this patch needs to be reverted, we'll need to first change
>> the definition of __GFP_KMEMCG (and __GFP_BITS_SHIFT as a result), or it
>> would break things.
>
> I presently have
>
> /* Plain integer GFP bitmasks. Do not use this directly. */
> #define ___GFP_DMA 0x01u
> #define ___GFP_HIGHMEM 0x02u
> #define ___GFP_DMA32 0x04u
> #define ___GFP_MOVABLE 0x08u
> #define ___GFP_WAIT 0x10u
> #define ___GFP_HIGH 0x20u
> #define ___GFP_IO 0x40u
> #define ___GFP_FS 0x80u
> #define ___GFP_COLD 0x100u
> #define ___GFP_NOWARN 0x200u
> #define ___GFP_REPEAT 0x400u
> #define ___GFP_NOFAIL 0x800u
> #define ___GFP_NORETRY 0x1000u
> #define ___GFP_MEMALLOC 0x2000u
> #define ___GFP_COMP 0x4000u
> #define ___GFP_ZERO 0x8000u
> #define ___GFP_NOMEMALLOC 0x10000u
> #define ___GFP_HARDWALL 0x20000u
> #define ___GFP_THISNODE 0x40000u
> #define ___GFP_RECLAIMABLE 0x80000u
> #define ___GFP_KMEMCG 0x100000u
> #define ___GFP_NOTRACK 0x200000u
> #define ___GFP_NO_KSWAPD 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> #define ___GFP_WRITE 0x1000000u
>
> and
>

Humm, I didn't realize there were also another free space at 0x100000u.
This seems fine.

> #define __GFP_BITS_SHIFT 25 /* Room for N __GFP_FOO bits */
> #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
>
> Which I think is OK?
Yes, if we haven't increased the size of the flag-space, no need to
change it.

>
> I'd forgotten about __GFP_BITS_SHIFT. Should we do this?
>
> --- a/include/linux/gfp.h~a
> +++ a/include/linux/gfp.h
> @@ -35,6 +35,7 @@ struct vm_area_struct;
> #define ___GFP_NO_KSWAPD 0x400000u
> #define ___GFP_OTHER_NODE 0x800000u
> #define ___GFP_WRITE 0x1000000u
> +/* If the above are modified, __GFP_BITS_SHIFT may need updating */
>
This is a very helpful comment.

2012-11-21 15:08:57

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

On Tue, Nov 20, 2012 at 10:38:45AM -0500, Josh Boyer wrote:
> On Fri, Nov 16, 2012 at 3:06 PM, Mel Gorman <[email protected]> wrote:
> > On Fri, Nov 16, 2012 at 02:14:47PM -0500, Josh Boyer wrote:
> >> On Mon, Nov 12, 2012 at 6:37 AM, Mel Gorman <[email protected]> wrote:
> >> > With "mm: vmscan: scale number of pages reclaimed by reclaim/compaction
> >> > based on failures" reverted, Zdenek Kabelac reported the following
> >> >
> >> > Hmm, so it's just took longer to hit the problem and observe
> >> > kswapd0 spinning on my CPU again - it's not as endless like before -
> >> > but still it easily eats minutes - it helps to turn off Firefox
> >> > or TB (memory hungry apps) so kswapd0 stops soon - and restart
> >> > those apps again. (And I still have like >1GB of cached memory)
> >> >
> >> > kswapd0 R running task 0 30 2 0x00000000
> >> > ffff8801331efae8 0000000000000082 0000000000000018 0000000000000246
> >> > ffff880135b9a340 ffff8801331effd8 ffff8801331effd8 ffff8801331effd8
> >> > ffff880055dfa340 ffff880135b9a340 00000000331efad8 ffff8801331ee000
> >> > Call Trace:
> >> > [<ffffffff81555bf2>] preempt_schedule+0x42/0x60
> >> > [<ffffffff81557a95>] _raw_spin_unlock+0x55/0x60
> >> > [<ffffffff81192971>] put_super+0x31/0x40
> >> > [<ffffffff81192a42>] drop_super+0x22/0x30
> >> > [<ffffffff81193b89>] prune_super+0x149/0x1b0
> >> > [<ffffffff81141e2a>] shrink_slab+0xba/0x510
> >> >
> >> > The sysrq+m indicates the system has no swap so it'll never reclaim
> >> > anonymous pages as part of reclaim/compaction. That is one part of the
> >> > problem but not the root cause as file-backed pages could also be reclaimed.
> >> >
> >> > The likely underlying problem is that kswapd is woken up or kept awake
> >> > for each THP allocation request in the page allocator slow path.
> >> >
> >> > If compaction fails for the requesting process then compaction will be
> >> > deferred for a time and direct reclaim is avoided. However, if there
> >> > are a storm of THP requests that are simply rejected, it will still
> >> > be the the case that kswapd is awake for a prolonged period of time
> >> > as pgdat->kswapd_max_order is updated each time. This is noticed by
> >> > the main kswapd() loop and it will not call kswapd_try_to_sleep().
> >> > Instead it will loopp, shrinking a small number of pages and calling
> >> > shrink_slab() on each iteration.
> >> >
> >> > The temptation is to supply a patch that checks if kswapd was woken for
> >> > THP and if so ignore pgdat->kswapd_max_order but it'll be a hack and not
> >> > backed up by proper testing. As 3.7 is very close to release and this is
> >> > not a bug we should release with, a safer path is to revert "mm: remove
> >> > __GFP_NO_KSWAPD" for now and revisit it with the view to ironing out the
> >> > balance_pgdat() logic in general.
> >> >
> >> > Signed-off-by: Mel Gorman <[email protected]>
> >>
> >> Does anyone know if this is queued to go into 3.7 somewhere? I looked
> >> a bit and can't find it in a tree. We have a few reports of Fedora
> >> rawhide users hitting this.
> >>
> >
> > No, because I was waiting to hear if a) it worked and preferably if the
> > alternative "less safe" option worked. This close to release it might be
> > better to just go with the safe option.
>
> We've been tracking it in https://bugzilla.redhat.com/show_bug.cgi?id=866988
> and people say this revert patch doesn't seem to make the issue go away
> fully. Thorsten has created another kernel with the other patch applied
> for testing.
>

There is also a potential accounting bug that could be affecting this.
https://lkml.org/lkml/2012/11/20/613 . NR_FREE_PAGES affects watermark
calculations. If it's drifts too far then processes would keep entering
direct reclaim and waking kswapd even if there is no need to.

--
Mel Gorman
SUSE Labs

2012-11-23 15:20:53

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

Thorsten Leemhuis wrote on 20.11.2012 18:43:
> On 20.11.2012 16:38, Josh Boyer wrote:
>
> The short story from my current point of view is:

Quick update, in case anybody is interested:

> * my main machine at home where I initially saw the issue that started
> this thread seems to be running fine with rc6 and the "safe" patch Mel
> posted in https://lkml.org/lkml/2012/11/12/113 Before that I ran a rc5
> kernel with the revert that went into rc6 and the "safe" patch -- that
> worked fine for a few days, too.

On this machine I'm running a rc6 kernel + the fix for the accounting
bug(¹) that went into mainline ~40 hours ago + the "riskier" patch Mel
posted in https://lkml.org/lkml/2012/11/12/151

Up to now everything works fine.

(¹) https://lkml.org/lkml/2012/11/21/362

> * I have a second machine where I started to use 3.7-rc kernels only
> yesterday (the machine triggered a bug in the radeon driver that seems
> to be fixed in rc6) which showed symptoms like the ones Zdenek Kabelac
> mentions in this thread. I wasn't able to look closer at it, but simply
> tried rc6 with the safe patch, which didn't help. I'm now running rc6
> with the "riskier" patch from https://lkml.org/lkml/2012/11/12/151
> I can't yet tell if it helps. If the problems shows up again I'll try to
> capture more debugging data via sysrq -- there wasn't any time for that
> when I was running rc6 with the safe patch, sorry.

This machine is now also behaving fine with above mentioned rc6 kernel +
the two patches. It seems the accounting bug was the root cause for the
problems this machine showed.

CU
Thorsten

2012-11-27 11:12:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] Revert "mm: remove __GFP_NO_KSWAPD"

On Fri, Nov 23, 2012 at 04:20:48PM +0100, Thorsten Leemhuis wrote:
> Thorsten Leemhuis wrote on 20.11.2012 18:43:
> > On 20.11.2012 16:38, Josh Boyer wrote:
> >
> > The short story from my current point of view is:
>
> Quick update, in case anybody is interested:
>
> > * my main machine at home where I initially saw the issue that started
> > this thread seems to be running fine with rc6 and the "safe" patch Mel
> > posted in https://lkml.org/lkml/2012/11/12/113 Before that I ran a rc5
> > kernel with the revert that went into rc6 and the "safe" patch -- that
> > worked fine for a few days, too.
>
> On this machine I'm running a rc6 kernel + the fix for the accounting
> bug(?) that went into mainline ~40 hours ago + the "riskier" patch Mel
> posted in https://lkml.org/lkml/2012/11/12/151
>
> Up to now everything works fine.
>
> (?) https://lkml.org/lkml/2012/11/21/362
>

That's good news, thanks for the follow up. Maybe 3.7 will not be a complete
disaster with respect to THP after all this.

The riskier patch was not picked up simply because it was riskier and
would still be vunerable to the effective infinite loop Johannes found in
kswapd. It'll all need to be revisisted.

> > * I have a second machine where I started to use 3.7-rc kernels only
> > yesterday (the machine triggered a bug in the radeon driver that seems
> > to be fixed in rc6) which showed symptoms like the ones Zdenek Kabelac
> > mentions in this thread. I wasn't able to look closer at it, but simply
> > tried rc6 with the safe patch, which didn't help. I'm now running rc6
> > with the "riskier" patch from https://lkml.org/lkml/2012/11/12/151
> > I can't yet tell if it helps. If the problems shows up again I'll try to
> > capture more debugging data via sysrq -- there wasn't any time for that
> > when I was running rc6 with the safe patch, sorry.
>
> This machine is now also behaving fine with above mentioned rc6 kernel +
> the two patches. It seems the accounting bug was the root cause for the
> problems this machine showed.
>

For some yes, for others no. Others are getting stuck within effective
infinite loops in kswapd and the trigger cases are different although
the symptoms loop similar.

Thanks again.

--
Mel Gorman
SUSE Labs