2019-01-26 20:01:10

by Pavel Machek

[permalink] [raw]
Subject: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

Hi!

With modern web, 100% CPU load is no longer uncommon, but this time
chromium is not to blame:

pavel@amd:/data/l/linux-next-32$ uname -a
Linux amd 5.0.0-rc2-next-20190117 #214 SMP Fri Jan 18 09:47:18 CET
2019 i686 GNU/Linux

top - 13:38:51 up 1:42, 16 users, load average: 1.41, 1.93, 1.62
Tasks: 182 total, 3 running, 138 sleeping, 0 stopped, 0 zombie
%Cpu(s): 2.3 us, 57.8 sy, 0.0 ni, 39.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 3020044 total, 2429420 used, 590624 free, 27468 buffers
KiB Swap: 2097148 total, 0 used, 2097148 free. 1924268 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
608 root 20 0 0 0 0 R 99.6 0.0 11:34.38 kcompactd0
9782 root 20 0 0 0 0 I 7.9 0.0 0:59.02 kworker/0:+
2971 root 20 0 46624 23076 13576 S 4.3 0.8 2:50.22 Xorg



--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.06 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2019-01-27 02:57:48

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On Sat, 26 Jan 2019 21:00:05 +0100, Pavel Machek said:

> top - 13:38:51 up 1:42, 16 users, load average: 1.41, 1.93, 1.62
> Tasks: 182 total, 3 running, 138 sleeping, 0 stopped, 0 zombie
> %Cpu(s): 2.3 us, 57.8 sy, 0.0 ni, 39.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> KiB Mem: 3020044 total, 2429420 used, 590624 free, 27468 buffers
> KiB Swap: 2097148 total, 0 used, 2097148 free. 1924268 cached Mem
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 608 root 20 0 0 0 0 R 99.6 0.0 11:34.38 kcompactd0
> 9782 root 20 0 0 0 0 I 7.9 0.0 0:59.02 kworker/0:
> 2971 root 20 0 46624 23076 13576 S 4.3 0.8 2:50.22 Xorg

I've noticed this as well on earlier kernels (next-20181224 to 20190115)

Some more info:

1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds.

2) Typical kcompactd traceback:

cat /proc/27/stack
[<0>] retint_kernel+0x1b/0x2d
[<0>] lock_is_held_type+0x1b/0x50
[<0>] ___might_sleep+0xad/0x220
[<0>] __might_sleep+0x113/0x130
[<0>] on_each_cpu_cond_mask+0x12a/0x140
[<0>] on_each_cpu_cond+0x18/0x20
[<0>] invalidate_bh_lrus+0x29/0x30
[<0>] __buffer_migrate_page+0x154/0x340
[<0>] buffer_migrate_page_norefs+0x14/0x20
[<0>] move_to_new_page+0x8e/0x360
[<0>] migrate_pages+0x3cc/0xfd8
[<0>] compact_zone+0xb70/0x1380
[<0>] kcompactd_do_work+0x15b/0x500
[<0>] kcompactd+0x74/0x340
[<0>] kthread+0x158/0x170
[<0>] ret_from_fork+0x3a/0x50
[<0>] 0xffffffffffffffff

I've also seen khugepaged hung up:

cat /proc/29/stack
[<0>] ___preempt_schedule+0x16/0x18
[<0>] page_vma_mapped_walk+0x60/0x840
[<0>] remove_migration_pte+0x67/0x390
[<0>] rmap_walk_file+0x186/0x380
[<0>] rmap_walk+0xa3/0xd0
[<0>] remove_migration_ptes+0x69/0x70
[<0>] migrate_pages+0xb6d/0xfd8
[<0>] compact_zone+0xb70/0x1370
[<0>] compact_zone_order+0xd8/0x120
[<0>] try_to_compact_pages+0xe5/0x550
[<0>] __alloc_pages_direct_compact+0x6d/0x1a0
[<0>] __alloc_pages_slowpath+0x6c9/0x1640
[<0>] __alloc_pages_nodemask+0x558/0x5b0
[<0>] khugepaged+0x499/0x810
[<0>] kthread+0x158/0x170
[<0>] ret_from_fork+0x3a/0x50
[<0>] 0xffffffffffffffff

Looks like something has gone astray with compact_zone.


2019-01-27 14:10:07

by Mel Gorman

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

Adding Jan Kara to cc due to the fact it appears the lockup is within
buffer_migrate_page_norefs which changed recently.

On Sat, Jan 26, 2019 at 09:56:53PM -0500, [email protected] wrote:
> On Sat, 26 Jan 2019 21:00:05 +0100, Pavel Machek said:
>
> > top - 13:38:51 up 1:42, 16 users, load average: 1.41, 1.93, 1.62
> > Tasks: 182 total, 3 running, 138 sleeping, 0 stopped, 0 zombie
> > %Cpu(s): 2.3 us, 57.8 sy, 0.0 ni, 39.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> > KiB Mem: 3020044 total, 2429420 used, 590624 free, 27468 buffers
> > KiB Swap: 2097148 total, 0 used, 2097148 free. 1924268 cached Mem
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 608 root 20 0 0 0 0 R 99.6 0.0 11:34.38 kcompactd0
> > 9782 root 20 0 0 0 0 I 7.9 0.0 0:59.02 kworker/0:
> > 2971 root 20 0 46624 23076 13576 S 4.3 0.8 2:50.22 Xorg
>
> I've noticed this as well on earlier kernels (next-20181224 to 20190115)
>
> Some more info:
>
> 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds.
>
> 2) Typical kcompactd traceback:
>
> cat /proc/27/stack
> [<0>] retint_kernel+0x1b/0x2d
> [<0>] lock_is_held_type+0x1b/0x50
> [<0>] ___might_sleep+0xad/0x220
> [<0>] __might_sleep+0x113/0x130
> [<0>] on_each_cpu_cond_mask+0x12a/0x140
> [<0>] on_each_cpu_cond+0x18/0x20
> [<0>] invalidate_bh_lrus+0x29/0x30
> [<0>] __buffer_migrate_page+0x154/0x340
> [<0>] buffer_migrate_page_norefs+0x14/0x20
> [<0>] move_to_new_page+0x8e/0x360
> [<0>] migrate_pages+0x3cc/0xfd8
> [<0>] compact_zone+0xb70/0x1380
> [<0>] kcompactd_do_work+0x15b/0x500
> [<0>] kcompactd+0x74/0x340
> [<0>] kthread+0x158/0x170
> [<0>] ret_from_fork+0x3a/0x50
> [<0>] 0xffffffffffffffff
>
> I've also seen khugepaged hung up:
>
> cat /proc/29/stack
> [<0>] ___preempt_schedule+0x16/0x18
> [<0>] page_vma_mapped_walk+0x60/0x840
> [<0>] remove_migration_pte+0x67/0x390
> [<0>] rmap_walk_file+0x186/0x380
> [<0>] rmap_walk+0xa3/0xd0
> [<0>] remove_migration_ptes+0x69/0x70
> [<0>] migrate_pages+0xb6d/0xfd8
> [<0>] compact_zone+0xb70/0x1370
> [<0>] compact_zone_order+0xd8/0x120
> [<0>] try_to_compact_pages+0xe5/0x550
> [<0>] __alloc_pages_direct_compact+0x6d/0x1a0
> [<0>] __alloc_pages_slowpath+0x6c9/0x1640
> [<0>] __alloc_pages_nodemask+0x558/0x5b0
> [<0>] khugepaged+0x499/0x810
> [<0>] kthread+0x158/0x170
> [<0>] ret_from_fork+0x3a/0x50
> [<0>] 0xffffffffffffffff
>
> Looks like something has gone astray with compact_zone.
>

--
Mel Gorman
SUSE Labs

2019-01-27 14:16:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On Sat, Jan 26, 2019 at 09:56:53PM -0500, [email protected] wrote:
> On Sat, 26 Jan 2019 21:00:05 +0100, Pavel Machek said:
>
> > top - 13:38:51 up 1:42, 16 users, load average: 1.41, 1.93, 1.62
> > Tasks: 182 total, 3 running, 138 sleeping, 0 stopped, 0 zombie
> > %Cpu(s): 2.3 us, 57.8 sy, 0.0 ni, 39.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> > KiB Mem: 3020044 total, 2429420 used, 590624 free, 27468 buffers
> > KiB Swap: 2097148 total, 0 used, 2097148 free. 1924268 cached Mem
> >
> > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > 608 root 20 0 0 0 0 R 99.6 0.0 11:34.38 kcompactd0
> > 9782 root 20 0 0 0 0 I 7.9 0.0 0:59.02 kworker/0:
> > 2971 root 20 0 46624 23076 13576 S 4.3 0.8 2:50.22 Xorg
>
> I've noticed this as well on earlier kernels (next-20181224 to 20190115)
>
> Some more info:
>
> 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds.
>

This aspect is curious as it indicates that kcompactd could potentially
be infinite looping but it's not something I've experienced myself. By
any chance is there a preditable reproduction case for this?

> I've also seen khugepaged hung up:
>
> cat /proc/29/stack
> [<0>] ___preempt_schedule+0x16/0x18
> [<0>] page_vma_mapped_walk+0x60/0x840
> [<0>] remove_migration_pte+0x67/0x390
> [<0>] rmap_walk_file+0x186/0x380
> [<0>] rmap_walk+0xa3/0xd0
> [<0>] remove_migration_ptes+0x69/0x70
> [<0>] migrate_pages+0xb6d/0xfd8
> [<0>] compact_zone+0xb70/0x1370
> [<0>] compact_zone_order+0xd8/0x120
> [<0>] try_to_compact_pages+0xe5/0x550
> [<0>] __alloc_pages_direct_compact+0x6d/0x1a0
> [<0>] __alloc_pages_slowpath+0x6c9/0x1640
> [<0>] __alloc_pages_nodemask+0x558/0x5b0
> [<0>] khugepaged+0x499/0x810
> [<0>] kthread+0x158/0x170
> [<0>] ret_from_fork+0x3a/0x50
> [<0>] 0xffffffffffffffff
>
> Looks like something has gone astray with compact_zone.
>

It's a possibility that the buffer aspect of the trace is a red herring
and there is some corner case that prevents the migration scan/free
scanner meeting and exiting compaction. Again, a reproduction case of
some sort would be nice or an indication of how long it takes to
trigger. An update of the series is due which may or may not fix this
but if it doesn't, we'll need to start tracing this to see what's going
on at the point of failure.

--
Mel Gorman
SUSE Labs

2019-01-27 16:00:56

by Pavel Machek

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

Hi!

> > > top - 13:38:51 up 1:42, 16 users, load average: 1.41, 1.93, 1.62
> > > Tasks: 182 total, 3 running, 138 sleeping, 0 stopped, 0 zombie
> > > %Cpu(s): 2.3 us, 57.8 sy, 0.0 ni, 39.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> > > KiB Mem: 3020044 total, 2429420 used, 590624 free, 27468 buffers
> > > KiB Swap: 2097148 total, 0 used, 2097148 free. 1924268 cached Mem
> > >
> > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> > > 608 root 20 0 0 0 0 R 99.6 0.0 11:34.38 kcompactd0
> > > 9782 root 20 0 0 0 0 I 7.9 0.0 0:59.02 kworker/0:
> > > 2971 root 20 0 46624 23076 13576 S 4.3 0.8 2:50.22 Xorg
> >
> > I've noticed this as well on earlier kernels (next-20181224 to 20190115)
> >
> > Some more info:
> >
> > 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds.
> >
>
> This aspect is curious as it indicates that kcompactd could potentially
> be infinite looping but it's not something I've experienced myself. By
> any chance is there a preditable reproduction case for this?

I seen it exactly once, so not sure how reproducible this is. x86-32
machine, running chromium browser, so yes, there was some swapping
involved.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.44 kB)
signature.asc (188.00 B)
Digital signature
Download all attachments

2019-01-27 21:39:14

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On Sun, 27 Jan 2019 17:00:27 +0100, Pavel Machek said:
> > > I've noticed this as well on earlier kernels (next-20181224 to 20190115)
> > > Some more info:
> > > 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds.
> > This aspect is curious as it indicates that kcompactd could potentially
> > be infinite looping but it's not something I've experienced myself. By
> > any chance is there a preditable reproduction case for this?
>
> I seen it exactly once, so not sure how reproducible this is. x86-32
> machine, running chromium browser, so yes, there was some swapping
> involved.

I don't have a surefire replicator, but my laptop (x86_64, so it's not a 32-bit
only issue) triggers it fairly often, up to multiple times a day. Doesn't seem to
be just the Chrome browser that triggers it - usually I'm doing other stuff as
well, like a compile or similar. The fact that 'drop_caches' clears it makes me
wonder if we're hitting a corner case where cache data isn't being automatically
cleared and clogging something up.

Any particular diagnostic info you want me to get next time it hits? (Am
currently on next-20190125, if that matters).


2019-01-28 09:18:58

by Jan Kara

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On Sun 27-01-19 16:36:34, [email protected] wrote:
> On Sun, 27 Jan 2019 17:00:27 +0100, Pavel Machek said:
> > > > I've noticed this as well on earlier kernels (next-20181224 to 20190115)
> > > > Some more info:
> > > > 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds.
> > > This aspect is curious as it indicates that kcompactd could potentially
> > > be infinite looping but it's not something I've experienced myself. By
> > > any chance is there a preditable reproduction case for this?
> >
> > I seen it exactly once, so not sure how reproducible this is. x86-32
> > machine, running chromium browser, so yes, there was some swapping
> > involved.
>
> I don't have a surefire replicator, but my laptop (x86_64, so it's not a 32-bit
> only issue) triggers it fairly often, up to multiple times a day. Doesn't seem to
> be just the Chrome browser that triggers it - usually I'm doing other stuff as
> well, like a compile or similar. The fact that 'drop_caches' clears it makes me
> wonder if we're hitting a corner case where cache data isn't being automatically
> cleared and clogging something up.

So my buffer_migrate_page_norefs() is certainly buggy in its current
incarnation (as a result block device page cache is not migratable at all).
I've sent Andrew a patch over week ago but so far it got ignored. The patch
is attached, can you give it a try whether it changes something for you?
Thanks!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (1.49 kB)
0001-mm-migrate-Make-buffer_migrate_page_norefs-actually-.patch (2.32 kB)
Download all attachments

2019-01-28 10:57:59

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On (01/28/19 10:16), Jan Kara wrote:
> On Sun 27-01-19 16:36:34, [email protected] wrote:
> > On Sun, 27 Jan 2019 17:00:27 +0100, Pavel Machek said:
> > > > > I've noticed this as well on earlier kernels (next-20181224 to 20190115)
> > > > > Some more info:
> > > > > 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds.
> > > > This aspect is curious as it indicates that kcompactd could potentially
> > > > be infinite looping but it's not something I've experienced myself. By
> > > > any chance is there a preditable reproduction case for this?
> > >
> > > I seen it exactly once, so not sure how reproducible this is. x86-32
> > > machine, running chromium browser, so yes, there was some swapping
> > > involved.
> >
> > I don't have a surefire replicator, but my laptop (x86_64, so it's not a 32-bit
> > only issue) triggers it fairly often, up to multiple times a day. Doesn't seem to
> > be just the Chrome browser that triggers it - usually I'm doing other stuff as
> > well, like a compile or similar. The fact that 'drop_caches' clears it makes me
> > wonder if we're hitting a corner case where cache data isn't being automatically
> > cleared and clogging something up.
>
> So my buffer_migrate_page_norefs() is certainly buggy in its current
> incarnation (as a result block device page cache is not migratable at all).
> I've sent Andrew a patch over week ago but so far it got ignored. The patch
> is attached, can you give it a try whether it changes something for you?
> Thanks!

Hello Jan,

Just for note,
I'm seeing the same problems on my x86 box [1]. Don't have a reproducer
for the issue yet, but will try to test your patch.

Thanks.

[1] https://lore.kernel.org/lkml/20190128085747.GA14454@jagdpanzerIV/T/#u

-ss

2019-01-28 11:04:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On Mon, Jan 28, 2019 at 10:16:27AM +0100, Jan Kara wrote:
> On Sun 27-01-19 16:36:34, [email protected] wrote:
> > On Sun, 27 Jan 2019 17:00:27 +0100, Pavel Machek said:
> > > > > I've noticed this as well on earlier kernels (next-20181224 to 20190115)
> > > > > Some more info:
> > > > > 1) echo 3 > /proc/sys/vm/drop_caches unwedges kcompactd in 1-3 seconds.
> > > > This aspect is curious as it indicates that kcompactd could potentially
> > > > be infinite looping but it's not something I've experienced myself. By
> > > > any chance is there a preditable reproduction case for this?
> > >
> > > I seen it exactly once, so not sure how reproducible this is. x86-32
> > > machine, running chromium browser, so yes, there was some swapping
> > > involved.
> >
> > I don't have a surefire replicator, but my laptop (x86_64, so it's not a 32-bit
> > only issue) triggers it fairly often, up to multiple times a day. Doesn't seem to
> > be just the Chrome browser that triggers it - usually I'm doing other stuff as
> > well, like a compile or similar. The fact that 'drop_caches' clears it makes me
> > wonder if we're hitting a corner case where cache data isn't being automatically
> > cleared and clogging something up.
>
> So my buffer_migrate_page_norefs() is certainly buggy in its current
> incarnation (as a result block device page cache is not migratable at all).
> I've sent Andrew a patch over week ago but so far it got ignored. The patch
> is attached, can you give it a try whether it changes something for you?
> Thanks!
>

Definetly worth trying and hopefully both the migration and compaction
patches sync up soon. In the event this patch does not help, I would
appreciate the following

1) A trace while kcompactd is pegged at 100%

trace-cmd record -a -e compaction -e migrate -e kmem:mm_page_alloc -e vmscan:mm_vmscan_kswapd_wake -e vmscan:mm_vmscan_kswapd_sleep sleep 10

Compress the resulting trace.dat and email it to me. If it's too big
for a reasonable email, drop "-e kmem:mm_page_alloc" from the command
line and it should be a more reasonable size. If not, reduce the sleep
time to gather a shorter inverval.

2) Sample stack traces of kcompact while pegged at 100%

echo -n > /tmp/kcompactd-stack; for i in `seq 1 100`; do echo sample $i >> /tmp/kcompactd-stack; cat /proc/`pidof kcompactd0`/stack >> /tmp/kcompactd-stack; done; gzip -f /tmp/kcompactd-stack

And mail me the resulting /tmp/kcompactd-stack.gz

Thanks.

--
Mel Gorman
SUSE Labs

2019-01-30 01:07:34

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On Mon, 28 Jan 2019 10:16:27 +0100, Jan Kara said:

> So my buffer_migrate_page_norefs() is certainly buggy in its current
> incarnation (as a result block device page cache is not migratable at all).
> I've sent Andrew a patch over week ago but so far it got ignored. The patch
> is attached, can you give it a try whether it changes something for you?
> Thanks!

Been running with the patch for about 24 hours, haven't seen kcompactd
misbehave. I even fired up a Chrome with a lot of tabs open, a Firefox, and a
kernel build, intentionally drove the system into swapping, and kcompactd
didn't make it into the top 10 on 'top'.

I'm willing to say put a "tested-by:" on that one, it looks fixed from here.
If there's any remaining bugs, they're ones I can't seem to trigger...


2019-01-30 04:30:04

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On Tue, 29 Jan 2019 20:06:39 -0500, [email protected] said:
> On Mon, 28 Jan 2019 10:16:27 +0100, Jan Kara said:
>
> > So my buffer_migrate_page_norefs() is certainly buggy in its current
> > incarnation (as a result block device page cache is not migratable at all).
> > I've sent Andrew a patch over week ago but so far it got ignored. The patch
> > is attached, can you give it a try whether it changes something for you?
> > Thanks!
>
> Been running with the patch for about 24 hours, haven't seen kcompactd
> misbehave. I even fired up a Chrome with a lot of tabs open, a Firefox, and a
> kernel build, intentionally drove the system into swapping, and kcompactd
> didn't make it into the top 10 on 'top'.
>
> I'm willing to say put a "tested-by:" on that one, it looks fixed from here.
> If there's any remaining bugs, they're ones I can't seem to trigger...

Spoke too soon. Sitting here not stressing the laptop at all, plenty of free
memory, and ka-blam.

Will keep my eyes open and do the data gathering Mel Gorban wanted - I discovered
too late that trace-cmd wasn't installed, and things broke free by themselves (probably
not coincidence that I launched a terminal window and then it cleared....)

top - 23:24:03 up 2:19, 1 user, load average: 2.70, 2.00, 1.55
Tasks: 221 total, 3 running, 218 sleeping, 0 stopped, 0 zombie
%Cpu(s): 15.6 us, 67.3 sy, 0.0 ni, 9.5 id, 0.0 wa, 5.6 hi, 2.0 si, 0.0 st
GiB Mem : 7.6 total, 2.7 free, 3.1 used, 1.8 buff/cache
GiB Swap: 8.0 total, 8.0 free, 0.0 used. 4.1 avail Mem

PID PPID %MEM PR NI S VIRT RES SHR SWAP UID %CPU TIME+ COMMAND
27 2 0.0 20 0 R 0.0m 0.0m 0.0m 0.0m 0 78.5 2:11.91 kcompactd0


2019-01-30 10:41:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On Tue, Jan 29, 2019 at 11:29:37PM -0500, [email protected] wrote:
> On Tue, 29 Jan 2019 20:06:39 -0500, [email protected] said:
> > On Mon, 28 Jan 2019 10:16:27 +0100, Jan Kara said:
> >
> > > So my buffer_migrate_page_norefs() is certainly buggy in its current
> > > incarnation (as a result block device page cache is not migratable at all).
> > > I've sent Andrew a patch over week ago but so far it got ignored. The patch
> > > is attached, can you give it a try whether it changes something for you?
> > > Thanks!
> >
> > Been running with the patch for about 24 hours, haven't seen kcompactd
> > misbehave. I even fired up a Chrome with a lot of tabs open, a Firefox, and a
> > kernel build, intentionally drove the system into swapping, and kcompactd
> > didn't make it into the top 10 on 'top'.
> >
> > I'm willing to say put a "tested-by:" on that one, it looks fixed from here.
> > If there's any remaining bugs, they're ones I can't seem to trigger...
>
> Spoke too soon. Sitting here not stressing the laptop at all, plenty of free
> memory, and ka-blam.
>
> Will keep my eyes open and do the data gathering Mel Gorban wanted - I discovered
> too late that trace-cmd wasn't installed, and things broke free by themselves (probably
> not coincidence that I launched a terminal window and then it cleared....)
>

That's unfortunate. I also note that linux-next still has not been
updated with the latest version of the compaction series. Nevertheless,
it might be helpful to get the output of

grep -r . /sys/kernel/mm/transparent_hugepage/*

and the trace when the system is in normal use but kcompactd has not
pegged at 100%. At minimum, I'd like to see what the sources of high-order
allocations are and the likely causes of wakeups of kcompactd in case
there are any hints there. Your Kconfig is also potentially useful.

Thanks.

--
Mel Gorman
SUSE Labs

2021-01-26 10:08:55

by Tibor Bana

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

Greetings!

I don't know if it still actual, but I am strugling with this problem right now and searching the internet for solutions.
I read the thread and saw that you are strugling to reproduce the problem, and I can reproduce it almost every day.

- Install vmware player, and a linux guest.
- Configure the virtual machine to have a good amount of memory and cpu
- run resource intensive tasks on the guest
- when the host used up almost it's all memory and start to reuse caches kcompactd will kick in.

As I know the problem is related to transparent huge pages, but I tried to disable it.
Today I saw the problem again and kcompactd shown an interesting status in top. It hasn't used any memory, all zeroes but it used up one core completely.

My machine is a core-i7 with 4 physical cores and hyper threading and 24GB Memory
5.9.11-arch2-1 #1 SMP PREEMPT Sat, 28 Nov 2020 02:07:22 +0000 x86_64 GNU/Linux

Hope this can help, to point out the problem.

Tibor Bana

On Wed, 30 Jan 2019 10:40:20 +0000
Mel Gorman <[email protected]> wrote:

> On Tue, Jan 29, 2019 at 11:29:37PM -0500, [email protected] wrote:
> > On Tue, 29 Jan 2019 20:06:39 -0500, [email protected] said:
> > > On Mon, 28 Jan 2019 10:16:27 +0100, Jan Kara said:
> > >
> > > > So my buffer_migrate_page_norefs() is certainly buggy in its current
> > > > incarnation (as a result block device page cache is not migratable at all).
> > > > I've sent Andrew a patch over week ago but so far it got ignored. The patch
> > > > is attached, can you give it a try whether it changes something for you?
> > > > Thanks!
> > >
> > > Been running with the patch for about 24 hours, haven't seen kcompactd
> > > misbehave. I even fired up a Chrome with a lot of tabs open, a Firefox, and a
> > > kernel build, intentionally drove the system into swapping, and kcompactd
> > > didn't make it into the top 10 on 'top'.
> > >
> > > I'm willing to say put a "tested-by:" on that one, it looks fixed from here.
> > > If there's any remaining bugs, they're ones I can't seem to trigger...
> >
> > Spoke too soon. Sitting here not stressing the laptop at all, plenty of free
> > memory, and ka-blam.
> >
> > Will keep my eyes open and do the data gathering Mel Gorban wanted - I discovered
> > too late that trace-cmd wasn't installed, and things broke free by themselves (probably
> > not coincidence that I launched a terminal window and then it cleared....)
> >
>
> That's unfortunate. I also note that linux-next still has not been
> updated with the latest version of the compaction series. Nevertheless,
> it might be helpful to get the output of
>
> grep -r . /sys/kernel/mm/transparent_hugepage/*
>
> and the trace when the system is in normal use but kcompactd has not
> pegged at 100%. At minimum, I'd like to see what the sources of high-order
> allocations are and the likely causes of wakeups of kcompactd in case
> there are any hints there. Your Kconfig is also potentially useful.
>
> Thanks.
>
> --
> Mel Gorman
> SUSE Labs


--
Tibor Bana <[email protected]>

2021-01-27 06:06:51

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On Mon, 25 Jan 2021 19:54:38 +0100, Tibor Bana said:

> I don't know if it still actual, but I am strugling with this problem right
> now and searching the internet for solutions. I read the thread and saw that
> you are strugling to reproduce the problem, and I can reproduce it almost every
> day.

I'm pretty sure that you have a real bug on your hands. Even if your box
is very low on memory, kcompactd should eventually figure out it's not
making any progress and wait for the situation to change before trying again.

However, I'm also pretty sure that it's a different one than the one we were
chasing, because that one never showed up again once all the patches landed in
linux-next, some 18 months before 5.9 was released.

2021-01-27 06:10:06

by Mel Gorman

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On Mon, Jan 25, 2021 at 07:54:38PM +0100, Tibor Bana wrote:
> Greetings!
>
> I don't know if it still actual, but I am strugling with this problem right now and searching the internet for solutions.
> I read the thread and saw that you are strugling to reproduce the problem, and I can reproduce it almost every day.
>
> - Install vmware player, and a linux guest.
> - Configure the virtual machine to have a good amount of memory and cpu
> - run resource intensive tasks on the guest
> - when the host used up almost it's all memory and start to reuse caches kcompactd will kick in.
>
> As I know the problem is related to transparent huge pages, but I tried to disable it.
> Today I saw the problem again and kcompactd shown an interesting status in top. It hasn't used any memory, all zeroes but it used up one core completely.
>
> My machine is a core-i7 with 4 physical cores and hyper threading and 24GB Memory
> 5.9.11-arch2-1 #1 SMP PREEMPT Sat, 28 Nov 2020 02:07:22 +0000 x86_64 GNU/Linux
>
> Hope this can help, to point out the problem.
>

Is 5.10.10 affected because it included two patches related to halting
compaction that are relevant.

d20bdd571ee5c9966191568527ecdb1bd4b52368 mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate
38935861d85a4d9a353d1dd5a156c97700e2765d mm/compaction: count pages and stop correctly during page isolation

--
Mel Gorman
SUSE Labs

2021-01-28 01:39:32

by Tibor Bana

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

Hi,
Sorry for the delay. I had time to do a full system upgrade yesterday
evening and fortunately Archlinux already ships 5.10.10, today I used
my computer as usual to test it. I haven't experienced the symptoms,
but since I disabled transparent huge pages it showed up sporadically.
If I face it again I will let you know.

On Tue, Jan 26, 2021 at 10:17 AM Mel Gorman <[email protected]> wrote:
>
> On Mon, Jan 25, 2021 at 07:54:38PM +0100, Tibor Bana wrote:
> > Greetings!
> >
> > I don't know if it still actual, but I am strugling with this problem right now and searching the internet for solutions.
> > I read the thread and saw that you are strugling to reproduce the problem, and I can reproduce it almost every day.
> >
> > - Install vmware player, and a linux guest.
> > - Configure the virtual machine to have a good amount of memory and cpu
> > - run resource intensive tasks on the guest
> > - when the host used up almost it's all memory and start to reuse caches kcompactd will kick in.
> >
> > As I know the problem is related to transparent huge pages, but I tried to disable it.
> > Today I saw the problem again and kcompactd shown an interesting status in top. It hasn't used any memory, all zeroes but it used up one core completely.
> >
> > My machine is a core-i7 with 4 physical cores and hyper threading and 24GB Memory
> > 5.9.11-arch2-1 #1 SMP PREEMPT Sat, 28 Nov 2020 02:07:22 +0000 x86_64 GNU/Linux
> >
> > Hope this can help, to point out the problem.
> >
>
> Is 5.10.10 affected because it included two patches related to halting
> compaction that are relevant.
>
> d20bdd571ee5c9966191568527ecdb1bd4b52368 mm/compaction: stop isolation if too many pages are isolated and we have pages to migrate
> 38935861d85a4d9a353d1dd5a156c97700e2765d mm/compaction: count pages and stop correctly during page isolation
>
> --
> Mel Gorman
> SUSE Labs

2021-02-16 12:38:43

by Jason A. Donenfeld

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On Mon, Jan 25, 2021 at 07:54:38PM +0100, Tibor Bana wrote:
> Greetings!
>
> I don't know if it still actual, but I am strugling with this problem right now and searching the internet for solutions.
> I read the thread and saw that you are strugling to reproduce the problem, and I can reproduce it almost every day.
>
> - Install vmware player, and a linux guest.
> - Configure the virtual machine to have a good amount of memory and cpu
> - run resource intensive tasks on the guest
> - when the host used up almost it's all memory and start to reuse caches kcompactd will kick in.
>
> As I know the problem is related to transparent huge pages, but I tried to disable it.
> Today I saw the problem again and kcompactd shown an interesting status in top. It hasn't used any memory, all zeroes but it used up one core completely.
>
> My machine is a core-i7 with 4 physical cores and hyper threading and 24GB Memory
> 5.9.11-arch2-1 #1 SMP PREEMPT Sat, 28 Nov 2020 02:07:22 +0000 x86_64 GNU/Linux

Another anecdote: 5.11.0, 64 gigs of ram. If I run QEMU/KVM for a VM
with 16 gigs at the same time as a VMware VM with 16 gigs of ram,
kcompact goes wild and both VMs get really slow. The key here is running
KVM at the same time as VMware.

2021-02-16 22:35:00

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: [regression -next0117] What is kcompactd and why is he eating 100% of my cpu?

On Tue, 16 Feb 2021 13:36:22 +0100, "Jason A. Donenfeld" said:

> Another anecdote: 5.11.0, 64 gigs of ram. If I run QEMU/KVM for a VM
> with 16 gigs at the same time as a VMware VM with 16 gigs of ram,
> kcompact goes wild and both VMs get really slow. The key here is running
> KVM at the same time as VMware.

Do things operated as expected if there are 2 KVM instances, or 2 VMware
instances?


Attachments:
(No filename) (849.00 B)