2015-12-01 13:34:37

by Minchan Kim

[permalink] [raw]
Subject: memcg uncharge page counter mismatch

With new test on mmotm-2015-11-25-17-08, I saw below WARNING message
several times. I couldn't see it with reverting new THP refcount
redesign.

I will try to make reproducer when I have a time but not sure.
Before that, I hope someone catches it up.

------------[ cut here ]------------
WARNING: CPU: 0 PID: 1340 at mm/page_counter.c:26 page_counter_cancel+0x34/0x40()
Modules linked in:
CPU: 0 PID: 1340 Comm: madvise_test Not tainted 4.4.0-rc2-mm1-kirill+ #12
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffffffff81782eeb ffff880072b97be8 ffffffff8126f476 0000000000000000
ffff880072b97c20 ffffffff8103e476 ffff88006b35d0b0 00000000000001fe
0000000000000000 00000000000001fe ffff88006b35d000 ffff880072b97c30
Call Trace:
[<ffffffff8126f476>] dump_stack+0x44/0x5e
[<ffffffff8103e476>] warn_slowpath_common+0x86/0xc0
[<ffffffff8103e56a>] warn_slowpath_null+0x1a/0x20
[<ffffffff8114c754>] page_counter_cancel+0x34/0x40
[<ffffffff8114c852>] page_counter_uncharge+0x22/0x30
[<ffffffff8114fe17>] uncharge_batch+0x47/0x140
[<ffffffff81150033>] uncharge_list+0x123/0x190
[<ffffffff8115222b>] mem_cgroup_uncharge_list+0x1b/0x20
[<ffffffff810fe9bb>] release_pages+0xdb/0x350
[<ffffffff8113044d>] free_pages_and_swap_cache+0x9d/0x120
[<ffffffff8111a546>] tlb_flush_mmu_free+0x36/0x60
[<ffffffff8111b63c>] tlb_finish_mmu+0x1c/0x50
[<ffffffff81125f38>] exit_mmap+0xd8/0x130
[<ffffffff8103bd56>] mmput+0x56/0xe0
[<ffffffff8103ff4d>] do_exit+0x1fd/0xaa0
[<ffffffff8104086f>] do_group_exit+0x3f/0xb0
[<ffffffff810408f4>] SyS_exit_group+0x14/0x20
[<ffffffff8142b617>] entry_SYSCALL_64_fastpath+0x12/0x6a
---[ end trace 7864cf719fb83e12 ]---


2015-12-02 10:16:49

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Tue 01-12-15 22:34:55, Minchan Kim wrote:
> With new test on mmotm-2015-11-25-17-08, I saw below WARNING message
> several times. I couldn't see it with reverting new THP refcount
> redesign.

Just a wild guess. What prevents migration/compaction from calling
split_huge_page on thp zero page? There is VM_BUG_ON but it is not clear
whether you run with CONFIG_DEBUG_VM enabled.

Also, how big is the underflow?

> I will try to make reproducer when I have a time but not sure.
> Before that, I hope someone catches it up.
>
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 1340 at mm/page_counter.c:26 page_counter_cancel+0x34/0x40()
> Modules linked in:
> CPU: 0 PID: 1340 Comm: madvise_test Not tainted 4.4.0-rc2-mm1-kirill+ #12
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> ffffffff81782eeb ffff880072b97be8 ffffffff8126f476 0000000000000000
> ffff880072b97c20 ffffffff8103e476 ffff88006b35d0b0 00000000000001fe
> 0000000000000000 00000000000001fe ffff88006b35d000 ffff880072b97c30
> Call Trace:
> [<ffffffff8126f476>] dump_stack+0x44/0x5e
> [<ffffffff8103e476>] warn_slowpath_common+0x86/0xc0
> [<ffffffff8103e56a>] warn_slowpath_null+0x1a/0x20
> [<ffffffff8114c754>] page_counter_cancel+0x34/0x40
> [<ffffffff8114c852>] page_counter_uncharge+0x22/0x30
> [<ffffffff8114fe17>] uncharge_batch+0x47/0x140
> [<ffffffff81150033>] uncharge_list+0x123/0x190
> [<ffffffff8115222b>] mem_cgroup_uncharge_list+0x1b/0x20
> [<ffffffff810fe9bb>] release_pages+0xdb/0x350
> [<ffffffff8113044d>] free_pages_and_swap_cache+0x9d/0x120
> [<ffffffff8111a546>] tlb_flush_mmu_free+0x36/0x60
> [<ffffffff8111b63c>] tlb_finish_mmu+0x1c/0x50
> [<ffffffff81125f38>] exit_mmap+0xd8/0x130
> [<ffffffff8103bd56>] mmput+0x56/0xe0
> [<ffffffff8103ff4d>] do_exit+0x1fd/0xaa0
> [<ffffffff8104086f>] do_group_exit+0x3f/0xb0
> [<ffffffff810408f4>] SyS_exit_group+0x14/0x20
> [<ffffffff8142b617>] entry_SYSCALL_64_fastpath+0x12/0x6a
> ---[ end trace 7864cf719fb83e12 ]---

--
Michal Hocko
SUSE Labs

2015-12-03 01:39:05

by Minchan Kim

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Wed, Dec 02, 2015 at 11:16:43AM +0100, Michal Hocko wrote:
> On Tue 01-12-15 22:34:55, Minchan Kim wrote:
> > With new test on mmotm-2015-11-25-17-08, I saw below WARNING message
> > several times. I couldn't see it with reverting new THP refcount
> > redesign.
>
> Just a wild guess. What prevents migration/compaction from calling
> split_huge_page on thp zero page? There is VM_BUG_ON but it is not clear

I guess migration should work with LRU pages now but zero page couldn't
stay there.

> whether you run with CONFIG_DEBUG_VM enabled.

I enabled VM_DEBUG_VM.

>
> Also, how big is the underflow?

diff --git a/mm/page_counter.c b/mm/page_counter.c
index 7c6a63d..adc27c3 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -23,6 +23,8 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)

new = atomic_long_sub_return(nr_pages, &counter->count);
/* More uncharges than charges? */
+ if (new < 0)
+ printk("nr_pages %lu new %ld\n", nr_pages, new);
WARN_ON_ONCE(new < 0);
}

nr_pages 512 new -31
------------[ cut here ]------------
WARNING: CPU: 3 PID: 1145 at mm/page_counter.c:28 page_counter_cancel+0x44/0x50()
Modules linked in:
CPU: 3 PID: 1145 Comm: madvise_test Not tainted 4.4.0-rc2-mm1-kirill+ #17
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
ffffffff81782f09 ffff8800598c3b90 ffffffff8126f476 0000000000000000
ffff8800598c3bc8 ffffffff8103e476 ffff88007f14f8b0 0000000000000200
0000000000000000 0000000000000200 ffff88007f14f800 ffff8800598c3bd8
Call Trace:
[<ffffffff8126f476>] dump_stack+0x44/0x5e
[<ffffffff8103e476>] warn_slowpath_common+0x86/0xc0
[<ffffffff8103e56a>] warn_slowpath_null+0x1a/0x20
[<ffffffff8114c744>] page_counter_cancel+0x44/0x50
[<ffffffff8114c842>] page_counter_uncharge+0x22/0x30
[<ffffffff8114fe07>] uncharge_batch+0x47/0x140
[<ffffffff81150023>] uncharge_list+0x123/0x190
[<ffffffff811521f9>] mem_cgroup_uncharge+0x29/0x30
[<ffffffff810fe3be>] __page_cache_release+0x15e/0x200
[<ffffffff810fe47e>] __put_compound_page+0x1e/0x50
[<ffffffff810fe960>] release_pages+0xd0/0x370
[<ffffffff8113042d>] free_pages_and_swap_cache+0x9d/0x120
[<ffffffff8111a516>] tlb_flush_mmu_free+0x36/0x60
[<ffffffff8111b60c>] tlb_finish_mmu+0x1c/0x50
[<ffffffff81125f1c>] exit_mmap+0xec/0x140
[<ffffffff8103bd56>] mmput+0x56/0xe0
[<ffffffff8103ff4d>] do_exit+0x1fd/0xaa0
[<ffffffff8104086f>] do_group_exit+0x3f/0xb0
[<ffffffff810408f4>] SyS_exit_group+0x14/0x20
[<ffffffff8142b617>] entry_SYSCALL_64_fastpath+0x12/0x6a
---[ end trace 872ed93351e964c0 ]---
nr_pages 293 new -324
nr_pages 16 new -340
nr_pages 342 new -91
nr_pages 246 new -337
nr_pages 15 new -352
nr_pages 15 new -367

2015-12-03 02:09:54

by Minchan Kim

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Thu, Dec 03, 2015 at 10:34:04AM +0900, Minchan Kim wrote:
> On Wed, Dec 02, 2015 at 11:16:43AM +0100, Michal Hocko wrote:
> > On Tue 01-12-15 22:34:55, Minchan Kim wrote:
> > > With new test on mmotm-2015-11-25-17-08, I saw below WARNING message
> > > several times. I couldn't see it with reverting new THP refcount
> > > redesign.
> >
> > Just a wild guess. What prevents migration/compaction from calling
> > split_huge_page on thp zero page? There is VM_BUG_ON but it is not clear
>
> I guess migration should work with LRU pages now but zero page couldn't
> stay there.
>
> > whether you run with CONFIG_DEBUG_VM enabled.
>
> I enabled VM_DEBUG_VM.
>
> >
> > Also, how big is the underflow?
>
> diff --git a/mm/page_counter.c b/mm/page_counter.c
> index 7c6a63d..adc27c3 100644
> --- a/mm/page_counter.c
> +++ b/mm/page_counter.c
> @@ -23,6 +23,8 @@ void page_counter_cancel(struct page_counter *counter, unsigned long nr_pages)
>
> new = atomic_long_sub_return(nr_pages, &counter->count);
> /* More uncharges than charges? */
> + if (new < 0)
> + printk("nr_pages %lu new %ld\n", nr_pages, new);
> WARN_ON_ONCE(new < 0);
> }
>
> nr_pages 512 new -31
> ------------[ cut here ]------------
> WARNING: CPU: 3 PID: 1145 at mm/page_counter.c:28 page_counter_cancel+0x44/0x50()
> Modules linked in:
> CPU: 3 PID: 1145 Comm: madvise_test Not tainted 4.4.0-rc2-mm1-kirill+ #17
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> ffffffff81782f09 ffff8800598c3b90 ffffffff8126f476 0000000000000000
> ffff8800598c3bc8 ffffffff8103e476 ffff88007f14f8b0 0000000000000200
> 0000000000000000 0000000000000200 ffff88007f14f800 ffff8800598c3bd8
> Call Trace:
> [<ffffffff8126f476>] dump_stack+0x44/0x5e
> [<ffffffff8103e476>] warn_slowpath_common+0x86/0xc0
> [<ffffffff8103e56a>] warn_slowpath_null+0x1a/0x20
> [<ffffffff8114c744>] page_counter_cancel+0x44/0x50
> [<ffffffff8114c842>] page_counter_uncharge+0x22/0x30
> [<ffffffff8114fe07>] uncharge_batch+0x47/0x140
> [<ffffffff81150023>] uncharge_list+0x123/0x190
> [<ffffffff811521f9>] mem_cgroup_uncharge+0x29/0x30
> [<ffffffff810fe3be>] __page_cache_release+0x15e/0x200
> [<ffffffff810fe47e>] __put_compound_page+0x1e/0x50
> [<ffffffff810fe960>] release_pages+0xd0/0x370
> [<ffffffff8113042d>] free_pages_and_swap_cache+0x9d/0x120
> [<ffffffff8111a516>] tlb_flush_mmu_free+0x36/0x60
> [<ffffffff8111b60c>] tlb_finish_mmu+0x1c/0x50
> [<ffffffff81125f1c>] exit_mmap+0xec/0x140
> [<ffffffff8103bd56>] mmput+0x56/0xe0
> [<ffffffff8103ff4d>] do_exit+0x1fd/0xaa0
> [<ffffffff8104086f>] do_group_exit+0x3f/0xb0
> [<ffffffff810408f4>] SyS_exit_group+0x14/0x20
> [<ffffffff8142b617>] entry_SYSCALL_64_fastpath+0x12/0x6a
> ---[ end trace 872ed93351e964c0 ]---
> nr_pages 293 new -324
> nr_pages 16 new -340
> nr_pages 342 new -91
> nr_pages 246 new -337
> nr_pages 15 new -352
> nr_pages 15 new -367

My guess is that it's related to new feature of Kirill's THP 'PageDoubleMap'
so a THP page could be mapped a pte but !pmd_trans_huge(*pmd) so memcg
precharge in move_charge should handle it?

2015-12-03 08:54:56

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Thu 03-12-15 11:10:06, Minchan Kim wrote:
> On Thu, Dec 03, 2015 at 10:34:04AM +0900, Minchan Kim wrote:
> > On Wed, Dec 02, 2015 at 11:16:43AM +0100, Michal Hocko wrote:
> > > On Tue 01-12-15 22:34:55, Minchan Kim wrote:
> > > > With new test on mmotm-2015-11-25-17-08, I saw below WARNING message
> > > > several times. I couldn't see it with reverting new THP refcount
> > > > redesign.
> > >
> > > Just a wild guess. What prevents migration/compaction from calling
> > > split_huge_page on thp zero page? There is VM_BUG_ON but it is not clear
> >
> > I guess migration should work with LRU pages now but zero page couldn't
> > stay there.

Ahh, you are right. I have missed PageLRU check in isolate_migratepages_block
pfn walker.

> > > whether you run with CONFIG_DEBUG_VM enabled.
> >
> > I enabled VM_DEBUG_VM.
> >
> > >
> > > Also, how big is the underflow?
[...]
> > nr_pages 293 new -324
> > nr_pages 16 new -340
> > nr_pages 342 new -91
> > nr_pages 246 new -337
> > nr_pages 15 new -352
> > nr_pages 15 new -367

They are quite large but that is not that surprising if we consider that
we are batching many uncharges at once.

> My guess is that it's related to new feature of Kirill's THP 'PageDoubleMap'
> so a THP page could be mapped a pte but !pmd_trans_huge(*pmd) so memcg
> precharge in move_charge should handle it?

I am not familiar with the current state of THP after the rework
unfortunately. So if I got you right then you are saying that
pmd_trans_huge_lock fails to notice a THP so we will not charge it as
THP and only charge one head page and then the tear down path will
correctly recognize it as a THP and uncharge the full size, right?

I have to admit I have no idea how PageDoubleMap works and how can
pmd_trans_huge fail.

--
Michal Hocko
SUSE Labs

2015-12-03 12:59:26

by Minchan Kim

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Thu, Dec 03, 2015 at 09:54:52AM +0100, Michal Hocko wrote:
> On Thu 03-12-15 11:10:06, Minchan Kim wrote:
> > On Thu, Dec 03, 2015 at 10:34:04AM +0900, Minchan Kim wrote:
> > > On Wed, Dec 02, 2015 at 11:16:43AM +0100, Michal Hocko wrote:
> > > > On Tue 01-12-15 22:34:55, Minchan Kim wrote:
> > > > > With new test on mmotm-2015-11-25-17-08, I saw below WARNING message
> > > > > several times. I couldn't see it with reverting new THP refcount
> > > > > redesign.
> > > >
> > > > Just a wild guess. What prevents migration/compaction from calling
> > > > split_huge_page on thp zero page? There is VM_BUG_ON but it is not clear
> > >
> > > I guess migration should work with LRU pages now but zero page couldn't
> > > stay there.
>
> Ahh, you are right. I have missed PageLRU check in isolate_migratepages_block
> pfn walker.
>
> > > > whether you run with CONFIG_DEBUG_VM enabled.
> > >
> > > I enabled VM_DEBUG_VM.
> > >
> > > >
> > > > Also, how big is the underflow?
> [...]
> > > nr_pages 293 new -324
> > > nr_pages 16 new -340
> > > nr_pages 342 new -91
> > > nr_pages 246 new -337
> > > nr_pages 15 new -352
> > > nr_pages 15 new -367
>
> They are quite large but that is not that surprising if we consider that
> we are batching many uncharges at once.
>
> > My guess is that it's related to new feature of Kirill's THP 'PageDoubleMap'
> > so a THP page could be mapped a pte but !pmd_trans_huge(*pmd) so memcg
> > precharge in move_charge should handle it?
>
> I am not familiar with the current state of THP after the rework
> unfortunately. So if I got you right then you are saying that
> pmd_trans_huge_lock fails to notice a THP so we will not charge it as
> THP and only charge one head page and then the tear down path will
> correctly recognize it as a THP and uncharge the full size, right?

Exactly.

2015-12-03 13:37:25

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Thu 03-12-15 21:59:50, Minchan Kim wrote:
> On Thu, Dec 03, 2015 at 09:54:52AM +0100, Michal Hocko wrote:
> > On Thu 03-12-15 11:10:06, Minchan Kim wrote:
> > > On Thu, Dec 03, 2015 at 10:34:04AM +0900, Minchan Kim wrote:
> > > > On Wed, Dec 02, 2015 at 11:16:43AM +0100, Michal Hocko wrote:
[...]
> > > > > Also, how big is the underflow?
> > [...]
> > > > nr_pages 293 new -324
> > > > nr_pages 16 new -340
> > > > nr_pages 342 new -91
> > > > nr_pages 246 new -337
> > > > nr_pages 15 new -352
> > > > nr_pages 15 new -367
> >
> > They are quite large but that is not that surprising if we consider that
> > we are batching many uncharges at once.
> >
> > > My guess is that it's related to new feature of Kirill's THP 'PageDoubleMap'
> > > so a THP page could be mapped a pte but !pmd_trans_huge(*pmd) so memcg
> > > precharge in move_charge should handle it?
> >
> > I am not familiar with the current state of THP after the rework
> > unfortunately. So if I got you right then you are saying that
> > pmd_trans_huge_lock fails to notice a THP so we will not charge it as
> > THP and only charge one head page and then the tear down path will
> > correctly recognize it as a THP and uncharge the full size, right?
>
> Exactly.

Hmm, but are pages represented by those ptes on the LRU list?
__split_huge_pmd_locked doesn't seem to do any lru care. If they are not
on any LRU then mem_cgroup_move_charge_pte_range should ignore such a pte
and the THP (which the pte is part of) should stay in the original
memcg.

How do you trigger this issue btw?
--
Michal Hocko
SUSE Labs

2015-12-03 13:43:34

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Thu 03-12-15 14:37:19, Michal Hocko wrote:
> On Thu 03-12-15 21:59:50, Minchan Kim wrote:
> > On Thu, Dec 03, 2015 at 09:54:52AM +0100, Michal Hocko wrote:
> > > On Thu 03-12-15 11:10:06, Minchan Kim wrote:
> > > > On Thu, Dec 03, 2015 at 10:34:04AM +0900, Minchan Kim wrote:
> > > > > On Wed, Dec 02, 2015 at 11:16:43AM +0100, Michal Hocko wrote:
> [...]
> > > > > > Also, how big is the underflow?
> > > [...]
> > > > > nr_pages 293 new -324
> > > > > nr_pages 16 new -340
> > > > > nr_pages 342 new -91
> > > > > nr_pages 246 new -337
> > > > > nr_pages 15 new -352
> > > > > nr_pages 15 new -367
> > >
> > > They are quite large but that is not that surprising if we consider that
> > > we are batching many uncharges at once.
> > >
> > > > My guess is that it's related to new feature of Kirill's THP 'PageDoubleMap'
> > > > so a THP page could be mapped a pte but !pmd_trans_huge(*pmd) so memcg
> > > > precharge in move_charge should handle it?
> > >
> > > I am not familiar with the current state of THP after the rework
> > > unfortunately. So if I got you right then you are saying that
> > > pmd_trans_huge_lock fails to notice a THP so we will not charge it as
> > > THP and only charge one head page and then the tear down path will
> > > correctly recognize it as a THP and uncharge the full size, right?
> >
> > Exactly.
>
> Hmm, but are pages represented by those ptes on the LRU list?
> __split_huge_pmd_locked doesn't seem to do any lru care. If they are not
> on any LRU then mem_cgroup_move_charge_pte_range should ignore such a pte
> and the THP (which the pte is part of) should stay in the original
> memcg.

Ohh, PageLRU is
PAGEFLAG(LRU, lru, PF_HEAD)

So we are checking the head and it is on LRU. Now I can see how this
might happen. Let me think about a fix...
--
Michal Hocko
SUSE Labs

2015-12-03 14:58:56

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Thu 03-12-15 14:43:26, Michal Hocko wrote:
> On Thu 03-12-15 14:37:19, Michal Hocko wrote:
> > On Thu 03-12-15 21:59:50, Minchan Kim wrote:
> > > On Thu, Dec 03, 2015 at 09:54:52AM +0100, Michal Hocko wrote:
> > > > On Thu 03-12-15 11:10:06, Minchan Kim wrote:
> > > > > On Thu, Dec 03, 2015 at 10:34:04AM +0900, Minchan Kim wrote:
> > > > > > On Wed, Dec 02, 2015 at 11:16:43AM +0100, Michal Hocko wrote:
> > [...]
> > > > > > > Also, how big is the underflow?
> > > > [...]
> > > > > > nr_pages 293 new -324
> > > > > > nr_pages 16 new -340
> > > > > > nr_pages 342 new -91
> > > > > > nr_pages 246 new -337
> > > > > > nr_pages 15 new -352
> > > > > > nr_pages 15 new -367
> > > >
> > > > They are quite large but that is not that surprising if we consider that
> > > > we are batching many uncharges at once.
> > > >
> > > > > My guess is that it's related to new feature of Kirill's THP 'PageDoubleMap'
> > > > > so a THP page could be mapped a pte but !pmd_trans_huge(*pmd) so memcg
> > > > > precharge in move_charge should handle it?
> > > >
> > > > I am not familiar with the current state of THP after the rework
> > > > unfortunately. So if I got you right then you are saying that
> > > > pmd_trans_huge_lock fails to notice a THP so we will not charge it as
> > > > THP and only charge one head page and then the tear down path will
> > > > correctly recognize it as a THP and uncharge the full size, right?
> > >
> > > Exactly.
> >
> > Hmm, but are pages represented by those ptes on the LRU list?
> > __split_huge_pmd_locked doesn't seem to do any lru care. If they are not
> > on any LRU then mem_cgroup_move_charge_pte_range should ignore such a pte
> > and the THP (which the pte is part of) should stay in the original
> > memcg.
>
> Ohh, PageLRU is
> PAGEFLAG(LRU, lru, PF_HEAD)
>
> So we are checking the head and it is on LRU. Now I can see how this
> might happen. Let me think about a fix...

Perhaps something like the following? I wouldn't be surprised if it
blown up magnificently. I am still pretty confused about all the
consequences of the thp rework so there are some open questions. I
also do not see isolate_lru_page can work properly. It doesn't seem
to operate on the head page but who knows maybe there is some other
trickery because this sounds like something to blow up from other places
as well. So I must be missing something here.

Warning, this looks ugly as hell.
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 79a29d564bff..f5c3af0b74b0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4501,7 +4501,6 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
/**
* mem_cgroup_move_account - move account of the page
* @page: the page
- * @nr_pages: number of regular pages (>1 for huge pages)
* @from: mem_cgroup which the page is moved from.
* @to: mem_cgroup which the page is moved to. @from != @to.
*
@@ -4509,30 +4508,43 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
*
* This function doesn't do "charge" to new cgroup and doesn't do "uncharge"
* from old cgroup.
+ *
+ * Returns the number of moved pages.
*/
-static int mem_cgroup_move_account(struct page *page,
- bool compound,
+static unsigned mem_cgroup_move_account(struct page *page,
struct mem_cgroup *from,
struct mem_cgroup *to)
{
unsigned long flags;
- unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
- int ret;
+ bool compound = PageCompound(page);
+ unsigned nr_pages = 1;
+ unsigned ret = 0;
bool anon;

VM_BUG_ON(from == to);
VM_BUG_ON_PAGE(PageLRU(page), page);
- VM_BUG_ON(compound && !PageTransHuge(page));
+
+ if (compound) {
+ /*
+ * We might see a split huge pmd and a tail page in a regular
+ * pte. Make sure to work on the whole THP.
+ *
+ * TODO: what about pmd_trans_huge_lock for tail page mapped via
+ * pte? That means we are already split up so we cannot race?
+ * TODO: reference on the tail page will keep the head alive,
+ * right?
+ */
+ page = compound_head(page);
+ nr_pages = hpage_nr_pages(page);
+ }

/*
* Prevent mem_cgroup_replace_page() from looking at
* page->mem_cgroup of its source page while we change it.
*/
- ret = -EBUSY;
if (!trylock_page(page))
goto out;

- ret = -EINVAL;
if (page->mem_cgroup != from)
goto out_unlock;

@@ -4580,7 +4592,7 @@ static int mem_cgroup_move_account(struct page *page,
page->mem_cgroup = to;
spin_unlock_irqrestore(&from->move_lock, flags);

- ret = 0;
+ ret = nr_pages;

local_irq_disable();
mem_cgroup_charge_statistics(to, page, compound, nr_pages);
@@ -4858,6 +4870,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
enum mc_target_type target_type;
union mc_target target;
struct page *page;
+ struct page *huge_page = NULL;

if (pmd_trans_huge_lock(pmd, vma, &ptl)) {
if (mc.precharge < HPAGE_PMD_NR) {
@@ -4868,11 +4881,12 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
if (target_type == MC_TARGET_PAGE) {
page = target.page;
if (!isolate_lru_page(page)) {
- if (!mem_cgroup_move_account(page, true,
- mc.from, mc.to)) {
- mc.precharge -= HPAGE_PMD_NR;
- mc.moved_charge += HPAGE_PMD_NR;
- }
+ unsigned moved;
+
+ moved = mem_cgroup_move_account(page,
+ mc.from, mc.to);
+ mc.precharge -= moved;
+ mc.moved_charge += moved;
putback_lru_page(page);
}
put_page(page);
@@ -4895,15 +4909,24 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
switch (get_mctgt_type(vma, addr, ptent, &target)) {
case MC_TARGET_PAGE:
page = target.page;
- if (isolate_lru_page(page))
+ /* We have already handled this THP */
+ if (compound_head(page) == huge_page)
goto put;
- if (!mem_cgroup_move_account(page, false,
- mc.from, mc.to)) {
- mc.precharge--;
+
+ if (!isolate_lru_page(page)) {
+ unsigned move;
+
+ move = mem_cgroup_move_account(page,
+ mc.from, mc.to);
+ if (move > 1)
+ huge_page = compound_head(page);
+
+ mc.precharge -= move;
/* we uncharge from mc.from later. */
- mc.moved_charge++;
+ mc.moved_charge += move;
+ /* LRU always operates on the head page */
+ putback_lru_page(compound_head(page));
}
- putback_lru_page(page);
put: /* get_mctgt_type() gets the page */
put_page(page);
break;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4589cfdbe405..14b65949b7d0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1430,9 +1430,12 @@ int isolate_lru_page(struct page *page)
VM_BUG_ON_PAGE(!page_count(page), page);

if (PageLRU(page)) {
- struct zone *zone = page_zone(page);
+ struct zone *zone;
struct lruvec *lruvec;

+ /* TODO: is this correct? */
+ page = compound_head(page);
+ zone = page_zone(page);
spin_lock_irq(&zone->lru_lock);
lruvec = mem_cgroup_page_lruvec(page, zone);
if (PageLRU(page)) {
--
Michal Hocko
SUSE Labs

2015-12-03 15:47:35

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Thu 03-12-15 15:58:50, Michal Hocko wrote:
[....]
> Warning, this looks ugly as hell.

I was thinking about it some more and it seems that we should rather not
bother with partial thp at all and keep it in the original memcg
instead. It is way much less code and I do not think this will be too
disruptive. Somebody should be holding the thp head, right?

Minchan, does this fix the issue you are seeing.
---
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 79a29d564bff..143c933f0b81 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4895,6 +4895,14 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
switch (get_mctgt_type(vma, addr, ptent, &target)) {
case MC_TARGET_PAGE:
page = target.page;
+ /*
+ * We can have a part of the split pmd here. Moving it
+ * can be done but it would be too convoluted so simply
+ * ignore such a partial THP and keep it in original
+ * memcg. There should be somebody mapping the head.
+ */
+ if (PageCompound(page))
+ goto put;
if (isolate_lru_page(page))
goto put;
if (!mem_cgroup_move_account(page, false,
--
Michal Hocko
SUSE Labs

2015-12-04 05:43:32

by Minchan Kim

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Thu, Dec 03, 2015 at 04:47:29PM +0100, Michal Hocko wrote:
> On Thu 03-12-15 15:58:50, Michal Hocko wrote:
> [....]
> > Warning, this looks ugly as hell.
>
> I was thinking about it some more and it seems that we should rather not
> bother with partial thp at all and keep it in the original memcg
> instead. It is way much less code and I do not think this will be too
> disruptive. Somebody should be holding the thp head, right?
>
> Minchan, does this fix the issue you are seeing.

This patch solves the issue but not sure it's right approach.
I think it could make regression that in old, we could charge
a THP page but we can't now. Whether it's trivial or not, it depends
on memcg guys.

Thanks.


> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 79a29d564bff..143c933f0b81 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4895,6 +4895,14 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> switch (get_mctgt_type(vma, addr, ptent, &target)) {
> case MC_TARGET_PAGE:
> page = target.page;
> + /*
> + * We can have a part of the split pmd here. Moving it
> + * can be done but it would be too convoluted so simply
> + * ignore such a partial THP and keep it in original
> + * memcg. There should be somebody mapping the head.
> + */
> + if (PageCompound(page))
> + goto put;
> if (isolate_lru_page(page))
> goto put;
> if (!mem_cgroup_move_account(page, false,
> --
> Michal Hocko
> SUSE Labs

--
Kind regards,
Minchan Kim

2015-12-04 08:52:30

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Fri 04-12-15 14:35:15, Minchan Kim wrote:
> On Thu, Dec 03, 2015 at 04:47:29PM +0100, Michal Hocko wrote:
> > On Thu 03-12-15 15:58:50, Michal Hocko wrote:
> > [....]
> > > Warning, this looks ugly as hell.
> >
> > I was thinking about it some more and it seems that we should rather not
> > bother with partial thp at all and keep it in the original memcg
> > instead. It is way much less code and I do not think this will be too
> > disruptive. Somebody should be holding the thp head, right?
> >
> > Minchan, does this fix the issue you are seeing.
>
> This patch solves the issue but not sure it's right approach.
> I think it could make regression that in old, we could charge
> a THP page but we can't now.

The page would still get charged when allocated. It just wouldn't get
moved when mapped only partially. IIUC there will be still somebody
mapping the THP head via pmd, right? That process will move the page to
the new memcg when moved. Or is it possible that we will end up only
with pte mapped THP from all processes? Kirill?

If not then I think it is reasonable to expect that partially mapped THP
is not moved during task migration. I will post an official patch after
Kirill confirms my understanding.

Anyway thanks for the testing and pointing me to right direction
Minchan!

--
Michal Hocko
SUSE Labs

2015-12-04 09:16:47

by Minchan Kim

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Fri, Dec 04, 2015 at 09:52:27AM +0100, Michal Hocko wrote:
> On Fri 04-12-15 14:35:15, Minchan Kim wrote:
> > On Thu, Dec 03, 2015 at 04:47:29PM +0100, Michal Hocko wrote:
> > > On Thu 03-12-15 15:58:50, Michal Hocko wrote:
> > > [....]
> > > > Warning, this looks ugly as hell.
> > >
> > > I was thinking about it some more and it seems that we should rather not
> > > bother with partial thp at all and keep it in the original memcg
> > > instead. It is way much less code and I do not think this will be too
> > > disruptive. Somebody should be holding the thp head, right?
> > >
> > > Minchan, does this fix the issue you are seeing.
> >
> > This patch solves the issue but not sure it's right approach.
> > I think it could make regression that in old, we could charge
> > a THP page but we can't now.
>
> The page would still get charged when allocated. It just wouldn't get
> moved when mapped only partially. IIUC there will be still somebody
> mapping the THP head via pmd, right? That process will move the page to

If I read code correctly, No. The split_huge_pmd splits just pmd,
not page itself. IOW, it could be possible !pmd_trans_huge(pmd) &&
PageTransHuge although there is only process owns the page.

> the new memcg when moved. Or is it possible that we will end up only
> with pte mapped THP from all processes? Kirill?

I'm not Kirill but I think it's possible.
If so, a thing we can use is page_mapcount(page) == 1. With that,
it could gaurantee only a process owns the page so charge 512 instead of 1?

>
> If not then I think it is reasonable to expect that partially mapped THP
> is not moved during task migration. I will post an official patch after
> Kirill confirms my understanding.
>
> Anyway thanks for the testing and pointing me to right direction
> Minchan!

Thanks for the quick patch and feedback, Michal.

>
> --
> Michal Hocko
> SUSE Labs

--
Kind regards,
Minchan Kim

2015-12-04 09:58:22

by Michal Hocko

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Fri 04-12-15 18:16:34, Minchan Kim wrote:
> On Fri, Dec 04, 2015 at 09:52:27AM +0100, Michal Hocko wrote:
> > On Fri 04-12-15 14:35:15, Minchan Kim wrote:
> > > On Thu, Dec 03, 2015 at 04:47:29PM +0100, Michal Hocko wrote:
> > > > On Thu 03-12-15 15:58:50, Michal Hocko wrote:
> > > > [....]
> > > > > Warning, this looks ugly as hell.
> > > >
> > > > I was thinking about it some more and it seems that we should rather not
> > > > bother with partial thp at all and keep it in the original memcg
> > > > instead. It is way much less code and I do not think this will be too
> > > > disruptive. Somebody should be holding the thp head, right?
> > > >
> > > > Minchan, does this fix the issue you are seeing.
> > >
> > > This patch solves the issue but not sure it's right approach.
> > > I think it could make regression that in old, we could charge
> > > a THP page but we can't now.
> >
> > The page would still get charged when allocated. It just wouldn't get
> > moved when mapped only partially. IIUC there will be still somebody
> > mapping the THP head via pmd, right? That process will move the page to
>
> If I read code correctly, No. The split_huge_pmd splits just pmd,
> not page itself. IOW, it could be possible !pmd_trans_huge(pmd) &&
> PageTransHuge although there is only process owns the page.

I am not sure I follow you. I thought there would still be other pmd
which will hold the THP. Why should we keep the page as huge when all
processes which map it have already split it up?

On the other hand it is true that the last process which maps the whole
thp might have exited and leave others to map it partially.

> > the new memcg when moved. Or is it possible that we will end up only
> > with pte mapped THP from all processes? Kirill?
>
> I'm not Kirill but I think it's possible.
> If so, a thing we can use is page_mapcount(page) == 1. With that,
> it could gaurantee only a process owns the page so charge 512 instead of 1?

Alright the exclusive holder should indeed move it. I will think how to
simplify the previous patch (has it helped in your testing btw.?).

--
Michal Hocko
SUSE Labs

2015-12-04 13:35:50

by Minchan Kim

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Fri, Dec 04, 2015 at 10:58:15AM +0100, Michal Hocko wrote:
> On Fri 04-12-15 18:16:34, Minchan Kim wrote:
> > On Fri, Dec 04, 2015 at 09:52:27AM +0100, Michal Hocko wrote:
> > > On Fri 04-12-15 14:35:15, Minchan Kim wrote:
> > > > On Thu, Dec 03, 2015 at 04:47:29PM +0100, Michal Hocko wrote:
> > > > > On Thu 03-12-15 15:58:50, Michal Hocko wrote:
> > > > > [....]
> > > > > > Warning, this looks ugly as hell.
> > > > >
> > > > > I was thinking about it some more and it seems that we should rather not
> > > > > bother with partial thp at all and keep it in the original memcg
> > > > > instead. It is way much less code and I do not think this will be too
> > > > > disruptive. Somebody should be holding the thp head, right?
> > > > >
> > > > > Minchan, does this fix the issue you are seeing.
> > > >
> > > > This patch solves the issue but not sure it's right approach.
> > > > I think it could make regression that in old, we could charge
> > > > a THP page but we can't now.
> > >
> > > The page would still get charged when allocated. It just wouldn't get
> > > moved when mapped only partially. IIUC there will be still somebody
> > > mapping the THP head via pmd, right? That process will move the page to
> >
> > If I read code correctly, No. The split_huge_pmd splits just pmd,
> > not page itself. IOW, it could be possible !pmd_trans_huge(pmd) &&
> > PageTransHuge although there is only process owns the page.
>
> I am not sure I follow you. I thought there would still be other pmd
> which will hold the THP. Why should we keep the page as huge when all
> processes which map it have already split it up?

I didn't follow Kirill's work but just read part of code to implement
MADV_FREE so I just guess.
(high-order-alloc-and-compaction/split/collapse) are costly operations
so new work tried to avoid split page as far as possible.
For example, if it works with splitting pmd, not THP page,
it doesn't split the THP page where in mprotect path.
Even, it could do delay split-page via deferred _split_huge_page
even if THP page is freed.

>
> On the other hand it is true that the last process which maps the whole
> thp might have exited and leave others to map it partially.
>
> > > the new memcg when moved. Or is it possible that we will end up only
> > > with pte mapped THP from all processes? Kirill?
> >
> > I'm not Kirill but I think it's possible.
> > If so, a thing we can use is page_mapcount(page) == 1. With that,
> > it could gaurantee only a process owns the page so charge 512 instead of 1?
>
> Alright the exclusive holder should indeed move it. I will think how to
> simplify the previous patch (has it helped in your testing btw.?).

At least, your patch doesn't make the WARNING but I didn't check
the accouting was right.

Thanks.

>
> --
> Michal Hocko
> SUSE Labs

--
Kind regards,
Minchan Kim

2015-12-04 16:01:39

by Johannes Weiner

[permalink] [raw]
Subject: Re: memcg uncharge page counter mismatch

On Thu, Dec 03, 2015 at 04:47:29PM +0100, Michal Hocko wrote:
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 79a29d564bff..143c933f0b81 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4895,6 +4895,14 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
> switch (get_mctgt_type(vma, addr, ptent, &target)) {
> case MC_TARGET_PAGE:
> page = target.page;
> + /*
> + * We can have a part of the split pmd here. Moving it
> + * can be done but it would be too convoluted so simply
> + * ignore such a partial THP and keep it in original
> + * memcg. There should be somebody mapping the head.
> + */
> + if (PageCompound(page))
> + goto put;
> if (isolate_lru_page(page))
> goto put;
> if (!mem_cgroup_move_account(page, false,

Acked-by: Johannes Weiner <[email protected]>

The charge moving concept is fundamentally flawed and its
implementation here is incomplete and races with reclaim.

Really, nobody should be using this. Absent any actual regression
reports, a minimal fix to stop this code from generating warnings
should be enough.