[permalink] [raw]

Subject: [BUGFIX][PATCH] memory hotplug: fix notiier chain return value (Was Re: 2.6.28-rc4 mem_cgroup_charge_common panic)

Badari, I think you used SLUB. If so, page_cgroup's notifier callback was not
called and newly allocated page's page_cgroup wasn't allocated.
This is a fix. (notifier saw STOP_HERE flag added by slub's notifier.)

I'm now testing modified kernel, which does alloc/free page_cgroup by notifier.
(Usually, all page_cgroups are from bootmem and not freed.
so, modified a bit for test)

And I cannot reproduce panic. I think you do "real" memory hotplug other than
online/offline and saw panic caused by this.

Is this slub's behavior intentional ? page_cgroup's notifier has lower priority
than slub, now.

Thanks,
-Kame
==
notifier callback's notifier_from_errno() just works well in error
route. (It adds mask for "stop here")

Hanlder should return NOTIFY_OK in explict way.

Signed-off-by:KAMEZAWA Hiroyuki <[email protected]>
---
mm/page_cgroup.c | 5 ++++-
mm/slub.c | 6 ++++--
2 files changed, 8 insertions(+), 3 deletions(-)

Index: mmotm-2.6.28-Nov10/mm/slub.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/slub.c
+++ mmotm-2.6.28-Nov10/mm/slub.c
@@ -3220,8 +3220,10 @@ static int slab_memory_callback(struct n
case MEM_CANCEL_OFFLINE:
break;
}
-
- ret = notifier_from_errno(ret);
+ if (ret)
+ ret = notifier_from_errno(ret);
+ else
+ ret = NOTIFY_OK;
return ret;
}

Index: mmotm-2.6.28-Nov10/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-Nov10/mm/page_cgroup.c
@@ -216,7 +216,10 @@ static int page_cgroup_callback(struct n
break;
}

- ret = notifier_from_errno(ret);
+ if (ret)
+ ret = notifier_from_errno(ret);
+ else
+ ret = NOTIFY_OK;

return ret;
}

2008-11-13 17:28:36

by Badari Pulavarty

[permalink] [raw]

Subject: Re: [BUGFIX][PATCH] memory hotplug: fix notiier chain return value (Was Re: 2.6.28-rc4 mem_cgroup_charge_common panic)

On Thu, 2008-11-13 at 20:27 +0900, KAMEZAWA Hiroyuki wrote:
> Badari, I think you used SLUB. If so, page_cgroup's notifier callback was not
> called and newly allocated page's page_cgroup wasn't allocated.
> This is a fix. (notifier saw STOP_HERE flag added by slub's notifier.)

No. I wasn't using SLUB.

# egrep "SLUB|SLAB" .config
CONFIG_SLAB=y
# CONFIG_SLUB is not set
CONFIG_SLABINFO=y
# CONFIG_DEBUG_SLAB is not set

I can test the patch and let you know.

Thanks,
Badari

>
> I'm now testing modified kernel, which does alloc/free page_cgroup by notifier.
> (Usually, all page_cgroups are from bootmem and not freed.
> so, modified a bit for test)
>
> And I cannot reproduce panic. I think you do "real" memory hotplug other than
> online/offline and saw panic caused by this.
>
> Is this slub's behavior intentional ? page_cgroup's notifier has lower priority
> than slub, now.
>
> Thanks,
> -Kame
> ==
> notifier callback's notifier_from_errno() just works well in error
> route. (It adds mask for "stop here")
>
> Hanlder should return NOTIFY_OK in explict way.
>
> Signed-off-by:KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/page_cgroup.c | 5 ++++-
> mm/slub.c | 6 ++++--
> 2 files changed, 8 insertions(+), 3 deletions(-)
>
> Index: mmotm-2.6.28-Nov10/mm/slub.c
> ===================================================================
> --- mmotm-2.6.28-Nov10.orig/mm/slub.c
> +++ mmotm-2.6.28-Nov10/mm/slub.c
> @@ -3220,8 +3220,10 @@ static int slab_memory_callback(struct n
> case MEM_CANCEL_OFFLINE:
> break;
> }
> -
> - ret = notifier_from_errno(ret);
> + if (ret)
> + ret = notifier_from_errno(ret);
> + else
> + ret = NOTIFY_OK;
> return ret;
> }
>
> Index: mmotm-2.6.28-Nov10/mm/page_cgroup.c
> ===================================================================
> --- mmotm-2.6.28-Nov10.orig/mm/page_cgroup.c
> +++ mmotm-2.6.28-Nov10/mm/page_cgroup.c
> @@ -216,7 +216,10 @@ static int page_cgroup_callback(struct n
> break;
> }
>
> - ret = notifier_from_errno(ret);
> + if (ret)
> + ret = notifier_from_errno(ret);
> + else
> + ret = NOTIFY_OK;
>
> return ret;
> }
>
>

2008-11-13 18:52:23

by Badari Pulavarty

[permalink] [raw]

Subject: Re: 2.6.28-rc4 mem_cgroup_charge_common panic

On Thu, 2008-11-13 at 11:17 +0900, KAMEZAWA Hiroyuki wrote:
> On Wed, 12 Nov 2008 14:02:56 -0800
> Badari Pulavarty <[email protected]> wrote:
>
> > On Tue, 2008-11-11 at 11:09 +0900, KAMEZAWA Hiroyuki wrote:
> > > On Tue, 11 Nov 2008 10:14:40 +0900
> > > KAMEZAWA Hiroyuki <[email protected]> wrote:
> > >
> > > > On Mon, 10 Nov 2008 13:43:28 -0800
> > > > Badari Pulavarty <[email protected]> wrote:
> > > >
> > > > > Hi KAME,
> > > > >
> > > > > Thank you for the fix for online/offline page_cgroup panic.
> > > > >
> > > > > While running memory offline/online tests ran into another
> > > > > mem_cgroup panic.
> > > > >
> > > >
> > > > Hm, should I avoid freeing mem_cgroup at memory Offline ?
> > > > (memmap is also not free AFAIK.)
> > > >
> > > > Anyway, I'll dig this. thanks.
> > > >
> > > it seems not the same kind of bug..
> > >
> > > Could you give me disassemble of mem_cgroup_charge_common() ?
> > > (I'm not sure I can read ppc asm but I want to know what is "0x20"
> > > of fault address....)
> > >
> > > As first impression, it comes from page migration..
> > > rc4's page migration handler of memcg handles *usual* path but not so good.
> > >
> > > new migration code of memcg in mmotm is much better, I think.
> > > Could you try mmotm if you have time ?
> >
> > I tried mmtom. Its even worse :(
> >
> > Ran into following quickly .. Sorry!!
> >
> From
> > Instruction dump:
> > 794b1f24 794026e4 7d6bda14 7d3b0214 7d234b78 39490008 e92b0048 39290001
> > f92b0048 419e001c e9230008 f93c0018 <f9090008> f9030008 f9480008 48000018
>
> the reason doesn't seem to be different from the one you saw in rc4.
>
> We'do add_list() hear, so (maybe) used page_cgroup is zero-cleared, I think.
> We usually do migration test on cpuset and confirmed this works with migration.
>
> Hmm...I susupect following. could you try ?
>
> Sorry.
> -Kame
> ==
>
> Index: mmotm-2.6.28-Nov10/mm/page_cgroup.c
> ===================================================================
> --- mmotm-2.6.28-Nov10.orig/mm/page_cgroup.c
> +++ mmotm-2.6.28-Nov10/mm/page_cgroup.c
> @@ -166,7 +166,7 @@ int online_page_cgroup(unsigned long sta
> end = ALIGN(start_pfn + nr_pages, PAGES_PER_SECTION);
>
> for (pfn = start; !fail && pfn < end; pfn += PAGES_PER_SECTION) {
> - if (!pfn_present(pfn))
> + if (!pfn_valid(pfn))
> continue;
> fail = init_section_page_cgroup(pfn);
> }
>
>

I tried mmtom + startpfn fix + this fix + notifier fix. Didn't help.
I am not using SLUB (using SLAB). Yes. I am testing "real" memory
remove (not just offline/online), since it executes more code of
freeing memmap etc.

Code that is panicing is list_add() in mem_cgroup_add_list().
I will debug it further.

Thanks,
Badari

Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=32 NUMA pSeries
Modules linked in:
NIP: c000000000109e50 LR: c000000000109de8 CTR: c0000000000c2414
REGS: c0000000e653f2d0 TRAP: 0300 Not tainted (2.6.28-rc4-mm1)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 44008484 XER: 20000018
DAR: 0000000000000007, DSISR: 0000000042000000
TASK = c0000000e7db9950[4927] 'drmgr' THREAD: c0000000e653c000 CPU: 0
GPR00: 0000000000000020 c0000000e653f550 c000000000b472d8 c0000000e910d558
GPR04: c00000000010a730 c0000000000b96a8 c0000000e653f660 0000000000000000
GPR08: c000000005432358 ffffffffffffffff c0000000e910d560 c0000000e910d548
GPR12: 0000000024000482 c000000000b68300 00000000200957bc 0000000000000000
GPR16: 0000000000000000 c0000000e653f8f8 0000000000000000 c000000000aeea70
GPR20: 0000000000000056 0000000000000000 c00000000370c300 00000000000e6000
GPR24: 0000000000000000 0000000000000000 0000000000000001 c0000000e910d538
GPR28: c000000005432340 0000000000000001 c000000000abc220 c0000000e653f550
NIP [c000000000109e50] .__mem_cgroup_add_list+0x98/0xec
LR [c000000000109de8] .__mem_cgroup_add_list+0x30/0xec
Call Trace:
[c0000000e653f550] [c000000000109de8] .__mem_cgroup_add_list+0x30/0xec (unreliable)
[c0000000e653f5f0] [c00000000010a730] .__mem_cgroup_commit_charge+0x108/0x154
[c0000000e653f690] [c00000000010adf8] .mem_cgroup_end_migration+0xb4/0x130
[c0000000e653f730] [c000000000108c84] .migrate_pages+0x460/0x62c
[c0000000e653f880] [c000000000106760] .offline_pages+0x398/0x5ac
[c0000000e653f990] [c0000000001069b8] .remove_memory+0x44/0x60
[c0000000e653fa20] [c000000000407590] .memory_block_change_state+0x198/0x230
[c0000000e653fad0] [c000000000407cb0] .store_mem_state+0xcc/0x144
[c0000000e653fb70] [c0000000003fa8b4] .sysdev_store+0x74/0xa4
[c0000000e653fc10] [c00000000017b088] .sysfs_write_file+0x128/0x1a4
[c0000000e653fcd0] [c00000000010fb80] .vfs_write+0xf0/0x1c4
[c0000000e653fd80] [c00000000011051c] .sys_write+0x6c/0xb8
[c0000000e653fe30] [c00000000000852c] syscall_exit+0x0/0x40
Instruction dump:
794b1f24 794026e4 7d6bda14 7d3b0214 7d234b78 39490008 e92b0048 39290001
f92b0048 419e001c e9230008 f93c0018 <f9090008> f9030008 f9480008 48000018
---[ end trace e803fa4abaa22794 ]---
Unable to handle kernel paging request for data at address 0x00000008
Faulting instruction address: 0xc000000000109e50
Oops: Kernel access of bad area, sig: 11 [#2]
SMP NR_CPUS=32 NUMA pSeries
Modules linked in:
NIP: c000000000109e50 LR: c000000000109de8 CTR: c0000000000c2414
REGS: c0000000e63b3110 TRAP: 0300 Tainted: G D (2.6.28-rc4-mm1)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 44044844 XER: 20000010
DAR: 0000000000000008, DSISR: 0000000042000000
TASK = c0000000e9966690[2719] 'syslog-ng' THREAD: c0000000e63b0000 CPU: 1
GPR00: 0000000000000020 c0000000e63b3390 c000000000b472d8 c0000000e910c758
GPR04: c00000000010a730 c0000000000b96a8 0000000000000001 0000000000000000
GPR08: c000000005431ea8 0000000000000000 c0000000e910c760 c0000000e910c748
GPR12: c0000000e6e040f8 c000000000b68500 000000000000f8e5 0000000000000004
GPR16: 0000000000000000 00000000000003fa c0000000e63b3b30 c0000000be125000
GPR20: c0000000be124000 0000000000000005 0000000002180404 0000000000001694
GPR24: 0000000000000000 0000000000000000 0000000000000001 c0000000e910c738
GPR28: c000000005431e90 0000000000000001 c000000000abc220 c0000000e63b3390
NIP [c000000000109e50] .__mem_cgroup_add_list+0x98/0xec
LR [c000000000109de8] .__mem_cgroup_add_list+0x30/0xec
Call Trace:
[c0000000e63b3390] [c000000000109de8] .__mem_cgroup_add_list+0x30/0xec (unreliable)
[c0000000e63b3430] [c00000000010a730] .__mem_cgroup_commit_charge+0x108/0x154
[c0000000e63b34d0] [c00000000010af90] .mem_cgroup_charge_common+0x94/0xc4
[c0000000e63b3590] [c00000000010b588] .mem_cgroup_cache_charge+0x130/0x154
[c0000000e63b3630] [c0000000000c5308] .add_to_page_cache_locked+0x64/0x18c
[c0000000e63b36e0] [c0000000000c54b0] .add_to_page_cache_lru+0x80/0xe4
[c0000000e63b3780] [c0000000000c59bc] .find_or_create_page+0x74/0xc8
[c0000000e63b3830] [c00000000013e8a4] .__getblk+0x150/0x2f8
[c0000000e63b38f0] [c0000000001aac0c] .do_journal_end+0x9c4/0xfa0
[c0000000e63b3a20] [c0000000001ab28c] .journal_end_sync+0xa4/0xc4
[c0000000e63b3ac0] [c0000000001af044] .reiserfs_commit_for_inode+0x188/0x22c
[c0000000e63b3bc0] [c000000000190948] .reiserfs_sync_file+0x6c/0xe4
[c0000000e63b3c60] [c00000000013c114] .do_fsync+0xa0/0x120
[c0000000e63b3d00] [c00000000013c1e4] .__do_fsync+0x50/0x84
[c0000000e63b3da0] [c00000000013c290] .sys_fsync+0x30/0x48
[c0000000e63b3e30] [c00000000000852c] syscall_exit+0x0/0x40
Instruction dump:
794b1f24 794026e4 7d6bda14 7d3b0214 7d234b78 39490008 e92b0048 39290001
f92b0048 419e001c e9230008 f93c0018 <f9090008> f9030008 f9480008 48000018
---[ end trace e803fa4abaa22794 ]---
INFO: RCU detected CPU 0 stall (t=4295359473/2500 jiffies)
Call Trace:
[c0000000e6662c90] [c0000000000102bc] .show_stack+0x94/0x198 (unreliable)
[c0000000e6662d40] [c0000000000103e8] .dump_stack+0x28/0x3c
[c0000000e6662dc0] [c0000000000b338c] .__rcu_pending+0xa8/0x2c4
[c0000000e6662e60] [c0000000000b35f4] .rcu_pending+0x4c/0xa0
[c0000000e6662ef0] [c0000000000788a0] .update_process_times+0x50/0xa8
[c0000000e6662f90] [c00000000009875c] .tick_sched_timer+0xb0/0x100
[c0000000e6663040] [c00000000008cbf8] .__run_hrtimer+0xa4/0x13c
[c0000000e66630e0] [c00000000008de64] .hrtimer_interrupt+0x128/0x200
[c0000000e66631c0] [c0000000000284c4] .timer_interrupt+0xc0/0x11c
[c0000000e6663260] [c000000000003710] decrementer_common+0x110/0x180
--- Exception: 901 at ._spin_lock_irqsave+0x80/0xd4
LR = ._spin_lock_irqsave+0x7c/0xd4
[c0000000e6663550] [c0000000005ae7a0] ._spin_lock_irqsave+0x28/0xd4 (unreliable)
[c0000000e66635f0] [c00000000010a718] .__mem_cgroup_commit_charge+0xf0/0x154
[c0000000e6663690] [c00000000010adf8] .mem_cgroup_end_migration+0xb4/0x130
[c0000000e6663730] [c000000000108c84] .migrate_pages+0x460/0x62c
[c0000000e6663880] [c000000000106760] .offline_pages+0x398/0x5ac
[c0000000e6663990] [c0000000001069b8] .remove_memory+0x44/0x60
[c0000000e6663a20] [c000000000407590] .memory_block_change_state+0x198/0x230
[c0000000e6663ad0] [c000000000407cb0] .store_mem_state+0xcc/0x144
[c0000000e6663b70] [c0000000003fa8b4] .sysdev_store+0x74/0xa4
[c0000000e6663c10] [c00000000017b088] .sysfs_write_file+0x128/0x1a4
[c0000000e6663cd0] [c00000000010fb80] .vfs_write+0xf0/0x1c4
[c0000000e6663d80] [c00000000011051c] .sys_write+0x6c/0xb8
[c0000000e6663e30] [c00000000000852c] syscall_exit+0x0/0x40
INFO: RCU detected CPU 0 stall (t=4295366973/10000 jiffies)
Call Trace:
[c0000000e6662c90] [c0000000000102bc] .show_stack+0x94/0x198 (unreliable)
[c0000000e6662d40] [c0000000000103e8] .dump_stack+0x28/0x3c
[c0000000e6662dc0] [c0000000000b338c] .__rcu_pending+0xa8/0x2c4
[c0000000e6662e60] [c0000000000b35f4] .rcu_pending+0x4c/0xa0
[c0000000e6662ef0] [c0000000000788a0] .update_process_times+0x50/0xa8
[c0000000e6662f90] [c00000000009875c] .tick_sched_timer+0xb0/0x100
[c0000000e6663040] [c00000000008cbf8] .__run_hrtimer+0xa4/0x13c
[c0000000e66630e0] [c00000000008de64] .hrtimer_interrupt+0x128/0x200
[c0000000e66631c0] [c0000000000284c4] .timer_interrupt+0xc0/0x11c
[c0000000e6663260] [c000000000003710] decrementer_common+0x110/0x180
--- Exception: 901 at ._spin_lock_irqsave+0x84/0xd4
LR = ._spin_lock_irqsave+0x7c/0xd4
[c0000000e6663550] [c0000000005ae7a0] ._spin_lock_irqsave+0x28/0xd4 (unreliable)
[c0000000e66635f0] [c00000000010a718] .__mem_cgroup_commit_charge+0xf0/0x154
[c0000000e6663690] [c00000000010adf8] .mem_cgroup_end_migration+0xb4/0x130
[c0000000e6663730] [c000000000108c84] .migrate_pages+0x460/0x62c
[c0000000e6663880] [c000000000106760] .offline_pages+0x398/0x5ac
[c0000000e6663990] [c0000000001069b8] .remove_memory+0x44/0x60
[c0000000e6663a20] [c000000000407590] .memory_block_change_state+0x198/0x230
[c0000000e6663ad0] [c000000000407cb0] .store_mem_state+0xcc/0x144
[c0000000e6663b70] [c0000000003fa8b4] .sysdev_store+0x74/0xa4
[c0000000e6663c10] [c00000000017b088] .sysfs_write_file+0x128/0x1a4
[c0000000e6663cd0] [c00000000010fb80] .vfs_write+0xf0/0x1c4
[c0000000e6663d80] [c00000000011051c] .sys_write+0x6c/0xb8

2008-11-14 04:11:55

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: 2.6.28-rc4 mem_cgroup_charge_common panic

On Thu, 13 Nov 2008 10:53:24 -0800
Badari Pulavarty <[email protected]> wrote:
> I tried mmtom + startpfn fix + this fix + notifier fix. Didn't help.
> I am not using SLUB (using SLAB). Yes. I am testing "real" memory
> remove (not just offline/online), since it executes more code of
> freeing memmap etc.
>
> Code that is panicing is list_add() in mem_cgroup_add_list().
> I will debug it further.
>

Considering difference between "real" memory hotplug and logical ones,
I found this. I hope this fixes the bug.
But I myself can't do test this..

Thanks,
-Kame

==
Fixes for memcg/memory hotplug.

While memory hotplug allocate/free memmap, page_cgroup doesn't free
page_cgroup at OFFLINE when page_cgroup is allocated via bootomem.
(Because freeing bootmem requires special care.)

Then, if page_cgroup is allocated by bootmem and memmap is freed/allocated
by memory hotplug, page_cgroup->page == page is no longer true and
we have to update that.

But current MEM_ONLINE handler doesn't check it and update page_cgroup->page
if it's not necessary to allocate page_cgroup.

And I noticed that MEM_ONLINE can be called against "part of section".
So, freeing page_cgroup at CANCEL_ONLINE will cause trouble.
(freeing used page_cgroup)
Don't rollback at CANCEL.

One more, current memory hotplug notifier is stopped by slub
because it sets NOTIFY_STOP_MASK to return vaule. So, page_cgroup's callback
never be called. (low priority than slub now.)

I think this slub's behavior is not intentional(BUG). and fixes it.

Another way to be considered about page_cgroup allocation:
- free page_cgroup at OFFLINE even if it's from bootmem
and remove specieal handler. But it requires more changes.

Signed-off-by: KAMEZAWA Hiruyoki <[email protected]>

---
mm/page_cgroup.c | 39 +++++++++++++++++++++++++++------------
mm/slub.c | 6 ++++--
2 files changed, 31 insertions(+), 14 deletions(-)

Index: mmotm-2.6.28-Nov10/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/page_cgroup.c
+++ mmotm-2.6.28-Nov10/mm/page_cgroup.c
@@ -104,18 +104,30 @@ int __meminit init_section_page_cgroup(u
unsigned long table_size;
int nid, index;

- if (section->page_cgroup)
- return 0;
+ if (!section->page_cgroup) {

- nid = page_to_nid(pfn_to_page(pfn));
- table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
- if (slab_is_available()) {
- base = kmalloc_node(table_size, GFP_KERNEL, nid);
- if (!base)
- base = vmalloc_node(table_size, nid);
- } else {
- base = __alloc_bootmem_node_nopanic(NODE_DATA(nid), table_size,
+ nid = page_to_nid(pfn_to_page(pfn));
+ table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
+ if (slab_is_available()) {
+ base = kmalloc_node(table_size, GFP_KERNEL, nid);
+ if (!base)
+ base = vmalloc_node(table_size, nid);
+ } else {
+ base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
+ table_size,
PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
+ }
+ } else {
+ /*
+ * We don't have to allocate page_cgroup again, but
+ * address of memmap may be changed. So, we have to initialize
+ * again.
+ */
+ base = section->page_cgroup + pfn;
+ table_size = 0;
+ /* check address of memmap is changed or not. */
+ if (base->page == pfn_to_page(pfn))
+ return 0;
}

if (!base) {
@@ -204,19 +216,22 @@ static int page_cgroup_callback(struct n
ret = online_page_cgroup(mn->start_pfn,
mn->nr_pages, mn->status_change_nid);
break;
- case MEM_CANCEL_ONLINE:
case MEM_OFFLINE:
offline_page_cgroup(mn->start_pfn,
mn->nr_pages, mn->status_change_nid);
break;
case MEM_GOING_OFFLINE:
+ case MEM_CANCEL_ONLINE:
break;
case MEM_ONLINE:
case MEM_CANCEL_OFFLINE:
break;
}

- ret = notifier_from_errno(ret);
+ if (ret)
+ ret = notifier_from_errno(ret);
+ else
+ ret = NOTIFY_OK;

return ret;
}
Index: mmotm-2.6.28-Nov10/mm/slub.c
===================================================================
--- mmotm-2.6.28-Nov10.orig/mm/slub.c
+++ mmotm-2.6.28-Nov10/mm/slub.c
@@ -3220,8 +3220,10 @@ static int slab_memory_callback(struct n
case MEM_CANCEL_OFFLINE:
break;
}
-
- ret = notifier_from_errno(ret);
+ if (ret)
+ ret = notifier_from_errno(ret);
+ else
+ ret = NOTIFY_OK;
return ret;
}

2008-11-17 21:29:34

by Badari Pulavarty

[permalink] [raw]

Subject: Re: 2.6.28-rc4 mem_cgroup_charge_common panic

On Fri, 2008-11-14 at 13:10 +0900, KAMEZAWA Hiroyuki wrote:
> On Thu, 13 Nov 2008 10:53:24 -0800
> Badari Pulavarty <[email protected]> wrote:
> > I tried mmtom + startpfn fix + this fix + notifier fix. Didn't help.
> > I am not using SLUB (using SLAB). Yes. I am testing "real" memory
> > remove (not just offline/online), since it executes more code of
> > freeing memmap etc.
> >
> > Code that is panicing is list_add() in mem_cgroup_add_list().
> > I will debug it further.
> >
>
> Considering difference between "real" memory hotplug and logical ones,
> I found this. I hope this fixes the bug.
> But I myself can't do test this..
>
> Thanks,
> -Kame
>

Kame,

With this patch I am able to run tests without any issues.

Sorry for delayed response, I wanted to make sure test runs fine over
the weekend.

Tested-by: Badari Pulavarty <[email protected]>

Thanks,
Badari

> ==
> Fixes for memcg/memory hotplug.
>
>
> While memory hotplug allocate/free memmap, page_cgroup doesn't free
> page_cgroup at OFFLINE when page_cgroup is allocated via bootomem.
> (Because freeing bootmem requires special care.)
>
> Then, if page_cgroup is allocated by bootmem and memmap is freed/allocated
> by memory hotplug, page_cgroup->page == page is no longer true and
> we have to update that.
>
> But current MEM_ONLINE handler doesn't check it and update page_cgroup->page
> if it's not necessary to allocate page_cgroup.
>
> And I noticed that MEM_ONLINE can be called against "part of section".
> So, freeing page_cgroup at CANCEL_ONLINE will cause trouble.
> (freeing used page_cgroup)
> Don't rollback at CANCEL.
>
> One more, current memory hotplug notifier is stopped by slub
> because it sets NOTIFY_STOP_MASK to return vaule. So, page_cgroup's callback
> never be called. (low priority than slub now.)
>
> I think this slub's behavior is not intentional(BUG). and fixes it.
>
>
> Another way to be considered about page_cgroup allocation:
> - free page_cgroup at OFFLINE even if it's from bootmem
> and remove specieal handler. But it requires more changes.
>
>
> Signed-off-by: KAMEZAWA Hiruyoki <[email protected]>
>
> ---
> mm/page_cgroup.c | 39 +++++++++++++++++++++++++++------------
> mm/slub.c | 6 ++++--
> 2 files changed, 31 insertions(+), 14 deletions(-)
>
> Index: mmotm-2.6.28-Nov10/mm/page_cgroup.c
> ===================================================================
> --- mmotm-2.6.28-Nov10.orig/mm/page_cgroup.c
> +++ mmotm-2.6.28-Nov10/mm/page_cgroup.c
> @@ -104,18 +104,30 @@ int __meminit init_section_page_cgroup(u
> unsigned long table_size;
> int nid, index;
>
> - if (section->page_cgroup)
> - return 0;
> + if (!section->page_cgroup) {
>
> - nid = page_to_nid(pfn_to_page(pfn));
> - table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
> - if (slab_is_available()) {
> - base = kmalloc_node(table_size, GFP_KERNEL, nid);
> - if (!base)
> - base = vmalloc_node(table_size, nid);
> - } else {
> - base = __alloc_bootmem_node_nopanic(NODE_DATA(nid), table_size,
> + nid = page_to_nid(pfn_to_page(pfn));
> + table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
> + if (slab_is_available()) {
> + base = kmalloc_node(table_size, GFP_KERNEL, nid);
> + if (!base)
> + base = vmalloc_node(table_size, nid);
> + } else {
> + base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
> + table_size,
> PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
> + }
> + } else {
> + /*
> + * We don't have to allocate page_cgroup again, but
> + * address of memmap may be changed. So, we have to initialize
> + * again.
> + */
> + base = section->page_cgroup + pfn;
> + table_size = 0;
> + /* check address of memmap is changed or not. */
> + if (base->page == pfn_to_page(pfn))
> + return 0;
> }
>
> if (!base) {
> @@ -204,19 +216,22 @@ static int page_cgroup_callback(struct n
> ret = online_page_cgroup(mn->start_pfn,
> mn->nr_pages, mn->status_change_nid);
> break;
> - case MEM_CANCEL_ONLINE:
> case MEM_OFFLINE:
> offline_page_cgroup(mn->start_pfn,
> mn->nr_pages, mn->status_change_nid);
> break;
> case MEM_GOING_OFFLINE:
> + case MEM_CANCEL_ONLINE:
> break;
> case MEM_ONLINE:
> case MEM_CANCEL_OFFLINE:
> break;
> }
>
> - ret = notifier_from_errno(ret);
> + if (ret)
> + ret = notifier_from_errno(ret);
> + else
> + ret = NOTIFY_OK;
>
> return ret;
> }
> Index: mmotm-2.6.28-Nov10/mm/slub.c
> ===================================================================
> --- mmotm-2.6.28-Nov10.orig/mm/slub.c
> +++ mmotm-2.6.28-Nov10/mm/slub.c
> @@ -3220,8 +3220,10 @@ static int slab_memory_callback(struct n
> case MEM_CANCEL_OFFLINE:
> break;
> }
> -
> - ret = notifier_from_errno(ret);
> + if (ret)
> + ret = notifier_from_errno(ret);
> + else
> + ret = NOTIFY_OK;
> return ret;
> }
>
>

2008-11-18 01:09:21

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: 2.6.28-rc4 mem_cgroup_charge_common panic

On Mon, 17 Nov 2008 13:30:08 -0800
Badari Pulavarty <[email protected]> wrote:

> On Fri, 2008-11-14 at 13:10 +0900, KAMEZAWA Hiroyuki wrote:
> > On Thu, 13 Nov 2008 10:53:24 -0800
> > Badari Pulavarty <[email protected]> wrote:
> > > I tried mmtom + startpfn fix + this fix + notifier fix. Didn't help.
> > > I am not using SLUB (using SLAB). Yes. I am testing "real" memory
> > > remove (not just offline/online), since it executes more code of
> > > freeing memmap etc.
> > >
> > > Code that is panicing is list_add() in mem_cgroup_add_list().
> > > I will debug it further.
> > >
> >
> > Considering difference between "real" memory hotplug and logical ones,
> > I found this. I hope this fixes the bug.
> > But I myself can't do test this..
> >
> > Thanks,
> > -Kame
> >
>
> Kame,
>
> With this patch I am able to run tests without any issues.
>
> Sorry for delayed response, I wanted to make sure test runs fine over
> the weekend.
>
> Tested-by: Badari Pulavarty <[email protected]>
>
Wow, Thank you!

-Kame

> Thanks,
> Badari
>
>
> > ==
> > Fixes for memcg/memory hotplug.
> >
> >
> > While memory hotplug allocate/free memmap, page_cgroup doesn't free
> > page_cgroup at OFFLINE when page_cgroup is allocated via bootomem.
> > (Because freeing bootmem requires special care.)
> >
> > Then, if page_cgroup is allocated by bootmem and memmap is freed/allocated
> > by memory hotplug, page_cgroup->page == page is no longer true and
> > we have to update that.
> >
> > But current MEM_ONLINE handler doesn't check it and update page_cgroup->page
> > if it's not necessary to allocate page_cgroup.
> >
> > And I noticed that MEM_ONLINE can be called against "part of section".
> > So, freeing page_cgroup at CANCEL_ONLINE will cause trouble.
> > (freeing used page_cgroup)
> > Don't rollback at CANCEL.
> >
> > One more, current memory hotplug notifier is stopped by slub
> > because it sets NOTIFY_STOP_MASK to return vaule. So, page_cgroup's callback
> > never be called. (low priority than slub now.)
> >
> > I think this slub's behavior is not intentional(BUG). and fixes it.
> >
> >
> > Another way to be considered about page_cgroup allocation:
> > - free page_cgroup at OFFLINE even if it's from bootmem
> > and remove specieal handler. But it requires more changes.
> >
> >
> > Signed-off-by: KAMEZAWA Hiruyoki <[email protected]>
> >
> > ---
> > mm/page_cgroup.c | 39 +++++++++++++++++++++++++++------------
> > mm/slub.c | 6 ++++--
> > 2 files changed, 31 insertions(+), 14 deletions(-)
> >
> > Index: mmotm-2.6.28-Nov10/mm/page_cgroup.c
> > ===================================================================
> > --- mmotm-2.6.28-Nov10.orig/mm/page_cgroup.c
> > +++ mmotm-2.6.28-Nov10/mm/page_cgroup.c
> > @@ -104,18 +104,30 @@ int __meminit init_section_page_cgroup(u
> > unsigned long table_size;
> > int nid, index;
> >
> > - if (section->page_cgroup)
> > - return 0;
> > + if (!section->page_cgroup) {
> >
> > - nid = page_to_nid(pfn_to_page(pfn));
> > - table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
> > - if (slab_is_available()) {
> > - base = kmalloc_node(table_size, GFP_KERNEL, nid);
> > - if (!base)
> > - base = vmalloc_node(table_size, nid);
> > - } else {
> > - base = __alloc_bootmem_node_nopanic(NODE_DATA(nid), table_size,
> > + nid = page_to_nid(pfn_to_page(pfn));
> > + table_size = sizeof(struct page_cgroup) * PAGES_PER_SECTION;
> > + if (slab_is_available()) {
> > + base = kmalloc_node(table_size, GFP_KERNEL, nid);
> > + if (!base)
> > + base = vmalloc_node(table_size, nid);
> > + } else {
> > + base = __alloc_bootmem_node_nopanic(NODE_DATA(nid),
> > + table_size,
> > PAGE_SIZE, __pa(MAX_DMA_ADDRESS));
> > + }
> > + } else {
> > + /*
> > + * We don't have to allocate page_cgroup again, but
> > + * address of memmap may be changed. So, we have to initialize
> > + * again.
> > + */
> > + base = section->page_cgroup + pfn;
> > + table_size = 0;
> > + /* check address of memmap is changed or not. */
> > + if (base->page == pfn_to_page(pfn))
> > + return 0;
> > }
> >
> > if (!base) {
> > @@ -204,19 +216,22 @@ static int page_cgroup_callback(struct n
> > ret = online_page_cgroup(mn->start_pfn,
> > mn->nr_pages, mn->status_change_nid);
> > break;
> > - case MEM_CANCEL_ONLINE:
> > case MEM_OFFLINE:
> > offline_page_cgroup(mn->start_pfn,
> > mn->nr_pages, mn->status_change_nid);
> > break;
> > case MEM_GOING_OFFLINE:
> > + case MEM_CANCEL_ONLINE:
> > break;
> > case MEM_ONLINE:
> > case MEM_CANCEL_OFFLINE:
> > break;
> > }
> >
> > - ret = notifier_from_errno(ret);
> > + if (ret)
> > + ret = notifier_from_errno(ret);
> > + else
> > + ret = NOTIFY_OK;
> >
> > return ret;
> > }
> > Index: mmotm-2.6.28-Nov10/mm/slub.c
> > ===================================================================
> > --- mmotm-2.6.28-Nov10.orig/mm/slub.c
> > +++ mmotm-2.6.28-Nov10/mm/slub.c
> > @@ -3220,8 +3220,10 @@ static int slab_memory_callback(struct n
> > case MEM_CANCEL_OFFLINE:
> > break;
> > }
> > -
> > - ret = notifier_from_errno(ret);
> > + if (ret)
> > + ret = notifier_from_errno(ret);
> > + else
> > + ret = NOTIFY_OK;
> > return ret;
> > }
> >
> >
>
>