LinuxLists.cc - [PATCH 00/31] numa/core patches

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On 10/25/2012 08:16 PM, Peter Zijlstra wrote:
> Hi all,
>
> Here's a re-post of the NUMA scheduling and migration improvement
> patches that we are working on. These include techniques from
> AutoNUMA and the sched/numa tree and form a unified basis - it
> has got all the bits that look good and mergeable.
>
> With these patches applied, the mbind system calls expand to
> new modes of lazy-migration binding, and if the
> CONFIG_SCHED_NUMA=y .config option is enabled the scheduler
> will automatically sample the working set of tasks via page
> faults. Based on that information the scheduler then tries
> to balance smartly, put tasks on a home node and migrate CPU
> work and memory on the same node.
>
> They are functional in their current state and have had testing on
> a variety of x86 NUMA hardware.
>
> These patches will continue their life in tip:numa/core and unless
> there are major showstoppers they are intended for the v3.8
> merge window.
>
> We believe that they provide a solid basis for future work.
>
> Please review .. once again and holler if you see anything funny! :-)

Hi,

I tested the patch set, but there's one issues blocked me:

kernel BUG at mm/memcontrol.c:3263!

--------- snip -----------------
[ 179.804754] kernel BUG at mm/memcontrol.c:3263!
[ 179.874356] invalid opcode: 0000 [#1] SMP
[ 179.939377] Modules linked in: fuse ip6table_filter ip6_tables
ebtable_nat ebtables bnep bluetooth rfkill iptable_mangle
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack be2iscsi
iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi
ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp
libiscsi_tcp libiscsi scsi_transport_iscsi vfat fat iTCO_wdt cdc_ether
coretemp iTCO_vendor_support usbnet mii ioatdma lpc_ich crc32c_intel
bnx2 shpchp i7core_edac pcspkr tpm_tis tpm i2c_i801 mfd_core tpm_bios
edac_core dca serio_raw microcode vhost_net tun macvtap macvlan
kvm_intel kvm uinput mgag200 i2c_algo_bit drm_kms_helper ttm drm
megaraid_sas i2c_core
[ 180.737647] CPU 7
[ 180.759586] Pid: 1316, comm: X Not tainted 3.7.0-rc2+ #3 IBM IBM
System x3400 M3 Server -[7379I08]-/69Y4356
[ 180.918591] RIP: 0010:[<ffffffff8118c39a>] [<ffffffff8118c39a>]
mem_cgroup_prepare_migration+0xba/0xd0
[ 181.047572] RSP: 0000:ffff880179113d38 EFLAGS: 00013202
[ 181.127009] RAX: 0040100000084069 RBX: ffffea0005b28000 RCX:
ffffea00099a805c
[ 181.228674] RDX: ffff880179113d90 RSI: ffffea00099a8000 RDI:
ffffea0005b28000
[ 181.331080] RBP: ffff880179113d58 R08: 0000000000280000 R09:
ffff88027fffff80
[ 181.433163] R10: 00000000000000d4 R11: 00000037e9f7bd90 R12:
ffff880179113d90
[ 181.533866] R13: 00007fc5ffa00000 R14: ffff880178001fe8 R15:
000000016ca001e0
[ 181.635264] FS: 00007fc600ddb940(0000) GS:ffff88027fc60000(0000)
knlGS:0000000000000000
[ 181.753726] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 181.842013] CR2: 00007fc5ffa00000 CR3: 00000001779d2000 CR4:
00000000000007e0
[ 181.945346] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 182.049416] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 182.153796] Process X (pid: 1316, threadinfo ffff880179112000, task
ffff880179364620)
[ 182.266464] Stack:
[ 182.309943] ffff880177d2c980 00007fc5ffa00000 ffffea0005b28000
ffff880177d2c980
[ 182.418164] ffff880179113dc8 ffffffff81183b60 ffff880177d2c9dc
0000000178001fe0
[ 182.526366] ffff880177856a50 ffffea00099a8000 ffff880177d2cc38
0000000000000000
[ 182.633709] Call Trace:
[ 182.681450] [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
[ 182.775090] [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
[ 182.863038] [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
[ 182.950574] [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
[ 183.041512] [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
[ 183.126832] [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
[ 183.211216] [<ffffffff81632ede>] do_page_fault+0xe/0x10
[ 183.293705] [<ffffffff8162f518>] page_fault+0x28/0x30
[ 183.373909] Code: 00 48 8b 78 08 48 8b 57 10 83 e2 01 75 05 f0 83 47
08 01 f6 43 08 01 74 bb f0 80 08 04 eb b5 f3 90 48 8b 10 80 e2 01 75 f6
eb 94 <0f> 0b 0f 1f 40 00 e8 9c b4 49 00 66 66 2e 0f 1f 84 00 00 00 00
[ 183.651946] RIP [<ffffffff8118c39a>]
mem_cgroup_prepare_migration+0xba/0xd0
[ 183.760378] RSP <ffff880179113d38>

===========================================================================

my system has two numa nodes.

There are two methods can reproduce the bug on my machine.
1. start X server:
# startx
it's 100% to reproduce it, and which can crash the system.

2. Compiling kernel source using multi-threads:
# make -j N
this action can produce such similar above Call Trace, but it
*didn't* crash the system

The whole dmesg log and config file are attached.

Also I have tested the mainline kernel un-patched sched/numa patch set,
there's no such issues.

please let me know if you need more info.

Thanks,
Zhouping

>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

Attachments:

config_sched_numa (111.99 kB)
dmesg.log (20.48 kB)
Download all attachments

2012-10-26 09:08:52

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> [ 180.918591] RIP: 0010:[<ffffffff8118c39a>] [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0

> [ 182.681450] [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
> [ 182.775090] [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
> [ 182.863038] [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
> [ 182.950574] [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
> [ 183.041512] [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
> [ 183.126832] [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
> [ 183.211216] [<ffffffff81632ede>] do_page_fault+0xe/0x10
> [ 183.293705] [<ffffffff8162f518>] page_fault+0x28/0x30

Johannes, this looks like the thp migration memcg hookery gone bad,
could you have a look at this?

2012-10-26 09:20:56

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

* Peter Zijlstra <[email protected]> wrote:

> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> > [ 180.918591] RIP: 0010:[<ffffffff8118c39a>] [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
>
> > [ 182.681450] [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
> > [ 182.775090] [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
> > [ 182.863038] [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
> > [ 182.950574] [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
> > [ 183.041512] [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
> > [ 183.126832] [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
> > [ 183.211216] [<ffffffff81632ede>] do_page_fault+0xe/0x10
> > [ 183.293705] [<ffffffff8162f518>] page_fault+0x28/0x30
>
> Johannes, this looks like the thp migration memcg hookery gone bad,
> could you have a look at this?

Meanwhile, Zhouping Liu, could you please not apply the last
patch:

[PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()

and see whether it boots/works without that?

Thanks,

Ingo

2012-10-26 09:39:16

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On 10/26/2012 05:20 PM, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
>> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
>>> [ 180.918591] RIP: 0010:[<ffffffff8118c39a>] [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
>>> [ 182.681450] [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
>>> [ 182.775090] [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
>>> [ 182.863038] [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
>>> [ 182.950574] [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
>>> [ 183.041512] [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
>>> [ 183.126832] [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
>>> [ 183.211216] [<ffffffff81632ede>] do_page_fault+0xe/0x10
>>> [ 183.293705] [<ffffffff8162f518>] page_fault+0x28/0x30
>> Johannes, this looks like the thp migration memcg hookery gone bad,
>> could you have a look at this?
> Meanwhile, Zhouping Liu, could you please not apply the last
> patch:
>
> [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
>
> and see whether it boots/works without that?

Ok, I reverted the 31st patch, will provide the results here after I
finish testing.

Thanks,
Zhouping

2012-10-26 10:18:22

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On 10/26/2012 05:20 PM, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
>> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
>>> [ 180.918591] RIP: 0010:[<ffffffff8118c39a>] [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
>>> [ 182.681450] [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
>>> [ 182.775090] [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
>>> [ 182.863038] [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
>>> [ 182.950574] [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
>>> [ 183.041512] [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
>>> [ 183.126832] [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
>>> [ 183.211216] [<ffffffff81632ede>] do_page_fault+0xe/0x10
>>> [ 183.293705] [<ffffffff8162f518>] page_fault+0x28/0x30
>> Johannes, this looks like the thp migration memcg hookery gone bad,
>> could you have a look at this?
> Meanwhile, Zhouping Liu, could you please not apply the last
> patch:
>
> [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
>
> and see whether it boots/works without that?

Hi Ingo,

your supposed is right, after reverting the 31st patch(sched, numa, mm:
Add memcg support to do_huge_pmd_numa_page())
the issue is gone, thank you.

Thanks,
Zhouping

>
> Thanks,
>
> Ingo
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-10-26 10:24:23

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

* Zhouping Liu <[email protected]> wrote:

> On 10/26/2012 05:20 PM, Ingo Molnar wrote:
> >* Peter Zijlstra <[email protected]> wrote:
> >
> >>On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> >>>[ 180.918591] RIP: 0010:[<ffffffff8118c39a>] [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
> >>>[ 182.681450] [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
> >>>[ 182.775090] [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
> >>>[ 182.863038] [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
> >>>[ 182.950574] [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
> >>>[ 183.041512] [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
> >>>[ 183.126832] [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
> >>>[ 183.211216] [<ffffffff81632ede>] do_page_fault+0xe/0x10
> >>>[ 183.293705] [<ffffffff8162f518>] page_fault+0x28/0x30
> >>Johannes, this looks like the thp migration memcg hookery gone bad,
> >>could you have a look at this?
> >Meanwhile, Zhouping Liu, could you please not apply the last
> >patch:
> >
> > [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
> >
> >and see whether it boots/works without that?
>
> Hi Ingo,
>
> your supposed is right, after reverting the 31st patch(sched, numa,
> mm: Add memcg support to do_huge_pmd_numa_page())
> the issue is gone, thank you.

The tested bits you can find in the numa/core tree:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git numa/core

It includes all changes (patches #1-#30) except patch #31 - I
wanted to test and apply that last patch today, but won't do it
now that you've reported this regression.

Thanks,

Ingo

2012-10-28 17:56:32

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> > [ 180.918591] RIP: 0010:[<ffffffff8118c39a>] [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
>
> > [ 182.681450] [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
> > [ 182.775090] [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
> > [ 182.863038] [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
> > [ 182.950574] [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
> > [ 183.041512] [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
> > [ 183.126832] [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
> > [ 183.211216] [<ffffffff81632ede>] do_page_fault+0xe/0x10
> > [ 183.293705] [<ffffffff8162f518>] page_fault+0x28/0x30
>
> Johannes, this looks like the thp migration memcg hookery gone bad,
> could you have a look at this?

Oops. Here is an incremental fix, feel free to fold it into #31.

Signed-off-by: Johannes Weiner <[email protected]>
---

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5c30a14..0d7ebd3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (!new_page)
goto alloc_fail;

- mem_cgroup_prepare_migration(page, new_page, &memcg);
-
lru = PageLRU(page);

if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
@@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,

return;
}
+ /*
+ * Traditional migration needs to prepare the memcg charge
+ * transaction early to prevent the old page from being
+ * uncharged when installing migration entries. Here we can
+ * save the potential rollback and start the charge transfer
+ * only when migration is already known to end successfully.
+ */
+ mem_cgroup_prepare_migration(page, new_page, &memcg);

entry = mk_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, entry);
page_remove_rmap(page);
+ /*
+ * Finish the charge transaction under the page table lock to
+ * prevent split_huge_page() from dividing up the charge
+ * before it's fully transferred to the new page.
+ */
+ mem_cgroup_end_migration(memcg, page, new_page, true);
spin_unlock(&mm->page_table_lock);

put_page(page); /* Drop the rmap reference */
@@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,

unlock_page(new_page);

- mem_cgroup_end_migration(memcg, page, new_page, true);
-
unlock_page(page);
put_page(page); /* Drop the local reference */

return;

alloc_fail:
- if (new_page) {
- mem_cgroup_end_migration(memcg, page, new_page, false);
+ if (new_page)
put_page(new_page);
- }

unlock_page(page);

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7acf43b..011e510 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
struct mem_cgroup **memcgp)
{
struct mem_cgroup *memcg = NULL;
+ unsigned int nr_pages = 1;
struct page_cgroup *pc;
enum charge_type ctype;

*memcgp = NULL;

- VM_BUG_ON(PageTransHuge(page));
if (mem_cgroup_disabled())
return;

+ if (PageTransHuge(page))
+ nr_pages <<= compound_order(page);
+
pc = lookup_page_cgroup(page);
lock_page_cgroup(pc);
if (PageCgroupUsed(pc)) {
@@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
* charged to the res_counter since we plan on replacing the
* old one and only one page is going to be left afterwards.
*/
- __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
+ __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
}

/* remove redundant charge if migration failed*/

2012-10-29 02:42:54

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

2012-10-29 06:50:51

[permalink] [raw]

Subject: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()

* Zhouping Liu <[email protected]> wrote:

> Hi Johannes,
>
> Tested the below patch, and I'm sure it has fixed the above
> issue, thank you.

Thanks. Below is the folded up patch.

Ingo

---------------------------->
Subject: sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
From: Johannes Weiner <[email protected]>
Date: Thu Oct 25 12:49:51 CEST 2012

Add memory control group support to hugepage migration.

Signed-off-by: Johannes Weiner <[email protected]>
Tested-by: Zhouping Liu <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
mm/huge_memory.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

Index: tip/mm/huge_memory.c
===================================================================
--- tip.orig/mm/huge_memory.c
+++ tip/mm/huge_memory.c
@@ -743,6 +743,7 @@ void do_huge_pmd_numa_page(struct mm_str
unsigned int flags, pmd_t entry)
{
unsigned long haddr = address & HPAGE_PMD_MASK;
+ struct mem_cgroup *memcg = NULL;
struct page *new_page = NULL;
struct page *page = NULL;
int node, lru;
@@ -833,6 +834,14 @@ migrate:

return;
}
+ /*
+ * Traditional migration needs to prepare the memcg charge
+ * transaction early to prevent the old page from being
+ * uncharged when installing migration entries. Here we can
+ * save the potential rollback and start the charge transfer
+ * only when migration is already known to end successfully.
+ */
+ mem_cgroup_prepare_migration(page, new_page, &memcg);

entry = mk_pmd(new_page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
@@ -843,6 +852,12 @@ migrate:
set_pmd_at(mm, haddr, pmd, entry);
update_mmu_cache_pmd(vma, address, entry);
page_remove_rmap(page);
+ /*
+ * Finish the charge transaction under the page table lock to
+ * prevent split_huge_page() from dividing up the charge
+ * before it's fully transferred to the new page.
+ */
+ mem_cgroup_end_migration(memcg, page, new_page, true);
spin_unlock(&mm->page_table_lock);

put_page(page); /* Drop the rmap reference */

2012-10-29 08:24:21

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()

Hello Ingo!

On Mon, Oct 29, 2012 at 07:50:44AM +0100, Ingo Molnar wrote:
>
> * Zhouping Liu <[email protected]> wrote:
>
> > Hi Johannes,
> >
> > Tested the below patch, and I'm sure it has fixed the above
> > issue, thank you.
>
> Thanks. Below is the folded up patch.
>
> Ingo
>
> ---------------------------->
> Subject: sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
> From: Johannes Weiner <[email protected]>
> Date: Thu Oct 25 12:49:51 CEST 2012
>
> Add memory control group support to hugepage migration.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Tested-by: Zhouping Liu <[email protected]>
> Link: http://lkml.kernel.org/n/[email protected]
> Signed-off-by: Ingo Molnar <[email protected]>
> ---
> mm/huge_memory.c | 15 +++++++++++++++
> 1 file changed, 15 insertions(+)

Did the mm/memcontrol.c part go missing?

2012-10-29 08:34:35

[permalink] [raw]

Subject: Re: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()

On 10/29/2012 04:24 PM, Johannes Weiner wrote:
> Hello Ingo!
>
> On Mon, Oct 29, 2012 at 07:50:44AM +0100, Ingo Molnar wrote:
>> * Zhouping Liu <[email protected]> wrote:
>>
>>> Hi Johannes,
>>>
>>> Tested the below patch, and I'm sure it has fixed the above
>>> issue, thank you.
>> Thanks. Below is the folded up patch.
>>
>> Ingo
>>
>> ---------------------------->
>> Subject: sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
>> From: Johannes Weiner <[email protected]>
>> Date: Thu Oct 25 12:49:51 CEST 2012
>>
>> Add memory control group support to hugepage migration.
>>
>> Signed-off-by: Johannes Weiner <[email protected]>
>> Tested-by: Zhouping Liu <[email protected]>
>> Link: http://lkml.kernel.org/n/[email protected]
>> Signed-off-by: Ingo Molnar <[email protected]>
>> ---
>> mm/huge_memory.c | 15 +++++++++++++++
>> 1 file changed, 15 insertions(+)
> Did the mm/memcontrol.c part go missing?

I think so, as the issue is still existed with this patch

Thanks,
Zhouping

2012-10-29 11:15:59

[permalink] [raw]

Subject: Re: [PATCH] sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()

* Johannes Weiner <[email protected]> wrote:

> Hello Ingo!
>
> On Mon, Oct 29, 2012 at 07:50:44AM +0100, Ingo Molnar wrote:
> >
> > * Zhouping Liu <[email protected]> wrote:
> >
> > > Hi Johannes,
> > >
> > > Tested the below patch, and I'm sure it has fixed the above
> > > issue, thank you.
> >
> > Thanks. Below is the folded up patch.
> >
> > Ingo
> >
> > ---------------------------->
> > Subject: sched, numa, mm: Add memcg support to do_huge_pmd_numa_page()
> > From: Johannes Weiner <[email protected]>
> > Date: Thu Oct 25 12:49:51 CEST 2012
> >
> > Add memory control group support to hugepage migration.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > Tested-by: Zhouping Liu <[email protected]>
> > Link: http://lkml.kernel.org/n/[email protected]
> > Signed-off-by: Ingo Molnar <[email protected]>
> > ---
> > mm/huge_memory.c | 15 +++++++++++++++
> > 1 file changed, 15 insertions(+)
>
> Did the mm/memcontrol.c part go missing?

Yes :-/

Fixing it up now.

Thanks,

Ingo

2012-10-30 06:27:36

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On 10/29/2012 01:56 AM, Johannes Weiner wrote:
> On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
>> On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
>>> [ 180.918591] RIP: 0010:[<ffffffff8118c39a>] [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
>>> [ 182.681450] [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
>>> [ 182.775090] [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
>>> [ 182.863038] [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
>>> [ 182.950574] [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
>>> [ 183.041512] [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
>>> [ 183.126832] [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
>>> [ 183.211216] [<ffffffff81632ede>] do_page_fault+0xe/0x10
>>> [ 183.293705] [<ffffffff8162f518>] page_fault+0x28/0x30
>> Johannes, this looks like the thp migration memcg hookery gone bad,
>> could you have a look at this?
> Oops. Here is an incremental fix, feel free to fold it into #31.
Hello Johannes,

maybe I don't think the below patch completely fix this issue, as I
found a new error(maybe similar with this):

[88099.923724] ------------[ cut here ]------------
[88099.924036] kernel BUG at mm/memcontrol.c:1134!
[88099.924036] invalid opcode: 0000 [#1] SMP
[88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
drm_kms_helper ttm drm i2c_core
[88099.924036] CPU 7
[88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
Dell Inc. PowerEdge 6950/0WN213
[88099.924036] RIP: 0010:[<ffffffff81188e97>] [<ffffffff81188e97>]
mem_cgroup_update_lru_size+0x27/0x30
[88099.924036] RSP: 0000:ffff88021b247ca8 EFLAGS: 00010082
[88099.924036] RAX: ffff88011d310138 RBX: ffffea0002f18000 RCX:
0000000000000001
[88099.924036] RDX: fffffffffffffe00 RSI: 000000000000000e RDI:
ffff88011d310138
[88099.924036] RBP: ffff88021b247ca8 R08: 0000000000000000 R09:
a8000bc600000000
[88099.924036] R10: 0000000000000000 R11: 0000000000000000 R12:
00000000fffffe00
[88099.924036] R13: ffff88011ffecb40 R14: 0000000000000286 R15:
0000000000000000
[88099.924036] FS: 00007f787d0bf740(0000) GS:ffff88021fc80000(0000)
knlGS:0000000000000000
[88099.924036] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[88099.924036] CR2: 00007f7873a00010 CR3: 000000021bda0000 CR4:
00000000000007e0
[88099.924036] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[88099.924036] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[88099.924036] Process stress (pid: 3441, threadinfo ffff88021b246000,
task ffff88021b399760)
[88099.924036] Stack:
[88099.924036] ffff88021b247cf8 ffffffff8113a9cd ffffea0002f18000
ffff88011d310138
[88099.924036] 0000000000000200 ffffea0002f18000 ffff88019bace580
00007f7873c00000
[88099.924036] ffff88021aca0cf0 ffffea00081e0000 ffff88021b247d18
ffffffff8113aa7d
[88099.924036] Call Trace:
[88099.924036] [<ffffffff8113a9cd>] __page_cache_release.part.11+0xdd/0x140
[88099.924036] [<ffffffff8113aa7d>] __put_compound_page+0x1d/0x30
[88099.924036] [<ffffffff8113ac4d>] put_compound_page+0x5d/0x1e0
[88099.924036] [<ffffffff8113b1a5>] put_page+0x45/0x50
[88099.924036] [<ffffffff8118378c>] do_huge_pmd_numa_page+0x2ec/0x4e0
[88099.924036] [<ffffffff81158089>] handle_mm_fault+0x1e9/0x360
[88099.924036] [<ffffffff8162cd22>] __do_page_fault+0x172/0x4e0
[88099.924036] [<ffffffff810958b9>] ? task_numa_work+0x1c9/0x220
[88099.924036] [<ffffffff8107c56c>] ? task_work_run+0xac/0xe0
[88099.924036] [<ffffffff8162d09e>] do_page_fault+0xe/0x10
[88099.924036] [<ffffffff816296d8>] page_fault+0x28/0x30
[88099.924036] Code: 00 00 00 00 66 66 66 66 90 44 8b 1d 1c 90 b5 00 55
48 89 e5 45 85 db 75 10 89 f6 48 63 d2 48 83 c6 0e 48 01 54 f7 08 78 02
5d c3 <0f> 0b 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 e5 48 83 ec
[88099.924036] RIP [<ffffffff81188e97>]
mem_cgroup_update_lru_size+0x27/0x30
[88099.924036] RSP <ffff88021b247ca8>
[88099.924036] ---[ end trace c8d6b169e0c3f25a ]---
[88108.054610] ------------[ cut here ]------------
[88108.054610] WARNING: at kernel/watchdog.c:245
watchdog_overflow_callback+0x9c/0xd0()
[88108.054610] Hardware name: PowerEdge 6950
[88108.054610] Watchdog detected hard LOCKUP on cpu 3
[88108.054610] Modules linked in: lockd sunrpc kvm_amd kvm
amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
drm_kms_helper ttm drm i2c_core
[88108.054610] Pid: 3429, comm: stress Tainted: G D 3.7.0-rc2Jons+ #3
[88108.054610] Call Trace:
[88108.054610] <NMI> [<ffffffff8105c29f>] warn_slowpath_common+0x7f/0xc0
[88108.054610] [<ffffffff8105c396>] warn_slowpath_fmt+0x46/0x50
[88108.054610] [<ffffffff81093fa8>] ? sched_clock_cpu+0xa8/0x120
[88108.054610] [<ffffffff810e95c0>] ? touch_nmi_watchdog+0x80/0x80
[88108.054610] [<ffffffff810e965c>] watchdog_overflow_callback+0x9c/0xd0
[88108.054610] [<ffffffff81124e6d>] __perf_event_overflow+0x9d/0x230
[88108.054610] [<ffffffff81121f44>] ? perf_event_update_userpage+0x24/0x110
[88108.054610] [<ffffffff81125a74>] perf_event_overflow+0x14/0x20
[88108.054610] [<ffffffff8102440a>] x86_pmu_handle_irq+0x10a/0x160
[88108.054610] [<ffffffff8162ac4d>] perf_event_nmi_handler+0x1d/0x20
[88108.054610] [<ffffffff8162a411>] nmi_handle.isra.0+0x51/0x80
[88108.054610] [<ffffffff8162a5b9>] do_nmi+0x179/0x350
[88108.054610] [<ffffffff81629a30>] end_repeat_nmi+0x1e/0x2e
[88108.054610] [<ffffffff816290c2>] ? _raw_spin_lock_irqsave+0x32/0x40
[88108.054610] [<ffffffff816290c2>] ? _raw_spin_lock_irqsave+0x32/0x40
[88108.054610] [<ffffffff816290c2>] ? _raw_spin_lock_irqsave+0x32/0x40
[88108.054610] <<EOE>> [<ffffffff8113b087>] pagevec_lru_move_fn+0x97/0x110
[88108.054610] [<ffffffff8113a5f0>] ? pagevec_move_tail_fn+0x80/0x80
[88108.054610] [<ffffffff8113b11c>] __pagevec_lru_add+0x1c/0x20
[88108.054610] [<ffffffff8113b4e8>] __lru_cache_add+0x68/0x90
[88108.054610] [<ffffffff8113b71b>] lru_cache_add_lru+0x3b/0x60
[88108.054610] [<ffffffff81161151>] page_add_new_anon_rmap+0xc1/0x170
[88108.054610] [<ffffffff811854b2>] do_huge_pmd_anonymous_page+0x242/0x330
[88108.054610] [<ffffffff81158162>] handle_mm_fault+0x2c2/0x360
[88108.054610] [<ffffffff8162cd22>] __do_page_fault+0x172/0x4e0
[88108.054610] [<ffffffff8109520f>] ? __dequeue_entity+0x2f/0x50
[88108.054610] [<ffffffff810125d1>] ? __switch_to+0x181/0x4a0
[88108.054610] [<ffffffff8162d09e>] do_page_fault+0xe/0x10
[88108.054610] [<ffffffff816296d8>] page_fault+0x28/0x30
[88108.054610] ---[ end trace c8d6b169e0c3f25b ]---
......
......

it's easy to reproduce with stress[1] workload.
what command I used is '# stress -i 20 -m 30 -v'

I will report it on a new subject if it's a new issue.

let me know if you need other info.

[1] http://weather.ou.edu/~apw/projects/stress/

Thanks,
Zhouping
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5c30a14..0d7ebd3 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -801,8 +801,6 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> if (!new_page)
> goto alloc_fail;
>
> - mem_cgroup_prepare_migration(page, new_page, &memcg);
> -
> lru = PageLRU(page);
>
> if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
> @@ -835,6 +833,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
> return;
> }
> + /*
> + * Traditional migration needs to prepare the memcg charge
> + * transaction early to prevent the old page from being
> + * uncharged when installing migration entries. Here we can
> + * save the potential rollback and start the charge transfer
> + * only when migration is already known to end successfully.
> + */
> + mem_cgroup_prepare_migration(page, new_page, &memcg);
>
> entry = mk_pmd(new_page, vma->vm_page_prot);
> entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
> @@ -845,6 +851,12 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> set_pmd_at(mm, haddr, pmd, entry);
> update_mmu_cache_pmd(vma, address, entry);
> page_remove_rmap(page);
> + /*
> + * Finish the charge transaction under the page table lock to
> + * prevent split_huge_page() from dividing up the charge
> + * before it's fully transferred to the new page.
> + */
> + mem_cgroup_end_migration(memcg, page, new_page, true);
> spin_unlock(&mm->page_table_lock);
>
> put_page(page); /* Drop the rmap reference */
> @@ -856,18 +868,14 @@ void do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
> unlock_page(new_page);
>
> - mem_cgroup_end_migration(memcg, page, new_page, true);
> -
> unlock_page(page);
> put_page(page); /* Drop the local reference */
>
> return;
>
> alloc_fail:
> - if (new_page) {
> - mem_cgroup_end_migration(memcg, page, new_page, false);
> + if (new_page)
> put_page(new_page);
> - }
>
> unlock_page(page);
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7acf43b..011e510 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3255,15 +3255,18 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
> struct mem_cgroup **memcgp)
> {
> struct mem_cgroup *memcg = NULL;
> + unsigned int nr_pages = 1;
> struct page_cgroup *pc;
> enum charge_type ctype;
>
> *memcgp = NULL;
>
> - VM_BUG_ON(PageTransHuge(page));
> if (mem_cgroup_disabled())
> return;
>
> + if (PageTransHuge(page))
> + nr_pages <<= compound_order(page);
> +
> pc = lookup_page_cgroup(page);
> lock_page_cgroup(pc);
> if (PageCgroupUsed(pc)) {
> @@ -3325,7 +3328,7 @@ void mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
> * charged to the res_counter since we plan on replacing the
> * old one and only one page is going to be left afterwards.
> */
> - __mem_cgroup_commit_charge(memcg, newpage, 1, ctype, false);
> + __mem_cgroup_commit_charge(memcg, newpage, nr_pages, ctype, false);
> }
>
> /* remove redundant charge if migration failed*/

2012-10-30 12:20:49

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On Thu, Oct 25, 2012 at 02:16:17PM +0200, Peter Zijlstra wrote:
> Hi all,
>
> Here's a re-post of the NUMA scheduling and migration improvement
> patches that we are working on. These include techniques from
> AutoNUMA and the sched/numa tree and form a unified basis - it
> has got all the bits that look good and mergeable.
>

Thanks for the repost. I have not even started a review yet as I was
travelling and just online today. It will be another day or two before I can
start but I was at least able to do a comparison test between autonuma and
schednuma today to see which actually performs the best. Even without the
review I was able to stick on similar vmstats as was applied to autonuma
to give a rough estimate of the relative overhead of both implementations.

Machine was a 4-node box running autonumabench and specjbb.

Three kernels are

3.7-rc2-stats-v2r1 vmstat patces on top
3.7-rc2-autonuma-v27 latest autonuma with stats on top
3.7-rc2-schednuma-v1r3 schednuma series minus the last path + stats

AUTONUMA BENCH
3.7.0 3.7.0 3.7.0
rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r3
User NUMA01 67145.71 ( 0.00%) 30110.13 ( 55.16%) 61666.46 ( 8.16%)
User NUMA01_THEADLOCAL 55104.60 ( 0.00%) 17285.49 ( 68.63%) 17135.48 ( 68.90%)
User NUMA02 7074.54 ( 0.00%) 2219.11 ( 68.63%) 2226.09 ( 68.53%)
User NUMA02_SMT 2916.86 ( 0.00%) 999.19 ( 65.74%) 1038.06 ( 64.41%)
System NUMA01 42.28 ( 0.00%) 469.07 (-1009.44%) 2808.08 (-6541.63%)
System NUMA01_THEADLOCAL 41.71 ( 0.00%) 183.24 (-339.32%) 174.92 (-319.37%)
System NUMA02 34.67 ( 0.00%) 27.85 ( 19.67%) 15.03 ( 56.65%)
System NUMA02_SMT 0.89 ( 0.00%) 18.36 (-1962.92%) 5.05 (-467.42%)
Elapsed NUMA01 1512.97 ( 0.00%) 698.18 ( 53.85%) 1422.71 ( 5.97%)
Elapsed NUMA01_THEADLOCAL 1264.23 ( 0.00%) 389.51 ( 69.19%) 377.51 ( 70.14%)
Elapsed NUMA02 181.52 ( 0.00%) 60.65 ( 66.59%) 52.86 ( 70.88%)
Elapsed NUMA02_SMT 163.59 ( 0.00%) 58.57 ( 64.20%) 48.82 ( 70.16%)
CPU NUMA01 4440.00 ( 0.00%) 4379.00 ( 1.37%) 4531.00 ( -2.05%)
CPU NUMA01_THEADLOCAL 4362.00 ( 0.00%) 4484.00 ( -2.80%) 4585.00 ( -5.11%)
CPU NUMA02 3916.00 ( 0.00%) 3704.00 ( 5.41%) 4239.00 ( -8.25%)
CPU NUMA02_SMT 1783.00 ( 0.00%) 1737.00 ( 2.58%) 2136.00 (-19.80%)

Two figures really matter here - System CPU usage and Elapsed time.

autonuma was known to hurt system CPU usage for the NUMA01 test case but
schednuma does *far* worse. I do not have a breakdown of where this time
is being spent but the raw figure is bad. autonuma is 10 times worse
than a vanilla kernel and schednuma is 5 times worse than autonuma.

For the overhead of the other test cases, schednuma is roughly
comparable with autonuma -- i.e. both pretty high overhead.

In terms of elapsed time, autonuma in the NUMA01 test case massively
improves elapsed time while schednuma barely makes a dent on it. Looking
at the memory usage per node (I generated a graph offline), it appears
that schednuma does not migrate pages to other nodes fast enough. The
convergence figures do not reflect this because the convergence seems
high (towards 1) but it may be because the approximation using faults is
insufficient.

In the other cases, schednuma does well and is comparable to autonuma.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0
rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3
User 132248.88 50620.50 82073.11
System 120.19 699.12 3003.83
Elapsed 3131.10 1215.63 1911.55

This is the overall time to complete the test. autonuma is way better
than schednuma but this is all due to how it handles the NUMA01 test
case.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0
rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3
Page Ins 37256 37508 37360
Page Outs 28888 13372 19488
Swap Ins 0 0 0
Swap Outs 0 0 0
Direct pages scanned 0 0 0
Kswapd pages scanned 0 0 0
Kswapd pages reclaimed 0 0 0
Direct pages reclaimed 0 0 0
Kswapd efficiency 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000
Direct efficiency 100% 100% 100%
Direct velocity 0.000 0.000 0.000
Percentage direct scans 0% 0% 0%
Page writes by reclaim 0 0 0
Page writes file 0 0 0
Page writes anon 0 0 0
Page reclaim immediate 0 0 0
Page rescued immediate 0 0 0
Slabs scanned 0 0 0
Direct inode steals 0 0 0
Kswapd inode steals 0 0 0
Kswapd skipped wait 0 0 0
THP fault alloc 17370 17923 13399
THP collapse alloc 6 12385 3
THP splits 3 12577 0
THP fault fallback 0 0 0
THP collapse fail 0 0 0
Compaction stalls 0 0 0
Compaction success 0 0 0
Compaction failures 0 0 0
Page migrate success 0 7061327 57167
Page migrate failure 0 0 0
Compaction pages isolated 0 0 0
Compaction migrate scanned 0 0 0
Compaction free scanned 0 0 0
Compaction cost 0 7329 59
NUMA PTE updates 0 191503 123214
NUMA hint faults 0 13322261 818762
NUMA hint local faults 0 9813514 756797
NUMA pages migrated 0 7061327 57167
AutoNUMA cost 0 66746 4095

The "THP collapse alloc" figures are interesting but reflect the fact
that schednuma can migrate THP pages natively where as autonuma does
not.

The "Page migrate success" figure is more interesting. autonuma migrates
much more aggressively even though "NUMA PTE faults" are not that
different.

For reasons that are not immediately obvious, autonuma incurs far more
"NUMA hint faults" even though the PTE updates are not that different. I
expect when I actually review the code this will be due to differences
on how and when the two implentations decide to mark a PTE PROT_NONE.
A stonger possibility is because autonuma is not natively migrating THP
pages. I also expect autonuma is continually scanning where as schednuma is
reacting to some other external event or at least less frequently scanning.
Obviously, I cannot rule out the possibility that the stats patch was buggy.

Because of the fewer faults, the "cost model" for sched-numa is lower.
OBviously there is a disconnect here because System CPU usage is high
but the cost model only takes a few limited variables into account.

In terms of absolute performance (elapsed time), autonuma is currently
better than schednuma. schednuma has high System CPU overhead in one case
for some unknown reason and introduces a lot of overhead but in general
worked less than autonuma as it incurred fewer faults.

Finally, I recorded node-load-misses,node-store-misses events. These are
the total number of events recorded

stats-v2r1 94600194
autonuma 945370766
schednuma 2828322708

It was surprising to me that the number of events recorded was higher -
page table accesses maybe? Either way, schednuma missed a *LOT* more
than autonuma but maybe I'm misinterpreting the meaning of the
node-load-misses,node-store-misses events as I haven't had the change
yet to dig down and see what perf maps those events onto.

SPECJBB BOPS
3.7.0 3.7.0 3.7.0
rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r3
Mean 1 25960.00 ( 0.00%) 24884.25 ( -4.14%) 25056.00 ( -3.48%)
Mean 2 53997.50 ( 0.00%) 55744.25 ( 3.23%) 52165.75 ( -3.39%)
Mean 3 78454.25 ( 0.00%) 82321.75 ( 4.93%) 76939.25 ( -1.93%)
Mean 4 101131.25 ( 0.00%) 106996.50 ( 5.80%) 99365.00 ( -1.75%)
Mean 5 120807.00 ( 0.00%) 129999.50 ( 7.61%) 118492.00 ( -1.92%)
Mean 6 135793.50 ( 0.00%) 152013.25 ( 11.94%) 133139.75 ( -1.95%)
Mean 7 137686.75 ( 0.00%) 158556.00 ( 15.16%) 136070.25 ( -1.17%)
Mean 8 135802.25 ( 0.00%) 160725.50 ( 18.35%) 140158.75 ( 3.21%)
Mean 9 129194.00 ( 0.00%) 161531.00 ( 25.03%) 137308.00 ( 6.28%)
Mean 10 125457.00 ( 0.00%) 156800.00 ( 24.98%) 136357.50 ( 8.69%)
Mean 11 121733.75 ( 0.00%) 154211.25 ( 26.68%) 138089.50 ( 13.44%)
Mean 12 110556.25 ( 0.00%) 149009.75 ( 34.78%) 138835.50 ( 25.58%)
Mean 13 107484.75 ( 0.00%) 144792.25 ( 34.71%) 128099.50 ( 19.18%)
Mean 14 105733.00 ( 0.00%) 141304.75 ( 33.64%) 118950.50 ( 12.50%)
Mean 15 104492.00 ( 0.00%) 138179.00 ( 32.24%) 119325.75 ( 14.20%)
Mean 16 103312.75 ( 0.00%) 136635.00 ( 32.25%) 116104.50 ( 12.38%)
Mean 17 101999.25 ( 0.00%) 134625.00 ( 31.99%) 114375.75 ( 12.13%)
Mean 18 100107.75 ( 0.00%) 132831.25 ( 32.69%) 114352.25 ( 14.23%)
TPut 1 103840.00 ( 0.00%) 99537.00 ( -4.14%) 100224.00 ( -3.48%)
TPut 2 215990.00 ( 0.00%) 222977.00 ( 3.23%) 208663.00 ( -3.39%)
TPut 3 313817.00 ( 0.00%) 329287.00 ( 4.93%) 307757.00 ( -1.93%)
TPut 4 404525.00 ( 0.00%) 427986.00 ( 5.80%) 397460.00 ( -1.75%)
TPut 5 483228.00 ( 0.00%) 519998.00 ( 7.61%) 473968.00 ( -1.92%)
TPut 6 543174.00 ( 0.00%) 608053.00 ( 11.94%) 532559.00 ( -1.95%)
TPut 7 550747.00 ( 0.00%) 634224.00 ( 15.16%) 544281.00 ( -1.17%)
TPut 8 543209.00 ( 0.00%) 642902.00 ( 18.35%) 560635.00 ( 3.21%)
TPut 9 516776.00 ( 0.00%) 646124.00 ( 25.03%) 549232.00 ( 6.28%)
TPut 10 501828.00 ( 0.00%) 627200.00 ( 24.98%) 545430.00 ( 8.69%)
TPut 11 486935.00 ( 0.00%) 616845.00 ( 26.68%) 552358.00 ( 13.44%)
TPut 12 442225.00 ( 0.00%) 596039.00 ( 34.78%) 555342.00 ( 25.58%)
TPut 13 429939.00 ( 0.00%) 579169.00 ( 34.71%) 512398.00 ( 19.18%)
TPut 14 422932.00 ( 0.00%) 565219.00 ( 33.64%) 475802.00 ( 12.50%)
TPut 15 417968.00 ( 0.00%) 552716.00 ( 32.24%) 477303.00 ( 14.20%)
TPut 16 413251.00 ( 0.00%) 546540.00 ( 32.25%) 464418.00 ( 12.38%)
TPut 17 407997.00 ( 0.00%) 538500.00 ( 31.99%) 457503.00 ( 12.13%)
TPut 18 400431.00 ( 0.00%) 531325.00 ( 32.69%) 457409.00 ( 14.23%)

In reality, this report is larger but I chopped it down a bit for
brevity. autonuma beats schednuma *heavily* on this benchmark both in
terms of average operations per numa node and overall throughput.

SPECJBB PEAKS
3.7.0 3.7.0 3.7.0
rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r3
Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
Expctd Peak Bops 442225.00 ( 0.00%) 596039.00 ( 34.78%) 555342.00 ( 25.58%)
Actual Warehouse 7.00 ( 0.00%) 9.00 ( 28.57%) 8.00 ( 14.29%)
Actual Peak Bops 550747.00 ( 0.00%) 646124.00 ( 17.32%) 560635.00 ( 1.80%)

autonuma was also able to handle more simultaneous warehouses peaking at
9 warehouses in comparison to schednumas 8 and the normal kernels 7. Of
course all fell short of the expected peak of 12 but that's neither here
nor there.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0
rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3
User 481580.26 478759.35 464261.89
System 179.35 803.59 16577.76
Elapsed 10398.85 10354.08 10383.61

Duration is the same but the benchmark should run for roughly the same
length of time each time so that is not earth shattering.

However, look at the System CPU usage. autonuma was bad but schednuma is
*completely* out of control.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0
rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r3
Page Ins 33220 33896 33664
Page Outs 111332 113116 115972
Swap Ins 0 0 0
Swap Outs 0 0 0
Direct pages scanned 0 0 0
Kswapd pages scanned 0 0 0
Kswapd pages reclaimed 0 0 0
Direct pages reclaimed 0 0 0
Kswapd efficiency 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000
Direct efficiency 100% 100% 100%
Direct velocity 0.000 0.000 0.000
Percentage direct scans 0% 0% 0%
Page writes by reclaim 0 0 0
Page writes file 0 0 0
Page writes anon 0 0 0
Page reclaim immediate 0 0 0
Page rescued immediate 0 0 0
Slabs scanned 0 0 0
Direct inode steals 0 0 0
Kswapd inode steals 0 0 0
Kswapd skipped wait 0 0 0
THP fault alloc 1 2 1
THP collapse alloc 0 21 0
THP splits 0 1 0
THP fault fallback 0 0 0
THP collapse fail 0 0 0
Compaction stalls 0 0 0
Compaction success 0 0 0
Compaction failures 0 0 0
Page migrate success 0 8070314 399095844
Page migrate failure 0 0 0
Compaction pages isolated 0 0 0
Compaction migrate scanned 0 0 0
Compaction free scanned 0 0 0
Compaction cost 0 8376 414261
NUMA PTE updates 0 3841 1110729
NUMA hint faults 0 2033295070 2945111212
NUMA hint local faults 0 1895230022 2545845756
NUMA pages migrated 0 8070314 399095844
AutoNUMA cost 0 10166628 14733146

Interesting to note that native THP migration makes no difference here.

schednuma migrated a lot more aggressively in this test, and incurred
*way* more PTE updates. I have no explanation for this but overall
schednuma was far heavier than autonuma.

So, without reviewing the code at all, it seems to me that schednuma is
not the obvious choice for merging above autonuma as the merge to -tip
implied -- at least based on these figures. By and large, autonuma seems
to perform better and while I know that some of its paths are heavy, it
was also clear during review of the code that the overhead could have been
reduced incrementally. Maybe the same can be said for schednuma, we'll see
but I expect that the actual performance be taken into accounting during
merging as well as the relatively maintenance effort.

> Please review .. once again and holler if you see anything funny! :-)
>

Consider the figures above to be a hollering that I think something
might be screwy in schednuma :)

I'll do a release of mmtests if you want to use the same benchmarks or
see if I messed up how it was benchmarked which is quite possible as
this was a rush job while I was travelling.

--
Mel Gorman
SUSE Labs

2012-10-30 15:28:25

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On Tue, 30 Oct 2012 12:20:32 +0000 Mel Gorman <[email protected]> wrote:

> ...

Useful testing - thanks. Did I miss the description of what
autonumabench actually does? How representitive is it of real-world
things?

> I also expect autonuma is continually scanning where as schednuma is
> reacting to some other external event or at least less frequently scanning.

Might this imply that autonuma is consuming more CPU in kernel threads,
the cost of which didn't get included in these results?

2012-10-30 16:59:10

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On Tue, Oct 30, 2012 at 08:28:10AM -0700, Andrew Morton wrote:
>
> On Tue, 30 Oct 2012 12:20:32 +0000 Mel Gorman <[email protected]> wrote:
>
> > ...
>
> Useful testing - thanks. Did I miss the description of what
> autonumabench actually does? How representitive is it of real-world
> things?
>

It's not representative of anything at all. It's a synthetic benchmark
that just measures if automatic NUMA migration (whatever the mechanism)
is working as expected. I'm not aware of a decent description of what
the test does and why. Here is my current interpretation and hopefully
Andrea will correct me if I'm wrong.

NUMA01
Two processes
NUM_CPUS/2 number of threads so all CPUs are in use

On startup, the process forks
Each process mallocs a 3G buffer but there is no communication
between the processes.
Threads are created that zeros out the full buffer 1000 times

The objective of the test is that initially the two processes
allocate their memory on the same node. As the threads are
are created the memory will migrate from the initial node to
nodes that are closer to the referencing thread.

It is worth noting that this benchmark is specifically tuned
for two nodes and the expectation is that the two processes
and their threads split so that all process A runs on node 0
and all threads on process B run in node 1

With 4 and more nodes, this is actually an adverse workload.
As all the buffer is zeroed in both processes, there is an
expectation that it will continually bounce between two nodes.

So, on 2 nodes, this benchmark tests convergence. On 4 or more
nodes, this partially measures how much busy work automatic
NUMA migrate does and it'll be very noisy due to cache conflicts.

NUMA01_THREADLOCAL
Two processes
NUM_CPUS/2 number of threads so all CPUs are in use

On startup, the process forks
Each process mallocs a 3G buffer but there is no communication
between the processes
Threads are created that zero out their own subset of the buffer.
Each buffer is 3G/NR_THREADS in size

This benchmark is more realistic. In an ideal situation, each
thread will migrate its data to its local node. The test really
is to see does it converge and how quickly.

NUMA02
One process, NR_CPU threads

On startup, malloc a 1G buffer
Create threads that zero out a thread-local portion of the buffer.
Zeros multiple times - the number of times is fixed and seems
to just be to take a period of time

This is similar in principal to NUMA01_THREADLOCAL except that only
one process is involved. I think it was aimed at being more JVM-like.

NUMA02_SMT
One process, NR_CPU/2 threads

This is a variation of NUMA02 except that with half the cores idle it
is checking if the system migrates the memory to two or more nodes or
if it tries to fit everything in one node even though the memory should
migrate to be close to the CPU

> > I also expect autonuma is continually scanning where as schednuma is
> > reacting to some other external event or at least less frequently scanning.
>
> Might this imply that autonuma is consuming more CPU in kernel threads,
> the cost of which didn't get included in these results?

It might but according to top, knuma_scand only used 7.86 seconds of CPU
time during the whole test and the time used by the migration tests is
also very low. Most migration threads used less than 1 second of CPU
time. Two migration threads used 2 seconds of CPU time each but that
still seems low.

--
Mel Gorman
SUSE Labs

2012-10-31 00:48:55

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On Tue, Oct 30, 2012 at 02:29:25PM +0800, Zhouping Liu wrote:
> On 10/29/2012 01:56 AM, Johannes Weiner wrote:
> >On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
> >>On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> >>>[ 180.918591] RIP: 0010:[<ffffffff8118c39a>] [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
> >>>[ 182.681450] [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
> >>>[ 182.775090] [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
> >>>[ 182.863038] [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
> >>>[ 182.950574] [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
> >>>[ 183.041512] [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
> >>>[ 183.126832] [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
> >>>[ 183.211216] [<ffffffff81632ede>] do_page_fault+0xe/0x10
> >>>[ 183.293705] [<ffffffff8162f518>] page_fault+0x28/0x30
> >>Johannes, this looks like the thp migration memcg hookery gone bad,
> >>could you have a look at this?
> >Oops. Here is an incremental fix, feel free to fold it into #31.
> Hello Johannes,
>
> maybe I don't think the below patch completely fix this issue, as I
> found a new error(maybe similar with this):
>
> [88099.923724] ------------[ cut here ]------------
> [88099.924036] kernel BUG at mm/memcontrol.c:1134!
> [88099.924036] invalid opcode: 0000 [#1] SMP
> [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
> amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
> joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
> megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
> drm_kms_helper ttm drm i2c_core
> [88099.924036] CPU 7
> [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
> Dell Inc. PowerEdge 6950/0WN213
> [88099.924036] RIP: 0010:[<ffffffff81188e97>] [<ffffffff81188e97>]
> mem_cgroup_update_lru_size+0x27/0x30

Thanks a lot for your testing efforts, I really appreciate it.

I'm looking into it, but I don't expect power to get back for several
days where I live, so it's hard to reproduce it locally.

But that looks like an LRU accounting imbalance that I wasn't able to
tie to this patch yet. Do you see weird numbers for the lru counters
in /proc/vmstat even without this memory cgroup patch? Ccing Hugh as
well.

Thanks,
Johannes

2012-10-31 07:26:40

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On Tue, 30 Oct 2012, Johannes Weiner wrote:
> On Tue, Oct 30, 2012 at 02:29:25PM +0800, Zhouping Liu wrote:
> > On 10/29/2012 01:56 AM, Johannes Weiner wrote:
> > >On Fri, Oct 26, 2012 at 11:08:00AM +0200, Peter Zijlstra wrote:
> > >>On Fri, 2012-10-26 at 17:07 +0800, Zhouping Liu wrote:
> > >>>[ 180.918591] RIP: 0010:[<ffffffff8118c39a>] [<ffffffff8118c39a>] mem_cgroup_prepare_migration+0xba/0xd0
> > >>>[ 182.681450] [<ffffffff81183b60>] do_huge_pmd_numa_page+0x180/0x500
> > >>>[ 182.775090] [<ffffffff811585c9>] handle_mm_fault+0x1e9/0x360
> > >>>[ 182.863038] [<ffffffff81632b62>] __do_page_fault+0x172/0x4e0
> > >>>[ 182.950574] [<ffffffff8101c283>] ? __switch_to_xtra+0x163/0x1a0
> > >>>[ 183.041512] [<ffffffff8101281e>] ? __switch_to+0x3ce/0x4a0
> > >>>[ 183.126832] [<ffffffff8162d686>] ? __schedule+0x3c6/0x7a0
> > >>>[ 183.211216] [<ffffffff81632ede>] do_page_fault+0xe/0x10
> > >>>[ 183.293705] [<ffffffff8162f518>] page_fault+0x28/0x30
> > >>Johannes, this looks like the thp migration memcg hookery gone bad,
> > >>could you have a look at this?
> > >Oops. Here is an incremental fix, feel free to fold it into #31.
> > Hello Johannes,
> >
> > maybe I don't think the below patch completely fix this issue, as I
> > found a new error(maybe similar with this):
> >
> > [88099.923724] ------------[ cut here ]------------
> > [88099.924036] kernel BUG at mm/memcontrol.c:1134!
> > [88099.924036] invalid opcode: 0000 [#1] SMP
> > [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
> > amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
> > joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
> > megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
> > drm_kms_helper ttm drm i2c_core
> > [88099.924036] CPU 7
> > [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
> > Dell Inc. PowerEdge 6950/0WN213
> > [88099.924036] RIP: 0010:[<ffffffff81188e97>] [<ffffffff81188e97>]
> > mem_cgroup_update_lru_size+0x27/0x30
>
> Thanks a lot for your testing efforts, I really appreciate it.
>
> I'm looking into it, but I don't expect power to get back for several
> days where I live, so it's hard to reproduce it locally.
>
> But that looks like an LRU accounting imbalance that I wasn't able to
> tie to this patch yet. Do you see weird numbers for the lru counters
> in /proc/vmstat even without this memory cgroup patch? Ccing Hugh as
> well.

Sorry, I didn't get very far with it tonight.

Almost certain to be a page which was added to lru while it looked like
a 4k page, but taken off lru as a 2M page: we are taking a 2M page off
lru here, it's likely to be the page in question, but not necessarily.

There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
would help if we could focus on the one which is giving the trouble,
but I don't know which that is. Zhouping, if you can, please would
you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
from bigfile just the lines from "<do_huge_pmd_numa_page>:" to whatever
is the next function, and post or mail privately just that disassembly.
That should be good to identify which of the put_page()s is involved.

do_huge_pmd_numa_page() does look a bit worrying, but I've not pinned
the misaccounting seen to the aspects which have worried me so far.
Where is a check for page_mapcount(page) being 1? And surely it's
unsafe to to be migrating the page when it was found !PageLRU? It's
quite likely to be sitting in a pagevec or on a local list somewhere,
about to be added to lru at any moment.

Hugh

2012-10-31 13:13:36

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On 10/31/2012 03:26 PM, Hugh Dickins wrote:
> On Tue, 30 Oct 2012, Johannes Weiner wrote:
>
> [88099.923724] ------------[ cut here ]------------
> [88099.924036] kernel BUG at mm/memcontrol.c:1134!
> [88099.924036] invalid opcode: 0000 [#1] SMP
> [88099.924036] Modules linked in: lockd sunrpc kvm_amd kvm
> amd64_edac_mod edac_core ses enclosure serio_raw bnx2 pcspkr shpchp
> joydev i2c_piix4 edac_mce_amd k8temp dcdbas ata_generic pata_acpi
> megaraid_sas pata_serverworks usb_storage radeon i2c_algo_bit
> drm_kms_helper ttm drm i2c_core
> [88099.924036] CPU 7
> [88099.924036] Pid: 3441, comm: stress Not tainted 3.7.0-rc2Jons+ #3
> Dell Inc. PowerEdge 6950/0WN213
> [88099.924036] RIP: 0010:[<ffffffff81188e97>] [<ffffffff81188e97>]
> mem_cgroup_update_lru_size+0x27/0x30
>> Thanks a lot for your testing efforts, I really appreciate it.
>>
>> I'm looking into it, but I don't expect power to get back for several
>> days where I live, so it's hard to reproduce it locally.
>>
>> But that looks like an LRU accounting imbalance that I wasn't able to
>> tie to this patch yet. Do you see weird numbers for the lru counters
>> in /proc/vmstat even without this memory cgroup patch? Ccing Hugh as
>> well.
> Sorry, I didn't get very far with it tonight.
>
> Almost certain to be a page which was added to lru while it looked like
> a 4k page, but taken off lru as a 2M page: we are taking a 2M page off
> lru here, it's likely to be the page in question, but not necessarily.
>
> There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
> would help if we could focus on the one which is giving the trouble,
> but I don't know which that is. Zhouping, if you can, please would
> you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
> from bigfile just the lines from "<do_huge_pmd_numa_page>:" to whatever
> is the next function, and post or mail privately just that disassembly.
> That should be good to identify which of the put_page()s is involved.

Hugh, I didn't find the next function, as I can't find any words that
matched "do_huge_pmd_numa_page".
is there any other methods? also I tried to use kdump to dump vmcore
file, but unluckily kdump didn't
work well, if you think it useful to dump vmcore file, I can try it
again and provide more info.

Thanks,
Zhouping

2012-10-31 17:31:12

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On Wed, 31 Oct 2012, Zhouping Liu wrote:
> On 10/31/2012 03:26 PM, Hugh Dickins wrote:
> >
> > There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
> > would help if we could focus on the one which is giving the trouble,
> > but I don't know which that is. Zhouping, if you can, please would
> > you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
> > from bigfile just the lines from "<do_huge_pmd_numa_page>:" to whatever
> > is the next function, and post or mail privately just that disassembly.
> > That should be good to identify which of the put_page()s is involved.
>
> Hugh, I didn't find the next function, as I can't find any words that matched
> "do_huge_pmd_numa_page".
> is there any other methods?

Hmm, do_huge_pmd_numa_page does appear in your stacktrace,
unless I've made a typo but am blind to it.

Were you applying objdump to the vmlinux which gave you the
BUG at mm/memcontrol.c:1134! ?

Maybe just do "objdump -ld mm/huge_memory.o >notsobigfile"
and mail me an attachment of the notsobigfile.

I did try building your config here last night, but ran out of disk
space on this partition, and it was already clear that my gcc version
differs from yours, so not quite matching.

> also I tried to use kdump to dump vmcore file,
> but unluckily kdump didn't
> work well, if you think it useful to dump vmcore file, I can try it again and
> provide more info.

It would take me awhile to get up to speed on using that,
I'd prefer to start with just the objdump of huge_memory.o.

I forgot last night to say that I did try stress (but not on a kernel
of your config), but didn't see the BUG: I expect there are too many
differences in our environments, and I'd have to tweak things one way
or another to get it to happen - probably a waste of time.

Thanks,
Hugh

2012-11-01 13:41:38

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On Wed, 31 Oct 2012, Hugh Dickins wrote:
> On Wed, 31 Oct 2012, Zhouping Liu wrote:
> > On 10/31/2012 03:26 PM, Hugh Dickins wrote:
> > >
> > > There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
> > > would help if we could focus on the one which is giving the trouble,
> > > but I don't know which that is. Zhouping, if you can, please would
> > > you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
> > > from bigfile just the lines from "<do_huge_pmd_numa_page>:" to whatever
> > > is the next function, and post or mail privately just that disassembly.
> > > That should be good to identify which of the put_page()s is involved.
> >
> > Hugh, I didn't find the next function, as I can't find any words that matched
> > "do_huge_pmd_numa_page".
> > is there any other methods?
>
> Hmm, do_huge_pmd_numa_page does appear in your stacktrace,
> unless I've made a typo but am blind to it.
>
> Were you applying objdump to the vmlinux which gave you the
> BUG at mm/memcontrol.c:1134! ?

Thanks for the further info you then sent privately: I have not made any
more effort to reproduce the issue, but your objdump did tell me that the
put_page hitting the problem is the one on line 872 of mm/huge_memory.c,
"Drop the local reference", just before successful return after migration.

I didn't really get the inspiration I'd hoped for out of knowing that,
but it did make wonder whether you're suffering from one of the issues
I already mentioned, and I can now see a way in which it might cause
the mm/memcontrol.c:1134 BUG:-

migrate_page_copy() does TestClearPageActive on the source page:
so given the unsafe way in which do_huge_pmd_numa_page() was proceeding
with a !PageLRU page, it's quite possible that the page was sitting in
a pagevec, and added to the active lru (so added to the lru_size of the
active lru), but our final put_page removes it from lru, active flag has
been cleared, so we subtract it from the lru_size of the inactive lru -
that could indeed make it go negative and trigger the BUG.

Here's a patch fixing and tidying up that and a few other things there.
But I'm not signing it off yet, partly because I've barely tested it
(quite probably I didn't even have any numa pmd migration happening
at all), and partly because just a moment ago I ran across this
instructive comment in __collapse_huge_page_isolate():
/* cannot use mapcount: can't collapse if there's a gup pin */
if (page_count(page) != 1) {

Hmm, yes, below I've added the page_mapcount() check I proposed to
do_huge_pmd_numa_page(), but is even that safe enough? Do we actually
need a page_count() check (for 2?) to guard against get_user_pages()?
I suspect we do, but then do we have enough locking to stabilize such
a check? Probably, but...

This will take more time, and I doubt get_user_pages() is an issue in
your testing, so please would you try the patch below, to see if it
does fix the BUGs you are seeing? Thanks a lot.

Not-Yet-Signed-off-by: Hugh Dickins <[email protected]>
---

mm/huge_memory.c | 24 +++++++++---------------
1 file changed, 9 insertions(+), 15 deletions(-)

--- 3.7-rc2+schednuma+johannes/mm/huge_memory.c 2012-11-01 04:10:43.812155671 -0700
+++ linux/mm/huge_memory.c 2012-11-01 05:52:19.512153771 -0700
@@ -745,7 +745,7 @@ void do_huge_pmd_numa_page(struct mm_str
struct mem_cgroup *memcg = NULL;
struct page *new_page = NULL;
struct page *page = NULL;
- int node, lru;
+ int node = -1;

spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, entry)))
@@ -762,7 +762,8 @@ void do_huge_pmd_numa_page(struct mm_str
VM_BUG_ON(!PageCompound(page) || !PageHead(page));

get_page(page);
- node = mpol_misplaced(page, vma, haddr);
+ if (page_mapcount(page) == 1) /* Only do exclusively mapped */
+ node = mpol_misplaced(page, vma, haddr);
if (node != -1)
goto migrate;
}
@@ -801,13 +802,11 @@ migrate:
if (!new_page)
goto alloc_fail;

- lru = PageLRU(page);
-
- if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
+ if (isolate_lru_page(page)) /* Does an implicit get_page() */
goto alloc_fail;

- if (!trylock_page(new_page))
- BUG();
+ __set_page_locked(new_page);
+ SetPageSwapBacked(new_page);

/* anon mapping, we can simply copy page->mapping to the new page: */
new_page->mapping = page->mapping;
@@ -820,8 +819,6 @@ migrate:
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(*pmd, entry))) {
spin_unlock(&mm->page_table_lock);
- if (lru)
- putback_lru_page(page);

unlock_page(new_page);
ClearPageActive(new_page); /* Set by migrate_page_copy() */
@@ -829,6 +826,7 @@ migrate:
put_page(new_page); /* Free it */

unlock_page(page);
+ putback_lru_page(page);
put_page(page); /* Drop the local reference */

return;
@@ -859,16 +857,12 @@ migrate:
mem_cgroup_end_migration(memcg, page, new_page, true);
spin_unlock(&mm->page_table_lock);

- put_page(page); /* Drop the rmap reference */
-
task_numa_fault(node, HPAGE_PMD_NR);

- if (lru)
- put_page(page); /* drop the LRU isolation reference */
-
unlock_page(new_page);
-
unlock_page(page);
+ put_page(page); /* Drop the rmap reference */
+ put_page(page); /* Drop the LRU isolation reference */
put_page(page); /* Drop the local reference */

return;

2012-11-02 03:21:38

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On 11/01/2012 09:41 PM, Hugh Dickins wrote:
> On Wed, 31 Oct 2012, Hugh Dickins wrote:
>> On Wed, 31 Oct 2012, Zhouping Liu wrote:
>>> On 10/31/2012 03:26 PM, Hugh Dickins wrote:
>>>> There's quite a few put_page()s in do_huge_pmd_numa_page(), and it
>>>> would help if we could focus on the one which is giving the trouble,
>>>> but I don't know which that is. Zhouping, if you can, please would
>>>> you do an "objdump -ld vmlinux >bigfile" of your kernel, then extract
>>>> from bigfile just the lines from "<do_huge_pmd_numa_page>:" to whatever
>>>> is the next function, and post or mail privately just that disassembly.
>>>> That should be good to identify which of the put_page()s is involved.
>>> Hugh, I didn't find the next function, as I can't find any words that matched
>>> "do_huge_pmd_numa_page".
>>> is there any other methods?
>> Hmm, do_huge_pmd_numa_page does appear in your stacktrace,
>> unless I've made a typo but am blind to it.
>>
>> Were you applying objdump to the vmlinux which gave you the
>> BUG at mm/memcontrol.c:1134! ?
> Thanks for the further info you then sent privately: I have not made any
> more effort to reproduce the issue, but your objdump did tell me that the
> put_page hitting the problem is the one on line 872 of mm/huge_memory.c,
> "Drop the local reference", just before successful return after migration.
>
> I didn't really get the inspiration I'd hoped for out of knowing that,
> but it did make wonder whether you're suffering from one of the issues
> I already mentioned, and I can now see a way in which it might cause
> the mm/memcontrol.c:1134 BUG:-
>
> migrate_page_copy() does TestClearPageActive on the source page:
> so given the unsafe way in which do_huge_pmd_numa_page() was proceeding
> with a !PageLRU page, it's quite possible that the page was sitting in
> a pagevec, and added to the active lru (so added to the lru_size of the
> active lru), but our final put_page removes it from lru, active flag has
> been cleared, so we subtract it from the lru_size of the inactive lru -
> that could indeed make it go negative and trigger the BUG.
>
> Here's a patch fixing and tidying up that and a few other things there.
> But I'm not signing it off yet, partly because I've barely tested it
> (quite probably I didn't even have any numa pmd migration happening
> at all), and partly because just a moment ago I ran across this
> instructive comment in __collapse_huge_page_isolate():
> /* cannot use mapcount: can't collapse if there's a gup pin */
> if (page_count(page) != 1) {
>
> Hmm, yes, below I've added the page_mapcount() check I proposed to
> do_huge_pmd_numa_page(), but is even that safe enough? Do we actually
> need a page_count() check (for 2?) to guard against get_user_pages()?
> I suspect we do, but then do we have enough locking to stabilize such
> a check? Probably, but...
>
> This will take more time, and I doubt get_user_pages() is an issue in
> your testing, so please would you try the patch below, to see if it
> does fix the BUGs you are seeing? Thanks a lot.

Hugh, I have tested the patch for 5 more hours, the issue can't be
reproduced again,
so I think it has fixed the issue, thank you :)

Zhouping

>
> Not-Yet-Signed-off-by: Hugh Dickins <[email protected]>
> ---
>
> mm/huge_memory.c | 24 +++++++++---------------
> 1 file changed, 9 insertions(+), 15 deletions(-)
>
> --- 3.7-rc2+schednuma+johannes/mm/huge_memory.c 2012-11-01 04:10:43.812155671 -0700
> +++ linux/mm/huge_memory.c 2012-11-01 05:52:19.512153771 -0700
> @@ -745,7 +745,7 @@ void do_huge_pmd_numa_page(struct mm_str
> struct mem_cgroup *memcg = NULL;
> struct page *new_page = NULL;
> struct page *page = NULL;
> - int node, lru;
> + int node = -1;
>
> spin_lock(&mm->page_table_lock);
> if (unlikely(!pmd_same(*pmd, entry)))
> @@ -762,7 +762,8 @@ void do_huge_pmd_numa_page(struct mm_str
> VM_BUG_ON(!PageCompound(page) || !PageHead(page));
>
> get_page(page);
> - node = mpol_misplaced(page, vma, haddr);
> + if (page_mapcount(page) == 1) /* Only do exclusively mapped */
> + node = mpol_misplaced(page, vma, haddr);
> if (node != -1)
> goto migrate;
> }
> @@ -801,13 +802,11 @@ migrate:
> if (!new_page)
> goto alloc_fail;
>
> - lru = PageLRU(page);
> -
> - if (lru && isolate_lru_page(page)) /* does an implicit get_page() */
> + if (isolate_lru_page(page)) /* Does an implicit get_page() */
> goto alloc_fail;
>
> - if (!trylock_page(new_page))
> - BUG();
> + __set_page_locked(new_page);
> + SetPageSwapBacked(new_page);
>
> /* anon mapping, we can simply copy page->mapping to the new page: */
> new_page->mapping = page->mapping;
> @@ -820,8 +819,6 @@ migrate:
> spin_lock(&mm->page_table_lock);
> if (unlikely(!pmd_same(*pmd, entry))) {
> spin_unlock(&mm->page_table_lock);
> - if (lru)
> - putback_lru_page(page);
>
> unlock_page(new_page);
> ClearPageActive(new_page); /* Set by migrate_page_copy() */
> @@ -829,6 +826,7 @@ migrate:
> put_page(new_page); /* Free it */
>
> unlock_page(page);
> + putback_lru_page(page);
> put_page(page); /* Drop the local reference */
>
> return;
> @@ -859,16 +857,12 @@ migrate:
> mem_cgroup_end_migration(memcg, page, new_page, true);
> spin_unlock(&mm->page_table_lock);
>
> - put_page(page); /* Drop the rmap reference */
> -
> task_numa_fault(node, HPAGE_PMD_NR);
>
> - if (lru)
> - put_page(page); /* drop the LRU isolation reference */
> -
> unlock_page(new_page);
> -
> unlock_page(page);
> + put_page(page); /* Drop the rmap reference */
> + put_page(page); /* Drop the LRU isolation reference */
> put_page(page); /* Drop the local reference */
>
> return;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-11-02 23:06:52

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On Fri, 2 Nov 2012, Zhouping Liu wrote:
> On 11/01/2012 09:41 PM, Hugh Dickins wrote:
> >
> > Here's a patch fixing and tidying up that and a few other things there.
> > But I'm not signing it off yet, partly because I've barely tested it
> > (quite probably I didn't even have any numa pmd migration happening
> > at all), and partly because just a moment ago I ran across this
> > instructive comment in __collapse_huge_page_isolate():
> > /* cannot use mapcount: can't collapse if there's a gup pin */
> > if (page_count(page) != 1) {
> >
> > Hmm, yes, below I've added the page_mapcount() check I proposed to
> > do_huge_pmd_numa_page(), but is even that safe enough? Do we actually
> > need a page_count() check (for 2?) to guard against get_user_pages()?
> > I suspect we do, but then do we have enough locking to stabilize such
> > a check? Probably, but...
> >
> > This will take more time, and I doubt get_user_pages() is an issue in
> > your testing, so please would you try the patch below, to see if it
> > does fix the BUGs you are seeing? Thanks a lot.
>
> Hugh, I have tested the patch for 5 more hours,
> the issue can't be reproduced again,
> so I think it has fixed the issue, thank you :)

Thanks a lot for testing and reporting back, that's good news.

However, I've meanwhile become convinced that more fixes are needed here,
to be safe against get_user_pages() (including get_user_pages_fast());
to get the Mlocked count right; and to recover correctly when !pmd_same
with an Unevictable page.

Won't now have time to update the patch today,
but these additional fixes shouldn't hold up your testing.

Hugh

2012-11-03 11:04:09

by Alex Shi

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

>
> In reality, this report is larger but I chopped it down a bit for
> brevity. autonuma beats schednuma *heavily* on this benchmark both in
> terms of average operations per numa node and overall throughput.
>
> SPECJBB PEAKS
> 3.7.0 3.7.0 3.7.0
> rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r3
> Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
> Expctd Peak Bops 442225.00 ( 0.00%) 596039.00 ( 34.78%) 555342.00 ( 25.58%)
> Actual Warehouse 7.00 ( 0.00%) 9.00 ( 28.57%) 8.00 ( 14.29%)
> Actual Peak Bops 550747.00 ( 0.00%) 646124.00 ( 17.32%) 560635.00 ( 1.80%)

It is impressive report!

Could you like to share the what JVM and options are you using in the
testing, and based on which kinds of platform?

--
Thanks
Alex

2012-11-03 12:22:06

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote:
> >
> > In reality, this report is larger but I chopped it down a bit for
> > brevity. autonuma beats schednuma *heavily* on this benchmark both in
> > terms of average operations per numa node and overall throughput.
> >
> > SPECJBB PEAKS
> > 3.7.0 3.7.0 3.7.0
> > rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r3
> > Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
> > Expctd Peak Bops 442225.00 ( 0.00%) 596039.00 ( 34.78%) 555342.00 ( 25.58%)
> > Actual Warehouse 7.00 ( 0.00%) 9.00 ( 28.57%) 8.00 ( 14.29%)
> > Actual Peak Bops 550747.00 ( 0.00%) 646124.00 ( 17.32%) 560635.00 ( 1.80%)
>
> It is impressive report!
>
> Could you like to share the what JVM and options are you using in the
> testing, and based on which kinds of platform?
>

Oracle JVM version "1.7.0_07"
Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)

4 JVMs were run, one for each node.

JVM switch specified was -Xmx12901m so it would consume roughly 80% of
memory overall.

Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in
total with HT enabled.

--
Mel Gorman
SUSE Labs

2012-11-05 17:07:11

by Srikar Dronamraju

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

Hey Peter,

Here are results on 2node and 8node machine while running the autonuma
benchmark.
----------------------------------------------------------------------------
On 2 node, 12 core 24GB
----------------------------------------------------------------------------
KernelVersion: 3.7.0-rc3
Testcase: Min Max Avg
numa01: 121.23 122.43 121.53
numa01_HARD_BIND: 80.90 81.07 80.96
numa01_INVERSE_BIND: 145.91 146.06 145.97
numa01_THREAD_ALLOC: 395.81 398.30 397.47
numa01_THREAD_ALLOC_HARD_BIND: 264.09 264.27 264.18
numa01_THREAD_ALLOC_INVERSE_BIND: 476.36 476.65 476.53
numa02: 53.11 53.19 53.15
numa02_HARD_BIND: 35.20 35.29 35.25
numa02_INVERSE_BIND: 63.52 63.55 63.54
numa02_SMT: 60.28 62.00 61.33
numa02_SMT_HARD_BIND: 42.63 43.61 43.22
numa02_SMT_INVERSE_BIND: 76.27 78.06 77.31

KernelVersion: numasched (i.e 3.7.0-rc3 + your patches)
Testcase: Min Max Avg %Change
numa01: 121.28 121.71 121.47 0.05%
numa01_HARD_BIND: 80.89 81.01 80.96 0.00%
numa01_INVERSE_BIND: 145.87 146.04 145.96 0.01%
numa01_THREAD_ALLOC: 398.07 400.27 398.90 -0.36%
numa01_THREAD_ALLOC_HARD_BIND: 264.02 264.21 264.14 0.02%
numa01_THREAD_ALLOC_INVERSE_BIND: 476.13 476.62 476.41 0.03%
numa02: 52.97 53.25 53.13 0.04%
numa02_HARD_BIND: 35.21 35.28 35.24 0.03%
numa02_INVERSE_BIND: 63.51 63.54 63.53 0.02%
numa02_SMT: 61.35 62.46 61.97 -1.03%
numa02_SMT_HARD_BIND: 42.89 43.85 43.22 0.00%
numa02_SMT_INVERSE_BIND: 76.53 77.68 77.08 0.30%

----------------------------------------------------------------------------

KernelVersion: 3.7.0-rc3(with HT enabled )
Testcase: Min Max Avg
numa01: 242.58 244.39 243.68
numa01_HARD_BIND: 169.36 169.40 169.38
numa01_INVERSE_BIND: 299.69 299.73 299.71
numa01_THREAD_ALLOC: 399.86 404.10 401.50
numa01_THREAD_ALLOC_HARD_BIND: 278.72 278.77 278.75
numa01_THREAD_ALLOC_INVERSE_BIND: 493.46 493.59 493.54
numa02: 53.00 53.33 53.19
numa02_HARD_BIND: 36.77 36.88 36.82
numa02_INVERSE_BIND: 66.07 66.10 66.09
numa02_SMT: 53.23 53.51 53.35
numa02_SMT_HARD_BIND: 35.19 35.27 35.24
numa02_SMT_INVERSE_BIND: 63.50 63.54 63.52

KernelVersion: numasched (i.e 3.7.0-rc3 + your patches) (with HT enabled)
Testcase: Min Max Avg %Change
numa01: 242.68 244.59 243.53 0.06%
numa01_HARD_BIND: 169.37 169.42 169.40 -0.01%
numa01_INVERSE_BIND: 299.83 299.96 299.91 -0.07%
numa01_THREAD_ALLOC: 399.53 403.13 401.62 -0.03%
numa01_THREAD_ALLOC_HARD_BIND: 278.78 278.80 278.79 -0.01%
numa01_THREAD_ALLOC_INVERSE_BIND: 493.63 493.90 493.78 -0.05%
numa02: 53.06 53.42 53.22 -0.06%
numa02_HARD_BIND: 36.78 36.87 36.82 0.00%
numa02_INVERSE_BIND: 66.09 66.10 66.10 -0.02%
numa02_SMT: 53.34 53.55 53.42 -0.13%
numa02_SMT_HARD_BIND: 35.22 35.29 35.25 -0.03%
numa02_SMT_INVERSE_BIND: 63.50 63.58 63.53 -0.02%
----------------------------------------------------------------------------

On 8 node, 64 core, 320 GB
----------------------------------------------------------------------------

KernelVersion: 3.7.0-rc3()
Testcase: Min Max Avg
numa01: 1550.56 1596.03 1574.24
numa01_HARD_BIND: 915.25 2540.64 1392.42
numa01_INVERSE_BIND: 2964.66 3716.33 3149.10
numa01_THREAD_ALLOC: 922.99 1003.31 972.99
numa01_THREAD_ALLOC_HARD_BIND: 579.54 1266.65 896.75
numa01_THREAD_ALLOC_INVERSE_BIND: 1794.51 2057.16 1922.86
numa02: 126.22 133.01 130.91
numa02_HARD_BIND: 25.85 26.25 26.06
numa02_INVERSE_BIND: 341.38 350.35 345.82
numa02_SMT: 153.06 175.41 163.47
numa02_SMT_HARD_BIND: 27.10 212.39 114.37
numa02_SMT_INVERSE_BIND: 285.70 1542.83 540.62

KernelVersion: numasched()
Testcase: Min Max Avg %Change
numa01: 1542.69 1601.81 1569.68 0.29%
numa01_HARD_BIND: 867.35 1094.00 966.05 44.14%
numa01_INVERSE_BIND: 2835.71 3030.36 2966.99 6.14%
numa01_THREAD_ALLOC: 326.35 379.43 347.01 180.39%
numa01_THREAD_ALLOC_HARD_BIND: 611.55 720.09 657.06 36.48%
numa01_THREAD_ALLOC_INVERSE_BIND: 1839.60 1999.58 1919.36 0.18%
numa02: 35.35 55.09 40.81 220.78%
numa02_HARD_BIND: 26.58 26.81 26.68 -2.32%
numa02_INVERSE_BIND: 341.86 355.36 347.68 -0.53%
numa02_SMT: 37.65 48.65 43.08 279.46%
numa02_SMT_HARD_BIND: 28.29 157.66 84.29 35.69%
numa02_SMT_INVERSE_BIND: 313.07 346.72 333.69 62.01%
----------------------------------------------------------------------------

--
Thanks and Regards
Srikar

2012-11-09 08:49:53

by Rik van Riel

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On 10/30/2012 08:20 AM, Mel Gorman wrote:
> On Thu, Oct 25, 2012 at 02:16:17PM +0200, Peter Zijlstra wrote:
>> Hi all,
>>
>> Here's a re-post of the NUMA scheduling and migration improvement
>> patches that we are working on. These include techniques from
>> AutoNUMA and the sched/numa tree and form a unified basis - it
>> has got all the bits that look good and mergeable.
>>
>
> Thanks for the repost. I have not even started a review yet as I was
> travelling and just online today. It will be another day or two before I can
> start but I was at least able to do a comparison test between autonuma and
> schednuma today to see which actually performs the best. Even without the
> review I was able to stick on similar vmstats as was applied to autonuma
> to give a rough estimate of the relative overhead of both implementations.

Peter, Ingo,

do you have any comments on the performance measurements
by Mel?

Any ideas on how to fix sched/numa or numa/core?

At this point, I suspect the easiest way forward might be
to merge the basic infrastructure from Mel's combined
tree (in -mm? in -tip?), so we can experiment with different
NUMA placement policies on top.

That way we can do apples to apples comparison of the
policies, and figure out what works best, and why.

2012-11-10 02:47:43

by Alex Shi

[permalink] [raw]

Subject: Re: [PATCH 00/31] numa/core patches

On Sat, Nov 3, 2012 at 8:21 PM, Mel Gorman <[email protected]> wrote:
> On Sat, Nov 03, 2012 at 07:04:04PM +0800, Alex Shi wrote:
>> >
>> > In reality, this report is larger but I chopped it down a bit for
>> > brevity. autonuma beats schednuma *heavily* on this benchmark both in
>> > terms of average operations per numa node and overall throughput.
>> >
>> > SPECJBB PEAKS
>> > 3.7.0 3.7.0 3.7.0
>> > rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r3
>> > Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
>> > Expctd Peak Bops 442225.00 ( 0.00%) 596039.00 ( 34.78%) 555342.00 ( 25.58%)
>> > Actual Warehouse 7.00 ( 0.00%) 9.00 ( 28.57%) 8.00 ( 14.29%)
>> > Actual Peak Bops 550747.00 ( 0.00%) 646124.00 ( 17.32%) 560635.00 ( 1.80%)
>>
>> It is impressive report!
>>
>> Could you like to share the what JVM and options are you using in the
>> testing, and based on which kinds of platform?
>>
>
> Oracle JVM version "1.7.0_07"
> Java(TM) SE Runtime Environment (build 1.7.0_07-b10)
> Java HotSpot(TM) 64-Bit Server VM (build 23.3-b01, mixed mode)
>
> 4 JVMs were run, one for each node.
>
> JVM switch specified was -Xmx12901m so it would consume roughly 80% of
> memory overall.
>
> Machine is x86-64 4-node, 64G of RAM, CPUs are E7-4807, 48 cores in
> total with HT enabled.
>

Thanks for configuration sharing!

I used Jrockit and openjdk with Hugepage plus pin JVM to cpu socket.
In previous sched numa version, I had found 20% dropping with Jrockit
with our configuration. but for this version. No clear regression
found. also has no benefit found.

Seems we need to expend the testing configurations. :)
--
Thanks
Alex

2012-11-12 09:50:16