2010-12-15 05:17:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Transparent Hugepage Support #33

Some of some relevant user of the project:

KVM Virtualization
GCC (kernel build included, requires a few liner patch to enable)
JVM
VMware Workstation
HPC

It would be great if it could go in -mm.

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=blob;f=Documentation/vm/transhuge.txt
http://www.linux-kvm.org/wiki/images/9/9e/2010-forum-thp.pdf

http://git.kernel.org/?p=linux/kernel/git/andrea/aa.git;a=shortlog

first: git clone git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
or first: git clone --reference linux-2.6 git://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git
later: git fetch; git checkout -f origin/master

The tree is rebased and git pull won't work.

http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.37-rc5/transparent_hugepage-33/
http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.37-rc5/transparent_hugepage-33.gz

Diff #32 -> #33:

b/THP-disable-on-small-systems | 4

Improved header.

b/clear_copy_huge_page | 60 +--

Update after upstream changes.

b/compaction-add-trace-events | 179 +++++++++
b/compaction-instead-of-lumpy | 415 ++++++++++++++++++++++
b/compaction-lumpy_mode | 169 +++++++++
b/compaction-migrate-async | 388 +++++++++++++++++++++
b/compaction-migrate_pages-api-bool | 133 +++++++
b/compaction-movable-pageblocks | 56 +++
b/compaction-reclaim_mode | 248 +++++++++++++
b/zone_watermark_ok_safe | 372 ++++++++++++++++++++

Mel's lumpy compaction (disables lumpy and uses compaction instead
when CONFIG_COMPACTION=y) allows proper runtime when there are
frequent hugepage allocations like with THP on. Picked from mmotm
broken-out patchset to allow easy -mm integration and to test it out
in combination of THP.

b/compaction-all-orders | 23 +
b/compaction-kswapd | 104 +++--

Split the compaction-all-orders part off compaction-kswapd.

b/compound_get_put | 39 +-

Cleanups.

b/compound_get_put_fix | 28 +

While reading code I think there was a super tiny race (never
reproduced) in the put_page of a tail page in case split_huge_page
would run on the head page after put_page releases the compound lock
but before put_page_testzero is called (only after put_page_testzero returns
true we're sure split_huge_page can't run from under us anymore as it
requires a reference on the head page to run, rechecking PageHead is
enough to fix it).

b/compound_lock | 13

Change the API to return flags instead of void.

b/compound_trans_order | 120 ++++++

Be safe while reading compound_order on transparent hugepages that may
be under split_huge_page.

b/gfp_no_kswapd | 17

Define ___GFP_NO_KSWAPD.

b/khugepaged-mmap_sem | 113 ++++++

Some user reported deadlocks after days of load with pvfs.

Allocate memory inside mmap_sem read mode (not anymore inside mmap_sem
write mode) within khugepaged collapse_huge_page to satisfy certain
filesystems in userland that may benefit from THP (so they don't need
to use MADV_NOHUGEPAGE). Not sure if this bugfix was really required
from a theoretical standpoint (as far as the deadlock is concerned
this may actually hide bugs), but it makes the code more scalable so
it actually makes the code better and it's a no brainer.

Still investigating the page lock usage in khugepaged vs fuse.

b/ksm-swapcache | 64 ---

Use Hugh's equivalent one liner fix.

b/kvm_transparent_hugepage | 38 +-

Adjust for hva_to_pfn interface change.

b/madv_nohugepage | 157 ++++++++
b/madv_nohugepage_define | 64 +++

Add MADV_NOHUGEPAGE to disable THP on low priority vmas (needed
especially now that KSM won't scan inside THP, later it will be less
important but maybe still useful to leave hugepages available for
higher priority virtual machines).

b/memcg_compound | 71 ++-

Don't batch hugepage releasing in __do_uncharge.

b/memcg_huge_memory | 12

Optimize with mem_cgroup_uncharge_start/stop().

b/memory-failure-thp-vs-hugetlbfs | 44 ++

The new hugetlbfs memory-failure code merged upstream collided with
THP (reported by some users running
mce-test.git/hwpoison/run-huge-test.sh on aa.git).

Use PageHuge to differentiate between THP pages and hugetlbfs pages in
common paths that can run into any of the two types. PageTransHuge
will still return 1 for hugetlbfs pages because PageTransHuge must
only be used in the core VM paths where hugetlbfs pages can't be
processed. In any place where hugetlbfs shared the common paths with
the core VM code, PageHuge should be used to differentiate the
two. Usually PageHuge is only needed in THP context in slow paths
(memory-failure is not just a slow but even an error path), so it's
ok and we don't want to slowdown PageTransHuge considering PageHuge
already is there for this.

b/pagetranscompound | 30 -

Cleanups.

b/pmd_mangling_generic | 488 +++++++++++++++++++--------

Cleanups to save icache by moving slow common methods to
mm/pgtable-generic.c.

b/pmd_mangling_x86 | 41 --

Update header and undo a noop change.

b/pmd_paravirt_ops | 12

Fix x86 32bit build with PAE off and paravirt on.

b/pmd_trans | 13

macro -> inline cleanups.

b/pmd_trans_huge_migrate | 31 -

Remove false positive bug on.

b/pte_alloc_trans_splitting | 13

Add BUG_ON matching the issue in pmd_trans_huge_migrate (pmd must be
null to call __pte_alloc, pmd_present is not enough if pmd_trans_huge
can be set). The reason is that very temporarily to optimize away one
unnecessary IPI for every split_huge_page we mark the pmd not present
but still huge for the duration of the IPI (this is to prevent
simultaneous 4k and 2M tlb entries that would machine check some CPU
with erratas).

b/set-recommended-min_free_kbytes | 10

Explicit call setup_per_zone_wmarks even if min_free_kbytes is already
bigger than recommended_min (otherwise the reserved pageblocks won't
be enabled on huge systems). This brings the kernel version of
set_recommended_min_free_kbytes fully equivalent to the hugeadm
--set-recommended-min_free_kbytes command line.

b/transhuge-enable-direct-defrag | 3

Header update.

b/transhuge-selects-compaction | 15

Header update to explain why THP selects compaction.

b/transparent_hugepage | 114 ++++--

Make PageTransHuge inline and move it from huge_mm.h to page-flags.h.

Add BUG_ON if is_vma_temporary_stack is set during split_huge_page (we
can't fail, it shall never trigger because mremap done on the initial
kernel stack during execve that sets the temporary stack flag for its
duration, shouldn't work on hugepages). The BUG_ON makes sure it won't
break silently if the user stack is ever born huge.

Use assert_spin_locked instead of VM_BUG_ON.

Remove potentially false positive bugcheck for not present pmd, same
as pte_alloc_trans_splitting.

b/transparent_hugepage-doc | 67 ++-

Doc improvement from Mel.

b/transparent_hugepage-numa | 50 +-

Fix memleak if memcg fails charge during khugepaged collapse_huge_page
with CONFIG_NUMA=y.

b/transparent_hugepage_vmstat-anon_vma-chain | 16


memcg_consume_stock | 56 ---
remove-lumpy_reclaim | 131 -------
exec-migrate-race-anon_vma-chain

removed.

FAQ:

Q: When will 1G pages be supported? (by far the most frequently asked question
in the last two days)
A: Not any time soon but it's not entirly impossible... The benefit of going
from 2M to 1G is likely much lower than the benefit of going from 4k to 2M
so it's unlikely to be a worthwhile effort for a while. And some CPUs
won't have 1G TLB so it only speedup a bit the tlb miss handler but
it won't actually decrease the tlb miss rate.

Q: When this will work on filebacked pages? (pagecache/swapcache/tmpfs)
A: Not until it's merged in mainline. It's already feature complete for many
usages and the moment we expand into pagecache the patch would grow
significantly.

Q: When will KSM will scan inside Transparent Hugepages?
A: Working on that, this should materialize soon enough.

Q: What is the next place where to remove split_huge_page_pmd()?
A: mremap. JVM uses mremap in the garbage collector so the ~18% boost (no virt)
has further margin for optimizations.

Full diffstat:

Documentation/vm/transhuge.txt | 298 ++++
arch/alpha/include/asm/mman.h | 3
arch/mips/include/asm/mman.h | 3
arch/parisc/include/asm/mman.h | 3
arch/powerpc/mm/gup.c | 12
arch/x86/include/asm/kvm_host.h | 1
arch/x86/include/asm/paravirt.h | 25
arch/x86/include/asm/paravirt_types.h | 6
arch/x86/include/asm/pgtable-2level.h | 9
arch/x86/include/asm/pgtable-3level.h | 23
arch/x86/include/asm/pgtable.h | 143 ++
arch/x86/include/asm/pgtable_64.h | 28
arch/x86/include/asm/pgtable_types.h | 3
arch/x86/kernel/paravirt.c | 3
arch/x86/kernel/tboot.c | 2
arch/x86/kernel/vm86_32.c | 1
arch/x86/kvm/mmu.c | 60
arch/x86/kvm/paging_tmpl.h | 4
arch/x86/mm/gup.c | 28
arch/x86/mm/pgtable.c | 66
arch/xtensa/include/asm/mman.h | 3
drivers/base/node.c | 21
fs/Kconfig | 2
fs/proc/meminfo.c | 14
fs/proc/page.c | 14
include/asm-generic/mman-common.h | 3
include/asm-generic/pgtable.h | 225 ++-
include/linux/compaction.h | 25
include/linux/gfp.h | 15
include/linux/huge_mm.h | 159 ++
include/linux/kernel.h | 7
include/linux/khugepaged.h | 67
include/linux/kvm_host.h | 4
include/linux/memory_hotplug.h | 14
include/linux/migrate.h | 12
include/linux/mm.h | 137 +
include/linux/mm_inline.h | 19
include/linux/mm_types.h | 3
include/linux/mmu_notifier.h | 66
include/linux/mmzone.h | 11
include/linux/page-flags.h | 65
include/linux/rmap.h | 2
include/linux/sched.h | 1
include/linux/swap.h | 2
include/linux/vmstat.h | 5
include/trace/events/compaction.h | 74 +
include/trace/events/vmscan.h | 6
kernel/fork.c | 12
kernel/futex.c | 55
mm/Kconfig | 38
mm/Makefile | 3
mm/compaction.c | 174 +-
mm/huge_memory.c | 2331 ++++++++++++++++++++++++++++++++++
mm/hugetlb.c | 70 -
mm/internal.h | 4
mm/ksm.c | 29
mm/madvise.c | 10
mm/memcontrol.c | 129 +
mm/memory-failure.c | 22
mm/memory.c | 199 ++
mm/memory_hotplug.c | 17
mm/mempolicy.c | 20
mm/migrate.c | 29
mm/mincore.c | 7
mm/mmap.c | 7
mm/mmu_notifier.c | 20
mm/mmzone.c | 21
mm/mprotect.c | 20
mm/mremap.c | 9
mm/page_alloc.c | 98 +
mm/pagewalk.c | 1
mm/pgtable-generic.c | 123 +
mm/rmap.c | 87 -
mm/sparse.c | 4
mm/swap.c | 131 +
mm/swap_state.c | 6
mm/swapfile.c | 2
mm/vmscan.c | 210 ++-
mm/vmstat.c | 69 -
virt/kvm/iommu.c | 2
virt/kvm/kvm_main.c | 56
81 files changed, 5189 insertions(+), 523 deletions(-)


2010-12-15 23:58:11

by Andrew Morton

[permalink] [raw]
Subject: Re: Transparent Hugepage Support #33

On Wed, 15 Dec 2010 06:15:40 +0100
Andrea Arcangeli <[email protected]> wrote:

> Some of some relevant user of the project:
>
> KVM Virtualization
> GCC (kernel build included, requires a few liner patch to enable)
> JVM
> VMware Workstation
> HPC
>
> It would be great if it could go in -mm.

That all merged pretty easily on top of the current mm pile. Except
for kvm-mmu-transparent-hugepage-support.patch which needs some thought
and testing to get it merged into the KVM changes in linux-next. I
simply omitted kvm-mmu-transparent-hugepage-support.patch so please
take a look?

2010-12-16 01:00:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: Transparent Hugepage Support #33

On Wed, 15 Dec 2010 06:15:40 +0100
Andrea Arcangeli <[email protected]> wrote:

> Some of some relevant user of the project:
>
> KVM Virtualization
> GCC (kernel build included, requires a few liner patch to enable)
> JVM
> VMware Workstation
> HPC
>
> It would be great if it could go in -mm.

Things should be done in memory cgroup is

- make accounting correct (RSS count will be broken)
- make move_charge() to work
(at rmdir(), this is now broken. It seems move-charge-at-task-move to work)

Do you have known other viewpoints ? I'll look into when -mm is shipped.


Thanks,
-Kame

2010-12-16 01:18:22

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: Transparent Hugepage Support #33

Hi,

On Thu, 16 Dec 2010 09:54:08 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Wed, 15 Dec 2010 06:15:40 +0100
> Andrea Arcangeli <[email protected]> wrote:
>
> > Some of some relevant user of the project:
> >
> > KVM Virtualization
> > GCC (kernel build included, requires a few liner patch to enable)
> > JVM
> > VMware Workstation
> > HPC
> >
> > It would be great if it could go in -mm.
>
> Things should be done in memory cgroup is
>
> - make accounting correct (RSS count will be broken)
> - make move_charge() to work
> (at rmdir(), this is now broken. It seems move-charge-at-task-move to work)
>
Yes.
I think we should add mem_cgroup_split_hugepage_commit() and add PageTransHuge()
check in mem_cgroup_move_parent() as done in RHEL6 kernel.
As for move-charge-at-task-move, it will work because walk_pmd_range() splits
THP pages(it would be better to change move-charge not to split THP pages, but
it's not so urgent IMHO).

> Do you have known other viewpoints ?
Not yet, but I'll test and check.

> I'll look into when -mm is shipped.
>
me too :)


Thanks,
Daisuke Nihimura.

2010-12-16 01:20:16

by Andrew Morton

[permalink] [raw]
Subject: Re: Transparent Hugepage Support #33

On Thu, 16 Dec 2010 09:54:08 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> I'll look into when -mm is shipped.

That might take a while - linux-next is a screwed-up catastrophe and I
suppose some sucker has some bisecting to do.

(The second trace below looks similar to https://bugzilla.kernel.org/show_bug.cgi?id=24942)

[ 241.227687] INFO: task modprobe:904 blocked for more than 120 seconds.
[ 241.227979] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 241.228264] modprobe D 0000000000000007 0 904 1 0x00000000
[ 241.228525] ffff880255cbdc48 0000000000000046 ffff88009edd1dd8 ffff88025736e880
[ 241.228973] ffff880257a508c0 ffff88025736ebd8 0000000000000002 0000000100000000
[ 241.229421] 0000000000000002 0000000000000000 ffff88009edd1dd8 0000000000000000
[ 241.229879] Call Trace:
[ 241.230043] [<ffffffff81391496>] schedule_timeout+0x24/0x1b6
[ 241.230202] [<ffffffff81391293>] ? wait_for_common+0x3a/0x129
[ 241.230364] [<ffffffff8105e1ca>] ? trace_hardirqs_on+0xd/0xf
[ 241.230522] [<ffffffff81391322>] wait_for_common+0xc9/0x129
[ 241.230681] [<ffffffff810317d1>] ? default_wake_function+0x0/0xf
[ 241.230850] [<ffffffff8139141c>] wait_for_completion+0x18/0x1a
[ 241.231010] [<ffffffff8107e7bb>] synchronize_sched+0x51/0x58
[ 241.231169] [<ffffffff8104d3d0>] ? wakeme_after_rcu+0x0/0xf
[ 241.231329] [<ffffffff8106a772>] load_module+0xd4e/0xe81
[ 241.231489] [<ffffffff8106a8e5>] sys_init_module+0x40/0x1d7
[ 241.231658] [<ffffffff810029bb>] system_call_fastpath+0x16/0x1b
[ 241.231831] INFO: lockdep is turned off.

and

[ 271.500616] INFO: rcu_sched_state detected stall on CPU 5 (t=65032 jiffies)
[ 271.500616] sending NMI to all CPUs:
[ 271.500954] NMI backtrace for cpu 2
[ 271.501110] CPU 2
[ 271.501157] Modules linked in: ipv6 dm_mirror dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event snd_seq ide_cd_mod serio_raw snd_seq_device snd_pcm_oss shpchp cdrom option usb_wwan snd_mixer_oss snd_pcm usbserial snd_timer snd i2c_i801 soundcore button floppy i2c_core intel_rng(-) snd_page_alloc pcspkr ehci_hcd ohci_hcd uhci_hcd
[ 271.503961]
[ 271.504122] Pid: 0, comm: kworker/0:1 Tainted: G W 2.6.37-rc5-mm1 #1 /
[ 271.504403] RIP: 0010:[<ffffffff81009c9b>] [<ffffffff81009c9b>] mwait_idle+0x76/0x82
[ 271.504662] RSP: 0018:ffff880257967f08 EFLAGS: 00000246
[ 271.504662] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 271.504662] RDX: 0000000000000000 RSI: ffff880257966010 RDI: ffffffff81009c91
[ 271.504662] RBP: ffff880257967f18 R08: 0000000000000000 R09: 0000000000000001
[ 271.504662] R10: ffffffff8102b7d4 R11: ffffffff81396dcc R12: 0000000000000000
[ 271.504662] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 271.504662] FS: 0000000000000000(0000) GS:ffff88009e200000(0000) knlGS:0000000000000000
[ 271.504662] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 271.504662] CR2: 0000003e5f0948f0 CR3: 000000000179b000 CR4: 00000000000006e0
[ 271.504662] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 271.504662] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 271.504662] Process kworker/0:1 (pid: 0, threadinfo ffff880257966000, task ffff8802579643c0)
[ 271.504662] Stack:
[ 271.504662] 0000000000000000 0000000000000002 ffff880257967f28 ffffffff810014cf
[ 271.504662] ffff880257967f48 ffffffff8138c3e8 ffffffff8138c25d 0000000000000000
[ 271.504662] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 271.504662] Call Trace:
[ 271.504662] [<ffffffff810014cf>] cpu_idle+0x48/0x68
[ 271.504662] [<ffffffff8138c3e8>] start_secondary+0x18b/0x18f
[ 271.504662] [<ffffffff8138c25d>] ? start_secondary+0x0/0x18f
[ 271.504662] Code: 31 db 48 89 f0 48 89 d9 48 89 da 0f 01 c8 0f ae f0 48 8b 87 38 e0 ff ff a8 08 75 11 e8 2c 45 05 00 48 89 d8 48 89 d9 fb 0f 01 c9 <eb> 06 e8 1b 45 05 00 fb 58 5b c9 c3 55 ba e8 12 00 00 48 89 e5
[ 271.504662] Call Trace:
[ 271.504662] [<ffffffff810014cf>] cpu_idle+0x48/0x68
[ 271.504662] [<ffffffff8138c3e8>] start_secondary+0x18b/0x18f
[ 271.504662] [<ffffffff8138c25d>] ? start_secondary+0x0/0x18f
[ 271.504662] Pid: 0, comm: kworker/0:1 Tainted: G W 2.6.37-rc5-mm1 #1
[ 271.504662] Call Trace:
[ 271.504662] <NMI> [<ffffffff8139529d>] ? arch_trigger_all_cpu_backtrace_handler+0x64/0x80
[ 271.504662] [<ffffffff81396d97>] ? notifier_call_chain+0x81/0xb6
[ 271.504662] [<ffffffff81396e27>] ? __atomic_notifier_call_chain+0x5b/0x84
[ 271.504662] [<ffffffff81396dcc>] ? __atomic_notifier_call_chain+0x0/0x84
[ 271.504662] [<ffffffff81396e5f>] ? atomic_notifier_call_chain+0xf/0x11
[ 271.504662] [<ffffffff81396e8f>] ? notify_die+0x2e/0x30
[ 271.504662] [<ffffffff8139454d>] ? do_nmi+0xa7/0x2a1
[ 271.504662] [<ffffffff8139424a>] ? nmi+0x1a/0x2c
[ 271.504662] [<ffffffff81396dcc>] ? __atomic_notifier_call_chain+0x0/0x84
[ 271.504662] [<ffffffff8102b7d4>] ? finish_task_switch+0x44/0xb8
[ 271.504662] [<ffffffff81009c91>] ? mwait_idle+0x6c/0x82
[ 271.504662] [<ffffffff81009c9b>] ? mwait_idle+0x76/0x82
[ 271.504662] <<EOE>> [<ffffffff810014cf>] ? cpu_idle+0x48/0x68
[ 271.504662] [<ffffffff8138c3e8>] ? start_secondary+0x18b/0x18f
[ 271.504662] [<ffffffff8138c25d>] ? start_secondary+0x0/0x18f
[ 271.500616] NMI backtrace for cpu 5
[ 271.500616] CPU 5
[ 271.500616] Modules linked in: ipv6 dm_mirror dm_region_hash dm_log dm_multipath dm_mod video sbs sbshc battery ac lp parport sg snd_hda_intel snd_hda_codec snd_seq_oss snd_seq_midi_event snd_seq ide_cd_mod serio_raw snd_seq_device snd_pcm_oss shpchp cdrom option usb_wwan snd_mixer_oss snd_pcm usbserial snd_timer snd i2c_i801 soundcore button floppy i2c_core intel_rng(-) snd_page_alloc pcspkr ehci_hcd ohci_hcd uhci_hcd
[ 271.500616]
[ 271.500616] Pid: 0, comm: kworker/0:1 Tainted: G W 2.6.37-rc5-mm1 #1 /
[ 271.500616] RIP: 0010:[<ffffffff8119b624>] [<ffffffff8119b624>] __bitmap_empty+0x5a/0x63
[ 271.500616] RSP: 0018:ffff88009e803e90 EFLAGS: 00000046
[ 271.500616] RAX: 0000000000000000 RBX: 0000000000002710 RCX: ffffffff8180e4e8
[ 271.500616] RDX: 0000000000000000 RSI: 00000000000000ff RDI: ffffffff8180e4e0
[ 271.500616] RBP: ffff88009e803e98 R08: 0000000000000003 R09: 0000000000000000
[ 271.500616] R10: 0000000000000000 R11: ffff88025589aec0 R12: ffff88009e9ce760
[ 271.500616] R13: ffffffff817b3080 R14: 0000000000000000 R15: ffffffff817b3180
[ 271.500616] FS: 0000000000000000(0000) GS:ffff88009e800000(0000) knlGS:0000000000000000
[ 271.500616] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 271.500616] CR2: 00000000008cfb80 CR3: 0000000255941000 CR4: 00000000000006e0
[ 271.500616] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 271.500616] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 271.500616] Process kworker/0:1 (pid: 0, threadinfo ffff8802579e8000, task ffff8802579e66c0)
[ 271.500616] Stack:
[ 271.500616] ffff88025589aec0 ffff88009e803eb8 ffffffff8101a53c ffff8802579e66c0
[ 271.500616] 0000000000000005 ffff88009e803ef8 ffffffff8107ec29 ffff8802579e66c0
[ 271.500616] 0000000000000005 0000000000000005 ffff8802579e66c0 0000000000000000
[ 271.500616] Call Trace:
[ 271.500616] <IRQ>
[ 271.500616] [<ffffffff8101a53c>] arch_trigger_all_cpu_backtrace+0x52/0x6a
[ 271.500616] [<ffffffff8107ec29>] __rcu_pending+0x7e/0x2f0
[ 271.500616] [<ffffffff8107ef1d>] rcu_check_callbacks+0x82/0xb3
[ 271.500616] [<ffffffff8104275f>] update_process_times+0x38/0x6e
[ 271.500616] [<ffffffff8105a0f8>] tick_periodic+0x63/0x6f
[ 271.500616] [<ffffffff8105a122>] tick_handle_periodic+0x1e/0x6b
[ 271.500616] [<ffffffff81019a37>] smp_apic_timer_interrupt+0x83/0x96
[ 271.500616] [<ffffffff810033d3>] apic_timer_interrupt+0x13/0x20
[ 271.500616] <EOI>
[ 271.500616] [<ffffffff81396dcc>] ? __atomic_notifier_call_chain+0x0/0x84
[ 271.500616] [<ffffffff8102b7d4>] ? finish_task_switch+0x44/0xb8
[ 271.500616] [<ffffffff81009c91>] ? mwait_idle+0x6c/0x82
[ 271.500616] [<ffffffff81009c9b>] ? mwait_idle+0x76/0x82
[ 271.500616] [<ffffffff81009c91>] ? mwait_idle+0x6c/0x82
[ 271.500616] [<ffffffff810014cf>] cpu_idle+0x48/0x68
[ 271.500616] [<ffffffff8138c3e8>] start_secondary+0x18b/0x18f
[ 271.500616] [<ffffffff8138c25d>] ? start_secondary+0x0/0x18f
[ 271.500616] Code: 89 f0 83 e0 3f 85 c0 74 24 89 f0 4c 63 c2 b9 40 00 00 00 99 f7 f9 b8 01 00 00 00 89 d1 48 d3 e0 48 ff c8 4a 85 04 c7 74 04 31 c0 <eb> 05 b8 01 00 00 00 c9 c3 55 ba 40 00 00 00 89 f1 48 89 e5 53

2010-12-16 02:15:33

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: Transparent Hugepage Support #33

Hi Daisuke and Kame,

On Thu, Dec 16, 2010 at 10:10:53AM +0900, Daisuke Nishimura wrote:
> Hi,
>
> On Thu, 16 Dec 2010 09:54:08 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Wed, 15 Dec 2010 06:15:40 +0100
> > Andrea Arcangeli <[email protected]> wrote:
> >
> > > Some of some relevant user of the project:
> > >
> > > KVM Virtualization
> > > GCC (kernel build included, requires a few liner patch to enable)
> > > JVM
> > > VMware Workstation
> > > HPC
> > >
> > > It would be great if it could go in -mm.
> >
> > Things should be done in memory cgroup is
> >
> > - make accounting correct (RSS count will be broken)
> > - make move_charge() to work
> > (at rmdir(), this is now broken. It seems move-charge-at-task-move to work)
> >
> Yes.
> I think we should add mem_cgroup_split_hugepage_commit() and add PageTransHuge()
> check in mem_cgroup_move_parent() as done in RHEL6 kernel.

Yes, unfortunately porting all the RHEL6 THP cgroups bits wasn't
trivial because of the difference in the cgroup code.

> As for move-charge-at-task-move, it will work because walk_pmd_range() splits
> THP pages(it would be better to change move-charge not to split THP pages, but
> it's not so urgent IMHO).
>
> > Do you have known other viewpoints ?
> Not yet, but I'll test and check.

Same here.

One detail I'd ask you to check is the compound_trans_order I added in
#33 for memory-failure and cgroups. It's not really necessary in memcg
if we stop reading the order and we do page_size = HPAGE_PMD_SIZE
instead. I thought having the cgroup code handling compound pages
without hardwiring the size was better but maybe it's not. Maybe the
compound_lock locking should also be extended there? It's up to you to
what you prefer there but I'll try to help as much as I can.

BTW, now that it's in -mm I'll keep any further change incremental at
the end and I'll stop rebasing to avoid confusion.

> > I'll look into when -mm is shipped.
> >
> me too :)

Thanks a lot!

2010-12-16 02:37:00

by Andrea Arcangeli

[permalink] [raw]
Subject: kvm mmu transparent hugepage support for linux-next

Hi Andrew,

On Wed, Dec 15, 2010 at 03:55:45PM -0800, Andrew Morton wrote:
> On Wed, 15 Dec 2010 06:15:40 +0100
> Andrea Arcangeli <[email protected]> wrote:
>
> > Some of some relevant user of the project:
> >
> > KVM Virtualization
> > GCC (kernel build included, requires a few liner patch to enable)
> > JVM
> > VMware Workstation
> > HPC
> >
> > It would be great if it could go in -mm.
>
> That all merged pretty easily on top of the current mm pile. Except
> for kvm-mmu-transparent-hugepage-support.patch which needs some thought
> and testing to get it merged into the KVM changes in linux-next. I
> simply omitted kvm-mmu-transparent-hugepage-support.patch so please
> take a look?

Ok, I've an untested patch as full replacement of the
5Akvm-mmu-transparent-hugepage-support.patch, for linux-next. It's
untested because I didn't even try to boot linux-next after reading
your last mail about it. In the meantime I'd appreciate review from
Marcelo.

For Marcelo: before we were calling gup and checking if the pfn was
part of a compound page, and we were returning the right "level" from
inside mapping_level(). Now mapping_level is only left to detect
hugetlbfs. So if hugetlbfs isn't detected, _after_ gfn_to_pfn runs, we
check if the pfn is part of a trans compound page. If it is, we adjust
pfn/gfn after the fact before invoking spte establishment. It should
be functionally equivalent to the previous version and it eliminates
one unnecessary gfn_to_pfn/gup invocation compared to the previous
code. I had to rewrite it to adjust after the fact (async page fault)
to avoid invalidating async page faults (or to avoid handling async
page faults inside mapping_level itself which would litter its
interface and make it a lot more complex). If we're allowed to adjust
after the fact, this is simpler more efficient and it'll live happily
with the async page faults. Note: I didn't adjust the guest virtual
address as I don't think it needs adjustment. Let me know if you see
something wrong with this, thanks! (good thing is, if something's
wrong we'll notice it very quick as soon as we can test it :)

=========
Subject: kvm mmu transparent hugepage support

From: Andrea Arcangeli <[email protected]>

This should work for both hugetlbfs and transparent hugepages.

Signed-off-by: Andrea Arcangeli <[email protected]>
---

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index bdb9fa9..22062b2 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2286,6 +2286,18 @@ static int kvm_handle_bad_page(struct kvm *kvm, gfn_t gfn, pfn_t pfn)
return 1;
}

+static void transparent_hugepage_adjust(gfn_t *gfn, pfn_t *pfn, int * level)
+{
+ /* check if it's a transparent hugepage */
+ if (!is_error_pfn(*pfn) && !kvm_is_mmio_pfn(*pfn) &&
+ *level == PT_PAGE_TABLE_LEVEL &&
+ PageTransCompound(pfn_to_page(*pfn))) {
+ *level = PT_DIRECTORY_LEVEL;
+ *gfn = *gfn & ~(KVM_PAGES_PER_HPAGE(*level) - 1);
+ *pfn = *pfn & ~(KVM_PAGES_PER_HPAGE(*level) - 1);
+ }
+}
+
static bool try_async_pf(struct kvm_vcpu *vcpu, bool no_apf, gfn_t gfn,
gva_t gva, pfn_t *pfn, bool write, bool *writable);

@@ -2314,6 +2326,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn,

if (try_async_pf(vcpu, no_apf, gfn, v, &pfn, write, &map_writable))
return 0;
+ transparent_hugepage_adjust(&gfn, &pfn, &level);

/* mmio */
if (is_error_pfn(pfn))
@@ -2676,6 +2689,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t gpa, u32 error_code,

if (try_async_pf(vcpu, no_apf, gfn, gpa, &pfn, write, &map_writable))
return 0;
+ transparent_hugepage_adjust(&gfn, &pfn, &level);

/* mmio */
if (is_error_pfn(pfn))
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 590bf12..bc91891 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -575,6 +575,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t addr, u32 error_code,
if (try_async_pf(vcpu, no_apf, walker.gfn, addr, &pfn, write_fault,
&map_writable))
return 0;
+ transparent_hugepage_adjust(&walker.gfn, &pfn, &level);

/* mmio */
if (is_error_pfn(pfn))
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index fb93ff9..4fa0121 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -103,8 +103,36 @@ static pfn_t fault_pfn;
inline int kvm_is_mmio_pfn(pfn_t pfn)
{
if (pfn_valid(pfn)) {
- struct page *page = compound_head(pfn_to_page(pfn));
- return PageReserved(page);
+ struct page *head;
+ struct page *tail = pfn_to_page(pfn);
+ head = compound_head(tail);
+ if (head != tail) {
+ smp_rmb();
+ /*
+ * head may be a dangling pointer.
+ * __split_huge_page_refcount clears PageTail
+ * before overwriting first_page, so if
+ * PageTail is still there it means the head
+ * pointer isn't dangling.
+ */
+ if (PageTail(tail)) {
+ /*
+ * the "head" is not a dangling
+ * pointer but the hugepage may have
+ * been splitted from under us (and we
+ * may not hold a reference count on
+ * the head page so it can be reused
+ * before we run PageReferenced), so
+ * we've to recheck PageTail before
+ * returning what we just read.
+ */
+ int reserved = PageReserved(head);
+ smp_rmb();
+ if (PageTail(tail))
+ return reserved;
+ }
+ }
+ return PageReserved(tail);
}

return true;

2010-12-20 11:17:04

by Mel Gorman

[permalink] [raw]
Subject: Re: Transparent Hugepage Support #33

On Wed, Dec 15, 2010 at 06:15:40AM +0100, Andrea Arcangeli wrote:
> Some of some relevant user of the project:
>
> KVM Virtualization
> GCC (kernel build included, requires a few liner patch to enable)
> JVM
> VMware Workstation
> HPC
>
> It would be great if it could go in -mm.
>

I ran some basic performance tests comparing base pages, hugetlbfs and
transparent huge pages.

STREAM (triad only)
Triad--17.0 18955.94 ( 0.00%) 18955.94 ( 0.00%) 18955.94 ( 0.00%)
Triad--17.33 19756.78 ( 0.00%) 19756.78 ( 0.00%) 19808.90 ( 0.26%)
Triad--17.66 19918.20 ( 0.00%) 19918.20 ( 0.00%) 19918.20 ( 0.00%)
Triad--18.0 19303.15 ( 0.00%) 19687.37 ( 1.95%) 19199.75 (-0.54%)
Triad--18.33 18397.44 ( 0.00%) 18556.45 ( 0.86%) 18443.83 ( 0.25%)
Triad--18.66 18917.43 ( 0.00%) 19088.28 ( 0.90%) 18865.09 (-0.28%)
Triad--19.0 16338.07 ( 0.00%) 18794.78 (13.07%) 16380.81 ( 0.26%)
Triad--19.33 11402.08 ( 0.00%) 11387.21 (-0.13%) 11226.44 (-1.56%)
Triad--19.66 9654.13 ( 0.00%) 9516.96 (-1.44%) 9666.16 ( 0.12%)
Triad--20.0 9556.79 ( 0.00%) 9572.48 ( 0.16%) 9573.63 ( 0.18%)
Triad--20.33 9553.81 ( 0.00%) 9524.22 (-0.31%) 9552.19 (-0.02%)
Triad--20.66 9504.67 ( 0.00%) 9504.67 ( 0.00%) 9509.61 ( 0.05%)
Triad--21.0 9500.04 ( 0.00%) 9538.13 ( 0.40%) 9501.06 ( 0.01%)
Triad--21.33 9355.53 ( 0.00%) 9511.82 ( 1.64%) 9391.13 ( 0.38%)
Triad--21.66 9310.97 ( 0.00%) 9535.04 ( 2.35%) 9459.83 ( 1.57%)
Triad--22.0 9264.88 ( 0.00%) 9521.61 ( 2.70%) 9512.85 ( 2.61%)
Triad--22.33 9197.81 ( 0.00%) 9505.28 ( 3.23%) 9442.67 ( 2.59%)
Triad--22.66 8535.29 ( 0.00%) 8965.94 ( 4.80%) 8839.97 ( 3.45%)
Triad--23.0 7158.25 ( 0.00%) 7462.07 ( 4.07%) 7373.10 ( 2.91%)
Triad--23.33 5659.50 ( 0.00%) 5708.15 ( 0.85%) 5695.34 ( 0.63%)
Triad--23.66 5191.97 ( 0.00%) 5200.99 ( 0.17%) 5175.16 (-0.32%)
Triad--24.0 4960.82 ( 0.00%) 5038.79 ( 1.55%) 5017.61 ( 1.13%)
Triad--24.33 4734.72 ( 0.00%) 4767.03 ( 0.68%) 4752.25 ( 0.37%)
Triad--24.66 4694.59 ( 0.00%) 4687.10 (-0.16%) 4698.72 ( 0.09%)
Triad--25.0 4701.91 ( 0.00%) 4823.23 ( 2.52%) 4759.94 ( 1.22%)
Triad--25.33 4664.94 ( 0.00%) 4748.64 ( 1.76%) 4690.97 ( 0.55%)
Triad--25.66 4670.35 ( 0.00%) 4751.30 ( 1.70%) 4706.59 ( 0.77%)
Triad--26.0 4704.77 ( 0.00%) 4814.09 ( 2.27%) 4788.46 ( 1.75%)
Triad--26.33 4702.14 ( 0.00%) 4707.05 ( 0.10%) 4677.77 (-0.52%)
Triad--26.66 4668.22 ( 0.00%) 4682.79 ( 0.31%) 4671.49 ( 0.07%)
Triad--27.0 4728.34 ( 0.00%) 4807.55 ( 1.65%) 4794.87 ( 1.39%)
Triad--27.33 4722.43 ( 0.00%) 4765.43 ( 0.90%) 4757.13 ( 0.73%)
Triad--27.66 4721.08 ( 0.00%) 4748.82 ( 0.58%) 4748.01 ( 0.57%)
Triad--28.0 4720.13 ( 0.00%) 4804.78 ( 1.76%) 4792.87 ( 1.52%)
Triad--28.33 4685.32 ( 0.00%) 4674.07 (-0.24%) 4627.00 (-1.26%)
Triad--28.66 4689.31 ( 0.00%) 4690.17 ( 0.02%) 4654.35 (-0.75%)
Triad--29.0 4740.42 ( 0.00%) 4780.69 ( 0.84%) 4779.78 ( 0.82%)
Triad--29.33 4688.10 ( 0.00%) 4655.82 (-0.69%) 4722.80 ( 0.73%)
Triad--29.66 4719.65 ( 0.00%) 4670.27 (-1.06%) 4768.32 ( 1.02%)
Triad--30.0 4731.50 ( 0.00%) 4786.19 ( 1.14%) 4773.81 ( 0.89%)
Triad--30.33 4722.82 ( 0.00%) 4734.01 ( 0.24%) 4748.29 ( 0.54%)
Triad--30.66 4732.06 ( 0.00%) 4721.55 (-0.22%) 4733.16 ( 0.02%)
Triad--31.0 4756.53 ( 0.00%) 4784.76 ( 0.59%) 4767.52 ( 0.23%)

I didn't include the other operations because the results are comparable
each time. Broadly speaking, hugetlbfs does slightly better but
transparent huge pages did improve performance a small amount.

SYSBENCH
threads base huge transhuge
1 18629.91 ( 0.00%) 19017.23 ( 2.04%) 18766.30 ( 0.73%)
2 29691.39 ( 0.00%) 30062.81 ( 1.24%) 29808.59 ( 0.39%)
3 39824.00 ( 0.00%) 40324.75 ( 1.24%) 40002.75 ( 0.45%)
4 67639.65 ( 0.00%) 69231.83 ( 2.30%) 68305.58 ( 0.97%)
5 66833.81 ( 0.00%) 68339.77 ( 2.20%) 67393.01 ( 0.83%)
6 66168.22 ( 0.00%) 67875.52 ( 2.52%) 67255.45 ( 1.62%)
7 65775.08 ( 0.00%) 67386.93 ( 2.39%) 66208.60 ( 0.65%)
8 64899.14 ( 0.00%) 66588.38 ( 2.54%) 65367.80 ( 0.72%)

In some ways this is more interesting. hugetlbfs is backing only the
shared memory segment where transhuge is promoting other areas. Hence,
it's not really a like-with-like comparison but still, transparent
hugepages is pushing up performance by a small amount.

NAS-SER C Class (time, lower is better)
base huge-heap transhuge
bt.C 1389.33 ( 0.00%) 1421.64 (-2.27%) 1315.75 ( 5.59%)
cg.C 561.27 ( 0.00%) 509.38 (10.19%) 562.71 (-0.26%)
ep.C 375.78 ( 0.00%) 376.69 (-0.24%) 371.86 ( 1.05%)
ft.C 374.43 ( 0.00%) 371.73 ( 0.73%) 341.87 ( 9.52%)
is.C 17.84 ( 0.00%) 18.80 (-5.11%) 18.49 (-3.52%)
lu.C 1655.91 ( 0.00%) 1668.52 (-0.76%) 1662.25 (-0.38%)
mg.C 134.28 ( 0.00%) 136.96 (-1.96%) 128.04 ( 4.87%)
sp.C 1214.57 ( 0.00%) 1261.40 (-3.71%) 1151.98 ( 5.43%)
ua.C 1070.87 ( 0.00%) 1115.73 (-4.02%) 1048.45 ( 2.14%)

This is more of a like-with-like comparison as hugetlbfs is only backing
the heap. Results were mixed. Sometimes hugetlbfs was better and other times
transhuge was THP won the majority of the time.

SPECjvm huge page comparison
base huge transhuge
compiler 145.54 ( 0.00%) 156.00 ( 6.71%) 156.23 ( 6.84%)
compress 168.07 ( 0.00%) 175.15 ( 4.04%) 174.83 ( 3.87%)
crypto 164.30 ( 0.00%) 157.16 (-4.54%) 156.39 (-5.06%)
derby 53.64 ( 0.00%) 68.71 (21.93%) 58.57 ( 8.42%)
mpegaudio 81.80 ( 0.00%) 94.29 (13.25%) 92.58 (11.64%)
scimark.large 22.97 ( 0.00%) 21.43 (-7.19%) 21.59 (-6.39%)
scimark.small 119.25 ( 0.00%) 122.10 ( 2.33%) 121.44 ( 1.80%)
serial 46.93 ( 0.00%) 46.83 (-0.21%) 47.65 ( 1.51%)
sunflow 47.49 ( 0.00%) 50.03 ( 5.08%) 48.51 ( 2.10%)
xml 206.17 ( 0.00%) 211.42 ( 2.48%) 212.77 ( 3.10%)

hugetlbfs edged out transparent hugepages the majority of the times but
broadly speaking they were comparable in terms of performance.

Bottom-line is that overall transparent hugepages is delivering the expected
performance for this range of workloads at least. It's generally not as
good as hugetlbfs in terms of raw performance but that is hardly a surprise
considering how they both operate and what their objectives are.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab