2016-04-05 21:10:19

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 00/31] huge tmpfs: THPagecache implemented by teams

Here is my "huge tmpfs" implementation of Transparent Huge Pagecache,
rebased to v4.6-rc2 plus the "mm: easy preliminaries to THPagecache"
series.

The design is just the same as before, when I posted against v3.19:
using a team of pagecache pages placed within a huge-order extent,
instead of using a compound page (see 04/31 for more info on that).

Patches 01-17 are much as before, but with whatever changes were
needed for the rebase, and bugfixes folded back in. Patches 18-22
add memcg and smaps visibility. But the more important ones are
patches 23-29, which add recovery: reassembling a hugepage after
fragmentation or swapping. Patches 30-31 reflect gfpmask doubts:
you might prefer that I fold 31 back in and keep 30 internal.

It was lack of recovery which stopped me from proposing inclusion
of the series a year ago: this series now is fully featured, and
ready for v4.7 - but I expect we shall want to wait a release to
give time to consider the alternatives.

I currently believe that the same functionality (including the
team implementation's support for small files, standard mlocking,
and recovery) can be achieved with compound pages, but not easily:
I think the huge tmpfs functionality should be made available soon,
then converted at leisure to compound pages, if that works out (but
it's not a job I want to do - what we have here is good enough).

Huge tmpfs has been in use within Google for about a year: it's
been a success, and gaining ever wider adoption. Several TODOs
have not yet been toDONE, because they just haven't surfaced as
real-life issues yet: that includes NUMA migration, which is at
the top of my list, but so far we've done well enough without it.

01 huge tmpfs: prepare counts in meminfo, vmstat and SysRq-m
02 huge tmpfs: include shmem freeholes in available memory
03 huge tmpfs: huge=N mount option and /proc/sys/vm/shmem_huge
04 huge tmpfs: try to allocate huge pages, split into a team
05 huge tmpfs: avoid team pages in a few places
06 huge tmpfs: shrinker to migrate and free underused holes
07 huge tmpfs: get_unmapped_area align & fault supply huge page
08 huge tmpfs: try_to_unmap_one use page_check_address_transhuge
09 huge tmpfs: avoid premature exposure of new pagetable
10 huge tmpfs: map shmem by huge page pmd or by page team ptes
11 huge tmpfs: disband split huge pmds on race or memory failure
12 huge tmpfs: extend get_user_pages_fast to shmem pmd
13 huge tmpfs: use Unevictable lru with variable hpage_nr_pages
14 huge tmpfs: fix Mlocked meminfo, track huge & unhuge mlocks
15 huge tmpfs: fix Mapped meminfo, track huge & unhuge mappings
16 kvm: plumb return of hva when resolving page fault.
17 kvm: teach kvm to map page teams as huge pages.
18 huge tmpfs: mem_cgroup move charge on shmem huge pages
19 huge tmpfs: mem_cgroup shmem_pmdmapped accounting
20 huge tmpfs: mem_cgroup shmem_hugepages accounting
21 huge tmpfs: show page team flag in pageflags
22 huge tmpfs: /proc/<pid>/smaps show ShmemHugePages
23 huge tmpfs recovery: framework for reconstituting huge pages
24 huge tmpfs recovery: shmem_recovery_populate to fill huge page
25 huge tmpfs recovery: shmem_recovery_remap & remap_team_by_pmd
26 huge tmpfs recovery: shmem_recovery_swapin to read from swap
27 huge tmpfs recovery: tweak shmem_getpage_gfp to fill team
28 huge tmpfs recovery: debugfs stats to complete this phase
29 huge tmpfs recovery: page migration call back into shmem
30 huge tmpfs: shmem_huge_gfpmask and shmem_recovery_gfpmask
31 huge tmpfs: no kswapd by default on sync allocations

Documentation/cgroup-v1/memory.txt | 2
Documentation/filesystems/proc.txt | 20
Documentation/filesystems/tmpfs.txt | 106 +
Documentation/sysctl/vm.txt | 46
Documentation/vm/pagemap.txt | 2
Documentation/vm/transhuge.txt | 38
Documentation/vm/unevictable-lru.txt | 15
arch/mips/mm/gup.c | 15
arch/s390/mm/gup.c | 19
arch/sparc/mm/gup.c | 19
arch/x86/kvm/mmu.c | 150 +
arch/x86/kvm/paging_tmpl.h | 6
arch/x86/mm/gup.c | 15
drivers/base/node.c | 20
drivers/char/mem.c | 23
fs/proc/meminfo.c | 11
fs/proc/page.c | 6
fs/proc/task_mmu.c | 28
include/linux/huge_mm.h | 14
include/linux/kvm_host.h | 2
include/linux/memcontrol.h | 17
include/linux/migrate.h | 2
include/linux/migrate_mode.h | 2
include/linux/mm.h | 3
include/linux/mm_types.h | 1
include/linux/mmzone.h | 5
include/linux/page-flags.h | 10
include/linux/shmem_fs.h | 29
include/trace/events/migrate.h | 7
include/trace/events/mmflags.h | 7
include/uapi/linux/kernel-page-flags.h | 3
ipc/shm.c | 6
kernel/sysctl.c | 33
mm/compaction.c | 5
mm/filemap.c | 10
mm/gup.c | 19
mm/huge_memory.c | 363 +++-
mm/internal.h | 26
mm/memcontrol.c | 187 +-
mm/memory-failure.c | 7
mm/memory.c | 225 +-
mm/mempolicy.c | 13
mm/migrate.c | 37
mm/mlock.c | 183 +-
mm/mmap.c | 16
mm/page-writeback.c | 2
mm/page_alloc.c | 55
mm/rmap.c | 129 -
mm/shmem.c | 2066 ++++++++++++++++++++++-
mm/swap.c | 5
mm/truncate.c | 2
mm/util.c | 1
mm/vmscan.c | 47
mm/vmstat.c | 3
tools/vm/page-types.c | 2
virt/kvm/kvm_main.c | 14
56 files changed, 3627 insertions(+), 472 deletions(-)


2016-04-05 21:12:33

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 01/31] huge tmpfs: prepare counts in meminfo, vmstat and SysRq-m

Abbreviate NR_ANON_TRANSPARENT_HUGEPAGES to NR_ANON_HUGEPAGES,
add NR_SHMEM_HUGEPAGES, NR_SHMEM_PMDMAPPED, NR_SHMEM_FREEHOLES:
to be accounted in later commits, when we shall need some visibility.

Shown in /proc/meminfo and /sys/devices/system/node/nodeN/meminfo
as AnonHugePages (as before), ShmemHugePages, ShmemPmdMapped,
ShmemFreeHoles; /proc/vmstat and /sys/devices/system/node/nodeN/vmstat
as nr_anon_transparent_hugepages (as before), nr_shmem_hugepages,
nr_shmem_pmdmapped, nr_shmem_freeholes.

Be upfront about this being Shmem, neither file nor anon: Shmem
is sometimes counted as file (as in Cached) and sometimes as anon
(as in Active(anon)); which is too confusing. Shmem is already
shown in meminfo, so use that term, rather than tmpfs or shm.

ShmemHugePages will show that portion of Shmem which is allocated
on complete huge pages. ShmemPmdMapped (named not to misalign the
%8lu) will show that portion of ShmemHugePages which is mapped into
userspace with huge pmds. ShmemFreeHoles will show the wastage
from using huge pages for small, or sparsely occupied, or unrounded
files: wastage not included in Shmem or MemFree, but will be freed
under memory pressure. (But no count for the partially occupied
portions of huge pages: seems less important, but could be added.)

Since shmem_freeholes are otherwise hidden, they ought to be shown by
show_free_areas(), in OOM-kill or ALT-SysRq-m or /proc/sysrq-trigger m.
shmem_hugepages is a subset of shmem, and shmem_pmdmapped a subset of
shmem_hugepages: there is not a strong argument for adding them here
(anon_hugepages is not shown), but include them anyway for reassurance.
Note that shmem_hugepages (and _pmdmapped and _freeholes) page counts
are shown in smallpage units, like other fields: not in hugepage units.

The lines get rather long: abbreviate thus
mapped:19778 shmem:38 pagetables:1153 bounce:0
shmem_hugepages:0 _pmdmapped:0 _freeholes:2044
free:3261805 free_pcp:9444 free_cma:0
and
... shmem:92kB _hugepages:0kB _pmdmapped:0kB _freeholes:0kB ...

Tidy up the CONFIG_TRANSPARENT_HUGEPAGE printf blocks in
fs/proc/meminfo.c and drivers/base/node.c: the shorter names help.
Clarify a comment in page_remove_rmap() to refer to "hugetlbfs pages"
rather than hugepages generally. I left arch/tile/mm/pgtable.c's
show_mem() unchanged: tile does not HAVE_ARCH_TRANSPARENT_HUGEPAGE.

Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/filesystems/proc.txt | 10 ++++++++--
drivers/base/node.c | 20 +++++++++++---------
fs/proc/meminfo.c | 11 ++++++++---
include/linux/mmzone.h | 5 ++++-
mm/huge_memory.c | 2 +-
mm/page_alloc.c | 17 +++++++++++++++++
mm/rmap.c | 14 ++++++--------
mm/vmstat.c | 3 +++
8 files changed, 58 insertions(+), 24 deletions(-)

--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -853,7 +853,7 @@ Dirty: 968 kB
Writeback: 0 kB
AnonPages: 861800 kB
Mapped: 280372 kB
-Shmem: 644 kB
+Shmem: 26396 kB
Slab: 284364 kB
SReclaimable: 159856 kB
SUnreclaim: 124508 kB
@@ -867,6 +867,9 @@ VmallocTotal: 112216 kB
VmallocUsed: 428 kB
VmallocChunk: 111088 kB
AnonHugePages: 49152 kB
+ShmemHugePages: 20480 kB
+ShmemPmdMapped: 12288 kB
+ShmemFreeHoles: 0 kB

MemTotal: Total usable ram (i.e. physical ram minus a few reserved
bits and the kernel binary code)
@@ -908,7 +911,6 @@ MemAvailable: An estimate of how much me
Dirty: Memory which is waiting to get written back to the disk
Writeback: Memory which is actively being written back to the disk
AnonPages: Non-file backed pages mapped into userspace page tables
-AnonHugePages: Non-file backed huge pages mapped into userspace page tables
Mapped: files which have been mmaped, such as libraries
Shmem: Total memory used by shared memory (shmem) and tmpfs
Slab: in-kernel data structures cache
@@ -949,6 +951,10 @@ Committed_AS: The amount of memory prese
VmallocTotal: total size of vmalloc memory area
VmallocUsed: amount of vmalloc area which is used
VmallocChunk: largest contiguous block of vmalloc area which is free
+ AnonHugePages: Non-file backed huge pages mapped into userspace page tables
+ShmemHugePages: tmpfs-file backed huge pages completed (subset of Shmem)
+ShmemPmdMapped: tmpfs-file backed huge pages with huge mappings into userspace
+ShmemFreeHoles: Space reserved for tmpfs team pages but available to shrinker

..............................................................................

--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -111,9 +111,6 @@ static ssize_t node_read_meminfo(struct
"Node %d Slab: %8lu kB\n"
"Node %d SReclaimable: %8lu kB\n"
"Node %d SUnreclaim: %8lu kB\n"
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- "Node %d AnonHugePages: %8lu kB\n"
-#endif
,
nid, K(node_page_state(nid, NR_FILE_DIRTY)),
nid, K(node_page_state(nid, NR_WRITEBACK)),
@@ -130,13 +127,18 @@ static ssize_t node_read_meminfo(struct
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE) +
node_page_state(nid, NR_SLAB_UNRECLAIMABLE)),
nid, K(node_page_state(nid, NR_SLAB_RECLAIMABLE)),
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE))
- , nid,
- K(node_page_state(nid, NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR));
-#else
nid, K(node_page_state(nid, NR_SLAB_UNRECLAIMABLE)));
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ n += sprintf(buf + n,
+ "Node %d AnonHugePages: %8lu kB\n"
+ "Node %d ShmemHugePages: %8lu kB\n"
+ "Node %d ShmemPmdMapped: %8lu kB\n"
+ "Node %d ShmemFreeHoles: %8lu kB\n",
+ nid, K(node_page_state(nid, NR_ANON_HUGEPAGES)*HPAGE_PMD_NR),
+ nid, K(node_page_state(nid, NR_SHMEM_HUGEPAGES)*HPAGE_PMD_NR),
+ nid, K(node_page_state(nid, NR_SHMEM_PMDMAPPED)*HPAGE_PMD_NR),
+ nid, K(node_page_state(nid, NR_SHMEM_FREEHOLES)));
#endif
n += hugetlb_report_node_meminfo(nid, buf + n);
return n;
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@@ -105,6 +105,9 @@ static int meminfo_proc_show(struct seq_
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
"AnonHugePages: %8lu kB\n"
+ "ShmemHugePages: %8lu kB\n"
+ "ShmemPmdMapped: %8lu kB\n"
+ "ShmemFreeHoles: %8lu kB\n"
#endif
#ifdef CONFIG_CMA
"CmaTotal: %8lu kB\n"
@@ -159,11 +162,13 @@ static int meminfo_proc_show(struct seq_
0ul, // used to be vmalloc 'used'
0ul // used to be vmalloc 'largest_chunk'
#ifdef CONFIG_MEMORY_FAILURE
- , atomic_long_read(&num_poisoned_pages) << (PAGE_SHIFT - 10)
+ , K(atomic_long_read(&num_poisoned_pages))
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- , K(global_page_state(NR_ANON_TRANSPARENT_HUGEPAGES) *
- HPAGE_PMD_NR)
+ , K(global_page_state(NR_ANON_HUGEPAGES) * HPAGE_PMD_NR)
+ , K(global_page_state(NR_SHMEM_HUGEPAGES) * HPAGE_PMD_NR)
+ , K(global_page_state(NR_SHMEM_PMDMAPPED) * HPAGE_PMD_NR)
+ , K(global_page_state(NR_SHMEM_FREEHOLES))
#endif
#ifdef CONFIG_CMA
, K(totalcma_pages)
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -158,7 +158,10 @@ enum zone_stat_item {
WORKINGSET_REFAULT,
WORKINGSET_ACTIVATE,
WORKINGSET_NODERECLAIM,
- NR_ANON_TRANSPARENT_HUGEPAGES,
+ NR_ANON_HUGEPAGES, /* transparent anon huge pages */
+ NR_SHMEM_HUGEPAGES, /* transparent shmem huge pages */
+ NR_SHMEM_PMDMAPPED, /* shmem huge pages currently mapped hugely */
+ NR_SHMEM_FREEHOLES, /* unused memory of high-order allocations */
NR_FREE_CMA_PAGES,
NR_VM_ZONE_STAT_ITEMS };

--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2943,7 +2943,7 @@ static void __split_huge_pmd_locked(stru

if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
/* Last compound_mapcount is gone. */
- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __dec_zone_page_state(page, NR_ANON_HUGEPAGES);
if (TestClearPageDoubleMap(page)) {
/* No need in mapcount reference anymore */
for (i = 0; i < HPAGE_PMD_NR; i++)
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3830,6 +3830,11 @@ out:
}

#define K(x) ((x) << (PAGE_SHIFT-10))
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define THPAGE_PMD_NR HPAGE_PMD_NR
+#else
+#define THPAGE_PMD_NR 0 /* Avoid BUILD_BUG() */
+#endif

static void show_migration_types(unsigned char type)
{
@@ -3886,6 +3891,7 @@ void show_free_areas(unsigned int filter
" unevictable:%lu dirty:%lu writeback:%lu unstable:%lu\n"
" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
+ " shmem_hugepages:%lu _pmdmapped:%lu _freeholes:%lu\n"
" free:%lu free_pcp:%lu free_cma:%lu\n",
global_page_state(NR_ACTIVE_ANON),
global_page_state(NR_INACTIVE_ANON),
@@ -3903,6 +3909,9 @@ void show_free_areas(unsigned int filter
global_page_state(NR_SHMEM),
global_page_state(NR_PAGETABLE),
global_page_state(NR_BOUNCE),
+ global_page_state(NR_SHMEM_HUGEPAGES) * THPAGE_PMD_NR,
+ global_page_state(NR_SHMEM_PMDMAPPED) * THPAGE_PMD_NR,
+ global_page_state(NR_SHMEM_FREEHOLES),
global_page_state(NR_FREE_PAGES),
free_pcp,
global_page_state(NR_FREE_CMA_PAGES));
@@ -3937,6 +3946,9 @@ void show_free_areas(unsigned int filter
" writeback:%lukB"
" mapped:%lukB"
" shmem:%lukB"
+ " _hugepages:%lukB"
+ " _pmdmapped:%lukB"
+ " _freeholes:%lukB"
" slab_reclaimable:%lukB"
" slab_unreclaimable:%lukB"
" kernel_stack:%lukB"
@@ -3969,6 +3981,11 @@ void show_free_areas(unsigned int filter
K(zone_page_state(zone, NR_WRITEBACK)),
K(zone_page_state(zone, NR_FILE_MAPPED)),
K(zone_page_state(zone, NR_SHMEM)),
+ K(zone_page_state(zone, NR_SHMEM_HUGEPAGES) *
+ THPAGE_PMD_NR),
+ K(zone_page_state(zone, NR_SHMEM_PMDMAPPED) *
+ THPAGE_PMD_NR),
+ K(zone_page_state(zone, NR_SHMEM_FREEHOLES)),
K(zone_page_state(zone, NR_SLAB_RECLAIMABLE)),
K(zone_page_state(zone, NR_SLAB_UNRECLAIMABLE)),
zone_page_state(zone, NR_KERNEL_STACK) *
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1213,10 +1213,8 @@ void do_page_add_anon_rmap(struct page *
* pte lock(a spinlock) is held, which implies preemption
* disabled.
*/
- if (compound) {
- __inc_zone_page_state(page,
- NR_ANON_TRANSPARENT_HUGEPAGES);
- }
+ if (compound)
+ __inc_zone_page_state(page, NR_ANON_HUGEPAGES);
__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, nr);
}
if (unlikely(PageKsm(page)))
@@ -1254,7 +1252,7 @@ void page_add_new_anon_rmap(struct page
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
/* increment count (starts at -1) */
atomic_set(compound_mapcount_ptr(page), 0);
- __inc_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __inc_zone_page_state(page, NR_ANON_HUGEPAGES);
} else {
/* Anon THP always mapped first with PMD */
VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1285,7 +1283,7 @@ static void page_remove_file_rmap(struct
{
lock_page_memcg(page);

- /* Hugepages are not counted in NR_FILE_MAPPED for now. */
+ /* hugetlbfs pages are not counted in NR_FILE_MAPPED for now. */
if (unlikely(PageHuge(page))) {
/* hugetlb pages are always mapped with pmds */
atomic_dec(compound_mapcount_ptr(page));
@@ -1317,14 +1315,14 @@ static void page_remove_anon_compound_rm
if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
return;

- /* Hugepages are not counted in NR_ANON_PAGES for now. */
+ /* hugetlbfs pages are not counted in NR_ANON_PAGES for now. */
if (unlikely(PageHuge(page)))
return;

if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return;

- __dec_zone_page_state(page, NR_ANON_TRANSPARENT_HUGEPAGES);
+ __dec_zone_page_state(page, NR_ANON_HUGEPAGES);

if (TestClearPageDoubleMap(page)) {
/*
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -762,6 +762,9 @@ const char * const vmstat_text[] = {
"workingset_activate",
"workingset_nodereclaim",
"nr_anon_transparent_hugepages",
+ "nr_shmem_hugepages",
+ "nr_shmem_pmdmapped",
+ "nr_shmem_freeholes",
"nr_free_cma",

/* enum writeback_stat_item counters */

2016-04-05 21:13:57

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 02/31] huge tmpfs: include shmem freeholes in available memory

ShmemFreeHoles will be freed under memory pressure, but are not included
in MemFree: they need to be added into MemAvailable, and wherever the
kernel calculates freeable pages, rather than actually free pages. They
must not be counted as free when considering whether to go to reclaim.

There is certainly room for debate about other places, but I think I've
got about the right list - though I'm unfamiliar with and undecided about
drivers/staging/android/lowmemorykiller.c and kernel/power/snapshot.c.

While NR_SHMEM_FREEHOLES should certainly not be counted in NR_FREE_PAGES,
there is a case for including ShmemFreeHoles in the user-visible MemFree
after all: I can see both sides of that argument, leaving it out so far.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/page-writeback.c | 2 ++
mm/page_alloc.c | 6 ++++++
mm/util.c | 1 +
3 files changed, 9 insertions(+)

--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -285,6 +285,7 @@ static unsigned long zone_dirtyable_memo
*/
nr_pages -= min(nr_pages, zone->totalreserve_pages);

+ nr_pages += zone_page_state(zone, NR_SHMEM_FREEHOLES);
nr_pages += zone_page_state(zone, NR_INACTIVE_FILE);
nr_pages += zone_page_state(zone, NR_ACTIVE_FILE);

@@ -344,6 +345,7 @@ static unsigned long global_dirtyable_me
*/
x -= min(x, totalreserve_pages);

+ x += global_page_state(NR_SHMEM_FREEHOLES);
x += global_page_state(NR_INACTIVE_FILE);
x += global_page_state(NR_ACTIVE_FILE);

--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3760,6 +3760,12 @@ long si_mem_available(void)
available += pagecache;

/*
+ * Shmem freeholes help to keep huge pages intact, but contain
+ * no data, and can be shrunk whenever small pages are needed.
+ */
+ available += global_page_state(NR_SHMEM_FREEHOLES);
+
+ /*
* Part of the reclaimable slab consists of items that are in use,
* and cannot be freed. Cap this estimate at the low watermark.
*/
--- a/mm/util.c
+++ b/mm/util.c
@@ -496,6 +496,7 @@ int __vm_enough_memory(struct mm_struct

if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
free = global_page_state(NR_FREE_PAGES);
+ free += global_page_state(NR_SHMEM_FREEHOLES);
free += global_page_state(NR_FILE_PAGES);

/*

2016-04-05 21:15:13

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 03/31] huge tmpfs: huge=N mount option and /proc/sys/vm/shmem_huge

Plumb in a new "huge=1" or "huge=0" mount option to tmpfs: I don't
want to get into a maze of boot options, madvises and fadvises at
this stage, nor extend the use of the existing THP tuning to tmpfs;
though either might be pursued later on. We just want a way to ask
a tmpfs filesystem to favor huge pages, and a way to turn that off
again when it doesn't work out so well. Default of course is off.

"mount -o remount,huge=N /mountpoint" works fine after mount:
remounting from huge=1 (on) to huge=0 (off) will not attempt to
break up huge pages at all, just stop more from being allocated.

It's possible that we shall allow more values for the option later,
to select different strategies (e.g. how hard to try when allocating
huge pages, or when to map hugely and when not, or how sparse a huge
page should be before it is split up), either for experiments, or well
baked in: so use an unsigned char in the superblock rather than a bool.

No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE,
which is the appropriate option to protect those who don't want
the new bloat, and with which we shall share some pmd code. Use a
"name=numeric_value" format like most other tmpfs options. Prohibit
the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is invalid
without CONFIG_NUMA (was hidden in mpol_parse_str(): make it explicit).
Allow setting >0 only if the machine has_transparent_hugepage().

But what about Shmem with no user-visible mount? SysV SHM, memfds,
shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers'
DRM objects, ashmem. Though unlikely to suit all usages, provide
sysctl /proc/sys/vm/shmem_huge to experiment with huge on those. We
may add a memfd_create flag and a per-file huge/non-huge fcntl later.

And allow shmem_huge two further values: -1 for use in emergencies,
to force the huge option off from all mounts; and (currently) 2,
to force the huge option on for all - very useful for testing.

Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/filesystems/tmpfs.txt | 45 +++++++++++++++++
Documentation/sysctl/vm.txt | 16 ++++++
include/linux/shmem_fs.h | 16 ++++--
kernel/sysctl.c | 12 ++++
mm/shmem.c | 66 ++++++++++++++++++++++++++
5 files changed, 149 insertions(+), 6 deletions(-)

--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -140,9 +140,52 @@ will give you tmpfs instance on /mytmpfs
RAM/SWAP in 10240 inodes and it is only accessible by root.


+Huge tmpfs
+==========
+
+If CONFIG_TRANSPARENT_HUGEPAGE is enabled, tmpfs has a mount (or remount)
+option for transparent huge pagecache, giving the efficiency advantage of
+hugepages (from less TLB pressure and fewer pagetable levels), without
+the inflexibility of hugetlbfs. Huge tmpfs pages can be swapped out when
+memory pressure demands, just as ordinary tmpfs pages can be swapped out.
+
+huge=0 default, don't attempt to allocate hugepages.
+huge=1 allocate hugepages when available, and mmap on hugepage boundaries.
+
+So 'mount -t tmpfs -o huge=1 tmpfs /mytmpfs' will give you a huge tmpfs.
+
+Huge tmpfs pages can be slower to allocate than ordinary pages (since they
+may require compaction), and slower to set up initially than hugetlbfs pages
+(since a team of small pages is managed instead of a single compound page);
+but once set up and mapped, huge tmpfs performance should match hugetlbfs.
+
+/proc/sys/vm/shmem_huge (intended for experimentation only):
+
+Default 0; write 1 to set tmpfs mount option huge=1 on the kernel's
+internal shmem mount, to use huge pages transparently for SysV SHM,
+memfds, shared anonymous mmaps, GPU DRM objects, and ashmem.
+
+In addition to 0 and 1, it also accepts 2 to force the huge=1 option
+automatically on for all tmpfs mounts (intended for testing), or -1
+to force huge off for all (intended for safety if bugs appeared).
+
+/proc/meminfo, /sys/devices/system/node/nodeN/meminfo show:
+
+Shmem: 35016 kB total shmem/tmpfs memory (subset of Cached)
+ShmemHugePages: 26624 kB tmpfs hugepages completed (subset of Shmem)
+ShmemPmdMapped: 12288 kB tmpfs hugepages with huge mappings in userspace
+ShmemFreeHoles: 671444 kB reserved for team pages but available to shrinker
+
+/proc/vmstat, /proc/zoneinfo, /sys/devices/system/node/nodeN/vmstat show:
+
+nr_shmem 8754 total shmem/tmpfs pages (subset of nr_file_pages)
+nr_shmem_hugepages 13 tmpfs hugepages completed (each 512 in nr_shmem)
+nr_shmem_pmdmapped 6 tmpfs hugepages with huge mappings in userspace
+nr_shmem_freeholes 167861 pages reserved for team but available to shrinker
+
Author:
Christoph Rohland <[email protected]>, 1.12.01
Updated:
- Hugh Dickins, 4 June 2007
+ Hugh Dickins, 4 June 2007, 3 Oct 2015
Updated:
KOSAKI Motohiro, 16 Mar 2010
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -56,6 +56,7 @@ Currently, these files are in /proc/sys/
- page-cluster
- panic_on_oom
- percpu_pagelist_fraction
+- shmem_huge
- stat_interval
- stat_refresh
- swappiness
@@ -748,6 +749,21 @@ sysctl, it will revert to this default b

==============================================================

+shmem_huge
+
+Default 0; write 1 to set tmpfs mount option huge=1 on the kernel's
+internal shmem mount, to use huge pages transparently for SysV SHM,
+memfds, shared anonymous mmaps, GPU DRM objects, and ashmem.
+
+In addition to 0 and 1, it also accepts 2 to force the huge=1 option
+automatically on for all tmpfs mounts (intended for testing), or -1
+to force huge off for all (intended for safety if bugs appeared).
+
+See Documentation/filesystems/tmpfs.txt for info on huge tmpfs.
+/proc/sys/vm/shmem_huge is intended for experimentation only.
+
+==============================================================
+
stat_interval

The time interval between which vm statistics are updated. The default
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -28,9 +28,10 @@ struct shmem_sb_info {
unsigned long max_inodes; /* How many inodes are allowed */
unsigned long free_inodes; /* How many are left for allocation */
spinlock_t stat_lock; /* Serialize shmem_sb_info changes */
+ umode_t mode; /* Mount mode for root directory */
+ unsigned char huge; /* Whether to try for hugepages */
kuid_t uid; /* Mount uid for root directory */
kgid_t gid; /* Mount gid for root directory */
- umode_t mode; /* Mount mode for root directory */
struct mempolicy *mpol; /* default memory policy for mappings */
};

@@ -69,18 +70,23 @@ static inline struct page *shmem_read_ma
}

#ifdef CONFIG_TMPFS
-
extern int shmem_add_seals(struct file *file, unsigned int seals);
extern int shmem_get_seals(struct file *file);
extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
-
#else
-
static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
{
return -EINVAL;
}
+#endif /* CONFIG_TMPFS */

-#endif
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SHMEM)
+# ifdef CONFIG_SYSCTL
+struct ctl_table;
+extern int shmem_huge, shmem_huge_min, shmem_huge_max;
+extern int shmem_huge_sysctl(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+# endif /* CONFIG_SYSCTL */
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SHMEM */

#endif
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -43,6 +43,7 @@
#include <linux/ratelimit.h>
#include <linux/compaction.h>
#include <linux/hugetlb.h>
+#include <linux/shmem_fs.h>
#include <linux/initrd.h>
#include <linux/key.h>
#include <linux/times.h>
@@ -1313,6 +1314,17 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one_hundred,
},
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SHMEM)
+ {
+ .procname = "shmem_huge",
+ .data = &shmem_huge,
+ .maxlen = sizeof(shmem_huge),
+ .mode = 0644,
+ .proc_handler = shmem_huge_sysctl,
+ .extra1 = &shmem_huge_min,
+ .extra2 = &shmem_huge_max,
+ },
+#endif
#ifdef CONFIG_HUGETLB_PAGE
{
.procname = "nr_hugepages",
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -58,6 +58,7 @@ static struct vfsmount *shm_mnt;
#include <linux/falloc.h>
#include <linux/splice.h>
#include <linux/security.h>
+#include <linux/sysctl.h>
#include <linux/swapops.h>
#include <linux/mempolicy.h>
#include <linux/namei.h>
@@ -289,6 +290,25 @@ static bool shmem_confirm_swap(struct ad
}

/*
+ * Definitions for "huge tmpfs": tmpfs mounted with the huge=1 option
+ */
+
+/* Special values for /proc/sys/vm/shmem_huge */
+#define SHMEM_HUGE_DENY (-1)
+#define SHMEM_HUGE_FORCE (2)
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/* ifdef here to avoid bloating shmem.o when not necessary */
+
+int shmem_huge __read_mostly;
+
+#else /* !CONFIG_TRANSPARENT_HUGEPAGE */
+
+#define shmem_huge SHMEM_HUGE_DENY
+
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+
+/*
* Like add_to_page_cache_locked, but error if expected item has gone.
*/
static int shmem_add_to_page_cache(struct page *page,
@@ -2857,11 +2877,21 @@ static int shmem_parse_options(char *opt
sbinfo->gid = make_kgid(current_user_ns(), gid);
if (!gid_valid(sbinfo->gid))
goto bad_val;
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ } else if (!strcmp(this_char, "huge")) {
+ if (kstrtou8(value, 10, &sbinfo->huge) < 0 ||
+ sbinfo->huge >= SHMEM_HUGE_FORCE)
+ goto bad_val;
+ if (sbinfo->huge && !has_transparent_hugepage())
+ goto bad_val;
+#endif
+#ifdef CONFIG_NUMA
} else if (!strcmp(this_char,"mpol")) {
mpol_put(mpol);
mpol = NULL;
if (mpol_parse_str(value, &mpol))
goto bad_val;
+#endif
} else {
pr_err("tmpfs: Bad mount option %s\n", this_char);
goto error;
@@ -2907,6 +2937,7 @@ static int shmem_remount_fs(struct super
goto out;

error = 0;
+ sbinfo->huge = config.huge;
sbinfo->max_blocks = config.max_blocks;
sbinfo->max_inodes = config.max_inodes;
sbinfo->free_inodes = config.max_inodes - inodes;
@@ -2940,6 +2971,9 @@ static int shmem_show_options(struct seq
if (!gid_eq(sbinfo->gid, GLOBAL_ROOT_GID))
seq_printf(seq, ",gid=%u",
from_kgid_munged(&init_user_ns, sbinfo->gid));
+ /* Rightly or wrongly, show huge mount option unmasked by shmem_huge */
+ if (sbinfo->huge)
+ seq_printf(seq, ",huge=%u", sbinfo->huge);
shmem_show_mpol(seq, sbinfo->mpol);
return 0;
}
@@ -3278,6 +3312,13 @@ int __init shmem_init(void)
pr_err("Could not kern_mount tmpfs\n");
goto out1;
}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ if (has_transparent_hugepage())
+ SHMEM_SB(shm_mnt->mnt_sb)->huge = (shmem_huge > 0);
+ else
+ shmem_huge = 0; /* just in case it was patched */
+#endif
return 0;

out1:
@@ -3289,6 +3330,31 @@ out3:
return error;
}

+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SYSCTL)
+int shmem_huge_min = SHMEM_HUGE_DENY;
+int shmem_huge_max = SHMEM_HUGE_FORCE;
+/*
+ * /proc/sys/vm/shmem_huge sysctl for internal shm_mnt, and mount override:
+ * -1 disables huge on shm_mnt and all mounts, for emergency use
+ * 0 disables huge on internal shm_mnt (which has no way to be remounted)
+ * 1 enables huge on internal shm_mnt (which has no way to be remounted)
+ * 2 enables huge on shm_mnt and all mounts, w/o needing option, for testing
+ * (but we may add more huge options, and push that 2 for testing upwards)
+ */
+int shmem_huge_sysctl(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int err;
+
+ if (!has_transparent_hugepage())
+ shmem_huge_max = 0;
+ err = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+ if (write && !err && !IS_ERR(shm_mnt))
+ SHMEM_SB(shm_mnt->mnt_sb)->huge = (shmem_huge > 0);
+ return err;
+}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SYSCTL */
+
#else /* !CONFIG_SHMEM */

/*

2016-04-05 21:16:47

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 04/31] huge tmpfs: try to allocate huge pages, split into a team

Now we get down to work. The idea here is that compound pages were
ideal for hugetlbfs, with its own separate pool to which huge pages
must be freed. Not so suitable for anonymous THP, which has so far
tried three different schemes of tailpage refcounting to manage them.
And not at all suitable for pagecache THP, where one process may want
to map 4kB of a file while another maps 2MB spanning the same offset
(but the most recent compound tailpage scheme is much more promising
in this regard than earlier ones).

And since anonymous THP was confined to private mappings, that blurred
the distinction between the mapping and the object mapped: so splitting
the mapping used to entail splitting the object (the compound page). For
a long time, pagecache THP appeared to be an even greater challenge: but
that's when you try to follow the anonymous lead too closely. Actually
pagecache THP is easier, once you abandon compound pages, and consider
the object and its mapping separately.

This and the next patches are entirely concerned with the object and
not its mapping: but there will be no chance of mapping the object
with huge pmds, unless it is allocated in huge extents. Mounting
a tmpfs with the huge=1 option requests that objects be allocated
in huge extents, when memory fragmentation and pressure permit.

The main change here is, of course, to shmem_alloc_page(), and to
shmem_add_to_page_cache(): with attention to the races which may
well occur in between the two calls - which involves a rather ugly
"hugehint" interface between them, and the shmem_hugeteam_lookup()
helper which checks the surrounding area for a previously allocated
huge page, or a small page implying earlier huge allocation failure.

shmem_getpage_gfp() works in terms of small (meaning typically 4kB)
pages just as before; the radix_tree holds a slot for each small
page just as before; memcg is charged for small pages just as before;
the LRUs hold small pages just as before; get_user_pages() will work
on ordinarily-refcounted small pages just as before. Which keeps it
all reassuringly simple, but is sure to show up in greater overhead
than hugetlbfs, when first establishing an object; and reclaim from
LRU (with 512 items to go through when only 1 will free them) is sure
to demand cleverer handling in later patches.

The huge page itself is allocated (currently with __GFP_NORETRY)
as a high-order page, but not as a compound page; and that high-order
page is immediately split into its separately refcounted subpages (no
overhead to that: establishing a compound page itself has to set up
each tail page). Only the small page that was asked for is put into
the radix_tree ("page cache") at that time, the remainder left unused
(but with page count 1). The whole is loosely "held together" with a
new PageTeam flag on the head page (whether or not it was put in the
cache), and then one by one, on each tail page as it is instantiated.
There is no requirement that the file be written sequentially.

PageSwapBacked proves useful to distinguish a page which has been
instantiated from one which has not: particularly in the case of that
head page marked PageTeam even when not yet instantiated. Although
conceptually very different, PageTeam was originally designed to reuse
the old CONFIG_TRANSPARENT_HUGEPAGE PG_compound_lock bit, but now that
is gone, it is using its own PG_team bit: perhaps could be doubled up
with some other pageflag bit if necessary, but not a high priority.

Truncation (and hole-punch and eviction) needs to disband the
team before any page is freed from it; and although it will only be
important once we get to mapping the page, even now take the lock on
the head page when truncating any team member (though commonly the head
page will be the first truncated anyway). That does need a trylock,
and sometimes even a busyloop waiting for PageTeam to be cleared, but
I don't see an actual problem with it (no worse than waiting to take
a bitspinlock). When disbanding a team, ask free_hot_cold_page() to
free to the cold end of the pcp list, so the subpages are more likely
to be buddied back together.

In reclaim (shmem_writepage), simply redirty any tail page of the team,
and only when the head is to be reclaimed, proceed to disband and swap.
(Unless head remains uninstantiated: then tail may disband and swap.)
This strategy will still be safe once we get to mapping the huge page:
the head (and hence the huge) can never be mapped at this point.

With this patch, the ShmemHugePages line of /proc/meminfo is shown,
but it totals the amount of huge page memory allocated, not the
amount fully used: so it may show ShmemHugePages exceeding Shmem.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/page-flags.h | 10
include/linux/pageteam.h | 32 ++
include/trace/events/mmflags.h | 7
mm/shmem.c | 355 ++++++++++++++++++++++++++++---
4 files changed, 376 insertions(+), 28 deletions(-)
create mode 100644 include/linux/pageteam.h

--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -101,6 +101,9 @@ enum pageflags {
#ifdef CONFIG_MEMORY_FAILURE
PG_hwpoison, /* hardware poisoned page. Don't touch */
#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ PG_team, /* used for huge tmpfs (shmem) */
+#endif
#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
PG_young,
PG_idle,
@@ -231,6 +234,9 @@ static inline int Page##uname(const stru
#define SETPAGEFLAG_NOOP(uname) \
static inline void SetPage##uname(struct page *page) { }

+#define __SETPAGEFLAG_NOOP(uname) \
+static inline void __SetPage##uname(struct page *page) { }
+
#define CLEARPAGEFLAG_NOOP(uname) \
static inline void ClearPage##uname(struct page *page) { }

@@ -556,6 +562,8 @@ static inline int TestClearPageDoubleMap
return test_and_clear_bit(PG_double_map, &page[1].flags);
}

+PAGEFLAG(Team, team, PF_NO_COMPOUND)
+ __SETPAGEFLAG(Team, team, PF_NO_COMPOUND)
#else
TESTPAGEFLAG_FALSE(TransHuge)
TESTPAGEFLAG_FALSE(TransCompound)
@@ -563,6 +571,8 @@ TESTPAGEFLAG_FALSE(TransTail)
TESTPAGEFLAG_FALSE(DoubleMap)
TESTSETFLAG_FALSE(DoubleMap)
TESTCLEARFLAG_FALSE(DoubleMap)
+PAGEFLAG_FALSE(Team)
+ __SETPAGEFLAG_NOOP(Team)
#endif

/*
--- /dev/null
+++ b/include/linux/pageteam.h
@@ -0,0 +1,32 @@
+#ifndef _LINUX_PAGETEAM_H
+#define _LINUX_PAGETEAM_H
+
+/*
+ * Declarations and definitions for PageTeam pages and page->team_usage:
+ * as implemented for "huge tmpfs" in mm/shmem.c and mm/huge_memory.c, when
+ * CONFIG_TRANSPARENT_HUGEPAGE=y, and tmpfs is mounted with the huge=1 option.
+ */
+
+#include <linux/huge_mm.h>
+#include <linux/mm_types.h>
+#include <linux/mmdebug.h>
+#include <asm/page.h>
+
+static inline struct page *team_head(struct page *page)
+{
+ struct page *head = page - (page->index & (HPAGE_PMD_NR-1));
+ /*
+ * Locating head by page->index is a faster calculation than by
+ * pfn_to_page(page_to_pfn), and we only use this function after
+ * page->index has been set (never on tail holes): but check that.
+ *
+ * Although this is only used on a PageTeam(page), the team might be
+ * disbanded racily, so it's not safe to VM_BUG_ON(!PageTeam(page));
+ * but page->index remains stable across disband and truncation.
+ */
+ VM_BUG_ON_PAGE(head != pfn_to_page(round_down(page_to_pfn(page),
+ HPAGE_PMD_NR)), page);
+ return head;
+}
+
+#endif /* _LINUX_PAGETEAM_H */
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -72,6 +72,12 @@
#define IF_HAVE_PG_HWPOISON(flag,string)
#endif

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define IF_HAVE_PG_TEAM(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_TEAM(flag,string)
+#endif
+
#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
#define IF_HAVE_PG_IDLE(flag,string) ,{1UL << flag, string}
#else
@@ -102,6 +108,7 @@
IF_HAVE_PG_MLOCK(PG_mlocked, "mlocked" ) \
IF_HAVE_PG_UNCACHED(PG_uncached, "uncached" ) \
IF_HAVE_PG_HWPOISON(PG_hwpoison, "hwpoison" ) \
+IF_HAVE_PG_TEAM(PG_team, "team" ) \
IF_HAVE_PG_IDLE(PG_young, "young" ) \
IF_HAVE_PG_IDLE(PG_idle, "idle" )

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -60,6 +60,7 @@ static struct vfsmount *shm_mnt;
#include <linux/security.h>
#include <linux/sysctl.h>
#include <linux/swapops.h>
+#include <linux/pageteam.h>
#include <linux/mempolicy.h>
#include <linux/namei.h>
#include <linux/ctype.h>
@@ -297,49 +298,234 @@ static bool shmem_confirm_swap(struct ad
#define SHMEM_HUGE_DENY (-1)
#define SHMEM_HUGE_FORCE (2)

+/* hugehint values: NULL to choose a small page always */
+#define SHMEM_ALLOC_SMALL_PAGE ((struct page *)1)
+#define SHMEM_ALLOC_HUGE_PAGE ((struct page *)2)
+#define SHMEM_RETRY_HUGE_PAGE ((struct page *)3)
+/* otherwise hugehint is the hugeteam page to be used */
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* ifdef here to avoid bloating shmem.o when not necessary */

int shmem_huge __read_mostly;

+static struct page *shmem_hugeteam_lookup(struct address_space *mapping,
+ pgoff_t index, bool speculative)
+{
+ pgoff_t start;
+ pgoff_t indice;
+ void __rcu **pagep;
+ struct page *cachepage;
+ struct page *headpage;
+ struct page *page;
+
+ /*
+ * First called speculatively, under rcu_read_lock(), by the huge
+ * shmem_alloc_page(): to decide whether to allocate a new huge page,
+ * or a new small page, or use a previously allocated huge team page.
+ *
+ * Later called under mapping->tree_lock, by shmem_add_to_page_cache(),
+ * to confirm the decision just before inserting into the radix_tree.
+ */
+
+ start = round_down(index, HPAGE_PMD_NR);
+restart:
+ if (!radix_tree_gang_lookup_slot(&mapping->page_tree,
+ &pagep, &indice, start, 1))
+ return SHMEM_ALLOC_HUGE_PAGE;
+ cachepage = rcu_dereference_check(*pagep,
+ lockdep_is_held(&mapping->tree_lock));
+ if (!cachepage || indice >= start + HPAGE_PMD_NR)
+ return SHMEM_ALLOC_HUGE_PAGE;
+ if (radix_tree_exception(cachepage)) {
+ if (radix_tree_deref_retry(cachepage))
+ goto restart;
+ return SHMEM_ALLOC_SMALL_PAGE;
+ }
+ if (!PageTeam(cachepage))
+ return SHMEM_ALLOC_SMALL_PAGE;
+ /* headpage is very often its first cachepage, but not necessarily */
+ headpage = cachepage - (indice - start);
+ page = headpage + (index - start);
+ if (speculative && !page_cache_get_speculative(page))
+ goto restart;
+ if (!PageTeam(headpage) ||
+ headpage->mapping != mapping || headpage->index != start) {
+ if (speculative)
+ put_page(page);
+ return SHMEM_ALLOC_SMALL_PAGE;
+ }
+ return page;
+}
+
+static int shmem_disband_hugehead(struct page *head)
+{
+ struct address_space *mapping;
+ struct zone *zone;
+ int nr = -EALREADY; /* A racing task may have disbanded the team */
+
+ mapping = head->mapping;
+ zone = page_zone(head);
+
+ spin_lock_irq(&mapping->tree_lock);
+ if (PageTeam(head)) {
+ ClearPageTeam(head);
+ if (!PageSwapBacked(head))
+ head->mapping = NULL;
+ __dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
+ nr = 0;
+ }
+ spin_unlock_irq(&mapping->tree_lock);
+ return nr;
+}
+
+static void shmem_disband_hugetails(struct page *head)
+{
+ struct page *page;
+ struct page *endpage;
+
+ page = head;
+ endpage = head + HPAGE_PMD_NR;
+
+ /* Condition follows in next but one commit */ {
+ /*
+ * The usual case: disbanding team and freeing holes as cold
+ * (cold being more likely to preserve high-order extents).
+ */
+ if (!PageSwapBacked(page)) { /* head was not in cache */
+ if (put_page_testzero(page))
+ free_hot_cold_page(page, 1);
+ }
+ while (++page < endpage) {
+ if (PageTeam(page))
+ ClearPageTeam(page);
+ else if (put_page_testzero(page))
+ free_hot_cold_page(page, 1);
+ }
+ }
+}
+
+static void shmem_disband_hugeteam(struct page *page)
+{
+ struct page *head = team_head(page);
+ int nr_used;
+
+ /*
+ * In most cases, shmem_disband_hugeteam() is called with this page
+ * locked. But shmem_getpage_gfp()'s alloced_huge failure case calls
+ * it after unlocking and releasing: because it has not exposed the
+ * page, and prefers free_hot_cold_page to free it all cold together.
+ *
+ * The truncation case may need a second lock, on the head page,
+ * to guard against races while shmem fault prepares a huge pmd.
+ * Little point in returning error, it has to check PageTeam anyway.
+ */
+ if (head != page) {
+ if (!get_page_unless_zero(head))
+ return;
+ if (!trylock_page(head)) {
+ put_page(head);
+ return;
+ }
+ if (!PageTeam(head)) {
+ unlock_page(head);
+ put_page(head);
+ return;
+ }
+ }
+
+ /*
+ * Disable preemption because truncation may end up spinning until a
+ * tail PageTeam has been cleared: we hold the lock as briefly as we
+ * can (splitting disband in two stages), but better not be preempted.
+ */
+ preempt_disable();
+ nr_used = shmem_disband_hugehead(head);
+ if (head != page)
+ unlock_page(head);
+ if (nr_used >= 0)
+ shmem_disband_hugetails(head);
+ if (head != page)
+ put_page(head);
+ preempt_enable();
+}
+
#else /* !CONFIG_TRANSPARENT_HUGEPAGE */

#define shmem_huge SHMEM_HUGE_DENY

+static inline struct page *shmem_hugeteam_lookup(struct address_space *mapping,
+ pgoff_t index, bool speculative)
+{
+ BUILD_BUG();
+ return SHMEM_ALLOC_SMALL_PAGE;
+}
+
+static inline void shmem_disband_hugeteam(struct page *page)
+{
+ BUILD_BUG();
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

/*
* Like add_to_page_cache_locked, but error if expected item has gone.
*/
-static int shmem_add_to_page_cache(struct page *page,
- struct address_space *mapping,
- pgoff_t index, void *expected)
+static int
+shmem_add_to_page_cache(struct page *page, struct address_space *mapping,
+ pgoff_t index, void *expected, struct page *hugehint)
{
+ struct zone *zone = page_zone(page);
int error;

VM_BUG_ON_PAGE(!PageLocked(page), page);
- VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+ VM_BUG_ON(expected && hugehint);
+
+ spin_lock_irq(&mapping->tree_lock);
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugehint) {
+ if (shmem_hugeteam_lookup(mapping, index, false) != hugehint) {
+ error = -EEXIST; /* will retry */
+ goto errout;
+ }
+ if (!PageSwapBacked(page)) { /* huge needs special care */
+ SetPageSwapBacked(page);
+ SetPageTeam(page);
+ }
+ }

- get_page(page);
page->mapping = mapping;
page->index = index;
+ /* smp_wmb()? That's in radix_tree_insert()'s rcu_assign_pointer() */

- spin_lock_irq(&mapping->tree_lock);
if (!expected)
error = radix_tree_insert(&mapping->page_tree, index, page);
else
error = shmem_radix_tree_replace(mapping, index, expected,
page);
- if (!error) {
- mapping->nrpages++;
- __inc_zone_page_state(page, NR_FILE_PAGES);
- __inc_zone_page_state(page, NR_SHMEM);
- spin_unlock_irq(&mapping->tree_lock);
- } else {
+ if (unlikely(error))
+ goto errout;
+
+ if (!PageTeam(page))
+ get_page(page);
+ else if (hugehint == SHMEM_ALLOC_HUGE_PAGE)
+ __inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
+
+ mapping->nrpages++;
+ __inc_zone_state(zone, NR_FILE_PAGES);
+ __inc_zone_state(zone, NR_SHMEM);
+ spin_unlock_irq(&mapping->tree_lock);
+ return 0;
+
+errout:
+ if (PageTeam(page)) {
+ /* We use SwapBacked to indicate if already in cache */
+ ClearPageSwapBacked(page);
+ if (index & (HPAGE_PMD_NR-1)) {
+ ClearPageTeam(page);
+ page->mapping = NULL;
+ }
+ } else
page->mapping = NULL;
- spin_unlock_irq(&mapping->tree_lock);
- put_page(page);
- }
+ spin_unlock_irq(&mapping->tree_lock);
return error;
}

@@ -501,15 +687,16 @@ static void shmem_undo_range(struct inod
struct pagevec pvec;
pgoff_t indices[PAGEVEC_SIZE];
long nr_swaps_freed = 0;
+ pgoff_t warm_index = 0;
pgoff_t index;
int i;

if (lend == -1)
end = -1; /* unsigned, so actually very big */

- pagevec_init(&pvec, 0);
index = start;
while (index < end) {
+ pagevec_init(&pvec, index < warm_index);
pvec.nr = find_get_entries(mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE),
pvec.pages, indices);
@@ -535,7 +722,21 @@ static void shmem_undo_range(struct inod
if (!unfalloc || !PageUptodate(page)) {
if (page->mapping == mapping) {
VM_BUG_ON_PAGE(PageWriteback(page), page);
- truncate_inode_page(mapping, page);
+ if (PageTeam(page)) {
+ /*
+ * Try preserve huge pages by
+ * freeing to tail of pcp list.
+ */
+ pvec.cold = 1;
+ warm_index = round_up(
+ index + 1, HPAGE_PMD_NR);
+ shmem_disband_hugeteam(page);
+ /* but that may not succeed */
+ }
+ if (!PageTeam(page)) {
+ truncate_inode_page(mapping,
+ page);
+ }
}
}
unlock_page(page);
@@ -577,7 +778,8 @@ static void shmem_undo_range(struct inod
index = start;
while (index < end) {
cond_resched();
-
+ /* Carrying warm_index from first pass is the best we can do */
+ pagevec_init(&pvec, index < warm_index);
pvec.nr = find_get_entries(mapping, index,
min(end - index, (pgoff_t)PAGEVEC_SIZE),
pvec.pages, indices);
@@ -612,7 +814,26 @@ static void shmem_undo_range(struct inod
if (!unfalloc || !PageUptodate(page)) {
if (page->mapping == mapping) {
VM_BUG_ON_PAGE(PageWriteback(page), page);
- truncate_inode_page(mapping, page);
+ if (PageTeam(page)) {
+ /*
+ * Try preserve huge pages by
+ * freeing to tail of pcp list.
+ */
+ pvec.cold = 1;
+ warm_index = round_up(
+ index + 1, HPAGE_PMD_NR);
+ shmem_disband_hugeteam(page);
+ /* but that may not succeed */
+ }
+ if (!PageTeam(page)) {
+ truncate_inode_page(mapping,
+ page);
+ } else if (end != -1) {
+ /* Punch retry disband now */
+ unlock_page(page);
+ index--;
+ break;
+ }
} else {
/* Page was replaced by swap: retry */
unlock_page(page);
@@ -784,7 +1005,7 @@ static int shmem_unuse_inode(struct shme
*/
if (!error)
error = shmem_add_to_page_cache(*pagep, mapping, index,
- radswap);
+ radswap, NULL);
if (error != -ENOMEM) {
/*
* Truncation and eviction use free_swap_and_cache(), which
@@ -922,10 +1143,25 @@ static int shmem_writepage(struct page *
SetPageUptodate(page);
}

+ if (PageTeam(page)) {
+ struct page *head = team_head(page);
+ /*
+ * Only proceed if this is head, or if head is unpopulated.
+ */
+ if (page != head && PageSwapBacked(head))
+ goto redirty;
+ }
+
swap = get_swap_page();
if (!swap.val)
goto redirty;

+ if (PageTeam(page)) {
+ shmem_disband_hugeteam(page);
+ if (PageTeam(page))
+ goto free_swap;
+ }
+
if (mem_cgroup_try_charge_swap(page, swap))
goto free_swap;

@@ -1025,8 +1261,8 @@ static struct page *shmem_swapin(swp_ent
return page;
}

-static struct page *shmem_alloc_page(gfp_t gfp,
- struct shmem_inode_info *info, pgoff_t index)
+static struct page *shmem_alloc_page(gfp_t gfp, struct shmem_inode_info *info,
+ pgoff_t index, struct page **hugehint, struct page **alloced_huge)
{
struct vm_area_struct pvma;
struct page *page;
@@ -1038,12 +1274,55 @@ static struct page *shmem_alloc_page(gfp
pvma.vm_ops = NULL;
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);

+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && *hugehint) {
+ struct address_space *mapping = info->vfs_inode.i_mapping;
+ struct page *head;
+
+ rcu_read_lock();
+ *hugehint = shmem_hugeteam_lookup(mapping, index, true);
+ rcu_read_unlock();
+
+ if (*hugehint == SHMEM_ALLOC_HUGE_PAGE) {
+ head = alloc_pages_vma(gfp|__GFP_NORETRY|__GFP_NOWARN,
+ HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
+ true);
+ if (head) {
+ split_page(head, HPAGE_PMD_ORDER);
+
+ /* Prepare head page for add_to_page_cache */
+ __SetPageTeam(head);
+ head->mapping = mapping;
+ head->index = round_down(index, HPAGE_PMD_NR);
+ *alloced_huge = head;
+
+ /* Prepare wanted page for add_to_page_cache */
+ page = head + (index & (HPAGE_PMD_NR-1));
+ get_page(page);
+ __SetPageLocked(page);
+ goto out;
+ }
+ } else if (*hugehint != SHMEM_ALLOC_SMALL_PAGE) {
+ page = *hugehint;
+ head = page - (index & (HPAGE_PMD_NR-1));
+ /*
+ * This page is already visible: so we cannot use the
+ * __nonatomic ops, must check that it has not already
+ * been added, and cannot set the flags it needs until
+ * add_to_page_cache has the tree_lock.
+ */
+ lock_page(page);
+ if (PageSwapBacked(page) || !PageTeam(head))
+ *hugehint = SHMEM_RETRY_HUGE_PAGE;
+ goto out;
+ }
+ }
+
page = alloc_pages_vma(gfp, 0, &pvma, 0, numa_node_id(), false);
if (page) {
__SetPageLocked(page);
__SetPageSwapBacked(page);
}
-
+out:
/* Drop reference taken by mpol_shared_policy_lookup() */
mpol_cond_put(pvma.vm_policy);

@@ -1074,6 +1353,7 @@ static int shmem_replace_page(struct pag
struct address_space *swap_mapping;
pgoff_t swap_index;
int error;
+ struct page *hugehint = NULL;

oldpage = *pagep;
swap_index = page_private(oldpage);
@@ -1084,7 +1364,7 @@ static int shmem_replace_page(struct pag
* limit chance of success by further cpuset and node constraints.
*/
gfp &= ~GFP_CONSTRAINT_MASK;
- newpage = shmem_alloc_page(gfp, info, index);
+ newpage = shmem_alloc_page(gfp, info, index, &hugehint, &hugehint);
if (!newpage)
return -ENOMEM;

@@ -1155,6 +1435,8 @@ static int shmem_getpage_gfp(struct inod
int error;
int once = 0;
int alloced = 0;
+ struct page *hugehint;
+ struct page *alloced_huge = NULL;

if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
return -EFBIG;
@@ -1237,7 +1519,7 @@ repeat:
false);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
- swp_to_radix_entry(swap));
+ swp_to_radix_entry(swap), NULL);
/*
* We already confirmed swap under page lock, and make
* no memory allocation here, so usually no possibility
@@ -1286,11 +1568,23 @@ repeat:
percpu_counter_inc(&sbinfo->used_blocks);
}

- page = shmem_alloc_page(gfp, info, index);
+ /* Take huge hint from super, except for shmem_symlink() */
+ hugehint = NULL;
+ if (mapping->a_ops == &shmem_aops &&
+ (shmem_huge == SHMEM_HUGE_FORCE ||
+ (sbinfo->huge && shmem_huge != SHMEM_HUGE_DENY)))
+ hugehint = SHMEM_ALLOC_HUGE_PAGE;
+
+ page = shmem_alloc_page(gfp, info, index,
+ &hugehint, &alloced_huge);
if (!page) {
error = -ENOMEM;
goto decused;
}
+ if (hugehint == SHMEM_RETRY_HUGE_PAGE) {
+ error = -EEXIST;
+ goto decused;
+ }
if (sgp == SGP_WRITE)
__SetPageReferenced(page);

@@ -1301,7 +1595,7 @@ repeat:
error = radix_tree_maybe_preload(gfp & GFP_RECLAIM_MASK);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
- NULL);
+ NULL, hugehint);
radix_tree_preload_end();
}
if (error) {
@@ -1339,13 +1633,14 @@ clear:
/* Perhaps the file has been truncated since we checked */
if (sgp <= SGP_CACHE &&
((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
- if (alloced) {
+ if (alloced && !PageTeam(page)) {
ClearPageDirty(page);
delete_from_page_cache(page);
spin_lock(&info->lock);
shmem_recalc_inode(inode);
spin_unlock(&info->lock);
}
+ alloced_huge = NULL; /* already exposed: maybe now in use */
error = -EINVAL;
goto unlock;
}
@@ -1368,6 +1663,10 @@ unlock:
unlock_page(page);
put_page(page);
}
+ if (alloced_huge) {
+ shmem_disband_hugeteam(alloced_huge);
+ alloced_huge = NULL;
+ }
if (error == -ENOSPC && !once++) {
info = SHMEM_I(inode);
spin_lock(&info->lock);

2016-04-05 21:17:50

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 05/31] huge tmpfs: avoid team pages in a few places

A few functions outside of mm/shmem.c must take care not to damage a
team accidentally. In particular, although huge tmpfs will make its
own use of page migration, we don't want compaction or other users
of page migration to stomp on teams by mistake: backstop checks in
migrate_page_move_mapping() and unmap_and_move() secure most cases,
and an earlier check in isolate_migratepages_block() saves compaction
from wasting time.

These checks are certainly too strong: we shall want NUMA mempolicy
and balancing, and memory hot-remove, and soft-offline of failing
memory, to work with team pages; but defer those to a later series.

Also send PageTeam the slow route, along with PageTransHuge, in
munlock_vma_pages_range(): because __munlock_pagevec_fill() uses
get_locked_pte(), which expects ptes not a huge pmd; and we don't
want to split up a pmd to munlock it. This avoids a VM_BUG_ON, or
hang on the non-existent ptlock; but there's much more to do later,
to get mlock+munlock working properly.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/compaction.c | 5 +++++
mm/memcontrol.c | 4 ++--
mm/migrate.c | 15 ++++++++++++++-
mm/mlock.c | 2 +-
mm/truncate.c | 2 +-
mm/vmscan.c | 2 ++
6 files changed, 25 insertions(+), 5 deletions(-)

--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -701,6 +701,11 @@ isolate_migratepages_block(struct compac
continue;
}

+ if (PageTeam(page)) {
+ low_pfn = round_up(low_pfn + 1, HPAGE_PMD_NR) - 1;
+ continue;
+ }
+
/*
* Check may be lockless but that's ok as we recheck later.
* It's possible to migrate LRU pages and balloon pages
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4566,8 +4566,8 @@ static enum mc_target_type get_mctgt_typ
enum mc_target_type ret = MC_TARGET_NONE;

page = pmd_page(pmd);
- VM_BUG_ON_PAGE(!page || !PageHead(page), page);
- if (!(mc.flags & MOVE_ANON))
+ /* Don't attempt to move huge tmpfs pages yet: can be enabled later */
+ if (!(mc.flags & MOVE_ANON) || !PageAnon(page))
return ret;
if (page->mem_cgroup == mc.from) {
ret = MC_TARGET_PAGE;
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -346,7 +346,7 @@ int migrate_page_move_mapping(struct add
page_index(page));

expected_count += 1 + page_has_private(page);
- if (page_count(page) != expected_count ||
+ if (page_count(page) != expected_count || PageTeam(page) ||
radix_tree_deref_slot_protected(pslot, &mapping->tree_lock) != page) {
spin_unlock_irq(&mapping->tree_lock);
return -EAGAIN;
@@ -944,6 +944,11 @@ static ICE_noinline int unmap_and_move(n
if (!newpage)
return -ENOMEM;

+ if (PageTeam(page)) {
+ rc = -EBUSY;
+ goto out;
+ }
+
if (page_count(page) == 1) {
/* page was freed from under us. So we are done. */
goto out;
@@ -1757,6 +1762,14 @@ int migrate_misplaced_transhuge_page(str
pmd_t orig_entry;

/*
+ * Leave support for NUMA balancing on huge tmpfs pages to the future.
+ * The pmd marking up to this point should work okay, but from here on
+ * there is work to be done: e.g. anon page->mapping assumption below.
+ */
+ if (!PageAnon(page))
+ goto out_dropref;
+
+ /*
* Rate-limit the amount of data that is being migrated to a node.
* Optimal placement is no good if the memory bus is saturated and
* all the time is being spent migrating!
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -459,7 +459,7 @@ void munlock_vma_pages_range(struct vm_a
if (PageTransTail(page)) {
VM_BUG_ON_PAGE(PageMlocked(page), page);
put_page(page); /* follow_page_mask() */
- } else if (PageTransHuge(page)) {
+ } else if (PageTransHuge(page) || PageTeam(page)) {
lock_page(page);
/*
* Any THP page found by follow_page_mask() may
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -528,7 +528,7 @@ invalidate_complete_page2(struct address
return 0;

spin_lock_irqsave(&mapping->tree_lock, flags);
- if (PageDirty(page))
+ if (PageDirty(page) || PageTeam(page))
goto failed;

BUG_ON(page_has_private(page));
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -638,6 +638,8 @@ static int __remove_mapping(struct addre
* Note that if SetPageDirty is always performed via set_page_dirty,
* and thus under tree_lock, then this ordering is not required.
*/
+ if (unlikely(PageTeam(page)))
+ goto cannot_free;
if (!page_ref_freeze(page, 2))
goto cannot_free;
/* note: atomic_cmpxchg in page_freeze_refs provides the smp_rmb */

2016-04-05 21:20:11

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 06/31] huge tmpfs: shrinker to migrate and free underused holes

Using 2MB for each small file is wasteful, and on average even a large
file is likely to waste 1MB at the end. We could say that a huge tmpfs
is only suitable for huge files, but I would much prefer not to limit
it in that way, and would not be very able to test such a filesystem.

In our model, the unused space in the team is not put on any LRU (nor
charged to any memcg), so not yet accessible to page reclaim: we need
a shrinker to disband the team, and free up the unused space, under
memory pressure. (Typically the freeable space is at the end, but
there's no assumption that it's at end of huge page or end of file.)

shmem_shrink_hugehole() is usually called from vmscan's shrink_slabs();
but I've found a direct call from shmem_alloc_page(), when it fails
to allocate a huge page (perhaps because too much memory is occupied
by shmem huge holes), is also helpful before a retry.

But each team holds a valuable resource: an extent of contiguous
memory that could be used for another team (or for an anonymous THP).
So try to proceed in such a way as to conserve that resource: rather
than just freeing the unused space and leaving yet another huge page
fragmented, also try to migrate the used space to another partially
occupied huge page.

The algorithm in shmem_choose_hugehole() (find least occupied huge page
in older half of shrinklist, and migrate its cachepages into the most
occupied huge page with enough space to fit, again chosen from older
half of shrinklist) is unlikely to be ideal; but easy to implement as
a demonstration of the pieces which can be used by any algorithm,
and good enough for now. A radix_tree tag helps to locate the
partially occupied huge pages more quickly: the tag available
since shmem does not participate in dirty/writeback accounting.

The "team_usage" field added to struct page (in union with "private")
is somewhat vaguely named: because while the huge page is sparsely
occupied, it counts the occupancy; but once the huge page is fully
occupied, it will come to be used differently in a later patch, as
the huge mapcount (offset by the HPAGE_PMD_NR occupancy) - it is
never possible to map a sparsely occupied huge page, because that
would expose stale data to the user.

With this patch, the ShmemHugePages and ShmemFreeHoles lines of
/proc/meminfo are shown correctly; but ShmemPmdMapped remains 0.

Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/filesystems/tmpfs.txt | 9
include/linux/migrate.h | 1
include/linux/mm_types.h | 1
include/linux/shmem_fs.h | 3
include/trace/events/migrate.h | 3
mm/shmem.c | 440 +++++++++++++++++++++++++-
6 files changed, 443 insertions(+), 14 deletions(-)

--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -159,6 +159,15 @@ may require compaction), and slower to s
(since a team of small pages is managed instead of a single compound page);
but once set up and mapped, huge tmpfs performance should match hugetlbfs.

+When a file is created on a huge tmpfs (or copied there), a hugepage is
+allocated to it if possible. Initially only one small page of the hugepage
+will actually be used for the file: then the neighbouring free holes filled
+as more data is added, until the hugepage is completed. But if the hugepage
+is left incomplete, and memory needs to be reclaimed, then a shrinker can
+disband the team and free those holes; or page reclaim disband the team
+and swap out the tmpfs pagecache. Free holes are not charged to any
+memcg, and are counted in MemAvailable; but are not counted in MemFree.
+
/proc/sys/vm/shmem_huge (intended for experimentation only):

Default 0; write 1 to set tmpfs mount option huge=1 on the kernel's
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -24,6 +24,7 @@ enum migrate_reason {
MR_MEMPOLICY_MBIND,
MR_NUMA_MISPLACED,
MR_CMA,
+ MR_SHMEM_HUGEHOLE,
MR_TYPES
};

--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -183,6 +183,7 @@ struct page {
#endif
#endif
struct kmem_cache *slab_cache; /* SL[AU]B: Pointer to slab */
+ atomic_long_t team_usage; /* In shmem's PageTeam page */
};

#ifdef CONFIG_MEMCG
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -16,8 +16,9 @@ struct shmem_inode_info {
unsigned long flags;
unsigned long alloced; /* data pages alloced to file */
unsigned long swapped; /* subtotal assigned to swap */
- struct shared_policy policy; /* NUMA memory alloc policy */
+ struct list_head shrinklist; /* shrinkable hpage inodes */
struct list_head swaplist; /* chain of maybes on swap */
+ struct shared_policy policy; /* NUMA memory alloc policy */
struct simple_xattrs xattrs; /* list of xattrs */
struct inode vfs_inode;
};
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -19,7 +19,8 @@
EM( MR_SYSCALL, "syscall_or_cpuset") \
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
EM( MR_NUMA_MISPLACED, "numa_misplaced") \
- EMe(MR_CMA, "cma")
+ EM( MR_CMA, "cma") \
+ EMe(MR_SHMEM_HUGEHOLE, "shmem_hugehole")

/*
* First define the enums in the above macros to be exported to userspace
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -58,6 +58,7 @@ static struct vfsmount *shm_mnt;
#include <linux/falloc.h>
#include <linux/splice.h>
#include <linux/security.h>
+#include <linux/shrinker.h>
#include <linux/sysctl.h>
#include <linux/swapops.h>
#include <linux/pageteam.h>
@@ -304,6 +305,14 @@ static bool shmem_confirm_swap(struct ad
#define SHMEM_RETRY_HUGE_PAGE ((struct page *)3)
/* otherwise hugehint is the hugeteam page to be used */

+/* tag for shrinker to locate unfilled hugepages */
+#define SHMEM_TAG_HUGEHOLE PAGECACHE_TAG_DIRTY
+
+/* list of inodes with unfilled hugepages, from which shrinker may free */
+static LIST_HEAD(shmem_shrinklist);
+static unsigned long shmem_shrinklist_depth;
+static DEFINE_SPINLOCK(shmem_shrinklist_lock);
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/* ifdef here to avoid bloating shmem.o when not necessary */

@@ -358,28 +367,106 @@ restart:
return page;
}

+static int shmem_freeholes(struct page *head)
+{
+ unsigned long nr = atomic_long_read(&head->team_usage);
+
+ return (nr >= HPAGE_PMD_NR) ? 0 : HPAGE_PMD_NR - nr;
+}
+
+static void shmem_clear_tag_hugehole(struct address_space *mapping,
+ pgoff_t index)
+{
+ struct page *page = NULL;
+
+ /*
+ * The tag was set on the first subpage to be inserted in cache.
+ * When written sequentially, or instantiated by a huge fault,
+ * it will be on the head page, but that's not always so. And
+ * radix_tree_tag_clear() succeeds when it finds a slot, whether
+ * tag was set on it or not. So first lookup and then clear.
+ */
+ radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)&page,
+ index, 1, SHMEM_TAG_HUGEHOLE);
+ VM_BUG_ON(!page || page->index >= index + HPAGE_PMD_NR);
+ radix_tree_tag_clear(&mapping->page_tree, page->index,
+ SHMEM_TAG_HUGEHOLE);
+}
+
+static void shmem_added_to_hugeteam(struct page *page, struct zone *zone,
+ struct page *hugehint)
+{
+ struct address_space *mapping = page->mapping;
+ struct page *head = team_head(page);
+ int nr;
+
+ if (hugehint == SHMEM_ALLOC_HUGE_PAGE) {
+ atomic_long_set(&head->team_usage, 1);
+ radix_tree_tag_set(&mapping->page_tree, page->index,
+ SHMEM_TAG_HUGEHOLE);
+ __mod_zone_page_state(zone, NR_SHMEM_FREEHOLES, HPAGE_PMD_NR-1);
+ } else {
+ /* We do not need atomic ops until huge page gets mapped */
+ nr = atomic_long_read(&head->team_usage) + 1;
+ atomic_long_set(&head->team_usage, nr);
+ if (nr == HPAGE_PMD_NR) {
+ shmem_clear_tag_hugehole(mapping, head->index);
+ __inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
+ }
+ __dec_zone_state(zone, NR_SHMEM_FREEHOLES);
+ }
+}
+
static int shmem_disband_hugehead(struct page *head)
{
struct address_space *mapping;
struct zone *zone;
int nr = -EALREADY; /* A racing task may have disbanded the team */

- mapping = head->mapping;
- zone = page_zone(head);
+ /*
+ * In most cases the head page is locked, or not yet exposed to others:
+ * only in the shrinker migration case might head have been truncated.
+ * But although head->mapping may then be zeroed at any moment, mapping
+ * stays safe because shmem_evict_inode must take the shrinklist_lock,
+ * and our caller shmem_choose_hugehole is already holding that lock.
+ */
+ mapping = READ_ONCE(head->mapping);
+ if (!mapping)
+ return nr;

+ zone = page_zone(head);
spin_lock_irq(&mapping->tree_lock);
+
if (PageTeam(head)) {
+ nr = atomic_long_read(&head->team_usage);
+ atomic_long_set(&head->team_usage, 0);
+ /*
+ * Disable additions to the team.
+ * Ensure head->private is written before PageTeam is
+ * cleared, so shmem_writepage() cannot write swap into
+ * head->private, then have it overwritten by that 0!
+ */
+ smp_mb__before_atomic();
ClearPageTeam(head);
if (!PageSwapBacked(head))
head->mapping = NULL;
- __dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
- nr = 0;
+
+ if (nr >= HPAGE_PMD_NR) {
+ __dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
+ VM_BUG_ON(nr != HPAGE_PMD_NR);
+ } else if (nr) {
+ shmem_clear_tag_hugehole(mapping, head->index);
+ __mod_zone_page_state(zone, NR_SHMEM_FREEHOLES,
+ nr - HPAGE_PMD_NR);
+ }
}
+
spin_unlock_irq(&mapping->tree_lock);
return nr;
}

-static void shmem_disband_hugetails(struct page *head)
+static void shmem_disband_hugetails(struct page *head,
+ struct list_head *list, int nr)
{
struct page *page;
struct page *endpage;
@@ -387,7 +474,7 @@ static void shmem_disband_hugetails(stru
page = head;
endpage = head + HPAGE_PMD_NR;

- /* Condition follows in next but one commit */ {
+ if (!nr) {
/*
* The usual case: disbanding team and freeing holes as cold
* (cold being more likely to preserve high-order extents).
@@ -402,7 +489,50 @@ static void shmem_disband_hugetails(stru
else if (put_page_testzero(page))
free_hot_cold_page(page, 1);
}
+ } else if (nr < 0) {
+ struct zone *zone = page_zone(page);
+ int orig_nr = nr;
+ /*
+ * Shrinker wants to migrate cache pages from this team.
+ */
+ if (!PageSwapBacked(page)) { /* head was not in cache */
+ if (put_page_testzero(page))
+ free_hot_cold_page(page, 1);
+ } else if (isolate_lru_page(page) == 0) {
+ list_add_tail(&page->lru, list);
+ nr++;
+ }
+ while (++page < endpage) {
+ if (PageTeam(page)) {
+ if (isolate_lru_page(page) == 0) {
+ list_add_tail(&page->lru, list);
+ nr++;
+ }
+ ClearPageTeam(page);
+ } else if (put_page_testzero(page))
+ free_hot_cold_page(page, 1);
+ }
+ /* Yes, shmem counts in NR_ISOLATED_ANON but NR_FILE_PAGES */
+ mod_zone_page_state(zone, NR_ISOLATED_ANON, nr - orig_nr);
+ } else {
+ /*
+ * Shrinker wants free pages from this team to migrate into.
+ */
+ if (!PageSwapBacked(page)) { /* head was not in cache */
+ list_add_tail(&page->lru, list);
+ nr--;
+ }
+ while (++page < endpage) {
+ if (PageTeam(page))
+ ClearPageTeam(page);
+ else if (nr) {
+ list_add_tail(&page->lru, list);
+ nr--;
+ } else if (put_page_testzero(page))
+ free_hot_cold_page(page, 1);
+ }
}
+ VM_BUG_ON(nr > 0); /* maybe a few were not isolated */
}

static void shmem_disband_hugeteam(struct page *page)
@@ -444,12 +574,254 @@ static void shmem_disband_hugeteam(struc
if (head != page)
unlock_page(head);
if (nr_used >= 0)
- shmem_disband_hugetails(head);
+ shmem_disband_hugetails(head, NULL, 0);
if (head != page)
put_page(head);
preempt_enable();
}

+static struct page *shmem_get_hugehole(struct address_space *mapping,
+ unsigned long *index)
+{
+ struct page *page;
+ struct page *head;
+
+ rcu_read_lock();
+ while (radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)&page,
+ *index, 1, SHMEM_TAG_HUGEHOLE)) {
+ if (radix_tree_exception(page))
+ continue;
+ if (!page_cache_get_speculative(page))
+ continue;
+ if (!PageTeam(page) || page->mapping != mapping)
+ goto release;
+ head = team_head(page);
+ if (head != page) {
+ if (!page_cache_get_speculative(head))
+ goto release;
+ put_page(page);
+ page = head;
+ if (!PageTeam(page) || page->mapping != mapping)
+ goto release;
+ }
+ if (shmem_freeholes(head) > 0) {
+ rcu_read_unlock();
+ *index = head->index + HPAGE_PMD_NR;
+ return head;
+ }
+release:
+ put_page(page);
+ }
+ rcu_read_unlock();
+ return NULL;
+}
+
+static unsigned long shmem_choose_hugehole(struct list_head *fromlist,
+ struct list_head *tolist)
+{
+ unsigned long freed = 0;
+ unsigned long double_depth;
+ struct list_head *this, *next;
+ struct shmem_inode_info *info;
+ struct address_space *mapping;
+ struct page *frompage = NULL;
+ struct page *topage = NULL;
+ struct page *page;
+ pgoff_t index;
+ int fromused;
+ int toused;
+ int nid;
+
+ double_depth = 0;
+ spin_lock(&shmem_shrinklist_lock);
+ list_for_each_safe(this, next, &shmem_shrinklist) {
+ info = list_entry(this, struct shmem_inode_info, shrinklist);
+ mapping = info->vfs_inode.i_mapping;
+ if (!radix_tree_tagged(&mapping->page_tree,
+ SHMEM_TAG_HUGEHOLE)) {
+ list_del_init(&info->shrinklist);
+ shmem_shrinklist_depth--;
+ continue;
+ }
+ index = 0;
+ while ((page = shmem_get_hugehole(mapping, &index))) {
+ /* Choose to migrate from page with least in use */
+ if (!frompage ||
+ shmem_freeholes(page) > shmem_freeholes(frompage)) {
+ if (frompage)
+ put_page(frompage);
+ frompage = page;
+ if (shmem_freeholes(page) == HPAGE_PMD_NR-1) {
+ /* No point searching further */
+ double_depth = -3;
+ break;
+ }
+ } else
+ put_page(page);
+ }
+
+ /* Only reclaim from the older half of the shrinklist */
+ double_depth += 2;
+ if (double_depth >= min(shmem_shrinklist_depth, 2000UL))
+ break;
+ }
+
+ if (!frompage)
+ goto unlock;
+ preempt_disable();
+ fromused = shmem_disband_hugehead(frompage);
+ spin_unlock(&shmem_shrinklist_lock);
+ if (fromused > 0)
+ shmem_disband_hugetails(frompage, fromlist, -fromused);
+ preempt_enable();
+ nid = page_to_nid(frompage);
+ put_page(frompage);
+
+ if (fromused <= 0)
+ return 0;
+ freed = HPAGE_PMD_NR - fromused;
+ if (fromused > HPAGE_PMD_NR/2)
+ return freed;
+
+ double_depth = 0;
+ spin_lock(&shmem_shrinklist_lock);
+ list_for_each_safe(this, next, &shmem_shrinklist) {
+ info = list_entry(this, struct shmem_inode_info, shrinklist);
+ mapping = info->vfs_inode.i_mapping;
+ if (!radix_tree_tagged(&mapping->page_tree,
+ SHMEM_TAG_HUGEHOLE)) {
+ list_del_init(&info->shrinklist);
+ shmem_shrinklist_depth--;
+ continue;
+ }
+ index = 0;
+ while ((page = shmem_get_hugehole(mapping, &index))) {
+ /* Choose to migrate to page with just enough free */
+ if (shmem_freeholes(page) >= fromused &&
+ page_to_nid(page) == nid) {
+ if (!topage || shmem_freeholes(page) <
+ shmem_freeholes(topage)) {
+ if (topage)
+ put_page(topage);
+ topage = page;
+ if (shmem_freeholes(page) == fromused) {
+ /* No point searching further */
+ double_depth = -3;
+ break;
+ }
+ } else
+ put_page(page);
+ } else
+ put_page(page);
+ }
+
+ /* Only reclaim from the older half of the shrinklist */
+ double_depth += 2;
+ if (double_depth >= min(shmem_shrinklist_depth, 2000UL))
+ break;
+ }
+
+ if (!topage)
+ goto unlock;
+ preempt_disable();
+ toused = shmem_disband_hugehead(topage);
+ spin_unlock(&shmem_shrinklist_lock);
+ if (toused > 0) {
+ if (HPAGE_PMD_NR - toused >= fromused)
+ shmem_disband_hugetails(topage, tolist, fromused);
+ else
+ shmem_disband_hugetails(topage, NULL, 0);
+ freed += HPAGE_PMD_NR - toused;
+ }
+ preempt_enable();
+ put_page(topage);
+ return freed;
+unlock:
+ spin_unlock(&shmem_shrinklist_lock);
+ return freed;
+}
+
+static struct page *shmem_get_migrate_page(struct page *frompage,
+ unsigned long private, int **result)
+{
+ struct list_head *tolist = (struct list_head *)private;
+ struct page *topage;
+
+ VM_BUG_ON(list_empty(tolist));
+ topage = list_first_entry(tolist, struct page, lru);
+ list_del(&topage->lru);
+ return topage;
+}
+
+static void shmem_put_migrate_page(struct page *topage, unsigned long private)
+{
+ struct list_head *tolist = (struct list_head *)private;
+
+ list_add(&topage->lru, tolist);
+}
+
+static void shmem_putback_migrate_pages(struct list_head *tolist)
+{
+ struct page *topage;
+ struct page *next;
+
+ /*
+ * The tolist pages were not counted in NR_ISOLATED, so stats
+ * would go wrong if putback_movable_pages() were used on them.
+ * Indeed, even putback_lru_page() is wrong for these pages.
+ */
+ list_for_each_entry_safe(topage, next, tolist, lru) {
+ list_del(&topage->lru);
+ if (put_page_testzero(topage))
+ free_hot_cold_page(topage, 1);
+ }
+}
+
+static unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ unsigned long freed;
+ LIST_HEAD(fromlist);
+ LIST_HEAD(tolist);
+
+ if (list_empty(&shmem_shrinklist))
+ return SHRINK_STOP;
+ freed = shmem_choose_hugehole(&fromlist, &tolist);
+ if (list_empty(&fromlist))
+ return SHRINK_STOP;
+ if (!list_empty(&tolist)) {
+ migrate_pages(&fromlist, shmem_get_migrate_page,
+ shmem_put_migrate_page, (unsigned long)&tolist,
+ MIGRATE_SYNC, MR_SHMEM_HUGEHOLE);
+ preempt_disable();
+ drain_local_pages(NULL); /* try to preserve huge freed page */
+ preempt_enable();
+ shmem_putback_migrate_pages(&tolist);
+ }
+ putback_movable_pages(&fromlist); /* if any were left behind */
+ return freed;
+}
+
+static unsigned long shmem_count_hugehole(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ /*
+ * Huge hole space is not charged to any memcg:
+ * only shrink it for global reclaim.
+ * But at present we're only called for global reclaim anyway.
+ */
+ if (list_empty(&shmem_shrinklist))
+ return 0;
+ return global_page_state(NR_SHMEM_FREEHOLES);
+}
+
+static struct shrinker shmem_hugehole_shrinker = {
+ .count_objects = shmem_count_hugehole,
+ .scan_objects = shmem_shrink_hugehole,
+ .seeks = DEFAULT_SEEKS, /* would another value work better? */
+ .batch = HPAGE_PMD_NR, /* would another value work better? */
+};
+
#else /* !CONFIG_TRANSPARENT_HUGEPAGE */

#define shmem_huge SHMEM_HUGE_DENY
@@ -465,6 +837,17 @@ static inline void shmem_disband_hugetea
{
BUILD_BUG();
}
+
+static inline void shmem_added_to_hugeteam(struct page *page,
+ struct zone *zone, struct page *hugehint)
+{
+}
+
+static inline unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
+ struct shrink_control *sc)
+{
+ return 0;
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

/*
@@ -504,10 +887,10 @@ shmem_add_to_page_cache(struct page *pag
if (unlikely(error))
goto errout;

- if (!PageTeam(page))
+ if (PageTeam(page))
+ shmem_added_to_hugeteam(page, zone, hugehint);
+ else
get_page(page);
- else if (hugehint == SHMEM_ALLOC_HUGE_PAGE)
- __inc_zone_state(zone, NR_SHMEM_HUGEPAGES);

mapping->nrpages++;
__inc_zone_state(zone, NR_FILE_PAGES);
@@ -932,6 +1315,14 @@ static void shmem_evict_inode(struct ino
shmem_unacct_size(info->flags, inode->i_size);
inode->i_size = 0;
shmem_truncate_range(inode, 0, (loff_t)-1);
+ if (!list_empty(&info->shrinklist)) {
+ spin_lock(&shmem_shrinklist_lock);
+ if (!list_empty(&info->shrinklist)) {
+ list_del_init(&info->shrinklist);
+ shmem_shrinklist_depth--;
+ }
+ spin_unlock(&shmem_shrinklist_lock);
+ }
if (!list_empty(&info->swaplist)) {
mutex_lock(&shmem_swaplist_mutex);
list_del_init(&info->swaplist);
@@ -1286,10 +1677,18 @@ static struct page *shmem_alloc_page(gfp
head = alloc_pages_vma(gfp|__GFP_NORETRY|__GFP_NOWARN,
HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
true);
+ if (!head &&
+ shmem_shrink_hugehole(NULL, NULL) != SHRINK_STOP) {
+ head = alloc_pages_vma(
+ gfp|__GFP_NORETRY|__GFP_NOWARN,
+ HPAGE_PMD_ORDER, &pvma, 0,
+ numa_node_id(), true);
+ }
if (head) {
split_page(head, HPAGE_PMD_ORDER);

/* Prepare head page for add_to_page_cache */
+ atomic_long_set(&head->team_usage, 0);
__SetPageTeam(head);
head->mapping = mapping;
head->index = round_down(index, HPAGE_PMD_NR);
@@ -1613,6 +2012,21 @@ repeat:
alloced = true;

/*
+ * Might we see !list_empty a moment before the shrinker
+ * removes this inode from its list? Unlikely, since we
+ * already set a tag in the tree. Some barrier required?
+ */
+ if (alloced_huge && list_empty(&info->shrinklist)) {
+ spin_lock(&shmem_shrinklist_lock);
+ if (list_empty(&info->shrinklist)) {
+ list_add_tail(&info->shrinklist,
+ &shmem_shrinklist);
+ shmem_shrinklist_depth++;
+ }
+ spin_unlock(&shmem_shrinklist_lock);
+ }
+
+ /*
* Let SGP_FALLOC use the SGP_WRITE optimization on a new page.
*/
if (sgp == SGP_FALLOC)
@@ -1823,6 +2237,7 @@ static struct inode *shmem_get_inode(str
spin_lock_init(&info->lock);
info->seals = F_SEAL_SEAL;
info->flags = flags & VM_NORESERVE;
+ INIT_LIST_HEAD(&info->shrinklist);
INIT_LIST_HEAD(&info->swaplist);
simple_xattrs_init(&info->xattrs);
cache_no_acl(inode);
@@ -3613,9 +4028,10 @@ int __init shmem_init(void)
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (has_transparent_hugepage())
+ if (has_transparent_hugepage()) {
SHMEM_SB(shm_mnt->mnt_sb)->huge = (shmem_huge > 0);
- else
+ register_shrinker(&shmem_hugehole_shrinker);
+ } else
shmem_huge = 0; /* just in case it was patched */
#endif
return 0;

2016-04-05 21:21:27

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 07/31] huge tmpfs: get_unmapped_area align & fault supply huge page

Now make the shmem.c changes necessary for mapping its huge pages into
userspace with huge pmds: without actually doing so, since that needs
changes in huge_memory.c and across mm, better left to another patch.

Provide a shmem_get_unmapped_area method in file_operations, called
at mmap time to decide the mapping address. It could be conditional
on CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by
making it unconditional.

shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
(which we treat as a black box, highly dependent on architecture and
config and executable layout). Lots of conditions, and in most cases
it just goes with the address that chose; but when our huge stars are
rightly aligned, yet that did not provide a suitable address, go back
to ask for a larger arena, within which to align the mapping suitably.

There have to be some direct calls to shmem_get_unmapped_area(),
not via the file_operations: because of the way shmem_zero_setup()
is called to create a shmem object late in the mmap sequence, when
MAP_SHARED is requested with MAP_ANONYMOUS or /dev/zero. Though
this only matters when /proc/sys/vm/shmem_huge has been set.

Then at fault time, shmem_fault() does its usual shmem_getpage_gfp(),
and if caller __do_fault() passed FAULT_FLAG_MAY_HUGE (in later patch),
checks if the 4kB page returned is PageTeam, and, subject to further
conditions, proceeds to populate the whole of the huge page (if it
was not already fully populated and uptodate: use PG_owner_priv_1
PageChecked to save repeating all this each time the object is mapped);
then returns it to __do_fault() with a VM_FAULT_HUGE flag to request
a huge pmd.

Two conditions you might expect, which are not enforced. Originally
I intended to support just MAP_SHARED at this stage, which should be
good enough for a first implementation; but support for MAP_PRIVATE
(on read fault) needs so little further change, that it was well worth
supporting too - it opens up the opportunity to copy your x86_64 ELF
executables to huge tmpfs, their text then automatically mapped huge.

The other missing condition: shmem_getpage_gfp() is checking that
the fault falls within (4kB-rounded-up) i_size, but shmem_fault() maps
hugely even when the tail of the 2MB falls outside the (4kB-rounded-up)
i_size. This is intentional, but may need reconsideration - especially
in the MAP_PRIVATE case (is it right for a private mapping to allocate
"hidden" pages to the object beyond its EOF?). The intent is that an
application can indicate its desire for huge pmds throughout, even of
the tail, by using a hugely-rounded-up mmap size; but we might end up
retracting this, asking for fallocate to be used explicitly for that.
(hugetlbfs behaves even less standardly: its mmap extends the i_size
of the object.)

Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/filesystems/tmpfs.txt | 8 +
drivers/char/mem.c | 23 ++
include/linux/mm.h | 3
include/linux/shmem_fs.h | 2
ipc/shm.c | 6
mm/mmap.c | 16 +-
mm/shmem.c | 204 +++++++++++++++++++++++++-
7 files changed, 253 insertions(+), 9 deletions(-)

--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -168,6 +168,14 @@ disband the team and free those holes; o
and swap out the tmpfs pagecache. Free holes are not charged to any
memcg, and are counted in MemAvailable; but are not counted in MemFree.

+If a hugepage is mapped into a well-aligned huge extent of userspace (and
+huge tmpfs defaults to suitable alignment for any mapping large enough), any
+remaining free holes are first filled with zeroes to complete the hugepage.
+So, if the mmap length extends to a hugepage boundary beyond end of file,
+user accesses between end of file and that hugepage boundary will normally
+not fail with SIGBUS, as they would on a huge=0 filesystem - but will fail
+with SIGBUS if the kernel could only allocate small pages to back it.
+
/proc/sys/vm/shmem_huge (intended for experimentation only):

Default 0; write 1 to set tmpfs mount option huge=1 on the kernel's
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -22,6 +22,7 @@
#include <linux/device.h>
#include <linux/highmem.h>
#include <linux/backing-dev.h>
+#include <linux/shmem_fs.h>
#include <linux/splice.h>
#include <linux/pfn.h>
#include <linux/export.h>
@@ -661,6 +662,27 @@ static int mmap_zero(struct file *file,
return 0;
}

+static unsigned long get_unmapped_area_zero(struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+#ifndef CONFIG_MMU
+ return -ENOSYS;
+#endif
+ if (flags & MAP_SHARED) {
+ /*
+ * mmap_zero() will call shmem_zero_setup() to create a file,
+ * so use shmem's get_unmapped_area in case it can be huge;
+ * and pass NULL for file as in mmap.c's get_unmapped_area(),
+ * so as not to confuse shmem with our handle on "/dev/zero".
+ */
+ return shmem_get_unmapped_area(NULL, addr, len, pgoff, flags);
+ }
+
+ /* Otherwise flags & MAP_PRIVATE: with no shmem object beneath it */
+ return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+
static ssize_t write_full(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
@@ -768,6 +790,7 @@ static const struct file_operations zero
.read_iter = read_iter_zero,
.write_iter = write_iter_zero,
.mmap = mmap_zero,
+ .get_unmapped_area = get_unmapped_area_zero,
#ifndef CONFIG_MMU
.mmap_capabilities = zero_mmap_capabilities,
#endif
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -276,6 +276,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */
#define FAULT_FLAG_REMOTE 0x80 /* faulting for non current tsk/mm */
#define FAULT_FLAG_INSTRUCTION 0x100 /* The fault was during an instruction fetch */
+#define FAULT_FLAG_MAY_HUGE 0x200 /* PT not alloced: could use huge pmd */

/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
@@ -1079,7 +1080,7 @@ static inline void clear_page_pfmemalloc
#define VM_FAULT_HWPOISON 0x0010 /* Hit poisoned small page */
#define VM_FAULT_HWPOISON_LARGE 0x0020 /* Hit poisoned large page. Index encoded in upper bits */
#define VM_FAULT_SIGSEGV 0x0040
-
+#define VM_FAULT_HUGE 0x0080 /* ->fault needs page installed as huge pmd */
#define VM_FAULT_NOPAGE 0x0100 /* ->fault installed the pte, not return page */
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
#define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -51,6 +51,8 @@ extern struct file *shmem_file_setup(con
extern struct file *shmem_kernel_file_setup(const char *name, loff_t size,
unsigned long flags);
extern int shmem_zero_setup(struct vm_area_struct *);
+extern unsigned long shmem_get_unmapped_area(struct file *, unsigned long addr,
+ unsigned long len, unsigned long pgoff, unsigned long flags);
extern int shmem_lock(struct file *file, int lock, struct user_struct *user);
extern bool shmem_mapping(struct address_space *mapping);
extern void shmem_unlock_mapping(struct address_space *mapping);
--- a/ipc/shm.c
+++ b/ipc/shm.c
@@ -476,13 +476,15 @@ static const struct file_operations shm_
.mmap = shm_mmap,
.fsync = shm_fsync,
.release = shm_release,
-#ifndef CONFIG_MMU
.get_unmapped_area = shm_get_unmapped_area,
-#endif
.llseek = noop_llseek,
.fallocate = shm_fallocate,
};

+/*
+ * shm_file_operations_huge is now identical to shm_file_operations,
+ * but we keep it distinct for the sake of is_file_shm_hugepages().
+ */
static const struct file_operations shm_file_operations_huge = {
.mmap = shm_mmap,
.fsync = shm_fsync,
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -25,6 +25,7 @@
#include <linux/personality.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/shmem_fs.h>
#include <linux/profile.h>
#include <linux/export.h>
#include <linux/mount.h>
@@ -1900,8 +1901,19 @@ get_unmapped_area(struct file *file, uns
return -ENOMEM;

get_area = current->mm->get_unmapped_area;
- if (file && file->f_op->get_unmapped_area)
- get_area = file->f_op->get_unmapped_area;
+ if (file) {
+ if (file->f_op->get_unmapped_area)
+ get_area = file->f_op->get_unmapped_area;
+ } else if (flags & MAP_SHARED) {
+ /*
+ * mmap_region() will call shmem_zero_setup() to create a file,
+ * so use shmem's get_unmapped_area in case it can be huge.
+ * do_mmap_pgoff() will clear pgoff, so match alignment.
+ */
+ pgoff = 0;
+ get_area = shmem_get_unmapped_area;
+ }
+
addr = get_area(file, addr, len, pgoff, flags);
if (IS_ERR_VALUE(addr))
return addr;
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -104,6 +104,7 @@ struct shmem_falloc {
enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
SGP_CACHE, /* don't exceed i_size, may allocate page */
+ SGP_TEAM, /* may exceed i_size, may make team page Uptodate */
SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */
SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
};
@@ -417,6 +418,44 @@ static void shmem_added_to_hugeteam(stru
}
}

+static int shmem_populate_hugeteam(struct inode *inode, struct page *head,
+ struct vm_area_struct *vma)
+{
+ gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
+ struct page *page;
+ pgoff_t index;
+ int error;
+ int i;
+
+ /* We only have to do this once */
+ if (PageChecked(head))
+ return 0;
+
+ index = head->index;
+ for (i = 0; i < HPAGE_PMD_NR; i++, index++) {
+ if (!PageTeam(head))
+ return -EAGAIN;
+ if (PageChecked(head))
+ return 0;
+ /* Mark all pages dirty even when map is readonly, for now */
+ if (PageUptodate(head + i) && PageDirty(head + i))
+ continue;
+ error = shmem_getpage_gfp(inode, index, &page, SGP_TEAM,
+ gfp, vma->vm_mm, NULL);
+ if (error)
+ return error;
+ SetPageDirty(page);
+ unlock_page(page);
+ put_page(page);
+ if (page != head + i)
+ return -EAGAIN;
+ cond_resched();
+ }
+
+ /* Now safe from the shrinker, but not yet from truncate */
+ return 0;
+}
+
static int shmem_disband_hugehead(struct page *head)
{
struct address_space *mapping;
@@ -452,6 +491,7 @@ static int shmem_disband_hugehead(struct
head->mapping = NULL;

if (nr >= HPAGE_PMD_NR) {
+ ClearPageChecked(head);
__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
VM_BUG_ON(nr != HPAGE_PMD_NR);
} else if (nr) {
@@ -843,6 +883,12 @@ static inline void shmem_added_to_hugete
{
}

+static inline int shmem_populate_hugeteam(struct inode *inode,
+ struct page *head, struct vm_area_struct *vma)
+{
+ return -EAGAIN;
+}
+
static inline unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
struct shrink_control *sc)
{
@@ -1817,8 +1863,8 @@ static int shmem_replace_page(struct pag
* vm. If we swap it in we mark it dirty since we also free the swap
* entry since a page cannot live in both the swap and page cache.
*
- * fault_mm and fault_type are only supplied by shmem_fault:
- * otherwise they are NULL.
+ * fault_mm and fault_type are only supplied by shmem_fault
+ * (or hugeteam population): otherwise they are NULL.
*/
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct page **pagep, enum sgp_type sgp, gfp_t gfp,
@@ -2095,10 +2141,13 @@ unlock:

static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
+ unsigned long addr = (unsigned long)vmf->virtual_address;
struct inode *inode = file_inode(vma->vm_file);
gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
+ struct page *head;
+ int ret = 0;
+ int once = 0;
int error;
- int ret = VM_FAULT_LOCKED;

/*
* Trinity finds that probing a hole which tmpfs is punching can
@@ -2158,11 +2207,150 @@ static int shmem_fault(struct vm_area_st
spin_unlock(&inode->i_lock);
}

+single:
+ vmf->page = NULL;
error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
gfp, vma->vm_mm, &ret);
if (error)
return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
- return ret;
+ ret |= VM_FAULT_LOCKED;
+
+ /*
+ * Shall we map a huge page hugely?
+ */
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return ret;
+ if (!(vmf->flags & FAULT_FLAG_MAY_HUGE))
+ return ret;
+ if (!PageTeam(vmf->page))
+ return ret;
+ if (once++)
+ return ret;
+ if (!(vma->vm_flags & VM_SHARED) && (vmf->flags & FAULT_FLAG_WRITE))
+ return ret;
+ if ((vma->vm_start-(vma->vm_pgoff<<PAGE_SHIFT)) & (HPAGE_PMD_SIZE-1))
+ return ret;
+ if (round_down(addr, HPAGE_PMD_SIZE) < vma->vm_start)
+ return ret;
+ if (round_up(addr + 1, HPAGE_PMD_SIZE) > vma->vm_end)
+ return ret;
+ /* But omit i_size check: allow up to huge page boundary */
+
+ head = team_head(vmf->page);
+ if (!get_page_unless_zero(head))
+ return ret;
+ if (!PageTeam(head)) {
+ put_page(head);
+ return ret;
+ }
+
+ ret &= ~VM_FAULT_LOCKED;
+ unlock_page(vmf->page);
+ put_page(vmf->page);
+ if (shmem_populate_hugeteam(inode, head, vma) < 0) {
+ put_page(head);
+ goto single;
+ }
+ lock_page(head);
+ if (!PageTeam(head)) {
+ unlock_page(head);
+ put_page(head);
+ goto single;
+ }
+ if (!PageChecked(head))
+ SetPageChecked(head);
+
+ /* Now safe from truncation */
+ vmf->page = head;
+ return ret | VM_FAULT_LOCKED | VM_FAULT_HUGE;
+}
+
+unsigned long shmem_get_unmapped_area(struct file *file,
+ unsigned long uaddr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ unsigned long (*get_area)(struct file *,
+ unsigned long, unsigned long, unsigned long, unsigned long);
+ unsigned long addr;
+ unsigned long offset;
+ unsigned long inflated_len;
+ unsigned long inflated_addr;
+ unsigned long inflated_offset;
+
+ if (len > TASK_SIZE)
+ return -ENOMEM;
+
+ get_area = current->mm->get_unmapped_area;
+ addr = get_area(file, uaddr, len, pgoff, flags);
+
+ if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ return addr;
+ if (IS_ERR_VALUE(addr))
+ return addr;
+ if (addr & ~PAGE_MASK)
+ return addr;
+ if (addr > TASK_SIZE - len)
+ return addr;
+
+ if (shmem_huge == SHMEM_HUGE_DENY)
+ return addr;
+ if (len < HPAGE_PMD_SIZE)
+ return addr;
+ if (flags & MAP_FIXED)
+ return addr;
+ /*
+ * Our priority is to support MAP_SHARED mapped hugely;
+ * and support MAP_PRIVATE mapped hugely too, until it is COWed.
+ * But if caller specified an address hint, respect that as before.
+ */
+ if (uaddr)
+ return addr;
+
+ if (shmem_huge != SHMEM_HUGE_FORCE) {
+ struct super_block *sb;
+
+ if (file) {
+ VM_BUG_ON(file->f_op != &shmem_file_operations);
+ sb = file_inode(file)->i_sb;
+ } else {
+ /*
+ * Called directly from mm/mmap.c, or drivers/char/mem.c
+ * for "/dev/zero", to create a shared anonymous object.
+ */
+ if (IS_ERR(shm_mnt))
+ return addr;
+ sb = shm_mnt->mnt_sb;
+ }
+ if (!SHMEM_SB(sb)->huge)
+ return addr;
+ }
+
+ offset = (pgoff << PAGE_SHIFT) & (HPAGE_PMD_SIZE-1);
+ if (offset && offset + len < 2 * HPAGE_PMD_SIZE)
+ return addr;
+ if ((addr & (HPAGE_PMD_SIZE-1)) == offset)
+ return addr;
+
+ inflated_len = len + HPAGE_PMD_SIZE - PAGE_SIZE;
+ if (inflated_len > TASK_SIZE)
+ return addr;
+ if (inflated_len < len)
+ return addr;
+
+ inflated_addr = get_area(NULL, 0, inflated_len, 0, flags);
+ if (IS_ERR_VALUE(inflated_addr))
+ return addr;
+ if (inflated_addr & ~PAGE_MASK)
+ return addr;
+
+ inflated_offset = inflated_addr & (HPAGE_PMD_SIZE-1);
+ inflated_addr += offset - inflated_offset;
+ if (inflated_offset > offset)
+ inflated_addr += HPAGE_PMD_SIZE;
+
+ if (inflated_addr > TASK_SIZE - len)
+ return addr;
+ return inflated_addr;
}

#ifdef CONFIG_NUMA
@@ -3905,6 +4093,7 @@ static const struct address_space_operat

static const struct file_operations shmem_file_operations = {
.mmap = shmem_mmap,
+ .get_unmapped_area = shmem_get_unmapped_area,
#ifdef CONFIG_TMPFS
.llseek = shmem_file_llseek,
.read_iter = shmem_file_read_iter,
@@ -4112,6 +4301,13 @@ void shmem_unlock_mapping(struct address
{
}

+unsigned long shmem_get_unmapped_area(struct file *file,
+ unsigned long addr, unsigned long len,
+ unsigned long pgoff, unsigned long flags)
+{
+ return current->mm->get_unmapped_area(file, addr, len, pgoff, flags);
+}
+
void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
{
truncate_inode_pages_range(inode->i_mapping, lstart, lend);

2016-04-05 21:23:10

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 08/31] huge tmpfs: try_to_unmap_one use page_check_address_transhuge

Anon THP's huge pages are split for reclaim in add_to_swap(), before they
reach try_to_unmap(); migrate_misplaced_transhuge_page() does its own pmd
remapping, instead of needing try_to_unmap(); migratable hugetlbfs pages
masquerade as pte-mapped in page_check_address(). So try_to_unmap_one()
did not need to handle transparent pmd mappings as page_referenced_one()
does (beyond the TTU_SPLIT_HUGE_PMD case; though what about TTU_MUNLOCK?).

But tmpfs huge pages are split a little later in the reclaim sequence,
when pageout() calls shmem_writepage(): so try_to_unmap_one() now needs
to handle pmd-mapped pages by using page_check_address_transhuge(), and
a function unmap_team_by_pmd() that we shall place in huge_memory.c in
a later patch, but just use a stub for now.

Refine the lookup in page_check_address_transhuge() slightly, to match
what mm_find_pmd() does, and we've been using for a year: take a pmdval
snapshot of *pmd first, to avoid pmd_lock before the pmd_page check,
with a retry if it changes in between. Was the code wrong before?
I don't think it was, but I am more comfortable with how it is now.

Change its check on hpage_nr_pages() to use compound_order() instead,
two reasons for that: one being that there's now a case in anon THP
splitting where the new call to page_check_address_transhuge() may be on
a PageTail, which hits VM_BUG_ON in PageTransHuge in hpage_nr_pages();
the other being that hpage_nr_pages() on PageTeam gets more interesting
in a later patch, and would no longer be appropriate here.

Say "pmdval" as usual, instead of the "pmde" I made up for mm_find_pmd()
before. Update the comment in mm_find_pmd() to generalise it away from
just the anon_vma lock.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/pageteam.h | 6 +++
mm/rmap.c | 65 +++++++++++++++++++++----------------
2 files changed, 43 insertions(+), 28 deletions(-)

--- a/include/linux/pageteam.h
+++ b/include/linux/pageteam.h
@@ -29,4 +29,10 @@ static inline struct page *team_head(str
return head;
}

+/* Temporary stub for mm/rmap.c until implemented in mm/huge_memory.c */
+static inline void unmap_team_by_pmd(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd, struct page *page)
+{
+}
+
#endif /* _LINUX_PAGETEAM_H */
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -47,6 +47,7 @@

#include <linux/mm.h>
#include <linux/pagemap.h>
+#include <linux/pageteam.h>
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/slab.h>
@@ -687,7 +688,7 @@ pmd_t *mm_find_pmd(struct mm_struct *mm,
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd = NULL;
- pmd_t pmde;
+ pmd_t pmdval;

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
@@ -700,12 +701,12 @@ pmd_t *mm_find_pmd(struct mm_struct *mm,
pmd = pmd_offset(pud, address);
/*
* Some THP functions use the sequence pmdp_huge_clear_flush(), set_pmd_at()
- * without holding anon_vma lock for write. So when looking for a
- * genuine pmde (in which to find pte), test present and !THP together.
+ * without locking out concurrent rmap lookups. So when looking for a
+ * pmd entry, in which to find a pte, test present and !THP together.
*/
- pmde = *pmd;
+ pmdval = *pmd;
barrier();
- if (!pmd_present(pmde) || pmd_trans_huge(pmde))
+ if (!pmd_present(pmdval) || pmd_trans_huge(pmdval))
pmd = NULL;
out:
return pmd;
@@ -800,6 +801,7 @@ bool page_check_address_transhuge(struct
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
+ pmd_t pmdval;
pte_t *pte;
spinlock_t *ptl;

@@ -821,32 +823,24 @@ bool page_check_address_transhuge(struct
if (!pud_present(*pud))
return false;
pmd = pmd_offset(pud, address);
+again:
+ pmdval = *pmd;
+ barrier();
+ if (!pmd_present(pmdval))
+ return false;

- if (pmd_trans_huge(*pmd)) {
+ if (pmd_trans_huge(pmdval)) {
+ if (pmd_page(pmdval) != page)
+ return false;
ptl = pmd_lock(mm, pmd);
- if (!pmd_present(*pmd))
- goto unlock_pmd;
- if (unlikely(!pmd_trans_huge(*pmd))) {
+ if (unlikely(!pmd_same(*pmd, pmdval))) {
spin_unlock(ptl);
- goto map_pte;
+ goto again;
}
-
- if (pmd_page(*pmd) != page)
- goto unlock_pmd;
-
pte = NULL;
goto found;
-unlock_pmd:
- spin_unlock(ptl);
- return false;
- } else {
- pmd_t pmde = *pmd;
-
- barrier();
- if (!pmd_present(pmde) || pmd_trans_huge(pmde))
- return false;
}
-map_pte:
+
pte = pte_offset_map(pmd, address);
if (!pte_present(*pte)) {
pte_unmap(pte);
@@ -863,7 +857,7 @@ check_pte:
}

/* THP can be referenced by any subpage */
- if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
+ if (pte_pfn(*pte) - page_to_pfn(page) >= (1 << compound_order(page))) {
pte_unmap_unlock(pte, ptl);
return false;
}
@@ -1404,6 +1398,7 @@ static int try_to_unmap_one(struct page
unsigned long address, void *arg)
{
struct mm_struct *mm = vma->vm_mm;
+ pmd_t *pmd;
pte_t *pte;
pte_t pteval;
spinlock_t *ptl;
@@ -1423,8 +1418,7 @@ static int try_to_unmap_one(struct page
goto out;
}

- pte = page_check_address(page, mm, address, &ptl, 0);
- if (!pte)
+ if (!page_check_address_transhuge(page, mm, address, &pmd, &pte, &ptl))
goto out;

/*
@@ -1442,6 +1436,19 @@ static int try_to_unmap_one(struct page
if (flags & TTU_MUNLOCK)
goto out_unmap;
}
+
+ if (!pte) {
+ if (!(flags & TTU_IGNORE_ACCESS) &&
+ IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ pmdp_clear_flush_young_notify(vma, address, pmd)) {
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+ spin_unlock(ptl);
+ unmap_team_by_pmd(vma, address, pmd, page);
+ goto out;
+ }
+
if (!(flags & TTU_IGNORE_ACCESS)) {
if (ptep_clear_flush_young_notify(vma, address, pte)) {
ret = SWAP_FAIL;
@@ -1542,7 +1549,9 @@ discard:
put_page(page);

out_unmap:
- pte_unmap_unlock(pte, ptl);
+ spin_unlock(ptl);
+ if (pte)
+ pte_unmap(pte);
if (ret != SWAP_FAIL && ret != SWAP_MLOCK && !(flags & TTU_MUNLOCK))
mmu_notifier_invalidate_page(mm, address);
out:

2016-04-05 21:24:29

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 09/31] huge tmpfs: avoid premature exposure of new pagetable

In early development, a huge tmpfs fault simply replaced the pmd which
pointed to the empty pagetable just allocated in __handle_mm_fault():
but that is unsafe.

Andrea wrote a very interesting comment on THP in mm/memory.c,
just before the end of __handle_mm_fault():

* A regular pmd is established and it can't morph into a huge pmd
* from under us anymore at this point because we hold the mmap_sem
* read mode and khugepaged takes it in write mode. So now it's
* safe to run pte_offset_map().

This comment hints at several difficulties, which anon THP solved
for itself with mmap_sem and anon_vma lock, but which huge tmpfs
may need to solve differently.

The reference to pte_offset_map() above: I believe that's a hint
that on a 32-bit machine, the pagetables might need to come from
kernel-mapped memory, but a huge pmd pointing to user memory beyond
that limit could be racily substituted, causing undefined behavior
in the architecture-dependent pte_offset_map().

That itself is not a problem on x86_64, but there's plenty more:
how about those places which use pte_offset_map_lock() - if that
spinlock is in the struct page of a pagetable, which has been
deposited and might be withdrawn and freed at any moment (being
on a list unattached to the allocating pmd in the case of x86),
taking the spinlock might corrupt someone else's struct page.

Because THP has departed from the earlier rules (when pagetable
was only freed under exclusive mmap_sem, or at exit_mmap, after
removing all affected vmas from the rmap list): zap_huge_pmd()
does pte_free() even when serving MADV_DONTNEED under down_read
of mmap_sem.

And what of the "entry = *pte" at the start of handle_pte_fault(),
getting the entry used in pte_same(,orig_pte) tests to validate all
fault handling? If that entry can itself be junk picked out of some
freed and reused pagetable, it's hard to estimate the consequences.

We need to consider the safety of concurrent faults, and the
safety of rmap lookups, and the safety of miscellaneous operations
such as smaps_pte_range() for reading /proc/<pid>/smaps.

I set out to make safe the places which descend pgd,pud,pmd,pte,
using more careful access techniques like mm_find_pmd(); but with
pte_offset_map() being architecture-defined, found it too big a job
to tighten up all over.

Instead, approach from the opposite direction: just do not expose
a pagetable in an empty *pmd, until vm_ops->fault has had a chance
to ask for a huge pmd there. This is a much easier change to make,
and we are lucky that all the driver faults appear to be using
interfaces (like vm_insert_page() and remap_pfn_range()) which
automatically do the pte_alloc() if it was not already done.

But we must not get stuck refaulting: need FAULT_FLAG_MAY_HUGE for
__do_fault() to tell shmem_fault() to try for huge only when *pmd is
empty (could instead add pmd to vmf and let shmem work that out for
itself, but probably better to hide pmd from vm_ops->faults).

Without a pagetable to hold the pte_none() entry found in a newly
allocated pagetable, handle_pte_fault() would like to provide a static
none entry for later orig_pte checks. But architectures have never had
to provide that definition before; and although almost all use zeroes
for an empty pagetable, a few do not - nios2, s390, um, xtensa.

Never mind, forget about pte_same(,orig_pte), the three __do_fault()
callers can follow do_anonymous_page(), and just use a pte_none() check.

do_fault_around() presents one last problem: it wants pagetable to
have been allocated, but was being called by do_read_fault() before
__do_fault(). I see no disadvantage to moving it after, allowing huge
pmd to be chosen first; but Kirill reports additional radix-tree lookup
in hot pagecache case when he implemented faultaround: needs further
investigation.

Note: after months of use, we recently hit an OOM deadlock: this patch
moves the new pagetable allocation inside where page lock is held on a
pagecache page, and exit's munlock_vma_pages_all() takes page lock on
all mlocked pages. Both parties are behaving badly: we hope to change
munlock to use trylock_page() instead, but should certainly switch here
to preallocating the pagetable outside the page lock. But I've not yet
written and tested that change.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/filemap.c | 10 +-
mm/memory.c | 215 ++++++++++++++++++++++++++-----------------------
2 files changed, 123 insertions(+), 102 deletions(-)

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2147,6 +2147,10 @@ void filemap_map_pages(struct vm_area_st
radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, vmf->pgoff) {
if (iter.index > vmf->max_pgoff)
break;
+
+ pte = vmf->pte + iter.index - vmf->pgoff;
+ if (!pte_none(*pte))
+ goto next;
repeat:
page = radix_tree_deref_slot(slot);
if (unlikely(!page))
@@ -2168,6 +2172,8 @@ repeat:
goto repeat;
}

+ VM_BUG_ON_PAGE(page->index != iter.index, page);
+
if (!PageUptodate(page) ||
PageReadahead(page) ||
PageHWPoison(page))
@@ -2182,10 +2188,6 @@ repeat:
if (page->index >= size >> PAGE_SHIFT)
goto unlock;

- pte = vmf->pte + page->index - vmf->pgoff;
- if (!pte_none(*pte))
- goto unlock;
-
if (file->f_ra.mmap_miss > 0)
file->f_ra.mmap_miss--;
addr = address + (page->index - vmf->pgoff) * PAGE_SIZE;
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2678,20 +2678,17 @@ static inline int check_stack_guard_page

/*
* We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with mmap_sem still held, but pte unmapped and unlocked.
+ * but allow concurrent faults). We return with mmap_sem still held.
*/
static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags)
+ unsigned long address, pmd_t *pmd, unsigned int flags)
{
struct mem_cgroup *memcg;
+ pte_t *page_table;
struct page *page;
spinlock_t *ptl;
pte_t entry;

- pte_unmap(page_table);
-
/* File mapping without ->vm_ops ? */
if (vma->vm_flags & VM_SHARED)
return VM_FAULT_SIGBUS;
@@ -2700,6 +2697,27 @@ static int do_anonymous_page(struct mm_s
if (check_stack_guard_page(vma, address) < 0)
return VM_FAULT_SIGSEGV;

+ /*
+ * Use pte_alloc instead of pte_alloc_map, because we can't
+ * run pte_offset_map on the pmd, if an huge pmd could
+ * materialize from under us from a different thread.
+ */
+ if (unlikely(pte_alloc(mm, pmd, address)))
+ return VM_FAULT_OOM;
+ /*
+ * If a huge pmd materialized under us just retry later. Use
+ * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
+ * didn't become pmd_trans_huge under us and then back to pmd_none, as
+ * a result of MADV_DONTNEED running immediately after a huge pmd fault
+ * in a different thread of this mm, in turn leading to a misleading
+ * pmd_trans_huge() retval. All we have to ensure is that it is a
+ * regular pmd that we can walk with pte_offset_map() and we can do that
+ * through an atomic read in C, which is what pmd_trans_unstable()
+ * provides.
+ */
+ if (unlikely(pmd_trans_unstable(pmd) || pmd_devmap(*pmd)))
+ return 0;
+
/* Use the zero-page for reads */
if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
@@ -2778,8 +2796,8 @@ oom:
* See filemap_fault() and __lock_page_retry().
*/
static int __do_fault(struct vm_area_struct *vma, unsigned long address,
- pgoff_t pgoff, unsigned int flags,
- struct page *cow_page, struct page **page)
+ pmd_t *pmd, pgoff_t pgoff, unsigned int flags,
+ struct page *cow_page, struct page **page)
{
struct vm_fault vmf;
int ret;
@@ -2797,21 +2815,40 @@ static int __do_fault(struct vm_area_str
if (!vmf.page)
goto out;

- if (unlikely(PageHWPoison(vmf.page))) {
- if (ret & VM_FAULT_LOCKED)
- unlock_page(vmf.page);
- put_page(vmf.page);
- return VM_FAULT_HWPOISON;
- }
-
if (unlikely(!(ret & VM_FAULT_LOCKED)))
lock_page(vmf.page);
else
VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);

+ if (unlikely(PageHWPoison(vmf.page))) {
+ ret = VM_FAULT_HWPOISON;
+ goto err;
+ }
+
+ /*
+ * Use pte_alloc instead of pte_alloc_map, because we can't
+ * run pte_offset_map on the pmd, if an huge pmd could
+ * materialize from under us from a different thread.
+ */
+ if (unlikely(pte_alloc(vma->vm_mm, pmd, address))) {
+ ret = VM_FAULT_OOM;
+ goto err;
+ }
+ /*
+ * If a huge pmd materialized under us just retry later. Allow for
+ * a racing transition of huge pmd to none to huge pmd or pagetable.
+ */
+ if (unlikely(pmd_trans_unstable(pmd) || pmd_devmap(*pmd))) {
+ ret = VM_FAULT_NOPAGE;
+ goto err;
+ }
out:
*page = vmf.page;
return ret;
+err:
+ unlock_page(vmf.page);
+ put_page(vmf.page);
+ return ret;
}

/**
@@ -2961,32 +2998,19 @@ static void do_fault_around(struct vm_ar

static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+ pgoff_t pgoff, unsigned int flags)
{
struct page *fault_page;
spinlock_t *ptl;
pte_t *pte;
- int ret = 0;
-
- /*
- * Let's call ->map_pages() first and use ->fault() as fallback
- * if page by the offset is not ready to be mapped (cold cache or
- * something).
- */
- if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1) {
- pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- do_fault_around(vma, address, pte, pgoff, flags);
- if (!pte_same(*pte, orig_pte))
- goto unlock_out;
- pte_unmap_unlock(pte, ptl);
- }
+ int ret;

- ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(vma, address, pmd, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
+ if (unlikely(!pte_none(*pte))) {
pte_unmap_unlock(pte, ptl);
unlock_page(fault_page);
put_page(fault_page);
@@ -2994,14 +3018,20 @@ static int do_read_fault(struct mm_struc
}
do_set_pte(vma, address, fault_page, pte, false, false);
unlock_page(fault_page);
-unlock_out:
+
+ /*
+ * Finally call ->map_pages() to fault around the pte we just set.
+ */
+ if (vma->vm_ops->map_pages && fault_around_bytes >> PAGE_SHIFT > 1)
+ do_fault_around(vma, address, pte, pgoff, flags);
+
pte_unmap_unlock(pte, ptl);
return ret;
}

static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+ pgoff_t pgoff, unsigned int flags)
{
struct page *fault_page, *new_page;
struct mem_cgroup *memcg;
@@ -3021,7 +3051,7 @@ static int do_cow_fault(struct mm_struct
return VM_FAULT_OOM;
}

- ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
+ ret = __do_fault(vma, address, pmd, pgoff, flags, new_page, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;

@@ -3030,7 +3060,7 @@ static int do_cow_fault(struct mm_struct
__SetPageUptodate(new_page);

pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
+ if (unlikely(!pte_none(*pte))) {
pte_unmap_unlock(pte, ptl);
if (fault_page) {
unlock_page(fault_page);
@@ -3067,7 +3097,7 @@ uncharge_out:

static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long address, pmd_t *pmd,
- pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
+ pgoff_t pgoff, unsigned int flags)
{
struct page *fault_page;
struct address_space *mapping;
@@ -3076,7 +3106,7 @@ static int do_shared_fault(struct mm_str
int dirtied = 0;
int ret, tmp;

- ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(vma, address, pmd, pgoff, flags, NULL, &fault_page);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;

@@ -3095,7 +3125,7 @@ static int do_shared_fault(struct mm_str
}

pte = pte_offset_map_lock(mm, pmd, address, &ptl);
- if (unlikely(!pte_same(*pte, orig_pte))) {
+ if (unlikely(!pte_none(*pte))) {
pte_unmap_unlock(pte, ptl);
unlock_page(fault_page);
put_page(fault_page);
@@ -3135,22 +3165,18 @@ static int do_shared_fault(struct mm_str
* return value. See filemap_fault() and __lock_page_or_retry().
*/
static int do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, pte_t *page_table, pmd_t *pmd,
- unsigned int flags, pte_t orig_pte)
+ unsigned long address, pmd_t *pmd, unsigned int flags)
{
pgoff_t pgoff = linear_page_index(vma, address);

- pte_unmap(page_table);
/* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND */
if (!vma->vm_ops->fault)
return VM_FAULT_SIGBUS;
if (!(flags & FAULT_FLAG_WRITE))
- return do_read_fault(mm, vma, address, pmd, pgoff, flags,
- orig_pte);
+ return do_read_fault(mm, vma, address, pmd, pgoff, flags);
if (!(vma->vm_flags & VM_SHARED))
- return do_cow_fault(mm, vma, address, pmd, pgoff, flags,
- orig_pte);
- return do_shared_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
+ return do_cow_fault(mm, vma, address, pmd, pgoff, flags);
+ return do_shared_fault(mm, vma, address, pmd, pgoff, flags);
}

static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
@@ -3290,20 +3316,49 @@ static int wp_huge_pmd(struct mm_struct
* with external mmu caches can use to update those (ie the Sparc or
* PowerPC hashed page tables that act as extended TLBs).
*
- * We enter with non-exclusive mmap_sem (to exclude vma changes,
- * but allow concurrent faults), and pte mapped but not yet locked.
- * We return with pte unmapped and unlocked.
- *
+ * We enter with non-exclusive mmap_sem
+ * (to exclude vma changes, but allow concurrent faults).
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
*/
-static int handle_pte_fault(struct mm_struct *mm,
- struct vm_area_struct *vma, unsigned long address,
- pte_t *pte, pmd_t *pmd, unsigned int flags)
+static int handle_pte_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmd, unsigned int flags)
{
+ pmd_t pmdval;
+ pte_t *pte;
pte_t entry;
spinlock_t *ptl;

+ /* If a huge pmd materialized under us just retry later */
+ pmdval = *pmd;
+ barrier();
+ if (unlikely(pmd_trans_huge(pmdval) || pmd_devmap(pmdval)))
+ return 0;
+
+ if (unlikely(pmd_none(pmdval))) {
+ /*
+ * Leave pte_alloc() until later: because huge tmpfs may
+ * want to map_team_by_pmd(), and if we expose page table
+ * for an instant, it will be difficult to retract from
+ * concurrent faults and from rmap lookups.
+ */
+ pte = NULL;
+ } else {
+ /*
+ * A regular pmd is established and it can't morph into a huge
+ * pmd from under us anymore at this point because we hold the
+ * mmap_sem read mode and khugepaged takes it in write mode.
+ * So now it's safe to run pte_offset_map().
+ */
+ pte = pte_offset_map(pmd, address);
+ entry = *pte;
+ barrier();
+ if (pte_none(entry)) {
+ pte_unmap(pte);
+ pte = NULL;
+ }
+ }
+
/*
* some architectures can have larger ptes than wordsize,
* e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and CONFIG_32BIT=y,
@@ -3312,21 +3367,14 @@ static int handle_pte_fault(struct mm_st
* we later double check anyway with the ptl lock held. So here
* a barrier will do.
*/
- entry = *pte;
- barrier();
- if (!pte_present(entry)) {
- if (pte_none(entry)) {
- if (vma_is_anonymous(vma))
- return do_anonymous_page(mm, vma, address,
- pte, pmd, flags);
- else
- return do_fault(mm, vma, address, pte, pmd,
- flags, entry);
- }
- return do_swap_page(mm, vma, address,
- pte, pmd, flags, entry);
- }

+ if (!pte) {
+ if (!vma_is_anonymous(vma))
+ return do_fault(mm, vma, address, pmd, flags);
+ return do_anonymous_page(mm, vma, address, pmd, flags);
+ }
+ if (!pte_present(entry))
+ return do_swap_page(mm, vma, address, pte, pmd, flags, entry);
if (pte_protnone(entry))
return do_numa_page(mm, vma, address, entry, pte, pmd);

@@ -3370,7 +3418,6 @@ static int __handle_mm_fault(struct mm_s
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
- pte_t *pte;

if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
flags & FAULT_FLAG_INSTRUCTION,
@@ -3416,35 +3463,7 @@ static int __handle_mm_fault(struct mm_s
}
}

- /*
- * Use pte_alloc() instead of pte_alloc_map, because we can't
- * run pte_offset_map on the pmd, if an huge pmd could
- * materialize from under us from a different thread.
- */
- if (unlikely(pte_alloc(mm, pmd, address)))
- return VM_FAULT_OOM;
- /*
- * If a huge pmd materialized under us just retry later. Use
- * pmd_trans_unstable() instead of pmd_trans_huge() to ensure the pmd
- * didn't become pmd_trans_huge under us and then back to pmd_none, as
- * a result of MADV_DONTNEED running immediately after a huge pmd fault
- * in a different thread of this mm, in turn leading to a misleading
- * pmd_trans_huge() retval. All we have to ensure is that it is a
- * regular pmd that we can walk with pte_offset_map() and we can do that
- * through an atomic read in C, which is what pmd_trans_unstable()
- * provides.
- */
- if (unlikely(pmd_trans_unstable(pmd) || pmd_devmap(*pmd)))
- return 0;
- /*
- * A regular pmd is established and it can't morph into a huge pmd
- * from under us anymore at this point because we hold the mmap_sem
- * read mode and khugepaged takes it in write mode. So now it's
- * safe to run pte_offset_map().
- */
- pte = pte_offset_map(pmd, address);
-
- return handle_pte_fault(mm, vma, address, pte, pmd, flags);
+ return handle_pte_fault(mm, vma, address, pmd, flags);
}

/*

2016-04-05 21:25:40

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 10/31] huge tmpfs: map shmem by huge page pmd or by page team ptes

This is the commit which at last gets huge mappings of tmpfs working,
as can be seen from the ShmemPmdMapped line of /proc/meminfo.

The main thing here is the trio of functions map_team_by_pmd(),
unmap_team_by_pmd() and remap_team_by_ptes() added to huge_memory.c;
and of course the enablement of FAULT_FLAG_MAY_HUGE from memory.c
to shmem.c, with VM_FAULT_HUGE back from shmem.c to memory.c. But
one-line and few-line changes scattered throughout huge_memory.c.

Huge tmpfs is relying on the pmd_trans_huge() page table hooks which
the original Anonymous THP project placed throughout mm; but skips
almost all of its complications, going to its own simpler handling.

Kirill has a much better idea of what copy_huge_pmd() should do for
pagecache: nothing, just as we don't copy shared file ptes. I shall
adopt his idea in a future version, but for now show how to dup team.

Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/vm/transhuge.txt | 38 ++++-
include/linux/pageteam.h | 48 ++++++
mm/huge_memory.c | 229 +++++++++++++++++++++++++++++--
mm/memory.c | 12 +
4 files changed, 307 insertions(+), 20 deletions(-)

--- a/Documentation/vm/transhuge.txt
+++ b/Documentation/vm/transhuge.txt
@@ -9,8 +9,8 @@ using huge pages for the backing of virt
that supports the automatic promotion and demotion of page sizes and
without the shortcomings of hugetlbfs.

-Currently it only works for anonymous memory mappings but in the
-future it can expand over the pagecache layer starting with tmpfs.
+Initially it only worked for anonymous memory mappings, but then was
+extended to the pagecache layer, starting with tmpfs.

The reason applications are running faster is because of two
factors. The first factor is almost completely irrelevant and it's not
@@ -57,9 +57,8 @@ miss is going to run faster.
feature that applies to all dynamic high order allocations in the
kernel)

-- this initial support only offers the feature in the anonymous memory
- regions but it'd be ideal to move it to tmpfs and the pagecache
- later
+- initial support only offered the feature in anonymous memory regions,
+ but then it was extended to huge tmpfs pagecache: see section below.

Transparent Hugepage Support maximizes the usefulness of free memory
if compared to the reservation approach of hugetlbfs by allowing all
@@ -458,3 +457,32 @@ exit(2) if an THP crosses VMA boundary.
Function deferred_split_huge_page() is used to queue page for splitting.
The splitting itself will happen when we get memory pressure via shrinker
interface.
+
+== Huge tmpfs ==
+
+Transparent hugepages were implemented much later in tmpfs.
+That implementation shares much of the "pmd" infrastructure
+devised for anonymous hugepages, and their reliance on compaction.
+
+But unlike hugetlbfs, which has always been free to impose its own
+restrictions, a transparent implementation of pagecache in tmpfs must
+be able to support files both large and small, with large extents
+mapped by hugepage pmds at the same time as small extents (of the
+very same pagecache) are mapped by ptes. For this reason, the
+compound pages used for hugetlbfs and anonymous hugepages were found
+unsuitable, and the opposite approach taken: the high-order backing
+page is split from the start, and managed as a team of partially
+independent small cache pages.
+
+Huge tmpfs is enabled simply by a "huge=1" mount option, and does not
+attend to the boot options, sysfs settings and madvice controlling
+anonymous hugepages. Huge tmpfs recovery (putting a hugepage back
+together after it was disbanded for reclaim, or after a period of
+fragmentation) is done by a workitem scheduled from fault, without
+involving khugepaged at all.
+
+For more info on huge tmpfs, see Documentation/filesystems/tmpfs.txt.
+It is an open question whether that implementation forms the basis for
+extending transparent hugepages to other filesystems' pagecache: in its
+present form, it makes use of struct page's private field, available on
+tmpfs, but already in use on most other filesystems.
--- a/include/linux/pageteam.h
+++ b/include/linux/pageteam.h
@@ -29,10 +29,56 @@ static inline struct page *team_head(str
return head;
}

-/* Temporary stub for mm/rmap.c until implemented in mm/huge_memory.c */
+/*
+ * Returns true if this team is mapped by pmd somewhere.
+ */
+static inline bool team_pmd_mapped(struct page *head)
+{
+ return atomic_long_read(&head->team_usage) > HPAGE_PMD_NR;
+}
+
+/*
+ * Returns true if this was the first mapping by pmd, whereupon mapped stats
+ * need to be updated.
+ */
+static inline bool inc_team_pmd_mapped(struct page *head)
+{
+ return atomic_long_inc_return(&head->team_usage) == HPAGE_PMD_NR+1;
+}
+
+/*
+ * Returns true if this was the last mapping by pmd, whereupon mapped stats
+ * need to be updated.
+ */
+static inline bool dec_team_pmd_mapped(struct page *head)
+{
+ return atomic_long_dec_return(&head->team_usage) == HPAGE_PMD_NR;
+}
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int map_team_by_pmd(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd, struct page *page);
+void unmap_team_by_pmd(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd, struct page *page);
+void remap_team_by_ptes(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd);
+#else
+static inline int map_team_by_pmd(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd, struct page *page)
+{
+ VM_BUG_ON_PAGE(1, page);
+ return 0;
+}
static inline void unmap_team_by_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd, struct page *page)
{
+ VM_BUG_ON_PAGE(1, page);
+}
+static inline void remap_team_by_ptes(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd)
+{
+ VM_BUG_ON(1);
}
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

#endif /* _LINUX_PAGETEAM_H */
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -25,6 +25,7 @@
#include <linux/mman.h>
#include <linux/memremap.h>
#include <linux/pagemap.h>
+#include <linux/pageteam.h>
#include <linux/debugfs.h>
#include <linux/migrate.h>
#include <linux/hashtable.h>
@@ -63,6 +64,8 @@ enum scan_result {
#define CREATE_TRACE_POINTS
#include <trace/events/huge_memory.h>

+static void page_remove_team_rmap(struct page *);
+
/*
* By default transparent hugepage support is disabled in order that avoid
* to risk increase the memory footprint of applications without a guaranteed
@@ -1120,17 +1123,23 @@ int copy_huge_pmd(struct mm_struct *dst_
if (!vma_is_dax(vma)) {
/* thp accounting separate from pmd_devmap accounting */
src_page = pmd_page(pmd);
- VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
get_page(src_page);
- page_dup_rmap(src_page, true);
- add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+ if (PageAnon(src_page)) {
+ VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
+ page_dup_rmap(src_page, true);
+ pmdp_set_wrprotect(src_mm, addr, src_pmd);
+ pmd = pmd_wrprotect(pmd);
+ } else {
+ VM_BUG_ON_PAGE(!PageTeam(src_page), src_page);
+ page_dup_rmap(src_page, false);
+ inc_team_pmd_mapped(src_page);
+ }
+ add_mm_counter(dst_mm, mm_counter(src_page), HPAGE_PMD_NR);
atomic_long_inc(&dst_mm->nr_ptes);
pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
}

- pmdp_set_wrprotect(src_mm, addr, src_pmd);
- pmd = pmd_mkold(pmd_wrprotect(pmd));
- set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+ set_pmd_at(dst_mm, addr, dst_pmd, pmd_mkold(pmd));

ret = 0;
out_unlock:
@@ -1429,7 +1438,7 @@ struct page *follow_trans_huge_pmd(struc
goto out;

page = pmd_page(*pmd);
- VM_BUG_ON_PAGE(!PageHead(page), page);
+ VM_BUG_ON_PAGE(!PageHead(page) && !PageTeam(page), page);
if (flags & FOLL_TOUCH)
touch_pmd(vma, addr, pmd);
if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
@@ -1454,7 +1463,7 @@ struct page *follow_trans_huge_pmd(struc
}
}
page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;
- VM_BUG_ON_PAGE(!PageCompound(page), page);
+ VM_BUG_ON_PAGE(!PageCompound(page) && !PageTeam(page), page);
if (flags & FOLL_GET)
get_page(page);

@@ -1692,10 +1701,12 @@ int zap_huge_pmd(struct mmu_gather *tlb,
put_huge_zero_page();
} else {
struct page *page = pmd_page(orig_pmd);
- page_remove_rmap(page, true);
+ if (PageTeam(page))
+ page_remove_team_rmap(page);
+ page_remove_rmap(page, PageHead(page));
VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
- add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
- VM_BUG_ON_PAGE(!PageHead(page), page);
+ VM_BUG_ON_PAGE(!PageHead(page) && !PageTeam(page), page);
+ add_mm_counter(tlb->mm, mm_counter(page), -HPAGE_PMD_NR);
pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
atomic_long_dec(&tlb->mm->nr_ptes);
spin_unlock(ptl);
@@ -1739,7 +1750,7 @@ bool move_huge_pmd(struct vm_area_struct
VM_BUG_ON(!pmd_none(*new_pmd));

if (pmd_move_must_withdraw(new_ptl, old_ptl) &&
- vma_is_anonymous(vma)) {
+ !vma_is_dax(vma)) {
pgtable_t pgtable;
pgtable = pgtable_trans_huge_withdraw(mm, old_pmd);
pgtable_trans_huge_deposit(mm, new_pmd, pgtable);
@@ -1789,7 +1800,6 @@ int change_huge_pmd(struct vm_area_struc
entry = pmd_mkwrite(entry);
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
- BUG_ON(!preserve_write && pmd_write(entry));
}
spin_unlock(ptl);
}
@@ -2991,6 +3001,11 @@ void __split_huge_pmd(struct vm_area_str
struct mm_struct *mm = vma->vm_mm;
unsigned long haddr = address & HPAGE_PMD_MASK;

+ if (!vma_is_anonymous(vma) && !vma->vm_ops->pmd_fault) {
+ remap_team_by_ptes(vma, address, pmd);
+ return;
+ }
+
mmu_notifier_invalidate_range_start(mm, haddr, haddr + HPAGE_PMD_SIZE);
ptl = pmd_lock(mm, pmd);
if (pmd_trans_huge(*pmd)) {
@@ -3469,4 +3484,190 @@ static int __init split_huge_pages_debug
return 0;
}
late_initcall(split_huge_pages_debugfs);
-#endif
+#endif /* CONFIG_DEBUG_FS */
+
+/*
+ * huge pmd support for huge tmpfs
+ */
+
+static void page_add_team_rmap(struct page *page)
+{
+ VM_BUG_ON_PAGE(PageAnon(page), page);
+ VM_BUG_ON_PAGE(!PageTeam(page), page);
+ if (inc_team_pmd_mapped(page))
+ __inc_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+}
+
+static void page_remove_team_rmap(struct page *page)
+{
+ VM_BUG_ON_PAGE(PageAnon(page), page);
+ VM_BUG_ON_PAGE(!PageTeam(page), page);
+ if (dec_team_pmd_mapped(page))
+ __dec_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+}
+
+int map_team_by_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, struct page *page)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgtable_t pgtable;
+ spinlock_t *pml;
+ pmd_t pmdval;
+ int ret = VM_FAULT_NOPAGE;
+
+ /*
+ * Another task may have mapped it in just ahead of us; but we
+ * have the huge page locked, so others will wait on us now... or,
+ * is there perhaps some way another might still map in a single pte?
+ */
+ VM_BUG_ON_PAGE(!PageTeam(page), page);
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ if (!pmd_none(*pmd))
+ goto raced2;
+
+ addr &= HPAGE_PMD_MASK;
+ pgtable = pte_alloc_one(mm, addr);
+ if (!pgtable) {
+ ret = VM_FAULT_OOM;
+ goto raced2;
+ }
+
+ pml = pmd_lock(mm, pmd);
+ if (!pmd_none(*pmd))
+ goto raced1;
+ pmdval = mk_pmd(page, vma->vm_page_prot);
+ pmdval = pmd_mkhuge(pmd_mkdirty(pmdval));
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ set_pmd_at(mm, addr, pmd, pmdval);
+ page_add_file_rmap(page);
+ page_add_team_rmap(page);
+ update_mmu_cache_pmd(vma, addr, pmd);
+ atomic_long_inc(&mm->nr_ptes);
+ spin_unlock(pml);
+
+ unlock_page(page);
+ add_mm_counter(mm, MM_SHMEMPAGES, HPAGE_PMD_NR);
+ return ret;
+raced1:
+ spin_unlock(pml);
+ pte_free(mm, pgtable);
+raced2:
+ unlock_page(page);
+ put_page(page);
+ return ret;
+}
+
+void unmap_team_by_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, struct page *page)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgtable_t pgtable = NULL;
+ unsigned long end;
+ spinlock_t *pml;
+
+ VM_BUG_ON_PAGE(!PageTeam(page), page);
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+ /*
+ * But even so there might be a racing zap_huge_pmd() or
+ * remap_team_by_ptes() while the page_table_lock is dropped.
+ */
+
+ addr &= HPAGE_PMD_MASK;
+ end = addr + HPAGE_PMD_SIZE;
+
+ mmu_notifier_invalidate_range_start(mm, addr, end);
+ pml = pmd_lock(mm, pmd);
+ if (pmd_trans_huge(*pmd) && pmd_page(*pmd) == page) {
+ pmdp_huge_clear_flush(vma, addr, pmd);
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ page_remove_team_rmap(page);
+ page_remove_rmap(page, false);
+ atomic_long_dec(&mm->nr_ptes);
+ }
+ spin_unlock(pml);
+ mmu_notifier_invalidate_range_end(mm, addr, end);
+
+ if (!pgtable)
+ return;
+
+ pte_free(mm, pgtable);
+ update_hiwater_rss(mm);
+ add_mm_counter(mm, MM_SHMEMPAGES, -HPAGE_PMD_NR);
+ put_page(page);
+}
+
+void remap_team_by_ptes(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *head;
+ struct page *page;
+ pgtable_t pgtable;
+ unsigned long end;
+ spinlock_t *pml;
+ spinlock_t *ptl;
+ pte_t *pte;
+ pmd_t _pmd;
+ pmd_t pmdval;
+ pte_t pteval;
+
+ addr &= HPAGE_PMD_MASK;
+ end = addr + HPAGE_PMD_SIZE;
+
+ mmu_notifier_invalidate_range_start(mm, addr, end);
+ pml = pmd_lock(mm, pmd);
+ if (!pmd_trans_huge(*pmd))
+ goto raced;
+
+ page = head = pmd_page(*pmd);
+ pmdval = pmdp_huge_clear_flush(vma, addr, pmd);
+ pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+ pmd_populate(mm, &_pmd, pgtable);
+ ptl = pte_lockptr(mm, &_pmd);
+ if (ptl != pml)
+ spin_lock(ptl);
+ pmd_populate(mm, pmd, pgtable);
+ update_mmu_cache_pmd(vma, addr, pmd);
+
+ /*
+ * It would be nice to have prepared this page table in advance,
+ * so we could just switch from pmd to ptes under one lock.
+ * But a comment in zap_huge_pmd() warns that ppc64 needs
+ * to look at the deposited page table when clearing the pmd.
+ */
+ pte = pte_offset_map(pmd, addr);
+ do {
+ pteval = pte_mkdirty(mk_pte(page, vma->vm_page_prot));
+ if (!pmd_young(pmdval))
+ pteval = pte_mkold(pteval);
+ set_pte_at(mm, addr, pte, pteval);
+ VM_BUG_ON_PAGE(!PageTeam(page), page);
+ if (page != head) {
+ page_add_file_rmap(page);
+ get_page(page);
+ }
+ /*
+ * Move page flags from head to page,
+ * as __split_huge_page_tail() does for anon?
+ * Start off by assuming not, but reconsider later.
+ */
+ } while (pte++, page++, addr += PAGE_SIZE, addr != end);
+
+ /*
+ * remap_team_by_ptes() is called from various locking contexts.
+ * Don't dec_team_pmd_mapped() until after that page table has been
+ * completed (with atomic_long_sub_return supplying a barrier):
+ * otherwise shmem_disband_hugeteam() may disband it concurrently,
+ * and pages be freed while mapped.
+ */
+ page_remove_team_rmap(head);
+
+ pte -= HPAGE_PMD_NR;
+ addr -= HPAGE_PMD_NR;
+ if (ptl != pml)
+ spin_unlock(ptl);
+ pte_unmap(pte);
+raced:
+ spin_unlock(pml);
+ mmu_notifier_invalidate_range_end(mm, addr, end);
+}
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -45,6 +45,7 @@
#include <linux/swap.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
+#include <linux/pageteam.h>
#include <linux/ksm.h>
#include <linux/rmap.h>
#include <linux/export.h>
@@ -2809,11 +2810,21 @@ static int __do_fault(struct vm_area_str
vmf.gfp_mask = __get_fault_gfp_mask(vma);
vmf.cow_page = cow_page;

+ /*
+ * Give huge pmd a chance before allocating pte or trying fault around.
+ */
+ if (unlikely(pmd_none(*pmd)))
+ vmf.flags |= FAULT_FLAG_MAY_HUGE;
+
ret = vma->vm_ops->fault(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
if (!vmf.page)
goto out;
+ if (unlikely(ret & VM_FAULT_HUGE)) {
+ ret |= map_team_by_pmd(vma, address, pmd, vmf.page);
+ return ret;
+ }

if (unlikely(!(ret & VM_FAULT_LOCKED)))
lock_page(vmf.page);
@@ -3304,6 +3315,7 @@ static int wp_huge_pmd(struct mm_struct
return do_huge_pmd_wp_page(mm, vma, address, pmd, orig_pmd);
if (vma->vm_ops->pmd_fault)
return vma->vm_ops->pmd_fault(vma, address, pmd, flags);
+ remap_team_by_ptes(vma, address, pmd);
return VM_FAULT_FALLBACK;
}


2016-04-05 21:29:17

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 11/31] huge tmpfs: disband split huge pmds on race or memory failure

Andres L-C has pointed out that the single-page unmap_mapping_range()
fallback in truncate_inode_page() cannot protect against the case when
a huge page was faulted in after the full-range unmap_mapping_range():
because page_mapped(page) checks tail page's mapcount, not the head's.

So, there's a danger that hole-punching (and maybe even truncation)
can free pages while they are mapped into userspace with a huge pmd.
And I don't believe that the CVE-2014-4171 protection in shmem_fault()
can fully protect from this, although it does make it much harder.

Fix that by adding a duplicate single-page unmap_mapping_range()
into shmem_disband_hugeteam() (called when punching or truncating
a PageTeam), at the point when we also hold the head's page lock
(without which there would still be races): which will then split
all huge pmd mappings covering the page into team pte mappings.

This is also just what's needed to handle memory_failure() correctly:
provide custom shmem_error_remove_page(), call shmem_disband_hugeteam()
from that before proceeding to generic_error_remove_page(), then this
additional unmap_mapping_range() will remap team by ptes as needed.

(There is an unlikely case that we're racing with another disbander,
or disband didn't get trylock on head page at first: memory_failure()
has almost finished with the page, so it's safe to unlock and relock
before retrying.)

But there is one further change needed in hwpoison_user_mappings():
it must recognize a hugely mapped team before concluding that the
page is not mapped. (And still no support for soft_offline(),
which will have to wait for page migration of teams.)

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/memory-failure.c | 7 ++++++-
mm/shmem.c | 30 +++++++++++++++++++++++++++++-
2 files changed, 35 insertions(+), 2 deletions(-)

--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -45,6 +45,7 @@
#include <linux/rmap.h>
#include <linux/export.h>
#include <linux/pagemap.h>
+#include <linux/pageteam.h>
#include <linux/swap.h>
#include <linux/backing-dev.h>
#include <linux/migrate.h>
@@ -902,6 +903,7 @@ static int hwpoison_user_mappings(struct
enum ttu_flags ttu = TTU_UNMAP | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS;
struct address_space *mapping;
LIST_HEAD(tokill);
+ bool mapped;
int ret;
int kill = 1, forcekill;
struct page *hpage = *hpagep;
@@ -919,7 +921,10 @@ static int hwpoison_user_mappings(struct
* This check implies we don't kill processes if their pages
* are in the swap cache early. Those are always late kills.
*/
- if (!page_mapped(hpage))
+ mapped = page_mapped(hpage);
+ if (PageTeam(p) && team_pmd_mapped(team_head(p)))
+ mapped = true;
+ if (!mapped)
return SWAP_SUCCESS;

if (PageKsm(p)) {
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -605,6 +605,19 @@ static void shmem_disband_hugeteam(struc
}

/*
+ * truncate_inode_page() will unmap page if page_mapped(page),
+ * but there's a race by which the team could be hugely mapped,
+ * with page_mapped(page) saying false. So check here if the
+ * head is hugely mapped, and if so unmap page to remap team.
+ * Use a loop because there is no good locking against a
+ * concurrent remap_team_by_ptes().
+ */
+ while (team_pmd_mapped(head)) {
+ unmap_mapping_range(page->mapping,
+ (loff_t)page->index << PAGE_SHIFT, PAGE_SIZE, 0);
+ }
+
+ /*
* Disable preemption because truncation may end up spinning until a
* tail PageTeam has been cleared: we hold the lock as briefly as we
* can (splitting disband in two stages), but better not be preempted.
@@ -1305,6 +1318,21 @@ static int shmem_getattr(struct vfsmount
return 0;
}

+static int shmem_error_remove_page(struct address_space *mapping,
+ struct page *page)
+{
+ if (PageTeam(page)) {
+ shmem_disband_hugeteam(page);
+ while (unlikely(PageTeam(page))) {
+ unlock_page(page);
+ cond_resched();
+ lock_page(page);
+ shmem_disband_hugeteam(page);
+ }
+ }
+ return generic_error_remove_page(mapping, page);
+}
+
static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
{
struct inode *inode = d_inode(dentry);
@@ -4088,7 +4116,7 @@ static const struct address_space_operat
#ifdef CONFIG_MIGRATION
.migratepage = migrate_page,
#endif
- .error_remove_page = generic_error_remove_page,
+ .error_remove_page = shmem_error_remove_page,
};

static const struct file_operations shmem_file_operations = {

2016-04-05 21:33:20

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 12/31] huge tmpfs: extend get_user_pages_fast to shmem pmd

The arch-specific get_user_pages_fast() has a gup_huge_pmd() designed to
optimize the refcounting on anonymous THP and hugetlbfs pages, with one
atomic addition to compound head's common refcount. That optimization
must be avoided on huge tmpfs team pages, which use normal separate page
refcounting. We could combine the PageTeam and PageCompound cases into
a single simple loop, but would lose the compound optimization that way.

One cannot go through these functions without wondering why some arches
(x86, mips) like to SetPageReferenced, while the rest do not: an x86
optimization that missed being propagated to the other architectures?
No, see commit 8ee53820edfd ("thp: mmu_notifier_test_young"): it's a
KVM GRU EPT thing, maybe not useful beyond x86. I've just followed
the established practice in each architecture.

Signed-off-by: Hugh Dickins <[email protected]>
---
Cc'ed to arch maintainers as an FYI: this patch is not expected to
go into the tree in the next few weeks, and depends upon a PageTeam
definition not yet available outside this huge tmpfs patchset.
Please refer to linux-mm or linux-kernel for more context.

arch/mips/mm/gup.c | 15 ++++++++++++++-
arch/s390/mm/gup.c | 19 ++++++++++++++++++-
arch/sparc/mm/gup.c | 19 ++++++++++++++++++-
arch/x86/mm/gup.c | 15 ++++++++++++++-
mm/gup.c | 19 ++++++++++++++++++-
5 files changed, 82 insertions(+), 5 deletions(-)

--- a/arch/mips/mm/gup.c
+++ b/arch/mips/mm/gup.c
@@ -81,9 +81,22 @@ static int gup_huge_pmd(pmd_t pmd, unsig
VM_BUG_ON(pte_special(pte));
VM_BUG_ON(!pfn_valid(pte_pfn(pte)));

- refs = 0;
head = pte_page(pte);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+
+ if (PageTeam(head)) {
+ /* Handle a huge tmpfs team with normal refcounting. */
+ do {
+ get_page(page);
+ SetPageReferenced(page);
+ pages[*nr] = page;
+ (*nr)++;
+ page++;
+ } while (addr += PAGE_SIZE, addr != end);
+ return 1;
+ }
+
+ refs = 0;
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
--- a/arch/s390/mm/gup.c
+++ b/arch/s390/mm/gup.c
@@ -66,9 +66,26 @@ static inline int gup_huge_pmd(pmd_t *pm
return 0;
VM_BUG_ON(!pfn_valid(pmd_val(pmd) >> PAGE_SHIFT));

- refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+
+ if (PageTeam(head)) {
+ /* Handle a huge tmpfs team with normal refcounting. */
+ do {
+ if (!page_cache_get_speculative(page))
+ return 0;
+ if (unlikely(pmd_val(pmd) != pmd_val(*pmdp))) {
+ put_page(page);
+ return 0;
+ }
+ pages[*nr] = page;
+ (*nr)++;
+ page++;
+ } while (addr += PAGE_SIZE, addr != end);
+ return 1;
+ }
+
+ refs = 0;
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
--- a/arch/sparc/mm/gup.c
+++ b/arch/sparc/mm/gup.c
@@ -77,9 +77,26 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd
if (write && !pmd_write(pmd))
return 0;

- refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+
+ if (PageTeam(head)) {
+ /* Handle a huge tmpfs team with normal refcounting. */
+ do {
+ if (!page_cache_get_speculative(page))
+ return 0;
+ if (unlikely(pmd_val(pmd) != pmd_val(*pmdp))) {
+ put_page(page);
+ return 0;
+ }
+ pages[*nr] = page;
+ (*nr)++;
+ page++;
+ } while (addr += PAGE_SIZE, addr != end);
+ return 1;
+ }
+
+ refs = 0;
do {
VM_BUG_ON(compound_head(page) != head);
pages[*nr] = page;
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -196,9 +196,22 @@ static noinline int gup_huge_pmd(pmd_t p
/* hugepages are never "special" */
VM_BUG_ON(pmd_flags(pmd) & _PAGE_SPECIAL);

- refs = 0;
head = pmd_page(pmd);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+
+ if (PageTeam(head)) {
+ /* Handle a huge tmpfs team with normal refcounting. */
+ do {
+ get_page(page);
+ SetPageReferenced(page);
+ pages[*nr] = page;
+ (*nr)++;
+ page++;
+ } while (addr += PAGE_SIZE, addr != end);
+ return 1;
+ }
+
+ refs = 0;
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1247,9 +1247,26 @@ static int gup_huge_pmd(pmd_t orig, pmd_
if (write && !pmd_write(orig))
return 0;

- refs = 0;
head = pmd_page(orig);
page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
+
+ if (PageTeam(head)) {
+ /* Handle a huge tmpfs team with normal refcounting. */
+ do {
+ if (!page_cache_get_speculative(page))
+ return 0;
+ if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
+ put_page(page);
+ return 0;
+ }
+ pages[*nr] = page;
+ (*nr)++;
+ page++;
+ } while (addr += PAGE_SIZE, addr != end);
+ return 1;
+ }
+
+ refs = 0;
do {
VM_BUG_ON_PAGE(compound_head(page) != head, page);
pages[*nr] = page;

2016-04-05 21:34:39

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 13/31] huge tmpfs: use Unevictable lru with variable hpage_nr_pages

A big advantage of huge tmpfs over hugetlbfs is that its pages can
be swapped out; but too often it OOMs before swapping them out.

At first I tried changing page_evictable(), to treat all tail pages
of a hugely mapped team as unevictable: the anon LRUs were otherwise
swamped by pages that could not be freed before the head.

That worked quite well, some of the time, but has some drawbacks.

Most obviously, /proc/meminfo is liable to show 511/512ths of all
the ShmemPmdMapped as Unevictable; which is rather sad for a feature
intended to improve on hugetlbfs by letting the pages be swappable.

But more seriously, although it is helpful to have those tails out
of the way on the Unevictable list, page reclaim can very easily come
to a point where all the team heads to be freed are on the Active list,
but the Inactive is large enough that !inactive_anon_is_low(), so the
Active is never scanned to unmap those heads to release all the tails.
Eventually we OOM.

Perhaps that could be dealt with by hacking inactive_anon_is_low():
but it wouldn't help the Unevictable numbers, and has never been
necessary for anon THP. How does anon THP avoid this? It doesn't
put tails on the LRU at all, so doesn't then need to shift them to
Unevictable; but there would still be the danger of an Active list
full of heads, holding the unseen tails, but the ratio too high for
for Active scanning - except that hpage_nr_pages() weights each THP
head by the number of small pages the huge page holds, instead of the
usual 1, and that is what keeps the Active/Inactive balance working.

So in this patch we try to do the same for huge tmpfs pages. However,
a team is not one huge compound page, but a collection of independent
pages, and the fair and lazy way to accomplish this seems to be to
transfer each tail's weight to head at the time when shmem_writepage()
has been asked to evict the page, but refuses because the head has not
yet been evicted. So although the failed-to-be-evicted tails are moved
to the Unevictable LRU, each counts for 0kB in the Unevictable amount,
its 4kB going to the head in the Active(anon) or Inactive(anon) amount.

With a few exceptions, hpage_nr_pages() is now only called on a
maybe-PageTeam page while under lruvec lock: and we do need to hold
lruvec lock when transferring weight from one page to another.
Exceptions: mlock.c (next patch), subsequently self-correcting calls to
page_evictable(), and the "nr_rotated +=" line in shrink_active_list(),
which has no need to be precise.

(Aside: quite a few of our calls to hpage_nr_pages() are no more than
ways to side-step the THP-off BUILD_BUG_ON() buried in HPAGE_PMD_NR:
we might do better to kill that BUILD_BUG_ON() at last.)

Lru lock is a new overhead, which shmem_disband_hugehead() prefers
to avoid, if the head's weight is just the default 1. And it's not
clear how well this will all play out if different pages of a team
are charged to different memcgs: but the code allows for that, and
it should be fine while that's just an exceptional minority case.

A change I like in principle, but have not made, and do not intend
to make unless we see a workload that demands it: it would be natural
for mark_page_accessed() to retrieve such a 0-weight page from the
Unevictable LRU, assigning it weight again and giving it a new life
on the Active and Inactive LRUs. As it is, I'm hoping PageReferenced
gives a good enough hint as to whether a page should be retained, when
shmem_evictify_hugetails() brings it back from Unevictable to Inactive.

Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/vm/unevictable-lru.txt | 15 ++
include/linux/huge_mm.h | 14 ++
include/linux/pageteam.h | 48 ++++++
mm/memcontrol.c | 10 +
mm/shmem.c | 173 +++++++++++++++++++++----
mm/swap.c | 5
mm/vmscan.c | 39 +++++
7 files changed, 274 insertions(+), 30 deletions(-)

--- a/Documentation/vm/unevictable-lru.txt
+++ b/Documentation/vm/unevictable-lru.txt
@@ -72,6 +72,8 @@ The unevictable list addresses the follo

(*) Those mapped into VM_LOCKED [mlock()ed] VMAs.

+ (*) Tails owned by huge tmpfs, unevictable until team head page is evicted.
+
The infrastructure may also be able to handle other conditions that make pages
unevictable, either by definition or by circumstance, in the future.

@@ -201,6 +203,15 @@ page_evictable() also checks for mlocked
flag, PG_mlocked (as wrapped by PageMlocked()), which is set when a page is
faulted into a VM_LOCKED vma, or found in a vma being VM_LOCKED.

+page_evictable() also uses hpage_nr_pages(), to check for a huge tmpfs team
+tail page which reached the bottom of the inactive list, but could not be
+evicted at that time because its team head had not yet been evicted. We
+must not evict any member of the team while the whole team is mapped; and
+at present we only disband the team for reclaim when its head is evicted.
+When an inactive tail is held back from eviction, putback_inactive_pages()
+shifts its "weight" of 1 page to the head, to increase pressure on the head,
+but leave the tail as unevictable, without adding to the Unevictable count.
+

VMSCAN'S HANDLING OF UNEVICTABLE PAGES
--------------------------------------
@@ -597,7 +608,9 @@ Some examples of these unevictable pages
unevictable list in mlock_vma_page().

shrink_inactive_list() also diverts any unevictable pages that it finds on the
-inactive lists to the appropriate zone's unevictable list.
+inactive lists to the appropriate zone's unevictable list, adding in those
+huge tmpfs team tails which were rejected by pageout() (shmem_writepage())
+because the team has not yet been disbanded by evicting the head.

shrink_inactive_list() should only see SHM_LOCK'd pages that became SHM_LOCK'd
after shrink_active_list() had moved them to the inactive list, or pages mapped
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -127,10 +127,24 @@ static inline spinlock_t *pmd_trans_huge
else
return NULL;
}
+
+/* Repeat definition from linux/pageteam.h to force error if different */
+#define TEAM_LRU_WEIGHT_MASK ((1L << (HPAGE_PMD_ORDER + 1)) - 1)
+
+/*
+ * hpage_nr_pages(page) returns the current LRU weight of the page.
+ * Beware of races when it is used: an Anon THPage might get split,
+ * so may need protection by compound lock or lruvec lock; a huge tmpfs
+ * team page might have weight 1 shifted from tail to head, or back to
+ * tail when disbanded, so may need protection by lruvec lock.
+ */
static inline int hpage_nr_pages(struct page *page)
{
if (unlikely(PageTransHuge(page)))
return HPAGE_PMD_NR;
+ if (PageTeam(page))
+ return atomic_long_read(&page->team_usage) &
+ TEAM_LRU_WEIGHT_MASK;
return 1;
}

--- a/include/linux/pageteam.h
+++ b/include/linux/pageteam.h
@@ -30,11 +30,32 @@ static inline struct page *team_head(str
}

/*
+ * Mask for lower bits of team_usage, giving the weight 0..HPAGE_PMD_NR of the
+ * page on its LRU: normal pages have weight 1, tails held unevictable until
+ * head is evicted have weight 0, and the head gathers weight 1..HPAGE_PMD_NR.
+ */
+#define TEAM_LRU_WEIGHT_ONE 1L
+#define TEAM_LRU_WEIGHT_MASK ((1L << (HPAGE_PMD_ORDER + 1)) - 1)
+
+#define TEAM_HIGH_COUNTER (1L << (HPAGE_PMD_ORDER + 1))
+/*
+ * Count how many pages of team are instantiated, as it is built up.
+ */
+#define TEAM_PAGE_COUNTER TEAM_HIGH_COUNTER
+#define TEAM_COMPLETE (TEAM_PAGE_COUNTER << HPAGE_PMD_ORDER)
+/*
+ * And when complete, count how many huge mappings (like page_mapcount): an
+ * incomplete team cannot be hugely mapped (would expose uninitialized holes).
+ */
+#define TEAM_MAPPING_COUNTER TEAM_HIGH_COUNTER
+#define TEAM_PMD_MAPPED (TEAM_COMPLETE + TEAM_MAPPING_COUNTER)
+
+/*
* Returns true if this team is mapped by pmd somewhere.
*/
static inline bool team_pmd_mapped(struct page *head)
{
- return atomic_long_read(&head->team_usage) > HPAGE_PMD_NR;
+ return atomic_long_read(&head->team_usage) >= TEAM_PMD_MAPPED;
}

/*
@@ -43,7 +64,8 @@ static inline bool team_pmd_mapped(struc
*/
static inline bool inc_team_pmd_mapped(struct page *head)
{
- return atomic_long_inc_return(&head->team_usage) == HPAGE_PMD_NR+1;
+ return atomic_long_add_return(TEAM_MAPPING_COUNTER, &head->team_usage)
+ < TEAM_PMD_MAPPED + TEAM_MAPPING_COUNTER;
}

/*
@@ -52,7 +74,27 @@ static inline bool inc_team_pmd_mapped(s
*/
static inline bool dec_team_pmd_mapped(struct page *head)
{
- return atomic_long_dec_return(&head->team_usage) == HPAGE_PMD_NR;
+ return atomic_long_sub_return(TEAM_MAPPING_COUNTER, &head->team_usage)
+ < TEAM_PMD_MAPPED;
+}
+
+static inline void inc_lru_weight(struct page *head)
+{
+ atomic_long_inc(&head->team_usage);
+ VM_BUG_ON_PAGE((atomic_long_read(&head->team_usage) &
+ TEAM_LRU_WEIGHT_MASK) > HPAGE_PMD_NR, head);
+}
+
+static inline void set_lru_weight(struct page *page)
+{
+ VM_BUG_ON_PAGE(atomic_long_read(&page->team_usage) != 0, page);
+ atomic_long_set(&page->team_usage, 1);
+}
+
+static inline void clear_lru_weight(struct page *page)
+{
+ VM_BUG_ON_PAGE(atomic_long_read(&page->team_usage) != 1, page);
+ atomic_long_set(&page->team_usage, 0);
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1047,6 +1047,16 @@ void mem_cgroup_update_lru_size(struct l
*lru_size += nr_pages;

size = *lru_size;
+ if (!size && !empty && lru == LRU_UNEVICTABLE) {
+ struct page *page;
+ /*
+ * The unevictable list might be full of team tail pages of 0
+ * weight: check the first, and skip the warning if that fits.
+ */
+ page = list_first_entry(lruvec->lists + lru, struct page, lru);
+ if (hpage_nr_pages(page) == 0)
+ empty = true;
+ }
if (WARN_ONCE(size < 0 || empty != !size,
"%s(%p, %d, %d): lru_size %ld but %sempty\n",
__func__, lruvec, lru, nr_pages, size, empty ? "" : "not ")) {
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -63,6 +63,7 @@ static struct vfsmount *shm_mnt;
#include <linux/swapops.h>
#include <linux/pageteam.h>
#include <linux/mempolicy.h>
+#include <linux/mm_inline.h>
#include <linux/namei.h>
#include <linux/ctype.h>
#include <linux/migrate.h>
@@ -372,7 +373,8 @@ static int shmem_freeholes(struct page *
{
unsigned long nr = atomic_long_read(&head->team_usage);

- return (nr >= HPAGE_PMD_NR) ? 0 : HPAGE_PMD_NR - nr;
+ return (nr >= TEAM_COMPLETE) ? 0 :
+ HPAGE_PMD_NR - (nr / TEAM_PAGE_COUNTER);
}

static void shmem_clear_tag_hugehole(struct address_space *mapping,
@@ -399,18 +401,16 @@ static void shmem_added_to_hugeteam(stru
{
struct address_space *mapping = page->mapping;
struct page *head = team_head(page);
- int nr;

if (hugehint == SHMEM_ALLOC_HUGE_PAGE) {
- atomic_long_set(&head->team_usage, 1);
+ atomic_long_set(&head->team_usage,
+ TEAM_PAGE_COUNTER + TEAM_LRU_WEIGHT_ONE);
radix_tree_tag_set(&mapping->page_tree, page->index,
SHMEM_TAG_HUGEHOLE);
__mod_zone_page_state(zone, NR_SHMEM_FREEHOLES, HPAGE_PMD_NR-1);
} else {
- /* We do not need atomic ops until huge page gets mapped */
- nr = atomic_long_read(&head->team_usage) + 1;
- atomic_long_set(&head->team_usage, nr);
- if (nr == HPAGE_PMD_NR) {
+ if (atomic_long_add_return(TEAM_PAGE_COUNTER,
+ &head->team_usage) >= TEAM_COMPLETE) {
shmem_clear_tag_hugehole(mapping, head->index);
__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
}
@@ -456,11 +456,14 @@ static int shmem_populate_hugeteam(struc
return 0;
}

-static int shmem_disband_hugehead(struct page *head)
+static int shmem_disband_hugehead(struct page *head, int *head_lru_weight)
{
struct address_space *mapping;
+ bool lru_locked = false;
+ unsigned long flags;
struct zone *zone;
- int nr = -EALREADY; /* A racing task may have disbanded the team */
+ long team_usage;
+ long nr = -EALREADY; /* A racing task may have disbanded the team */

/*
* In most cases the head page is locked, or not yet exposed to others:
@@ -469,27 +472,54 @@ static int shmem_disband_hugehead(struct
* stays safe because shmem_evict_inode must take the shrinklist_lock,
* and our caller shmem_choose_hugehole is already holding that lock.
*/
+ *head_lru_weight = 0;
mapping = READ_ONCE(head->mapping);
if (!mapping)
return nr;

zone = page_zone(head);
- spin_lock_irq(&mapping->tree_lock);
+ team_usage = atomic_long_read(&head->team_usage);
+again1:
+ if ((team_usage & TEAM_LRU_WEIGHT_MASK) != TEAM_LRU_WEIGHT_ONE) {
+ spin_lock_irq(&zone->lru_lock);
+ lru_locked = true;
+ }
+ spin_lock_irqsave(&mapping->tree_lock, flags);

if (PageTeam(head)) {
- nr = atomic_long_read(&head->team_usage);
- atomic_long_set(&head->team_usage, 0);
+again2:
+ nr = atomic_long_cmpxchg(&head->team_usage, team_usage,
+ TEAM_LRU_WEIGHT_ONE);
+ if (unlikely(nr != team_usage)) {
+ team_usage = nr;
+ if (lru_locked ||
+ (team_usage & TEAM_LRU_WEIGHT_MASK) ==
+ TEAM_LRU_WEIGHT_ONE)
+ goto again2;
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ goto again1;
+ }
+ *head_lru_weight = nr & TEAM_LRU_WEIGHT_MASK;
+ nr /= TEAM_PAGE_COUNTER;
+
/*
- * Disable additions to the team.
- * Ensure head->private is written before PageTeam is
- * cleared, so shmem_writepage() cannot write swap into
- * head->private, then have it overwritten by that 0!
+ * Disable additions to the team. The cmpxchg above
+ * ensures head->team_usage is read before PageTeam is cleared,
+ * when shmem_writepage() might write swap into head->private.
*/
- smp_mb__before_atomic();
ClearPageTeam(head);
+
+ /*
+ * If head has not yet been instantiated into the cache,
+ * reset its page->mapping now, while we have all the locks.
+ */
if (!PageSwapBacked(head))
head->mapping = NULL;

+ if (PageLRU(head) && *head_lru_weight > 1)
+ update_lru_size(mem_cgroup_page_lruvec(head, zone),
+ page_lru(head), 1 - *head_lru_weight);
+
if (nr >= HPAGE_PMD_NR) {
ClearPageChecked(head);
__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
@@ -501,10 +531,88 @@ static int shmem_disband_hugehead(struct
}
}

- spin_unlock_irq(&mapping->tree_lock);
+ spin_unlock_irqrestore(&mapping->tree_lock, flags);
+ if (lru_locked)
+ spin_unlock_irq(&zone->lru_lock);
return nr;
}

+static void shmem_evictify_hugetails(struct page *head, int head_lru_weight)
+{
+ struct page *page;
+ struct lruvec *lruvec = NULL;
+ struct zone *zone = page_zone(head);
+ bool lru_locked = false;
+
+ /*
+ * The head has been sheltering the rest of its team from reclaim:
+ * if any were moved to the unevictable list, now make them evictable.
+ */
+again:
+ for (page = head + HPAGE_PMD_NR - 1; page > head; page--) {
+ if (!PageTeam(page))
+ continue;
+ if (atomic_long_read(&page->team_usage) == TEAM_LRU_WEIGHT_ONE)
+ continue;
+
+ /*
+ * Delay getting lru lock until we reach a page that needs it.
+ */
+ if (!lru_locked) {
+ spin_lock_irq(&zone->lru_lock);
+ lru_locked = true;
+ }
+ lruvec = mem_cgroup_page_lruvec(page, zone);
+
+ if (unlikely(atomic_long_read(&page->team_usage) ==
+ TEAM_LRU_WEIGHT_ONE))
+ continue;
+
+ set_lru_weight(page);
+ head_lru_weight--;
+
+ /*
+ * Usually an Unevictable Team page just stays on its LRU;
+ * but isolation for migration might take it off briefly.
+ */
+ if (unlikely(!PageLRU(page)))
+ continue;
+
+ VM_BUG_ON_PAGE(!PageUnevictable(page), page);
+ VM_BUG_ON_PAGE(PageActive(page), page);
+
+ if (!page_evictable(page)) {
+ /*
+ * This is tiresome, but page_evictable() needs weight 1
+ * to make the right decision, whereas lru size update
+ * needs weight 0 to avoid a bogus "not empty" warning.
+ */
+ clear_lru_weight(page);
+ update_lru_size(lruvec, LRU_UNEVICTABLE, 1);
+ set_lru_weight(page);
+ continue;
+ }
+
+ ClearPageUnevictable(page);
+ update_lru_size(lruvec, LRU_INACTIVE_ANON, 1);
+
+ list_del(&page->lru);
+ list_add_tail(&page->lru, lruvec->lists + LRU_INACTIVE_ANON);
+ }
+
+ if (lru_locked) {
+ spin_unlock_irq(&zone->lru_lock);
+ lru_locked = false;
+ }
+
+ /*
+ * But how can we be sure that a racing putback_inactive_pages()
+ * did its clear_lru_weight() before we checked team_usage above?
+ */
+ if (unlikely(head_lru_weight != TEAM_LRU_WEIGHT_ONE))
+ goto again;
+}
+
static void shmem_disband_hugetails(struct page *head,
struct list_head *list, int nr)
{
@@ -578,6 +686,7 @@ static void shmem_disband_hugetails(stru
static void shmem_disband_hugeteam(struct page *page)
{
struct page *head = team_head(page);
+ int head_lru_weight;
int nr_used;

/*
@@ -623,9 +732,11 @@ static void shmem_disband_hugeteam(struc
* can (splitting disband in two stages), but better not be preempted.
*/
preempt_disable();
- nr_used = shmem_disband_hugehead(head);
+ nr_used = shmem_disband_hugehead(head, &head_lru_weight);
if (head != page)
unlock_page(head);
+ if (head_lru_weight > TEAM_LRU_WEIGHT_ONE)
+ shmem_evictify_hugetails(head, head_lru_weight);
if (nr_used >= 0)
shmem_disband_hugetails(head, NULL, 0);
if (head != page)
@@ -681,6 +792,7 @@ static unsigned long shmem_choose_hugeho
struct page *topage = NULL;
struct page *page;
pgoff_t index;
+ int head_lru_weight;
int fromused;
int toused;
int nid;
@@ -722,8 +834,10 @@ static unsigned long shmem_choose_hugeho
if (!frompage)
goto unlock;
preempt_disable();
- fromused = shmem_disband_hugehead(frompage);
+ fromused = shmem_disband_hugehead(frompage, &head_lru_weight);
spin_unlock(&shmem_shrinklist_lock);
+ if (head_lru_weight > TEAM_LRU_WEIGHT_ONE)
+ shmem_evictify_hugetails(frompage, head_lru_weight);
if (fromused > 0)
shmem_disband_hugetails(frompage, fromlist, -fromused);
preempt_enable();
@@ -777,8 +891,10 @@ static unsigned long shmem_choose_hugeho
if (!topage)
goto unlock;
preempt_disable();
- toused = shmem_disband_hugehead(topage);
+ toused = shmem_disband_hugehead(topage, &head_lru_weight);
spin_unlock(&shmem_shrinklist_lock);
+ if (head_lru_weight > TEAM_LRU_WEIGHT_ONE)
+ shmem_evictify_hugetails(topage, head_lru_weight);
if (toused > 0) {
if (HPAGE_PMD_NR - toused >= fromused)
shmem_disband_hugetails(topage, tolist, fromused);
@@ -930,7 +1046,11 @@ shmem_add_to_page_cache(struct page *pag
}
if (!PageSwapBacked(page)) { /* huge needs special care */
SetPageSwapBacked(page);
- SetPageTeam(page);
+ if (!PageTeam(page)) {
+ atomic_long_set(&page->team_usage,
+ TEAM_LRU_WEIGHT_ONE);
+ SetPageTeam(page);
+ }
}
}

@@ -1612,9 +1732,13 @@ static int shmem_writepage(struct page *
struct page *head = team_head(page);
/*
* Only proceed if this is head, or if head is unpopulated.
+ * Redirty any others, without setting PageActive, and then
+ * putback_inactive_pages() will shift them to unevictable.
*/
- if (page != head && PageSwapBacked(head))
+ if (page != head && PageSwapBacked(head)) {
+ wbc->for_reclaim = 0;
goto redirty;
+ }
}

swap = get_swap_page();
@@ -1762,7 +1886,8 @@ static struct page *shmem_alloc_page(gfp
split_page(head, HPAGE_PMD_ORDER);

/* Prepare head page for add_to_page_cache */
- atomic_long_set(&head->team_usage, 0);
+ atomic_long_set(&head->team_usage,
+ TEAM_LRU_WEIGHT_ONE);
__SetPageTeam(head);
head->mapping = mapping;
head->index = round_down(index, HPAGE_PMD_NR);
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -469,6 +469,11 @@ void lru_cache_add_active_or_unevictable
struct vm_area_struct *vma)
{
VM_BUG_ON_PAGE(PageLRU(page), page);
+ /*
+ * Using hpage_nr_pages() on a huge tmpfs team page might not give the
+ * 1 NR_MLOCK needs below; but this seems to be for anon pages only.
+ */
+ VM_BUG_ON_PAGE(!PageAnon(page), page);

if (likely((vma->vm_flags & (VM_LOCKED | VM_SPECIAL)) != VM_LOCKED)) {
SetPageActive(page);
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -19,6 +19,7 @@
#include <linux/kernel_stat.h>
#include <linux/swap.h>
#include <linux/pagemap.h>
+#include <linux/pageteam.h>
#include <linux/init.h>
#include <linux/highmem.h>
#include <linux/vmpressure.h>
@@ -1514,6 +1515,39 @@ putback_inactive_pages(struct lruvec *lr
continue;
}

+ if (PageTeam(page) && !PageActive(page)) {
+ struct page *head = team_head(page);
+ struct address_space *mapping;
+ bool transferring_weight = false;
+ /*
+ * Team tail page was ready for eviction, but has
+ * been sent back from shmem_writepage(): transfer
+ * its weight to head, and move tail to unevictable.
+ */
+ mapping = READ_ONCE(page->mapping);
+ if (page != head && mapping) {
+ lruvec = mem_cgroup_page_lruvec(head, zone);
+ spin_lock(&mapping->tree_lock);
+ if (PageTeam(head)) {
+ VM_BUG_ON(head->mapping != mapping);
+ inc_lru_weight(head);
+ transferring_weight = true;
+ }
+ spin_unlock(&mapping->tree_lock);
+ }
+ if (transferring_weight) {
+ if (PageLRU(head))
+ update_lru_size(lruvec,
+ page_lru(head), 1);
+ /* Get this tail page out of the way for now */
+ SetPageUnevictable(page);
+ clear_lru_weight(page);
+ } else {
+ /* Traditional case of unswapped & redirtied */
+ SetPageActive(page);
+ }
+ }
+
lruvec = mem_cgroup_page_lruvec(page, zone);

SetPageLRU(page);
@@ -3791,11 +3825,12 @@ int zone_reclaim(struct zone *zone, gfp_
* Reasons page might not be evictable:
* (1) page's mapping marked unevictable
* (2) page is part of an mlocked VMA
- *
+ * (3) page is held in memory as part of a team
*/
int page_evictable(struct page *page)
{
- return !mapping_unevictable(page_mapping(page)) && !PageMlocked(page);
+ return !mapping_unevictable(page_mapping(page)) &&
+ !PageMlocked(page) && hpage_nr_pages(page);
}

#ifdef CONFIG_SHMEM

2016-04-05 21:35:46

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 14/31] huge tmpfs: fix Mlocked meminfo, track huge & unhuge mlocks

Up to this point, the huge tmpfs effort has barely looked at or touched
mm/mlock.c at all: just a PageTeam test to stop __munlock_pagevec_fill()
crashing (or hanging on a non-existent spinlock) on hugepage pmds.

/proc/meminfo's Mlocked count has been whatever happens to be shown
if we do nothing extra: a hugely mapped and mlocked team page would
count as 4kB instead of the 2MB you'd expect; or at least until the
previous (Unevictable) patch, which now requires lruvec locking for
hpage_nr_pages() on a team page (locking not given it in mlock.c),
and varies the amount returned by hpage_nr_pages().

It would be easy to correct the 4kB or variable amount to 2MB
by using an alternative to hpage_nr_pages() here. And it would be
fairly easy to maintain an entirely independent PmdMlocked count,
such that Mlocked+PmdMlocked might amount to (almost) twice RAM
size. But is that what observers of Mlocked want? Probably not.

So we need a huge pmd mlock to count as 2MB, but discount 4kB for
each page within it that is already mlocked by pte somewhere, in
this or another process; and a small pte mlock to count usually as
4kB, but 0 if the team head is already mlocked by pmd somewhere.

Can this be done by maintaining extra counts per team? I did
intend so, but (a) space in team_usage is limited, and (b) mlock
and munlock already involve slow LRU switching, so might as well
keep 4kB and 2MB in synch manually; but most significantly (c):
the trylocking around which mlock was designed, makes it hard
to work out just when a count does need to be incremented.

The hard-won solution looks much simpler than I thought possible,
but an odd interface in its current implementation. Not so much
needed changing, mainly just clear_page_mlock(), mlock_vma_page()
munlock_vma_page() and try_to_"unmap"_one(). The big difference
from before, is that a team head page might be being mlocked as a
4kB page or as a 2MB page, and the called functions cannot tell:
so now need an nr_pages argument. But odd because the PageTeam
case immediately converts that to an iteration count, whereas
the anon THP case keeps it as the weight for a single iteration
(and in the munlock case has to reconfirm it under lruvec lock).
Not very nice, but will do for now: it was so hard to get here,
I'm very reluctant to pull it apart in a hurry.

The TEAM_PMD_MLOCKED flag in team_usage does not play a large part,
just optimizes out the overhead in a couple of cases: we don't want to
make yet another pass down the team, whenever a team is last unmapped,
just to handle the unlikely mlocked-then-truncated case; and we don't
want munlocking one of many parallel huge mlocks to check every page.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/pageteam.h | 38 +++++++
mm/huge_memory.c | 15 ++-
mm/internal.h | 26 +++--
mm/mlock.c | 181 +++++++++++++++++++++----------------
mm/rmap.c | 44 +++++---
5 files changed, 196 insertions(+), 108 deletions(-)

--- a/include/linux/pageteam.h
+++ b/include/linux/pageteam.h
@@ -36,8 +36,14 @@ static inline struct page *team_head(str
*/
#define TEAM_LRU_WEIGHT_ONE 1L
#define TEAM_LRU_WEIGHT_MASK ((1L << (HPAGE_PMD_ORDER + 1)) - 1)
+/*
+ * Single bit to indicate whether team is hugely mlocked (like PageMlocked).
+ * Then another bit reserved for experiments with other team flags.
+ */
+#define TEAM_PMD_MLOCKED (1L << (HPAGE_PMD_ORDER + 1))
+#define TEAM_RESERVED_FLAG (1L << (HPAGE_PMD_ORDER + 2))

-#define TEAM_HIGH_COUNTER (1L << (HPAGE_PMD_ORDER + 1))
+#define TEAM_HIGH_COUNTER (1L << (HPAGE_PMD_ORDER + 3))
/*
* Count how many pages of team are instantiated, as it is built up.
*/
@@ -97,6 +103,36 @@ static inline void clear_lru_weight(stru
atomic_long_set(&page->team_usage, 0);
}

+static inline bool team_pmd_mlocked(struct page *head)
+{
+ VM_BUG_ON_PAGE(head != team_head(head), head);
+ return atomic_long_read(&head->team_usage) & TEAM_PMD_MLOCKED;
+}
+
+static inline void set_team_pmd_mlocked(struct page *head)
+{
+ long team_usage;
+
+ VM_BUG_ON_PAGE(head != team_head(head), head);
+ team_usage = atomic_long_read(&head->team_usage);
+ while (!(team_usage & TEAM_PMD_MLOCKED)) {
+ team_usage = atomic_long_cmpxchg(&head->team_usage,
+ team_usage, team_usage | TEAM_PMD_MLOCKED);
+ }
+}
+
+static inline void clear_team_pmd_mlocked(struct page *head)
+{
+ long team_usage;
+
+ VM_BUG_ON_PAGE(head != team_head(head), head);
+ team_usage = atomic_long_read(&head->team_usage);
+ while (team_usage & TEAM_PMD_MLOCKED) {
+ team_usage = atomic_long_cmpxchg(&head->team_usage,
+ team_usage, team_usage & ~TEAM_PMD_MLOCKED);
+ }
+}
+
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
int map_team_by_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd, struct page *page);
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1443,8 +1443,8 @@ struct page *follow_trans_huge_pmd(struc
touch_pmd(vma, addr, pmd);
if ((flags & FOLL_MLOCK) && (vma->vm_flags & VM_LOCKED)) {
/*
- * We don't mlock() pte-mapped THPs. This way we can avoid
- * leaking mlocked pages into non-VM_LOCKED VMAs.
+ * We don't mlock() pte-mapped compound THPs. This way we
+ * can avoid leaking mlocked pages into non-VM_LOCKED VMAs.
*
* In most cases the pmd is the only mapping of the page as we
* break COW for the mlock() -- see gup_flags |= FOLL_WRITE for
@@ -1453,12 +1453,16 @@ struct page *follow_trans_huge_pmd(struc
* The only scenario when we have the page shared here is if we
* mlocking read-only mapping shared over fork(). We skip
* mlocking such pages.
+ *
+ * But the huge tmpfs PageTeam case is handled differently:
+ * there are no arbitrary restrictions on mlocking such pages,
+ * and compound_mapcount() returns 0 even when they are mapped.
*/
- if (compound_mapcount(page) == 1 && !PageDoubleMap(page) &&
+ if (compound_mapcount(page) <= 1 && !PageDoubleMap(page) &&
page->mapping && trylock_page(page)) {
lru_add_drain();
if (page->mapping)
- mlock_vma_page(page);
+ mlock_vma_pages(page, HPAGE_PMD_NR);
unlock_page(page);
}
}
@@ -1710,6 +1714,9 @@ int zap_huge_pmd(struct mmu_gather *tlb,
pte_free(tlb->mm, pgtable_trans_huge_withdraw(tlb->mm, pmd));
atomic_long_dec(&tlb->mm->nr_ptes);
spin_unlock(ptl);
+ if (PageTeam(page) &&
+ !team_pmd_mapped(page) && team_pmd_mlocked(page))
+ clear_pages_mlock(page, HPAGE_PMD_NR);
tlb_remove_page(tlb, page);
}
return 1;
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -275,8 +275,16 @@ static inline void munlock_vma_pages_all
/*
* must be called with vma's mmap_sem held for read or write, and page locked.
*/
-extern void mlock_vma_page(struct page *page);
-extern unsigned int munlock_vma_page(struct page *page);
+extern void mlock_vma_pages(struct page *page, int nr_pages);
+static inline void mlock_vma_page(struct page *page)
+{
+ mlock_vma_pages(page, 1);
+}
+extern int munlock_vma_pages(struct page *page, int nr_pages);
+static inline void munlock_vma_page(struct page *page)
+{
+ munlock_vma_pages(page, 1);
+}

/*
* Clear the page's PageMlocked(). This can be useful in a situation where
@@ -287,7 +295,11 @@ extern unsigned int munlock_vma_page(str
* If called for a page that is still mapped by mlocked vmas, all we do
* is revert to lazy LRU behaviour -- semantics are not broken.
*/
-extern void clear_page_mlock(struct page *page);
+extern void clear_pages_mlock(struct page *page, int nr_pages);
+static inline void clear_page_mlock(struct page *page)
+{
+ clear_pages_mlock(page, 1);
+}

/*
* mlock_migrate_page - called only from migrate_misplaced_transhuge_page()
@@ -328,13 +340,7 @@ vma_address(struct page *page, struct vm

return address;
}
-
-#else /* !CONFIG_MMU */
-static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct page *page) { }
-static inline void mlock_migrate_page(struct page *new, struct page *old) { }
-
-#endif /* !CONFIG_MMU */
+#endif /* CONFIG_MMU */

/*
* Return the mem_map entry representing the 'offset' subpage within
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -11,6 +11,7 @@
#include <linux/swap.h>
#include <linux/swapops.h>
#include <linux/pagemap.h>
+#include <linux/pageteam.h>
#include <linux/pagevec.h>
#include <linux/mempolicy.h>
#include <linux/syscalls.h>
@@ -51,43 +52,72 @@ EXPORT_SYMBOL(can_do_mlock);
* (see mm/rmap.c).
*/

-/*
- * LRU accounting for clear_page_mlock()
+/**
+ * clear_pages_mlock - clear mlock from a page or pages
+ * @page - page to be unlocked
+ * @nr_pages - usually 1, but HPAGE_PMD_NR if pmd mapping is zapped.
+ *
+ * Clear the page's PageMlocked(). This can be useful in a situation where
+ * we want to unconditionally remove a page from the pagecache -- e.g.,
+ * on truncation or freeing.
+ *
+ * It is legal to call this function for any page, mlocked or not.
+ * If called for a page that is still mapped by mlocked vmas, all we do
+ * is revert to lazy LRU behaviour -- semantics are not broken.
*/
-void clear_page_mlock(struct page *page)
+void clear_pages_mlock(struct page *page, int nr_pages)
{
- if (!TestClearPageMlocked(page))
- return;
+ struct zone *zone = page_zone(page);
+ struct page *endpage = page + 1;

- mod_zone_page_state(page_zone(page), NR_MLOCK,
- -hpage_nr_pages(page));
- count_vm_event(UNEVICTABLE_PGCLEARED);
- if (!isolate_lru_page(page)) {
- putback_lru_page(page);
- } else {
- /*
- * We lost the race. the page already moved to evictable list.
- */
- if (PageUnevictable(page))
+ if (nr_pages > 1 && PageTeam(page)) {
+ clear_team_pmd_mlocked(page); /* page is team head */
+ endpage = page + nr_pages;
+ nr_pages = 1;
+ }
+
+ for (; page < endpage; page++) {
+ if (page_mapped(page))
+ continue;
+ if (!TestClearPageMlocked(page))
+ continue;
+ mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
+ count_vm_event(UNEVICTABLE_PGCLEARED);
+ if (!isolate_lru_page(page))
+ putback_lru_page(page);
+ else if (PageUnevictable(page))
count_vm_event(UNEVICTABLE_PGSTRANDED);
}
}

-/*
- * Mark page as mlocked if not already.
+/**
+ * mlock_vma_pages - mlock a vma page or pages
+ * @page - page to be unlocked
+ * @nr_pages - usually 1, but HPAGE_PMD_NR if pmd mapping is mlocked.
+ *
+ * Mark pages as mlocked if not already.
* If page on LRU, isolate and putback to move to unevictable list.
*/
-void mlock_vma_page(struct page *page)
+void mlock_vma_pages(struct page *page, int nr_pages)
{
- /* Serialize with page migration */
- BUG_ON(!PageLocked(page));
+ struct zone *zone = page_zone(page);
+ struct page *endpage = page + 1;

+ /* Serialize with page migration */
+ VM_BUG_ON_PAGE(!PageLocked(page) && !PageTeam(page), page);
VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(PageCompound(page) && PageDoubleMap(page), page);

- if (!TestSetPageMlocked(page)) {
- mod_zone_page_state(page_zone(page), NR_MLOCK,
- hpage_nr_pages(page));
+ if (nr_pages > 1 && PageTeam(page)) {
+ set_team_pmd_mlocked(page); /* page is team head */
+ endpage = page + nr_pages;
+ nr_pages = 1;
+ }
+
+ for (; page < endpage; page++) {
+ if (TestSetPageMlocked(page))
+ continue;
+ mod_zone_page_state(zone, NR_MLOCK, nr_pages);
count_vm_event(UNEVICTABLE_PGMLOCKED);
if (!isolate_lru_page(page))
putback_lru_page(page);
@@ -111,6 +141,18 @@ static bool __munlock_isolate_lru_page(s
return true;
}

+ /*
+ * Perform accounting when page isolation fails in munlock.
+ * There is nothing else to do because it means some other task has
+ * already removed the page from the LRU. putback_lru_page() will take
+ * care of removing the page from the unevictable list, if necessary.
+ * vmscan [page_referenced()] will move the page back to the
+ * unevictable list if some other vma has it mlocked.
+ */
+ if (PageUnevictable(page))
+ __count_vm_event(UNEVICTABLE_PGSTRANDED);
+ else
+ __count_vm_event(UNEVICTABLE_PGMUNLOCKED);
return false;
}

@@ -128,7 +170,7 @@ static void __munlock_isolated_page(stru
* Optimization: if the page was mapped just once, that's our mapping
* and we don't need to check all the other vmas.
*/
- if (page_mapcount(page) > 1)
+ if (page_mapcount(page) > 1 || PageTeam(page))
ret = try_to_munlock(page);

/* Did try_to_unlock() succeed or punt? */
@@ -138,29 +180,12 @@ static void __munlock_isolated_page(stru
putback_lru_page(page);
}

-/*
- * Accounting for page isolation fail during munlock
- *
- * Performs accounting when page isolation fails in munlock. There is nothing
- * else to do because it means some other task has already removed the page
- * from the LRU. putback_lru_page() will take care of removing the page from
- * the unevictable list, if necessary. vmscan [page_referenced()] will move
- * the page back to the unevictable list if some other vma has it mlocked.
- */
-static void __munlock_isolation_failed(struct page *page)
-{
- if (PageUnevictable(page))
- __count_vm_event(UNEVICTABLE_PGSTRANDED);
- else
- __count_vm_event(UNEVICTABLE_PGMUNLOCKED);
-}
-
/**
- * munlock_vma_page - munlock a vma page
- * @page - page to be unlocked, either a normal page or THP page head
+ * munlock_vma_pages - munlock a vma page or pages
+ * @page - page to be unlocked
+ * @nr_pages - usually 1, but HPAGE_PMD_NR if pmd mapping is munlocked
*
- * returns the size of the page as a page mask (0 for normal page,
- * HPAGE_PMD_NR - 1 for THP head page)
+ * returns the size of the page (usually 1, but HPAGE_PMD_NR for huge page)
*
* called from munlock()/munmap() path with page supposedly on the LRU.
* When we munlock a page, because the vma where we found the page is being
@@ -173,41 +198,56 @@ static void __munlock_isolation_failed(s
* can't isolate the page, we leave it for putback_lru_page() and vmscan
* [page_referenced()/try_to_unmap()] to deal with.
*/
-unsigned int munlock_vma_page(struct page *page)
+int munlock_vma_pages(struct page *page, int nr_pages)
{
- int nr_pages;
struct zone *zone = page_zone(page);
+ struct page *endpage = page + 1;
+ struct page *head = NULL;
+ int ret = nr_pages;
+ bool isolated;

/* For try_to_munlock() and to serialize with page migration */
- BUG_ON(!PageLocked(page));
-
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(PageTail(page), page);

+ if (nr_pages > 1 && PageTeam(page)) {
+ head = page;
+ clear_team_pmd_mlocked(page); /* page is team head */
+ endpage = page + nr_pages;
+ nr_pages = 1;
+ }
+
/*
- * Serialize with any parallel __split_huge_page_refcount() which
- * might otherwise copy PageMlocked to part of the tail pages before
- * we clear it in the head page. It also stabilizes hpage_nr_pages().
+ * Serialize THP with any parallel __split_huge_page_tail() which
+ * might otherwise copy PageMlocked to some of the tail pages before
+ * we clear it in the head page.
*/
spin_lock_irq(&zone->lru_lock);
+ if (nr_pages > 1 && !PageTransHuge(page))
+ ret = nr_pages = 1;

- nr_pages = hpage_nr_pages(page);
- if (!TestClearPageMlocked(page))
- goto unlock_out;
+ for (; page < endpage; page++) {
+ if (!TestClearPageMlocked(page))
+ continue;

- __mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
-
- if (__munlock_isolate_lru_page(page, true)) {
+ __mod_zone_page_state(zone, NR_MLOCK, -nr_pages);
+ isolated = __munlock_isolate_lru_page(page, true);
spin_unlock_irq(&zone->lru_lock);
- __munlock_isolated_page(page);
- goto out;
- }
- __munlock_isolation_failed(page);
+ if (isolated)
+ __munlock_isolated_page(page);

-unlock_out:
+ /*
+ * If try_to_munlock() found the huge page to be still
+ * mlocked, don't waste more time munlocking and rmap
+ * walking and re-mlocking each of the team's pages.
+ */
+ if (!head || team_pmd_mlocked(head))
+ goto out;
+ spin_lock_irq(&zone->lru_lock);
+ }
spin_unlock_irq(&zone->lru_lock);
-
out:
- return nr_pages - 1;
+ return ret;
}

/*
@@ -300,8 +340,6 @@ static void __munlock_pagevec(struct pag
*/
if (__munlock_isolate_lru_page(page, false))
continue;
- else
- __munlock_isolation_failed(page);
}

/*
@@ -461,13 +499,8 @@ void munlock_vma_pages_range(struct vm_a
put_page(page); /* follow_page_mask() */
} else if (PageTransHuge(page) || PageTeam(page)) {
lock_page(page);
- /*
- * Any THP page found by follow_page_mask() may
- * have gotten split before reaching
- * munlock_vma_page(), so we need to recompute
- * the page_mask here.
- */
- page_mask = munlock_vma_page(page);
+ page_mask = munlock_vma_pages(page,
+ page_mask + 1) - 1;
unlock_page(page);
put_page(page); /* follow_page_mask() */
} else {
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -837,10 +837,15 @@ again:
spin_unlock(ptl);
goto again;
}
- pte = NULL;
+ if (ptep)
+ *ptep = NULL;
goto found;
}

+ /* TTU_MUNLOCK on PageTeam makes a second try for huge pmd only */
+ if (unlikely(!ptep))
+ return false;
+
pte = pte_offset_map(pmd, address);
if (!pte_present(*pte)) {
pte_unmap(pte);
@@ -861,8 +866,9 @@ check_pte:
pte_unmap_unlock(pte, ptl);
return false;
}
-found:
+
*ptep = pte;
+found:
*pmdp = pmd;
*ptlp = ptl;
return true;
@@ -1332,7 +1338,7 @@ static void page_remove_anon_compound_rm
}

if (unlikely(PageMlocked(page)))
- clear_page_mlock(page);
+ clear_pages_mlock(page, HPAGE_PMD_NR);

if (nr) {
__mod_zone_page_state(page_zone(page), NR_ANON_PAGES, -nr);
@@ -1418,8 +1424,17 @@ static int try_to_unmap_one(struct page
goto out;
}

- if (!page_check_address_transhuge(page, mm, address, &pmd, &pte, &ptl))
- goto out;
+ if (!page_check_address_transhuge(page, mm, address,
+ &pmd, &pte, &ptl)) {
+ if (!(flags & TTU_MUNLOCK) || !PageTeam(page))
+ goto out;
+ /* We need also to check whether head is hugely mapped here */
+ pte = NULL;
+ page = team_head(page);
+ if (!page_check_address_transhuge(page, mm, address,
+ &pmd, NULL, &ptl))
+ goto out;
+ }

/*
* If the page is mlock()d, we cannot swap it out.
@@ -1429,7 +1444,7 @@ static int try_to_unmap_one(struct page
if (!(flags & TTU_IGNORE_MLOCK)) {
if (vma->vm_flags & VM_LOCKED) {
/* Holding pte lock, we do *not* need mmap_sem here */
- mlock_vma_page(page);
+ mlock_vma_pages(page, pte ? 1 : HPAGE_PMD_NR);
ret = SWAP_MLOCK;
goto out_unmap;
}
@@ -1635,11 +1650,6 @@ int try_to_unmap(struct page *page, enum
return ret;
}

-static int page_not_mapped(struct page *page)
-{
- return !page_mapped(page);
-};
-
/**
* try_to_munlock - try to munlock a page
* @page: the page to be munlocked
@@ -1657,24 +1667,20 @@ static int page_not_mapped(struct page *
*/
int try_to_munlock(struct page *page)
{
- int ret;
struct rmap_private rp = {
.flags = TTU_MUNLOCK,
.lazyfreed = 0,
};
-
struct rmap_walk_control rwc = {
.rmap_one = try_to_unmap_one,
.arg = &rp,
- .done = page_not_mapped,
.anon_lock = page_lock_anon_vma_read,
-
};

- VM_BUG_ON_PAGE(!PageLocked(page) || PageLRU(page), page);
+ VM_BUG_ON_PAGE(!PageLocked(page) && !PageTeam(page), page);
+ VM_BUG_ON_PAGE(PageLRU(page), page);

- ret = rmap_walk(page, &rwc);
- return ret;
+ return rmap_walk(page, &rwc);
}

void __put_anon_vma(struct anon_vma *anon_vma)
@@ -1789,7 +1795,7 @@ static int rmap_walk_file(struct page *p
* structure at mapping cannot be freed and reused yet,
* so we can safely take mapping->i_mmap_rwsem.
*/
- VM_BUG_ON_PAGE(!PageLocked(page), page);
+ VM_BUG_ON_PAGE(!PageLocked(page) && !PageTeam(page), page);

if (!mapping)
return ret;

2016-04-05 21:37:10

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 15/31] huge tmpfs: fix Mapped meminfo, track huge & unhuge mappings

Maintaining Mlocked was the difficult one, but now that it is correctly
tracked, without duplication between the 4kB and 2MB amounts, I think
we have to make a similar effort with Mapped.

But whereas mlock and munlock were already rare and slow operations,
to which we could fairly add a little more overhead in the huge tmpfs
case, ordinary mmap is not something we want to slow down further,
relative to hugetlbfs.

In the Mapped case, I think we can take small or misaligned mmaps of
huge tmpfs files as the exceptional operation, and add a little more
overhead to those, by maintaining another count for them in the head;
and by keeping both hugely and unhugely mapped counts in the one long,
can rely on cmpxchg to manage their racing transitions atomically.

That's good on 64-bit, but there are not enough free bits in a 32-bit
atomic_long_t team_usage to support this: I think we should continue
to permit huge tmpfs on 32-bit, but accept that Mapped may be doubly
counted there. (A more serious problem on 32-bit is that it would,
I think, be possible to overflow the huge mapping counter: protection
against that will need to be added.)

Now that we are maintaining NR_FILE_MAPPED correctly for huge
tmpfs, adjust vmscan's zone_unmapped_file_pages() to exclude
NR_SHMEM_PMDMAPPED, which it clearly would not want included.
Whereas minimum_image_size() in kernel/power/snapshot.c? I have
not grasped the basis for that calculation, so leaving untouched.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/memcontrol.h | 5 +
include/linux/pageteam.h | 144 ++++++++++++++++++++++++++++++++---
mm/huge_memory.c | 34 +++++++-
mm/rmap.c | 10 +-
mm/vmscan.c | 6 +
5 files changed, 180 insertions(+), 19 deletions(-)

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -700,6 +700,11 @@ static inline bool mem_cgroup_oom_synchr
return false;
}

+static inline void mem_cgroup_update_page_stat(struct page *page,
+ enum mem_cgroup_stat_index idx, int val)
+{
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_stat_index idx)
{
--- a/include/linux/pageteam.h
+++ b/include/linux/pageteam.h
@@ -30,6 +30,30 @@ static inline struct page *team_head(str
}

/*
+ * Layout of team head's page->team_usage field, as on x86_64 and arm64_4K:
+ *
+ * 63 32 31 22 21 12 11 10 9 0
+ * +------------+--------------+----------+----------+---------+------------+
+ * | pmd_mapped & instantiated |pte_mapped| reserved | mlocked | lru_weight |
+ * | 42 bits 10 bits | 10 bits | 1 bit | 1 bit | 10 bits |
+ * +------------+--------------+----------+----------+---------+------------+
+ *
+ * TEAM_LRU_WEIGHT_ONE 1 (1<<0)
+ * TEAM_LRU_WEIGHT_MASK 3ff (1<<10)-1
+ * TEAM_PMD_MLOCKED 400 (1<<10)
+ * TEAM_RESERVED_FLAG 800 (1<<11)
+ * TEAM_PTE_COUNTER 1000 (1<<12)
+ * TEAM_PTE_MASK 3ff000 (1<<22)-(1<<12)
+ * TEAM_PAGE_COUNTER 400000 (1<<22)
+ * TEAM_COMPLETE 80000000 (1<<31)
+ * TEAM_MAPPING_COUNTER 400000 (1<<22)
+ * TEAM_PMD_MAPPED 80400000 (1<<31)
+ *
+ * The upper bits count up to TEAM_COMPLETE as pages are instantiated,
+ * and then, above TEAM_COMPLETE, they count huge mappings of the team.
+ * Team tails have team_usage either 1 (lru_weight 1) or 0 (lru_weight 0).
+ */
+/*
* Mask for lower bits of team_usage, giving the weight 0..HPAGE_PMD_NR of the
* page on its LRU: normal pages have weight 1, tails held unevictable until
* head is evicted have weight 0, and the head gathers weight 1..HPAGE_PMD_NR.
@@ -42,8 +66,22 @@ static inline struct page *team_head(str
*/
#define TEAM_PMD_MLOCKED (1L << (HPAGE_PMD_ORDER + 1))
#define TEAM_RESERVED_FLAG (1L << (HPAGE_PMD_ORDER + 2))
-
+#ifdef CONFIG_64BIT
+/*
+ * Count how many pages of team are individually mapped into userspace.
+ */
+#define TEAM_PTE_COUNTER (1L << (HPAGE_PMD_ORDER + 3))
+#define TEAM_HIGH_COUNTER (1L << (2*HPAGE_PMD_ORDER + 4))
+#define TEAM_PTE_MASK (TEAM_HIGH_COUNTER - TEAM_PTE_COUNTER)
+#define team_pte_count(usage) (((usage) & TEAM_PTE_MASK) / TEAM_PTE_COUNTER)
+#else /* 32-bit */
+/*
+ * Not enough bits in atomic_long_t: we prefer not to bloat struct page just to
+ * avoid duplication in Mapped, when a page is mapped both hugely and unhugely.
+ */
#define TEAM_HIGH_COUNTER (1L << (HPAGE_PMD_ORDER + 3))
+#define team_pte_count(usage) 1 /* allows for the extra page_add_file_rmap */
+#endif /* CONFIG_64BIT */
/*
* Count how many pages of team are instantiated, as it is built up.
*/
@@ -66,22 +104,110 @@ static inline bool team_pmd_mapped(struc

/*
* Returns true if this was the first mapping by pmd, whereupon mapped stats
- * need to be updated.
+ * need to be updated. Together with the number of pages which then need
+ * to be accounted (can be ignored when false returned): because some team
+ * members may have been mapped unhugely by pte, so already counted as Mapped.
*/
-static inline bool inc_team_pmd_mapped(struct page *head)
+static inline bool inc_team_pmd_mapped(struct page *head, int *nr_pages)
{
- return atomic_long_add_return(TEAM_MAPPING_COUNTER, &head->team_usage)
- < TEAM_PMD_MAPPED + TEAM_MAPPING_COUNTER;
+ long team_usage;
+
+ team_usage = atomic_long_add_return(TEAM_MAPPING_COUNTER,
+ &head->team_usage);
+ *nr_pages = HPAGE_PMD_NR - team_pte_count(team_usage);
+ return team_usage < TEAM_PMD_MAPPED + TEAM_MAPPING_COUNTER;
}

/*
* Returns true if this was the last mapping by pmd, whereupon mapped stats
- * need to be updated.
+ * need to be updated. Together with the number of pages which then need
+ * to be accounted (can be ignored when false returned): because some team
+ * members may still be mapped unhugely by pte, so remain counted as Mapped.
+ */
+static inline bool dec_team_pmd_mapped(struct page *head, int *nr_pages)
+{
+ long team_usage;
+
+ team_usage = atomic_long_sub_return(TEAM_MAPPING_COUNTER,
+ &head->team_usage);
+ *nr_pages = HPAGE_PMD_NR - team_pte_count(team_usage);
+ return team_usage < TEAM_PMD_MAPPED;
+}
+
+/*
+ * Returns true if this pte mapping is of a non-team page, or of a team page not
+ * covered by an existing huge pmd mapping: whereupon stats need to be updated.
+ * Only called when mapcount goes up from 0 to 1 i.e. _mapcount from -1 to 0.
+ */
+static inline bool inc_team_pte_mapped(struct page *page)
+{
+#ifdef CONFIG_64BIT
+ struct page *head;
+ long team_usage;
+ long old;
+
+ if (likely(!PageTeam(page)))
+ return true;
+ head = team_head(page);
+ team_usage = atomic_long_read(&head->team_usage);
+ for (;;) {
+ /* Is team now being disbanded? Stop once team_usage is reset */
+ if (unlikely(!PageTeam(head) ||
+ team_usage / TEAM_PAGE_COUNTER == 0))
+ return true;
+ /*
+ * XXX: but despite the impressive-looking cmpxchg, gthelen
+ * points out that head might be freed and reused and assigned
+ * a matching value in ->private now: tiny chance, must revisit.
+ */
+ old = atomic_long_cmpxchg(&head->team_usage,
+ team_usage, team_usage + TEAM_PTE_COUNTER);
+ if (likely(old == team_usage))
+ break;
+ team_usage = old;
+ }
+ return team_usage < TEAM_PMD_MAPPED;
+#else /* 32-bit */
+ return true;
+#endif
+}
+
+/*
+ * Returns true if this pte mapping is of a non-team page, or of a team page not
+ * covered by a remaining huge pmd mapping: whereupon stats need to be updated.
+ * Only called when mapcount goes down from 1 to 0 i.e. _mapcount from 0 to -1.
*/
-static inline bool dec_team_pmd_mapped(struct page *head)
+static inline bool dec_team_pte_mapped(struct page *page)
{
- return atomic_long_sub_return(TEAM_MAPPING_COUNTER, &head->team_usage)
- < TEAM_PMD_MAPPED;
+#ifdef CONFIG_64BIT
+ struct page *head;
+ long team_usage;
+ long old;
+
+ if (likely(!PageTeam(page)))
+ return true;
+ head = team_head(page);
+ team_usage = atomic_long_read(&head->team_usage);
+ for (;;) {
+ /* Is team now being disbanded? Stop once team_usage is reset */
+ if (unlikely(!PageTeam(head) ||
+ team_usage / TEAM_PAGE_COUNTER == 0))
+ return true;
+ /*
+ * XXX: but despite the impressive-looking cmpxchg, gthelen
+ * points out that head might be freed and reused and assigned
+ * a matching value in ->private now: tiny chance, must revisit.
+ */
+ old = atomic_long_cmpxchg(&head->team_usage,
+ team_usage, team_usage - TEAM_PTE_COUNTER);
+ if (likely(old == team_usage))
+ break;
+ team_usage = old;
+ }
+ return team_usage < TEAM_PMD_MAPPED;
+#else /* 32-bit */
+ return true;
+#endif
}

static inline void inc_lru_weight(struct page *head)
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1130,9 +1130,11 @@ int copy_huge_pmd(struct mm_struct *dst_
pmdp_set_wrprotect(src_mm, addr, src_pmd);
pmd = pmd_wrprotect(pmd);
} else {
+ int nr_pages; /* not interesting here */
+
VM_BUG_ON_PAGE(!PageTeam(src_page), src_page);
page_dup_rmap(src_page, false);
- inc_team_pmd_mapped(src_page);
+ inc_team_pmd_mapped(src_page, &nr_pages);
}
add_mm_counter(dst_mm, mm_counter(src_page), HPAGE_PMD_NR);
atomic_long_inc(&dst_mm->nr_ptes);
@@ -3499,18 +3501,40 @@ late_initcall(split_huge_pages_debugfs);

static void page_add_team_rmap(struct page *page)
{
+ int nr_pages;
+
VM_BUG_ON_PAGE(PageAnon(page), page);
VM_BUG_ON_PAGE(!PageTeam(page), page);
- if (inc_team_pmd_mapped(page))
- __inc_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+
+ lock_page_memcg(page);
+ if (inc_team_pmd_mapped(page, &nr_pages)) {
+ struct zone *zone = page_zone(page);
+
+ __inc_zone_state(zone, NR_SHMEM_PMDMAPPED);
+ __mod_zone_page_state(zone, NR_FILE_MAPPED, nr_pages);
+ mem_cgroup_update_page_stat(page,
+ MEM_CGROUP_STAT_FILE_MAPPED, nr_pages);
+ }
+ unlock_page_memcg(page);
}

static void page_remove_team_rmap(struct page *page)
{
+ int nr_pages;
+
VM_BUG_ON_PAGE(PageAnon(page), page);
VM_BUG_ON_PAGE(!PageTeam(page), page);
- if (dec_team_pmd_mapped(page))
- __dec_zone_page_state(page, NR_SHMEM_PMDMAPPED);
+
+ lock_page_memcg(page);
+ if (dec_team_pmd_mapped(page, &nr_pages)) {
+ struct zone *zone = page_zone(page);
+
+ __dec_zone_state(zone, NR_SHMEM_PMDMAPPED);
+ __mod_zone_page_state(zone, NR_FILE_MAPPED, -nr_pages);
+ mem_cgroup_update_page_stat(page,
+ MEM_CGROUP_STAT_FILE_MAPPED, -nr_pages);
+ }
+ unlock_page_memcg(page);
}

int map_team_by_pmd(struct vm_area_struct *vma, unsigned long addr,
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1272,7 +1272,8 @@ void page_add_new_anon_rmap(struct page
void page_add_file_rmap(struct page *page)
{
lock_page_memcg(page);
- if (atomic_inc_and_test(&page->_mapcount)) {
+ if (atomic_inc_and_test(&page->_mapcount) &&
+ inc_team_pte_mapped(page)) {
__inc_zone_page_state(page, NR_FILE_MAPPED);
mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
}
@@ -1299,9 +1300,10 @@ static void page_remove_file_rmap(struct
* these counters are not modified in interrupt context, and
* pte lock(a spinlock) is held, which implies preemption disabled.
*/
- __dec_zone_page_state(page, NR_FILE_MAPPED);
- mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
-
+ if (dec_team_pte_mapped(page)) {
+ __dec_zone_page_state(page, NR_FILE_MAPPED);
+ mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_MAPPED);
+ }
if (unlikely(PageMlocked(page)))
clear_page_mlock(page);
out:
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3685,8 +3685,12 @@ static inline unsigned long zone_unmappe
/*
* It's possible for there to be more file mapped pages than
* accounted for by the pages on the file LRU lists because
- * tmpfs pages accounted for as ANON can also be FILE_MAPPED
+ * tmpfs pages accounted for as ANON can also be FILE_MAPPED.
+ * We don't know how many, beyond the PMDMAPPED excluded below.
*/
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
+ file_mapped -= zone_page_state(zone, NR_SHMEM_PMDMAPPED) <<
+ HPAGE_PMD_ORDER;
return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
}


2016-04-05 21:39:34

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 16/31] kvm: plumb return of hva when resolving page fault.

From: Andres Lagar-Cavilla <[email protected]>

So we don't have to redo this work later. Note the hva is not racy,
it is simple arithmetic based on the memslot.

This will be used in the huge tmpfs commits.

Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
Cc'ed to [email protected] as an FYI: this patch is not expected to
go into the tree in the next few weeks. The context is a huge tmpfs
patchset which implements huge pagecache transparently on tmpfs,
using a team of small pages rather than one compound page:
please refer to linux-mm or linux-kernel for more context.

arch/x86/kvm/mmu.c | 20 ++++++++++++++------
arch/x86/kvm/paging_tmpl.h | 3 ++-
include/linux/kvm_host.h | 2 +-
virt/kvm/kvm_main.c | 14 ++++++++------
4 files changed, 25 insertions(+), 14 deletions(-)

--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -2992,7 +2992,8 @@ exit:
}

static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
- gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable);
+ gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable,
+ unsigned long *hva);
static void make_mmu_pages_available(struct kvm_vcpu *vcpu);

static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
@@ -3003,6 +3004,7 @@ static int nonpaging_map(struct kvm_vcpu
bool force_pt_level = false;
kvm_pfn_t pfn;
unsigned long mmu_seq;
+ unsigned long hva;
bool map_writable, write = error_code & PFERR_WRITE_MASK;

level = mapping_level(vcpu, gfn, &force_pt_level);
@@ -3024,7 +3026,8 @@ static int nonpaging_map(struct kvm_vcpu
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();

- if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write, &map_writable))
+ if (try_async_pf(vcpu, prefault, gfn, v, &pfn, write,
+ &map_writable, &hva))
return 0;

if (handle_abnormal_pfn(vcpu, v, gfn, pfn, ACC_ALL, &r))
@@ -3487,14 +3490,16 @@ static bool can_do_async_pf(struct kvm_v
}

static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
- gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable)
+ gva_t gva, kvm_pfn_t *pfn, bool write, bool *writable,
+ unsigned long *hva)
{
struct kvm_memory_slot *slot;
bool async;

slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
async = false;
- *pfn = __gfn_to_pfn_memslot(slot, gfn, false, &async, write, writable);
+ *pfn = __gfn_to_pfn_memslot(slot, gfn,
+ false, &async, write, writable, hva);
if (!async)
return false; /* *pfn has correct page already */

@@ -3508,7 +3513,8 @@ static bool try_async_pf(struct kvm_vcpu
return true;
}

- *pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL, write, writable);
+ *pfn = __gfn_to_pfn_memslot(slot, gfn,
+ false, NULL, write, writable, hva);
return false;
}

@@ -3531,6 +3537,7 @@ static int tdp_page_fault(struct kvm_vcp
bool force_pt_level;
gfn_t gfn = gpa >> PAGE_SHIFT;
unsigned long mmu_seq;
+ unsigned long hva;
int write = error_code & PFERR_WRITE_MASK;
bool map_writable;

@@ -3559,7 +3566,8 @@ static int tdp_page_fault(struct kvm_vcp
mmu_seq = vcpu->kvm->mmu_notifier_seq;
smp_rmb();

- if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write, &map_writable))
+ if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, write,
+ &map_writable, &hva))
return 0;

if (handle_abnormal_pfn(vcpu, 0, gfn, pfn, ACC_ALL, &r))
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -712,6 +712,7 @@ static int FNAME(page_fault)(struct kvm_
int level = PT_PAGE_TABLE_LEVEL;
bool force_pt_level = false;
unsigned long mmu_seq;
+ unsigned long hva;
bool map_writable, is_self_change_mapping;

pgprintk("%s: addr %lx err %x\n", __func__, addr, error_code);
@@ -765,7 +766,7 @@ static int FNAME(page_fault)(struct kvm_
smp_rmb();

if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, write_fault,
- &map_writable))
+ &map_writable, &hva))
return 0;

if (handle_abnormal_pfn(vcpu, mmu_is_nested(vcpu) ? 0 : addr,
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -600,7 +600,7 @@ kvm_pfn_t gfn_to_pfn_memslot(struct kvm_
kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn);
kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
bool atomic, bool *async, bool write_fault,
- bool *writable);
+ bool *writable, unsigned long *hva);

void kvm_release_pfn_clean(kvm_pfn_t pfn);
void kvm_set_pfn_dirty(kvm_pfn_t pfn);
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1444,7 +1444,7 @@ exit:

kvm_pfn_t __gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn,
bool atomic, bool *async, bool write_fault,
- bool *writable)
+ bool *writable, unsigned long *hva)
{
unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);

@@ -1466,8 +1466,10 @@ kvm_pfn_t __gfn_to_pfn_memslot(struct kv
writable = NULL;
}

- return hva_to_pfn(addr, atomic, async, write_fault,
- writable);
+ if (hva)
+ *hva = addr;
+
+ return hva_to_pfn(addr, atomic, async, write_fault, writable);
}
EXPORT_SYMBOL_GPL(__gfn_to_pfn_memslot);

@@ -1475,19 +1477,19 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kv
bool *writable)
{
return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, NULL,
- write_fault, writable);
+ write_fault, writable, NULL);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);

kvm_pfn_t gfn_to_pfn_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
{
- return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL);
+ return __gfn_to_pfn_memslot(slot, gfn, false, NULL, true, NULL, NULL);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);

kvm_pfn_t gfn_to_pfn_memslot_atomic(struct kvm_memory_slot *slot, gfn_t gfn)
{
- return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL);
+ return __gfn_to_pfn_memslot(slot, gfn, true, NULL, true, NULL, NULL);
}
EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);


2016-04-05 21:41:20

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 17/31] kvm: teach kvm to map page teams as huge pages.

From: Andres Lagar-Cavilla <[email protected]>

Include a small treatise on the locking rules around page teams.

Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
Cc'ed to [email protected] as an FYI: this patch is not expected to
go into the tree in the next few weeks, and depends upon a pageteam.h
not yet available outside this patchset. The context is a huge tmpfs
patchset which implements huge pagecache transparently on tmpfs,
using a team of small pages rather than one compound page:
please refer to linux-mm or linux-kernel for more context.

arch/x86/kvm/mmu.c | 130 ++++++++++++++++++++++++++++++-----
arch/x86/kvm/paging_tmpl.h | 3
2 files changed, 117 insertions(+), 16 deletions(-)

--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -32,6 +32,7 @@
#include <linux/module.h>
#include <linux/swap.h>
#include <linux/hugetlb.h>
+#include <linux/pageteam.h>
#include <linux/compiler.h>
#include <linux/srcu.h>
#include <linux/slab.h>
@@ -2799,33 +2800,132 @@ static int kvm_handle_bad_page(struct kv
return -EFAULT;
}

+/*
+ * We are holding kvm->mmu_lock, serializing against mmu notifiers.
+ * We have a ref on page.
+ *
+ * A team of tmpfs 512 pages can be mapped as an integral hugepage as long as
+ * the team is not disbanded. The head page is !PageTeam if disbanded.
+ *
+ * Huge tmpfs pages are disbanded for page freeing, shrinking, or swap out.
+ *
+ * Freeing (punch hole, truncation):
+ * shmem_undo_range
+ * disband
+ * lock head page
+ * unmap_mapping_range
+ * zap_page_range_single
+ * mmu_notifier_invalidate_range_start
+ * __split_huge_pmd or zap_huge_pmd
+ * remap_team_by_ptes
+ * mmu_notifier_invalidate_range_end
+ * unlock head page
+ * pagevec_release
+ * pages are freed
+ * If we race with disband MMUN will fix us up. The head page lock also
+ * serializes any gup() against resolving the page team.
+ *
+ * Shrinker, disbands, but once a page team is fully banded up it no longer is
+ * tagged as shrinkable in the radix tree and hence can't be shrunk.
+ * shmem_shrink_hugehole
+ * shmem_choose_hugehole
+ * disband
+ * migrate_pages
+ * try_to_unmap
+ * mmu_notifier_invalidate_page
+ * Double-indemnity: if we race with disband, MMUN will fix us up.
+ *
+ * Swap out:
+ * shrink_page_list
+ * try_to_unmap
+ * unmap_team_by_pmd
+ * mmu_notifier_invalidate_range
+ * pageout
+ * shmem_writepage
+ * disband
+ * free_hot_cold_page_list
+ * pages are freed
+ * If we race with disband, no one will come to fix us up. So, we check for a
+ * pmd mapping, serializing against the MMUN in unmap_team_by_pmd, which will
+ * break the pmd mapping if it runs before us (or invalidate our mapping if ran
+ * after).
+ */
+static bool is_huge_tmpfs(struct kvm_vcpu *vcpu,
+ unsigned long address, struct page *page)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ struct page *head;
+
+ if (!PageTeam(page))
+ return false;
+ /*
+ * This strictly assumes PMD-level huge-ing.
+ * Which is the only thing KVM can handle here.
+ */
+ if (((address & (HPAGE_PMD_SIZE - 1)) >> PAGE_SHIFT) !=
+ (page->index & (HPAGE_PMD_NR-1)))
+ return false;
+ head = team_head(page);
+ if (!PageTeam(head))
+ return false;
+ /*
+ * Attempt at early discard. If the head races into becoming SwapCache,
+ * and thus having a bogus team_usage, we'll know for sure next.
+ */
+ if (!team_pmd_mapped(head))
+ return false;
+ /*
+ * Copied from page_check_address_transhuge, to avoid making it
+ * a module-visible symbol. Simplify it. No need for page table lock,
+ * as mmu notifier serialization ensures we are on either side of
+ * unmap_team_by_pmd or remap_team_by_ptes.
+ */
+ address &= HPAGE_PMD_MASK;
+ pgd = pgd_offset(vcpu->kvm->mm, address);
+ if (!pgd_present(*pgd))
+ return false;
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ return false;
+ pmd = pmd_offset(pud, address);
+ if (!pmd_trans_huge(*pmd))
+ return false;
+ return pmd_page(*pmd) == head;
+}
+
+static bool is_transparent_hugepage(struct kvm_vcpu *vcpu,
+ unsigned long address, kvm_pfn_t pfn)
+{
+ struct page *page = pfn_to_page(pfn);
+
+ return PageTransCompound(page) || is_huge_tmpfs(vcpu, address, page);
+}
+
static void transparent_hugepage_adjust(struct kvm_vcpu *vcpu,
- gfn_t *gfnp, kvm_pfn_t *pfnp,
- int *levelp)
+ unsigned long address, gfn_t *gfnp,
+ kvm_pfn_t *pfnp, int *levelp)
{
kvm_pfn_t pfn = *pfnp;
gfn_t gfn = *gfnp;
int level = *levelp;

/*
- * Check if it's a transparent hugepage. If this would be an
- * hugetlbfs page, level wouldn't be set to
- * PT_PAGE_TABLE_LEVEL and there would be no adjustment done
- * here.
+ * Check if it's a transparent hugepage, either anon or huge tmpfs.
+ * If this were a hugetlbfs page, level wouldn't be set to
+ * PT_PAGE_TABLE_LEVEL and no adjustment would be done here.
*/
if (!is_error_noslot_pfn(pfn) && !kvm_is_reserved_pfn(pfn) &&
level == PT_PAGE_TABLE_LEVEL &&
- PageTransCompound(pfn_to_page(pfn)) &&
+ is_transparent_hugepage(vcpu, address, pfn) &&
!mmu_gfn_lpage_is_disallowed(vcpu, gfn, PT_DIRECTORY_LEVEL)) {
unsigned long mask;
/*
* mmu_notifier_retry was successful and we hold the
- * mmu_lock here, so the pmd can't become splitting
- * from under us, and in turn
- * __split_huge_page_refcount() can't run from under
- * us and we can safely transfer the refcount from
- * PG_tail to PG_head as we switch the pfn to tail to
- * head.
+ * mmu_lock here, so the pmd can't be split under us,
+ * so we can safely transfer the refcount from PG_tail
+ * to PG_head as we switch the pfn from tail to head.
*/
*levelp = level = PT_DIRECTORY_LEVEL;
mask = KVM_PAGES_PER_HPAGE(level) - 1;
@@ -3038,7 +3138,7 @@ static int nonpaging_map(struct kvm_vcpu
goto out_unlock;
make_mmu_pages_available(vcpu);
if (likely(!force_pt_level))
- transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
+ transparent_hugepage_adjust(vcpu, hva, &gfn, &pfn, &level);
r = __direct_map(vcpu, write, map_writable, level, gfn, pfn, prefault);
spin_unlock(&vcpu->kvm->mmu_lock);

@@ -3578,7 +3678,7 @@ static int tdp_page_fault(struct kvm_vcp
goto out_unlock;
make_mmu_pages_available(vcpu);
if (likely(!force_pt_level))
- transparent_hugepage_adjust(vcpu, &gfn, &pfn, &level);
+ transparent_hugepage_adjust(vcpu, hva, &gfn, &pfn, &level);
r = __direct_map(vcpu, write, map_writable, level, gfn, pfn, prefault);
spin_unlock(&vcpu->kvm->mmu_lock);

--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -800,7 +800,8 @@ static int FNAME(page_fault)(struct kvm_
kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
make_mmu_pages_available(vcpu);
if (!force_pt_level)
- transparent_hugepage_adjust(vcpu, &walker.gfn, &pfn, &level);
+ transparent_hugepage_adjust(vcpu, hva, &walker.gfn, &pfn,
+ &level);
r = FNAME(fetch)(vcpu, addr, &walker, write_fault,
level, pfn, map_writable, prefault);
++vcpu->stat.pf_fixed;

2016-04-05 21:44:13

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 18/31] huge tmpfs: mem_cgroup move charge on shmem huge pages

Early on, for simplicity, we disabled moving huge tmpfs pages from
one memcg to another (nowadays only required when moving a task into a
memcg having move_charge_at_immigrate exceptionally set). We're about
to add a couple of memcg stats for huge tmpfs, and will need to confront
how to handle moving those stats, so better enable moving the pages now.

Although they're discovered by the pmd's get_mctgt_type_thp(), they
have to be considered page by page, in what's usually the pte scan:
because although the common case is for each member of the team to be
owned by the same memcg, nowhere is that enforced - perhaps one day
we shall need to enforce such a limitation, but not so far.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/memcontrol.c | 103 +++++++++++++++++++++++++---------------------
1 file changed, 58 insertions(+), 45 deletions(-)

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4332,6 +4332,7 @@ static int mem_cgroup_do_precharge(unsig
* 2(MC_TARGET_SWAP): if the swap entry corresponding to this pte is a
* target for charge migration. if @target is not NULL, the entry is stored
* in target->ent.
+ * 3(MC_TARGET_TEAM): if pmd entry is not an anon THP: check it page by page
*
* Called with pte lock held.
*/
@@ -4344,6 +4345,7 @@ enum mc_target_type {
MC_TARGET_NONE = 0,
MC_TARGET_PAGE,
MC_TARGET_SWAP,
+ MC_TARGET_TEAM,
};

static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
@@ -4565,19 +4567,22 @@ static enum mc_target_type get_mctgt_typ

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
/*
- * We don't consider swapping or file mapped pages because THP does not
- * support them for now.
* Caller should make sure that pmd_trans_huge(pmd) is true.
*/
-static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, union mc_target *target)
+static enum mc_target_type get_mctgt_type_thp(pmd_t pmd,
+ union mc_target *target, unsigned long *pfn)
{
- struct page *page = NULL;
+ struct page *page;
enum mc_target_type ret = MC_TARGET_NONE;

page = pmd_page(pmd);
- /* Don't attempt to move huge tmpfs pages yet: can be enabled later */
- if (!(mc.flags & MOVE_ANON) || !PageAnon(page))
+ if (!PageAnon(page)) {
+ if (!(mc.flags & MOVE_FILE))
+ return ret;
+ *pfn = page_to_pfn(page);
+ return MC_TARGET_TEAM;
+ }
+ if (!(mc.flags & MOVE_ANON))
return ret;
if (page->mem_cgroup == mc.from) {
ret = MC_TARGET_PAGE;
@@ -4589,8 +4594,8 @@ static enum mc_target_type get_mctgt_typ
return ret;
}
#else
-static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
- unsigned long addr, pmd_t pmd, union mc_target *target)
+static inline enum mc_target_type get_mctgt_type_thp(pmd_t pmd,
+ union mc_target *target, unsigned long *pfn)
{
return MC_TARGET_NONE;
}
@@ -4601,24 +4606,33 @@ static int mem_cgroup_count_precharge_pt
struct mm_walk *walk)
{
struct vm_area_struct *vma = walk->vma;
- pte_t *pte;
+ enum mc_target_type target_type;
+ unsigned long uninitialized_var(pfn);
+ pte_t ptent;
+ pte_t *pte = NULL;
spinlock_t *ptl;

ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
- if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
+ target_type = get_mctgt_type_thp(*pmd, NULL, &pfn);
+ if (target_type == MC_TARGET_PAGE)
mc.precharge += HPAGE_PMD_NR;
- spin_unlock(ptl);
- return 0;
+ if (target_type != MC_TARGET_TEAM)
+ goto unlock;
+ } else {
+ if (pmd_trans_unstable(pmd))
+ return 0;
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
}
-
- if (pmd_trans_unstable(pmd))
- return 0;
- pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
- for (; addr != end; pte++, addr += PAGE_SIZE)
- if (get_mctgt_type(vma, addr, *pte, NULL))
+ for (; addr != end; addr += PAGE_SIZE) {
+ ptent = pte ? *(pte++) : pfn_pte(pfn++, vma->vm_page_prot);
+ if (get_mctgt_type(vma, addr, ptent, NULL))
mc.precharge++; /* increment precharge temporarily */
- pte_unmap_unlock(pte - 1, ptl);
+ }
+ if (pte)
+ pte_unmap(pte - 1);
+unlock:
+ spin_unlock(ptl);
cond_resched();

return 0;
@@ -4787,22 +4801,21 @@ static int mem_cgroup_move_charge_pte_ra
{
int ret = 0;
struct vm_area_struct *vma = walk->vma;
- pte_t *pte;
+ unsigned long uninitialized_var(pfn);
+ pte_t ptent;
+ pte_t *pte = NULL;
spinlock_t *ptl;
enum mc_target_type target_type;
union mc_target target;
struct page *page;
-
+retry:
ptl = pmd_trans_huge_lock(pmd, vma);
if (ptl) {
- if (mc.precharge < HPAGE_PMD_NR) {
- spin_unlock(ptl);
- return 0;
- }
- target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
+ target_type = get_mctgt_type_thp(*pmd, &target, &pfn);
if (target_type == MC_TARGET_PAGE) {
page = target.page;
- if (!isolate_lru_page(page)) {
+ if (mc.precharge >= HPAGE_PMD_NR &&
+ !isolate_lru_page(page)) {
if (!mem_cgroup_move_account(page, true,
mc.from, mc.to)) {
mc.precharge -= HPAGE_PMD_NR;
@@ -4811,22 +4824,19 @@ static int mem_cgroup_move_charge_pte_ra
putback_lru_page(page);
}
put_page(page);
+ addr = end;
}
- spin_unlock(ptl);
- return 0;
+ if (target_type != MC_TARGET_TEAM)
+ goto unlock;
+ /* addr is not aligned when retrying after precharge ran out */
+ pfn += (addr & (HPAGE_PMD_SIZE-1)) >> PAGE_SHIFT;
+ } else {
+ if (pmd_trans_unstable(pmd))
+ return 0;
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
}
-
- if (pmd_trans_unstable(pmd))
- return 0;
-retry:
- pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
- for (; addr != end; addr += PAGE_SIZE) {
- pte_t ptent = *(pte++);
- swp_entry_t ent;
-
- if (!mc.precharge)
- break;
-
+ for (; addr != end && mc.precharge; addr += PAGE_SIZE) {
+ ptent = pte ? *(pte++) : pfn_pte(pfn++, vma->vm_page_prot);
switch (get_mctgt_type(vma, addr, ptent, &target)) {
case MC_TARGET_PAGE:
page = target.page;
@@ -4851,8 +4861,8 @@ put: /* get_mctgt_type() gets the page
put_page(page);
break;
case MC_TARGET_SWAP:
- ent = target.ent;
- if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) {
+ if (!mem_cgroup_move_swap_account(target.ent,
+ mc.from, mc.to)) {
mc.precharge--;
/* we fixup refcnts and charges later. */
mc.moved_swap++;
@@ -4862,7 +4872,10 @@ put: /* get_mctgt_type() gets the page
break;
}
}
- pte_unmap_unlock(pte - 1, ptl);
+ if (pte)
+ pte_unmap(pte - 1);
+unlock:
+ spin_unlock(ptl);
cond_resched();

if (addr != end) {

2016-04-05 21:46:09

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 19/31] huge tmpfs: mem_cgroup shmem_pmdmapped accounting

From: Andres Lagar-Cavilla <[email protected]>

Grep now for shmem_pmdmapped in memory.stat (and also for
"total_..." in a hierarchical setting).

This metric allows for easy checking on a per-cgroup basis of the
amount of page team memory hugely mapped (at least once) out there.

The metric is counted towards the cgroup owning the page (unlike in an
event such as THP split) because the team page may be mapped hugely
for the first time via a shared map in some other process.

Moved up mem_group_move_account()'s PageWriteback block:
that movement is irrelevant to this patch, but lets us concentrate
better on the PageTeam locking issues which follow in the next patch.

Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/memcontrol.h | 2 ++
include/linux/pageteam.h | 16 ++++++++++++++++
mm/huge_memory.c | 4 ++++
mm/memcontrol.c | 35 ++++++++++++++++++++++++++---------
4 files changed, 48 insertions(+), 9 deletions(-)

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -50,6 +50,8 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_DIRTY, /* # of dirty pages in page cache */
MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+ /* # of pages charged as hugely mapped teams */
+ MEM_CGROUP_STAT_SHMEM_PMDMAPPED,
MEM_CGROUP_STAT_NSTATS,
/* default hierarchy stats */
MEMCG_KERNEL_STACK = MEM_CGROUP_STAT_NSTATS,
--- a/include/linux/pageteam.h
+++ b/include/linux/pageteam.h
@@ -135,6 +135,22 @@ static inline bool dec_team_pmd_mapped(s
}

/*
+ * Supplies those values which mem_cgroup_move_account()
+ * needs to maintain memcg's huge tmpfs stats correctly.
+ */
+static inline void count_team_pmd_mapped(struct page *head, int *file_mapped,
+ bool *pmd_mapped)
+{
+ long team_usage;
+
+ *file_mapped = 1;
+ team_usage = atomic_long_read(&head->team_usage);
+ *pmd_mapped = team_usage >= TEAM_PMD_MAPPED;
+ if (*pmd_mapped)
+ *file_mapped = HPAGE_PMD_NR - team_pte_count(team_usage);
+}
+
+/*
* Returns true if this pte mapping is of a non-team page, or of a team page not
* covered by an existing huge pmd mapping: whereupon stats need to be updated.
* Only called when mapcount goes up from 0 to 1 i.e. _mapcount from -1 to 0.
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3514,6 +3514,8 @@ static void page_add_team_rmap(struct pa
__mod_zone_page_state(zone, NR_FILE_MAPPED, nr_pages);
mem_cgroup_update_page_stat(page,
MEM_CGROUP_STAT_FILE_MAPPED, nr_pages);
+ mem_cgroup_update_page_stat(page,
+ MEM_CGROUP_STAT_SHMEM_PMDMAPPED, HPAGE_PMD_NR);
}
unlock_page_memcg(page);
}
@@ -3533,6 +3535,8 @@ static void page_remove_team_rmap(struct
__mod_zone_page_state(zone, NR_FILE_MAPPED, -nr_pages);
mem_cgroup_update_page_stat(page,
MEM_CGROUP_STAT_FILE_MAPPED, -nr_pages);
+ mem_cgroup_update_page_stat(page,
+ MEM_CGROUP_STAT_SHMEM_PMDMAPPED, -HPAGE_PMD_NR);
}
unlock_page_memcg(page);
}
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -37,6 +37,7 @@
#include <linux/mm.h>
#include <linux/hugetlb.h>
#include <linux/pagemap.h>
+#include <linux/pageteam.h>
#include <linux/smp.h>
#include <linux/page-flags.h>
#include <linux/backing-dev.h>
@@ -106,6 +107,7 @@ static const char * const mem_cgroup_sta
"dirty",
"writeback",
"swap",
+ "shmem_pmdmapped",
};

static const char * const mem_cgroup_events_names[] = {
@@ -4447,7 +4449,8 @@ static int mem_cgroup_move_account(struc
struct mem_cgroup *to)
{
unsigned long flags;
- unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+ int nr_pages = compound ? hpage_nr_pages(page) : 1;
+ int file_mapped = 1;
int ret;
bool anon;

@@ -4471,10 +4474,10 @@ static int mem_cgroup_move_account(struc

spin_lock_irqsave(&from->move_lock, flags);

- if (!anon && page_mapped(page)) {
- __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
+ if (PageWriteback(page)) {
+ __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK],
nr_pages);
- __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
+ __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_WRITEBACK],
nr_pages);
}

@@ -4494,11 +4497,25 @@ static int mem_cgroup_move_account(struc
}
}

- if (PageWriteback(page)) {
- __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_WRITEBACK],
- nr_pages);
- __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_WRITEBACK],
- nr_pages);
+ if (!anon && PageTeam(page)) {
+ if (page == team_head(page)) {
+ bool pmd_mapped;
+
+ count_team_pmd_mapped(page, &file_mapped, &pmd_mapped);
+ if (pmd_mapped) {
+ __this_cpu_sub(from->stat->count[
+ MEM_CGROUP_STAT_SHMEM_PMDMAPPED], HPAGE_PMD_NR);
+ __this_cpu_add(to->stat->count[
+ MEM_CGROUP_STAT_SHMEM_PMDMAPPED], HPAGE_PMD_NR);
+ }
+ }
+ }
+
+ if (!anon && page_mapped(page)) {
+ __this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
+ file_mapped);
+ __this_cpu_add(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED],
+ file_mapped);
}

/*

2016-04-05 21:48:05

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 20/31] huge tmpfs: mem_cgroup shmem_hugepages accounting

From: Andres Lagar-Cavilla <[email protected]>

Keep track of all hugepages, not just those mapped.

This has gone through several anguished iterations, memcg stats being
harder to protect against mem_cgroup_move_account() than you might
expect. Abandon the pretence that miscellaneous stats can all be
protected by the same lock_page_memcg(),unlock_page_memcg() scheme:
add mem_cgroup_update_page_stat_treelocked(), using mapping->tree_lock
for safe updates of MEM_CGROUP_STAT_SHMEM_HUGEPAGES (where tree_lock
is already held, but nests inside not outside of memcg->move_lock).

Nowadays, when mem_cgroup_move_account() takes page lock, and is only
called when immigrating pages found in page tables, it almost seems as
if this reliance on tree_lock is unnecessary. But consider the case
when the team head is pte-mapped, and being migrated to a new memcg,
racing with the last page of the team being instantiated: the page
lock is held on the page being instantiated, not on the team head,
so we do still need the tree_lock to serialize them.

Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/cgroup-v1/memory.txt | 2 +
Documentation/filesystems/tmpfs.txt | 8 ++++
include/linux/memcontrol.h | 10 +++++
include/linux/pageteam.h | 3 +
mm/memcontrol.c | 47 ++++++++++++++++++++++----
mm/shmem.c | 4 ++
6 files changed, 66 insertions(+), 8 deletions(-)

--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -487,6 +487,8 @@ rss - # of bytes of anonymous and swap
transparent hugepages).
rss_huge - # of bytes of anonymous transparent hugepages.
mapped_file - # of bytes of mapped file (includes tmpfs/shmem)
+shmem_hugepages - # of bytes of tmpfs huge pages completed (subset of cache)
+shmem_pmdmapped - # of bytes of tmpfs huge mapped huge (subset of mapped_file)
pgpgin - # of charging events to the memory cgroup. The charging
event happens each time a page is accounted as either mapped
anon page(RSS) or cache page(Page Cache) to the cgroup.
--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -200,6 +200,14 @@ nr_shmem_hugepages 13 tmpfs huge
nr_shmem_pmdmapped 6 tmpfs hugepages with huge mappings in userspace
nr_shmem_freeholes 167861 pages reserved for team but available to shrinker

+/sys/fs/cgroup/memory/<cgroup>/memory.stat shows:
+
+shmem_hugepages 27262976 bytes tmpfs hugepage completed (subset of cache)
+shmem_pmdmapped 12582912 bytes tmpfs huge mapped huge (subset of mapped_file)
+
+Note: the individual pages of a huge team might be charged to different
+memcgs, but these counts assume that they are all charged to the same as head.
+
Author:
Christoph Rohland <[email protected]>, 1.12.01
Updated:
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -50,6 +50,8 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_DIRTY, /* # of dirty pages in page cache */
MEM_CGROUP_STAT_WRITEBACK, /* # of pages under writeback */
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+ /* # of pages charged as non-disbanded huge teams */
+ MEM_CGROUP_STAT_SHMEM_HUGEPAGES,
/* # of pages charged as hugely mapped teams */
MEM_CGROUP_STAT_SHMEM_PMDMAPPED,
MEM_CGROUP_STAT_NSTATS,
@@ -491,6 +493,9 @@ static inline void mem_cgroup_update_pag
this_cpu_add(page->mem_cgroup->stat->count[idx], val);
}

+void mem_cgroup_update_page_stat_treelocked(struct page *page,
+ enum mem_cgroup_stat_index idx, int val);
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_stat_index idx)
{
@@ -706,6 +711,11 @@ static inline void mem_cgroup_update_pag
enum mem_cgroup_stat_index idx, int val)
{
}
+
+static inline void mem_cgroup_update_page_stat_treelocked(struct page *page,
+ enum mem_cgroup_stat_index idx, int val)
+{
+}

static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_stat_index idx)
--- a/include/linux/pageteam.h
+++ b/include/linux/pageteam.h
@@ -139,12 +139,13 @@ static inline bool dec_team_pmd_mapped(s
* needs to maintain memcg's huge tmpfs stats correctly.
*/
static inline void count_team_pmd_mapped(struct page *head, int *file_mapped,
- bool *pmd_mapped)
+ bool *pmd_mapped, bool *team_complete)
{
long team_usage;

*file_mapped = 1;
team_usage = atomic_long_read(&head->team_usage);
+ *team_complete = team_usage >= TEAM_COMPLETE;
*pmd_mapped = team_usage >= TEAM_PMD_MAPPED;
if (*pmd_mapped)
*file_mapped = HPAGE_PMD_NR - team_pte_count(team_usage);
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -107,6 +107,7 @@ static const char * const mem_cgroup_sta
"dirty",
"writeback",
"swap",
+ "shmem_hugepages",
"shmem_pmdmapped",
};

@@ -4431,6 +4432,17 @@ static struct page *mc_handle_file_pte(s
return page;
}

+void mem_cgroup_update_page_stat_treelocked(struct page *page,
+ enum mem_cgroup_stat_index idx, int val)
+{
+ /* Update this VM_BUG_ON if other cases are added */
+ VM_BUG_ON(idx != MEM_CGROUP_STAT_SHMEM_HUGEPAGES);
+ lockdep_assert_held(&page->mapping->tree_lock);
+
+ if (page->mem_cgroup)
+ __this_cpu_add(page->mem_cgroup->stat->count[idx], val);
+}
+
/**
* mem_cgroup_move_account - move account of the page
* @page: the page
@@ -4448,6 +4460,7 @@ static int mem_cgroup_move_account(struc
struct mem_cgroup *from,
struct mem_cgroup *to)
{
+ spinlock_t *tree_lock = NULL;
unsigned long flags;
int nr_pages = compound ? hpage_nr_pages(page) : 1;
int file_mapped = 1;
@@ -4487,9 +4500,9 @@ static int mem_cgroup_move_account(struc
* So mapping should be stable for dirty pages.
*/
if (!anon && PageDirty(page)) {
- struct address_space *mapping = page_mapping(page);
+ struct address_space *mapping = page->mapping;

- if (mapping_cap_account_dirty(mapping)) {
+ if (mapping && mapping_cap_account_dirty(mapping)) {
__this_cpu_sub(from->stat->count[MEM_CGROUP_STAT_DIRTY],
nr_pages);
__this_cpu_add(to->stat->count[MEM_CGROUP_STAT_DIRTY],
@@ -4498,10 +4511,28 @@ static int mem_cgroup_move_account(struc
}

if (!anon && PageTeam(page)) {
- if (page == team_head(page)) {
- bool pmd_mapped;
+ struct address_space *mapping = page->mapping;

- count_team_pmd_mapped(page, &file_mapped, &pmd_mapped);
+ if (mapping && page == team_head(page)) {
+ bool pmd_mapped, team_complete;
+ /*
+ * We avoided taking mapping->tree_lock unnecessarily.
+ * Is it safe to take mapping->tree_lock below? Was it
+ * safe to peek at PageTeam above, without tree_lock?
+ * Yes, this is a team head, just now taken from its
+ * lru: PageTeam must already be set. And we took
+ * page lock above, so page->mapping is stable.
+ */
+ tree_lock = &mapping->tree_lock;
+ spin_lock(tree_lock);
+ count_team_pmd_mapped(page, &file_mapped, &pmd_mapped,
+ &team_complete);
+ if (team_complete) {
+ __this_cpu_sub(from->stat->count[
+ MEM_CGROUP_STAT_SHMEM_HUGEPAGES], HPAGE_PMD_NR);
+ __this_cpu_add(to->stat->count[
+ MEM_CGROUP_STAT_SHMEM_HUGEPAGES], HPAGE_PMD_NR);
+ }
if (pmd_mapped) {
__this_cpu_sub(from->stat->count[
MEM_CGROUP_STAT_SHMEM_PMDMAPPED], HPAGE_PMD_NR);
@@ -4522,10 +4553,12 @@ static int mem_cgroup_move_account(struc
* It is safe to change page->mem_cgroup here because the page
* is referenced, charged, and isolated - we can't race with
* uncharging, charging, migration, or LRU putback.
+ * Caller should have done css_get.
*/
-
- /* caller should have done css_get */
page->mem_cgroup = to;
+
+ if (tree_lock)
+ spin_unlock(tree_lock);
spin_unlock_irqrestore(&from->move_lock, flags);

ret = 0;
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -413,6 +413,8 @@ static void shmem_added_to_hugeteam(stru
&head->team_usage) >= TEAM_COMPLETE) {
shmem_clear_tag_hugehole(mapping, head->index);
__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
+ mem_cgroup_update_page_stat_treelocked(head,
+ MEM_CGROUP_STAT_SHMEM_HUGEPAGES, HPAGE_PMD_NR);
}
__dec_zone_state(zone, NR_SHMEM_FREEHOLES);
}
@@ -523,6 +525,8 @@ again2:
if (nr >= HPAGE_PMD_NR) {
ClearPageChecked(head);
__dec_zone_state(zone, NR_SHMEM_HUGEPAGES);
+ mem_cgroup_update_page_stat_treelocked(head,
+ MEM_CGROUP_STAT_SHMEM_HUGEPAGES, -HPAGE_PMD_NR);
VM_BUG_ON(nr != HPAGE_PMD_NR);
} else if (nr) {
shmem_clear_tag_hugehole(mapping, head->index);

2016-04-05 21:49:53

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 21/31] huge tmpfs: show page team flag in pageflags

From: Andres Lagar-Cavilla <[email protected]>

For debugging and testing.

Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
This patchset has been based on v4.6-rc2: here we get a clash with
the addition of KPF_MOVABLE in current mmotm, not hard to fix up.

Documentation/vm/pagemap.txt | 2 ++
fs/proc/page.c | 6 ++++++
include/uapi/linux/kernel-page-flags.h | 3 ++-
tools/vm/page-types.c | 2 ++
4 files changed, 12 insertions(+), 1 deletion(-)

--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -71,6 +71,8 @@ There are four components to pagemap:
23. BALLOON
24. ZERO_PAGE
25. IDLE
+ 26. TEAM
+ 27. TEAM_PMD_MMAP (only if the whole team is mapped as a pmd at least once)

* /proc/kpagecgroup. This file contains a 64-bit inode number of the
memory cgroup each page is charged to, indexed by PFN. Only available when
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -12,6 +12,7 @@
#include <linux/memcontrol.h>
#include <linux/mmu_notifier.h>
#include <linux/page_idle.h>
+#include <linux/pageteam.h>
#include <linux/kernel-page-flags.h>
#include <asm/uaccess.h>
#include "internal.h"
@@ -112,6 +113,11 @@ u64 stable_page_flags(struct page *page)
if (PageKsm(page))
u |= 1 << KPF_KSM;

+ if (PageTeam(page)) {
+ u |= 1 << KPF_TEAM;
+ if (page == team_head(page) && team_pmd_mapped(page))
+ u |= 1 << KPF_TEAM_PMD_MMAP;
+ }
/*
* compound pages: export both head/tail info
* they together define a compound page's start/end pos and order
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -34,6 +34,7 @@
#define KPF_BALLOON 23
#define KPF_ZERO_PAGE 24
#define KPF_IDLE 25
-
+#define KPF_TEAM 26
+#define KPF_TEAM_PMD_MMAP 27

#endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -133,6 +133,8 @@ static const char * const page_flag_name
[KPF_BALLOON] = "o:balloon",
[KPF_ZERO_PAGE] = "z:zero_page",
[KPF_IDLE] = "i:idle_page",
+ [KPF_TEAM] = "y:team",
+ [KPF_TEAM_PMD_MMAP] = "Y:team_pmd_mmap",

[KPF_RESERVED] = "r:reserved",
[KPF_MLOCKED] = "m:mlocked",

2016-04-05 21:51:43

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 22/31] huge tmpfs: /proc/<pid>/smaps show ShmemHugePages

We have been relying on the AnonHugePages line of /proc/<pid>/smaps
for informal visibility of huge tmpfs mappings by a process. It's
been good enough, but rather tacky, and best fixed before wider use.

Now reserve AnonHugePages for anonymous THP, and use ShmemHugePages
for huge tmpfs. There is a good argument for calling it ShmemPmdMapped
instead (pte mappings of team pages won't be included in this count),
and I wouldn't mind changing to that; but smaps is all about the mapped,
and I think ShmemHugePages is more what people would expect to see here.

Add a team_page_mapcount() function to help get the PSS accounting right,
now that compound pages are accounting correctly for ptes inside pmds;
but nothing else needs that function, so keep it out of page_mapcount().

Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/filesystems/proc.txt | 10 +++++---
Documentation/filesystems/tmpfs.txt | 4 +++
fs/proc/task_mmu.c | 28 ++++++++++++++++--------
include/linux/pageteam.h | 30 ++++++++++++++++++++++++++
4 files changed, 59 insertions(+), 13 deletions(-)

--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -435,6 +435,7 @@ Private_Dirty: 0 kB
Referenced: 892 kB
Anonymous: 0 kB
AnonHugePages: 0 kB
+ShmemHugePages: 0 kB
Shared_Hugetlb: 0 kB
Private_Hugetlb: 0 kB
Swap: 0 kB
@@ -462,10 +463,11 @@ accessed.
"Anonymous" shows the amount of memory that does not belong to any file. Even
a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE
and a page is modified, the file page is replaced by a private anonymous copy.
-"AnonHugePages" shows the ammount of memory backed by transparent hugepage.
-"Shared_Hugetlb" and "Private_Hugetlb" show the ammounts of memory backed by
-hugetlbfs page which is *not* counted in "RSS" or "PSS" field for historical
-reasons. And these are not included in {Shared,Private}_{Clean,Dirty} field.
+"AnonHugePages" shows how much of Anonymous is in Transparent Huge Pages, and
+"ShmemHugePages" shows how much of Rss is from huge tmpfs pages mapped by pmd.
+"Shared_Hugetlb" and "Private_Hugetlb" show the amounts of memory backed by
+hugetlbfs pages: which are not counted in "Rss" or "Pss" fields for historical
+reasons; nor are they included in the {Shared,Private}_{Clean,Dirty} fields.
"Swap" shows how much would-be-anonymous memory is also used, but out on swap.
For shmem mappings, "Swap" includes also the size of the mapped (and not
replaced by copy-on-write) part of the underlying shmem object out on swap.
--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -186,6 +186,10 @@ In addition to 0 and 1, it also accepts
automatically on for all tmpfs mounts (intended for testing), or -1
to force huge off for all (intended for safety if bugs appeared).

+/proc/<pid>/smaps shows:
+
+ShmemHugePages: 10240 kB tmpfs hugepages mapped by pmd into this region
+
/proc/meminfo, /sys/devices/system/node/nodeN/meminfo show:

Shmem: 35016 kB total shmem/tmpfs memory (subset of Cached)
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -14,6 +14,7 @@
#include <linux/swapops.h>
#include <linux/mmu_notifier.h>
#include <linux/page_idle.h>
+#include <linux/pageteam.h>
#include <linux/shmem_fs.h>

#include <asm/elf.h>
@@ -448,6 +449,7 @@ struct mem_size_stats {
unsigned long referenced;
unsigned long anonymous;
unsigned long anonymous_thp;
+ unsigned long shmem_huge;
unsigned long swap;
unsigned long shared_hugetlb;
unsigned long private_hugetlb;
@@ -457,13 +459,19 @@ struct mem_size_stats {
};

static void smaps_account(struct mem_size_stats *mss, struct page *page,
- bool compound, bool young, bool dirty)
+ unsigned long size, bool young, bool dirty)
{
- int i, nr = compound ? 1 << compound_order(page) : 1;
- unsigned long size = nr * PAGE_SIZE;
+ int nr = size / PAGE_SIZE;
+ int i;

- if (PageAnon(page))
+ if (PageAnon(page)) {
mss->anonymous += size;
+ if (size > PAGE_SIZE)
+ mss->anonymous_thp += size;
+ } else {
+ if (size > PAGE_SIZE)
+ mss->shmem_huge += size;
+ }

mss->resident += size;
/* Accumulate the size in pages that have been accessed. */
@@ -473,7 +481,7 @@ static void smaps_account(struct mem_siz
/*
* page_count(page) == 1 guarantees the page is mapped exactly once.
* If any subpage of the compound page mapped with PTE it would elevate
- * page_count().
+ * page_count(). (This condition is never true of mapped pagecache.)
*/
if (page_count(page) == 1) {
if (dirty || PageDirty(page))
@@ -485,7 +493,7 @@ static void smaps_account(struct mem_siz
}

for (i = 0; i < nr; i++, page++) {
- int mapcount = page_mapcount(page);
+ int mapcount = team_page_mapcount(page);

if (mapcount >= 2) {
if (dirty || PageDirty(page))
@@ -561,7 +569,7 @@ static void smaps_pte_entry(pte_t *pte,
if (!page)
return;

- smaps_account(mss, page, false, pte_young(*pte), pte_dirty(*pte));
+ smaps_account(mss, page, PAGE_SIZE, pte_young(*pte), pte_dirty(*pte));
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
@@ -576,8 +584,8 @@ static void smaps_pmd_entry(pmd_t *pmd,
page = follow_trans_huge_pmd(vma, addr, pmd, FOLL_DUMP);
if (IS_ERR_OR_NULL(page))
return;
- mss->anonymous_thp += HPAGE_PMD_SIZE;
- smaps_account(mss, page, true, pmd_young(*pmd), pmd_dirty(*pmd));
+ smaps_account(mss, page, HPAGE_PMD_SIZE,
+ pmd_young(*pmd), pmd_dirty(*pmd));
}
#else
static void smaps_pmd_entry(pmd_t *pmd, unsigned long addr,
@@ -770,6 +778,7 @@ static int show_smap(struct seq_file *m,
"Referenced: %8lu kB\n"
"Anonymous: %8lu kB\n"
"AnonHugePages: %8lu kB\n"
+ "ShmemHugePages: %8lu kB\n"
"Shared_Hugetlb: %8lu kB\n"
"Private_Hugetlb: %7lu kB\n"
"Swap: %8lu kB\n"
@@ -787,6 +796,7 @@ static int show_smap(struct seq_file *m,
mss.referenced >> 10,
mss.anonymous >> 10,
mss.anonymous_thp >> 10,
+ mss.shmem_huge >> 10,
mss.shared_hugetlb >> 10,
mss.private_hugetlb >> 10,
mss.swap >> 10,
--- a/include/linux/pageteam.h
+++ b/include/linux/pageteam.h
@@ -152,6 +152,36 @@ static inline void count_team_pmd_mapped
}

/*
+ * Slightly misnamed, team_page_mapcount() returns the number of times
+ * any page is mapped into userspace, either by pte or covered by pmd:
+ * it is a generalization of page_mapcount() to include the case of a
+ * team page. We don't complicate page_mapcount() itself in this way,
+ * because almost nothing needs this number: only smaps accounting PSS.
+ * If something else wants it, we might have to worry more about races.
+ */
+static inline int team_page_mapcount(struct page *page)
+{
+ struct page *head;
+ long team_usage;
+ int mapcount;
+
+ mapcount = page_mapcount(page);
+ if (!PageTeam(page))
+ return mapcount;
+ head = team_head(page);
+ /* We always page_add_file_rmap to head when we page_add_team_rmap */
+ if (page == head)
+ return mapcount;
+
+ team_usage = atomic_long_read(&head->team_usage) - TEAM_COMPLETE;
+ /* Beware racing shmem_disband_hugehead() and add_to_swap_cache() */
+ smp_rmb();
+ if (PageTeam(head) && team_usage > 0)
+ mapcount += team_usage / TEAM_MAPPING_COUNTER;
+ return mapcount;
+}
+
+/*
* Returns true if this pte mapping is of a non-team page, or of a team page not
* covered by an existing huge pmd mapping: whereupon stats need to be updated.
* Only called when mapcount goes up from 0 to 1 i.e. _mapcount from -1 to 0.

2016-04-05 21:53:38

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 23/31] huge tmpfs recovery: framework for reconstituting huge pages

Huge tmpfs is great when you're allocated a huge page from the start;
but not much use if there was a shortage of huge pages at that time,
or your huge pages were disbanded and swapped out under pressure, and
now paged back in 4k pieces once the pressure has eased. At present
the best you can do is copy your original file, and start afresh on
the unfragmented copy; but we do need a better answer.

The approach taken here is driven from page fault: assembling a huge
page from existing pieces is more expensive than initial allocation
from an empty huge page, and the work done quite likely to be wasted,
unless there's some evidence that a huge TLB mapping will be useful
to the process. A page fault in a suitable area suggests that it may.

So we adjust the original "Shall we map a huge page hugely?" tests in
shmem_fault(), to distinguish what can be done on this occasion from
what may be possible later: invoking shmem_huge_recovery() when we
cannot map a huge page now, but might be able to use one later.

It's likely that this is over-eager, that it needs some rate-limiting,
and should be tuned by the number of faults which occur in the extent.
Such information will have to be stored somewhere: probably in the
extent's recovery work struct; but no attempt to do so in this series.

So as not to add latency to the fault, shmem_huge_recovery() just
enqueues a work item - with no consideration for all the flavors
of workqueue that might be used: would something special be better?

But skips it if this range of the file is already on the queue
(which is both more efficient, and avoids awkward races later),
or if too many items are currently enqueued. "Too many" defaults
to more than 8, tunable via /proc/sys/vm/shmem_huge_recoveries -
seems more appropriate than adding it into the huge=N mount option.
Why 8? Well, anon THP's khugepaged is equivalent to 1, but work
queues let us be less restrictive. Initializing or tuning it to 0
completely disables huge tmpfs recovery.

shmem_recovery_work() is where the huge page is allocated - using
__alloc_pages_node() rather than alloc_pages_vma(), like anon THP
does nowadays: ignoring vma mempol complications for now, though
I'm sure our NUMA behavior here will need to be improved very soon.
Population and remap phases left as stubs in this framework commit.

But a fresh huge page is not necessarily allocated: page migration
is never sure to succeed, so it's wiser to allow a work item to
resume on a huge page begun by an earlier, than re-migrate all its
pages so far instantiated, to yet another huge page. Sometimes
an unfinished huge page can be easily recognized by PageTeam;
but sometimes it has to be located, by the same SHMEM_TAG_HUGEHOLE
mechanism that exposes it to the hugehole shrinker. Clear the tag
to prevent the shrinker from interfering (unexpectedly disbanding)
while in shmem_populate_hugeteam() itself.

If shmem_huge_recoveries is enabled, shmem_alloc_page()'s retry
after shrinking is disabled: in early testing, the shrinker was
too eager to undo the work of recovery. That was probably a
side-effect of bugs at that time, but it still seems right to
reduce the latency of shmem_fault() when it has a second chance.

Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/filesystems/tmpfs.txt | 12 +
Documentation/sysctl/vm.txt | 9 +
include/linux/shmem_fs.h | 2
kernel/sysctl.c | 7
mm/shmem.c | 233 +++++++++++++++++++++++++-
5 files changed, 256 insertions(+), 7 deletions(-)

--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -176,6 +176,12 @@ user accesses between end of file and th
not fail with SIGBUS, as they would on a huge=0 filesystem - but will fail
with SIGBUS if the kernel could only allocate small pages to back it.

+When memory pressure eases, or compaction repairs memory fragmentation,
+huge tmpfs recovery attempts to restore the original performance with
+hugepages: as small pages are faulted back in, a workitem is queued to
+bring the remainder back from swap, and migrate small pages into place,
+before remapping the completed hugepage with a pmd.
+
/proc/sys/vm/shmem_huge (intended for experimentation only):

Default 0; write 1 to set tmpfs mount option huge=1 on the kernel's
@@ -186,6 +192,12 @@ In addition to 0 and 1, it also accepts
automatically on for all tmpfs mounts (intended for testing), or -1
to force huge off for all (intended for safety if bugs appeared).

+/proc/sys/vm/shmem_huge_recoveries:
+
+Default 8, allows up to 8 concurrent workitems, recovering hugepages
+after fragmentation prevented or reclaim disbanded; write 0 to disable
+huge recoveries, or a higher number to allow more concurrent recoveries.
+
/proc/<pid>/smaps shows:

ShmemHugePages: 10240 kB tmpfs hugepages mapped by pmd into this region
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -57,6 +57,7 @@ Currently, these files are in /proc/sys/
- panic_on_oom
- percpu_pagelist_fraction
- shmem_huge
+- shmem_huge_recoveries
- stat_interval
- stat_refresh
- swappiness
@@ -764,6 +765,14 @@ See Documentation/filesystems/tmpfs.txt

==============================================================

+shmem_huge_recoveries
+
+Default 8, allows up to 8 concurrent workitems, recovering hugepages
+after fragmentation prevented or reclaim disbanded; write 0 to disable
+huge recoveries, or a higher number to allow more concurrent recoveries.
+
+==============================================================
+
stat_interval

The time interval between which vm statistics are updated. The default
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -20,6 +20,7 @@ struct shmem_inode_info {
struct list_head swaplist; /* chain of maybes on swap */
struct shared_policy policy; /* NUMA memory alloc policy */
struct simple_xattrs xattrs; /* list of xattrs */
+ atomic_t recoveries; /* huge recovery work queued */
struct inode vfs_inode;
};

@@ -87,6 +88,7 @@ static inline long shmem_fcntl(struct fi
# ifdef CONFIG_SYSCTL
struct ctl_table;
extern int shmem_huge, shmem_huge_min, shmem_huge_max;
+extern int shmem_huge_recoveries;
extern int shmem_huge_sysctl(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);
# endif /* CONFIG_SYSCTL */
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1324,6 +1324,13 @@ static struct ctl_table vm_table[] = {
.extra1 = &shmem_huge_min,
.extra2 = &shmem_huge_max,
},
+ {
+ .procname = "shmem_huge_recoveries",
+ .data = &shmem_huge_recoveries,
+ .maxlen = sizeof(shmem_huge_recoveries),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif
#ifdef CONFIG_HUGETLB_PAGE
{
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -59,6 +59,7 @@ static struct vfsmount *shm_mnt;
#include <linux/splice.h>
#include <linux/security.h>
#include <linux/shrinker.h>
+#include <linux/workqueue.h>
#include <linux/sysctl.h>
#include <linux/swapops.h>
#include <linux/pageteam.h>
@@ -319,6 +320,7 @@ static DEFINE_SPINLOCK(shmem_shrinklist_
/* ifdef here to avoid bloating shmem.o when not necessary */

int shmem_huge __read_mostly;
+int shmem_huge_recoveries __read_mostly = 8; /* concurrent recovery limit */

static struct page *shmem_hugeteam_lookup(struct address_space *mapping,
pgoff_t index, bool speculative)
@@ -377,8 +379,8 @@ static int shmem_freeholes(struct page *
HPAGE_PMD_NR - (nr / TEAM_PAGE_COUNTER);
}

-static void shmem_clear_tag_hugehole(struct address_space *mapping,
- pgoff_t index)
+static struct page *shmem_clear_tag_hugehole(struct address_space *mapping,
+ pgoff_t index)
{
struct page *page = NULL;

@@ -391,9 +393,13 @@ static void shmem_clear_tag_hugehole(str
*/
radix_tree_gang_lookup_tag(&mapping->page_tree, (void **)&page,
index, 1, SHMEM_TAG_HUGEHOLE);
- VM_BUG_ON(!page || page->index >= index + HPAGE_PMD_NR);
- radix_tree_tag_clear(&mapping->page_tree, page->index,
+ VM_BUG_ON(radix_tree_exception(page));
+ if (page && page->index < index + HPAGE_PMD_NR) {
+ radix_tree_tag_clear(&mapping->page_tree, page->index,
SHMEM_TAG_HUGEHOLE);
+ return page;
+ }
+ return NULL;
}

static void shmem_added_to_hugeteam(struct page *page, struct zone *zone,
@@ -748,6 +754,190 @@ static void shmem_disband_hugeteam(struc
preempt_enable();
}

+static LIST_HEAD(shmem_recoverylist);
+static unsigned int shmem_recoverylist_depth;
+static DEFINE_SPINLOCK(shmem_recoverylist_lock);
+
+struct recovery {
+ struct list_head list;
+ struct work_struct work;
+ struct mm_struct *mm;
+ struct inode *inode;
+ struct page *page;
+ pgoff_t head_index;
+};
+
+#define shr_stats(x) do {} while (0)
+/* Stats implemented in a later patch */
+
+static bool shmem_work_still_useful(struct recovery *recovery)
+{
+ struct address_space *mapping = READ_ONCE(recovery->page->mapping);
+
+ return mapping && /* page is not yet truncated */
+#ifdef CONFIG_MEMCG
+ recovery->mm->owner && /* mm can still charge memcg */
+#else
+ atomic_read(&recovery->mm->mm_users) && /* mm still has users */
+#endif
+ !RB_EMPTY_ROOT(&mapping->i_mmap); /* file is still mapped */
+}
+
+static int shmem_recovery_populate(struct recovery *recovery, struct page *head)
+{
+ /* Huge page has been split but is not yet PageTeam */
+ shmem_disband_hugetails(head, NULL, 0);
+ return -ENOENT;
+}
+
+static void shmem_recovery_remap(struct recovery *recovery, struct page *head)
+{
+}
+
+static void shmem_recovery_work(struct work_struct *work)
+{
+ struct recovery *recovery;
+ struct shmem_inode_info *info;
+ struct address_space *mapping;
+ struct page *page;
+ struct page *head = NULL;
+ int error = -ENOENT;
+
+ recovery = container_of(work, struct recovery, work);
+ info = SHMEM_I(recovery->inode);
+ if (!shmem_work_still_useful(recovery)) {
+ shr_stats(work_too_late);
+ goto out;
+ }
+
+ /* Are we resuming from an earlier partially successful attempt? */
+ mapping = recovery->inode->i_mapping;
+ spin_lock_irq(&mapping->tree_lock);
+ page = shmem_clear_tag_hugehole(mapping, recovery->head_index);
+ if (page)
+ head = team_head(page);
+ spin_unlock_irq(&mapping->tree_lock);
+ if (head) {
+ /* Serialize with shrinker so it won't mess with our range */
+ spin_lock(&shmem_shrinklist_lock);
+ spin_unlock(&shmem_shrinklist_lock);
+ }
+
+ /* If team is now complete, no tag and head would be found above */
+ page = recovery->page;
+ if (PageTeam(page))
+ head = team_head(page);
+
+ /* Get a reference to the head of the team already being assembled */
+ if (head) {
+ if (!get_page_unless_zero(head))
+ head = NULL;
+ else if (!PageTeam(head) || head->mapping != mapping ||
+ head->index != recovery->head_index) {
+ put_page(head);
+ head = NULL;
+ }
+ }
+
+ if (head) {
+ /* We are resuming work from a previous partial recovery */
+ if (PageTeam(page))
+ shr_stats(resume_teamed);
+ else
+ shr_stats(resume_tagged);
+ } else {
+ gfp_t gfp = mapping_gfp_mask(mapping);
+ /*
+ * XXX: Note that with swapin readahead, page_to_nid(page) will
+ * often choose an unsuitable NUMA node: something to fix soon,
+ * but not an immediate blocker.
+ */
+ head = __alloc_pages_node(page_to_nid(page),
+ gfp | __GFP_NOWARN | __GFP_THISNODE, HPAGE_PMD_ORDER);
+ if (!head) {
+ shr_stats(huge_failed);
+ error = -ENOMEM;
+ goto out;
+ }
+ if (!shmem_work_still_useful(recovery)) {
+ __free_pages(head, HPAGE_PMD_ORDER);
+ shr_stats(huge_too_late);
+ goto out;
+ }
+ split_page(head, HPAGE_PMD_ORDER);
+ get_page(head);
+ shr_stats(huge_alloced);
+ }
+
+ put_page(page); /* before trying to migrate it */
+ recovery->page = head; /* to put at out */
+
+ error = shmem_recovery_populate(recovery, head);
+ if (!error)
+ shmem_recovery_remap(recovery, head);
+out:
+ put_page(recovery->page);
+ /* Let shmem_evict_inode proceed towards freeing it */
+ if (atomic_dec_and_test(&info->recoveries))
+ wake_up_atomic_t(&info->recoveries);
+ mmdrop(recovery->mm);
+
+ spin_lock(&shmem_recoverylist_lock);
+ shmem_recoverylist_depth--;
+ list_del(&recovery->list);
+ spin_unlock(&shmem_recoverylist_lock);
+ kfree(recovery);
+}
+
+static void shmem_huge_recovery(struct inode *inode, struct page *page,
+ struct vm_area_struct *vma)
+{
+ struct recovery *recovery;
+ struct recovery *r;
+
+ /* Limit the outstanding work somewhat; but okay to overshoot */
+ if (shmem_recoverylist_depth >= shmem_huge_recoveries) {
+ shr_stats(work_too_many);
+ return;
+ }
+ recovery = kmalloc(sizeof(*recovery), GFP_KERNEL);
+ if (!recovery)
+ return;
+
+ recovery->mm = vma->vm_mm;
+ recovery->inode = inode;
+ recovery->page = page;
+ recovery->head_index = round_down(page->index, HPAGE_PMD_NR);
+
+ spin_lock(&shmem_recoverylist_lock);
+ list_for_each_entry(r, &shmem_recoverylist, list) {
+ /* Is someone already working on this extent? */
+ if (r->inode == inode &&
+ r->head_index == recovery->head_index) {
+ spin_unlock(&shmem_recoverylist_lock);
+ kfree(recovery);
+ shr_stats(work_already);
+ return;
+ }
+ }
+ list_add(&recovery->list, &shmem_recoverylist);
+ shmem_recoverylist_depth++;
+ spin_unlock(&shmem_recoverylist_lock);
+
+ /*
+ * It's safe to leave inc'ing these reference counts until after
+ * dropping the list lock above, because the corresponding decs
+ * cannot happen until the work is run, and we queue it below.
+ */
+ atomic_inc(&recovery->mm->mm_count);
+ atomic_inc(&SHMEM_I(inode)->recoveries);
+ get_page(page);
+
+ INIT_WORK(&recovery->work, shmem_recovery_work);
+ schedule_work(&recovery->work);
+ shr_stats(work_queued);
+}
+
static struct page *shmem_get_hugehole(struct address_space *mapping,
unsigned long *index)
{
@@ -998,6 +1188,8 @@ static struct shrinker shmem_hugehole_sh
#else /* !CONFIG_TRANSPARENT_HUGEPAGE */

#define shmem_huge SHMEM_HUGE_DENY
+#define shmem_huge_recoveries 0
+#define shr_stats(x) do {} while (0)

static inline struct page *shmem_hugeteam_lookup(struct address_space *mapping,
pgoff_t index, bool speculative)
@@ -1022,6 +1214,11 @@ static inline int shmem_populate_hugetea
return -EAGAIN;
}

+static inline void shmem_huge_recovery(struct inode *inode,
+ struct page *page, struct vm_area_struct *vma)
+{
+}
+
static inline unsigned long shmem_shrink_hugehole(struct shrinker *shrink,
struct shrink_control *sc)
{
@@ -1505,6 +1702,12 @@ static int shmem_setattr(struct dentry *
return error;
}

+static int shmem_wait_on_atomic_t(atomic_t *atomic)
+{
+ schedule();
+ return 0;
+}
+
static void shmem_evict_inode(struct inode *inode)
{
struct shmem_inode_info *info = SHMEM_I(inode);
@@ -1526,6 +1729,9 @@ static void shmem_evict_inode(struct ino
list_del_init(&info->swaplist);
mutex_unlock(&shmem_swaplist_mutex);
}
+ /* Stop inode from being freed while recovery is in progress */
+ wait_on_atomic_t(&info->recoveries, shmem_wait_on_atomic_t,
+ TASK_UNINTERRUPTIBLE);
}

simple_xattrs_free(&info->xattrs);
@@ -1879,7 +2085,8 @@ static struct page *shmem_alloc_page(gfp
head = alloc_pages_vma(gfp|__GFP_NORETRY|__GFP_NOWARN,
HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
true);
- if (!head &&
+ /* Shrink and retry? Or leave it to recovery worker */
+ if (!head && !shmem_huge_recoveries &&
shmem_shrink_hugehole(NULL, NULL) != SHRINK_STOP) {
head = alloc_pages_vma(
gfp|__GFP_NORETRY|__GFP_NOWARN,
@@ -2377,9 +2584,9 @@ single:
*/
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return ret;
- if (!(vmf->flags & FAULT_FLAG_MAY_HUGE))
+ if (shmem_huge == SHMEM_HUGE_DENY)
return ret;
- if (!PageTeam(vmf->page))
+ if (shmem_huge != SHMEM_HUGE_FORCE && !SHMEM_SB(inode->i_sb)->huge)
return ret;
if (once++)
return ret;
@@ -2393,6 +2600,17 @@ single:
return ret;
/* But omit i_size check: allow up to huge page boundary */

+ if (!PageTeam(vmf->page) || !(vmf->flags & FAULT_FLAG_MAY_HUGE)) {
+ /*
+ * XXX: Need to add check for unobstructed pmd
+ * (no anon or swap), and per-pmd ratelimiting.
+ * Use anon_vma as over-strict hint of COWed pages.
+ */
+ if (shmem_huge_recoveries && !vma->anon_vma)
+ shmem_huge_recovery(inode, vmf->page, vma);
+ return ret;
+ }
+
head = team_head(vmf->page);
if (!get_page_unless_zero(head))
return ret;
@@ -2580,6 +2798,7 @@ static struct inode *shmem_get_inode(str
info = SHMEM_I(inode);
memset(info, 0, (char *)inode - (char *)info);
spin_lock_init(&info->lock);
+ atomic_set(&info->recoveries, 0);
info->seals = F_SEAL_SEAL;
info->flags = flags & VM_NORESERVE;
INIT_LIST_HEAD(&info->shrinklist);

2016-04-05 21:55:04

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 24/31] huge tmpfs recovery: shmem_recovery_populate to fill huge page

The outline of shmem_recovery_populate() is straightforward: a loop
applying shmem_getpage_gfp() to each offset that belongs to the extent,
converting swapcache to pagecache, checking existing pagecache, or
allocating new pagecache; adding each correctly placed page into the
team, or isolating each misplaced page, for passing to migrate_pages()
at the end of the loop. Repeated (skipping quickly over those pages
resolved in the previous pass) to add in those pages just migrated,
until the team is complete, or cannot be completed.

But the details are difficult: not so much an architected design,
as a series of improvisations arrived at by trial and much error,
which in the end works well. It has to cope with a variety of races:
pages being concurrently created or swapped in or disbanded or deleted
by other actors.

Most awkward is the handling of the head page of the team: which,
as usual, needs PageTeam and mapping and index set even before it
is instantiated (so that shmem_hugeteam_lookup() can validate tails
against it); but must not be confused with an instantiated team page
until PageSwapBacked is set. This awkwardness is compounded by the
(unlocked) interval between when migrate_pages() migrates an old page
into its new team location, and the next repeat of the loop which fixes
the new location as PageTeam. Yet migrate_page_move_mapping() will have
already composed PageSwapBacked (from old page) with PageTeam (from new)
in the case of the team head: "account_head" to track this case correctly,
but a later patch offers a tighter alternative to remove the need for it.

That interval between migration and enteamment also involves giving
up on the SHMEM_RETRY_HUGE_PAGE option from shmem_hugeteam_lookup():
SHMEM_ALLOC_SMALL_PAGE is not ideal, but does for now; and the later
patch can restore the SHMEM_RETRY_HUGE_PAGE optimization.

Note: this series was originally written with a swapin pass before
population, whereas this commit simply lets shmem_getpage_gfp() do the
swapin synchronously. The swapin pass is reintroduced as an optimization
afterwards, but some comments on swap in this commit may anticipate that:
sorry, a precise sequence of developing comments took too much trouble.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/migrate.h | 1
include/trace/events/migrate.h | 3
mm/migrate.c | 15 +
mm/shmem.c | 339 +++++++++++++++++++++++++++++--
4 files changed, 338 insertions(+), 20 deletions(-)

--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -25,6 +25,7 @@ enum migrate_reason {
MR_NUMA_MISPLACED,
MR_CMA,
MR_SHMEM_HUGEHOLE,
+ MR_SHMEM_RECOVERY,
MR_TYPES
};

--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -20,7 +20,8 @@
EM( MR_MEMPOLICY_MBIND, "mempolicy_mbind") \
EM( MR_NUMA_MISPLACED, "numa_misplaced") \
EM( MR_CMA, "cma") \
- EMe(MR_SHMEM_HUGEHOLE, "shmem_hugehole")
+ EM( MR_SHMEM_HUGEHOLE, "shmem_hugehole") \
+ EMe(MR_SHMEM_RECOVERY, "shmem_recovery")

/*
* First define the enums in the above macros to be exported to userspace
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -786,7 +786,7 @@ static int move_to_new_page(struct page
}

static int __unmap_and_move(struct page *page, struct page *newpage,
- int force, enum migrate_mode mode)
+ int force, enum migrate_mode mode, enum migrate_reason reason)
{
int rc = -EAGAIN;
int page_was_mapped = 0;
@@ -815,6 +815,17 @@ static int __unmap_and_move(struct page
lock_page(page);
}

+ /*
+ * huge tmpfs recovery: must not proceed if page has been truncated,
+ * because the newpage we are about to migrate into *might* then be
+ * already in use, on lru, with data newly written for that offset.
+ * We can only be sure of this check once we have the page locked.
+ */
+ if (reason == MR_SHMEM_RECOVERY && !page->mapping) {
+ rc = -ENOMEM; /* quit migrate_pages() immediately */
+ goto out_unlock;
+ }
+
if (PageWriteback(page)) {
/*
* Only in the case of a full synchronous migration is it
@@ -962,7 +973,7 @@ static ICE_noinline int unmap_and_move(n
goto out;
}

- rc = __unmap_and_move(page, newpage, force, mode);
+ rc = __unmap_and_move(page, newpage, force, mode, reason);
if (rc == MIGRATEPAGE_SUCCESS) {
put_new_page = NULL;
set_page_owner_migrate_reason(newpage, reason);
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -60,6 +60,7 @@ static struct vfsmount *shm_mnt;
#include <linux/security.h>
#include <linux/shrinker.h>
#include <linux/workqueue.h>
+#include <linux/rmap.h>
#include <linux/sysctl.h>
#include <linux/swapops.h>
#include <linux/pageteam.h>
@@ -305,7 +306,6 @@ static bool shmem_confirm_swap(struct ad
/* hugehint values: NULL to choose a small page always */
#define SHMEM_ALLOC_SMALL_PAGE ((struct page *)1)
#define SHMEM_ALLOC_HUGE_PAGE ((struct page *)2)
-#define SHMEM_RETRY_HUGE_PAGE ((struct page *)3)
/* otherwise hugehint is the hugeteam page to be used */

/* tag for shrinker to locate unfilled hugepages */
@@ -368,6 +368,20 @@ restart:
put_page(page);
return SHMEM_ALLOC_SMALL_PAGE;
}
+ if (PageSwapBacked(page)) {
+ if (speculative)
+ put_page(page);
+ /*
+ * This is very often a case of two tasks racing to instantiate
+ * the same hole in the huge page, and we don't particularly
+ * want to allocate a small page. But holepunch racing with
+ * recovery migration, in between migrating to the page and
+ * marking it team, can leave a PageSwapBacked NULL mapping
+ * page here which we should avoid, and this is the easiest
+ * way to handle all the cases correctly.
+ */
+ return SHMEM_ALLOC_SMALL_PAGE;
+ }
return page;
}

@@ -407,16 +421,18 @@ static void shmem_added_to_hugeteam(stru
{
struct address_space *mapping = page->mapping;
struct page *head = team_head(page);
+ long team_usage;

- if (hugehint == SHMEM_ALLOC_HUGE_PAGE) {
- atomic_long_set(&head->team_usage,
- TEAM_PAGE_COUNTER + TEAM_LRU_WEIGHT_ONE);
- radix_tree_tag_set(&mapping->page_tree, page->index,
+ VM_BUG_ON_PAGE(!PageTeam(page), page);
+ team_usage = atomic_long_add_return(TEAM_PAGE_COUNTER,
+ &head->team_usage);
+ if (team_usage < TEAM_PAGE_COUNTER + TEAM_PAGE_COUNTER) {
+ if (hugehint == SHMEM_ALLOC_HUGE_PAGE)
+ radix_tree_tag_set(&mapping->page_tree, page->index,
SHMEM_TAG_HUGEHOLE);
__mod_zone_page_state(zone, NR_SHMEM_FREEHOLES, HPAGE_PMD_NR-1);
} else {
- if (atomic_long_add_return(TEAM_PAGE_COUNTER,
- &head->team_usage) >= TEAM_COMPLETE) {
+ if (team_usage >= TEAM_COMPLETE) {
shmem_clear_tag_hugehole(mapping, head->index);
__inc_zone_state(zone, NR_SHMEM_HUGEPAGES);
mem_cgroup_update_page_stat_treelocked(head,
@@ -644,6 +660,8 @@ static void shmem_disband_hugetails(stru
while (++page < endpage) {
if (PageTeam(page))
ClearPageTeam(page);
+ else if (PageSwapBacked(page)) /* half recovered */
+ put_page(page);
else if (put_page_testzero(page))
free_hot_cold_page(page, 1);
}
@@ -765,9 +783,12 @@ struct recovery {
struct inode *inode;
struct page *page;
pgoff_t head_index;
+ struct page *migrated_head;
+ bool exposed_team;
};

#define shr_stats(x) do {} while (0)
+#define shr_stats_add(x, n) do {} while (0)
/* Stats implemented in a later patch */

static bool shmem_work_still_useful(struct recovery *recovery)
@@ -783,11 +804,295 @@ static bool shmem_work_still_useful(stru
!RB_EMPTY_ROOT(&mapping->i_mmap); /* file is still mapped */
}

+static struct page *shmem_get_recovery_page(struct page *page,
+ unsigned long private, int **result)
+{
+ struct recovery *recovery = (struct recovery *)private;
+ struct page *head = recovery->page;
+ struct page *newpage = head + (page->index & (HPAGE_PMD_NR-1));
+
+ /* Increment refcount to match other routes through recovery_populate */
+ if (!get_page_unless_zero(newpage))
+ return NULL;
+ if (!PageTeam(head)) {
+ put_page(newpage);
+ return NULL;
+ }
+ /* Note when migrating to head: tricky case because already PageTeam */
+ if (newpage == head)
+ recovery->migrated_head = head;
+ return newpage;
+}
+
+static void shmem_put_recovery_page(struct page *newpage, unsigned long private)
+{
+ struct recovery *recovery = (struct recovery *)private;
+
+ /* Must reset migrated_head if in the end it was not used */
+ if (recovery->migrated_head == newpage)
+ recovery->migrated_head = NULL;
+ /* Decrement refcount again if newpage was not used */
+ put_page(newpage);
+}
+
static int shmem_recovery_populate(struct recovery *recovery, struct page *head)
{
- /* Huge page has been split but is not yet PageTeam */
- shmem_disband_hugetails(head, NULL, 0);
- return -ENOENT;
+ LIST_HEAD(migrate);
+ struct address_space *mapping = recovery->inode->i_mapping;
+ gfp_t gfp = mapping_gfp_mask(mapping) | __GFP_NORETRY;
+ struct zone *zone = page_zone(head);
+ pgoff_t index;
+ bool drained_all = false;
+ bool account_head = false;
+ int migratable;
+ int unmigratable;
+ struct page *team;
+ struct page *endteam = head + HPAGE_PMD_NR;
+ struct page *page;
+ int error = 0;
+ int nr;
+
+ /* Warning: this optimization relies on disband's ClearPageChecked */
+ if (PageTeam(head) && PageChecked(head))
+ return 0;
+again:
+ migratable = 0;
+ unmigratable = 0;
+ index = recovery->head_index;
+ for (team = head; team < endteam && !error; index++, team++) {
+ if (PageTeam(team) && PageUptodate(team) && PageDirty(team) &&
+ !account_head)
+ continue;
+
+ error = shmem_getpage_gfp(recovery->inode, index, &page,
+ SGP_TEAM, gfp, recovery->mm, NULL);
+ if (error)
+ break;
+
+ VM_BUG_ON_PAGE(!PageUptodate(page), page);
+ VM_BUG_ON_PAGE(PageSwapCache(page), page);
+ if (!PageDirty(page))
+ SetPageDirty(page);
+
+ if (PageTeam(page) && PageTeam(team_head(page))) {
+ /*
+ * The page's old team might be being disbanded, but its
+ * PageTeam not yet cleared, hence the head check above.
+ *
+ * We used to have VM_BUG_ON(page != team) here, and
+ * never hit it; but I cannot see what else excludes
+ * the race of two teams being built for the same area
+ * (one through faulting and another through recovery).
+ */
+ if (page != team)
+ error = -ENOENT;
+ if (error || !account_head)
+ goto unlock;
+ }
+
+ if (PageSwapBacked(team) && page != team) {
+ /*
+ * Team page was prepared, yet shmem_getpage_gfp() has
+ * given us a different page: that implies that this
+ * offset was truncated or hole-punched meanwhile, so we
+ * might as well give up now. It might or might not be
+ * still PageSwapCache. We must not go on to use the
+ * team page while it's swap - its swap entry, where
+ * team_usage should be, causes crashes and softlockups
+ * when disbanding. And even if it has been removed
+ * from swapcache, it is on (or temporarily off) LRU,
+ * which crashes putback_lru_page() if we migrate to it.
+ */
+ error = -ENOENT;
+ goto unlock;
+ }
+
+ if (!recovery->exposed_team) {
+ VM_BUG_ON(team != head);
+ recovery->exposed_team = true;
+ atomic_long_set(&head->team_usage, TEAM_LRU_WEIGHT_ONE);
+ SetPageTeam(head);
+ head->mapping = mapping;
+ head->index = index;
+ if (page == head)
+ account_head = true;
+ }
+
+ /* Eviction or truncation or hole-punch already disbanded? */
+ if (!PageTeam(head)) {
+ error = -ENOENT;
+ goto unlock;
+ }
+
+ if (page == team) {
+ VM_BUG_ON_PAGE(!PageSwapBacked(page), page);
+ /*
+ * A task may have already mapped this page, before
+ * we set PageTeam: so now we would need to add it
+ * into head's team_pte_mapped count. But it might
+ * get unmapped while we do this: so artificially
+ * bump the mapcount here, then use page_remove_rmap
+ * below to get all the counts right. Luckily our page
+ * lock forbids it from transitioning from unmapped to
+ * mapped while we do so: that would be more difficult.
+ * Preemption disabled to suit zone_page_state updates.
+ */
+ if (page_mapped(page)) {
+ preempt_disable();
+ page_add_file_rmap(page);
+ }
+ spin_lock_irq(&mapping->tree_lock);
+ if (PageTeam(head)) {
+ if (page != head) {
+ atomic_long_set(&page->team_usage,
+ TEAM_LRU_WEIGHT_ONE);
+ SetPageTeam(page);
+ }
+ if (page != head || account_head) {
+ shmem_added_to_hugeteam(page, zone,
+ NULL);
+ put_page(page);
+ shr_stats(page_teamed);
+ }
+ }
+ spin_unlock_irq(&mapping->tree_lock);
+ if (page_mapped(page)) {
+ inc_team_pte_mapped(page);
+ page_remove_rmap(page, false);
+ preempt_enable();
+ }
+ account_head = false;
+ } else {
+ VM_BUG_ON(account_head);
+ if (!PageLRU(page))
+ lru_add_drain();
+ if (isolate_lru_page(page) == 0) {
+ inc_zone_page_state(page, NR_ISOLATED_ANON);
+ list_add_tail(&page->lru, &migrate);
+ shr_stats(page_migrate);
+ migratable++;
+ } else {
+ shr_stats(page_off_lru);
+ unmigratable++;
+ }
+ }
+unlock:
+ unlock_page(page);
+ put_page(page);
+ cond_resched();
+ }
+
+ if (!list_empty(&migrate)) {
+ lru_add_drain(); /* not necessary but may help debugging */
+ if (!error) {
+ VM_BUG_ON(recovery->page != head);
+ recovery->migrated_head = NULL;
+ nr = migrate_pages(&migrate, shmem_get_recovery_page,
+ shmem_put_recovery_page, (unsigned long)
+ recovery, MIGRATE_SYNC, MR_SHMEM_RECOVERY);
+ account_head = !!recovery->migrated_head;
+ if (nr < 0) {
+ /*
+ * If migrate_pages() returned error (-ENOMEM)
+ * instead of number of pages failed, we don't
+ * know how many failed; but it's irrelevant,
+ * the team should be disbanded now anyway.
+ * Increment page_unmigrated? No, we would not
+ * if the error were found during the main loop.
+ */
+ error = -ENOENT;
+ }
+ if (nr > 0) {
+ shr_stats_add(page_unmigrated, nr);
+ unmigratable += nr;
+ migratable -= nr;
+ }
+ }
+ putback_movable_pages(&migrate);
+ lru_add_drain(); /* not necessary but may help debugging */
+ }
+
+ /*
+ * migrate_pages() is prepared to make ten tries on each page,
+ * but the preparatory isolate_lru_page() can too easily fail;
+ * and we refrained from the IPIs of draining all CPUs before.
+ */
+ if (!error) {
+ if (unmigratable && !drained_all) {
+ drained_all = true;
+ lru_add_drain_all();
+ shr_stats(recov_retried);
+ goto again;
+ }
+ if (migratable) {
+ /* Make another pass to SetPageTeam on them */
+ goto again;
+ }
+ }
+
+ lock_page(head);
+ nr = HPAGE_PMD_NR;
+ if (!recovery->exposed_team) {
+ /* Failed before even setting team head */
+ VM_BUG_ON(!error);
+ shmem_disband_hugetails(head, NULL, 0);
+ } else if (PageTeam(head)) {
+ if (!error) {
+ nr = shmem_freeholes(head);
+ if (nr == HPAGE_PMD_NR) {
+ /* We made no progress so not worth resuming */
+ error = -ENOENT;
+ }
+ }
+ if (error) {
+ /* Unsafe to let shrinker back in on this team */
+ shmem_disband_hugeteam(head);
+ }
+ } else if (!error) {
+ /* A concurrent actor took over our team and disbanded it */
+ error = -ENOENT;
+ }
+
+ if (error) {
+ shr_stats(recov_failed);
+ } else if (!nr) {
+ /* Team is complete and ready for pmd mapping */
+ SetPageChecked(head);
+ shr_stats(recov_completed);
+ } else {
+ struct shmem_inode_info *info = SHMEM_I(recovery->inode);
+ /*
+ * All swapcache has been transferred to pagecache, but not
+ * all migrations succeeded, so holes remain to be filled.
+ * Allow shrinker to take these holes; but also tell later
+ * recovery attempts where the huge page is, so migration to it
+ * is resumed, so long as reclaim and shrinker did not disband.
+ */
+ for (team = head;; team++) {
+ VM_BUG_ON(team >= endteam);
+ if (PageSwapBacked(team)) {
+ VM_BUG_ON(!PageTeam(team));
+ spin_lock_irq(&mapping->tree_lock);
+ radix_tree_tag_set(&mapping->page_tree,
+ team->index, SHMEM_TAG_HUGEHOLE);
+ spin_unlock_irq(&mapping->tree_lock);
+ break;
+ }
+ }
+ if (list_empty(&info->shrinklist)) {
+ spin_lock(&shmem_shrinklist_lock);
+ if (list_empty(&info->shrinklist)) {
+ list_add_tail(&info->shrinklist,
+ &shmem_shrinklist);
+ shmem_shrinklist_depth++;
+ }
+ spin_unlock(&shmem_shrinklist_lock);
+ }
+ shr_stats(recov_partial);
+ error = -EAGAIN;
+ }
+ unlock_page(head);
+ return error;
}

static void shmem_recovery_remap(struct recovery *recovery, struct page *head)
@@ -841,6 +1146,7 @@ static void shmem_recovery_work(struct w

if (head) {
/* We are resuming work from a previous partial recovery */
+ recovery->exposed_team = true;
if (PageTeam(page))
shr_stats(resume_teamed);
else
@@ -867,6 +1173,7 @@ static void shmem_recovery_work(struct w
split_page(head, HPAGE_PMD_ORDER);
get_page(head);
shr_stats(huge_alloced);
+ recovery->exposed_team = false;
}

put_page(page); /* before trying to migrate it */
@@ -2120,9 +2427,11 @@ static struct page *shmem_alloc_page(gfp
* add_to_page_cache has the tree_lock.
*/
lock_page(page);
- if (PageSwapBacked(page) || !PageTeam(head))
- *hugehint = SHMEM_RETRY_HUGE_PAGE;
- goto out;
+ if (!PageSwapBacked(page) && PageTeam(head))
+ goto out;
+ unlock_page(page);
+ put_page(page);
+ *hugehint = SHMEM_ALLOC_SMALL_PAGE;
}
}

@@ -2390,10 +2699,6 @@ repeat:
error = -ENOMEM;
goto decused;
}
- if (hugehint == SHMEM_RETRY_HUGE_PAGE) {
- error = -EEXIST;
- goto decused;
- }
if (sgp == SGP_WRITE)
__SetPageReferenced(page);


2016-04-05 21:56:29

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 25/31] huge tmpfs recovery: shmem_recovery_remap & remap_team_by_pmd

And once we have a fully populated huge page, replace the pte mappings
(by now already pointing into this huge page, as page migration has
arranged) by a huge pmd mapping - not just in the mm which prompted
this work, but in any other mm which might benefit from it.

However, the transition from pte mappings to huge pmd mapping is a
new one, which may surprise code elsewhere - pte_offset_map() and
pte_offset_map_lock() in particular. See the earlier discussion in
"huge tmpfs: avoid premature exposure of new pagetable", but now we
are forced to go beyond its solution.

The answer will be to put *pmd checking inside them, and examine
whether a pagetable page could ever be recycled for another purpose
before the pte lock is taken: the deposit/withdraw protocol, and
mmap_sem conventions, work nicely against that danger; but special
attention will have to be paid to MADV_DONTNEED's zap_huge_pmd()
pte_free under down_read of mmap_sem.

Avoid those complications for now: just use a rather unwelcome
down_write or down_write_trylock of mmap_sem here in
shmem_recovery_remap(), to exclude msyscalls or faults or ptrace or
GUP or NUMA work or /proc access. rmap access is already excluded
by our holding i_mmap_rwsem. Fast GUP on x86 is made safe by the
TLB flush in remap_team_by_pmd()'s pmdp_collapse_flush(), its IPIs
as usual blocked by fast GUP's local_irq_disable(). Fast GUP on
powerpc is made safe as usual by its RCU freeing of page tables
(though zap_huge_pmd()'s pte_free appears to violate that, but
if so it's an issue for anon THP too: investigate further later).

Does remap_team_by_pmd() really need its mmu_notifier_invalidate_range
pair? The manner of mapping changes, but nothing is actually unmapped.
Of course, the same question can be asked of remap_team_by_ptes().

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/pageteam.h | 2
mm/huge_memory.c | 87 +++++++++++++++++++++++++++++++++++++
mm/shmem.c | 76 ++++++++++++++++++++++++++++++++
3 files changed, 165 insertions(+)

--- a/include/linux/pageteam.h
+++ b/include/linux/pageteam.h
@@ -313,6 +313,8 @@ void unmap_team_by_pmd(struct vm_area_st
unsigned long addr, pmd_t *pmd, struct page *page);
void remap_team_by_ptes(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd);
+void remap_team_by_pmd(struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmd, struct page *page);
#else
static inline int map_team_by_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd, struct page *page)
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3706,3 +3706,90 @@ raced:
spin_unlock(pml);
mmu_notifier_invalidate_range_end(mm, addr, end);
}
+
+void remap_team_by_pmd(struct vm_area_struct *vma, unsigned long addr,
+ pmd_t *pmd, struct page *head)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct page *page = head;
+ pgtable_t pgtable;
+ unsigned long end;
+ spinlock_t *pml;
+ spinlock_t *ptl;
+ pmd_t pmdval;
+ pte_t *pte;
+ int rss = 0;
+
+ VM_BUG_ON_PAGE(!PageTeam(head), head);
+ VM_BUG_ON_PAGE(!PageLocked(head), head);
+ VM_BUG_ON(addr & ~HPAGE_PMD_MASK);
+ end = addr + HPAGE_PMD_SIZE;
+
+ mmu_notifier_invalidate_range_start(mm, addr, end);
+ pml = pmd_lock(mm, pmd);
+ pmdval = *pmd;
+ /* I don't see how this can happen now, but be defensive */
+ if (pmd_trans_huge(pmdval) || pmd_none(pmdval))
+ goto out;
+
+ ptl = pte_lockptr(mm, pmd);
+ if (ptl != pml)
+ spin_lock(ptl);
+
+ pgtable = pmd_pgtable(pmdval);
+ pmdval = mk_pmd(head, vma->vm_page_prot);
+ pmdval = pmd_mkhuge(pmd_mkdirty(pmdval));
+
+ /* Perhaps wise to mark head as mapped before removing pte rmaps */
+ page_add_file_rmap(head);
+
+ /*
+ * Just as remap_team_by_ptes() would prefer to fill the page table
+ * earlier, remap_team_by_pmd() would prefer to empty it later; but
+ * ppc64's variant of the deposit/withdraw protocol prevents that.
+ */
+ pte = pte_offset_map(pmd, addr);
+ do {
+ if (pte_none(*pte))
+ continue;
+
+ VM_BUG_ON(!pte_present(*pte));
+ VM_BUG_ON(pte_page(*pte) != page);
+
+ pte_clear(mm, addr, pte);
+ page_remove_rmap(page, false);
+ put_page(page);
+ rss++;
+ } while (pte++, page++, addr += PAGE_SIZE, addr != end);
+
+ pte -= HPAGE_PMD_NR;
+ addr -= HPAGE_PMD_SIZE;
+
+ if (rss) {
+ pmdp_collapse_flush(vma, addr, pmd);
+ pgtable_trans_huge_deposit(mm, pmd, pgtable);
+ set_pmd_at(mm, addr, pmd, pmdval);
+ update_mmu_cache_pmd(vma, addr, pmd);
+ get_page(head);
+ page_add_team_rmap(head);
+ add_mm_counter(mm, MM_SHMEMPAGES, HPAGE_PMD_NR - rss);
+ } else {
+ /*
+ * Hmm. We might have caught this vma in between unmap_vmas()
+ * and free_pgtables(), which is a surprising time to insert a
+ * huge page. Before our caller checked mm_users, I sometimes
+ * saw a "bad pmd" report, and pgtable_pmd_page_dtor() BUG on
+ * pmd_huge_pte, when killing off tests. But checking mm_users
+ * is not enough to protect against munmap(): so for safety,
+ * back out if we found no ptes to replace.
+ */
+ page_remove_rmap(head, false);
+ }
+
+ if (ptl != pml)
+ spin_unlock(ptl);
+ pte_unmap(pte);
+out:
+ spin_unlock(pml);
+ mmu_notifier_invalidate_range_end(mm, addr, end);
+}
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1097,6 +1097,82 @@ unlock:

static void shmem_recovery_remap(struct recovery *recovery, struct page *head)
{
+ struct mm_struct *mm = recovery->mm;
+ struct address_space *mapping = head->mapping;
+ pgoff_t pgoff = head->index;
+ struct vm_area_struct *vma;
+ unsigned long addr;
+ pmd_t *pmd;
+ bool try_other_mms = false;
+
+ /*
+ * XXX: This use of mmap_sem is regrettable. It is needed for one
+ * reason only: because callers of pte_offset_map(_lock)() are not
+ * prepared for a huge pmd to appear in place of a page table at any
+ * instant. That can be fixed in pte_offset_map(_lock)() and callers,
+ * but that is a more invasive change, so just do it this way for now.
+ */
+ down_write(&mm->mmap_sem);
+ lock_page(head);
+ if (!PageTeam(head)) {
+ unlock_page(head);
+ up_write(&mm->mmap_sem);
+ return;
+ }
+ VM_BUG_ON_PAGE(!PageChecked(head), head);
+ i_mmap_lock_write(mapping);
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ /* XXX: Use anon_vma as over-strict hint of COWed pages */
+ if (vma->anon_vma)
+ continue;
+ addr = vma_address(head, vma);
+ if (addr & (HPAGE_PMD_SIZE-1))
+ continue;
+ if (vma->vm_end < addr + HPAGE_PMD_SIZE)
+ continue;
+ if (!atomic_read(&vma->vm_mm->mm_users))
+ continue;
+ if (vma->vm_mm != mm) {
+ try_other_mms = true;
+ continue;
+ }
+ /* Only replace existing ptes: empty pmd can fault for itself */
+ pmd = mm_find_pmd(vma->vm_mm, addr);
+ if (!pmd)
+ continue;
+ remap_team_by_pmd(vma, addr, pmd, head);
+ shr_stats(remap_faulter);
+ }
+ up_write(&mm->mmap_sem);
+ if (!try_other_mms)
+ goto out;
+ vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) {
+ if (vma->vm_mm == mm)
+ continue;
+ /* XXX: Use anon_vma as over-strict hint of COWed pages */
+ if (vma->anon_vma)
+ continue;
+ addr = vma_address(head, vma);
+ if (addr & (HPAGE_PMD_SIZE-1))
+ continue;
+ if (vma->vm_end < addr + HPAGE_PMD_SIZE)
+ continue;
+ if (!atomic_read(&vma->vm_mm->mm_users))
+ continue;
+ /* Only replace existing ptes: empty pmd can fault for itself */
+ pmd = mm_find_pmd(vma->vm_mm, addr);
+ if (!pmd)
+ continue;
+ if (down_write_trylock(&vma->vm_mm->mmap_sem)) {
+ remap_team_by_pmd(vma, addr, pmd, head);
+ shr_stats(remap_another);
+ up_write(&vma->vm_mm->mmap_sem);
+ } else
+ shr_stats(remap_untried);
+ }
+out:
+ i_mmap_unlock_write(mapping);
+ unlock_page(head);
}

static void shmem_recovery_work(struct work_struct *work)

2016-04-05 21:58:57

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 26/31] huge tmpfs recovery: shmem_recovery_swapin to read from swap

If pages of the extent are out on swap, we would much prefer to read
them in to their final locations on the assigned huge page, than have
swapin_readahead() adding unrelated pages, and __read_swap_cache_async()
allocating intermediate pages, from which we would then have to migrate
(though some may well be already in swapcache, and then need migration).

And we'd like to get all the swap I/O underway at the start, then wait
on it in probably a single page lock of the main population loop:
which can forget about swap, leaving shmem_getpage_gfp() to handle
the transitions from swapcache to pagecache.

shmem_recovery_swapin() is very much based on __read_swap_cache_async(),
but the things it needs to worry about are not always the same: it does
not matter if __read_swap_cache_async() occasionally reads an unrelated
page which has inherited a freed swap block; but shmem_recovery_swapin()
better not place that inside the huge page it is helping to build.

Ifdef CONFIG_SWAP around it and its shmem_next_swap() helper because a
couple of functions it calls are undeclared without CONFIG_SWAP.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 101 insertions(+)

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -804,6 +804,105 @@ static bool shmem_work_still_useful(stru
!RB_EMPTY_ROOT(&mapping->i_mmap); /* file is still mapped */
}

+#ifdef CONFIG_SWAP
+static void *shmem_next_swap(struct address_space *mapping,
+ pgoff_t *index, pgoff_t end)
+{
+ pgoff_t start = *index + 1;
+ struct radix_tree_iter iter;
+ void **slot;
+ void *radswap;
+
+ rcu_read_lock();
+restart:
+ radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
+ if (iter.index >= end)
+ break;
+ radswap = radix_tree_deref_slot(slot);
+ if (radix_tree_exception(radswap)) {
+ if (radix_tree_deref_retry(radswap))
+ goto restart;
+ goto out;
+ }
+ }
+ radswap = NULL;
+out:
+ rcu_read_unlock();
+ *index = iter.index;
+ return radswap;
+}
+
+static void shmem_recovery_swapin(struct recovery *recovery, struct page *head)
+{
+ struct shmem_inode_info *info = SHMEM_I(recovery->inode);
+ struct address_space *mapping = recovery->inode->i_mapping;
+ pgoff_t index = recovery->head_index - 1;
+ pgoff_t end = recovery->head_index + HPAGE_PMD_NR;
+ struct blk_plug plug;
+ void *radswap;
+ int error;
+
+ /*
+ * If the file has nothing swapped out, don't waste time here.
+ * If the team has already been exposed by an earlier attempt,
+ * it is not safe to pursue this optimization again - truncation
+ * *might* let swapin I/O overlap with fresh use of the page.
+ */
+ if (!info->swapped || recovery->exposed_team)
+ return;
+
+ blk_start_plug(&plug);
+ while ((radswap = shmem_next_swap(mapping, &index, end))) {
+ swp_entry_t swap = radix_to_swp_entry(radswap);
+ struct page *page = head + (index & (HPAGE_PMD_NR-1));
+
+ /*
+ * Code below is adapted from __read_swap_cache_async():
+ * we want to set up async swapin to the right pages.
+ * We don't have to worry about a more limiting gfp_mask
+ * leading to -ENOMEM from __add_to_swap_cache(), but we
+ * do have to worry about swapcache_prepare() succeeding
+ * when swap has been freed and reused for an unrelated page.
+ */
+ shr_stats(swap_entry);
+ error = radix_tree_preload(GFP_KERNEL);
+ if (error)
+ break;
+
+ error = swapcache_prepare(swap);
+ if (error) {
+ radix_tree_preload_end();
+ shr_stats(swap_cached);
+ continue;
+ }
+
+ if (!shmem_confirm_swap(mapping, index, swap)) {
+ radix_tree_preload_end();
+ swapcache_free(swap);
+ shr_stats(swap_gone);
+ continue;
+ }
+
+ __SetPageLocked(page);
+ __SetPageSwapBacked(page);
+ error = __add_to_swap_cache(page, swap);
+ radix_tree_preload_end();
+ VM_BUG_ON(error);
+
+ shr_stats(swap_read);
+ lru_cache_add_anon(page);
+ swap_readpage(page);
+ cond_resched();
+ }
+ blk_finish_plug(&plug);
+ lru_add_drain(); /* not necessary but may help debugging */
+}
+#else
+static void shmem_recovery_swapin(struct recovery *recovery, struct page *head)
+{
+}
+#endif /* CONFIG_SWAP */
+
static struct page *shmem_get_recovery_page(struct page *page,
unsigned long private, int **result)
{
@@ -855,6 +954,8 @@ static int shmem_recovery_populate(struc
/* Warning: this optimization relies on disband's ClearPageChecked */
if (PageTeam(head) && PageChecked(head))
return 0;
+
+ shmem_recovery_swapin(recovery, head);
again:
migratable = 0;
unmigratable = 0;

2016-04-05 22:00:32

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 27/31] huge tmpfs recovery: tweak shmem_getpage_gfp to fill team

shmem_recovery_swapin() took the trouble to arrange for pages to be
swapped in to their final destinations without needing page migration.
It's daft not to do the same for pages being newly instantiated (when
a huge page has been allocated after transient fragmentation, too late
to satisfy the initial fault).

Let SGP_TEAM convey the intended destination down to shmem_getpage_gfp().
And make sure that SGP_TEAM cannot instantiate pages beyond the last
huge page: although shmem_recovery_populate() has a PageTeam check
against truncation, that's insufficient, and only shmem_getpage_gfp()
knows what adjustments to make when we have allocated too far.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 56 ++++++++++++++++++++++++++++++++++++++++++++-------
1 file changed, 49 insertions(+), 7 deletions(-)

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -464,6 +464,7 @@ static int shmem_populate_hugeteam(struc
/* Mark all pages dirty even when map is readonly, for now */
if (PageUptodate(head + i) && PageDirty(head + i))
continue;
+ page = NULL;
error = shmem_getpage_gfp(inode, index, &page, SGP_TEAM,
gfp, vma->vm_mm, NULL);
if (error)
@@ -965,6 +966,7 @@ again:
!account_head)
continue;

+ page = team; /* used as hint if not yet instantiated */
error = shmem_getpage_gfp(recovery->inode, index, &page,
SGP_TEAM, gfp, recovery->mm, NULL);
if (error)
@@ -2708,6 +2710,7 @@ static int shmem_replace_page(struct pag

/*
* shmem_getpage_gfp - find page in cache, or get from swap, or allocate
+ * (or use page indicated by shmem_recovery_populate)
*
* If we allocate a new one we do not mark it dirty. That's up to the
* vm. If we swap it in we mark it dirty since we also free the swap
@@ -2727,14 +2730,20 @@ static int shmem_getpage_gfp(struct inod
struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
+ loff_t offset;
int error;
int once = 0;
- int alloced = 0;
+ bool alloced = false;
+ bool exposed_swapbacked = false;
struct page *hugehint;
struct page *alloced_huge = NULL;

if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
return -EFBIG;
+
+ offset = (loff_t)index << PAGE_SHIFT;
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && sgp == SGP_TEAM)
+ offset &= ~((loff_t)HPAGE_PMD_SIZE-1);
repeat:
swap.val = 0;
page = find_lock_entry(mapping, index);
@@ -2743,8 +2752,7 @@ repeat:
page = NULL;
}

- if (sgp <= SGP_CACHE &&
- ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
+ if (sgp <= SGP_TEAM && offset >= i_size_read(inode)) {
error = -EINVAL;
goto unlock;
}
@@ -2863,8 +2871,34 @@ repeat:
percpu_counter_inc(&sbinfo->used_blocks);
}

- /* Take huge hint from super, except for shmem_symlink() */
hugehint = NULL;
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ sgp == SGP_TEAM && *pagep) {
+ struct page *head;
+
+ if (!get_page_unless_zero(*pagep)) {
+ error = -ENOENT;
+ goto decused;
+ }
+ page = *pagep;
+ lock_page(page);
+ head = page - (index & (HPAGE_PMD_NR-1));
+ if (!PageTeam(head)) {
+ error = -ENOENT;
+ goto decused;
+ }
+ if (PageSwapBacked(page)) {
+ shr_stats(page_raced);
+ /* maybe already created; or swapin truncated */
+ error = page->mapping ? -EEXIST : -ENOENT;
+ goto decused;
+ }
+ SetPageSwapBacked(page);
+ exposed_swapbacked = true;
+ goto memcg;
+ }
+
+ /* Take huge hint from super, except for shmem_symlink() */
if (mapping->a_ops == &shmem_aops &&
(shmem_huge == SHMEM_HUGE_FORCE ||
(sbinfo->huge && shmem_huge != SHMEM_HUGE_DENY)))
@@ -2878,7 +2912,7 @@ repeat:
}
if (sgp == SGP_WRITE)
__SetPageReferenced(page);
-
+memcg:
error = mem_cgroup_try_charge(page, charge_mm, gfp, &memcg,
false);
if (error)
@@ -2894,6 +2928,11 @@ repeat:
goto decused;
}
mem_cgroup_commit_charge(page, memcg, false, false);
+ if (exposed_swapbacked) {
+ shr_stats(page_created);
+ /* cannot clear swapbacked once sent to lru */
+ exposed_swapbacked = false;
+ }
lru_cache_add_anon(page);

spin_lock(&info->lock);
@@ -2937,8 +2976,7 @@ clear:
}

/* Perhaps the file has been truncated since we checked */
- if (sgp <= SGP_CACHE &&
- ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
+ if (sgp <= SGP_TEAM && offset >= i_size_read(inode)) {
if (alloced && !PageTeam(page)) {
ClearPageDirty(page);
delete_from_page_cache(page);
@@ -2966,6 +3004,10 @@ failed:
error = -EEXIST;
unlock:
if (page) {
+ if (exposed_swapbacked) {
+ ClearPageSwapBacked(page);
+ exposed_swapbacked = false;
+ }
unlock_page(page);
put_page(page);
}

2016-04-05 22:02:16

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 28/31] huge tmpfs recovery: debugfs stats to complete this phase

Implement the shr_stats(name) macro that has been inserted all over, to
make the success of recovery visible in debugfs. After a little testing,
"cd /sys/kernel/debug/shmem_huge_recovery; grep . *" showed:

huge_alloced:15872
huge_failed:0
huge_too_late:0
page_created:0
page_migrate:1298014
page_off_lru:300
page_raced:0
page_teamed:6831300
page_unmigrated:3243
recov_completed:15484
recov_failed:0
recov_partial:696
recov_retried:2463
remap_another:0
remap_faulter:15484
remap_untried:0
resume_tagged:279
resume_teamed:68
swap_cached:699229
swap_entry:7530549
swap_gone:20
swap_read:6831300
work_already:43218374
work_queued:16221
work_too_late:2
work_too_many:0

Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/filesystems/tmpfs.txt | 2
mm/shmem.c | 91 ++++++++++++++++++++++++--
2 files changed, 88 insertions(+), 5 deletions(-)

--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -224,6 +224,8 @@ shmem_pmdmapped 12582912 bytes tmpfs h
Note: the individual pages of a huge team might be charged to different
memcgs, but these counts assume that they are all charged to the same as head.

+/sys/kernel/debug/shmem_huge_recovery: recovery stats to assist development.
+
Author:
Christoph Rohland <[email protected]>, 1.12.01
Updated:
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -6,8 +6,8 @@
* 2000-2001 Christoph Rohland
* 2000-2001 SAP AG
* 2002 Red Hat Inc.
- * Copyright (C) 2002-2011 Hugh Dickins.
- * Copyright (C) 2011 Google Inc.
+ * Copyright (C) 2002-2016 Hugh Dickins.
+ * Copyright (C) 2011-2016 Google Inc.
* Copyright (C) 2002-2005 VERITAS Software Corporation.
* Copyright (C) 2004 Andi Kleen, SuSE Labs
*
@@ -788,9 +788,90 @@ struct recovery {
bool exposed_team;
};

-#define shr_stats(x) do {} while (0)
-#define shr_stats_add(x, n) do {} while (0)
-/* Stats implemented in a later patch */
+#ifdef CONFIG_DEBUG_FS
+#include <linux/debugfs.h>
+
+static struct dentry *shr_debugfs_root;
+static struct {
+ /*
+ * Just stats: no need to use atomics; and although many of these
+ * u32s can soon overflow, debugging doesn't need them to be u64s.
+ */
+ u32 huge_alloced;
+ u32 huge_failed;
+ u32 huge_too_late;
+ u32 page_created;
+ u32 page_migrate;
+ u32 page_off_lru;
+ u32 page_raced;
+ u32 page_teamed;
+ u32 page_unmigrated;
+ u32 recov_completed;
+ u32 recov_failed;
+ u32 recov_partial;
+ u32 recov_retried;
+ u32 remap_another;
+ u32 remap_faulter;
+ u32 remap_untried;
+ u32 resume_tagged;
+ u32 resume_teamed;
+ u32 swap_cached;
+ u32 swap_entry;
+ u32 swap_gone;
+ u32 swap_read;
+ u32 work_already;
+ u32 work_queued;
+ u32 work_too_late;
+ u32 work_too_many;
+} shmem_huge_recovery_stats;
+
+#define shr_create(x) debugfs_create_u32(#x, S_IRUGO, shr_debugfs_root, \
+ &shmem_huge_recovery_stats.x)
+static int __init shmem_debugfs_init(void)
+{
+ if (!debugfs_initialized())
+ return -ENODEV;
+ shr_debugfs_root = debugfs_create_dir("shmem_huge_recovery", NULL);
+ if (!shr_debugfs_root)
+ return -ENOMEM;
+
+ shr_create(huge_alloced);
+ shr_create(huge_failed);
+ shr_create(huge_too_late);
+ shr_create(page_created);
+ shr_create(page_migrate);
+ shr_create(page_off_lru);
+ shr_create(page_raced);
+ shr_create(page_teamed);
+ shr_create(page_unmigrated);
+ shr_create(recov_completed);
+ shr_create(recov_failed);
+ shr_create(recov_partial);
+ shr_create(recov_retried);
+ shr_create(remap_another);
+ shr_create(remap_faulter);
+ shr_create(remap_untried);
+ shr_create(resume_tagged);
+ shr_create(resume_teamed);
+ shr_create(swap_cached);
+ shr_create(swap_entry);
+ shr_create(swap_gone);
+ shr_create(swap_read);
+ shr_create(work_already);
+ shr_create(work_queued);
+ shr_create(work_too_late);
+ shr_create(work_too_many);
+ return 0;
+}
+fs_initcall(shmem_debugfs_init);
+
+#undef shr_create
+#define shr_stats(x) (shmem_huge_recovery_stats.x++)
+#define shr_stats_add(x, n) (shmem_huge_recovery_stats.x += n)
+#else
+#define shr_stats(x) do {} while (0)
+#define shr_stats_add(x, n) do {} while (0)
+#endif /* CONFIG_DEBUG_FS */

static bool shmem_work_still_useful(struct recovery *recovery)
{

2016-04-05 22:03:59

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 29/31] huge tmpfs recovery: page migration call back into shmem

What we have works; but involves tricky "account_head" handling, and more
trips around the shmem_recovery_populate() loop than I'm comfortable with.

Tighten it all up with a MIGRATE_SHMEM_RECOVERY mode, and
shmem_recovery_migrate_page() callout from migrate_page_move_mapping(),
so that the migrated page can be made PageTeam immediately.

Which allows the SHMEM_RETRY_HUGE_PAGE hugehint to be reintroduced,
for what little that's worth.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/migrate_mode.h | 2
include/linux/shmem_fs.h | 6 +
include/trace/events/migrate.h | 3
mm/migrate.c | 17 ++++-
mm/shmem.c | 99 ++++++++++++-------------------
5 files changed, 62 insertions(+), 65 deletions(-)

--- a/include/linux/migrate_mode.h
+++ b/include/linux/migrate_mode.h
@@ -6,11 +6,13 @@
* on most operations but not ->writepage as the potential stall time
* is too significant
* MIGRATE_SYNC will block when migrating pages
+ * MIGRATE_SHMEM_RECOVERY is a MIGRATE_SYNC specific to huge tmpfs recovery.
*/
enum migrate_mode {
MIGRATE_ASYNC,
MIGRATE_SYNC_LIGHT,
MIGRATE_SYNC,
+ MIGRATE_SHMEM_RECOVERY,
};

#endif /* MIGRATE_MODE_H_INCLUDED */
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -85,6 +85,7 @@ static inline long shmem_fcntl(struct fi
#endif /* CONFIG_TMPFS */

#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_SHMEM)
+extern bool shmem_recovery_migrate_page(struct page *new, struct page *page);
# ifdef CONFIG_SYSCTL
struct ctl_table;
extern int shmem_huge, shmem_huge_min, shmem_huge_max;
@@ -92,6 +93,11 @@ extern int shmem_huge_recoveries;
extern int shmem_huge_sysctl(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);
# endif /* CONFIG_SYSCTL */
+#else
+static inline bool shmem_recovery_migrate_page(struct page *new, struct page *p)
+{
+ return true; /* Never called: true will optimize out the fallback */
+}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE && CONFIG_SHMEM */

#endif
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -9,7 +9,8 @@
#define MIGRATE_MODE \
EM( MIGRATE_ASYNC, "MIGRATE_ASYNC") \
EM( MIGRATE_SYNC_LIGHT, "MIGRATE_SYNC_LIGHT") \
- EMe(MIGRATE_SYNC, "MIGRATE_SYNC")
+ EM( MIGRATE_SYNC, "MIGRATE_SYNC") \
+ EMe(MIGRATE_SHMEM_RECOVERY, "MIGRATE_SHMEM_RECOVERY")


#define MIGRATE_REASON \
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -23,6 +23,7 @@
#include <linux/pagevec.h>
#include <linux/ksm.h>
#include <linux/rmap.h>
+#include <linux/shmem_fs.h>
#include <linux/topology.h>
#include <linux/cpu.h>
#include <linux/cpuset.h>
@@ -371,6 +372,15 @@ int migrate_page_move_mapping(struct add
return -EAGAIN;
}

+ if (mode == MIGRATE_SHMEM_RECOVERY) {
+ if (!shmem_recovery_migrate_page(newpage, page)) {
+ page_ref_unfreeze(page, expected_count);
+ spin_unlock_irq(&mapping->tree_lock);
+ return -ENOMEM; /* quit migrate_pages() immediately */
+ }
+ } else
+ get_page(newpage); /* add cache reference */
+
/*
* Now we know that no one else is looking at the page:
* no turning back from here.
@@ -380,7 +390,6 @@ int migrate_page_move_mapping(struct add
if (PageSwapBacked(page))
__SetPageSwapBacked(newpage);

- get_page(newpage); /* add cache reference */
if (PageSwapCache(page)) {
SetPageSwapCache(newpage);
set_page_private(newpage, page_private(page));
@@ -786,7 +795,7 @@ static int move_to_new_page(struct page
}

static int __unmap_and_move(struct page *page, struct page *newpage,
- int force, enum migrate_mode mode, enum migrate_reason reason)
+ int force, enum migrate_mode mode)
{
int rc = -EAGAIN;
int page_was_mapped = 0;
@@ -821,7 +830,7 @@ static int __unmap_and_move(struct page
* already in use, on lru, with data newly written for that offset.
* We can only be sure of this check once we have the page locked.
*/
- if (reason == MR_SHMEM_RECOVERY && !page->mapping) {
+ if (mode == MIGRATE_SHMEM_RECOVERY && !page->mapping) {
rc = -ENOMEM; /* quit migrate_pages() immediately */
goto out_unlock;
}
@@ -973,7 +982,7 @@ static ICE_noinline int unmap_and_move(n
goto out;
}

- rc = __unmap_and_move(page, newpage, force, mode, reason);
+ rc = __unmap_and_move(page, newpage, force, mode);
if (rc == MIGRATEPAGE_SUCCESS) {
put_new_page = NULL;
set_page_owner_migrate_reason(newpage, reason);
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -306,6 +306,7 @@ static bool shmem_confirm_swap(struct ad
/* hugehint values: NULL to choose a small page always */
#define SHMEM_ALLOC_SMALL_PAGE ((struct page *)1)
#define SHMEM_ALLOC_HUGE_PAGE ((struct page *)2)
+#define SHMEM_RETRY_HUGE_PAGE ((struct page *)3)
/* otherwise hugehint is the hugeteam page to be used */

/* tag for shrinker to locate unfilled hugepages */
@@ -368,20 +369,6 @@ restart:
put_page(page);
return SHMEM_ALLOC_SMALL_PAGE;
}
- if (PageSwapBacked(page)) {
- if (speculative)
- put_page(page);
- /*
- * This is very often a case of two tasks racing to instantiate
- * the same hole in the huge page, and we don't particularly
- * want to allocate a small page. But holepunch racing with
- * recovery migration, in between migrating to the page and
- * marking it team, can leave a PageSwapBacked NULL mapping
- * page here which we should avoid, and this is the easiest
- * way to handle all the cases correctly.
- */
- return SHMEM_ALLOC_SMALL_PAGE;
- }
return page;
}

@@ -784,7 +771,6 @@ struct recovery {
struct inode *inode;
struct page *page;
pgoff_t head_index;
- struct page *migrated_head;
bool exposed_team;
};

@@ -988,8 +974,7 @@ static void shmem_recovery_swapin(struct
static struct page *shmem_get_recovery_page(struct page *page,
unsigned long private, int **result)
{
- struct recovery *recovery = (struct recovery *)private;
- struct page *head = recovery->page;
+ struct page *head = (struct page *)private;
struct page *newpage = head + (page->index & (HPAGE_PMD_NR-1));

/* Increment refcount to match other routes through recovery_populate */
@@ -999,19 +984,33 @@ static struct page *shmem_get_recovery_p
put_page(newpage);
return NULL;
}
- /* Note when migrating to head: tricky case because already PageTeam */
- if (newpage == head)
- recovery->migrated_head = head;
return newpage;
}

-static void shmem_put_recovery_page(struct page *newpage, unsigned long private)
+/*
+ * shmem_recovery_migrate_page() is called from the heart of page migration's
+ * migrate_page_move_mapping(): with interrupts disabled, mapping->tree_lock
+ * held, page's reference count frozen to 0, and no other reason to turn back.
+ */
+bool shmem_recovery_migrate_page(struct page *newpage, struct page *page)
{
- struct recovery *recovery = (struct recovery *)private;
+ struct page *head = newpage - (page->index & (HPAGE_PMD_NR-1));
+
+ if (!PageTeam(head))
+ return false;
+ if (newpage != head) {
+ /* Needs to be initialized before shmem_added_to_hugeteam() */
+ atomic_long_set(&newpage->team_usage, TEAM_LRU_WEIGHT_ONE);
+ SetPageTeam(newpage);
+ newpage->mapping = page->mapping;
+ newpage->index = page->index;
+ }
+ shmem_added_to_hugeteam(newpage, page_zone(newpage), NULL);
+ return true;
+}

- /* Must reset migrated_head if in the end it was not used */
- if (recovery->migrated_head == newpage)
- recovery->migrated_head = NULL;
+static void shmem_put_recovery_page(struct page *newpage, unsigned long private)
+{
/* Decrement refcount again if newpage was not used */
put_page(newpage);
}
@@ -1024,9 +1023,7 @@ static int shmem_recovery_populate(struc
struct zone *zone = page_zone(head);
pgoff_t index;
bool drained_all = false;
- bool account_head = false;
- int migratable;
- int unmigratable;
+ int unmigratable = 0;
struct page *team;
struct page *endteam = head + HPAGE_PMD_NR;
struct page *page;
@@ -1039,12 +1036,9 @@ static int shmem_recovery_populate(struc

shmem_recovery_swapin(recovery, head);
again:
- migratable = 0;
- unmigratable = 0;
index = recovery->head_index;
for (team = head; team < endteam && !error; index++, team++) {
- if (PageTeam(team) && PageUptodate(team) && PageDirty(team) &&
- !account_head)
+ if (PageTeam(team) && PageUptodate(team) && PageDirty(team))
continue;

page = team; /* used as hint if not yet instantiated */
@@ -1070,8 +1064,7 @@ again:
*/
if (page != team)
error = -ENOENT;
- if (error || !account_head)
- goto unlock;
+ goto unlock;
}

if (PageSwapBacked(team) && page != team) {
@@ -1098,8 +1091,6 @@ again:
SetPageTeam(head);
head->mapping = mapping;
head->index = index;
- if (page == head)
- account_head = true;
}

/* Eviction or truncation or hole-punch already disbanded? */
@@ -1132,12 +1123,9 @@ again:
TEAM_LRU_WEIGHT_ONE);
SetPageTeam(page);
}
- if (page != head || account_head) {
- shmem_added_to_hugeteam(page, zone,
- NULL);
- put_page(page);
- shr_stats(page_teamed);
- }
+ shmem_added_to_hugeteam(page, zone, NULL);
+ put_page(page);
+ shr_stats(page_teamed);
}
spin_unlock_irq(&mapping->tree_lock);
if (page_mapped(page)) {
@@ -1145,16 +1133,13 @@ again:
page_remove_rmap(page, false);
preempt_enable();
}
- account_head = false;
} else {
- VM_BUG_ON(account_head);
if (!PageLRU(page))
lru_add_drain();
if (isolate_lru_page(page) == 0) {
inc_zone_page_state(page, NR_ISOLATED_ANON);
list_add_tail(&page->lru, &migrate);
shr_stats(page_migrate);
- migratable++;
} else {
shr_stats(page_off_lru);
unmigratable++;
@@ -1169,12 +1154,9 @@ unlock:
if (!list_empty(&migrate)) {
lru_add_drain(); /* not necessary but may help debugging */
if (!error) {
- VM_BUG_ON(recovery->page != head);
- recovery->migrated_head = NULL;
nr = migrate_pages(&migrate, shmem_get_recovery_page,
- shmem_put_recovery_page, (unsigned long)
- recovery, MIGRATE_SYNC, MR_SHMEM_RECOVERY);
- account_head = !!recovery->migrated_head;
+ shmem_put_recovery_page, (unsigned long)head,
+ MIGRATE_SHMEM_RECOVERY, MR_SHMEM_RECOVERY);
if (nr < 0) {
/*
* If migrate_pages() returned error (-ENOMEM)
@@ -1189,7 +1171,6 @@ unlock:
if (nr > 0) {
shr_stats_add(page_unmigrated, nr);
unmigratable += nr;
- migratable -= nr;
}
}
putback_movable_pages(&migrate);
@@ -1208,10 +1189,6 @@ unlock:
shr_stats(recov_retried);
goto again;
}
- if (migratable) {
- /* Make another pass to SetPageTeam on them */
- goto again;
- }
}

lock_page(head);
@@ -2687,11 +2664,9 @@ static struct page *shmem_alloc_page(gfp
* add_to_page_cache has the tree_lock.
*/
lock_page(page);
- if (!PageSwapBacked(page) && PageTeam(head))
- goto out;
- unlock_page(page);
- put_page(page);
- *hugehint = SHMEM_ALLOC_SMALL_PAGE;
+ if (PageSwapBacked(page) || !PageTeam(head))
+ *hugehint = SHMEM_RETRY_HUGE_PAGE;
+ goto out;
}
}

@@ -2991,6 +2966,10 @@ repeat:
error = -ENOMEM;
goto decused;
}
+ if (hugehint == SHMEM_RETRY_HUGE_PAGE) {
+ error = -EEXIST;
+ goto decused;
+ }
if (sgp == SGP_WRITE)
__SetPageReferenced(page);
memcg:

2016-04-05 22:05:59

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 30/31] huge tmpfs: shmem_huge_gfpmask and shmem_recovery_gfpmask

We know that compaction latencies can be a problem for transparent
hugepage allocation; and there has been a history of tweaking the
gfpmask used for anon THP allocation at fault time and by khugepaged.
Plus we keep on changing our minds as to whether local smallpages
are generally preferable to remote hugepages, or not.

Anon THP has at least /sys/kernel/mm/transparent_hugepage/defrag and
/sys/kernel/mm/transparent_hugepage/khugepaged/defrag to play with
__GFP_RECLAIM bits in its gfpmasks; but so far there's been nothing
at all in huge tmpfs to experiment with these issues, and doubts
looming on whether we've made the right choices.

Add /proc/sys/vm/{shmem_huge_gfpmask,shmem_recovery_gfpmask} to
override the defaults we've been using for its synchronous and
asynchronous hugepage allocations so far: make these tunable now,
but no thought yet given to what values worth experimenting with.
Only numeric validation of the input: root must just take care.

Three things make this a more awkward patch than you might expect:

1. We shall want to play with __GFP_THISNODE, but that got added
down inside alloc_pages_vma(): change huge_memory.c to supply it
for anon THP, then alloc_pages_vma() remove it when unsuitable.

2. It took some time to work out how a global gfpmask template should
modulate a non-standard incoming gfpmask, different bits having
different effects: shmem_huge_gfp() helper added for that.

3. __alloc_pages_slowpath() compared gfpmask with GFP_TRANSHUGE
in a couple of places: which is appropriate for anonymous THP,
but needed a little rework to extend it to huge tmpfs usage,
when we're hoping to be able to tune behavior with these sysctls.

Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/filesystems/tmpfs.txt | 19 ++++++++
Documentation/sysctl/vm.txt | 23 +++++++++-
include/linux/shmem_fs.h | 2
kernel/sysctl.c | 14 ++++++
mm/huge_memory.c | 2
mm/mempolicy.c | 13 +++--
mm/page_alloc.c | 34 ++++++++-------
mm/shmem.c | 58 +++++++++++++++++++++++---
8 files changed, 134 insertions(+), 31 deletions(-)

--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -192,11 +192,28 @@ In addition to 0 and 1, it also accepts
automatically on for all tmpfs mounts (intended for testing), or -1
to force huge off for all (intended for safety if bugs appeared).

+/proc/sys/vm/shmem_huge_gfpmask (intended for experimentation only):
+
+Default 38146762, that is 0x24612ca:
+GFP_HIGHUSER_MOVABLE|__GFP_NOWARN|__GFP_THISNODE|__GFP_NORETRY.
+Write a gfpmask built from __GFP flags in include/linux/gfp.h, to experiment
+with better alternatives for the synchronous huge tmpfs allocation used
+when faulting or writing.
+
/proc/sys/vm/shmem_huge_recoveries:

Default 8, allows up to 8 concurrent workitems, recovering hugepages
after fragmentation prevented or reclaim disbanded; write 0 to disable
-huge recoveries, or a higher number to allow more concurrent recoveries.
+huge recoveries, or a higher number to allow more concurrent recoveries
+(or a negative number to disable both retry after shrinking, and recovery).
+
+/proc/sys/vm/shmem_recovery_gfpmask (intended for experimentation only):
+
+Default 38142666, that is 0x24602ca:
+GFP_HIGHUSER_MOVABLE|__GFP_NOWARN|__GFP_THISNODE.
+Write a gfpmask built from __GFP flags in include/linux/gfp.h, to experiment
+with alternatives for the asynchronous huge tmpfs allocation used in recovery
+from fragmentation or swapping.

/proc/<pid>/smaps shows:

--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -57,7 +57,9 @@ Currently, these files are in /proc/sys/
- panic_on_oom
- percpu_pagelist_fraction
- shmem_huge
+- shmem_huge_gfpmask
- shmem_huge_recoveries
+- shmem_recovery_gfpmask
- stat_interval
- stat_refresh
- swappiness
@@ -765,11 +767,30 @@ See Documentation/filesystems/tmpfs.txt

==============================================================

+shmem_huge_gfpmask
+
+Write a gfpmask built from __GFP flags in include/linux/gfp.h, to experiment
+with better alternatives for the synchronous huge tmpfs allocation used
+when faulting or writing. See Documentation/filesystems/tmpfs.txt.
+/proc/sys/vm/shmem_huge_gfpmask is intended for experimentation only.
+
+==============================================================
+
shmem_huge_recoveries

Default 8, allows up to 8 concurrent workitems, recovering hugepages
after fragmentation prevented or reclaim disbanded; write 0 to disable
-huge recoveries, or a higher number to allow more concurrent recoveries.
+huge recoveries, or a higher number to allow more concurrent recoveries
+(or a negative number to disable both retry after shrinking, and recovery).
+
+==============================================================
+
+shmem_recovery_gfpmask
+
+Write a gfpmask built from __GFP flags in include/linux/gfp.h, to experiment
+with alternatives for the asynchronous huge tmpfs allocation used in recovery
+from fragmentation or swapping. See Documentation/filesystems/tmpfs.txt.
+/proc/sys/vm/shmem_recovery_gfpmask is intended for experimentation only.

==============================================================

--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -89,7 +89,7 @@ extern bool shmem_recovery_migrate_page(
# ifdef CONFIG_SYSCTL
struct ctl_table;
extern int shmem_huge, shmem_huge_min, shmem_huge_max;
-extern int shmem_huge_recoveries;
+extern int shmem_huge_recoveries, shmem_huge_gfpmask, shmem_recovery_gfpmask;
extern int shmem_huge_sysctl(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);
# endif /* CONFIG_SYSCTL */
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1325,12 +1325,26 @@ static struct ctl_table vm_table[] = {
.extra2 = &shmem_huge_max,
},
{
+ .procname = "shmem_huge_gfpmask",
+ .data = &shmem_huge_gfpmask,
+ .maxlen = sizeof(shmem_huge_gfpmask),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "shmem_huge_recoveries",
.data = &shmem_huge_recoveries,
.maxlen = sizeof(shmem_huge_recoveries),
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "shmem_recovery_gfpmask",
+ .data = &shmem_recovery_gfpmask,
+ .maxlen = sizeof(shmem_recovery_gfpmask),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif
#ifdef CONFIG_HUGETLB_PAGE
{
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -883,7 +883,7 @@ static inline gfp_t alloc_hugepage_direc
else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
reclaim_flags = __GFP_DIRECT_RECLAIM;

- return GFP_TRANSHUGE | reclaim_flags;
+ return GFP_TRANSHUGE | __GFP_THISNODE | reclaim_flags;
}

/* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2007,11 +2007,14 @@ retry_cpuset:

nid = interleave_nid(pol, vma, addr, PAGE_SHIFT + order);
mpol_cond_put(pol);
+ if (hugepage)
+ gfp &= ~__GFP_THISNODE;
page = alloc_page_interleave(gfp, order, nid);
goto out;
}

- if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage)) {
+ if (unlikely(IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) && hugepage &&
+ (gfp & __GFP_THISNODE))) {
int hpage_node = node;

/*
@@ -2024,17 +2027,17 @@ retry_cpuset:
* If the policy is interleave, or does not allow the current
* node in its nodemask, we allocate the standard way.
*/
- if (pol->mode == MPOL_PREFERRED &&
- !(pol->flags & MPOL_F_LOCAL))
+ if (pol->mode == MPOL_PREFERRED && !(pol->flags & MPOL_F_LOCAL))
hpage_node = pol->v.preferred_node;

nmask = policy_nodemask(gfp, pol);
if (!nmask || node_isset(hpage_node, *nmask)) {
mpol_cond_put(pol);
- page = __alloc_pages_node(hpage_node,
- gfp | __GFP_THISNODE, order);
+ page = __alloc_pages_node(hpage_node, gfp, order);
goto out;
}
+
+ gfp &= ~__GFP_THISNODE;
}

nmask = policy_nodemask(gfp, pol);
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3105,9 +3105,15 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_ma
return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
}

-static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
+static inline bool is_thp_allocation(gfp_t gfp_mask, unsigned int order)
{
- return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
+ /*
+ * !__GFP_KSWAPD_RECLAIM is an unusual choice, and no harm is done if a
+ * similar high order allocation is occasionally misinterpreted as THP.
+ */
+ return IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ !(gfp_mask & __GFP_KSWAPD_RECLAIM) &&
+ (order == HPAGE_PMD_ORDER);
}

static inline struct page *
@@ -3225,7 +3231,7 @@ retry:
goto got_pg;

/* Checks for THP-specific high-order allocations */
- if (is_thp_gfp_mask(gfp_mask)) {
+ if (is_thp_allocation(gfp_mask, order)) {
/*
* If compaction is deferred for high-order allocations, it is
* because sync compaction recently failed. If this is the case
@@ -3247,20 +3253,16 @@ retry:

/*
* If compaction was aborted due to need_resched(), we do not
- * want to further increase allocation latency, unless it is
- * khugepaged trying to collapse.
- */
- if (contended_compaction == COMPACT_CONTENDED_SCHED
- && !(current->flags & PF_KTHREAD))
+ * want to further increase allocation latency at fault.
+ * If continuing, still use asynchronous memory compaction
+ * for THP, unless it is khugepaged trying to collapse,
+ * or an asynchronous huge tmpfs recovery work item.
+ */
+ if (current->flags & PF_KTHREAD)
+ migration_mode = MIGRATE_SYNC_LIGHT;
+ else if (contended_compaction == COMPACT_CONTENDED_SCHED)
goto nopage;
- }
-
- /*
- * It can become very expensive to allocate transparent hugepages at
- * fault, so use asynchronous memory compaction for THP unless it is
- * khugepaged trying to collapse.
- */
- if (!is_thp_gfp_mask(gfp_mask) || (current->flags & PF_KTHREAD))
+ } else
migration_mode = MIGRATE_SYNC_LIGHT;

/* Try direct reclaim and then allocating */
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -323,6 +323,43 @@ static DEFINE_SPINLOCK(shmem_shrinklist_
int shmem_huge __read_mostly;
int shmem_huge_recoveries __read_mostly = 8; /* concurrent recovery limit */

+int shmem_huge_gfpmask __read_mostly =
+ (int)(GFP_HIGHUSER_MOVABLE|__GFP_NOWARN|__GFP_THISNODE|__GFP_NORETRY);
+int shmem_recovery_gfpmask __read_mostly =
+ (int)(GFP_HIGHUSER_MOVABLE|__GFP_NOWARN|__GFP_THISNODE);
+
+/*
+ * Compose the requested_gfp for a small page together with the tunable
+ * template above, to construct a suitable gfpmask for a huge allocation.
+ */
+static gfp_t shmem_huge_gfp(int tunable_template, gfp_t requested_gfp)
+{
+ gfp_t huge_gfpmask = (gfp_t)tunable_template;
+ gfp_t relaxants;
+
+ /*
+ * Relaxants must only be applied when they are permitted in both.
+ * GFP_KERNEL happens to be the name for
+ * __GFP_IO|__GFP_FS|__GFP_DIRECT_RECLAIM|__GFP_KSWAPD_RECLAIM.
+ */
+ relaxants = huge_gfpmask & requested_gfp & GFP_KERNEL;
+
+ /*
+ * Zone bits must be taken exclusively from the requested gfp mask.
+ * __GFP_COMP would be a disaster: make sure the sysctl cannot add it.
+ */
+ huge_gfpmask &= __GFP_BITS_MASK & ~(GFP_ZONEMASK|__GFP_COMP|GFP_KERNEL);
+
+ /*
+ * These might be right for a small page, but unsuitable for the huge.
+ * REPEAT and NOFAIL very likely wrong in huge_gfpmask, but permitted.
+ */
+ requested_gfp &= ~(__GFP_REPEAT|__GFP_NOFAIL|__GFP_COMP|GFP_KERNEL);
+
+ /* Beyond that, we can simply use the union, sensible or not */
+ return huge_gfpmask | requested_gfp | relaxants;
+}
+
static struct page *shmem_hugeteam_lookup(struct address_space *mapping,
pgoff_t index, bool speculative)
{
@@ -1395,8 +1432,9 @@ static void shmem_recovery_work(struct w
* often choose an unsuitable NUMA node: something to fix soon,
* but not an immediate blocker.
*/
+ gfp = shmem_huge_gfp(shmem_recovery_gfpmask, gfp);
head = __alloc_pages_node(page_to_nid(page),
- gfp | __GFP_NOWARN | __GFP_THISNODE, HPAGE_PMD_ORDER);
+ gfp, HPAGE_PMD_ORDER);
if (!head) {
shr_stats(huge_failed);
error = -ENOMEM;
@@ -1732,9 +1770,15 @@ static struct shrinker shmem_hugehole_sh
#else /* !CONFIG_TRANSPARENT_HUGEPAGE */

#define shmem_huge SHMEM_HUGE_DENY
+#define shmem_huge_gfpmask GFP_HIGHUSER_MOVABLE
#define shmem_huge_recoveries 0
#define shr_stats(x) do {} while (0)

+static inline gfp_t shmem_huge_gfp(int tunable_template, gfp_t requested_gfp)
+{
+ return requested_gfp;
+}
+
static inline struct page *shmem_hugeteam_lookup(struct address_space *mapping,
pgoff_t index, bool speculative)
{
@@ -2626,14 +2670,16 @@ static struct page *shmem_alloc_page(gfp
rcu_read_unlock();

if (*hugehint == SHMEM_ALLOC_HUGE_PAGE) {
- head = alloc_pages_vma(gfp|__GFP_NORETRY|__GFP_NOWARN,
- HPAGE_PMD_ORDER, &pvma, 0, numa_node_id(),
- true);
+ gfp_t huge_gfp;
+
+ huge_gfp = shmem_huge_gfp(shmem_huge_gfpmask, gfp);
+ head = alloc_pages_vma(huge_gfp,
+ HPAGE_PMD_ORDER, &pvma, 0,
+ numa_node_id(), true);
/* Shrink and retry? Or leave it to recovery worker */
if (!head && !shmem_huge_recoveries &&
shmem_shrink_hugehole(NULL, NULL) != SHRINK_STOP) {
- head = alloc_pages_vma(
- gfp|__GFP_NORETRY|__GFP_NOWARN,
+ head = alloc_pages_vma(huge_gfp,
HPAGE_PMD_ORDER, &pvma, 0,
numa_node_id(), true);
}

2016-04-05 22:07:34

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 31/31] huge tmpfs: no kswapd by default on sync allocations

From: Andres Lagar-Cavilla <[email protected]>

This triggers early compaction abort while in process context, to
ameliorate mmap semaphore stalls.

Suggested-by: David Rientjes <[email protected]>
Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/filesystems/tmpfs.txt | 5 +++--
mm/shmem.c | 3 ++-
2 files changed, 5 insertions(+), 3 deletions(-)

--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -194,8 +194,9 @@ to force huge off for all (intended for

/proc/sys/vm/shmem_huge_gfpmask (intended for experimentation only):

-Default 38146762, that is 0x24612ca:
-GFP_HIGHUSER_MOVABLE|__GFP_NOWARN|__GFP_THISNODE|__GFP_NORETRY.
+Default 4592330, that is 0x4612ca:
+GFP_HIGHUSER_MOVABLE|__GFP_NOWARN|__GFP_THISNODE|__GFP_NORETRY
+minus __GFP_KSWAPD_RECLAIM.
Write a gfpmask built from __GFP flags in include/linux/gfp.h, to experiment
with better alternatives for the synchronous huge tmpfs allocation used
when faulting or writing.
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -324,7 +324,8 @@ int shmem_huge __read_mostly;
int shmem_huge_recoveries __read_mostly = 8; /* concurrent recovery limit */

int shmem_huge_gfpmask __read_mostly =
- (int)(GFP_HIGHUSER_MOVABLE|__GFP_NOWARN|__GFP_THISNODE|__GFP_NORETRY);
+ (int)(GFP_HIGHUSER_MOVABLE|__GFP_NOWARN|__GFP_THISNODE|__GFP_NORETRY) &
+ ~__GFP_KSWAPD_RECLAIM;
int shmem_recovery_gfpmask __read_mostly =
(int)(GFP_HIGHUSER_MOVABLE|__GFP_NOWARN|__GFP_THISNODE);


2016-04-05 23:37:38

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 17/31] kvm: teach kvm to map page teams as huge pages.



On 05/04/2016 23:41, Hugh Dickins wrote:
> +/*
> + * We are holding kvm->mmu_lock, serializing against mmu notifiers.
> + * We have a ref on page.
> ...
> +static bool is_huge_tmpfs(struct kvm_vcpu *vcpu,
> + unsigned long address, struct page *page)

vcpu is only used to access vcpu->kvm->mm. If it's still possible to
give a sensible rule for locking, I wouldn't mind if is_huge_tmpfs took
the mm directly and was moved out of KVM. Otherwise, it would be quite
easy for people touch mm code to miss it.

Apart from this, both patches look good.

Paolo

2016-04-06 01:12:09

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 17/31] kvm: teach kvm to map page teams as huge pages.

On Wed, 6 Apr 2016, Paolo Bonzini wrote:
> On 05/04/2016 23:41, Hugh Dickins wrote:
> > +/*
> > + * We are holding kvm->mmu_lock, serializing against mmu notifiers.
> > + * We have a ref on page.
> > ...
> > +static bool is_huge_tmpfs(struct kvm_vcpu *vcpu,
> > + unsigned long address, struct page *page)
>
> vcpu is only used to access vcpu->kvm->mm. If it's still possible to

Hah, you've lighted on precisely a line of code where I changed around
what Andres had - I thought it nicer to pass down vcpu, because that
matched the function above, and in many cases vcpu is not dereferenced
here at all. So, definitely blame me not Andres for that interface.

> give a sensible rule for locking, I wouldn't mind if is_huge_tmpfs took
> the mm directly and was moved out of KVM. Otherwise, it would be quite
> easy for people touch mm code to miss it.

Good point. On the other hand, as you acknowledge in your "If...",
it might turn out to be too special-purpose in its assumptions to be
a safe export from core mm: Andres and I need to give it more thought.

>
> Apart from this, both patches look good.

Thanks so much for such a quick response; and contrary to what I'd
expected in my "FYI" comment, Andrew has taken them into his tree,
to give them some early exposure via mmotm and linux-next - but
of course that doesn't stop us from changing it as you suggest -
we'll think it over again.

Hugh

2016-04-06 06:47:13

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 17/31] kvm: teach kvm to map page teams as huge pages.



On 06/04/2016 03:12, Hugh Dickins wrote:
> Hah, you've lighted on precisely a line of code where I changed around
> what Andres had - I thought it nicer to pass down vcpu, because that
> matched the function above, and in many cases vcpu is not dereferenced
> here at all. So, definitely blame me not Andres for that interface.
>

Oh, actually I'm fine with the interface if it's in arch/x86/kvm. I'm
just pointing out that---putting aside the locking question---it's a
pretty generic thing that doesn't really need access to KVM data structures.

Paolo

2016-04-06 07:00:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 12/31] huge tmpfs: extend get_user_pages_fast to shmem pmd


* Hugh Dickins <[email protected]> wrote:

> The arch-specific get_user_pages_fast() has a gup_huge_pmd() designed to
> optimize the refcounting on anonymous THP and hugetlbfs pages, with one
> atomic addition to compound head's common refcount. That optimization
> must be avoided on huge tmpfs team pages, which use normal separate page
> refcounting. We could combine the PageTeam and PageCompound cases into
> a single simple loop, but would lose the compound optimization that way.
>
> One cannot go through these functions without wondering why some arches
> (x86, mips) like to SetPageReferenced, while the rest do not: an x86
> optimization that missed being propagated to the other architectures?
> No, see commit 8ee53820edfd ("thp: mmu_notifier_test_young"): it's a
> KVM GRU EPT thing, maybe not useful beyond x86. I've just followed
> the established practice in each architecture.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> Cc'ed to arch maintainers as an FYI: this patch is not expected to
> go into the tree in the next few weeks, and depends upon a PageTeam
> definition not yet available outside this huge tmpfs patchset.
> Please refer to linux-mm or linux-kernel for more context.
>
> arch/mips/mm/gup.c | 15 ++++++++++++++-
> arch/s390/mm/gup.c | 19 ++++++++++++++++++-
> arch/sparc/mm/gup.c | 19 ++++++++++++++++++-
> arch/x86/mm/gup.c | 15 ++++++++++++++-
> mm/gup.c | 19 ++++++++++++++++++-
> 5 files changed, 82 insertions(+), 5 deletions(-)
>
> --- a/arch/mips/mm/gup.c
> +++ b/arch/mips/mm/gup.c
> @@ -81,9 +81,22 @@ static int gup_huge_pmd(pmd_t pmd, unsig
> VM_BUG_ON(pte_special(pte));
> VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
>
> - refs = 0;
> head = pte_page(pte);
> page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +
> + if (PageTeam(head)) {
> + /* Handle a huge tmpfs team with normal refcounting. */
> + do {
> + get_page(page);
> + SetPageReferenced(page);
> + pages[*nr] = page;
> + (*nr)++;
> + page++;
> + } while (addr += PAGE_SIZE, addr != end);
> + return 1;
> + }
> +
> + refs = 0;
> do {
> VM_BUG_ON(compound_head(page) != head);
> pages[*nr] = page;
> --- a/arch/s390/mm/gup.c
> +++ b/arch/s390/mm/gup.c
> @@ -66,9 +66,26 @@ static inline int gup_huge_pmd(pmd_t *pm
> return 0;
> VM_BUG_ON(!pfn_valid(pmd_val(pmd) >> PAGE_SHIFT));
>
> - refs = 0;
> head = pmd_page(pmd);
> page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +
> + if (PageTeam(head)) {
> + /* Handle a huge tmpfs team with normal refcounting. */
> + do {
> + if (!page_cache_get_speculative(page))
> + return 0;
> + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp))) {
> + put_page(page);
> + return 0;
> + }
> + pages[*nr] = page;
> + (*nr)++;
> + page++;
> + } while (addr += PAGE_SIZE, addr != end);
> + return 1;
> + }
> +
> + refs = 0;
> do {
> VM_BUG_ON(compound_head(page) != head);
> pages[*nr] = page;
> --- a/arch/sparc/mm/gup.c
> +++ b/arch/sparc/mm/gup.c
> @@ -77,9 +77,26 @@ static int gup_huge_pmd(pmd_t *pmdp, pmd
> if (write && !pmd_write(pmd))
> return 0;
>
> - refs = 0;
> head = pmd_page(pmd);
> page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +
> + if (PageTeam(head)) {
> + /* Handle a huge tmpfs team with normal refcounting. */
> + do {
> + if (!page_cache_get_speculative(page))
> + return 0;
> + if (unlikely(pmd_val(pmd) != pmd_val(*pmdp))) {
> + put_page(page);
> + return 0;
> + }
> + pages[*nr] = page;
> + (*nr)++;
> + page++;
> + } while (addr += PAGE_SIZE, addr != end);
> + return 1;
> + }
> +
> + refs = 0;
> do {
> VM_BUG_ON(compound_head(page) != head);
> pages[*nr] = page;
> --- a/arch/x86/mm/gup.c
> +++ b/arch/x86/mm/gup.c
> @@ -196,9 +196,22 @@ static noinline int gup_huge_pmd(pmd_t p
> /* hugepages are never "special" */
> VM_BUG_ON(pmd_flags(pmd) & _PAGE_SPECIAL);
>
> - refs = 0;
> head = pmd_page(pmd);
> page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +
> + if (PageTeam(head)) {
> + /* Handle a huge tmpfs team with normal refcounting. */
> + do {
> + get_page(page);
> + SetPageReferenced(page);
> + pages[*nr] = page;
> + (*nr)++;
> + page++;
> + } while (addr += PAGE_SIZE, addr != end);
> + return 1;
> + }
> +
> + refs = 0;
> do {
> VM_BUG_ON_PAGE(compound_head(page) != head, page);
> pages[*nr] = page;
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1247,9 +1247,26 @@ static int gup_huge_pmd(pmd_t orig, pmd_
> if (write && !pmd_write(orig))
> return 0;
>
> - refs = 0;
> head = pmd_page(orig);
> page = head + ((addr & ~PMD_MASK) >> PAGE_SHIFT);
> +
> + if (PageTeam(head)) {
> + /* Handle a huge tmpfs team with normal refcounting. */
> + do {
> + if (!page_cache_get_speculative(page))
> + return 0;
> + if (unlikely(pmd_val(orig) != pmd_val(*pmdp))) {
> + put_page(page);
> + return 0;
> + }
> + pages[*nr] = page;
> + (*nr)++;
> + page++;
> + } while (addr += PAGE_SIZE, addr != end);
> + return 1;
> + }
> +
> + refs = 0;
> do {
> VM_BUG_ON_PAGE(compound_head(page) != head, page);
> pages[*nr] = page;

Ouch!

Looks like there are two main variants - so these kinds of repetitive patterns
very much call for some sort of factoring out of common code, right?

Then the fix could be applied to the common portion(s) only, which will cut down
this gigantic diffstat:

> 5 files changed, 82 insertions(+), 5 deletions(-)

Thanks,

Ingo

2016-04-06 10:28:45

by Mika Penttilä

[permalink] [raw]
Subject: Re: [PATCH 23/31] huge tmpfs recovery: framework for reconstituting huge pages

On 04/06/2016 12:53 AM, Hugh Dickins wrote:



> +static void shmem_recovery_work(struct work_struct *work)
> +{
> + struct recovery *recovery;
> + struct shmem_inode_info *info;
> + struct address_space *mapping;
> + struct page *page;
> + struct page *head = NULL;
> + int error = -ENOENT;
> +
> + recovery = container_of(work, struct recovery, work);
> + info = SHMEM_I(recovery->inode);
> + if (!shmem_work_still_useful(recovery)) {
> + shr_stats(work_too_late);
> + goto out;
> + }
> +
> + /* Are we resuming from an earlier partially successful attempt? */
> + mapping = recovery->inode->i_mapping;
> + spin_lock_irq(&mapping->tree_lock);
> + page = shmem_clear_tag_hugehole(mapping, recovery->head_index);
> + if (page)
> + head = team_head(page);
> + spin_unlock_irq(&mapping->tree_lock);
> + if (head) {
> + /* Serialize with shrinker so it won't mess with our range */
> + spin_lock(&shmem_shrinklist_lock);
> + spin_unlock(&shmem_shrinklist_lock);
> + }
> +
> + /* If team is now complete, no tag and head would be found above */
> + page = recovery->page;
> + if (PageTeam(page))
> + head = team_head(page);
> +
> + /* Get a reference to the head of the team already being assembled */
> + if (head) {
> + if (!get_page_unless_zero(head))
> + head = NULL;
> + else if (!PageTeam(head) || head->mapping != mapping ||
> + head->index != recovery->head_index) {
> + put_page(head);
> + head = NULL;
> + }
> + }
> +
> + if (head) {
> + /* We are resuming work from a previous partial recovery */
> + if (PageTeam(page))
> + shr_stats(resume_teamed);
> + else
> + shr_stats(resume_tagged);
> + } else {
> + gfp_t gfp = mapping_gfp_mask(mapping);
> + /*
> + * XXX: Note that with swapin readahead, page_to_nid(page) will
> + * often choose an unsuitable NUMA node: something to fix soon,
> + * but not an immediate blocker.
> + */
> + head = __alloc_pages_node(page_to_nid(page),
> + gfp | __GFP_NOWARN | __GFP_THISNODE, HPAGE_PMD_ORDER);
> + if (!head) {
> + shr_stats(huge_failed);
> + error = -ENOMEM;
> + goto out;
> + }

Should this head marked PageTeam? Because in patch 27/31 when given as a hint to shmem_getpage_gfp() :

hugehint = NULL;
+ if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
+ sgp == SGP_TEAM && *pagep) {
+ struct page *head;
+
+ if (!get_page_unless_zero(*pagep)) {
+ error = -ENOENT;
+ goto decused;
+ }
+ page = *pagep;
+ lock_page(page);
+ head = page - (index & (HPAGE_PMD_NR-1));

we fail always because :
+ if (!PageTeam(head)) {
+ error = -ENOENT;
+ goto decused;
+ }


> + if (!shmem_work_still_useful(recovery)) {
> + __free_pages(head, HPAGE_PMD_ORDER);
> + shr_stats(huge_too_late);
> + goto out;
> + }
> + split_page(head, HPAGE_PMD_ORDER);
> + get_page(head);
> + shr_stats(huge_alloced);
> + }


Thanks,
Mika

2016-04-07 02:05:33

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 23/31] huge tmpfs recovery: framework for reconstituting huge pages

On Wed, 6 Apr 2016, Mika Penttila wrote:
> On 04/06/2016 12:53 AM, Hugh Dickins wrote:
> > +static void shmem_recovery_work(struct work_struct *work)
...
> > +
> > + if (head) {
> > + /* We are resuming work from a previous partial recovery */
> > + if (PageTeam(page))
> > + shr_stats(resume_teamed);
> > + else
> > + shr_stats(resume_tagged);
> > + } else {
> > + gfp_t gfp = mapping_gfp_mask(mapping);
> > + /*
> > + * XXX: Note that with swapin readahead, page_to_nid(page) will
> > + * often choose an unsuitable NUMA node: something to fix soon,
> > + * but not an immediate blocker.
> > + */
> > + head = __alloc_pages_node(page_to_nid(page),
> > + gfp | __GFP_NOWARN | __GFP_THISNODE, HPAGE_PMD_ORDER);
> > + if (!head) {
> > + shr_stats(huge_failed);
> > + error = -ENOMEM;
> > + goto out;
> > + }
>
> Should this head marked PageTeam? Because in patch 27/31 when given as a hint to shmem_getpage_gfp() :
>
> hugehint = NULL;
> + if (IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE) &&
> + sgp == SGP_TEAM && *pagep) {
> + struct page *head;
> +
> + if (!get_page_unless_zero(*pagep)) {
> + error = -ENOENT;
> + goto decused;
> + }
> + page = *pagep;
> + lock_page(page);
> + head = page - (index & (HPAGE_PMD_NR-1));
>
> we fail always because :
> + if (!PageTeam(head)) {
> + error = -ENOENT;
> + goto decused;
> + }

Great observation, thank you Mika.

We don't fail always, because in most cases the page wanted for the head
will either be already in memory, or read in from swap, and that SGP_TEAM
block in shmem_getpage_gfp() (with the -ENOENT you show) not come into play
on it: then shmem_recovery_populate() does its !recovery->exposed_team
SetPageTeam(head) and all is well from then on.

But I think what you point out means that the current recovery code is
incapable of assembling a hugepage if its first page was not already
instantiated earlier: not something I'd realized until you showed me.
Not a common failing, and would never be the case for an extent which had
been mapped huge in the past, but it's certainly not what I'd intended.

As to whether the head should be marked PageTeam immediately after the
hugepage allocation: I think not, especially because of the swapin case
(26/31). Swapin may need to read data from disk into that head page,
and I've never had to think about the consequences of having a swap
page marked PageTeam. Perhaps it would work out okay, but I'd prefer
not to go there.

At this moment I'm too tired to think what the right answer will be,
and certainly won't be able to commit to any without some testing.

So, not as incapacitating as perhaps you thought, and not any danger
to people trying out huge tmpfs, but definitely something to be fixed:
I'll mull it over in the background and let you know when I'm sure.

Thank you again,
Hugh

2016-04-07 02:54:09

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 12/31] huge tmpfs: extend get_user_pages_fast to shmem pmd

On Wed, 6 Apr 2016, Ingo Molnar wrote:
> * Hugh Dickins <[email protected]> wrote:
>
> > ---
> > Cc'ed to arch maintainers as an FYI: this patch is not expected to
> > go into the tree in the next few weeks, and depends upon a PageTeam
> > definition not yet available outside this huge tmpfs patchset.
> > Please refer to linux-mm or linux-kernel for more context.

Actually, Andrew took it and the rest into mmotm yesterday, to give them
better exposure through linux-next, so they should appear there soon.

> >
> > arch/mips/mm/gup.c | 15 ++++++++++++++-
> > arch/s390/mm/gup.c | 19 ++++++++++++++++++-
> > arch/sparc/mm/gup.c | 19 ++++++++++++++++++-
> > arch/x86/mm/gup.c | 15 ++++++++++++++-
> > mm/gup.c | 19 ++++++++++++++++++-
> > 5 files changed, 82 insertions(+), 5 deletions(-)
...
>
> Ouch!

Oh sorry, I didn't mean to hurt you ;)

>
> Looks like there are two main variants - so these kinds of repetitive patterns
> very much call for some sort of factoring out of common code, right?

Hmm. I'm still struggling between the two extremes, of

(a) agreeing completely with you, and saying, yeah, I'll take on the job
of refactoring every architecture's get_user_pages_as_fast_as_you_can(),
without much likelihood of testing more than one,

and

(b) running a mile, and pointing out that we have a tradition of using
arch/x86/mm/gup.c as a template for the others, and here I've just
added a few more lines to that template (which never gets built more
than once into any kernel).

Both are appealing in their different ways, but I think you can tell
which I'm leaning towards...

Honestly, I am still struggling between those two; but I think the patch
as it stands is one thing, and cleanup for commonality should be another
however weaselly that sounds ("I'll come back to it" - yeah, right).

Hugh

>
> Then the fix could be applied to the common portion(s) only, which will cut down
> this gigantic diffstat:
>
> > 5 files changed, 82 insertions(+), 5 deletions(-)
>
> Thanks,
>
> Ingo

2016-04-11 11:05:54

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 01/31] huge tmpfs: prepare counts in meminfo, vmstat and SysRq-m

On Tue, Apr 05, 2016 at 02:12:26PM -0700, Hugh Dickins wrote:
> ShmemFreeHoles will show the wastage from using huge pages for small, or
> sparsely occupied, or unrounded files: wastage not included in Shmem or
> MemFree, but will be freed under memory pressure. (But no count for the
> partially occupied portions of huge pages: seems less important, but
> could be added.)

And here first difference in interfaces comes: I don't have an
equivalent in my implementation, as I don't track such information.
It looks like an implementation detail for team-pages based huge tmpfs.

We don't track anything similar for anon-THP.

> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3830,6 +3830,11 @@ out:
> }
>
> #define K(x) ((x) << (PAGE_SHIFT-10))
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define THPAGE_PMD_NR HPAGE_PMD_NR
> +#else
> +#define THPAGE_PMD_NR 0 /* Avoid BUILD_BUG() */
> +#endif

I've just put THP-related counters on separate line and wrap it into
#ifdef.


--
Kirill A. Shutemov

2016-04-11 11:17:11

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 03/31] huge tmpfs: huge=N mount option and /proc/sys/vm/shmem_huge

On Tue, Apr 05, 2016 at 02:15:05PM -0700, Hugh Dickins wrote:
> Plumb in a new "huge=1" or "huge=0" mount option to tmpfs: I don't
> want to get into a maze of boot options, madvises and fadvises at
> this stage, nor extend the use of the existing THP tuning to tmpfs;
> though either might be pursued later on. We just want a way to ask
> a tmpfs filesystem to favor huge pages, and a way to turn that off
> again when it doesn't work out so well. Default of course is off.
>
> "mount -o remount,huge=N /mountpoint" works fine after mount:
> remounting from huge=1 (on) to huge=0 (off) will not attempt to
> break up huge pages at all, just stop more from being allocated.
>
> It's possible that we shall allow more values for the option later,
> to select different strategies (e.g. how hard to try when allocating
> huge pages, or when to map hugely and when not, or how sparse a huge
> page should be before it is split up), either for experiments, or well
> baked in: so use an unsigned char in the superblock rather than a bool.

Make the value a string from beginning would be better choice in my
opinion. As more allocation policies would be implemented, number would
not make much sense.

For record, my implementation has four allocation policies: never, always,
within_size and advise.

>
> No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE,
> which is the appropriate option to protect those who don't want
> the new bloat, and with which we shall share some pmd code. Use a
> "name=numeric_value" format like most other tmpfs options. Prohibit
> the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is invalid
> without CONFIG_NUMA (was hidden in mpol_parse_str(): make it explicit).
> Allow setting >0 only if the machine has_transparent_hugepage().
>
> But what about Shmem with no user-visible mount? SysV SHM, memfds,
> shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers'
> DRM objects, ashmem. Though unlikely to suit all usages, provide
> sysctl /proc/sys/vm/shmem_huge to experiment with huge on those. We
> may add a memfd_create flag and a per-file huge/non-huge fcntl later.

I use sysfs knob instead:

/sys/kernel/mm/transparent_hugepage/shmem_enabled

And string values there as well. It's better match current THP interface.

> And allow shmem_huge two further values: -1 for use in emergencies,
> to force the huge option off from all mounts; and (currently) 2,
> to force the huge option on for all - very useful for testing.

In my case, it's "deny" and "force".

--
Kirill A. Shutemov

2016-04-11 11:54:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 09/31] huge tmpfs: avoid premature exposure of new pagetable

On Tue, Apr 05, 2016 at 02:24:23PM -0700, Hugh Dickins wrote:
> In early development, a huge tmpfs fault simply replaced the pmd which
> pointed to the empty pagetable just allocated in __handle_mm_fault():
> but that is unsafe.
>
> Andrea wrote a very interesting comment on THP in mm/memory.c,
> just before the end of __handle_mm_fault():
>
> * A regular pmd is established and it can't morph into a huge pmd
> * from under us anymore at this point because we hold the mmap_sem
> * read mode and khugepaged takes it in write mode. So now it's
> * safe to run pte_offset_map().
>
> This comment hints at several difficulties, which anon THP solved
> for itself with mmap_sem and anon_vma lock, but which huge tmpfs
> may need to solve differently.
>
> The reference to pte_offset_map() above: I believe that's a hint
> that on a 32-bit machine, the pagetables might need to come from
> kernel-mapped memory, but a huge pmd pointing to user memory beyond
> that limit could be racily substituted, causing undefined behavior
> in the architecture-dependent pte_offset_map().
>
> That itself is not a problem on x86_64, but there's plenty more:
> how about those places which use pte_offset_map_lock() - if that
> spinlock is in the struct page of a pagetable, which has been
> deposited and might be withdrawn and freed at any moment (being
> on a list unattached to the allocating pmd in the case of x86),
> taking the spinlock might corrupt someone else's struct page.
>
> Because THP has departed from the earlier rules (when pagetable
> was only freed under exclusive mmap_sem, or at exit_mmap, after
> removing all affected vmas from the rmap list): zap_huge_pmd()
> does pte_free() even when serving MADV_DONTNEED under down_read
> of mmap_sem.

Emm.. The pte table freed from zap_huge_pmd() is from deposit. It wasn't
linked into process' page table tree. So I don't see how THP has departed
from the rules.

I don't think it changes anything to implementation, but this part of
commit message, I believe, is inaccurate.

> And what of the "entry = *pte" at the start of handle_pte_fault(),
> getting the entry used in pte_same(,orig_pte) tests to validate all
> fault handling? If that entry can itself be junk picked out of some
> freed and reused pagetable, it's hard to estimate the consequences.
>
> We need to consider the safety of concurrent faults, and the
> safety of rmap lookups, and the safety of miscellaneous operations
> such as smaps_pte_range() for reading /proc/<pid>/smaps.
>
> I set out to make safe the places which descend pgd,pud,pmd,pte,
> using more careful access techniques like mm_find_pmd(); but with
> pte_offset_map() being architecture-defined, found it too big a job
> to tighten up all over.
>
> Instead, approach from the opposite direction: just do not expose
> a pagetable in an empty *pmd, until vm_ops->fault has had a chance
> to ask for a huge pmd there. This is a much easier change to make,
> and we are lucky that all the driver faults appear to be using
> interfaces (like vm_insert_page() and remap_pfn_range()) which
> automatically do the pte_alloc() if it was not already done.
>
> But we must not get stuck refaulting: need FAULT_FLAG_MAY_HUGE for
> __do_fault() to tell shmem_fault() to try for huge only when *pmd is
> empty (could instead add pmd to vmf and let shmem work that out for
> itself, but probably better to hide pmd from vm_ops->faults).
>
> Without a pagetable to hold the pte_none() entry found in a newly
> allocated pagetable, handle_pte_fault() would like to provide a static
> none entry for later orig_pte checks. But architectures have never had
> to provide that definition before; and although almost all use zeroes
> for an empty pagetable, a few do not - nios2, s390, um, xtensa.
>
> Never mind, forget about pte_same(,orig_pte), the three __do_fault()
> callers can follow do_anonymous_page(), and just use a pte_none() check.
>
> do_fault_around() presents one last problem: it wants pagetable to
> have been allocated, but was being called by do_read_fault() before
> __do_fault(). I see no disadvantage to moving it after, allowing huge
> pmd to be chosen first; but Kirill reports additional radix-tree lookup
> in hot pagecache case when he implemented faultaround: needs further
> investigation.

In my implementation faultaround can establish PMD mappings. So there's no
disadvantage to call faultaround first.

And if faultaround happened to solve the page fault we don't need to do
usual ->fault lookup.

> Note: after months of use, we recently hit an OOM deadlock: this patch
> moves the new pagetable allocation inside where page lock is held on a
> pagecache page, and exit's munlock_vma_pages_all() takes page lock on
> all mlocked pages. Both parties are behaving badly: we hope to change
> munlock to use trylock_page() instead, but should certainly switch here
> to preallocating the pagetable outside the page lock. But I've not yet
> written and tested that change.

Hm. Okay, I need to fix this in my implementation too. It shouldn't be too
hard as I have fe->pte_prealloc around already.

--
Kirill A. Shutemov

2016-04-13 08:58:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 12/31] huge tmpfs: extend get_user_pages_fast to shmem pmd


* Hugh Dickins <[email protected]> wrote:

> > > arch/mips/mm/gup.c | 15 ++++++++++++++-
> > > arch/s390/mm/gup.c | 19 ++++++++++++++++++-
> > > arch/sparc/mm/gup.c | 19 ++++++++++++++++++-
> > > arch/x86/mm/gup.c | 15 ++++++++++++++-
> > > mm/gup.c | 19 ++++++++++++++++++-
> > > 5 files changed, 82 insertions(+), 5 deletions(-)
> ...

> > Looks like there are two main variants - so these kinds of repetitive patterns
> > very much call for some sort of factoring out of common code, right?
>
> Hmm. I'm still struggling between the two extremes, of
>
> (a) agreeing completely with you, and saying, yeah, I'll take on the job
> of refactoring every architecture's get_user_pages_as_fast_as_you_can(),
> without much likelihood of testing more than one,
>
> and
>
> (b) running a mile, and pointing out that we have a tradition of using
> arch/x86/mm/gup.c as a template for the others, and here I've just
> added a few more lines to that template (which never gets built more
> than once into any kernel).
>
> Both are appealing in their different ways, but I think you can tell
> which I'm leaning towards...
>
> Honestly, I am still struggling between those two; but I think the patch
> as it stands is one thing, and cleanup for commonality should be another
> however weaselly that sounds ("I'll come back to it" - yeah, right).

Yeah, so my worry is this: your patch for example roughly doubles the algorithmic
complexity of mm/gup.c and arch/*/mm/gup.c's ::gup_huge_pmd().

And you want this to add a new feature!

So it really looks like to me this is the last sane chance to unify cheaply, then
add the feature you want. Everyone else in the future will be able to refer to
your example to chicken out! ;-)

Thanks,

Ingo

2016-04-17 01:49:39

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 09/31] huge tmpfs: avoid premature exposure of new pagetable

On Mon, 11 Apr 2016, Kirill A. Shutemov wrote:
> On Tue, Apr 05, 2016 at 02:24:23PM -0700, Hugh Dickins wrote:
> >
> > That itself is not a problem on x86_64, but there's plenty more:
> > how about those places which use pte_offset_map_lock() - if that
> > spinlock is in the struct page of a pagetable, which has been
> > deposited and might be withdrawn and freed at any moment (being
> > on a list unattached to the allocating pmd in the case of x86),
> > taking the spinlock might corrupt someone else's struct page.
> >
> > Because THP has departed from the earlier rules (when pagetable
> > was only freed under exclusive mmap_sem, or at exit_mmap, after
> > removing all affected vmas from the rmap list): zap_huge_pmd()
> > does pte_free() even when serving MADV_DONTNEED under down_read
> > of mmap_sem.
>
> Emm.. The pte table freed from zap_huge_pmd() is from deposit. It wasn't
> linked into process' page table tree. So I don't see how THP has departed
> from the rules.

That's true at the time that it is freed: but my point was, that we
don't know the past history of that pagetable, which might have been
linked in and contained visible ptes very recently, without sufficient
barriers in between. And in the x86 case (perhaps any non-powerpc
case), there's no logical association between the pagetable freed
and the place that it's freed from.

Now, I've certainly not paged back in all the anxieties I had, and
avenues I'd gone down, at the time that I first wrote that comment:
it's quite possible that they were self-inflicted issues, and
perhaps remnants of earlier ways in which I'd tried ordering it.
I'm sure that I never reached any "hey, anon THP has got this wrong"
conclusion, merely doubts, and surprise that it could free a
pagetable there.

But it looks as if all these ruminations here will vanish with the
revert of the patch. Though one day I'll probably be worrying
about it again, when I try to remove the need for mmap_sem
protection around recovery's remap_team_by_pmd().

> > do_fault_around() presents one last problem: it wants pagetable to
> > have been allocated, but was being called by do_read_fault() before
> > __do_fault(). I see no disadvantage to moving it after, allowing huge
> > pmd to be chosen first; but Kirill reports additional radix-tree lookup
> > in hot pagecache case when he implemented faultaround: needs further
> > investigation.
>
> In my implementation faultaround can establish PMD mappings. So there's no
> disadvantage to call faultaround first.
>
> And if faultaround happened to solve the page fault we don't need to do
> usual ->fault lookup.

Sounds good, though not something I'll be looking to add in myself:
feel free to add it, but maybe it fits easier with compound pages.

Hugh

2016-04-17 02:00:36

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 03/31] huge tmpfs: huge=N mount option and /proc/sys/vm/shmem_huge

On Mon, 11 Apr 2016, Kirill A. Shutemov wrote:
> On Tue, Apr 05, 2016 at 02:15:05PM -0700, Hugh Dickins wrote:
> > Plumb in a new "huge=1" or "huge=0" mount option to tmpfs: I don't
> > want to get into a maze of boot options, madvises and fadvises at
> > this stage, nor extend the use of the existing THP tuning to tmpfs;
> > though either might be pursued later on. We just want a way to ask
> > a tmpfs filesystem to favor huge pages, and a way to turn that off
> > again when it doesn't work out so well. Default of course is off.
> >
> > "mount -o remount,huge=N /mountpoint" works fine after mount:
> > remounting from huge=1 (on) to huge=0 (off) will not attempt to
> > break up huge pages at all, just stop more from being allocated.
> >
> > It's possible that we shall allow more values for the option later,
> > to select different strategies (e.g. how hard to try when allocating
> > huge pages, or when to map hugely and when not, or how sparse a huge
> > page should be before it is split up), either for experiments, or well
> > baked in: so use an unsigned char in the superblock rather than a bool.
>
> Make the value a string from beginning would be better choice in my
> opinion. As more allocation policies would be implemented, number would
> not make much sense.

I'll probably agree about the strings. Though we have not in fact
devised any more allocation policies so far, and perhaps never will
at this mount level.

>
> For record, my implementation has four allocation policies: never, always,
> within_size and advise.

I'm sceptical who will get into choosing "within_size".

>
> >
> > No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE,
> > which is the appropriate option to protect those who don't want
> > the new bloat, and with which we shall share some pmd code. Use a
> > "name=numeric_value" format like most other tmpfs options. Prohibit
> > the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is invalid
> > without CONFIG_NUMA (was hidden in mpol_parse_str(): make it explicit).
> > Allow setting >0 only if the machine has_transparent_hugepage().
> >
> > But what about Shmem with no user-visible mount? SysV SHM, memfds,
> > shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers'
> > DRM objects, ashmem. Though unlikely to suit all usages, provide
> > sysctl /proc/sys/vm/shmem_huge to experiment with huge on those. We
> > may add a memfd_create flag and a per-file huge/non-huge fcntl later.
>
> I use sysfs knob instead:
>
> /sys/kernel/mm/transparent_hugepage/shmem_enabled
>
> And string values there as well. It's better match current THP interface.

It's certainly been easier for me, to get it up and running without
having to respect all the anon THP knobs. But I do expect some
pressure to conform a bit more now.

Hugh

>
> > And allow shmem_huge two further values: -1 for use in emergencies,
> > to force the huge option off from all mounts; and (currently) 2,
> > to force the huge option on for all - very useful for testing.
>
> In my case, it's "deny" and "force".
>
> --
> Kirill A. Shutemov

2016-04-17 02:28:27

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 01/31] huge tmpfs: prepare counts in meminfo, vmstat and SysRq-m

On Mon, 11 Apr 2016, Kirill A. Shutemov wrote:
> On Tue, Apr 05, 2016 at 02:12:26PM -0700, Hugh Dickins wrote:
> > ShmemFreeHoles will show the wastage from using huge pages for small, or
> > sparsely occupied, or unrounded files: wastage not included in Shmem or
> > MemFree, but will be freed under memory pressure. (But no count for the
> > partially occupied portions of huge pages: seems less important, but
> > could be added.)
>
> And here first difference in interfaces comes: I don't have an
> equivalent in my implementation, as I don't track such information.
> It looks like an implementation detail for team-pages based huge tmpfs.

It's an implementation detail insofar as that you've not yet implemented
the equivalent with compound pages - and I think you're hoping never to
do so.

Of course, nobody wants ShmemFreeHoles as such, but they do want
the filesize flexibility that comes with them. And they may be an
important detail if free memory is vanishing into a black hole.

They are definitely a peculiar category, which is itself a strong
reason for making them visible in some way. But I don't think I'd
mind if we decided they're not quite up to /proc/meminfo standards,
and should be shown somewhere else instead. [Quiet sob.]

But if we do move them out of /proc/meminfo, I'll argue that they
should then be added in to the user-visible MemFree (though not to
the internal NR_FREE_PAGES): at different times I've felt differently
on that, and when MemAvailable came in, then it was so clear that they
belong in that category, that I didn't want them in MemFree; but if
they're not themselves high-level visible in /proc/meminfo, then
I think that probably they should go into MemFree.

But really, here, we want distro advice rather than my musings.

>
> We don't track anything similar for anon-THP.

The case just doesn't arise with anon THP (or didn't arise before
your recent changes anyway): the object could only be a pmd-mapped
entity, and always showing AnonFreeHoles 0kB is just boring.

>
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3830,6 +3830,11 @@ out:
> > }
> >
> > #define K(x) ((x) << (PAGE_SHIFT-10))
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#define THPAGE_PMD_NR HPAGE_PMD_NR
> > +#else
> > +#define THPAGE_PMD_NR 0 /* Avoid BUILD_BUG() */
> > +#endif
>
> I've just put THP-related counters on separate line and wrap it into
> #ifdef.

Time and time again I get very annoyed by that BUILD_BUG() buried
inside HPAGE_PMD_NR. I expect you do too. But it's true that it
does sometimes alert us to some large chunk of code that ought to
be slightly reorganized to get it optimized away. So I never
quite summon up the courage to un-BUILD_BUG it.

I think we need a secret definition that only you and I know,
THPAGE_PMD_NR or whatever, to get around it on occasion.

Hugh