2016-04-05 20:37:33

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 00/10] mm: easy preliminaries to THPagecache

I've rebased my huge tmpfs series against v4.6-rc2, and split it into
two sets. This is a set of miscellaneous premliminaries that I think
we can agree to put into mmotm right away, to be included in v4.7: or
if not, then I later rework the subsequent huge tmpfs series to avoid
or include them; but for now it expects these go in ahead.

These don't assume or commit us to any particular implementation of
huge tmpfs, though most of them are tidyups that came from that work.
01-04 are similar to what I posted in February 2015, I think 05 is the
only interesting patch here, if 06 is rejected then we can just keep
it for our own testing, 07-10 clear away some small obstructions.

But this is a weird 00/10 because it includes a patch at the bottom
itself: v4.6-rc2 missed out Kirill's page_cache_* removal, but that
is assumed in the following patches; so 00/10 should be applied if
you're basing on top of v4.6-rc2, but not applied to a later tree.

00 mm: get rid of a few rc2 page_cache_*
01 mm: update_lru_size warn and reset bad lru_size
02 mm: update_lru_size do the __mod_zone_page_state
03 mm: use __SetPageSwapBacked and dont ClearPageSwapBacked
04 tmpfs: preliminary minor tidyups
05 tmpfs: mem_cgroup charge fault to vm_mm not current mm
06 mm: /proc/sys/vm/stat_refresh to force vmstat update
07 huge mm: move_huge_pmd does not need new_vma
08 huge pagecache: extend mremap pmd rmap lockout to files
09 huge pagecache: mmap_sem is unlocked when truncation splits pmd
10 arch: fix has_transparent_hugepage()

Documentation/sysctl/vm.txt | 14 +
arch/arc/include/asm/hugepage.h | 2
arch/arm/include/asm/pgtable-3level.h | 5
arch/arm64/include/asm/pgtable.h | 5
arch/mips/include/asm/pgtable.h | 1
arch/mips/mm/tlb-r4k.c | 21 +-
arch/powerpc/include/asm/book3s/64/pgtable.h | 1
arch/powerpc/include/asm/pgtable.h | 1
arch/s390/include/asm/pgtable.h | 1
arch/sparc/include/asm/pgtable_64.h | 2
arch/tile/include/asm/pgtable.h | 1
arch/x86/include/asm/pgtable.h | 1
include/asm-generic/pgtable.h | 8
include/linux/huge_mm.h | 4
include/linux/memcontrol.h | 6
include/linux/mempolicy.h | 6
include/linux/mm_inline.h | 24 ++
include/linux/vmstat.h | 4
kernel/sysctl.c | 7
mm/filemap.c | 4
mm/huge_memory.c | 7
mm/memcontrol.c | 26 ++
mm/memory.c | 17 -
mm/migrate.c | 6
mm/mremap.c | 47 ++---
mm/rmap.c | 4
mm/shmem.c | 148 +++++++----------
mm/swap_state.c | 3
mm/vmscan.c | 23 +-
mm/vmstat.c | 58 ++++++
30 files changed, 271 insertions(+), 186 deletions(-)

[PATCH 00/10] mm: get rid of a few rc2 page_cache_*

Not-harebrained Linus forgot to apply Kirill's PAGE_CACHE_* page_cache_*
riddance in rc2, but did so the next day: this and the huge tmpfs series
assume that those changes have been made, so if applying these series to
vanilla v4.6-rc2 as intended, this patch resolves the few clashes first.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/filemap.c | 4 ++--
mm/memory.c | 6 +++---
mm/rmap.c | 2 +-
mm/shmem.c | 18 +++++++++---------
4 files changed, 15 insertions(+), 15 deletions(-)

--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2178,8 +2178,8 @@ repeat:
if (page->mapping != mapping || !PageUptodate(page))
goto unlock;

- size = round_up(i_size_read(mapping->host), PAGE_CACHE_SIZE);
- if (page->index >= size >> PAGE_CACHE_SHIFT)
+ size = round_up(i_size_read(mapping->host), PAGE_SIZE);
+ if (page->index >= size >> PAGE_SHIFT)
goto unlock;

pte = vmf->pte + page->index - vmf->pgoff;
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2807,7 +2807,7 @@ static int __do_fault(struct vm_area_str
if (unlikely(PageHWPoison(vmf.page))) {
if (ret & VM_FAULT_LOCKED)
unlock_page(vmf.page);
- page_cache_release(vmf.page);
+ put_page(vmf.page);
return VM_FAULT_HWPOISON;
}

@@ -2996,7 +2996,7 @@ static int do_read_fault(struct mm_struc
if (unlikely(!pte_same(*pte, orig_pte))) {
pte_unmap_unlock(pte, ptl);
unlock_page(fault_page);
- page_cache_release(fault_page);
+ put_page(fault_page);
return ret;
}
do_set_pte(vma, address, fault_page, pte, false, false);
@@ -3105,7 +3105,7 @@ static int do_shared_fault(struct mm_str
if (unlikely(!pte_same(*pte, orig_pte))) {
pte_unmap_unlock(pte, ptl);
unlock_page(fault_page);
- page_cache_release(fault_page);
+ put_page(fault_page);
return ret;
}
do_set_pte(vma, address, fault_page, pte, true, false);
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1541,7 +1541,7 @@ static int try_to_unmap_one(struct page

discard:
page_remove_rmap(page, PageHuge(page));
- page_cache_release(page);
+ put_page(page);

out_unmap:
pte_unmap_unlock(pte, ptl);
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -300,7 +300,7 @@ static int shmem_add_to_page_cache(struc
VM_BUG_ON_PAGE(!PageLocked(page), page);
VM_BUG_ON_PAGE(!PageSwapBacked(page), page);

- page_cache_get(page);
+ get_page(page);
page->mapping = mapping;
page->index = index;

@@ -318,7 +318,7 @@ static int shmem_add_to_page_cache(struc
} else {
page->mapping = NULL;
spin_unlock_irq(&mapping->tree_lock);
- page_cache_release(page);
+ put_page(page);
}
return error;
}
@@ -530,7 +530,7 @@ static void shmem_undo_range(struct inod
struct page *page = NULL;
shmem_getpage(inode, start - 1, &page, SGP_READ, NULL);
if (page) {
- unsigned int top = PAGE_CACHE_SIZE;
+ unsigned int top = PAGE_SIZE;
if (start > end) {
top = partial_end;
partial_end = 0;
@@ -1145,7 +1145,7 @@ static int shmem_getpage_gfp(struct inod
int once = 0;
int alloced = 0;

- if (index > (MAX_LFS_FILESIZE >> PAGE_CACHE_SHIFT))
+ if (index > (MAX_LFS_FILESIZE >> PAGE_SHIFT))
return -EFBIG;
repeat:
swap.val = 0;
@@ -1156,7 +1156,7 @@ repeat:
}

if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
- ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
+ ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
error = -EINVAL;
goto unlock;
}
@@ -1327,7 +1327,7 @@ clear:

/* Perhaps the file has been truncated since we checked */
if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
- ((loff_t)index << PAGE_CACHE_SHIFT) >= i_size_read(inode)) {
+ ((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
if (alloced) {
ClearPageDirty(page);
delete_from_page_cache(page);
@@ -1355,7 +1355,7 @@ failed:
unlock:
if (page) {
unlock_page(page);
- page_cache_release(page);
+ put_page(page);
}
if (error == -ENOSPC && !once++) {
info = SHMEM_I(inode);
@@ -1635,8 +1635,8 @@ static ssize_t shmem_file_read_iter(stru
if (!iter_is_iovec(to))
sgp = SGP_DIRTY;

- index = *ppos >> PAGE_CACHE_SHIFT;
- offset = *ppos & ~PAGE_CACHE_MASK;
+ index = *ppos >> PAGE_SHIFT;
+ offset = *ppos & ~PAGE_MASK;

for (;;) {
struct page *page = NULL;


2016-04-05 20:40:09

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 01/10] mm: update_lru_size warn and reset bad lru_size

Though debug kernels have a VM_BUG_ON to help protect from misaccounting
lru_size, non-debug kernels are liable to wrap it around: and then the
vast unsigned long size draws page reclaim into a loop of repeatedly
doing nothing on an empty list, without even a cond_resched().

That soft lockup looks confusingly like an over-busy reclaim scenario,
with lots of contention on the lru_lock in shrink_inactive_list():
yet has a totally different origin.

Help differentiate with a custom warning in mem_cgroup_update_lru_size(),
even in non-debug kernels; and reset the size to avoid the lockup. But
the particular bug which suggested this change was mine alone, and since
fixed.

Make it a WARN_ONCE: the first occurrence is the most informative, a
flurry may follow, yet even when rate-limited little more is learnt.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/mm_inline.h | 2 +-
mm/memcontrol.c | 24 ++++++++++++++++++++----
2 files changed, 21 insertions(+), 5 deletions(-)

--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -35,8 +35,8 @@ static __always_inline void del_page_fro
struct lruvec *lruvec, enum lru_list lru)
{
int nr_pages = hpage_nr_pages(page);
- mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
list_del(&page->lru);
+ mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
__mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
}

--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1022,22 +1022,38 @@ out:
* @lru: index of lru list the page is sitting on
* @nr_pages: positive when adding or negative when removing
*
- * This function must be called when a page is added to or removed from an
- * lru list.
+ * This function must be called under lru_lock, just before a page is added
+ * to or just after a page is removed from an lru list (that ordering being
+ * so as to allow it to check that lru_size 0 is consistent with list_empty).
*/
void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
int nr_pages)
{
struct mem_cgroup_per_zone *mz;
unsigned long *lru_size;
+ long size;
+ bool empty;

if (mem_cgroup_disabled())
return;

mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
lru_size = mz->lru_size + lru;
- *lru_size += nr_pages;
- VM_BUG_ON((long)(*lru_size) < 0);
+ empty = list_empty(lruvec->lists + lru);
+
+ if (nr_pages < 0)
+ *lru_size += nr_pages;
+
+ size = *lru_size;
+ if (WARN_ONCE(size < 0 || empty != !size,
+ "%s(%p, %d, %d): lru_size %ld but %sempty\n",
+ __func__, lruvec, lru, nr_pages, size, empty ? "" : "not ")) {
+ VM_BUG_ON(1);
+ *lru_size = 0;
+ }
+
+ if (nr_pages > 0)
+ *lru_size += nr_pages;
}

bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)

2016-04-05 20:42:06

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 02/10] mm: update_lru_size do the __mod_zone_page_state

Konstantin Khlebnikov pointed out (nearly four years ago, when lumpy
reclaim was removed) that lru_size can be updated by -nr_taken once
per call to isolate_lru_pages(), instead of page by page.

Update it inside isolate_lru_pages(), or at its two callsites? I
chose to update it at the callsites, rearranging and grouping the
updates by nr_taken and nr_scanned together in both.

With one exception, mem_cgroup_update_lru_size(,lru,) is then used
where __mod_zone_page_state(,NR_LRU_BASE+lru,) is used; and we shall
be adding some more calls in a future commit. Make the code a little
smaller and simpler by incorporating stat update in lru_size update.

The exception was move_active_pages_to_lru(), which aggregated the
pgmoved stat update separately from the individual lru_size updates;
but I still think this a simplification worth making.

However, the __mod_zone_page_state is not peculiar to mem_cgroups: so
better use the name update_lru_size, calls mem_cgroup_update_lru_size
when CONFIG_MEMCG.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/memcontrol.h | 6 ------
include/linux/mm_inline.h | 24 ++++++++++++++++++------
mm/memcontrol.c | 2 ++
mm/vmscan.c | 23 ++++++++++-------------
4 files changed, 30 insertions(+), 25 deletions(-)

--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -658,12 +658,6 @@ mem_cgroup_get_lru_size(struct lruvec *l
return 0;
}

-static inline void
-mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
- int increment)
-{
-}
-
static inline unsigned long
mem_cgroup_node_nr_lru_pages(struct mem_cgroup *memcg,
int nid, unsigned int lru_mask)
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -22,22 +22,34 @@ static inline int page_is_file_cache(str
return !PageSwapBacked(page);
}

+static __always_inline void __update_lru_size(struct lruvec *lruvec,
+ enum lru_list lru, int nr_pages)
+{
+ __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
+}
+
+static __always_inline void update_lru_size(struct lruvec *lruvec,
+ enum lru_list lru, int nr_pages)
+{
+#ifdef CONFIG_MEMCG
+ mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+#else
+ __update_lru_size(lruvec, lru, nr_pages);
+#endif
+}
+
static __always_inline void add_page_to_lru_list(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
- int nr_pages = hpage_nr_pages(page);
- mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+ update_lru_size(lruvec, lru, hpage_nr_pages(page));
list_add(&page->lru, &lruvec->lists[lru]);
- __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, nr_pages);
}

static __always_inline void del_page_from_lru_list(struct page *page,
struct lruvec *lruvec, enum lru_list lru)
{
- int nr_pages = hpage_nr_pages(page);
list_del(&page->lru);
- mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
- __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
+ update_lru_size(lruvec, lru, -hpage_nr_pages(page));
}

/**
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1034,6 +1034,8 @@ void mem_cgroup_update_lru_size(struct l
long size;
bool empty;

+ __update_lru_size(lruvec, lru, nr_pages);
+
if (mem_cgroup_disabled())
return;

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1374,7 +1374,6 @@ static unsigned long isolate_lru_pages(u
for (scan = 0; scan < nr_to_scan && nr_taken < nr_to_scan &&
!list_empty(src); scan++) {
struct page *page;
- int nr_pages;

page = lru_to_page(src);
prefetchw_prev_lru_page(page, src, flags);
@@ -1383,10 +1382,8 @@ static unsigned long isolate_lru_pages(u

switch (__isolate_lru_page(page, mode)) {
case 0:
- nr_pages = hpage_nr_pages(page);
- mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
+ nr_taken += hpage_nr_pages(page);
list_move(&page->lru, dst);
- nr_taken += nr_pages;
break;

case -EBUSY:
@@ -1602,8 +1599,9 @@ shrink_inactive_list(unsigned long nr_to
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, isolate_mode, lru);

- __mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
+ update_lru_size(lruvec, lru, -nr_taken);
__mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+ reclaim_stat->recent_scanned[file] += nr_taken;

if (global_reclaim(sc)) {
__mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
@@ -1624,8 +1622,6 @@ shrink_inactive_list(unsigned long nr_to

spin_lock_irq(&zone->lru_lock);

- reclaim_stat->recent_scanned[file] += nr_taken;
-
if (global_reclaim(sc)) {
if (current_is_kswapd())
__count_zone_vm_events(PGSTEAL_KSWAPD, zone,
@@ -1742,7 +1738,7 @@ static void move_active_pages_to_lru(str
SetPageLRU(page);

nr_pages = hpage_nr_pages(page);
- mem_cgroup_update_lru_size(lruvec, lru, nr_pages);
+ update_lru_size(lruvec, lru, nr_pages);
list_move(&page->lru, &lruvec->lists[lru]);
pgmoved += nr_pages;

@@ -1760,7 +1756,7 @@ static void move_active_pages_to_lru(str
list_add(&page->lru, pages_to_free);
}
}
- __mod_zone_page_state(zone, NR_LRU_BASE + lru, pgmoved);
+
if (!is_active_lru(lru))
__count_vm_events(PGDEACTIVATE, pgmoved);
}
@@ -1794,14 +1790,15 @@ static void shrink_active_list(unsigned

nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &l_hold,
&nr_scanned, sc, isolate_mode, lru);
- if (global_reclaim(sc))
- __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);

+ update_lru_size(lruvec, lru, -nr_taken);
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
reclaim_stat->recent_scanned[file] += nr_taken;

+ if (global_reclaim(sc))
+ __mod_zone_page_state(zone, NR_PAGES_SCANNED, nr_scanned);
__count_zone_vm_events(PGREFILL, zone, nr_scanned);
- __mod_zone_page_state(zone, NR_LRU_BASE + lru, -nr_taken);
- __mod_zone_page_state(zone, NR_ISOLATED_ANON + file, nr_taken);
+
spin_unlock_irq(&zone->lru_lock);

while (!list_empty(&l_hold)) {

2016-04-05 20:44:22

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 03/10] mm: use __SetPageSwapBacked and dont ClearPageSwapBacked

v3.16 commit 07a427884348 ("mm: shmem: avoid atomic operation during
shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
by __SetPageSwapBacked, pointing out that the newly allocated page is
not yet visible to other users (except speculative get_page_unless_zero-
ers, who may not update page flags before their further checks).

That was part of a series in which Mel was focused on tmpfs profiles:
but almost all SetPageSwapBacked uses can be so optimized, with the same
justification. Remove ClearPageSwapBacked from __read_swap_cache_async()
error path: it's not an error to free a page with PG_swapbacked set.

Follow a convention of __SetPageLocked, __SetPageSwapBacked instead of
doing it differently in different places; but that's for tidiness - if
the ordering actually mattered, we should not be using the __variants.

There's probably scope for further __SetPageFlags in other places,
but SwapBacked is the one I'm interested in at the moment.

Signed-off-by: Hugh Dickins <[email protected]>
---
Sorry, Mel did give
Reviewed-by: Mel Gorman <[email protected]>
a year ago, but the kernel has moved on since then,
so it feels slightly ruder to carry that forward without asking,
than to ask again after all this time not submitting what he approved.

mm/migrate.c | 6 +++---
mm/rmap.c | 2 +-
mm/shmem.c | 4 ++--
mm/swap_state.c | 3 +--
4 files changed, 7 insertions(+), 8 deletions(-)

--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -332,7 +332,7 @@ int migrate_page_move_mapping(struct add
newpage->index = page->index;
newpage->mapping = page->mapping;
if (PageSwapBacked(page))
- SetPageSwapBacked(newpage);
+ __SetPageSwapBacked(newpage);

return MIGRATEPAGE_SUCCESS;
}
@@ -378,7 +378,7 @@ int migrate_page_move_mapping(struct add
newpage->index = page->index;
newpage->mapping = page->mapping;
if (PageSwapBacked(page))
- SetPageSwapBacked(newpage);
+ __SetPageSwapBacked(newpage);

get_page(newpage); /* add cache reference */
if (PageSwapCache(page)) {
@@ -1785,7 +1785,7 @@ int migrate_misplaced_transhuge_page(str

/* Prepare a page as a migration target */
__SetPageLocked(new_page);
- SetPageSwapBacked(new_page);
+ __SetPageSwapBacked(new_page);

/* anon mapping, we can simply copy page->mapping to the new page: */
new_page->mapping = page->mapping;
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1249,7 +1249,7 @@ void page_add_new_anon_rmap(struct page
int nr = compound ? hpage_nr_pages(page) : 1;

VM_BUG_ON_VMA(address < vma->vm_start || address >= vma->vm_end, vma);
- SetPageSwapBacked(page);
+ __SetPageSwapBacked(page);
if (compound) {
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
/* increment count (starts at -1) */
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1085,8 +1085,8 @@ static int shmem_replace_page(struct pag
flush_dcache_page(newpage);

__SetPageLocked(newpage);
+ __SetPageSwapBacked(newpage);
SetPageUptodate(newpage);
- SetPageSwapBacked(newpage);
set_page_private(newpage, swap_index);
SetPageSwapCache(newpage);

@@ -1276,8 +1276,8 @@ repeat:
goto decused;
}

- __SetPageSwapBacked(page);
__SetPageLocked(page);
+ __SetPageSwapBacked(page);
if (sgp == SGP_WRITE)
__SetPageReferenced(page);

--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -358,7 +358,7 @@ struct page *__read_swap_cache_async(swp

/* May fail (-ENOMEM) if radix-tree node allocation failed. */
__SetPageLocked(new_page);
- SetPageSwapBacked(new_page);
+ __SetPageSwapBacked(new_page);
err = __add_to_swap_cache(new_page, entry);
if (likely(!err)) {
radix_tree_preload_end();
@@ -370,7 +370,6 @@ struct page *__read_swap_cache_async(swp
return new_page;
}
radix_tree_preload_end();
- ClearPageSwapBacked(new_page);
__ClearPageLocked(new_page);
/*
* add_to_swap_cache() doesn't return -EEXIST, so we can safely

2016-04-05 20:45:49

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 04/10] tmpfs: preliminary minor tidyups

Make a few cleanups in mm/shmem.c, before going on to complicate it.

shmem_alloc_page() will become more complicated: we can't afford to
to have that complication duplicated between a CONFIG_NUMA version
and a !CONFIG_NUMA version, so rearrange the #ifdef'ery there to
yield a single shmem_swapin() and a single shmem_alloc_page().

Yes, it's a shame to inflict the horrid pseudo-vma on non-NUMA
configurations, but eliminating it is a larger cleanup: I have an
alloc_pages_mpol() patchset not yet ready - mpol handling is subtle
and bug-prone, and changed yet again since my last version.

Move __SetPageLocked, __SetPageSwapBacked from shmem_getpage_gfp()
to shmem_alloc_page(): that SwapBacked flag will be useful in future,
to help to distinguish different cases appropriately.

And the SGP_DIRTY variant of SGP_CACHE is hard to understand and of
little use (IIRC it dates back to when shmem_getpage() returned the
page unlocked): kill it and do the necessary in shmem_file_read_iter().

But an arm64 build then complained that info may be uninitialized
(where shmem_getpage_gfp() deletes a freshly alloced page beyond eof),
and advancing to an "sgp <= SGP_CACHE" test jogged it back to reality.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/mempolicy.h | 6 +++
mm/shmem.c | 69 +++++++++++++-----------------------
2 files changed, 32 insertions(+), 43 deletions(-)

--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -228,6 +228,12 @@ static inline void mpol_free_shared_poli
{
}

+static inline struct mempolicy *
+mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+{
+ return NULL;
+}
+
#define vma_policy(vma) NULL

static inline int
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -101,7 +101,6 @@ struct shmem_falloc {
enum sgp_type {
SGP_READ, /* don't exceed i_size, don't allocate page */
SGP_CACHE, /* don't exceed i_size, may allocate page */
- SGP_DIRTY, /* like SGP_CACHE, but set new page dirty */
SGP_WRITE, /* may exceed i_size, may allocate !Uptodate page */
SGP_FALLOC, /* like SGP_WRITE, but make existing page Uptodate */
};
@@ -169,7 +168,7 @@ static inline int shmem_reacct_size(unsi

/*
* ... whereas tmpfs objects are accounted incrementally as
- * pages are allocated, in order to allow huge sparse files.
+ * pages are allocated, in order to allow large sparse files.
* shmem_getpage reports shmem_acct_block failure as -ENOSPC not -ENOMEM,
* so that a failure on a sparse tmpfs mapping will give SIGBUS not OOM.
*/
@@ -947,8 +946,7 @@ redirty:
return 0;
}

-#ifdef CONFIG_NUMA
-#ifdef CONFIG_TMPFS
+#if defined(CONFIG_NUMA) && defined(CONFIG_TMPFS)
static void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
{
char buffer[64];
@@ -972,7 +970,18 @@ static struct mempolicy *shmem_get_sbmpo
}
return mpol;
}
-#endif /* CONFIG_TMPFS */
+#else /* !CONFIG_NUMA || !CONFIG_TMPFS */
+static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
+{
+}
+static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
+{
+ return NULL;
+}
+#endif /* CONFIG_NUMA && CONFIG_TMPFS */
+#ifndef CONFIG_NUMA
+#define vm_policy vm_private_data
+#endif

static struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index)
@@ -1008,39 +1017,17 @@ static struct page *shmem_alloc_page(gfp
pvma.vm_ops = NULL;
pvma.vm_policy = mpol_shared_policy_lookup(&info->policy, index);

- page = alloc_page_vma(gfp, &pvma, 0);
+ page = alloc_pages_vma(gfp, 0, &pvma, 0, numa_node_id(), false);
+ if (page) {
+ __SetPageLocked(page);
+ __SetPageSwapBacked(page);
+ }

/* Drop reference taken by mpol_shared_policy_lookup() */
mpol_cond_put(pvma.vm_policy);

return page;
}
-#else /* !CONFIG_NUMA */
-#ifdef CONFIG_TMPFS
-static inline void shmem_show_mpol(struct seq_file *seq, struct mempolicy *mpol)
-{
-}
-#endif /* CONFIG_TMPFS */
-
-static inline struct page *shmem_swapin(swp_entry_t swap, gfp_t gfp,
- struct shmem_inode_info *info, pgoff_t index)
-{
- return swapin_readahead(swap, gfp, NULL, 0);
-}
-
-static inline struct page *shmem_alloc_page(gfp_t gfp,
- struct shmem_inode_info *info, pgoff_t index)
-{
- return alloc_page(gfp);
-}
-#endif /* CONFIG_NUMA */
-
-#if !defined(CONFIG_NUMA) || !defined(CONFIG_TMPFS)
-static inline struct mempolicy *shmem_get_sbmpol(struct shmem_sb_info *sbinfo)
-{
- return NULL;
-}
-#endif

/*
* When a page is moved from swapcache to shmem filecache (either by the
@@ -1084,8 +1071,6 @@ static int shmem_replace_page(struct pag
copy_highpage(newpage, oldpage);
flush_dcache_page(newpage);

- __SetPageLocked(newpage);
- __SetPageSwapBacked(newpage);
SetPageUptodate(newpage);
set_page_private(newpage, swap_index);
SetPageSwapCache(newpage);
@@ -1155,7 +1140,7 @@ repeat:
page = NULL;
}

- if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
+ if (sgp <= SGP_CACHE &&
((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
error = -EINVAL;
goto unlock;
@@ -1275,9 +1260,6 @@ repeat:
error = -ENOMEM;
goto decused;
}
-
- __SetPageLocked(page);
- __SetPageSwapBacked(page);
if (sgp == SGP_WRITE)
__SetPageReferenced(page);

@@ -1321,12 +1303,10 @@ clear:
flush_dcache_page(page);
SetPageUptodate(page);
}
- if (sgp == SGP_DIRTY)
- set_page_dirty(page);
}

/* Perhaps the file has been truncated since we checked */
- if (sgp != SGP_WRITE && sgp != SGP_FALLOC &&
+ if (sgp <= SGP_CACHE &&
((loff_t)index << PAGE_SHIFT) >= i_size_read(inode)) {
if (alloced) {
ClearPageDirty(page);
@@ -1633,7 +1613,7 @@ static ssize_t shmem_file_read_iter(stru
* and even mark them dirty, so it cannot exceed the max_blocks limit.
*/
if (!iter_is_iovec(to))
- sgp = SGP_DIRTY;
+ sgp = SGP_CACHE;

index = *ppos >> PAGE_SHIFT;
offset = *ppos & ~PAGE_MASK;
@@ -1659,8 +1639,11 @@ static ssize_t shmem_file_read_iter(stru
error = 0;
break;
}
- if (page)
+ if (page) {
+ if (sgp == SGP_CACHE)
+ set_page_dirty(page);
unlock_page(page);
+ }

/*
* We must evaluate after, since reads (unlike writes)

2016-04-05 20:48:00

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 05/10] tmpfs: mem_cgroup charge fault to vm_mm not current mm

From: Andres Lagar-Cavilla <[email protected]>

Although shmem_fault() has been careful to count a major fault to vm_mm,
shmem_getpage_gfp() has been careless in charging a remote access fault
to current->mm owner's memcg instead of to vma->vm_mm owner's memcg:
that is inconsistent with all the mem_cgroup charging on remote access
faults in mm/memory.c.

Fix it by passing fault_mm along with fault_type to shmem_get_page_gfp();
but in that case, now knowing the right mm, it's better for it to handle
the PGMAJFAULT updates itself.

And let's keep this clutter out of most callers' way: change the common
shmem_getpage() wrapper to hide fault_mm and fault_type as well as gfp.

Signed-off-by: Andres Lagar-Cavilla <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
---
mm/shmem.c | 61 ++++++++++++++++++++++++++++-----------------------
1 file changed, 34 insertions(+), 27 deletions(-)

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -121,13 +121,14 @@ static bool shmem_should_replace_page(st
static int shmem_replace_page(struct page **pagep, gfp_t gfp,
struct shmem_inode_info *info, pgoff_t index);
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
- struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type);
+ struct page **pagep, enum sgp_type sgp,
+ gfp_t gfp, struct mm_struct *fault_mm, int *fault_type);

static inline int shmem_getpage(struct inode *inode, pgoff_t index,
- struct page **pagep, enum sgp_type sgp, int *fault_type)
+ struct page **pagep, enum sgp_type sgp)
{
return shmem_getpage_gfp(inode, index, pagep, sgp,
- mapping_gfp_mask(inode->i_mapping), fault_type);
+ mapping_gfp_mask(inode->i_mapping), NULL, NULL);
}

static inline struct shmem_sb_info *SHMEM_SB(struct super_block *sb)
@@ -527,7 +528,7 @@ static void shmem_undo_range(struct inod

if (partial_start) {
struct page *page = NULL;
- shmem_getpage(inode, start - 1, &page, SGP_READ, NULL);
+ shmem_getpage(inode, start - 1, &page, SGP_READ);
if (page) {
unsigned int top = PAGE_SIZE;
if (start > end) {
@@ -542,7 +543,7 @@ static void shmem_undo_range(struct inod
}
if (partial_end) {
struct page *page = NULL;
- shmem_getpage(inode, end, &page, SGP_READ, NULL);
+ shmem_getpage(inode, end, &page, SGP_READ);
if (page) {
zero_user_segment(page, 0, partial_end);
set_page_dirty(page);
@@ -1115,14 +1116,19 @@ static int shmem_replace_page(struct pag
*
* If we allocate a new one we do not mark it dirty. That's up to the
* vm. If we swap it in we mark it dirty since we also free the swap
- * entry since a page cannot live in both the swap and page cache
+ * entry since a page cannot live in both the swap and page cache.
+ *
+ * fault_mm and fault_type are only supplied by shmem_fault:
+ * otherwise they are NULL.
*/
static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
- struct page **pagep, enum sgp_type sgp, gfp_t gfp, int *fault_type)
+ struct page **pagep, enum sgp_type sgp, gfp_t gfp,
+ struct mm_struct *fault_mm, int *fault_type)
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info;
struct shmem_sb_info *sbinfo;
+ struct mm_struct *charge_mm;
struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
@@ -1168,14 +1174,19 @@ repeat:
*/
info = SHMEM_I(inode);
sbinfo = SHMEM_SB(inode->i_sb);
+ charge_mm = fault_mm ? : current->mm;

if (swap.val) {
/* Look it up and read it in.. */
page = lookup_swap_cache(swap);
if (!page) {
- /* here we actually do the io */
- if (fault_type)
+ /* Or update major stats only when swapin succeeds?? */
+ if (fault_type) {
*fault_type |= VM_FAULT_MAJOR;
+ count_vm_event(PGMAJFAULT);
+ mem_cgroup_count_vm_event(fault_mm, PGMAJFAULT);
+ }
+ /* Here we actually start the io */
page = shmem_swapin(swap, gfp, info, index);
if (!page) {
error = -ENOMEM;
@@ -1202,7 +1213,7 @@ repeat:
goto failed;
}

- error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+ error = mem_cgroup_try_charge(page, charge_mm, gfp, &memcg,
false);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
@@ -1263,7 +1274,7 @@ repeat:
if (sgp == SGP_WRITE)
__SetPageReferenced(page);

- error = mem_cgroup_try_charge(page, current->mm, gfp, &memcg,
+ error = mem_cgroup_try_charge(page, charge_mm, gfp, &memcg,
false);
if (error)
goto decused;
@@ -1352,6 +1363,7 @@ unlock:
static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct inode *inode = file_inode(vma->vm_file);
+ gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
int error;
int ret = VM_FAULT_LOCKED;

@@ -1413,14 +1425,10 @@ static int shmem_fault(struct vm_area_st
spin_unlock(&inode->i_lock);
}

- error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);
+ error = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE,
+ gfp, vma->vm_mm, &ret);
if (error)
return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS);
-
- if (ret & VM_FAULT_MAJOR) {
- count_vm_event(PGMAJFAULT);
- mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
- }
return ret;
}

@@ -1567,7 +1575,7 @@ shmem_write_begin(struct file *file, str
return -EPERM;
}

- return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
+ return shmem_getpage(inode, index, pagep, SGP_WRITE);
}

static int
@@ -1633,7 +1641,7 @@ static ssize_t shmem_file_read_iter(stru
break;
}

- error = shmem_getpage(inode, index, &page, sgp, NULL);
+ error = shmem_getpage(inode, index, &page, sgp);
if (error) {
if (error == -EINVAL)
error = 0;
@@ -1749,7 +1757,7 @@ static ssize_t shmem_file_splice_read(st
error = 0;

while (spd.nr_pages < nr_pages) {
- error = shmem_getpage(inode, index, &page, SGP_CACHE, NULL);
+ error = shmem_getpage(inode, index, &page, SGP_CACHE);
if (error)
break;
unlock_page(page);
@@ -1771,8 +1779,7 @@ static ssize_t shmem_file_splice_read(st
page = spd.pages[page_nr];

if (!PageUptodate(page) || page->mapping != mapping) {
- error = shmem_getpage(inode, index, &page,
- SGP_CACHE, NULL);
+ error = shmem_getpage(inode, index, &page, SGP_CACHE);
if (error)
break;
unlock_page(page);
@@ -2215,8 +2222,7 @@ static long shmem_fallocate(struct file
else if (shmem_falloc.nr_unswapped > shmem_falloc.nr_falloced)
error = -ENOMEM;
else
- error = shmem_getpage(inode, index, &page, SGP_FALLOC,
- NULL);
+ error = shmem_getpage(inode, index, &page, SGP_FALLOC);
if (error) {
/* Remove the !PageUptodate pages we added */
shmem_undo_range(inode,
@@ -2534,7 +2540,7 @@ static int shmem_symlink(struct inode *d
inode->i_op = &shmem_short_symlink_operations;
} else {
inode_nohighmem(inode);
- error = shmem_getpage(inode, 0, &page, SGP_WRITE, NULL);
+ error = shmem_getpage(inode, 0, &page, SGP_WRITE);
if (error) {
iput(inode);
return error;
@@ -2575,7 +2581,7 @@ static const char *shmem_get_link(struct
return ERR_PTR(-ECHILD);
}
} else {
- error = shmem_getpage(inode, 0, &page, SGP_READ, NULL);
+ error = shmem_getpage(inode, 0, &page, SGP_READ);
if (error)
return ERR_PTR(error);
unlock_page(page);
@@ -3478,7 +3484,8 @@ struct page *shmem_read_mapping_page_gfp
int error;

BUG_ON(mapping->a_ops != &shmem_aops);
- error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE, gfp, NULL);
+ error = shmem_getpage_gfp(inode, index, &page, SGP_CACHE,
+ gfp, NULL, NULL);
if (error)
page = ERR_PTR(error);
else

2016-04-05 20:49:41

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 06/10] mm: /proc/sys/vm/stat_refresh to force vmstat update

Provide /proc/sys/vm/stat_refresh to force an immediate update of
per-cpu into global vmstats: useful to avoid a sleep(2) or whatever
before checking counts when testing. Originally added to work around
a bug which left counts stranded indefinitely on a cpu going idle
(an inaccuracy magnified when small below-batch numbers represent
"huge" amounts of memory), but I believe that bug is now fixed:
nonetheless, this is still a useful knob.

Its schedule_on_each_cpu() is probably too expensive just to fold
into reading /proc/meminfo itself: give this mode 0600 to prevent
abuse. Allow a write or a read to do the same: nothing to read,
but "grep -h Shmem /proc/sys/vm/stat_refresh /proc/meminfo" is
convenient. Oh, and since global_page_state() itself is careful
to disguise any underflow as 0, hack in an "Invalid argument" and
pr_warn() if a counter is negative after the refresh - this helped
to fix a misaccounting of NR_ISOLATED_FILE in my migration code.

But on recent kernels, I find that NR_ALLOC_BATCH and NR_PAGES_SCANNED
often go negative some of the time. I have not yet worked out why, but
have no evidence that it's actually harmful. Punt for the moment by
just ignoring the anomaly on those.

Signed-off-by: Hugh Dickins <[email protected]>
---
Documentation/sysctl/vm.txt | 14 ++++++++
include/linux/vmstat.h | 4 ++
kernel/sysctl.c | 7 ++++
mm/vmstat.c | 58 ++++++++++++++++++++++++++++++++++
4 files changed, 83 insertions(+)

--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -57,6 +57,7 @@ Currently, these files are in /proc/sys/
- panic_on_oom
- percpu_pagelist_fraction
- stat_interval
+- stat_refresh
- swappiness
- user_reserve_kbytes
- vfs_cache_pressure
@@ -754,6 +755,19 @@ is 1 second.

==============================================================

+stat_refresh
+
+Any read or write (by root only) flushes all the per-cpu vm statistics
+into their global totals, for more accurate reports when testing
+e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
+
+As a side-effect, it also checks for negative totals (elsewhere reported
+as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
+(At time of writing, a few stats are known sometimes to be found negative,
+with no ill effects: errors and warnings on these stats are suppressed.)
+
+==============================================================
+
swappiness

This control is used to define how aggressive the kernel will swap
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -193,6 +193,10 @@ void quiet_vmstat(void);
void cpu_vm_stats_fold(int cpu);
void refresh_zone_stat_thresholds(void);

+struct ctl_table;
+int vmstat_refresh(struct ctl_table *, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos);
+
void drain_zonestat(struct zone *zone, struct per_cpu_pageset *);

int calculate_pressure_threshold(struct zone *zone);
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1509,6 +1509,13 @@ static struct ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec_jiffies,
},
+ {
+ .procname = "stat_refresh",
+ .data = NULL,
+ .maxlen = 0,
+ .mode = 0600,
+ .proc_handler = vmstat_refresh,
+ },
#endif
#ifdef CONFIG_MMU
{
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1378,6 +1378,64 @@ static DEFINE_PER_CPU(struct delayed_wor
int sysctl_stat_interval __read_mostly = HZ;
static cpumask_var_t cpu_stat_off;

+static void refresh_vm_stats(struct work_struct *work)
+{
+ refresh_cpu_vm_stats(true);
+}
+
+int vmstat_refresh(struct ctl_table *table, int write,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ long val;
+ int err;
+ int i;
+
+ /*
+ * The regular update, every sysctl_stat_interval, may come later
+ * than expected: leaving a significant amount in per_cpu buckets.
+ * This is particularly misleading when checking a quantity of HUGE
+ * pages, immediately after running a test. /proc/sys/vm/stat_refresh,
+ * which can equally be echo'ed to or cat'ted from (by root),
+ * can be used to update the stats just before reading them.
+ *
+ * Oh, and since global_page_state() etc. are so careful to hide
+ * transiently negative values, report an error here if any of
+ * the stats is negative, so we know to go looking for imbalance.
+ */
+ err = schedule_on_each_cpu(refresh_vm_stats);
+ if (err)
+ return err;
+ for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
+ val = atomic_long_read(&vm_stat[i]);
+ if (val < 0) {
+ switch (i) {
+ case NR_ALLOC_BATCH:
+ case NR_PAGES_SCANNED:
+ /*
+ * These are often seen to go negative in
+ * recent kernels, but not to go permanently
+ * negative. Whilst it would be nicer not to
+ * have exceptions, rooting them out would be
+ * another task, of rather low priority.
+ */
+ break;
+ default:
+ pr_warn("%s: %s %ld\n",
+ __func__, vmstat_text[i], val);
+ err = -EINVAL;
+ break;
+ }
+ }
+ }
+ if (err)
+ return err;
+ if (write)
+ *ppos += *lenp;
+ else
+ *lenp = 0;
+ return 0;
+}
+
static void vmstat_update(struct work_struct *w)
{
if (refresh_cpu_vm_stats(true)) {

2016-04-05 20:51:21

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 07/10] huge mm: move_huge_pmd does not need new_vma

Remove move_huge_pmd()'s redundant new_vma arg: all it was used for was
a VM_NOHUGEPAGE check on new_vma flags, but the new_vma is cloned from
the old vma, so a trans_huge_pmd in the new_vma will be as acceptable
as it was in the old vma, alignment and size permitting.

Signed-off-by: Hugh Dickins <[email protected]>
---
include/linux/huge_mm.h | 4 +---
mm/huge_memory.c | 7 ++-----
mm/mremap.c | 5 ++---
3 files changed, 5 insertions(+), 11 deletions(-)

--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -28,9 +28,7 @@ extern int zap_huge_pmd(struct mmu_gathe
extern int mincore_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
unsigned char *vec);
-extern bool move_huge_pmd(struct vm_area_struct *vma,
- struct vm_area_struct *new_vma,
- unsigned long old_addr,
+extern bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd);
extern int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1704,20 +1704,17 @@ int zap_huge_pmd(struct mmu_gather *tlb,
return 1;
}

-bool move_huge_pmd(struct vm_area_struct *vma, struct vm_area_struct *new_vma,
- unsigned long old_addr,
+bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, unsigned long old_end,
pmd_t *old_pmd, pmd_t *new_pmd)
{
spinlock_t *old_ptl, *new_ptl;
pmd_t pmd;
-
struct mm_struct *mm = vma->vm_mm;

if ((old_addr & ~HPAGE_PMD_MASK) ||
(new_addr & ~HPAGE_PMD_MASK) ||
- old_end - old_addr < HPAGE_PMD_SIZE ||
- (new_vma->vm_flags & VM_NOHUGEPAGE))
+ old_end - old_addr < HPAGE_PMD_SIZE)
return false;

/*
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -198,9 +198,8 @@ unsigned long move_page_tables(struct vm
/* See comment in move_ptes() */
if (need_rmap_locks)
anon_vma_lock_write(vma->anon_vma);
- moved = move_huge_pmd(vma, new_vma, old_addr,
- new_addr, old_end,
- old_pmd, new_pmd);
+ moved = move_huge_pmd(vma, old_addr, new_addr,
+ old_end, old_pmd, new_pmd);
if (need_rmap_locks)
anon_vma_unlock_write(vma->anon_vma);
if (moved) {

2016-04-05 20:52:53

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 08/10] huge pagecache: extend mremap pmd rmap lockout to files

Whatever huge pagecache implementation we go with, file rmap locking
must be added to anon rmap locking, when mremap's move_page_tables()
finds a pmd_trans_huge pmd entry: a simple change, let's do it now.

Factor out take_rmap_locks() and drop_rmap_locks() to handle the
locking for make move_ptes() and move_page_tables(), and delete
the VM_BUG_ON_VMA which rejected vm_file and required anon_vma.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/mremap.c | 42 ++++++++++++++++++++++--------------------
1 file changed, 22 insertions(+), 20 deletions(-)

--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -70,6 +70,22 @@ static pmd_t *alloc_new_pmd(struct mm_st
return pmd;
}

+static void take_rmap_locks(struct vm_area_struct *vma)
+{
+ if (vma->vm_file)
+ i_mmap_lock_write(vma->vm_file->f_mapping);
+ if (vma->anon_vma)
+ anon_vma_lock_write(vma->anon_vma);
+}
+
+static void drop_rmap_locks(struct vm_area_struct *vma)
+{
+ if (vma->anon_vma)
+ anon_vma_unlock_write(vma->anon_vma);
+ if (vma->vm_file)
+ i_mmap_unlock_write(vma->vm_file->f_mapping);
+}
+
static pte_t move_soft_dirty_pte(pte_t pte)
{
/*
@@ -90,8 +106,6 @@ static void move_ptes(struct vm_area_str
struct vm_area_struct *new_vma, pmd_t *new_pmd,
unsigned long new_addr, bool need_rmap_locks)
{
- struct address_space *mapping = NULL;
- struct anon_vma *anon_vma = NULL;
struct mm_struct *mm = vma->vm_mm;
pte_t *old_pte, *new_pte, pte;
spinlock_t *old_ptl, *new_ptl;
@@ -114,16 +128,8 @@ static void move_ptes(struct vm_area_str
* serialize access to individual ptes, but only rmap traversal
* order guarantees that we won't miss both the old and new ptes).
*/
- if (need_rmap_locks) {
- if (vma->vm_file) {
- mapping = vma->vm_file->f_mapping;
- i_mmap_lock_write(mapping);
- }
- if (vma->anon_vma) {
- anon_vma = vma->anon_vma;
- anon_vma_lock_write(anon_vma);
- }
- }
+ if (need_rmap_locks)
+ take_rmap_locks(vma);

/*
* We don't have to worry about the ordering of src and dst
@@ -151,10 +157,8 @@ static void move_ptes(struct vm_area_str
spin_unlock(new_ptl);
pte_unmap(new_pte - 1);
pte_unmap_unlock(old_pte - 1, old_ptl);
- if (anon_vma)
- anon_vma_unlock_write(anon_vma);
- if (mapping)
- i_mmap_unlock_write(mapping);
+ if (need_rmap_locks)
+ drop_rmap_locks(vma);
}

#define LATENCY_LIMIT (64 * PAGE_SIZE)
@@ -193,15 +197,13 @@ unsigned long move_page_tables(struct vm
if (pmd_trans_huge(*old_pmd)) {
if (extent == HPAGE_PMD_SIZE) {
bool moved;
- VM_BUG_ON_VMA(vma->vm_file || !vma->anon_vma,
- vma);
/* See comment in move_ptes() */
if (need_rmap_locks)
- anon_vma_lock_write(vma->anon_vma);
+ take_rmap_locks(vma);
moved = move_huge_pmd(vma, old_addr, new_addr,
old_end, old_pmd, new_pmd);
if (need_rmap_locks)
- anon_vma_unlock_write(vma->anon_vma);
+ drop_rmap_locks(vma);
if (moved) {
need_flush = true;
continue;

2016-04-05 20:55:28

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 09/10] huge pagecache: mmap_sem is unlocked when truncation splits pmd

zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(&mmap_sem) BUG()
will be invalid with huge pagecache, in whatever way it is implemented:
truncation of a hugely-mapped file to an unhugely-aligned size would
easily hit it.

(Although anon THP could in principle apply khugepaged to private file
mappings, which are not excluded by the MADV_HUGEPAGE restrictions, in
practice there's a vm_ops check which excludes them, so it never hits
this BUG() - there's no interface to "truncate" an anonymous mapping.)

We could complicate the test, to check i_mmap_rwsem also when there's a
vm_file; but my inclination was to make zap_pmd_range() more readable by
simply deleting this check. A search has shown no report of the issue in
the years since commit e0897d75f0b2 ("mm, thp: print useful information
when mmap_sem is unlocked in zap_pmd_range") expanded it from VM_BUG_ON()
- though I cannot point to what commit I would say then fixed the issue.

But there are a couple of other patches now floating around, neither
yet in the tree: let's agree to retain the check as a VM_BUG_ON_VMA(),
as Matthew Wilcox has done; but subject to a vma_is_anonymous() check,
as Kirill Shutemov has done. And let's get this in, without waiting
for any particular huge pagecache implementation to reach the tree.

Signed-off-by: Hugh Dickins <[email protected]>
---
mm/memory.c | 11 ++---------
1 file changed, 2 insertions(+), 9 deletions(-)

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1182,15 +1182,8 @@ static inline unsigned long zap_pmd_rang
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE) {
-#ifdef CONFIG_DEBUG_VM
- if (!rwsem_is_locked(&tlb->mm->mmap_sem)) {
- pr_err("%s: mmap_sem is unlocked! addr=0x%lx end=0x%lx vma->vm_start=0x%lx vma->vm_end=0x%lx\n",
- __func__, addr, end,
- vma->vm_start,
- vma->vm_end);
- BUG();
- }
-#endif
+ VM_BUG_ON_VMA(vma_is_anonymous(vma) &&
+ !rwsem_is_locked(&tlb->mm->mmap_sem), vma);
split_huge_pmd(vma, pmd, addr);
} else if (zap_huge_pmd(tlb, vma, pmd, addr))
goto next;

2016-04-05 21:02:56

by Hugh Dickins

[permalink] [raw]
Subject: [PATCH 10/10] arch: fix has_transparent_hugepage()

I've just discovered that the useful-sounding has_transparent_hugepage()
is actually an architecture-dependent minefield: on some arches it only
builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
not, but on some of those (arm and arm64) it then gives the wrong answer;
and on mips alone it's marked __init, which would crash if called later
(but so far it has not been called later).

Straighten this out: make it available to all configs, with a sensible
default in asm-generic/pgtable.h, removing its definitions from those
arches (arc, arm, arm64, sparc, tile) which are served by the default,
adding #define has_transparent_hugepage has_transparent_hugepage to those
(mips, powerpc, s390, x86) which need to override the default at runtime,
and removing the __init from mips (but maybe that kind of code should be
avoided after init: set a static variable the first time it's called).

Signed-off-by: Hugh Dickins <[email protected]>
---
arch/arc/include/asm/hugepage.h | 2 -
arch/arm/include/asm/pgtable-3level.h | 5 ----
arch/arm64/include/asm/pgtable.h | 5 ----
arch/mips/include/asm/pgtable.h | 1
arch/mips/mm/tlb-r4k.c | 21 ++++++++---------
arch/powerpc/include/asm/book3s/64/pgtable.h | 1
arch/powerpc/include/asm/pgtable.h | 1
arch/s390/include/asm/pgtable.h | 1
arch/sparc/include/asm/pgtable_64.h | 2 -
arch/tile/include/asm/pgtable.h | 1
arch/x86/include/asm/pgtable.h | 1
include/asm-generic/pgtable.h | 8 ++++++
12 files changed, 23 insertions(+), 26 deletions(-)

--- a/arch/arc/include/asm/hugepage.h
+++ b/arch/arc/include/asm/hugepage.h
@@ -61,8 +61,6 @@ static inline void set_pmd_at(struct mm_
extern void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd);

-#define has_transparent_hugepage() 1
-
/* Generic variants assume pgtable_t is struct page *, hence need for these */
#define __HAVE_ARCH_PGTABLE_DEPOSIT
extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -281,11 +281,6 @@ static inline void set_pmd_at(struct mm_
flush_pmd_entry(pmdp);
}

-static inline int has_transparent_hugepage(void)
-{
- return 1;
-}
-
#endif /* __ASSEMBLY__ */

#endif /* _ASM_PGTABLE_3LEVEL_H */
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -304,11 +304,6 @@ static inline pgprot_t mk_sect_prot(pgpr

#define set_pmd_at(mm, addr, pmdp, pmd) set_pte_at(mm, addr, (pte_t *)pmdp, pmd_pte(pmd))

-static inline int has_transparent_hugepage(void)
-{
- return 1;
-}
-
#define __pgprot_modify(prot,mask,bits) \
__pgprot((pgprot_val(prot) & ~(mask)) | (bits))

--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -468,6 +468,7 @@ static inline int io_remap_pfn_range(str

#ifdef CONFIG_TRANSPARENT_HUGEPAGE

+#define has_transparent_hugepage has_transparent_hugepage
extern int has_transparent_hugepage(void);

static inline int pmd_trans_huge(pmd_t pmd)
--- a/arch/mips/mm/tlb-r4k.c
+++ b/arch/mips/mm/tlb-r4k.c
@@ -399,19 +399,20 @@ void add_wired_entry(unsigned long entry

#ifdef CONFIG_TRANSPARENT_HUGEPAGE

-int __init has_transparent_hugepage(void)
+int has_transparent_hugepage(void)
{
- unsigned int mask;
- unsigned long flags;
+ static unsigned int mask = -1;

- local_irq_save(flags);
- write_c0_pagemask(PM_HUGE_MASK);
- back_to_back_c0_hazard();
- mask = read_c0_pagemask();
- write_c0_pagemask(PM_DEFAULT_MASK);
-
- local_irq_restore(flags);
+ if (mask == -1) { /* first call comes during __init */
+ unsigned long flags;

+ local_irq_save(flags);
+ write_c0_pagemask(PM_HUGE_MASK);
+ back_to_back_c0_hazard();
+ mask = read_c0_pagemask();
+ write_c0_pagemask(PM_DEFAULT_MASK);
+ local_irq_restore(flags);
+ }
return mask == PM_HUGE_MASK;
}

--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -219,6 +219,7 @@ extern void set_pmd_at(struct mm_struct
pmd_t *pmdp, pmd_t pmd);
extern void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd);
+#define has_transparent_hugepage has_transparent_hugepage
extern int has_transparent_hugepage(void);
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -65,7 +65,6 @@ extern int gup_hugepte(pte_t *ptep, unsi
struct page **pages, int *nr);
#ifndef CONFIG_TRANSPARENT_HUGEPAGE
#define pmd_large(pmd) 0
-#define has_transparent_hugepage() 0
#endif
pte_t *__find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
bool *is_thp, unsigned *shift);
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1223,6 +1223,7 @@ static inline int pmd_trans_huge(pmd_t p
return pmd_val(pmd) & _SEGMENT_ENTRY_LARGE;
}

+#define has_transparent_hugepage has_transparent_hugepage
static inline int has_transparent_hugepage(void)
{
return MACHINE_HAS_HPAGE ? 1 : 0;
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -681,8 +681,6 @@ static inline unsigned long pmd_trans_hu
return pte_val(pte) & _PAGE_PMD_HUGE;
}

-#define has_transparent_hugepage() 1
-
static inline pmd_t pmd_mkold(pmd_t pmd)
{
pte_t pte = __pte(pmd_val(pmd));
--- a/arch/tile/include/asm/pgtable.h
+++ b/arch/tile/include/asm/pgtable.h
@@ -487,7 +487,6 @@ static inline pmd_t pmd_modify(pmd_t pmd
}

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-#define has_transparent_hugepage() 1
#define pmd_trans_huge pmd_huge_page
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -181,6 +181,7 @@ static inline int pmd_trans_huge(pmd_t p
return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
}

+#define has_transparent_hugepage has_transparent_hugepage
static inline int has_transparent_hugepage(void)
{
return cpu_has_pse;
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -806,4 +806,12 @@ static inline int pmd_clear_huge(pmd_t *
#define io_remap_pfn_range remap_pfn_range
#endif

+#ifndef has_transparent_hugepage
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define has_transparent_hugepage() 1
+#else
+#define has_transparent_hugepage() 0
+#endif
+#endif
+
#endif /* _ASM_GENERIC_PGTABLE_H */

2016-04-05 23:25:22

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 10/10] arch: fix has_transparent_hugepage()

From: Hugh Dickins <[email protected]>
Date: Tue, 5 Apr 2016 14:02:49 -0700 (PDT)

> I've just discovered that the useful-sounding has_transparent_hugepage()
> is actually an architecture-dependent minefield: on some arches it only
> builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
> not, but on some of those (arm and arm64) it then gives the wrong answer;
> and on mips alone it's marked __init, which would crash if called later
> (but so far it has not been called later).
>
> Straighten this out: make it available to all configs, with a sensible
> default in asm-generic/pgtable.h, removing its definitions from those
> arches (arc, arm, arm64, sparc, tile) which are served by the default,
> adding #define has_transparent_hugepage has_transparent_hugepage to those
> (mips, powerpc, s390, x86) which need to override the default at runtime,
> and removing the __init from mips (but maybe that kind of code should be
> avoided after init: set a static variable the first time it's called).
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: David S. Miller <[email protected]>

2016-04-06 06:58:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 10/10] arch: fix has_transparent_hugepage()


* Hugh Dickins <[email protected]> wrote:

> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -181,6 +181,7 @@ static inline int pmd_trans_huge(pmd_t p
> return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
> }
>
> +#define has_transparent_hugepage has_transparent_hugepage
> static inline int has_transparent_hugepage(void)
> {
> return cpu_has_pse;

Small nit, just writing:

#define has_transparent_hugepage

ought to be enough, right?

In any case:

Acked-by: Ingo Molnar <[email protected]>

Another nit, this:

> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -806,4 +806,12 @@ static inline int pmd_clear_huge(pmd_t *
> #define io_remap_pfn_range remap_pfn_range
> #endif
>
> +#ifndef has_transparent_hugepage
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#define has_transparent_hugepage() 1
> +#else
> +#define has_transparent_hugepage() 0
> +#endif
> +#endif

Looks a bit more structured as:

#ifndef has_transparent_hugepage
# ifdef CONFIG_TRANSPARENT_HUGEPAGE
# define has_transparent_hugepage() 1
# else
# define has_transparent_hugepage() 0
# endif
#endif

BYMMV.

Thanks,

Ingo

2016-04-06 09:52:22

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm: use __SetPageSwapBacked and dont ClearPageSwapBacked

On Tue, Apr 05, 2016 at 01:44:16PM -0700, Hugh Dickins wrote:
> v3.16 commit 07a427884348 ("mm: shmem: avoid atomic operation during
> shmem_getpage_gfp") rightly replaced one instance of SetPageSwapBacked
> by __SetPageSwapBacked, pointing out that the newly allocated page is
> not yet visible to other users (except speculative get_page_unless_zero-
> ers, who may not update page flags before their further checks).
>
> That was part of a series in which Mel was focused on tmpfs profiles:
> but almost all SetPageSwapBacked uses can be so optimized, with the same
> justification. Remove ClearPageSwapBacked from __read_swap_cache_async()
> error path: it's not an error to free a page with PG_swapbacked set.
>
> Follow a convention of __SetPageLocked, __SetPageSwapBacked instead of
> doing it differently in different places; but that's for tidiness - if
> the ordering actually mattered, we should not be using the __variants.
>
> There's probably scope for further __SetPageFlags in other places,
> but SwapBacked is the one I'm interested in at the moment.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> Sorry, Mel did give
> a year ago, but the kernel has moved on since then,

Still looks good to me so

Reviewed-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2016-04-06 11:57:30

by Gerald Schaefer

[permalink] [raw]
Subject: Re: [PATCH 10/10] arch: fix has_transparent_hugepage()

On Tue, 5 Apr 2016 14:02:49 -0700 (PDT)
Hugh Dickins <[email protected]> wrote:

> I've just discovered that the useful-sounding has_transparent_hugepage()
> is actually an architecture-dependent minefield: on some arches it only
> builds if CONFIG_TRANSPARENT_HUGEPAGE=y, on others it's also there when
> not, but on some of those (arm and arm64) it then gives the wrong answer;
> and on mips alone it's marked __init, which would crash if called later
> (but so far it has not been called later).
>
> Straighten this out: make it available to all configs, with a sensible
> default in asm-generic/pgtable.h, removing its definitions from those
> arches (arc, arm, arm64, sparc, tile) which are served by the default,
> adding #define has_transparent_hugepage has_transparent_hugepage to those
> (mips, powerpc, s390, x86) which need to override the default at runtime,
> and removing the __init from mips (but maybe that kind of code should be
> avoided after init: set a static variable the first time it's called).
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---

Acked-by: Gerald Schaefer <[email protected]> # for arch/s390 bits

2016-04-06 21:58:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 10/10] arch: fix has_transparent_hugepage()


* Chris Metcalf <[email protected]> wrote:

> On 4/6/2016 2:58 AM, Ingo Molnar wrote:
> >* Hugh Dickins <[email protected]> wrote:
> >
> >>--- a/arch/x86/include/asm/pgtable.h
> >>+++ b/arch/x86/include/asm/pgtable.h
> >>@@ -181,6 +181,7 @@ static inline int pmd_trans_huge(pmd_t p
> >> return (pmd_val(pmd) & (_PAGE_PSE|_PAGE_DEVMAP)) == _PAGE_PSE;
> >> }
> >>+#define has_transparent_hugepage has_transparent_hugepage
> >> static inline int has_transparent_hugepage(void)
> >> {
> >> return cpu_has_pse;
> >Small nit, just writing:
> >
> > #define has_transparent_hugepage
> >
> >ought to be enough, right?
>
> No, since then in hugepage_init() the has_transparent_hugepage() call site
> would be left with just a stray pair of parentheses instead of a call.

Yes, indeed ...

Thanks,

Ingo

2016-04-11 10:28:31

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 07/10] huge mm: move_huge_pmd does not need new_vma

On Tue, Apr 05, 2016 at 01:51:15PM -0700, Hugh Dickins wrote:
> Remove move_huge_pmd()'s redundant new_vma arg: all it was used for was
> a VM_NOHUGEPAGE check on new_vma flags, but the new_vma is cloned from
> the old vma, so a trans_huge_pmd in the new_vma will be as acceptable
> as it was in the old vma, alignment and size permitting.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2016-04-11 10:32:27

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 08/10] huge pagecache: extend mremap pmd rmap lockout to files

On Tue, Apr 05, 2016 at 01:52:48PM -0700, Hugh Dickins wrote:
> Whatever huge pagecache implementation we go with, file rmap locking
> must be added to anon rmap locking, when mremap's move_page_tables()
> finds a pmd_trans_huge pmd entry: a simple change, let's do it now.
>
> Factor out take_rmap_locks() and drop_rmap_locks() to handle the
> locking for make move_ptes() and move_page_tables(), and delete
> the VM_BUG_ON_VMA which rejected vm_file and required anon_vma.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Yeah, it's cleaner than my variant.

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2016-04-11 10:35:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 09/10] huge pagecache: mmap_sem is unlocked when truncation splits pmd

On Tue, Apr 05, 2016 at 01:55:23PM -0700, Hugh Dickins wrote:
> zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(&mmap_sem) BUG()
> will be invalid with huge pagecache, in whatever way it is implemented:
> truncation of a hugely-mapped file to an unhugely-aligned size would
> easily hit it.
>
> (Although anon THP could in principle apply khugepaged to private file
> mappings, which are not excluded by the MADV_HUGEPAGE restrictions, in
> practice there's a vm_ops check which excludes them, so it never hits
> this BUG() - there's no interface to "truncate" an anonymous mapping.)
>
> We could complicate the test, to check i_mmap_rwsem also when there's a
> vm_file; but my inclination was to make zap_pmd_range() more readable by
> simply deleting this check. A search has shown no report of the issue in
> the years since commit e0897d75f0b2 ("mm, thp: print useful information
> when mmap_sem is unlocked in zap_pmd_range") expanded it from VM_BUG_ON()
> - though I cannot point to what commit I would say then fixed the issue.
>
> But there are a couple of other patches now floating around, neither
> yet in the tree: let's agree to retain the check as a VM_BUG_ON_VMA(),
> as Matthew Wilcox has done; but subject to a vma_is_anonymous() check,
> as Kirill Shutemov has done. And let's get this in, without waiting
> for any particular huge pagecache implementation to reach the tree.
>
> Signed-off-by: Hugh Dickins <[email protected]>

Acked-by: Kirill A. Shutemov <[email protected]>

--
Kirill A. Shutemov

2016-04-14 11:56:53

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm: update_lru_size warn and reset bad lru_size

On 04/05/2016 10:40 PM, Hugh Dickins wrote:
> Though debug kernels have a VM_BUG_ON to help protect from misaccounting
> lru_size, non-debug kernels are liable to wrap it around: and then the
> vast unsigned long size draws page reclaim into a loop of repeatedly
> doing nothing on an empty list, without even a cond_resched().
>
> That soft lockup looks confusingly like an over-busy reclaim scenario,
> with lots of contention on the lru_lock in shrink_inactive_list():
> yet has a totally different origin.
>
> Help differentiate with a custom warning in mem_cgroup_update_lru_size(),
> even in non-debug kernels; and reset the size to avoid the lockup. But
> the particular bug which suggested this change was mine alone, and since
> fixed.

In my opinion, the code now looks quite complicated, not sure it's a good
tradeoff for a rare (?) development bug. But I guess it's up to memcg
maintainers which I note are not explicitly CC'd, so adding them now.

Maybe more generally, we can discuss in LSF/MM's mm debugging session, what it
means that DEBUG_VM check has to become unconditional. Does it mean insufficient
testing with DEBUG_VM during development/integration phase? Or are some bugs so
rare we can't depend on that phase to catch them? IIRC Fedora kernels are built
with DEBUG_VM, unless that changed...

> Make it a WARN_ONCE: the first occurrence is the most informative, a
> flurry may follow, yet even when rate-limited little more is learnt.
>
> Signed-off-by: Hugh Dickins <[email protected]>
> ---
> include/linux/mm_inline.h | 2 +-
> mm/memcontrol.c | 24 ++++++++++++++++++++----
> 2 files changed, 21 insertions(+), 5 deletions(-)
>
> --- a/include/linux/mm_inline.h
> +++ b/include/linux/mm_inline.h
> @@ -35,8 +35,8 @@ static __always_inline void del_page_fro
> struct lruvec *lruvec, enum lru_list lru)
> {
> int nr_pages = hpage_nr_pages(page);
> - mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
> list_del(&page->lru);
> + mem_cgroup_update_lru_size(lruvec, lru, -nr_pages);
> __mod_zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru, -nr_pages);
> }
>
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1022,22 +1022,38 @@ out:
> * @lru: index of lru list the page is sitting on
> * @nr_pages: positive when adding or negative when removing
> *
> - * This function must be called when a page is added to or removed from an
> - * lru list.
> + * This function must be called under lru_lock, just before a page is added
> + * to or just after a page is removed from an lru list (that ordering being
> + * so as to allow it to check that lru_size 0 is consistent with list_empty).
> */
> void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru,
> int nr_pages)
> {
> struct mem_cgroup_per_zone *mz;
> unsigned long *lru_size;
> + long size;
> + bool empty;

Could there be more descriptive names? lru_size vs size looks confusing.

>
> if (mem_cgroup_disabled())
> return;
>
> mz = container_of(lruvec, struct mem_cgroup_per_zone, lruvec);
> lru_size = mz->lru_size + lru;
> - *lru_size += nr_pages;
> - VM_BUG_ON((long)(*lru_size) < 0);
> + empty = list_empty(lruvec->lists + lru);
> +
> + if (nr_pages < 0)
> + *lru_size += nr_pages;
> +
> + size = *lru_size;
> + if (WARN_ONCE(size < 0 || empty != !size,

Maybe I'm just not used enough to constructs like "empty != !size", but it
really takes me longer than I'd like to get the meaning :(

> + "%s(%p, %d, %d): lru_size %ld but %sempty\n",
> + __func__, lruvec, lru, nr_pages, size, empty ? "" : "not ")) {
> + VM_BUG_ON(1);
> + *lru_size = 0;
> + }
> +
> + if (nr_pages > 0)
> + *lru_size += nr_pages;
> }
>
> bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2016-04-14 17:39:27

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 09/10] huge pagecache: mmap_sem is unlocked when truncation splits pmd

On Tue, Apr 05, 2016 at 01:55:23PM -0700, Hugh Dickins wrote:
> zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(&mmap_sem) BUG()
> will be invalid with huge pagecache, in whatever way it is implemented:
> truncation of a hugely-mapped file to an unhugely-aligned size would
> easily hit it.

We can reproduce this BUG() in the current Linus tree with DAX PMDs.
Andrew, can you send this patch to Linus for inclusion in 4.7?

Tested-by: Matthew Wilcox <[email protected]>

2016-04-16 04:55:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 09/10] huge pagecache: mmap_sem is unlocked when truncation splits pmd

On Thu, 14 Apr 2016 13:39:22 -0400 Matthew Wilcox <[email protected]> wrote:

> On Tue, Apr 05, 2016 at 01:55:23PM -0700, Hugh Dickins wrote:
> > zap_pmd_range()'s CONFIG_DEBUG_VM !rwsem_is_locked(&mmap_sem) BUG()
> > will be invalid with huge pagecache, in whatever way it is implemented:
> > truncation of a hugely-mapped file to an unhugely-aligned size would
> > easily hit it.
>
> We can reproduce this BUG() in the current Linus tree with DAX PMDs.
> Andrew, can you send this patch to Linus for inclusion in 4.7?

Wilco, thanks.