2020-04-20 22:12:59

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 00/18] mm: memcontrol: charge swapin pages on instantiation

This patch series reworks memcg to charge swapin pages directly at
swapin time, rather than at fault time, which may be much later, or
not happen at all.

The delayed charging scheme we have right now causes problems:

- Alex's per-cgroup lru_lock patches rely on pages that have been
isolated from the LRU to have a stable page->mem_cgroup; otherwise
the lock may change underneath him. Swapcache pages are charged only
after they are added to the LRU, and charging doesn't follow the LRU
isolation protocol.

- Joonsoo's anon workingset patches need a suitable LRU at the time
the page enters the swap cache and displaces the non-resident
info. But the correct LRU is only available after charging.

- It's a containment hole / DoS vector. Users can trigger arbitrarily
large swap readahead using MADV_WILLNEED. The memory is never
charged unless somebody actually touches it.

- It complicates the page->mem_cgroup stabilization rules

In order to charge pages directly at swapin time, the memcg code base
needs to be prepared, and several overdue cleanups become a necessity:

To charge pages at swapin time, we need to always have cgroup
ownership tracking of swap records. We also cannot rely on
page->mapping to tell apart page types at charge time, because that's
only set up during a page fault.

To eliminate the page->mapping dependency, memcg needs to ditch its
private page type counters (MEMCG_CACHE, MEMCG_RSS, NR_SHMEM) in favor
of the generic vmstat counters and accounting sites, such as
NR_FILE_PAGES, NR_ANON_MAPPED etc.

To switch to generic vmstat counters, the charge sequence must be
adjusted such that page->mem_cgroup is set up by the time these
counters are modified.

The series is structured as follows:

1. Bug fixes
2. Decoupling charging from rmap
3. Swap controller integration into memcg
4. Direct swapin charging

The patches survive a simple swapout->swapin test inside a virtual
machine. Because this is blocking two major patch sets, I'm sending
these out early and will continue testing in parallel to the review.

include/linux/memcontrol.h | 53 +----
include/linux/mm.h | 4 +-
include/linux/swap.h | 6 +-
init/Kconfig | 17 +-
kernel/events/uprobes.c | 10 +-
mm/filemap.c | 43 ++---
mm/huge_memory.c | 45 ++---
mm/khugepaged.c | 25 +--
mm/memcontrol.c | 448 ++++++++++++++-----------------------------
mm/memory.c | 51 ++---
mm/migrate.c | 20 +-
mm/rmap.c | 53 +++--
mm/shmem.c | 117 +++++------
mm/swap_cgroup.c | 6 -
mm/swap_state.c | 89 +++++----
mm/swapfile.c | 25 +--
mm/userfaultfd.c | 5 +-
17 files changed, 367 insertions(+), 650 deletions(-)



2020-04-20 22:13:10

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 01/18] mm: fix NUMA node file count error in replace_page_cache()

When replacing one page with another one in the cache, we have to
decrease the file count of the old page's NUMA node and increase the
one of the new NUMA node, otherwise the old node leaks the count and
the new node eventually underflows its counter.

Fixes: 74d609585d8b ("page cache: Add and replace pages using the XArray")
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/filemap.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 23a051a7ef0f..49e3b5da0216 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -808,11 +808,11 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
old->mapping = NULL;
/* hugetlb pages do not participate in page cache accounting. */
if (!PageHuge(old))
- __dec_node_page_state(new, NR_FILE_PAGES);
+ __dec_node_page_state(old, NR_FILE_PAGES);
if (!PageHuge(new))
__inc_node_page_state(new, NR_FILE_PAGES);
if (PageSwapBacked(old))
- __dec_node_page_state(new, NR_SHMEM);
+ __dec_node_page_state(old, NR_SHMEM);
if (PageSwapBacked(new))
__inc_node_page_state(new, NR_SHMEM);
xas_unlock_irqrestore(&xas, flags);
--
2.26.0

2020-04-20 22:13:16

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 02/18] mm: memcontrol: fix theoretical race in charge moving

The move_lock is a per-memcg lock, but the VM accounting code that
needs to acquire it comes from the page and follows page->mem_cgroup
under RCU protection. That means that the page becomes unlocked not
when we drop the move_lock, but when we update page->mem_cgroup. And
that assignment doesn't imply any memory ordering. If that pointer
write gets reordered against the reads of the page state -
page_mapped, PageDirty etc. the state may change while we rely on it
being stable and we can end up corrupting the counters.

Place an SMP memory barrier to make sure we're done with all page
state by the time the new page->mem_cgroup becomes visible.

Also replace the open-coded move_lock with a lock_page_memcg() to make
it more obvious what we're serializing against.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 26 ++++++++++++++------------
1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5beea03dd58a..41f5ed79272e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5372,7 +5372,6 @@ static int mem_cgroup_move_account(struct page *page,
{
struct lruvec *from_vec, *to_vec;
struct pglist_data *pgdat;
- unsigned long flags;
unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
int ret;
bool anon;
@@ -5399,18 +5398,13 @@ static int mem_cgroup_move_account(struct page *page,
from_vec = mem_cgroup_lruvec(from, pgdat);
to_vec = mem_cgroup_lruvec(to, pgdat);

- spin_lock_irqsave(&from->move_lock, flags);
+ lock_page_memcg(page);

if (!anon && page_mapped(page)) {
__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
}

- /*
- * move_lock grabbed above and caller set from->moving_account, so
- * mod_memcg_page_state will serialize updates to PageDirty.
- * So mapping should be stable for dirty pages.
- */
if (!anon && PageDirty(page)) {
struct address_space *mapping = page_mapping(page);

@@ -5426,15 +5420,23 @@ static int mem_cgroup_move_account(struct page *page,
}

/*
+ * All state has been migrated, let's switch to the new memcg.
+ *
* It is safe to change page->mem_cgroup here because the page
- * is referenced, charged, and isolated - we can't race with
- * uncharging, charging, migration, or LRU putback.
+ * is referenced, charged, isolated, and locked: we can't race
+ * with (un)charging, migration, LRU putback, or anything else
+ * that would rely on a stable page->mem_cgroup.
+ *
+ * Note that lock_page_memcg is a memcg lock, not a page lock,
+ * to save space. As soon as we switch page->mem_cgroup to a
+ * new memcg that isn't locked, the above state can change
+ * concurrently again. Make sure we're truly done with it.
*/
+ smp_mb();

- /* caller should have done css_get */
- page->mem_cgroup = to;
+ page->mem_cgroup = to; /* caller should have done css_get */

- spin_unlock_irqrestore(&from->move_lock, flags);
+ __unlock_page_memcg(from);

ret = 0;

--
2.26.0

2020-04-20 22:13:27

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 04/18] mm: memcontrol: move out cgroup swaprate throttling

The cgroup swaprate throttling is about matching new anon allocations
to the rate of available IO when that is being throttled. It's the io
controller hooking into the VM, rather than a memory controller thing.

Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(),
and drop the @memcg argument which is only used to check whether the
preceding page charge has succeeded and the fault is proceeding.

We could decouple the call from mem_cgroup_try_charge() here as well,
but that would cause unnecessary churn: the following patches convert
all callsites to a new charge API and we'll decouple as we go along.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/swap.h | 6 ++----
mm/memcontrol.c | 5 ++---
mm/swapfile.c | 14 +++++++-------
3 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b835d8dbea0e..e0380554f4c4 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -645,11 +645,9 @@ static inline int mem_cgroup_swappiness(struct mem_cgroup *mem)
#endif

#if defined(CONFIG_SWAP) && defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
-extern void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
- gfp_t gfp_mask);
+extern void cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask);
#else
-static inline void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg,
- int node, gfp_t gfp_mask)
+static inline void cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask)
{
}
#endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5ed8f6651383..711d6dd5cbb1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6493,12 +6493,11 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask, struct mem_cgroup **memcgp)
{
- struct mem_cgroup *memcg;
int ret;

ret = mem_cgroup_try_charge(page, mm, gfp_mask, memcgp);
- memcg = *memcgp;
- mem_cgroup_throttle_swaprate(memcg, page_to_nid(page), gfp_mask);
+ if (*memcgp)
+ cgroup_throttle_swaprate(page, gfp_mask);
return ret;
}

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 9c9ab44780ba..74543137371b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3744,11 +3744,12 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
}

#if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
-void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
- gfp_t gfp_mask)
+void cgroup_throttle_swaprate(struct page *page, gfp_t gfp_mask)
{
struct swap_info_struct *si, *next;
- if (!(gfp_mask & __GFP_IO) || !memcg)
+ int nid = page_to_nid(page);
+
+ if (!(gfp_mask & __GFP_IO))
return;

if (!blk_cgroup_congested())
@@ -3762,11 +3763,10 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
return;

spin_lock(&swap_avail_lock);
- plist_for_each_entry_safe(si, next, &swap_avail_heads[node],
- avail_lists[node]) {
+ plist_for_each_entry_safe(si, next, &swap_avail_heads[nid],
+ avail_lists[nid]) {
if (si->bdev) {
- blkcg_schedule_throttle(bdev_get_queue(si->bdev),
- true);
+ blkcg_schedule_throttle(bdev_get_queue(si->bdev), true);
break;
}
}
--
2.26.0

2020-04-20 22:13:34

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

The try/commit/cancel protocol that memcg uses dates back to when
pages used to be uncharged upon removal from the page cache, and thus
couldn't be committed before the insertion had succeeded. Nowadays,
pages are uncharged when they are physically freed; it doesn't matter
whether the insertion was successful or not. For the page cache, the
transaction dance has become unnecessary.

Introduce a mem_cgroup_charge() function that simply charges a newly
allocated page to a cgroup and sets up page->mem_cgroup in one single
step. If the insertion fails, the caller doesn't have to do anything
but free/put the page.

Then switch the page cache over to this new API.

Subsequent patches will also convert anon pages, but it needs a bit
more prep work. Right now, memcg depends on page->mapping being
already set up at the time of charging, so that it can maintain its
own MEMCG_CACHE and MEMCG_RSS counters. For anon, page->mapping is set
under the same pte lock under which the page is publishd, so a single
charge point that can block doesn't work there just yet.

The following prep patches will replace the private memcg counters
with the generic vmstat counters, thus removing the page->mapping
dependency, then complete the transition to the new single-point
charge API and delete the old transactional scheme.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 10 ++++
mm/filemap.c | 24 ++++------
mm/memcontrol.c | 27 +++++++++++
mm/shmem.c | 97 +++++++++++++++++---------------------
4 files changed, 89 insertions(+), 69 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c7875a48c8c1..5e8b0e38f145 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -367,6 +367,10 @@ int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
bool lrucare);
void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
+
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
+ bool lrucare);
+
void mem_cgroup_uncharge(struct page *page);
void mem_cgroup_uncharge_list(struct list_head *page_list);

@@ -872,6 +876,12 @@ static inline void mem_cgroup_cancel_charge(struct page *page,
{
}

+static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask, bool lrucare)
+{
+ return 0;
+}
+
static inline void mem_cgroup_uncharge(struct page *page)
{
}
diff --git a/mm/filemap.c b/mm/filemap.c
index 5b31af9d5b1b..5bdbda965177 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -832,7 +832,6 @@ static int __add_to_page_cache_locked(struct page *page,
{
XA_STATE(xas, &mapping->i_pages, offset);
int huge = PageHuge(page);
- struct mem_cgroup *memcg;
int error;
void *old;

@@ -840,17 +839,16 @@ static int __add_to_page_cache_locked(struct page *page,
VM_BUG_ON_PAGE(PageSwapBacked(page), page);
mapping_set_update(&xas, mapping);

- if (!huge) {
- error = mem_cgroup_try_charge(page, current->mm,
- gfp_mask, &memcg);
- if (error)
- return error;
- }
-
get_page(page);
page->mapping = mapping;
page->index = offset;

+ if (!huge) {
+ error = mem_cgroup_charge(page, current->mm, gfp_mask, false);
+ if (error)
+ goto error;
+ }
+
do {
xas_lock_irq(&xas);
old = xas_load(&xas);
@@ -874,20 +872,18 @@ static int __add_to_page_cache_locked(struct page *page,
xas_unlock_irq(&xas);
} while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));

- if (xas_error(&xas))
+ if (xas_error(&xas)) {
+ error = xas_error(&xas);
goto error;
+ }

- if (!huge)
- mem_cgroup_commit_charge(page, memcg, false);
trace_mm_filemap_add_to_page_cache(page);
return 0;
error:
page->mapping = NULL;
/* Leave page->index set: truncation relies upon it */
- if (!huge)
- mem_cgroup_cancel_charge(page, memcg);
put_page(page);
- return xas_error(&xas);
+ return error;
}
ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 711d6dd5cbb1..b38c0a672d26 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6577,6 +6577,33 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
cancel_charge(memcg, nr_pages);
}

+/**
+ * mem_cgroup_charge - charge a newly allocated page to a cgroup
+ * @page: page to charge
+ * @mm: mm context of the victim
+ * @gfp_mask: reclaim mode
+ * @lrucare: page might be on the LRU already
+ *
+ * Try to charge @page to the memcg that @mm belongs to, reclaiming
+ * pages according to @gfp_mask if necessary.
+ *
+ * Returns 0 on success. Otherwise, an error code is returned.
+ */
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
+ bool lrucare)
+{
+ struct mem_cgroup *memcg;
+ int ret;
+
+ VM_BUG_ON_PAGE(!page->mapping, page);
+
+ ret = mem_cgroup_try_charge(page, mm, gfp_mask, &memcg);
+ if (ret)
+ return ret;
+ mem_cgroup_commit_charge(page, memcg, lrucare);
+ return 0;
+}
+
struct uncharge_gather {
struct mem_cgroup *memcg;
unsigned long pgpgout;
diff --git a/mm/shmem.c b/mm/shmem.c
index 52c66801321e..2384f6c7ef71 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -605,11 +605,13 @@ static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo)
*/
static int shmem_add_to_page_cache(struct page *page,
struct address_space *mapping,
- pgoff_t index, void *expected, gfp_t gfp)
+ pgoff_t index, void *expected, gfp_t gfp,
+ struct mm_struct *charge_mm)
{
XA_STATE_ORDER(xas, &mapping->i_pages, index, compound_order(page));
unsigned long i = 0;
unsigned long nr = compound_nr(page);
+ int error;

VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(index != round_down(index, nr), page);
@@ -621,6 +623,16 @@ static int shmem_add_to_page_cache(struct page *page,
page->mapping = mapping;
page->index = index;

+ error = mem_cgroup_charge(page, charge_mm, gfp, PageSwapCache(page));
+ if (error) {
+ if (!PageSwapCache(page) && PageTransHuge(page)) {
+ count_vm_event(THP_FILE_FALLBACK);
+ count_vm_event(THP_FILE_FALLBACK_CHARGE);
+ }
+ goto error;
+ }
+ cgroup_throttle_swaprate(page, gfp);
+
do {
void *entry;
xas_lock_irq(&xas);
@@ -648,12 +660,15 @@ static int shmem_add_to_page_cache(struct page *page,
} while (xas_nomem(&xas, gfp));

if (xas_error(&xas)) {
- page->mapping = NULL;
- page_ref_sub(page, nr);
- return xas_error(&xas);
+ error = xas_error(&xas);
+ goto error;
}

return 0;
+error:
+ page->mapping = NULL;
+ page_ref_sub(page, nr);
+ return error;
}

/*
@@ -1619,7 +1634,6 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
- struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
int error;
@@ -1664,29 +1678,22 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
goto failed;
}

- error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
- if (!error) {
- error = shmem_add_to_page_cache(page, mapping, index,
- swp_to_radix_entry(swap), gfp);
- /*
- * We already confirmed swap under page lock, and make
- * no memory allocation here, so usually no possibility
- * of error; but free_swap_and_cache() only trylocks a
- * page, so it is just possible that the entry has been
- * truncated or holepunched since swap was confirmed.
- * shmem_undo_range() will have done some of the
- * unaccounting, now delete_from_swap_cache() will do
- * the rest.
- */
- if (error) {
- mem_cgroup_cancel_charge(page, memcg);
- delete_from_swap_cache(page);
- }
- }
- if (error)
+ error = shmem_add_to_page_cache(page, mapping, index,
+ swp_to_radix_entry(swap), gfp,
+ charge_mm);
+ /*
+ * We already confirmed swap under page lock, and make no
+ * memory allocation here, so usually no possibility of error;
+ * but free_swap_and_cache() only trylocks a page, so it is
+ * just possible that the entry has been truncated or
+ * holepunched since swap was confirmed. shmem_undo_range()
+ * will have done some of the unaccounting, now
+ * delete_from_swap_cache() will do the rest.
+ */
+ if (error) {
+ delete_from_swap_cache(page);
goto failed;
-
- mem_cgroup_commit_charge(page, memcg, true);
+ }

spin_lock_irq(&info->lock);
info->swapped--;
@@ -1733,7 +1740,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct shmem_inode_info *info = SHMEM_I(inode);
struct shmem_sb_info *sbinfo;
struct mm_struct *charge_mm;
- struct mem_cgroup *memcg;
struct page *page;
enum sgp_type sgp_huge = sgp;
pgoff_t hindex = index;
@@ -1858,21 +1864,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
if (sgp == SGP_WRITE)
__SetPageReferenced(page);

- error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
- if (error) {
- if (PageTransHuge(page)) {
- count_vm_event(THP_FILE_FALLBACK);
- count_vm_event(THP_FILE_FALLBACK_CHARGE);
- }
- goto unacct;
- }
error = shmem_add_to_page_cache(page, mapping, hindex,
- NULL, gfp & GFP_RECLAIM_MASK);
- if (error) {
- mem_cgroup_cancel_charge(page, memcg);
+ NULL, gfp & GFP_RECLAIM_MASK,
+ charge_mm);
+ if (error)
goto unacct;
- }
- mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_anon(page);

spin_lock_irq(&info->lock);
@@ -2307,7 +2303,6 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
struct address_space *mapping = inode->i_mapping;
gfp_t gfp = mapping_gfp_mask(mapping);
pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
- struct mem_cgroup *memcg;
spinlock_t *ptl;
void *page_kaddr;
struct page *page;
@@ -2357,16 +2352,10 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
if (unlikely(offset >= max_off))
goto out_release;

- ret = mem_cgroup_try_charge_delay(page, dst_mm, gfp, &memcg);
- if (ret)
- goto out_release;
-
ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
- gfp & GFP_RECLAIM_MASK);
+ gfp & GFP_RECLAIM_MASK, dst_mm);
if (ret)
- goto out_release_uncharge;
-
- mem_cgroup_commit_charge(page, memcg, false);
+ goto out_release;

_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
if (dst_vma->vm_flags & VM_WRITE)
@@ -2387,11 +2376,11 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
ret = -EFAULT;
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (unlikely(offset >= max_off))
- goto out_release_uncharge_unlock;
+ goto out_release_unlock;

ret = -EEXIST;
if (!pte_none(*dst_pte))
- goto out_release_uncharge_unlock;
+ goto out_release_unlock;

lru_cache_add_anon(page);

@@ -2412,12 +2401,10 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
ret = 0;
out:
return ret;
-out_release_uncharge_unlock:
+out_release_unlock:
pte_unmap_unlock(dst_pte, ptl);
ClearPageDirty(page);
delete_from_page_cache(page);
-out_release_uncharge:
- mem_cgroup_cancel_charge(page, memcg);
out_release:
unlock_page(page);
put_page(page);
--
2.26.0

2020-04-20 22:13:38

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 06/18] mm: memcontrol: prepare uncharging for removal of private page type counters

The uncharge batching code adds up the anon, file, kmem counts to
determine the total number of pages to uncharge and references to
drop. But the next patches will remove the anon and file counters.

Maintain an aggregate nr_pages in the uncharge_gather struct.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 23 ++++++++++++-----------
1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b38c0a672d26..e3e8913a5b28 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6606,6 +6606,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,

struct uncharge_gather {
struct mem_cgroup *memcg;
+ unsigned long nr_pages;
unsigned long pgpgout;
unsigned long nr_anon;
unsigned long nr_file;
@@ -6622,13 +6623,12 @@ static inline void uncharge_gather_clear(struct uncharge_gather *ug)

static void uncharge_batch(const struct uncharge_gather *ug)
{
- unsigned long nr_pages = ug->nr_anon + ug->nr_file + ug->nr_kmem;
unsigned long flags;

if (!mem_cgroup_is_root(ug->memcg)) {
- page_counter_uncharge(&ug->memcg->memory, nr_pages);
+ page_counter_uncharge(&ug->memcg->memory, ug->nr_pages);
if (do_memsw_account())
- page_counter_uncharge(&ug->memcg->memsw, nr_pages);
+ page_counter_uncharge(&ug->memcg->memsw, ug->nr_pages);
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
memcg_oom_recover(ug->memcg);
@@ -6640,16 +6640,18 @@ static void uncharge_batch(const struct uncharge_gather *ug)
__mod_memcg_state(ug->memcg, MEMCG_RSS_HUGE, -ug->nr_huge);
__mod_memcg_state(ug->memcg, NR_SHMEM, -ug->nr_shmem);
__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
- __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, nr_pages);
+ __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_pages);
memcg_check_events(ug->memcg, ug->dummy_page);
local_irq_restore(flags);

if (!mem_cgroup_is_root(ug->memcg))
- css_put_many(&ug->memcg->css, nr_pages);
+ css_put_many(&ug->memcg->css, ug->nr_pages);
}

static void uncharge_page(struct page *page, struct uncharge_gather *ug)
{
+ unsigned long nr_pages;
+
VM_BUG_ON_PAGE(PageLRU(page), page);
VM_BUG_ON_PAGE(page_count(page) && !is_zone_device_page(page) &&
!PageHWPoison(page) , page);
@@ -6671,13 +6673,12 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
ug->memcg = page->mem_cgroup;
}

- if (!PageKmemcg(page)) {
- unsigned int nr_pages = 1;
+ nr_pages = compound_nr(page);
+ ug->nr_pages += nr_pages;

- if (PageTransHuge(page)) {
- nr_pages = compound_nr(page);
+ if (!PageKmemcg(page)) {
+ if (PageTransHuge(page))
ug->nr_huge += nr_pages;
- }
if (PageAnon(page))
ug->nr_anon += nr_pages;
else {
@@ -6687,7 +6688,7 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
}
ug->pgpgout++;
} else {
- ug->nr_kmem += compound_nr(page);
+ ug->nr_kmem += nr_pages;
__ClearPageKmemcg(page);
}

--
2.26.0

2020-04-20 22:14:01

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 14/18] mm: memcontrol: prepare swap controller setup for integration

A few cleanups to streamline the swap controller setup:

- Replace the do_swap_account flag with cgroup_memory_noswap. This
brings it in line with other functionality that is usually available
unless explicitly opted out of - nosocket, nokmem.

- Remove the really_do_swap_account flag that stores the boot option
and is later used to switch the do_swap_account. It's not clear why
this indirection is/was necessary. Use do_swap_account directly.

- Minor coding style polishing

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 2 +-
mm/memcontrol.c | 59 ++++++++++++++++++--------------------
mm/swap_cgroup.c | 4 +--
3 files changed, 31 insertions(+), 34 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 52eb6411cfee..d458f1d90aa4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -560,7 +560,7 @@ struct mem_cgroup *mem_cgroup_get_oom_group(struct task_struct *victim,
void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);

#ifdef CONFIG_MEMCG_SWAP
-extern int do_swap_account;
+extern bool cgroup_memory_noswap;
#endif

struct mem_cgroup *lock_page_memcg(struct page *page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d5aee5577ff3..5558777023e7 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,10 +83,14 @@ static bool cgroup_memory_nokmem;

/* Whether the swap controller is active */
#ifdef CONFIG_MEMCG_SWAP
-int do_swap_account __read_mostly;
+#ifdef CONFIG_MEMCG_SWAP_ENABLED
+bool cgroup_memory_noswap __read_mostly;
#else
-#define do_swap_account 0
-#endif
+bool cgroup_memory_noswap __read_mostly = 1;
+#endif /* CONFIG_MEMCG_SWAP_ENABLED */
+#else
+#define cgroup_memory_noswap 1
+#endif /* CONFIG_MEMCG_SWAP */

#ifdef CONFIG_CGROUP_WRITEBACK
static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
@@ -95,7 +99,7 @@ static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
/* Whether legacy memory+swap accounting is active */
static bool do_memsw_account(void)
{
- return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && do_swap_account;
+ return !cgroup_subsys_on_dfl(memory_cgrp_subsys) && !cgroup_memory_noswap;
}

#define THRESHOLDS_EVENTS_TARGET 128
@@ -6458,18 +6462,19 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
/*
* Every swap fault against a single page tries to charge the
* page, bail as early as possible. shmem_unuse() encounters
- * already charged pages, too. The USED bit is protected by
- * the page lock, which serializes swap cache removal, which
+ * already charged pages, too. page->mem_cgroup is protected
+ * by the page lock, which serializes swap cache removal, which
* in turn serializes uncharging.
*/
VM_BUG_ON_PAGE(!PageLocked(page), page);
if (compound_head(page)->mem_cgroup)
goto out;

- if (do_swap_account) {
+ if (!cgroup_memory_noswap) {
swp_entry_t ent = { .val = page_private(page), };
- unsigned short id = lookup_swap_cgroup_id(ent);
+ unsigned short id;

+ id = lookup_swap_cgroup_id(ent);
rcu_read_lock();
memcg = mem_cgroup_from_id(id);
if (memcg && !css_tryget_online(&memcg->css))
@@ -6944,7 +6949,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
struct mem_cgroup *memcg;
unsigned short oldid;

- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) || !do_swap_account)
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) || cgroup_memory_noswap)
return 0;

memcg = page->mem_cgroup;
@@ -6988,7 +6993,7 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
struct mem_cgroup *memcg;
unsigned short id;

- if (!do_swap_account)
+ if (cgroup_memory_noswap)
return;

id = swap_cgroup_record(entry, 0, nr_pages);
@@ -7011,7 +7016,7 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
{
long nr_swap_pages = get_nr_swap_pages();

- if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
return nr_swap_pages;
for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg))
nr_swap_pages = min_t(long, nr_swap_pages,
@@ -7028,7 +7033,7 @@ bool mem_cgroup_swap_full(struct page *page)

if (vm_swap_full())
return true;
- if (!do_swap_account || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;

memcg = page->mem_cgroup;
@@ -7043,22 +7048,15 @@ bool mem_cgroup_swap_full(struct page *page)
return false;
}

-/* for remember boot option*/
-#ifdef CONFIG_MEMCG_SWAP_ENABLED
-static int really_do_swap_account __initdata = 1;
-#else
-static int really_do_swap_account __initdata;
-#endif
-
-static int __init enable_swap_account(char *s)
+static int __init setup_swap_account(char *s)
{
if (!strcmp(s, "1"))
- really_do_swap_account = 1;
+ cgroup_memory_noswap = 0;
else if (!strcmp(s, "0"))
- really_do_swap_account = 0;
+ cgroup_memory_noswap = 1;
return 1;
}
-__setup("swapaccount=", enable_swap_account);
+__setup("swapaccount=", setup_swap_account);

static u64 swap_current_read(struct cgroup_subsys_state *css,
struct cftype *cft)
@@ -7124,7 +7122,7 @@ static struct cftype swap_files[] = {
{ } /* terminate */
};

-static struct cftype memsw_cgroup_files[] = {
+static struct cftype memsw_files[] = {
{
.name = "memsw.usage_in_bytes",
.private = MEMFILE_PRIVATE(_MEMSWAP, RES_USAGE),
@@ -7153,13 +7151,12 @@ static struct cftype memsw_cgroup_files[] = {

static int __init mem_cgroup_swap_init(void)
{
- if (!mem_cgroup_disabled() && really_do_swap_account) {
- do_swap_account = 1;
- WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys,
- swap_files));
- WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys,
- memsw_cgroup_files));
- }
+ if (mem_cgroup_disabled() || cgroup_memory_noswap)
+ return 0;
+
+ WARN_ON(cgroup_add_dfl_cftypes(&memory_cgrp_subsys, swap_files));
+ WARN_ON(cgroup_add_legacy_cftypes(&memory_cgrp_subsys, memsw_files));
+
return 0;
}
subsys_initcall(mem_cgroup_swap_init);
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 45affaef3bc6..7aa764f09079 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -171,7 +171,7 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
unsigned long length;
struct swap_cgroup_ctrl *ctrl;

- if (!do_swap_account)
+ if (cgroup_memory_noswap)
return 0;

length = DIV_ROUND_UP(max_pages, SC_PER_PAGE);
@@ -209,7 +209,7 @@ void swap_cgroup_swapoff(int type)
unsigned long i, length;
struct swap_cgroup_ctrl *ctrl;

- if (!do_swap_account)
+ if (cgroup_memory_noswap)
return;

mutex_lock(&swap_cgroup_mutex);
--
2.26.0

2020-04-20 22:14:02

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 16/18] mm: memcontrol: charge swapin pages on instantiation

Right now, users that are otherwise memory controlled can easily
escape their containment and allocate significant amounts of memory
that they're not being charged for. That's because swap readahead
pages are not being charged until somebody actually faults them into
their page table. This can be exploited with MADV_WILLNEED, which
triggers arbitrary readahead allocations without charging the pages.

There are additional problems with the delayed charging of swap pages:

1. To implement refault/workingset detection for anonymous pages, we
need to have a target LRU available at swapin time, but the LRU is
not determinable until the page has been charged.

2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
stable when the page is isolated from the LRU; otherwise, the locks
change under us. But swapcache gets charged after it's already on
the LRU, and even if we cannot isolate it ourselves (since charging
is not exactly optional).

The previous patch ensured we always maintain cgroup ownership records
for swap pages. This patch moves the swapcache charging point from the
fault handler to swapin time to fix all of the above problems.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memory.c | 15 ++++++---
mm/shmem.c | 14 ++++----
mm/swap_state.c | 89 ++++++++++++++++++++++++++-----------------------
mm/swapfile.c | 6 ----
4 files changed, 67 insertions(+), 57 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3fa379d9b17d..5d266532fc40 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3127,9 +3127,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
vmf->address);
if (page) {
+ int err;
+
__SetPageLocked(page);
__SetPageSwapBacked(page);
set_page_private(page, entry.val);
+
+ /* Tell memcg to use swap ownership records */
+ SetPageSwapCache(page);
+ err = mem_cgroup_charge(page, vma->vm_mm,
+ GFP_KERNEL, false);
+ ClearPageSwapCache(page);
+ if (err)
+ goto out_page;
+
lru_cache_add_anon(page);
swap_readpage(page, true);
}
@@ -3191,10 +3202,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_page;
}

- if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL, true)) {
- ret = VM_FAULT_OOM;
- goto out_page;
- }
cgroup_throttle_swaprate(page, GFP_KERNEL);

/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 363bd11eba85..966f150a4823 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -623,13 +623,15 @@ static int shmem_add_to_page_cache(struct page *page,
page->mapping = mapping;
page->index = index;

- error = mem_cgroup_charge(page, charge_mm, gfp, PageSwapCache(page));
- if (error) {
- if (!PageSwapCache(page) && PageTransHuge(page)) {
- count_vm_event(THP_FILE_FALLBACK);
- count_vm_event(THP_FILE_FALLBACK_CHARGE);
+ if (!PageSwapCache(page)) {
+ error = mem_cgroup_charge(page, charge_mm, gfp, false);
+ if (error) {
+ if (PageTransHuge(page)) {
+ count_vm_event(THP_FILE_FALLBACK);
+ count_vm_event(THP_FILE_FALLBACK_CHARGE);
+ }
+ goto error;
}
- goto error;
}
cgroup_throttle_swaprate(page, gfp);

diff --git a/mm/swap_state.c b/mm/swap_state.c
index ebed37bbf7a3..f3b9073bfff3 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -360,12 +360,13 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr,
bool *new_page_allocated)
{
- struct page *found_page = NULL, *new_page = NULL;
struct swap_info_struct *si;
- int err;
+ struct page *page;
+
*new_page_allocated = false;

- do {
+ for (;;) {
+ int err;
/*
* First check the swap cache. Since this is normally
* called after lookup_swap_cache() failed, re-calling
@@ -373,12 +374,12 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
*/
si = get_swap_device(entry);
if (!si)
- break;
- found_page = find_get_page(swap_address_space(entry),
- swp_offset(entry));
+ return NULL;
+ page = find_get_page(swap_address_space(entry),
+ swp_offset(entry));
put_swap_device(si);
- if (found_page)
- break;
+ if (page)
+ return page;

/*
* Just skip read ahead for unused swap slot.
@@ -389,21 +390,15 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
* else swap_off will be aborted if we return NULL.
*/
if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
- break;
-
- /*
- * Get a new page to read into from swap.
- */
- if (!new_page) {
- new_page = alloc_page_vma(gfp_mask, vma, addr);
- if (!new_page)
- break; /* Out of memory */
- }
+ return NULL;

/*
* Swap entry may have been freed since our caller observed it.
*/
err = swapcache_prepare(entry);
+ if (!err)
+ break;
+
if (err == -EEXIST) {
/*
* We might race against get_swap_page() and stumble
@@ -412,31 +407,43 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
*/
cond_resched();
continue;
- } else if (err) /* swp entry is obsolete ? */
- break;
-
- /* May fail (-ENOMEM) if XArray node allocation failed. */
- __SetPageLocked(new_page);
- __SetPageSwapBacked(new_page);
- err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
- if (likely(!err)) {
- /* Initiate read into locked page */
- SetPageWorkingset(new_page);
- lru_cache_add_anon(new_page);
- *new_page_allocated = true;
- return new_page;
}
- __ClearPageLocked(new_page);
- /*
- * add_to_swap_cache() doesn't return -EEXIST, so we can safely
- * clear SWAP_HAS_CACHE flag.
- */
- put_swap_page(new_page, entry);
- } while (err != -ENOMEM);
+ if (err) /* swp entry is obsolete ? */
+ return NULL;
+ }
+
+ /*
+ * The swap entry is ours to swap in. Prepare a new page.
+ */
+
+ page = alloc_page_vma(gfp_mask, vma, addr);
+ if (!page)
+ goto fail_free;
+
+ __SetPageLocked(page);
+ __SetPageSwapBacked(page);
+
+ /* May fail (-ENOMEM) if XArray node allocation failed. */
+ if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
+ goto fail_unlock;
+
+ if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL, false))
+ goto fail_delete;
+
+ /* Initiate read into locked page */
+ SetPageWorkingset(page);
+ lru_cache_add_anon(page);
+ *new_page_allocated = true;
+ return page;

- if (new_page)
- put_page(new_page);
- return found_page;
+fail_delete:
+ delete_from_swap_cache(page);
+fail_unlock:
+ unlock_page(page);
+ put_page(page);
+fail_free:
+ swap_free(entry);
+ return NULL;
}

/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 08140aed9258..e41074848f25 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1863,11 +1863,6 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
if (unlikely(!page))
return -ENOMEM;

- if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL, true)) {
- ret = -ENOMEM;
- goto out_nolock;
- }
-
pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
if (unlikely(!pte_same_as_swp(*pte, swp_entry_to_pte(entry)))) {
ret = 0;
@@ -1893,7 +1888,6 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
activate_page(page);
out:
pte_unmap_unlock(pte, ptl);
-out_nolock:
if (page != swapcache) {
unlock_page(page);
put_page(page);
--
2.26.0

2020-04-20 22:14:21

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 17/18] mm: memcontrol: delete unused lrucare handling

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 5 ++--
kernel/events/uprobes.c | 3 +-
mm/filemap.c | 2 +-
mm/huge_memory.c | 7 ++---
mm/khugepaged.c | 4 +--
mm/memcontrol.c | 57 +++-----------------------------------
mm/memory.c | 8 +++---
mm/migrate.c | 2 +-
mm/shmem.c | 2 +-
mm/swap_state.c | 2 +-
mm/userfaultfd.c | 2 +-
11 files changed, 21 insertions(+), 73 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d458f1d90aa4..4b868e5a687f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -357,8 +357,7 @@ static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
struct mem_cgroup *memcg);

-int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
- bool lrucare);
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask);

void mem_cgroup_uncharge(struct page *page);
void mem_cgroup_uncharge_list(struct list_head *page_list);
@@ -839,7 +838,7 @@ static inline enum mem_cgroup_protection mem_cgroup_protected(
}

static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, bool lrucare)
+ gfp_t gfp_mask)
{
return 0;
}
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4253c153e985..eddc8db96027 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -167,8 +167,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
addr + PAGE_SIZE);

if (new_page) {
- err = mem_cgroup_charge(new_page, vma->vm_mm, GFP_KERNEL,
- false);
+ err = mem_cgroup_charge(new_page, vma->vm_mm, GFP_KERNEL);
if (err)
return err;
}
diff --git a/mm/filemap.c b/mm/filemap.c
index a10bd6696049..f73b221314df 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -845,7 +845,7 @@ static int __add_to_page_cache_locked(struct page *page,
page->index = offset;

if (!huge) {
- error = mem_cgroup_charge(page, current->mm, gfp_mask, false);
+ error = mem_cgroup_charge(page, current->mm, gfp_mask);
if (error)
goto error;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0b33eaf0740a..35a716720e26 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -593,7 +593,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,

VM_BUG_ON_PAGE(!PageCompound(page), page);

- if (mem_cgroup_charge(page, vma->vm_mm, gfp, false)) {
+ if (mem_cgroup_charge(page, vma->vm_mm, gfp)) {
put_page(page);
count_vm_event(THP_FAULT_FALLBACK);
count_vm_event(THP_FAULT_FALLBACK_CHARGE);
@@ -1276,7 +1276,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
vmf->address, page_to_nid(page));
if (unlikely(!pages[i] ||
mem_cgroup_charge(pages[i], vma->vm_mm,
- GFP_KERNEL, false))) {
+ GFP_KERNEL))) {
if (pages[i])
put_page(pages[i]);
while (--i >= 0)
@@ -1430,8 +1430,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
goto out;
}

- if (unlikely(mem_cgroup_charge(new_page, vma->vm_mm, huge_gfp,
- false))) {
+ if (unlikely(mem_cgroup_charge(new_page, vma->vm_mm, huge_gfp))) {
put_page(new_page);
split_huge_pmd(vma, vmf->pmd, vmf->address);
if (page)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 5cf8082fb038..28c6d84db4ee 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -973,7 +973,7 @@ static void collapse_huge_page(struct mm_struct *mm,
goto out_nolock;
}

- if (unlikely(mem_cgroup_charge(new_page, mm, gfp, false))) {
+ if (unlikely(mem_cgroup_charge(new_page, mm, gfp))) {
result = SCAN_CGROUP_CHARGE_FAIL;
goto out_nolock;
}
@@ -1527,7 +1527,7 @@ static void collapse_file(struct mm_struct *mm,
goto out;
}

- if (unlikely(mem_cgroup_charge(new_page, mm, gfp, false))) {
+ if (unlikely(mem_cgroup_charge(new_page, mm, gfp))) {
result = SCAN_CGROUP_CHARGE_FAIL;
goto out;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1d7408a8744a..a8cce52b6b4d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2601,51 +2601,9 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
css_put_many(&memcg->css, nr_pages);
}

-static void lock_page_lru(struct page *page, int *isolated)
+static void commit_charge(struct page *page, struct mem_cgroup *memcg)
{
- pg_data_t *pgdat = page_pgdat(page);
-
- spin_lock_irq(&pgdat->lru_lock);
- if (PageLRU(page)) {
- struct lruvec *lruvec;
-
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
- ClearPageLRU(page);
- del_page_from_lru_list(page, lruvec, page_lru(page));
- *isolated = 1;
- } else
- *isolated = 0;
-}
-
-static void unlock_page_lru(struct page *page, int isolated)
-{
- pg_data_t *pgdat = page_pgdat(page);
-
- if (isolated) {
- struct lruvec *lruvec;
-
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
- VM_BUG_ON_PAGE(PageLRU(page), page);
- SetPageLRU(page);
- add_page_to_lru_list(page, lruvec, page_lru(page));
- }
- spin_unlock_irq(&pgdat->lru_lock);
-}
-
-static void commit_charge(struct page *page, struct mem_cgroup *memcg,
- bool lrucare)
-{
- int isolated;
-
VM_BUG_ON_PAGE(page->mem_cgroup, page);
-
- /*
- * In some cases, SwapCache and FUSE(splice_buf->radixtree), the page
- * may already be on some other mem_cgroup's LRU. Take care of it.
- */
- if (lrucare)
- lock_page_lru(page, &isolated);
-
/*
* Nobody should be changing or seriously looking at
* page->mem_cgroup at this point:
@@ -2661,9 +2619,6 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg,
* have the page locked
*/
page->mem_cgroup = memcg;
-
- if (lrucare)
- unlock_page_lru(page, isolated);
}

#ifdef CONFIG_MEMCG_KMEM
@@ -6433,22 +6388,18 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
* @page: page to charge
* @mm: mm context of the victim
* @gfp_mask: reclaim mode
- * @lrucare: page might be on the LRU already
*
* Try to charge @page to the memcg that @mm belongs to, reclaiming
* pages according to @gfp_mask if necessary.
*
* Returns 0 on success. Otherwise, an error code is returned.
*/
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
- bool lrucare)
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
{
unsigned int nr_pages = hpage_nr_pages(page);
struct mem_cgroup *memcg = NULL;
int ret = 0;

- VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
-
if (mem_cgroup_disabled())
goto out;

@@ -6482,7 +6433,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
if (ret)
goto out_put;

- commit_charge(page, memcg, lrucare);
+ commit_charge(page, memcg);

local_irq_disable();
mem_cgroup_charge_statistics(memcg, page, nr_pages);
@@ -6685,7 +6636,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
page_counter_charge(&memcg->memsw, nr_pages);
css_get_many(&memcg->css, nr_pages);

- commit_charge(newpage, memcg, false);
+ commit_charge(newpage, memcg);

local_irq_save(flags);
mem_cgroup_charge_statistics(memcg, newpage, nr_pages);
diff --git a/mm/memory.c b/mm/memory.c
index 5d266532fc40..0ad4db56bea2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2677,7 +2677,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
}
}

- if (mem_cgroup_charge(new_page, mm, GFP_KERNEL, false))
+ if (mem_cgroup_charge(new_page, mm, GFP_KERNEL))
goto oom_free_new;
cgroup_throttle_swaprate(new_page, GFP_KERNEL);

@@ -3136,7 +3136,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
/* Tell memcg to use swap ownership records */
SetPageSwapCache(page);
err = mem_cgroup_charge(page, vma->vm_mm,
- GFP_KERNEL, false);
+ GFP_KERNEL);
ClearPageSwapCache(page);
if (err)
goto out_page;
@@ -3360,7 +3360,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (!page)
goto oom;

- if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL, false))
+ if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
goto oom_free_page;
cgroup_throttle_swaprate(page, GFP_KERNEL);

@@ -3856,7 +3856,7 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
if (!vmf->cow_page)
return VM_FAULT_OOM;

- if (mem_cgroup_charge(vmf->cow_page, vma->vm_mm, GFP_KERNEL, false)) {
+ if (mem_cgroup_charge(vmf->cow_page, vma->vm_mm, GFP_KERNEL)) {
put_page(vmf->cow_page);
return VM_FAULT_OOM;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index a3361c744069..ced652d069ee 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2792,7 +2792,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,

if (unlikely(anon_vma_prepare(vma)))
goto abort;
- if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL, false))
+ if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL))
goto abort;

/*
diff --git a/mm/shmem.c b/mm/shmem.c
index 966f150a4823..add10d448bc6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -624,7 +624,7 @@ static int shmem_add_to_page_cache(struct page *page,
page->index = index;

if (!PageSwapCache(page)) {
- error = mem_cgroup_charge(page, charge_mm, gfp, false);
+ error = mem_cgroup_charge(page, charge_mm, gfp);
if (error) {
if (PageTransHuge(page)) {
count_vm_event(THP_FILE_FALLBACK);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index f3b9073bfff3..26fded65c30d 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -427,7 +427,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
goto fail_unlock;

- if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL, false))
+ if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL))
goto fail_delete;

/* Initiate read into locked page */
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 2745489415cc..7f5194046b01 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -96,7 +96,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
__SetPageUptodate(page);

ret = -ENOMEM;
- if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL, false))
+ if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL))
goto out_release;

_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
--
2.26.0

2020-04-20 22:14:21

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 15/18] mm: memcontrol: make swap tracking an integral part of memory control

Without swap page tracking, users that are otherwise memory controlled
can easily escape their containment and allocate significant amounts
of memory that they're not being charged for. That's because swap does
readahead, but without the cgroup records of who owned the page at
swapout, readahead pages don't get charged until somebody actually
faults them into their page table and we can identify an owner task.
This can be maliciously exploited with MADV_WILLNEED, which triggers
arbitrary readahead allocations without charging the pages.

Make swap swap page tracking an integral part of memcg and remove the
Kconfig options. In the first place, it was only made configurable to
allow users to save some memory. But the overhead of tracking cgroup
ownership per swap page is minimal - 2 byte per page, or 512k per 1G
of swap, or 0.04%. Saving that at the expense of broken containment
semantics is not something we should present as a coequal option.

The swapaccount=0 boot option will continue to exist, and it will
eliminate the page_counter overhead and hide the swap control files,
but it won't disable swap slot ownership tracking.

This patch makes sure we always have the cgroup records at swapin
time; the next patch will fix the actual bug by charging readahead
swap pages at swapin time rather than at fault time.

Signed-off-by: Johannes Weiner <[email protected]>
---
init/Kconfig | 17 +----------------
mm/memcontrol.c | 48 +++++++++++++++++-------------------------------
mm/swap_cgroup.c | 6 ------
3 files changed, 18 insertions(+), 53 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 9e22ee8fbd75..39cdb13168cf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -835,24 +835,9 @@ config MEMCG
Provides control over the memory footprint of tasks in a cgroup.

config MEMCG_SWAP
- bool "Swap controller"
+ bool
depends on MEMCG && SWAP
- help
- Provides control over the swap space consumed by tasks in a cgroup.
-
-config MEMCG_SWAP_ENABLED
- bool "Swap controller enabled by default"
- depends on MEMCG_SWAP
default y
- help
- Memory Resource Controller Swap Extension comes with its price in
- a bigger memory consumption. General purpose distribution kernels
- which want to enable the feature but keep it disabled by default
- and let the user enable it by swapaccount=1 boot command line
- parameter should have this option unselected.
- For those who want to have the feature enabled by default should
- select this option (if, for some reason, they need to disable it
- then swapaccount=0 does the trick).

config MEMCG_KMEM
bool
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5558777023e7..1d7408a8744a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,14 +83,10 @@ static bool cgroup_memory_nokmem;

/* Whether the swap controller is active */
#ifdef CONFIG_MEMCG_SWAP
-#ifdef CONFIG_MEMCG_SWAP_ENABLED
bool cgroup_memory_noswap __read_mostly;
#else
-bool cgroup_memory_noswap __read_mostly = 1;
-#endif /* CONFIG_MEMCG_SWAP_ENABLED */
-#else
#define cgroup_memory_noswap 1
-#endif /* CONFIG_MEMCG_SWAP */
+#endif

#ifdef CONFIG_CGROUP_WRITEBACK
static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
@@ -5290,8 +5286,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
* we call find_get_page() with swapper_space directly.
*/
page = find_get_page(swap_address_space(ent), swp_offset(ent));
- if (do_memsw_account())
- entry->val = ent.val;
+ entry->val = ent.val;

return page;
}
@@ -5325,8 +5320,7 @@ static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
page = find_get_entry(mapping, pgoff);
if (xa_is_value(page)) {
swp_entry_t swp = radix_to_swp_entry(page);
- if (do_memsw_account())
- *entry = swp;
+ *entry = swp;
page = find_get_page(swap_address_space(swp),
swp_offset(swp));
}
@@ -6459,6 +6453,9 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
goto out;

if (PageSwapCache(page)) {
+ swp_entry_t ent = { .val = page_private(page), };
+ unsigned short id;
+
/*
* Every swap fault against a single page tries to charge the
* page, bail as early as possible. shmem_unuse() encounters
@@ -6470,17 +6467,12 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
if (compound_head(page)->mem_cgroup)
goto out;

- if (!cgroup_memory_noswap) {
- swp_entry_t ent = { .val = page_private(page), };
- unsigned short id;
-
- id = lookup_swap_cgroup_id(ent);
- rcu_read_lock();
- memcg = mem_cgroup_from_id(id);
- if (memcg && !css_tryget_online(&memcg->css))
- memcg = NULL;
- rcu_read_unlock();
- }
+ id = lookup_swap_cgroup_id(ent);
+ rcu_read_lock();
+ memcg = mem_cgroup_from_id(id);
+ if (memcg && !css_tryget_online(&memcg->css))
+ memcg = NULL;
+ rcu_read_unlock();
}

if (!memcg)
@@ -6497,7 +6489,7 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
memcg_check_events(memcg, page);
local_irq_enable();

- if (do_memsw_account() && PageSwapCache(page)) {
+ if (PageSwapCache(page)) {
swp_entry_t entry = { .val = page_private(page) };
/*
* The swap entry might not get freed for a long time,
@@ -6884,9 +6876,6 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
VM_BUG_ON_PAGE(PageLRU(page), page);
VM_BUG_ON_PAGE(page_count(page), page);

- if (!do_memsw_account())
- return;
-
memcg = page->mem_cgroup;

/* Readahead page, never charged */
@@ -6913,7 +6902,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
if (!mem_cgroup_is_root(memcg))
page_counter_uncharge(&memcg->memory, nr_entries);

- if (memcg != swap_memcg) {
+ if (do_memsw_account() && memcg != swap_memcg) {
if (!mem_cgroup_is_root(swap_memcg))
page_counter_charge(&swap_memcg->memsw, nr_entries);
page_counter_uncharge(&memcg->memsw, nr_entries);
@@ -6949,7 +6938,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
struct mem_cgroup *memcg;
unsigned short oldid;

- if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) || cgroup_memory_noswap)
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return 0;

memcg = page->mem_cgroup;
@@ -6965,7 +6954,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)

memcg = mem_cgroup_id_get_online(memcg);

- if (!mem_cgroup_is_root(memcg) &&
+ if (!cgroup_memory_noswap && !mem_cgroup_is_root(memcg) &&
!page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
memcg_memory_event(memcg, MEMCG_SWAP_MAX);
memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
@@ -6993,14 +6982,11 @@ void mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
struct mem_cgroup *memcg;
unsigned short id;

- if (cgroup_memory_noswap)
- return;
-
id = swap_cgroup_record(entry, 0, nr_pages);
rcu_read_lock();
memcg = mem_cgroup_from_id(id);
if (memcg) {
- if (!mem_cgroup_is_root(memcg)) {
+ if (!cgroup_memory_noswap && !mem_cgroup_is_root(memcg)) {
if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
page_counter_uncharge(&memcg->swap, nr_pages);
else
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 7aa764f09079..7f34343c075a 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -171,9 +171,6 @@ int swap_cgroup_swapon(int type, unsigned long max_pages)
unsigned long length;
struct swap_cgroup_ctrl *ctrl;

- if (cgroup_memory_noswap)
- return 0;
-
length = DIV_ROUND_UP(max_pages, SC_PER_PAGE);
array_size = length * sizeof(void *);

@@ -209,9 +206,6 @@ void swap_cgroup_swapoff(int type)
unsigned long i, length;
struct swap_cgroup_ctrl *ctrl;

- if (cgroup_memory_noswap)
- return;
-
mutex_lock(&swap_cgroup_mutex);
ctrl = &swap_cgroup_ctrl[type];
map = ctrl->map;
--
2.26.0

2020-04-20 22:14:33

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 08/18] mm: memcontrol: prepare cgroup vmstat infrastructure for native anon counters

Anonymous compound pages can be mapped by ptes, which means that if we
want to track NR_MAPPED_ANON, NR_ANON_THPS on a per-cgroup basis, we
have to be prepared to see tail pages in our accounting functions.

Make mod_lruvec_page_state() and lock_page_memcg() deal with tail
pages correctly, namely by redirecting to the head page which has the
page->mem_cgroup set up.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 5 +++--
mm/memcontrol.c | 9 ++++++---
2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5e8b0e38f145..5a1b5a7b7728 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -711,16 +711,17 @@ static inline void mod_lruvec_state(struct lruvec *lruvec,
static inline void __mod_lruvec_page_state(struct page *page,
enum node_stat_item idx, int val)
{
+ struct page *head = compound_head(page); /* rmap on tail pages */
pg_data_t *pgdat = page_pgdat(page);
struct lruvec *lruvec;

/* Untracked pages have no memcg, no lruvec. Update only the node */
- if (!page->mem_cgroup) {
+ if (!head->mem_cgroup) {
__mod_node_page_state(pgdat, idx, val);
return;
}

- lruvec = mem_cgroup_lruvec(page->mem_cgroup, pgdat);
+ lruvec = mem_cgroup_lruvec(head->mem_cgroup, pgdat);
__mod_lruvec_state(lruvec, idx, val);
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ac6f2b073a5a..e9e22c86a118 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1979,6 +1979,7 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
*/
struct mem_cgroup *lock_page_memcg(struct page *page)
{
+ struct page *head = compound_head(page); /* rmap on tail pages */
struct mem_cgroup *memcg;
unsigned long flags;

@@ -1998,7 +1999,7 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
if (mem_cgroup_disabled())
return NULL;
again:
- memcg = page->mem_cgroup;
+ memcg = head->mem_cgroup;
if (unlikely(!memcg))
return NULL;

@@ -2006,7 +2007,7 @@ struct mem_cgroup *lock_page_memcg(struct page *page)
return memcg;

spin_lock_irqsave(&memcg->move_lock, flags);
- if (memcg != page->mem_cgroup) {
+ if (memcg != head->mem_cgroup) {
spin_unlock_irqrestore(&memcg->move_lock, flags);
goto again;
}
@@ -2049,7 +2050,9 @@ void __unlock_page_memcg(struct mem_cgroup *memcg)
*/
void unlock_page_memcg(struct page *page)
{
- __unlock_page_memcg(page->mem_cgroup);
+ struct page *head = compound_head(page);
+
+ __unlock_page_memcg(head->mem_cgroup);
}
EXPORT_SYMBOL(unlock_page_memcg);

--
2.26.0

2020-04-20 22:14:37

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 18/18] mm: memcontrol: update page->mem_cgroup stability rules

The previous patches have simplified the access rules around
page->mem_cgroup somewhat:

1. We never change page->mem_cgroup while the page is isolated by
somebody else. This was by far the biggest exception to our rules
and it didn't stop at lock_page() or lock_page_memcg().

2. We charge pages before they get put into page tables now, so the
somewhat fishy rule about "can be in page table as long as it's
still locked" is now gone and boiled down to having an exclusive
reference to the page.

Document the new rules. Any of the following will stabilize the
page->mem_cgroup association:

- the page lock
- LRU isolation
- lock_page_memcg()
- exclusive access to the page

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 21 +++++++--------------
1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a8cce52b6b4d..7b63260c9b57 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1201,9 +1201,8 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
* @page: the page
* @pgdat: pgdat of the page
*
- * This function is only safe when following the LRU page isolation
- * and putback protocol: the LRU lock must be held, and the page must
- * either be PageLRU() or the caller must have isolated/allocated it.
+ * This function relies on page->mem_cgroup being stable - see the
+ * access rules in commit_charge().
*/
struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
{
@@ -2605,18 +2604,12 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg)
{
VM_BUG_ON_PAGE(page->mem_cgroup, page);
/*
- * Nobody should be changing or seriously looking at
- * page->mem_cgroup at this point:
- *
- * - the page is uncharged
- *
- * - the page is off-LRU
- *
- * - an anonymous fault has exclusive page access, except for
- * a locked page table
+ * Any of the following ensures page->mem_cgroup stability:
*
- * - a page cache insertion, a swapin fault, or a migration
- * have the page locked
+ * - the page lock
+ * - LRU isolation
+ * - lock_page_memcg()
+ * - exclusive reference
*/
page->mem_cgroup = memcg;
}
--
2.26.0

2020-04-20 22:14:45

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 09/18] mm: memcontrol: switch to native NR_FILE_PAGES and NR_SHMEM counters

Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This
divergence from the generic VM accounting means unnecessary code
overhead, and creates a dependency for memcg that page->mapping is set
up at the time of charging, so that page types can be told apart.

Convert the generic accounting sites to mod_lruvec_page_state and
friends to maintain the per-cgroup vmstat counters of NR_FILE_PAGES
and NR_SHMEM. The page is already locked in these places, so
page->mem_cgroup is stable; we only need minimal tweaks of two
mem_cgroup_migrate() calls to ensure it's set up in time.

Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
NR_SHMEM accounting sites.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 3 +--
mm/filemap.c | 17 +++++++++--------
mm/khugepaged.c | 16 +++++++++++-----
mm/memcontrol.c | 28 +++++++++++-----------------
mm/migrate.c | 15 +++++++++++----
mm/shmem.c | 14 +++++++-------
6 files changed, 50 insertions(+), 43 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 5a1b5a7b7728..c44aa1ccf553 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,8 +29,7 @@ struct kmem_cache;

/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
- MEMCG_CACHE = NR_VM_NODE_STAT_ITEMS,
- MEMCG_RSS,
+ MEMCG_RSS = NR_VM_NODE_STAT_ITEMS,
MEMCG_RSS_HUGE,
MEMCG_SWAP,
MEMCG_SOCK,
diff --git a/mm/filemap.c b/mm/filemap.c
index 5bdbda965177..f4592ff3ca8b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -199,9 +199,9 @@ static void unaccount_page_cache_page(struct address_space *mapping,

nr = hpage_nr_pages(page);

- __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
+ __mod_lruvec_page_state(page, NR_FILE_PAGES, -nr);
if (PageSwapBacked(page)) {
- __mod_node_page_state(page_pgdat(page), NR_SHMEM, -nr);
+ __mod_lruvec_page_state(page, NR_SHMEM, -nr);
if (PageTransHuge(page))
__dec_node_page_state(page, NR_SHMEM_THPS);
} else if (PageTransHuge(page)) {
@@ -802,21 +802,22 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
new->mapping = mapping;
new->index = offset;

+ mem_cgroup_migrate(old, new);
+
xas_lock_irqsave(&xas, flags);
xas_store(&xas, new);

old->mapping = NULL;
/* hugetlb pages do not participate in page cache accounting. */
if (!PageHuge(old))
- __dec_node_page_state(old, NR_FILE_PAGES);
+ __dec_lruvec_page_state(old, NR_FILE_PAGES);
if (!PageHuge(new))
- __inc_node_page_state(new, NR_FILE_PAGES);
+ __inc_lruvec_page_state(new, NR_FILE_PAGES);
if (PageSwapBacked(old))
- __dec_node_page_state(old, NR_SHMEM);
+ __dec_lruvec_page_state(old, NR_SHMEM);
if (PageSwapBacked(new))
- __inc_node_page_state(new, NR_SHMEM);
+ __inc_lruvec_page_state(new, NR_SHMEM);
xas_unlock_irqrestore(&xas, flags);
- mem_cgroup_migrate(old, new);
if (freepage)
freepage(old);
put_page(old);
@@ -867,7 +868,7 @@ static int __add_to_page_cache_locked(struct page *page,

/* hugetlb pages do not participate in page cache accounting */
if (!huge)
- __inc_node_page_state(page, NR_FILE_PAGES);
+ __inc_lruvec_page_state(page, NR_FILE_PAGES);
unlock:
xas_unlock_irq(&xas);
} while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 46f9b565e8d5..ee2ef4b8e828 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1740,12 +1740,18 @@ static void collapse_file(struct mm_struct *mm,
}

if (nr_none) {
- struct zone *zone = page_zone(new_page);
-
- __mod_node_page_state(zone->zone_pgdat, NR_FILE_PAGES, nr_none);
+ struct lruvec *lruvec;
+ /*
+ * XXX: We have started try_charge and pinned the
+ * memcg, but the page isn't committed yet so we
+ * cannot use mod_lruvec_page_state(). This hackery
+ * will be cleaned up when remove the page->mapping
+ * dependency from memcg and fully charge above.
+ */
+ lruvec = mem_cgroup_lruvec(memcg, page_pgdat(new_page));
+ __mod_lruvec_state(lruvec, NR_FILE_PAGES, nr_none);
if (is_shmem)
- __mod_node_page_state(zone->zone_pgdat,
- NR_SHMEM, nr_none);
+ __mod_lruvec_state(lruvec, NR_SHMEM, nr_none);
}

xa_locked:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e9e22c86a118..7e77166cf10b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -842,11 +842,6 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
*/
if (PageAnon(page))
__mod_memcg_state(memcg, MEMCG_RSS, nr_pages);
- else {
- __mod_memcg_state(memcg, MEMCG_CACHE, nr_pages);
- if (PageSwapBacked(page))
- __mod_memcg_state(memcg, NR_SHMEM, nr_pages);
- }

if (abs(nr_pages) > 1) {
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
@@ -1392,7 +1387,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
(u64)memcg_page_state(memcg, MEMCG_RSS) *
PAGE_SIZE);
seq_buf_printf(&s, "file %llu\n",
- (u64)memcg_page_state(memcg, MEMCG_CACHE) *
+ (u64)memcg_page_state(memcg, NR_FILE_PAGES) *
PAGE_SIZE);
seq_buf_printf(&s, "kernel_stack %llu\n",
(u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) *
@@ -3302,7 +3297,7 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
unsigned long val;

if (mem_cgroup_is_root(memcg)) {
- val = memcg_page_state(memcg, MEMCG_CACHE) +
+ val = memcg_page_state(memcg, NR_FILE_PAGES) +
memcg_page_state(memcg, MEMCG_RSS);
if (swap)
val += memcg_page_state(memcg, MEMCG_SWAP);
@@ -3772,7 +3767,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
#endif /* CONFIG_NUMA */

static const unsigned int memcg1_stats[] = {
- MEMCG_CACHE,
+ NR_FILE_PAGES,
MEMCG_RSS,
MEMCG_RSS_HUGE,
NR_SHMEM,
@@ -5401,6 +5396,14 @@ static int mem_cgroup_move_account(struct page *page,
lock_page_memcg(page);

if (!PageAnon(page)) {
+ __mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
+ __mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
+
+ if (PageSwapBacked(page)) {
+ __mod_lruvec_state(from_vec, NR_SHMEM, -nr_pages);
+ __mod_lruvec_state(to_vec, NR_SHMEM, nr_pages);
+ }
+
if (page_mapped(page)) {
__mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
__mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
@@ -6613,10 +6616,8 @@ struct uncharge_gather {
unsigned long nr_pages;
unsigned long pgpgout;
unsigned long nr_anon;
- unsigned long nr_file;
unsigned long nr_kmem;
unsigned long nr_huge;
- unsigned long nr_shmem;
struct page *dummy_page;
};

@@ -6640,9 +6641,7 @@ static void uncharge_batch(const struct uncharge_gather *ug)

local_irq_save(flags);
__mod_memcg_state(ug->memcg, MEMCG_RSS, -ug->nr_anon);
- __mod_memcg_state(ug->memcg, MEMCG_CACHE, -ug->nr_file);
__mod_memcg_state(ug->memcg, MEMCG_RSS_HUGE, -ug->nr_huge);
- __mod_memcg_state(ug->memcg, NR_SHMEM, -ug->nr_shmem);
__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_pages);
memcg_check_events(ug->memcg, ug->dummy_page);
@@ -6685,11 +6684,6 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
ug->nr_huge += nr_pages;
if (PageAnon(page))
ug->nr_anon += nr_pages;
- else {
- ug->nr_file += nr_pages;
- if (PageSwapBacked(page))
- ug->nr_shmem += nr_pages;
- }
ug->pgpgout++;
} else {
ug->nr_kmem += nr_pages;
diff --git a/mm/migrate.c b/mm/migrate.c
index 5dd50128568c..14a584c52782 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -490,11 +490,18 @@ int migrate_page_move_mapping(struct address_space *mapping,
* are mapped to swap space.
*/
if (newzone != oldzone) {
- __dec_node_state(oldzone->zone_pgdat, NR_FILE_PAGES);
- __inc_node_state(newzone->zone_pgdat, NR_FILE_PAGES);
+ struct lruvec *old_lruvec, *new_lruvec;
+ struct mem_cgroup *memcg;
+
+ memcg = page_memcg(page);
+ old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat);
+ new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat);
+
+ __dec_lruvec_state(old_lruvec, NR_FILE_PAGES);
+ __inc_lruvec_state(new_lruvec, NR_FILE_PAGES);
if (PageSwapBacked(page) && !PageSwapCache(page)) {
- __dec_node_state(oldzone->zone_pgdat, NR_SHMEM);
- __inc_node_state(newzone->zone_pgdat, NR_SHMEM);
+ __dec_lruvec_state(old_lruvec, NR_SHMEM);
+ __inc_lruvec_state(new_lruvec, NR_SHMEM);
}
if (dirty && mapping_cap_account_dirty(mapping)) {
__dec_node_state(oldzone->zone_pgdat, NR_FILE_DIRTY);
diff --git a/mm/shmem.c b/mm/shmem.c
index 2384f6c7ef71..363bd11eba85 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -653,8 +653,8 @@ static int shmem_add_to_page_cache(struct page *page,
__inc_node_page_state(page, NR_SHMEM_THPS);
}
mapping->nrpages += nr;
- __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, nr);
- __mod_node_page_state(page_pgdat(page), NR_SHMEM, nr);
+ __mod_lruvec_page_state(page, NR_FILE_PAGES, nr);
+ __mod_lruvec_page_state(page, NR_SHMEM, nr);
unlock:
xas_unlock_irq(&xas);
} while (xas_nomem(&xas, gfp));
@@ -685,8 +685,8 @@ static void shmem_delete_from_page_cache(struct page *page, void *radswap)
error = shmem_replace_entry(mapping, page->index, page, radswap);
page->mapping = NULL;
mapping->nrpages--;
- __dec_node_page_state(page, NR_FILE_PAGES);
- __dec_node_page_state(page, NR_SHMEM);
+ __dec_lruvec_page_state(page, NR_FILE_PAGES);
+ __dec_lruvec_page_state(page, NR_SHMEM);
xa_unlock_irq(&mapping->i_pages);
put_page(page);
BUG_ON(error);
@@ -1593,8 +1593,9 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
xa_lock_irq(&swap_mapping->i_pages);
error = shmem_replace_entry(swap_mapping, swap_index, oldpage, newpage);
if (!error) {
- __inc_node_page_state(newpage, NR_FILE_PAGES);
- __dec_node_page_state(oldpage, NR_FILE_PAGES);
+ mem_cgroup_migrate(oldpage, newpage);
+ __inc_lruvec_page_state(newpage, NR_FILE_PAGES);
+ __dec_lruvec_page_state(oldpage, NR_FILE_PAGES);
}
xa_unlock_irq(&swap_mapping->i_pages);

@@ -1606,7 +1607,6 @@ static int shmem_replace_page(struct page **pagep, gfp_t gfp,
*/
oldpage = newpage;
} else {
- mem_cgroup_migrate(oldpage, newpage);
lru_cache_add_anon(newpage);
*pagep = newpage;
}
--
2.26.0

2020-04-20 22:15:26

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 11/18] mm: memcontrol: switch to native NR_ANON_THPS counter

With rmap memcg locking already in place for NR_ANON_MAPPED, it's just
a small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the
native NR_ANON_THPS accounting sites.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 3 +--
mm/huge_memory.c | 4 +++-
mm/memcontrol.c | 39 ++++++++++++++++----------------------
mm/rmap.c | 6 +++---
4 files changed, 23 insertions(+), 29 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bfb1d961e346..9ac8122ec1cd 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,8 +29,7 @@ struct kmem_cache;

/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
- MEMCG_RSS_HUGE = NR_VM_NODE_STAT_ITEMS,
- MEMCG_SWAP,
+ MEMCG_SWAP = NR_VM_NODE_STAT_ITEMS,
MEMCG_SOCK,
/* XXX: why are these zone and not node counters? */
MEMCG_KERNEL_STACK_KB,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e9355a463e74..da6c413a75a5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2359,15 +2359,17 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
atomic_inc(&page[i]._mapcount);
}

+ lock_page_memcg(page);
if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
/* Last compound_mapcount is gone. */
- __dec_node_page_state(page, NR_ANON_THPS);
+ __dec_lruvec_page_state(page, NR_ANON_THPS);
if (TestClearPageDoubleMap(page)) {
/* No need in mapcount reference anymore */
for (i = 0; i < HPAGE_PMD_NR; i++)
atomic_dec(&page[i]._mapcount);
}
}
+ unlock_page_memcg(page);

smp_wmb(); /* make pte visible before pmd */
pmd_populate(mm, pmd, pgtable);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c87178d6219f..7845a87b94d5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -836,11 +836,6 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
struct page *page,
int nr_pages)
{
- if (abs(nr_pages) > 1) {
- VM_BUG_ON_PAGE(!PageTransHuge(page), page);
- __mod_memcg_state(memcg, MEMCG_RSS_HUGE, nr_pages);
- }
-
/* pagein of a big page is an event. So, ignore page size */
if (nr_pages > 0)
__count_memcg_events(memcg, PGPGIN, 1);
@@ -1406,15 +1401,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
(u64)memcg_page_state(memcg, NR_WRITEBACK) *
PAGE_SIZE);

- /*
- * TODO: We should eventually replace our own MEMCG_RSS_HUGE counter
- * with the NR_ANON_THP vm counter, but right now it's a pain in the
- * arse because it requires migrating the work out of rmap to a place
- * where the page->mem_cgroup is set up and stable.
- */
seq_buf_printf(&s, "anon_thp %llu\n",
- (u64)memcg_page_state(memcg, MEMCG_RSS_HUGE) *
- PAGE_SIZE);
+ (u64)memcg_page_state(memcg, NR_ANON_THPS) *
+ HPAGE_PMD_NR * PAGE_SIZE);

for (i = 0; i < NR_LRU_LISTS; i++)
seq_buf_printf(&s, "%s %llu\n", lru_list_name(i),
@@ -3006,8 +2995,6 @@ void mem_cgroup_split_huge_fixup(struct page *head)

for (i = 1; i < HPAGE_PMD_NR; i++)
head[i].mem_cgroup = head->mem_cgroup;
-
- __mod_memcg_state(head->mem_cgroup, MEMCG_RSS_HUGE, -HPAGE_PMD_NR);
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

@@ -3762,7 +3749,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
static const unsigned int memcg1_stats[] = {
NR_FILE_PAGES,
NR_ANON_MAPPED,
- MEMCG_RSS_HUGE,
+ NR_ANON_THPS,
NR_SHMEM,
NR_FILE_MAPPED,
NR_FILE_DIRTY,
@@ -3799,11 +3786,14 @@ static int memcg_stat_show(struct seq_file *m, void *v)
BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));

for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
+ unsigned long nr;
+
if (memcg1_stats[i] == MEMCG_SWAP && !do_memsw_account())
continue;
- seq_printf(m, "%s %lu\n", memcg1_stat_names[i],
- memcg_page_state_local(memcg, memcg1_stats[i]) *
- PAGE_SIZE);
+ nr = memcg_page_state_local(memcg, memcg1_stats[i]);
+ if (memcg1_stats[i] == NR_ANON_THPS)
+ nr *= HPAGE_PMD_NR;
+ seq_printf(m, "%s %lu\n", memcg1_stat_names[i], nr * PAGE_SIZE);
}

for (i = 0; i < ARRAY_SIZE(memcg1_events); i++)
@@ -5392,6 +5382,13 @@ static int mem_cgroup_move_account(struct page *page,
if (page_mapped(page)) {
__mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
__mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
+ if (PageTransHuge(page)) {
+ __mod_lruvec_state(from_vec, NR_ANON_THPS,
+ -nr_pages);
+ __mod_lruvec_state(to_vec, NR_ANON_THPS,
+ nr_pages);
+ }
+
}
} else {
__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
@@ -6611,7 +6608,6 @@ struct uncharge_gather {
unsigned long nr_pages;
unsigned long pgpgout;
unsigned long nr_kmem;
- unsigned long nr_huge;
struct page *dummy_page;
};

@@ -6634,7 +6630,6 @@ static void uncharge_batch(const struct uncharge_gather *ug)
}

local_irq_save(flags);
- __mod_memcg_state(ug->memcg, MEMCG_RSS_HUGE, -ug->nr_huge);
__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_pages);
memcg_check_events(ug->memcg, ug->dummy_page);
@@ -6673,8 +6668,6 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
ug->nr_pages += nr_pages;

if (!PageKmemcg(page)) {
- if (PageTransHuge(page))
- ug->nr_huge += nr_pages;
ug->pgpgout++;
} else {
ug->nr_kmem += nr_pages;
diff --git a/mm/rmap.c b/mm/rmap.c
index 150513d31efa..ad4a0fdcc94c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1138,7 +1138,7 @@ void do_page_add_anon_rmap(struct page *page,
* disabled.
*/
if (compound)
- __inc_node_page_state(page, NR_ANON_THPS);
+ __inc_lruvec_page_state(page, NR_ANON_THPS);
__mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
}

@@ -1180,7 +1180,7 @@ void page_add_new_anon_rmap(struct page *page,
if (hpage_pincount_available(page))
atomic_set(compound_pincount_ptr(page), 0);

- __inc_node_page_state(page, NR_ANON_THPS);
+ __inc_lruvec_page_state(page, NR_ANON_THPS);
} else {
/* Anon THP always mapped first with PMD */
VM_BUG_ON_PAGE(PageTransCompound(page), page);
@@ -1286,7 +1286,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
if (!IS_ENABLED(CONFIG_TRANSPARENT_HUGEPAGE))
return;

- __dec_node_page_state(page, NR_ANON_THPS);
+ __dec_lruvec_page_state(page, NR_ANON_THPS);

if (TestClearPageDoubleMap(page)) {
/*
--
2.26.0

2020-04-20 22:15:40

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 03/18] mm: memcontrol: drop @compound parameter from memcg charging API

The memcg charging API carries a boolean @compound parameter that
tells whether the page we're dealing with is a hugepage.
mem_cgroup_commit_charge() has another boolean @lrucare that indicates
whether the page needs LRU locking or not while charging. The majority
of callsites know those parameters at compile time, which results in a
lot of naked "false, false" argument lists. This makes for cryptic
code and is a breeding ground for subtle mistakes.

Thankfully, the huge page state can be inferred from the page itself
and doesn't need to be passed along. This is safe because charging
completes before the page is published and somebody may split it.

Simplify the callsites by removing @compound, and let memcg infer the
state by using hpage_nr_pages() unconditionally. That function does
PageTransHuge() to identify huge pages, which also helpfully asserts
that nobody passes in tail pages by accident.

The following patches will introduce a new charging API, best not to
carry over unnecessary weight.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 22 ++++++++--------------
kernel/events/uprobes.c | 6 +++---
mm/filemap.c | 6 +++---
mm/huge_memory.c | 23 +++++++++++------------
mm/khugepaged.c | 20 ++++++++++----------
mm/memcontrol.c | 38 +++++++++++++++-----------------------
mm/memory.c | 32 +++++++++++++++-----------------
mm/migrate.c | 6 +++---
mm/shmem.c | 22 +++++++++-------------
mm/swapfile.c | 9 ++++-----
mm/userfaultfd.c | 6 +++---
11 files changed, 84 insertions(+), 106 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 1b4150ff64be..c7875a48c8c1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -361,15 +361,12 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
struct mem_cgroup *memcg);

int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcgp,
- bool compound);
+ gfp_t gfp_mask, struct mem_cgroup **memcgp);
int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcgp,
- bool compound);
+ gfp_t gfp_mask, struct mem_cgroup **memcgp);
void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
- bool lrucare, bool compound);
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
- bool compound);
+ bool lrucare);
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
void mem_cgroup_uncharge(struct page *page);
void mem_cgroup_uncharge_list(struct list_head *page_list);

@@ -849,8 +846,7 @@ static inline enum mem_cgroup_protection mem_cgroup_protected(

static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask,
- struct mem_cgroup **memcgp,
- bool compound)
+ struct mem_cgroup **memcgp)
{
*memcgp = NULL;
return 0;
@@ -859,8 +855,7 @@ static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
static inline int mem_cgroup_try_charge_delay(struct page *page,
struct mm_struct *mm,
gfp_t gfp_mask,
- struct mem_cgroup **memcgp,
- bool compound)
+ struct mem_cgroup **memcgp)
{
*memcgp = NULL;
return 0;
@@ -868,13 +863,12 @@ static inline int mem_cgroup_try_charge_delay(struct page *page,

static inline void mem_cgroup_commit_charge(struct page *page,
struct mem_cgroup *memcg,
- bool lrucare, bool compound)
+ bool lrucare)
{
}

static inline void mem_cgroup_cancel_charge(struct page *page,
- struct mem_cgroup *memcg,
- bool compound)
+ struct mem_cgroup *memcg)
{
}

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index ece7e13f6e4a..40e7488ce467 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -169,7 +169,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,

if (new_page) {
err = mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,
- &memcg, false);
+ &memcg);
if (err)
return err;
}
@@ -181,7 +181,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
err = -EAGAIN;
if (!page_vma_mapped_walk(&pvmw)) {
if (new_page)
- mem_cgroup_cancel_charge(new_page, memcg, false);
+ mem_cgroup_cancel_charge(new_page, memcg);
goto unlock;
}
VM_BUG_ON_PAGE(addr != pvmw.address, old_page);
@@ -189,7 +189,7 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
if (new_page) {
get_page(new_page);
page_add_new_anon_rmap(new_page, vma, addr, false);
- mem_cgroup_commit_charge(new_page, memcg, false, false);
+ mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
} else
/* no new page, just dec_mm_counter for old_page */
diff --git a/mm/filemap.c b/mm/filemap.c
index 49e3b5da0216..5b31af9d5b1b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -842,7 +842,7 @@ static int __add_to_page_cache_locked(struct page *page,

if (!huge) {
error = mem_cgroup_try_charge(page, current->mm,
- gfp_mask, &memcg, false);
+ gfp_mask, &memcg);
if (error)
return error;
}
@@ -878,14 +878,14 @@ static int __add_to_page_cache_locked(struct page *page,
goto error;

if (!huge)
- mem_cgroup_commit_charge(page, memcg, false, false);
+ mem_cgroup_commit_charge(page, memcg, false);
trace_mm_filemap_add_to_page_cache(page);
return 0;
error:
page->mapping = NULL;
/* Leave page->index set: truncation relies upon it */
if (!huge)
- mem_cgroup_cancel_charge(page, memcg, false);
+ mem_cgroup_cancel_charge(page, memcg);
put_page(page);
return xas_error(&xas);
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6ecd1045113b..e9355a463e74 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -594,7 +594,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,

VM_BUG_ON_PAGE(!PageCompound(page), page);

- if (mem_cgroup_try_charge_delay(page, vma->vm_mm, gfp, &memcg, true)) {
+ if (mem_cgroup_try_charge_delay(page, vma->vm_mm, gfp, &memcg)) {
put_page(page);
count_vm_event(THP_FAULT_FALLBACK);
count_vm_event(THP_FAULT_FALLBACK_CHARGE);
@@ -630,7 +630,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
vm_fault_t ret2;

spin_unlock(vmf->ptl);
- mem_cgroup_cancel_charge(page, memcg, true);
+ mem_cgroup_cancel_charge(page, memcg);
put_page(page);
pte_free(vma->vm_mm, pgtable);
ret2 = handle_userfault(vmf, VM_UFFD_MISSING);
@@ -641,7 +641,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr, true);
- mem_cgroup_commit_charge(page, memcg, false, true);
+ mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
@@ -658,7 +658,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
release:
if (pgtable)
pte_free(vma->vm_mm, pgtable);
- mem_cgroup_cancel_charge(page, memcg, true);
+ mem_cgroup_cancel_charge(page, memcg);
put_page(page);
return ret;

@@ -1280,14 +1280,13 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
vmf->address, page_to_nid(page));
if (unlikely(!pages[i] ||
mem_cgroup_try_charge_delay(pages[i], vma->vm_mm,
- GFP_KERNEL, &memcg, false))) {
+ GFP_KERNEL, &memcg))) {
if (pages[i])
put_page(pages[i]);
while (--i >= 0) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
- mem_cgroup_cancel_charge(pages[i], memcg,
- false);
+ mem_cgroup_cancel_charge(pages[i], memcg);
put_page(pages[i]);
}
kfree(pages);
@@ -1333,7 +1332,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
- mem_cgroup_commit_charge(pages[i], memcg, false, false);
+ mem_cgroup_commit_charge(pages[i], memcg, false);
lru_cache_add_active_or_unevictable(pages[i], vma);
vmf->pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte_none(*vmf->pte));
@@ -1365,7 +1364,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
for (i = 0; i < HPAGE_PMD_NR; i++) {
memcg = (void *)page_private(pages[i]);
set_page_private(pages[i], 0);
- mem_cgroup_cancel_charge(pages[i], memcg, false);
+ mem_cgroup_cancel_charge(pages[i], memcg);
put_page(pages[i]);
}
kfree(pages);
@@ -1448,7 +1447,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
}

if (unlikely(mem_cgroup_try_charge_delay(new_page, vma->vm_mm,
- huge_gfp, &memcg, true))) {
+ huge_gfp, &memcg))) {
put_page(new_page);
split_huge_pmd(vma, vmf->pmd, vmf->address);
if (page)
@@ -1478,7 +1477,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
put_page(page);
if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) {
spin_unlock(vmf->ptl);
- mem_cgroup_cancel_charge(new_page, memcg, true);
+ mem_cgroup_cancel_charge(new_page, memcg);
put_page(new_page);
goto out_mn;
} else {
@@ -1487,7 +1486,7 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
page_add_new_anon_rmap(new_page, vma, haddr, true);
- mem_cgroup_commit_charge(new_page, memcg, false, true);
+ mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 99d77ffb79c2..46f9b565e8d5 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -974,7 +974,7 @@ static void collapse_huge_page(struct mm_struct *mm,
goto out_nolock;
}

- if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
+ if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg))) {
result = SCAN_CGROUP_CHARGE_FAIL;
goto out_nolock;
}
@@ -982,7 +982,7 @@ static void collapse_huge_page(struct mm_struct *mm,
down_read(&mm->mmap_sem);
result = hugepage_vma_revalidate(mm, address, &vma);
if (result) {
- mem_cgroup_cancel_charge(new_page, memcg, true);
+ mem_cgroup_cancel_charge(new_page, memcg);
up_read(&mm->mmap_sem);
goto out_nolock;
}
@@ -990,7 +990,7 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd = mm_find_pmd(mm, address);
if (!pmd) {
result = SCAN_PMD_NULL;
- mem_cgroup_cancel_charge(new_page, memcg, true);
+ mem_cgroup_cancel_charge(new_page, memcg);
up_read(&mm->mmap_sem);
goto out_nolock;
}
@@ -1001,7 +1001,7 @@ static void collapse_huge_page(struct mm_struct *mm,
* Continuing to collapse causes inconsistency.
*/
if (!__collapse_huge_page_swapin(mm, vma, address, pmd, referenced)) {
- mem_cgroup_cancel_charge(new_page, memcg, true);
+ mem_cgroup_cancel_charge(new_page, memcg);
up_read(&mm->mmap_sem);
goto out_nolock;
}
@@ -1087,7 +1087,7 @@ static void collapse_huge_page(struct mm_struct *mm,
spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
page_add_new_anon_rmap(new_page, vma, address, true);
- mem_cgroup_commit_charge(new_page, memcg, false, true);
+ mem_cgroup_commit_charge(new_page, memcg, false);
count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
lru_cache_add_active_or_unevictable(new_page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
@@ -1105,7 +1105,7 @@ static void collapse_huge_page(struct mm_struct *mm,
trace_mm_collapse_huge_page(mm, isolated, result);
return;
out:
- mem_cgroup_cancel_charge(new_page, memcg, true);
+ mem_cgroup_cancel_charge(new_page, memcg);
goto out_up_write;
}

@@ -1534,7 +1534,7 @@ static void collapse_file(struct mm_struct *mm,
goto out;
}

- if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg, true))) {
+ if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg))) {
result = SCAN_CGROUP_CHARGE_FAIL;
goto out;
}
@@ -1547,7 +1547,7 @@ static void collapse_file(struct mm_struct *mm,
break;
xas_unlock_irq(&xas);
if (!xas_nomem(&xas, GFP_KERNEL)) {
- mem_cgroup_cancel_charge(new_page, memcg, true);
+ mem_cgroup_cancel_charge(new_page, memcg);
result = SCAN_FAIL;
goto out;
}
@@ -1783,7 +1783,7 @@ static void collapse_file(struct mm_struct *mm,

SetPageUptodate(new_page);
page_ref_add(new_page, HPAGE_PMD_NR - 1);
- mem_cgroup_commit_charge(new_page, memcg, false, true);
+ mem_cgroup_commit_charge(new_page, memcg, false);

if (is_shmem) {
set_page_dirty(new_page);
@@ -1838,7 +1838,7 @@ static void collapse_file(struct mm_struct *mm,
VM_BUG_ON(nr_none);
xas_unlock_irq(&xas);

- mem_cgroup_cancel_charge(new_page, memcg, true);
+ mem_cgroup_cancel_charge(new_page, memcg);
new_page->mapping = NULL;
}

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 41f5ed79272e..5ed8f6651383 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -834,7 +834,7 @@ static unsigned long memcg_events_local(struct mem_cgroup *memcg, int event)

static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
struct page *page,
- bool compound, int nr_pages)
+ int nr_pages)
{
/*
* Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
@@ -848,7 +848,7 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
__mod_memcg_state(memcg, NR_SHMEM, nr_pages);
}

- if (compound) {
+ if (abs(nr_pages) > 1) {
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__mod_memcg_state(memcg, MEMCG_RSS_HUGE, nr_pages);
}
@@ -5441,9 +5441,9 @@ static int mem_cgroup_move_account(struct page *page,
ret = 0;

local_irq_disable();
- mem_cgroup_charge_statistics(to, page, compound, nr_pages);
+ mem_cgroup_charge_statistics(to, page, nr_pages);
memcg_check_events(to, page);
- mem_cgroup_charge_statistics(from, page, compound, -nr_pages);
+ mem_cgroup_charge_statistics(from, page, -nr_pages);
memcg_check_events(from, page);
local_irq_enable();
out_unlock:
@@ -6434,7 +6434,6 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
* @mm: mm context of the victim
* @gfp_mask: reclaim mode
* @memcgp: charged memcg return
- * @compound: charge the page as compound or small page
*
* Try to charge @page to the memcg that @mm belongs to, reclaiming
* pages according to @gfp_mask if necessary.
@@ -6447,11 +6446,10 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
* with mem_cgroup_cancel_charge() in case page instantiation fails.
*/
int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcgp,
- bool compound)
+ gfp_t gfp_mask, struct mem_cgroup **memcgp)
{
+ unsigned int nr_pages = hpage_nr_pages(page);
struct mem_cgroup *memcg = NULL;
- unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
int ret = 0;

if (mem_cgroup_disabled())
@@ -6493,13 +6491,12 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
}

int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcgp,
- bool compound)
+ gfp_t gfp_mask, struct mem_cgroup **memcgp)
{
struct mem_cgroup *memcg;
int ret;

- ret = mem_cgroup_try_charge(page, mm, gfp_mask, memcgp, compound);
+ ret = mem_cgroup_try_charge(page, mm, gfp_mask, memcgp);
memcg = *memcgp;
mem_cgroup_throttle_swaprate(memcg, page_to_nid(page), gfp_mask);
return ret;
@@ -6510,7 +6507,6 @@ int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
* @page: page to charge
* @memcg: memcg to charge the page to
* @lrucare: page might be on LRU already
- * @compound: charge the page as compound or small page
*
* Finalize a charge transaction started by mem_cgroup_try_charge(),
* after page->mapping has been set up. This must happen atomically
@@ -6523,9 +6519,9 @@ int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
* Use mem_cgroup_cancel_charge() to cancel the transaction instead.
*/
void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
- bool lrucare, bool compound)
+ bool lrucare)
{
- unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+ unsigned int nr_pages = hpage_nr_pages(page);

VM_BUG_ON_PAGE(!page->mapping, page);
VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
@@ -6543,7 +6539,7 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
commit_charge(page, memcg, lrucare);

local_irq_disable();
- mem_cgroup_charge_statistics(memcg, page, compound, nr_pages);
+ mem_cgroup_charge_statistics(memcg, page, nr_pages);
memcg_check_events(memcg, page);
local_irq_enable();

@@ -6562,14 +6558,12 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
* mem_cgroup_cancel_charge - cancel a page charge
* @page: page to charge
* @memcg: memcg to charge the page to
- * @compound: charge the page as compound or small page
*
* Cancel a charge transaction started by mem_cgroup_try_charge().
*/
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg,
- bool compound)
+void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
{
- unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
+ unsigned int nr_pages = hpage_nr_pages(page);

if (mem_cgroup_disabled())
return;
@@ -6784,8 +6778,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
commit_charge(newpage, memcg, false);

local_irq_save(flags);
- mem_cgroup_charge_statistics(memcg, newpage, PageTransHuge(newpage),
- nr_pages);
+ mem_cgroup_charge_statistics(memcg, newpage, nr_pages);
memcg_check_events(memcg, newpage);
local_irq_restore(flags);
}
@@ -7015,8 +7008,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
* only synchronisation we have for updating the per-CPU variables.
*/
VM_BUG_ON(!irqs_disabled());
- mem_cgroup_charge_statistics(memcg, page, PageTransHuge(page),
- -nr_entries);
+ mem_cgroup_charge_statistics(memcg, page, -nr_entries);
memcg_check_events(memcg, page);

if (!mem_cgroup_is_root(memcg))
diff --git a/mm/memory.c b/mm/memory.c
index f703fe8c8346..43a3345ecdf3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2678,7 +2678,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
}
}

- if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, &memcg, false))
+ if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, &memcg))
goto oom_free_new;

__SetPageUptodate(new_page);
@@ -2713,7 +2713,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
*/
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
- mem_cgroup_commit_charge(new_page, memcg, false, false);
+ mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
/*
* We call the notify macro here because, when using secondary
@@ -2752,7 +2752,7 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
new_page = old_page;
page_copied = 1;
} else {
- mem_cgroup_cancel_charge(new_page, memcg, false);
+ mem_cgroup_cancel_charge(new_page, memcg);
}

if (new_page)
@@ -3195,8 +3195,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_page;
}

- if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL,
- &memcg, false)) {
+ if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
ret = VM_FAULT_OOM;
goto out_page;
}
@@ -3247,11 +3246,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
/* ksm created a completely new copy */
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
- mem_cgroup_commit_charge(page, memcg, false, false);
+ mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
- mem_cgroup_commit_charge(page, memcg, true, false);
+ mem_cgroup_commit_charge(page, memcg, true);
activate_page(page);
}

@@ -3287,7 +3286,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
out:
return ret;
out_nomap:
- mem_cgroup_cancel_charge(page, memcg, false);
+ mem_cgroup_cancel_charge(page, memcg);
pte_unmap_unlock(vmf->pte, vmf->ptl);
out_page:
unlock_page(page);
@@ -3361,8 +3360,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (!page)
goto oom;

- if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL, &memcg,
- false))
+ if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL, &memcg))
goto oom_free_page;

/*
@@ -3388,14 +3386,14 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
- mem_cgroup_cancel_charge(page, memcg, false);
+ mem_cgroup_cancel_charge(page, memcg);
put_page(page);
return handle_userfault(vmf, VM_UFFD_MISSING);
}

inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
- mem_cgroup_commit_charge(page, memcg, false, false);
+ mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
@@ -3406,7 +3404,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
release:
- mem_cgroup_cancel_charge(page, memcg, false);
+ mem_cgroup_cancel_charge(page, memcg);
put_page(page);
goto unlock;
oom_free_page:
@@ -3657,7 +3655,7 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
- mem_cgroup_commit_charge(page, memcg, false, false);
+ mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
@@ -3866,8 +3864,8 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
if (!vmf->cow_page)
return VM_FAULT_OOM;

- if (mem_cgroup_try_charge_delay(vmf->cow_page, vma->vm_mm, GFP_KERNEL,
- &vmf->memcg, false)) {
+ if (mem_cgroup_try_charge_delay(vmf->cow_page, vma->vm_mm,
+ GFP_KERNEL, &vmf->memcg)) {
put_page(vmf->cow_page);
return VM_FAULT_OOM;
}
@@ -3888,7 +3886,7 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
goto uncharge_out;
return ret;
uncharge_out:
- mem_cgroup_cancel_charge(vmf->cow_page, vmf->memcg, false);
+ mem_cgroup_cancel_charge(vmf->cow_page, vmf->memcg);
put_page(vmf->cow_page);
return ret;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 7160c1556f79..5dd50128568c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2786,7 +2786,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,

if (unlikely(anon_vma_prepare(vma)))
goto abort;
- if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg, false))
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg))
goto abort;

/*
@@ -2832,7 +2832,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,

inc_mm_counter(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, addr, false);
- mem_cgroup_commit_charge(page, memcg, false, false);
+ mem_cgroup_commit_charge(page, memcg, false);
if (!is_zone_device_page(page))
lru_cache_add_active_or_unevictable(page, vma);
get_page(page);
@@ -2854,7 +2854,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,

unlock_abort:
pte_unmap_unlock(ptep, ptl);
- mem_cgroup_cancel_charge(page, memcg, false);
+ mem_cgroup_cancel_charge(page, memcg);
abort:
*src &= ~MIGRATE_PFN_MIGRATE;
}
diff --git a/mm/shmem.c b/mm/shmem.c
index d722eb830317..52c66801321e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1664,8 +1664,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
goto failed;
}

- error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg,
- false);
+ error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
if (!error) {
error = shmem_add_to_page_cache(page, mapping, index,
swp_to_radix_entry(swap), gfp);
@@ -1680,14 +1679,14 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
* the rest.
*/
if (error) {
- mem_cgroup_cancel_charge(page, memcg, false);
+ mem_cgroup_cancel_charge(page, memcg);
delete_from_swap_cache(page);
}
}
if (error)
goto failed;

- mem_cgroup_commit_charge(page, memcg, true, false);
+ mem_cgroup_commit_charge(page, memcg, true);

spin_lock_irq(&info->lock);
info->swapped--;
@@ -1859,8 +1858,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
if (sgp == SGP_WRITE)
__SetPageReferenced(page);

- error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg,
- PageTransHuge(page));
+ error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
if (error) {
if (PageTransHuge(page)) {
count_vm_event(THP_FILE_FALLBACK);
@@ -1871,12 +1869,10 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
error = shmem_add_to_page_cache(page, mapping, hindex,
NULL, gfp & GFP_RECLAIM_MASK);
if (error) {
- mem_cgroup_cancel_charge(page, memcg,
- PageTransHuge(page));
+ mem_cgroup_cancel_charge(page, memcg);
goto unacct;
}
- mem_cgroup_commit_charge(page, memcg, false,
- PageTransHuge(page));
+ mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_anon(page);

spin_lock_irq(&info->lock);
@@ -2361,7 +2357,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
if (unlikely(offset >= max_off))
goto out_release;

- ret = mem_cgroup_try_charge_delay(page, dst_mm, gfp, &memcg, false);
+ ret = mem_cgroup_try_charge_delay(page, dst_mm, gfp, &memcg);
if (ret)
goto out_release;

@@ -2370,7 +2366,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
if (ret)
goto out_release_uncharge;

- mem_cgroup_commit_charge(page, memcg, false, false);
+ mem_cgroup_commit_charge(page, memcg, false);

_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
if (dst_vma->vm_flags & VM_WRITE)
@@ -2421,7 +2417,7 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
ClearPageDirty(page);
delete_from_page_cache(page);
out_release_uncharge:
- mem_cgroup_cancel_charge(page, memcg, false);
+ mem_cgroup_cancel_charge(page, memcg);
out_release:
unlock_page(page);
put_page(page);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5871a2aa86a5..9c9ab44780ba 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1864,15 +1864,14 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
if (unlikely(!page))
return -ENOMEM;

- if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
- &memcg, false)) {
+ if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
ret = -ENOMEM;
goto out_nolock;
}

pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
if (unlikely(!pte_same_as_swp(*pte, swp_entry_to_pte(entry)))) {
- mem_cgroup_cancel_charge(page, memcg, false);
+ mem_cgroup_cancel_charge(page, memcg);
ret = 0;
goto out;
}
@@ -1884,10 +1883,10 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
if (page == swapcache) {
page_add_anon_rmap(page, vma, addr, false);
- mem_cgroup_commit_charge(page, memcg, true, false);
+ mem_cgroup_commit_charge(page, memcg, true);
} else { /* ksm created a completely new copy */
page_add_new_anon_rmap(page, vma, addr, false);
- mem_cgroup_commit_charge(page, memcg, false, false);
+ mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
}
swap_free(entry);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 512576e171ce..bb57d0a3fca7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -97,7 +97,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
__SetPageUptodate(page);

ret = -ENOMEM;
- if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg, false))
+ if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg))
goto out_release;

_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
@@ -124,7 +124,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,

inc_mm_counter(dst_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
- mem_cgroup_commit_charge(page, memcg, false, false);
+ mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, dst_vma);

set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
@@ -138,7 +138,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
return ret;
out_release_uncharge_unlock:
pte_unmap_unlock(dst_pte, ptl);
- mem_cgroup_cancel_charge(page, memcg, false);
+ mem_cgroup_cancel_charge(page, memcg);
out_release:
put_page(page);
goto out;
--
2.26.0

2020-04-20 22:15:47

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 12/18] mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API

With the page->mapping requirement gone from memcg, we can charge anon
and file-thp pages in one single step, right after they're allocated.

This removes two out of three API calls - especially the tricky commit
step that needed to happen at just the right time between when the
page is "set up" and when it's "published" - somewhat vague and fluid
concepts that varied by page type. All we need is a freshly allocated
page and a memcg context to charge.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/mm.h | 4 +---
kernel/events/uprobes.c | 11 +++--------
mm/filemap.c | 2 +-
mm/huge_memory.c | 41 ++++++++++++-----------------------------
mm/khugepaged.c | 31 ++++++-------------------------
mm/memory.c | 36 ++++++++++--------------------------
mm/migrate.c | 5 +----
mm/swapfile.c | 6 +-----
mm/userfaultfd.c | 5 +----
9 files changed, 36 insertions(+), 105 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5a323422d783..892096bb7292 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -505,7 +505,6 @@ struct vm_fault {
pte_t orig_pte; /* Value of PTE at the time of fault */

struct page *cow_page; /* Page handler may use for COW fault */
- struct mem_cgroup *memcg; /* Cgroup cow_page belongs to */
struct page *page; /* ->fault handlers should return a
* page here, unless VM_FAULT_NOPAGE
* is set (which is also implied by
@@ -939,8 +938,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
return pte;
}

-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
- struct page *page);
+vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page);
vm_fault_t finish_fault(struct vm_fault *vmf);
vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
#endif
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 40e7488ce467..4253c153e985 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -162,14 +162,13 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,
};
int err;
struct mmu_notifier_range range;
- struct mem_cgroup *memcg;

mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, mm, addr,
addr + PAGE_SIZE);

if (new_page) {
- err = mem_cgroup_try_charge(new_page, vma->vm_mm, GFP_KERNEL,
- &memcg);
+ err = mem_cgroup_charge(new_page, vma->vm_mm, GFP_KERNEL,
+ false);
if (err)
return err;
}
@@ -179,17 +178,13 @@ static int __replace_page(struct vm_area_struct *vma, unsigned long addr,

mmu_notifier_invalidate_range_start(&range);
err = -EAGAIN;
- if (!page_vma_mapped_walk(&pvmw)) {
- if (new_page)
- mem_cgroup_cancel_charge(new_page, memcg);
+ if (!page_vma_mapped_walk(&pvmw))
goto unlock;
- }
VM_BUG_ON_PAGE(addr != pvmw.address, old_page);

if (new_page) {
get_page(new_page);
page_add_new_anon_rmap(new_page, vma, addr, false);
- mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
} else
/* no new page, just dec_mm_counter for old_page */
diff --git a/mm/filemap.c b/mm/filemap.c
index f4592ff3ca8b..a10bd6696049 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2633,7 +2633,7 @@ void filemap_map_pages(struct vm_fault *vmf,
if (vmf->pte)
vmf->pte += xas.xa_index - last_pgoff;
last_pgoff = xas.xa_index;
- if (alloc_set_pte(vmf, NULL, page))
+ if (alloc_set_pte(vmf, page))
goto unlock;
unlock_page(page);
goto next;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index da6c413a75a5..0b33eaf0740a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -587,19 +587,19 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
struct page *page, gfp_t gfp)
{
struct vm_area_struct *vma = vmf->vma;
- struct mem_cgroup *memcg;
pgtable_t pgtable;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
vm_fault_t ret = 0;

VM_BUG_ON_PAGE(!PageCompound(page), page);

- if (mem_cgroup_try_charge_delay(page, vma->vm_mm, gfp, &memcg)) {
+ if (mem_cgroup_charge(page, vma->vm_mm, gfp, false)) {
put_page(page);
count_vm_event(THP_FAULT_FALLBACK);
count_vm_event(THP_FAULT_FALLBACK_CHARGE);
return VM_FAULT_FALLBACK;
}
+ cgroup_throttle_swaprate(page, gfp);

pgtable = pte_alloc_one(vma->vm_mm);
if (unlikely(!pgtable)) {
@@ -630,7 +630,6 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
vm_fault_t ret2;

spin_unlock(vmf->ptl);
- mem_cgroup_cancel_charge(page, memcg);
put_page(page);
pte_free(vma->vm_mm, pgtable);
ret2 = handle_userfault(vmf, VM_UFFD_MISSING);
@@ -641,7 +640,6 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
entry = mk_huge_pmd(page, vma->vm_page_prot);
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
page_add_new_anon_rmap(page, vma, haddr, true);
- mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
pgtable_trans_huge_deposit(vma->vm_mm, vmf->pmd, pgtable);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
@@ -649,7 +647,7 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
mm_inc_nr_ptes(vma->vm_mm);
spin_unlock(vmf->ptl);
count_vm_event(THP_FAULT_ALLOC);
- count_memcg_events(memcg, THP_FAULT_ALLOC, 1);
+ count_memcg_event_mm(vma->vm_mm, THP_FAULT_ALLOC);
}

return 0;
@@ -658,7 +656,6 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
release:
if (pgtable)
pte_free(vma->vm_mm, pgtable);
- mem_cgroup_cancel_charge(page, memcg);
put_page(page);
return ret;

@@ -1260,7 +1257,6 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
{
struct vm_area_struct *vma = vmf->vma;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
- struct mem_cgroup *memcg;
pgtable_t pgtable;
pmd_t _pmd;
int i;
@@ -1279,21 +1275,17 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
pages[i] = alloc_page_vma_node(GFP_HIGHUSER_MOVABLE, vma,
vmf->address, page_to_nid(page));
if (unlikely(!pages[i] ||
- mem_cgroup_try_charge_delay(pages[i], vma->vm_mm,
- GFP_KERNEL, &memcg))) {
+ mem_cgroup_charge(pages[i], vma->vm_mm,
+ GFP_KERNEL, false))) {
if (pages[i])
put_page(pages[i]);
- while (--i >= 0) {
- memcg = (void *)page_private(pages[i]);
- set_page_private(pages[i], 0);
- mem_cgroup_cancel_charge(pages[i], memcg);
+ while (--i >= 0)
put_page(pages[i]);
- }
kfree(pages);
ret |= VM_FAULT_OOM;
goto out;
}
- set_page_private(pages[i], (unsigned long)memcg);
+ cgroup_throttle_swaprate(pages[i], GFP_KERNEL);
}

for (i = 0; i < HPAGE_PMD_NR; i++) {
@@ -1329,10 +1321,7 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
pte_t entry;
entry = mk_pte(pages[i], vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- memcg = (void *)page_private(pages[i]);
- set_page_private(pages[i], 0);
page_add_new_anon_rmap(pages[i], vmf->vma, haddr, false);
- mem_cgroup_commit_charge(pages[i], memcg, false);
lru_cache_add_active_or_unevictable(pages[i], vma);
vmf->pte = pte_offset_map(&_pmd, haddr);
VM_BUG_ON(!pte_none(*vmf->pte));
@@ -1361,12 +1350,8 @@ static vm_fault_t do_huge_pmd_wp_page_fallback(struct vm_fault *vmf,
out_free_pages:
spin_unlock(vmf->ptl);
mmu_notifier_invalidate_range_end(&range);
- for (i = 0; i < HPAGE_PMD_NR; i++) {
- memcg = (void *)page_private(pages[i]);
- set_page_private(pages[i], 0);
- mem_cgroup_cancel_charge(pages[i], memcg);
+ for (i = 0; i < HPAGE_PMD_NR; i++)
put_page(pages[i]);
- }
kfree(pages);
goto out;
}
@@ -1375,7 +1360,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
{
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *new_page;
- struct mem_cgroup *memcg;
unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
struct mmu_notifier_range range;
gfp_t huge_gfp; /* for allocation and charge */
@@ -1446,8 +1430,8 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
goto out;
}

- if (unlikely(mem_cgroup_try_charge_delay(new_page, vma->vm_mm,
- huge_gfp, &memcg))) {
+ if (unlikely(mem_cgroup_charge(new_page, vma->vm_mm, huge_gfp,
+ false))) {
put_page(new_page);
split_huge_pmd(vma, vmf->pmd, vmf->address);
if (page)
@@ -1457,9 +1441,10 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
count_vm_event(THP_FAULT_FALLBACK_CHARGE);
goto out;
}
+ cgroup_throttle_swaprate(new_page, huge_gfp);

count_vm_event(THP_FAULT_ALLOC);
- count_memcg_events(memcg, THP_FAULT_ALLOC, 1);
+ count_memcg_page_event(new_page, THP_FAULT_ALLOC);

if (!page)
clear_huge_page(new_page, vmf->address, HPAGE_PMD_NR);
@@ -1477,7 +1462,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
put_page(page);
if (unlikely(!pmd_same(*vmf->pmd, orig_pmd))) {
spin_unlock(vmf->ptl);
- mem_cgroup_cancel_charge(new_page, memcg);
put_page(new_page);
goto out_mn;
} else {
@@ -1486,7 +1470,6 @@ vm_fault_t do_huge_pmd_wp_page(struct vm_fault *vmf, pmd_t orig_pmd)
entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);
pmdp_huge_clear_flush_notify(vma, haddr, vmf->pmd);
page_add_new_anon_rmap(new_page, vma, haddr, true);
- mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
set_pmd_at(vma->vm_mm, haddr, vmf->pmd, entry);
update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index ee2ef4b8e828..5cf8082fb038 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -951,7 +951,6 @@ static void collapse_huge_page(struct mm_struct *mm,
struct page *new_page;
spinlock_t *pmd_ptl, *pte_ptl;
int isolated = 0, result = 0;
- struct mem_cgroup *memcg;
struct vm_area_struct *vma;
struct mmu_notifier_range range;
gfp_t gfp;
@@ -974,15 +973,15 @@ static void collapse_huge_page(struct mm_struct *mm,
goto out_nolock;
}

- if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg))) {
+ if (unlikely(mem_cgroup_charge(new_page, mm, gfp, false))) {
result = SCAN_CGROUP_CHARGE_FAIL;
goto out_nolock;
}
+ count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);

down_read(&mm->mmap_sem);
result = hugepage_vma_revalidate(mm, address, &vma);
if (result) {
- mem_cgroup_cancel_charge(new_page, memcg);
up_read(&mm->mmap_sem);
goto out_nolock;
}
@@ -990,7 +989,6 @@ static void collapse_huge_page(struct mm_struct *mm,
pmd = mm_find_pmd(mm, address);
if (!pmd) {
result = SCAN_PMD_NULL;
- mem_cgroup_cancel_charge(new_page, memcg);
up_read(&mm->mmap_sem);
goto out_nolock;
}
@@ -1001,7 +999,6 @@ static void collapse_huge_page(struct mm_struct *mm,
* Continuing to collapse causes inconsistency.
*/
if (!__collapse_huge_page_swapin(mm, vma, address, pmd, referenced)) {
- mem_cgroup_cancel_charge(new_page, memcg);
up_read(&mm->mmap_sem);
goto out_nolock;
}
@@ -1087,8 +1084,6 @@ static void collapse_huge_page(struct mm_struct *mm,
spin_lock(pmd_ptl);
BUG_ON(!pmd_none(*pmd));
page_add_new_anon_rmap(new_page, vma, address, true);
- mem_cgroup_commit_charge(new_page, memcg, false);
- count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);
lru_cache_add_active_or_unevictable(new_page, vma);
pgtable_trans_huge_deposit(mm, pmd, pgtable);
set_pmd_at(mm, address, pmd, _pmd);
@@ -1105,7 +1100,6 @@ static void collapse_huge_page(struct mm_struct *mm,
trace_mm_collapse_huge_page(mm, isolated, result);
return;
out:
- mem_cgroup_cancel_charge(new_page, memcg);
goto out_up_write;
}

@@ -1515,7 +1509,6 @@ static void collapse_file(struct mm_struct *mm,
struct address_space *mapping = file->f_mapping;
gfp_t gfp;
struct page *new_page;
- struct mem_cgroup *memcg;
pgoff_t index, end = start + HPAGE_PMD_NR;
LIST_HEAD(pagelist);
XA_STATE_ORDER(xas, &mapping->i_pages, start, HPAGE_PMD_ORDER);
@@ -1534,10 +1527,11 @@ static void collapse_file(struct mm_struct *mm,
goto out;
}

- if (unlikely(mem_cgroup_try_charge(new_page, mm, gfp, &memcg))) {
+ if (unlikely(mem_cgroup_charge(new_page, mm, gfp, false))) {
result = SCAN_CGROUP_CHARGE_FAIL;
goto out;
}
+ count_memcg_page_event(new_page, THP_COLLAPSE_ALLOC);

/* This will be less messy when we use multi-index entries */
do {
@@ -1547,7 +1541,6 @@ static void collapse_file(struct mm_struct *mm,
break;
xas_unlock_irq(&xas);
if (!xas_nomem(&xas, GFP_KERNEL)) {
- mem_cgroup_cancel_charge(new_page, memcg);
result = SCAN_FAIL;
goto out;
}
@@ -1740,18 +1733,9 @@ static void collapse_file(struct mm_struct *mm,
}

if (nr_none) {
- struct lruvec *lruvec;
- /*
- * XXX: We have started try_charge and pinned the
- * memcg, but the page isn't committed yet so we
- * cannot use mod_lruvec_page_state(). This hackery
- * will be cleaned up when remove the page->mapping
- * dependency from memcg and fully charge above.
- */
- lruvec = mem_cgroup_lruvec(memcg, page_pgdat(new_page));
- __mod_lruvec_state(lruvec, NR_FILE_PAGES, nr_none);
+ __mod_lruvec_page_state(new_page, NR_FILE_PAGES, nr_none);
if (is_shmem)
- __mod_lruvec_state(lruvec, NR_SHMEM, nr_none);
+ __mod_lruvec_page_state(new_page, NR_SHMEM, nr_none);
}

xa_locked:
@@ -1789,7 +1773,6 @@ static void collapse_file(struct mm_struct *mm,

SetPageUptodate(new_page);
page_ref_add(new_page, HPAGE_PMD_NR - 1);
- mem_cgroup_commit_charge(new_page, memcg, false);

if (is_shmem) {
set_page_dirty(new_page);
@@ -1797,7 +1780,6 @@ static void collapse_file(struct mm_struct *mm,
} else {
lru_cache_add_file(new_page);
}
- count_memcg_events(memcg, THP_COLLAPSE_ALLOC, 1);

/*
* Remove pte page tables, so we can re-fault the page as huge.
@@ -1844,7 +1826,6 @@ static void collapse_file(struct mm_struct *mm,
VM_BUG_ON(nr_none);
xas_unlock_irq(&xas);

- mem_cgroup_cancel_charge(new_page, memcg);
new_page->mapping = NULL;
}

diff --git a/mm/memory.c b/mm/memory.c
index 43a3345ecdf3..3fa379d9b17d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2647,7 +2647,6 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
struct page *new_page = NULL;
pte_t entry;
int page_copied = 0;
- struct mem_cgroup *memcg;
struct mmu_notifier_range range;

if (unlikely(anon_vma_prepare(vma)))
@@ -2678,8 +2677,9 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
}
}

- if (mem_cgroup_try_charge_delay(new_page, mm, GFP_KERNEL, &memcg))
+ if (mem_cgroup_charge(new_page, mm, GFP_KERNEL, false))
goto oom_free_new;
+ cgroup_throttle_swaprate(new_page, GFP_KERNEL);

__SetPageUptodate(new_page);

@@ -2713,7 +2713,6 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
*/
ptep_clear_flush_notify(vma, vmf->address, vmf->pte);
page_add_new_anon_rmap(new_page, vma, vmf->address, false);
- mem_cgroup_commit_charge(new_page, memcg, false);
lru_cache_add_active_or_unevictable(new_page, vma);
/*
* We call the notify macro here because, when using secondary
@@ -2751,8 +2750,6 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
/* Free the old page.. */
new_page = old_page;
page_copied = 1;
- } else {
- mem_cgroup_cancel_charge(new_page, memcg);
}

if (new_page)
@@ -3090,7 +3087,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
struct page *page = NULL, *swapcache;
- struct mem_cgroup *memcg;
swp_entry_t entry;
pte_t pte;
int locked;
@@ -3195,10 +3191,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
goto out_page;
}

- if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
+ if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL, true)) {
ret = VM_FAULT_OOM;
goto out_page;
}
+ cgroup_throttle_swaprate(page, GFP_KERNEL);

/*
* Back out if somebody else already faulted in this pte.
@@ -3246,11 +3243,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
/* ksm created a completely new copy */
if (unlikely(page != swapcache && swapcache)) {
page_add_new_anon_rmap(page, vma, vmf->address, false);
- mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
} else {
do_page_add_anon_rmap(page, vma, vmf->address, exclusive);
- mem_cgroup_commit_charge(page, memcg, true);
activate_page(page);
}

@@ -3286,7 +3281,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
out:
return ret;
out_nomap:
- mem_cgroup_cancel_charge(page, memcg);
pte_unmap_unlock(vmf->pte, vmf->ptl);
out_page:
unlock_page(page);
@@ -3307,7 +3301,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
- struct mem_cgroup *memcg;
struct page *page;
vm_fault_t ret = 0;
pte_t entry;
@@ -3360,8 +3353,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
if (!page)
goto oom;

- if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL, &memcg))
+ if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL, false))
goto oom_free_page;
+ cgroup_throttle_swaprate(page, GFP_KERNEL);

/*
* The memory barrier inside __SetPageUptodate makes sure that
@@ -3386,14 +3380,12 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
/* Deliver the page fault to userland, check inside PT lock */
if (userfaultfd_missing(vma)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
- mem_cgroup_cancel_charge(page, memcg);
put_page(page);
return handle_userfault(vmf, VM_UFFD_MISSING);
}

inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
- mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
setpte:
set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
@@ -3404,7 +3396,6 @@ static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
release:
- mem_cgroup_cancel_charge(page, memcg);
put_page(page);
goto unlock;
oom_free_page:
@@ -3609,7 +3600,6 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
* mapping. If needed, the fucntion allocates page table or use pre-allocated.
*
* @vmf: fault environment
- * @memcg: memcg to charge page (only for private mappings)
* @page: page to map
*
* Caller must take care of unlocking vmf->ptl, if vmf->pte is non-NULL on
@@ -3620,8 +3610,7 @@ static vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
*
* Return: %0 on success, %VM_FAULT_ code in case of error.
*/
-vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
- struct page *page)
+vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct page *page)
{
struct vm_area_struct *vma = vmf->vma;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -3629,9 +3618,6 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
vm_fault_t ret;

if (pmd_none(*vmf->pmd) && PageTransCompound(page)) {
- /* THP on COW? */
- VM_BUG_ON_PAGE(memcg, page);
-
ret = do_set_pmd(vmf, page);
if (ret != VM_FAULT_FALLBACK)
return ret;
@@ -3655,7 +3641,6 @@ vm_fault_t alloc_set_pte(struct vm_fault *vmf, struct mem_cgroup *memcg,
if (write && !(vma->vm_flags & VM_SHARED)) {
inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, vmf->address, false);
- mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
} else {
inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page));
@@ -3704,7 +3689,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf)
if (!(vmf->vma->vm_flags & VM_SHARED))
ret = check_stable_address_space(vmf->vma->vm_mm);
if (!ret)
- ret = alloc_set_pte(vmf, vmf->memcg, page);
+ ret = alloc_set_pte(vmf, page);
if (vmf->pte)
pte_unmap_unlock(vmf->pte, vmf->ptl);
return ret;
@@ -3864,11 +3849,11 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
if (!vmf->cow_page)
return VM_FAULT_OOM;

- if (mem_cgroup_try_charge_delay(vmf->cow_page, vma->vm_mm,
- GFP_KERNEL, &vmf->memcg)) {
+ if (mem_cgroup_charge(vmf->cow_page, vma->vm_mm, GFP_KERNEL, false)) {
put_page(vmf->cow_page);
return VM_FAULT_OOM;
}
+ cgroup_throttle_swaprate(vmf->cow_page, GFP_KERNEL);

ret = __do_fault(vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
@@ -3886,7 +3871,6 @@ static vm_fault_t do_cow_fault(struct vm_fault *vmf)
goto uncharge_out;
return ret;
uncharge_out:
- mem_cgroup_cancel_charge(vmf->cow_page, vmf->memcg);
put_page(vmf->cow_page);
return ret;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 14a584c52782..a3361c744069 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2746,7 +2746,6 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,
{
struct vm_area_struct *vma = migrate->vma;
struct mm_struct *mm = vma->vm_mm;
- struct mem_cgroup *memcg;
bool flush = false;
spinlock_t *ptl;
pte_t entry;
@@ -2793,7 +2792,7 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,

if (unlikely(anon_vma_prepare(vma)))
goto abort;
- if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg))
+ if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL, false))
goto abort;

/*
@@ -2839,7 +2838,6 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,

inc_mm_counter(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, addr, false);
- mem_cgroup_commit_charge(page, memcg, false);
if (!is_zone_device_page(page))
lru_cache_add_active_or_unevictable(page, vma);
get_page(page);
@@ -2861,7 +2859,6 @@ static void migrate_vma_insert_page(struct migrate_vma *migrate,

unlock_abort:
pte_unmap_unlock(ptep, ptl);
- mem_cgroup_cancel_charge(page, memcg);
abort:
*src &= ~MIGRATE_PFN_MIGRATE;
}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 74543137371b..08140aed9258 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1854,7 +1854,6 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, swp_entry_t entry, struct page *page)
{
struct page *swapcache;
- struct mem_cgroup *memcg;
spinlock_t *ptl;
pte_t *pte;
int ret = 1;
@@ -1864,14 +1863,13 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
if (unlikely(!page))
return -ENOMEM;

- if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL, &memcg)) {
+ if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL, true)) {
ret = -ENOMEM;
goto out_nolock;
}

pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
if (unlikely(!pte_same_as_swp(*pte, swp_entry_to_pte(entry)))) {
- mem_cgroup_cancel_charge(page, memcg);
ret = 0;
goto out;
}
@@ -1883,10 +1881,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
if (page == swapcache) {
page_add_anon_rmap(page, vma, addr, false);
- mem_cgroup_commit_charge(page, memcg, true);
} else { /* ksm created a completely new copy */
page_add_new_anon_rmap(page, vma, addr, false);
- mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, vma);
}
swap_free(entry);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index bb57d0a3fca7..2745489415cc 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -56,7 +56,6 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
struct page **pagep,
bool wp_copy)
{
- struct mem_cgroup *memcg;
pte_t _dst_pte, *dst_pte;
spinlock_t *ptl;
void *page_kaddr;
@@ -97,7 +96,7 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
__SetPageUptodate(page);

ret = -ENOMEM;
- if (mem_cgroup_try_charge(page, dst_mm, GFP_KERNEL, &memcg))
+ if (mem_cgroup_charge(page, dst_mm, GFP_KERNEL, false))
goto out_release;

_dst_pte = pte_mkdirty(mk_pte(page, dst_vma->vm_page_prot));
@@ -124,7 +123,6 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,

inc_mm_counter(dst_mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, dst_vma, dst_addr, false);
- mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_active_or_unevictable(page, dst_vma);

set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
@@ -138,7 +136,6 @@ static int mcopy_atomic_pte(struct mm_struct *dst_mm,
return ret;
out_release_uncharge_unlock:
pte_unmap_unlock(dst_pte, ptl);
- mem_cgroup_cancel_charge(page, memcg);
out_release:
put_page(page);
goto out;
--
2.26.0

2020-04-20 22:16:26

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 10/18] mm: memcontrol: switch to native NR_ANON_MAPPED counter

Memcg maintains a private MEMCG_RSS counter. This divergence from the
generic VM accounting means unnecessary code overhead, and creates a
dependency for memcg that page->mapping is set up at the time of
charging, so that page types can be told apart.

Convert the generic accounting sites to mod_lruvec_page_state and
friends to maintain the per-cgroup vmstat counter of
NR_ANON_MAPPED. We use lock_page_memcg() to stabilize page->mem_cgroup
during rmap changes, the same way we do for NR_FILE_MAPPED.

With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
counter, this patch finally eliminates the need to have page->mapping
set up at charge time.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 3 +--
mm/memcontrol.c | 27 ++++++++--------------
mm/rmap.c | 47 +++++++++++++++++++++++---------------
3 files changed, 39 insertions(+), 38 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c44aa1ccf553..bfb1d961e346 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -29,8 +29,7 @@ struct kmem_cache;

/* Cgroup-specific page state, on top of universal node page state */
enum memcg_stat_item {
- MEMCG_RSS = NR_VM_NODE_STAT_ITEMS,
- MEMCG_RSS_HUGE,
+ MEMCG_RSS_HUGE = NR_VM_NODE_STAT_ITEMS,
MEMCG_SWAP,
MEMCG_SOCK,
/* XXX: why are these zone and not node counters? */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7e77166cf10b..c87178d6219f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -836,13 +836,6 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
struct page *page,
int nr_pages)
{
- /*
- * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
- * counted as CACHE even if it's on ANON LRU.
- */
- if (PageAnon(page))
- __mod_memcg_state(memcg, MEMCG_RSS, nr_pages);
-
if (abs(nr_pages) > 1) {
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
__mod_memcg_state(memcg, MEMCG_RSS_HUGE, nr_pages);
@@ -1384,7 +1377,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
*/

seq_buf_printf(&s, "anon %llu\n",
- (u64)memcg_page_state(memcg, MEMCG_RSS) *
+ (u64)memcg_page_state(memcg, NR_ANON_MAPPED) *
PAGE_SIZE);
seq_buf_printf(&s, "file %llu\n",
(u64)memcg_page_state(memcg, NR_FILE_PAGES) *
@@ -3298,7 +3291,7 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)

if (mem_cgroup_is_root(memcg)) {
val = memcg_page_state(memcg, NR_FILE_PAGES) +
- memcg_page_state(memcg, MEMCG_RSS);
+ memcg_page_state(memcg, NR_ANON_MAPPED);
if (swap)
val += memcg_page_state(memcg, MEMCG_SWAP);
} else {
@@ -3768,7 +3761,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)

static const unsigned int memcg1_stats[] = {
NR_FILE_PAGES,
- MEMCG_RSS,
+ NR_ANON_MAPPED,
MEMCG_RSS_HUGE,
NR_SHMEM,
NR_FILE_MAPPED,
@@ -5395,7 +5388,12 @@ static int mem_cgroup_move_account(struct page *page,

lock_page_memcg(page);

- if (!PageAnon(page)) {
+ if (PageAnon(page)) {
+ if (page_mapped(page)) {
+ __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
+ __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
+ }
+ } else {
__mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
__mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);

@@ -6529,7 +6527,6 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
{
unsigned int nr_pages = hpage_nr_pages(page);

- VM_BUG_ON_PAGE(!page->mapping, page);
VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);

if (mem_cgroup_disabled())
@@ -6602,8 +6599,6 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
struct mem_cgroup *memcg;
int ret;

- VM_BUG_ON_PAGE(!page->mapping, page);
-
ret = mem_cgroup_try_charge(page, mm, gfp_mask, &memcg);
if (ret)
return ret;
@@ -6615,7 +6610,6 @@ struct uncharge_gather {
struct mem_cgroup *memcg;
unsigned long nr_pages;
unsigned long pgpgout;
- unsigned long nr_anon;
unsigned long nr_kmem;
unsigned long nr_huge;
struct page *dummy_page;
@@ -6640,7 +6634,6 @@ static void uncharge_batch(const struct uncharge_gather *ug)
}

local_irq_save(flags);
- __mod_memcg_state(ug->memcg, MEMCG_RSS, -ug->nr_anon);
__mod_memcg_state(ug->memcg, MEMCG_RSS_HUGE, -ug->nr_huge);
__count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
__this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_pages);
@@ -6682,8 +6675,6 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
if (!PageKmemcg(page)) {
if (PageTransHuge(page))
ug->nr_huge += nr_pages;
- if (PageAnon(page))
- ug->nr_anon += nr_pages;
ug->pgpgout++;
} else {
ug->nr_kmem += nr_pages;
diff --git a/mm/rmap.c b/mm/rmap.c
index f79a206b271a..150513d31efa 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1114,6 +1114,11 @@ void do_page_add_anon_rmap(struct page *page,
bool compound = flags & RMAP_COMPOUND;
bool first;

+ if (unlikely(PageKsm(page)))
+ lock_page_memcg(page);
+ else
+ VM_BUG_ON_PAGE(!PageLocked(page), page);
+
if (compound) {
atomic_t *mapcount;
VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1134,12 +1139,13 @@ void do_page_add_anon_rmap(struct page *page,
*/
if (compound)
__inc_node_page_state(page, NR_ANON_THPS);
- __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
+ __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
}
- if (unlikely(PageKsm(page)))
- return;

- VM_BUG_ON_PAGE(!PageLocked(page), page);
+ if (unlikely(PageKsm(page))) {
+ unlock_page_memcg(page);
+ return;
+ }

/* address might be in next vma when migration races vma_adjust */
if (first)
@@ -1181,7 +1187,7 @@ void page_add_new_anon_rmap(struct page *page,
/* increment count (starts at -1) */
atomic_set(&page->_mapcount, 0);
}
- __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
+ __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
__page_set_anon_rmap(page, vma, address, 1);
}

@@ -1230,13 +1236,12 @@ static void page_remove_file_rmap(struct page *page, bool compound)
int i, nr = 1;

VM_BUG_ON_PAGE(compound && !PageHead(page), page);
- lock_page_memcg(page);

/* Hugepages are not counted in NR_FILE_MAPPED for now. */
if (unlikely(PageHuge(page))) {
/* hugetlb pages are always mapped with pmds */
atomic_dec(compound_mapcount_ptr(page));
- goto out;
+ return;
}

/* page still mapped by someone else? */
@@ -1246,14 +1251,14 @@ static void page_remove_file_rmap(struct page *page, bool compound)
nr++;
}
if (!atomic_add_negative(-1, compound_mapcount_ptr(page)))
- goto out;
+ return;
if (PageSwapBacked(page))
__dec_node_page_state(page, NR_SHMEM_PMDMAPPED);
else
__dec_node_page_state(page, NR_FILE_PMDMAPPED);
} else {
if (!atomic_add_negative(-1, &page->_mapcount))
- goto out;
+ return;
}

/*
@@ -1265,8 +1270,6 @@ static void page_remove_file_rmap(struct page *page, bool compound)

if (unlikely(PageMlocked(page)))
clear_page_mlock(page);
-out:
- unlock_page_memcg(page);
}

static void page_remove_anon_compound_rmap(struct page *page)
@@ -1310,7 +1313,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
clear_page_mlock(page);

if (nr)
- __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, -nr);
+ __mod_lruvec_page_state(page, NR_ANON_MAPPED, -nr);
}

/**
@@ -1322,22 +1325,28 @@ static void page_remove_anon_compound_rmap(struct page *page)
*/
void page_remove_rmap(struct page *page, bool compound)
{
- if (!PageAnon(page))
- return page_remove_file_rmap(page, compound);
+ lock_page_memcg(page);

- if (compound)
- return page_remove_anon_compound_rmap(page);
+ if (!PageAnon(page)) {
+ page_remove_file_rmap(page, compound);
+ goto out;
+ }
+
+ if (compound) {
+ page_remove_anon_compound_rmap(page);
+ goto out;
+ }

/* page still mapped by someone else? */
if (!atomic_add_negative(-1, &page->_mapcount))
- return;
+ goto out;

/*
* We use the irq-unsafe __{inc|mod}_zone_page_stat because
* these counters are not modified in interrupt context, and
* pte lock(a spinlock) is held, which implies preemption disabled.
*/
- __dec_node_page_state(page, NR_ANON_MAPPED);
+ __dec_lruvec_page_state(page, NR_ANON_MAPPED);

if (unlikely(PageMlocked(page)))
clear_page_mlock(page);
@@ -1354,6 +1363,8 @@ void page_remove_rmap(struct page *page, bool compound)
* Leaving it set also helps swapoff to reinstate ptes
* faster for those pages still in swapcache.
*/
+out:
+ unlock_page_memcg(page);
}

/*
--
2.26.0

2020-04-20 22:16:42

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 13/18] mm: memcontrol: drop unused try/commit/cancel charge API

There are no more users. RIP in peace.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 36 -----------
mm/memcontrol.c | 126 +++++--------------------------------
2 files changed, 15 insertions(+), 147 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9ac8122ec1cd..52eb6411cfee 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -357,14 +357,6 @@ static inline unsigned long mem_cgroup_protection(struct mem_cgroup *memcg,
enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
struct mem_cgroup *memcg);

-int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcgp);
-int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcgp);
-void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
- bool lrucare);
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
-
int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
bool lrucare);

@@ -846,34 +838,6 @@ static inline enum mem_cgroup_protection mem_cgroup_protected(
return MEMCG_PROT_NONE;
}

-static inline int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask,
- struct mem_cgroup **memcgp)
-{
- *memcgp = NULL;
- return 0;
-}
-
-static inline int mem_cgroup_try_charge_delay(struct page *page,
- struct mm_struct *mm,
- gfp_t gfp_mask,
- struct mem_cgroup **memcgp)
-{
- *memcgp = NULL;
- return 0;
-}
-
-static inline void mem_cgroup_commit_charge(struct page *page,
- struct mem_cgroup *memcg,
- bool lrucare)
-{
-}
-
-static inline void mem_cgroup_cancel_charge(struct page *page,
- struct mem_cgroup *memcg)
-{
-}
-
static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
gfp_t gfp_mask, bool lrucare)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7845a87b94d5..d5aee5577ff3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6431,29 +6431,26 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root,
}

/**
- * mem_cgroup_try_charge - try charging a page
+ * mem_cgroup_charge - charge a newly allocated page to a cgroup
* @page: page to charge
* @mm: mm context of the victim
* @gfp_mask: reclaim mode
- * @memcgp: charged memcg return
+ * @lrucare: page might be on the LRU already
*
* Try to charge @page to the memcg that @mm belongs to, reclaiming
* pages according to @gfp_mask if necessary.
*
- * Returns 0 on success, with *@memcgp pointing to the charged memcg.
- * Otherwise, an error code is returned.
- *
- * After page->mapping has been set up, the caller must finalize the
- * charge with mem_cgroup_commit_charge(). Or abort the transaction
- * with mem_cgroup_cancel_charge() in case page instantiation fails.
+ * Returns 0 on success. Otherwise, an error code is returned.
*/
-int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcgp)
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
+ bool lrucare)
{
unsigned int nr_pages = hpage_nr_pages(page);
struct mem_cgroup *memcg = NULL;
int ret = 0;

+ VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
+
if (mem_cgroup_disabled())
goto out;

@@ -6485,56 +6482,8 @@ int mem_cgroup_try_charge(struct page *page, struct mm_struct *mm,
memcg = get_mem_cgroup_from_mm(mm);

ret = try_charge(memcg, gfp_mask, nr_pages);
-
- css_put(&memcg->css);
-out:
- *memcgp = memcg;
- return ret;
-}
-
-int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
- gfp_t gfp_mask, struct mem_cgroup **memcgp)
-{
- int ret;
-
- ret = mem_cgroup_try_charge(page, mm, gfp_mask, memcgp);
- if (*memcgp)
- cgroup_throttle_swaprate(page, gfp_mask);
- return ret;
-}
-
-/**
- * mem_cgroup_commit_charge - commit a page charge
- * @page: page to charge
- * @memcg: memcg to charge the page to
- * @lrucare: page might be on LRU already
- *
- * Finalize a charge transaction started by mem_cgroup_try_charge(),
- * after page->mapping has been set up. This must happen atomically
- * as part of the page instantiation, i.e. under the page table lock
- * for anonymous pages, under the page lock for page and swap cache.
- *
- * In addition, the page must not be on the LRU during the commit, to
- * prevent racing with task migration. If it might be, use @lrucare.
- *
- * Use mem_cgroup_cancel_charge() to cancel the transaction instead.
- */
-void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
- bool lrucare)
-{
- unsigned int nr_pages = hpage_nr_pages(page);
-
- VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
-
- if (mem_cgroup_disabled())
- return;
- /*
- * Swap faults will attempt to charge the same page multiple
- * times. But reuse_swap_page() might have removed the page
- * from swapcache already, so we can't check PageSwapCache().
- */
- if (!memcg)
- return;
+ if (ret)
+ goto out_put;

commit_charge(page, memcg, lrucare);

@@ -6552,55 +6501,11 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
*/
mem_cgroup_uncharge_swap(entry, nr_pages);
}
-}

-/**
- * mem_cgroup_cancel_charge - cancel a page charge
- * @page: page to charge
- * @memcg: memcg to charge the page to
- *
- * Cancel a charge transaction started by mem_cgroup_try_charge().
- */
-void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
-{
- unsigned int nr_pages = hpage_nr_pages(page);
-
- if (mem_cgroup_disabled())
- return;
- /*
- * Swap faults will attempt to charge the same page multiple
- * times. But reuse_swap_page() might have removed the page
- * from swapcache already, so we can't check PageSwapCache().
- */
- if (!memcg)
- return;
-
- cancel_charge(memcg, nr_pages);
-}
-
-/**
- * mem_cgroup_charge - charge a newly allocated page to a cgroup
- * @page: page to charge
- * @mm: mm context of the victim
- * @gfp_mask: reclaim mode
- * @lrucare: page might be on the LRU already
- *
- * Try to charge @page to the memcg that @mm belongs to, reclaiming
- * pages according to @gfp_mask if necessary.
- *
- * Returns 0 on success. Otherwise, an error code is returned.
- */
-int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
- bool lrucare)
-{
- struct mem_cgroup *memcg;
- int ret;
-
- ret = mem_cgroup_try_charge(page, mm, gfp_mask, &memcg);
- if (ret)
- return ret;
- mem_cgroup_commit_charge(page, memcg, lrucare);
- return 0;
+out_put:
+ css_put(&memcg->css);
+out:
+ return ret;
}

struct uncharge_gather {
@@ -6707,8 +6612,7 @@ static void uncharge_list(struct list_head *page_list)
* mem_cgroup_uncharge - uncharge a page
* @page: page to uncharge
*
- * Uncharge a page previously charged with mem_cgroup_try_charge() and
- * mem_cgroup_commit_charge().
+ * Uncharge a page previously charged with mem_cgroup_charge().
*/
void mem_cgroup_uncharge(struct page *page)
{
@@ -6731,7 +6635,7 @@ void mem_cgroup_uncharge(struct page *page)
* @page_list: list of pages to uncharge
*
* Uncharge a list of pages previously charged with
- * mem_cgroup_try_charge() and mem_cgroup_commit_charge().
+ * mem_cgroup_charge().
*/
void mem_cgroup_uncharge_list(struct list_head *page_list)
{
--
2.26.0

2020-04-20 22:17:33

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 07/18] mm: memcontrol: prepare move_account for removal of private page type counters

When memcg uses the generic vmstat counters, it doesn't need to do
anything at charging and uncharging time. It does, however, need to
migrate counts when pages move to a different cgroup in move_account.

Prepare the move_account function for the arrival of NR_FILE_PAGES,
NR_ANON_MAPPED, NR_ANON_THPS etc. by having a branch for files and a
branch for anon, which can then divided into sub-branches.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 25 +++++++++++++------------
1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e3e8913a5b28..ac6f2b073a5a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5374,7 +5374,6 @@ static int mem_cgroup_move_account(struct page *page,
struct pglist_data *pgdat;
unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
int ret;
- bool anon;

VM_BUG_ON(from == to);
VM_BUG_ON_PAGE(PageLRU(page), page);
@@ -5392,25 +5391,27 @@ static int mem_cgroup_move_account(struct page *page,
if (page->mem_cgroup != from)
goto out_unlock;

- anon = PageAnon(page);
-
pgdat = page_pgdat(page);
from_vec = mem_cgroup_lruvec(from, pgdat);
to_vec = mem_cgroup_lruvec(to, pgdat);

lock_page_memcg(page);

- if (!anon && page_mapped(page)) {
- __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
- __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
- }
+ if (!PageAnon(page)) {
+ if (page_mapped(page)) {
+ __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
+ __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
+ }

- if (!anon && PageDirty(page)) {
- struct address_space *mapping = page_mapping(page);
+ if (PageDirty(page)) {
+ struct address_space *mapping = page_mapping(page);

- if (mapping_cap_account_dirty(mapping)) {
- __mod_lruvec_state(from_vec, NR_FILE_DIRTY, -nr_pages);
- __mod_lruvec_state(to_vec, NR_FILE_DIRTY, nr_pages);
+ if (mapping_cap_account_dirty(mapping)) {
+ __mod_lruvec_state(from_vec, NR_FILE_DIRTY,
+ -nr_pages);
+ __mod_lruvec_state(to_vec, NR_FILE_DIRTY,
+ nr_pages);
+ }
}
}

--
2.26.0

2020-04-21 08:30:54

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm: fix NUMA node file count error in replace_page_cache()



?? 2020/4/21 ????6:11, Johannes Weiner д??:
> When replacing one page with another one in the cache, we have to
> decrease the file count of the old page's NUMA node and increase the
> one of the new NUMA node, otherwise the old node leaks the count and
> the new node eventually underflows its counter.
>
> Fixes: 74d609585d8b ("page cache: Add and replace pages using the XArray")
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Alex Shi <[email protected]>

> ---
> mm/filemap.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 23a051a7ef0f..49e3b5da0216 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -808,11 +808,11 @@ int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> old->mapping = NULL;
> /* hugetlb pages do not participate in page cache accounting. */
> if (!PageHuge(old))
> - __dec_node_page_state(new, NR_FILE_PAGES);
> + __dec_node_page_state(old, NR_FILE_PAGES);
> if (!PageHuge(new))
> __inc_node_page_state(new, NR_FILE_PAGES);
> if (PageSwapBacked(old))
> - __dec_node_page_state(new, NR_SHMEM);
> + __dec_node_page_state(old, NR_SHMEM);
> if (PageSwapBacked(new))
> __inc_node_page_state(new, NR_SHMEM);
> xas_unlock_irqrestore(&xas, flags);
>

2020-04-21 09:14:20

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 03/18] mm: memcontrol: drop @compound parameter from memcg charging API



?? 2020/4/21 ????6:11, Johannes Weiner д??:
> The memcg charging API carries a boolean @compound parameter that
> tells whether the page we're dealing with is a hugepage.
> mem_cgroup_commit_charge() has another boolean @lrucare that indicates
> whether the page needs LRU locking or not while charging. The majority
> of callsites know those parameters at compile time, which results in a
> lot of naked "false, false" argument lists. This makes for cryptic
> code and is a breeding ground for subtle mistakes.
>
> Thankfully, the huge page state can be inferred from the page itself
> and doesn't need to be passed along. This is safe because charging
> completes before the page is published and somebody may split it.
>
> Simplify the callsites by removing @compound, and let memcg infer the
> state by using hpage_nr_pages() unconditionally. That function does
> PageTransHuge() to identify huge pages, which also helpfully asserts
> that nobody passes in tail pages by accident.
>
> The following patches will introduce a new charging API, best not to
> carry over unnecessary weight.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Alex Shi <[email protected]>

2020-04-21 09:14:46

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 04/18] mm: memcontrol: move out cgroup swaprate throttling



?? 2020/4/21 ????6:11, Johannes Weiner д??:
> The cgroup swaprate throttling is about matching new anon allocations
> to the rate of available IO when that is being throttled. It's the io
> controller hooking into the VM, rather than a memory controller thing.
>
> Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(),
> and drop the @memcg argument which is only used to check whether the
> preceding page charge has succeeded and the fault is proceeding.
>
> We could decouple the call from mem_cgroup_try_charge() here as well,
> but that would cause unnecessary churn: the following patches convert
> all callsites to a new charge API and we'll decouple as we go along.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Alex Shi <[email protected]>

2020-04-21 09:15:03

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API



?? 2020/4/21 ????6:11, Johannes Weiner д??:
> The try/commit/cancel protocol that memcg uses dates back to when
> pages used to be uncharged upon removal from the page cache, and thus
> couldn't be committed before the insertion had succeeded. Nowadays,
> pages are uncharged when they are physically freed; it doesn't matter
> whether the insertion was successful or not. For the page cache, the
> transaction dance has become unnecessary.
>
> Introduce a mem_cgroup_charge() function that simply charges a newly
> allocated page to a cgroup and sets up page->mem_cgroup in one single
> step. If the insertion fails, the caller doesn't have to do anything
> but free/put the page.
>
> Then switch the page cache over to this new API.
>
> Subsequent patches will also convert anon pages, but it needs a bit
> more prep work. Right now, memcg depends on page->mapping being
> already set up at the time of charging, so that it can maintain its
> own MEMCG_CACHE and MEMCG_RSS counters. For anon, page->mapping is set
> under the same pte lock under which the page is publishd, so a single
> charge point that can block doesn't work there just yet.
>
> The following prep patches will replace the private memcg counters
> with the generic vmstat counters, thus removing the page->mapping
> dependency, then complete the transition to the new single-point
> charge API and delete the old transactional scheme.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---

Reviewed-by: Alex Shi <[email protected]>

2020-04-21 09:16:19

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 06/18] mm: memcontrol: prepare uncharging for removal of private page type counters



?? 2020/4/21 ????6:11, Johannes Weiner д??:
> The uncharge batching code adds up the anon, file, kmem counts to
> determine the total number of pages to uncharge and references to
> drop. But the next patches will remove the anon and file counters.
>
> Maintain an aggregate nr_pages in the uncharge_gather struct.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Alex Shi <[email protected]>

2020-04-21 09:18:48

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 07/18] mm: memcontrol: prepare move_account for removal of private page type counters



?? 2020/4/21 ????6:11, Johannes Weiner д??:
> When memcg uses the generic vmstat counters, it doesn't need to do
> anything at charging and uncharging time. It does, however, need to
> migrate counts when pages move to a different cgroup in move_account.
>
> Prepare the move_account function for the arrival of NR_FILE_PAGES,
> NR_ANON_MAPPED, NR_ANON_THPS etc. by having a branch for files and a
> branch for anon, which can then divided into sub-branches.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Alex Shi <[email protected]>

2020-04-21 09:23:37

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 18/18] mm: memcontrol: update page->mem_cgroup stability rules



?? 2020/4/21 ????6:11, Johannes Weiner д??:
> The previous patches have simplified the access rules around
> page->mem_cgroup somewhat:
>
> 1. We never change page->mem_cgroup while the page is isolated by
> somebody else. This was by far the biggest exception to our rules
> and it didn't stop at lock_page() or lock_page_memcg().
>
> 2. We charge pages before they get put into page tables now, so the
> somewhat fishy rule about "can be in page table as long as it's
> still locked" is now gone and boiled down to having an exclusive
> reference to the page.
>
> Document the new rules. Any of the following will stabilize the
> page->mem_cgroup association:
>
> - the page lock
> - LRU isolation
> - lock_page_memcg()
> - exclusive access to the page
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Alex Shi <[email protected]>

> ---
> mm/memcontrol.c | 21 +++++++--------------
> 1 file changed, 7 insertions(+), 14 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index a8cce52b6b4d..7b63260c9b57 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1201,9 +1201,8 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
> * @page: the page
> * @pgdat: pgdat of the page
> *
> - * This function is only safe when following the LRU page isolation
> - * and putback protocol: the LRU lock must be held, and the page must
> - * either be PageLRU() or the caller must have isolated/allocated it.
> + * This function relies on page->mem_cgroup being stable - see the
> + * access rules in commit_charge().
> */
> struct lruvec *mem_cgroup_page_lruvec(struct page *page, struct pglist_data *pgdat)
> {
> @@ -2605,18 +2604,12 @@ static void commit_charge(struct page *page, struct mem_cgroup *memcg)
> {
> VM_BUG_ON_PAGE(page->mem_cgroup, page);
> /*
> - * Nobody should be changing or seriously looking at
> - * page->mem_cgroup at this point:
> - *
> - * - the page is uncharged
> - *
> - * - the page is off-LRU
> - *
> - * - an anonymous fault has exclusive page access, except for
> - * a locked page table
> + * Any of the following ensures page->mem_cgroup stability:
> *
> - * - a page cache insertion, a swapin fault, or a migration
> - * have the page locked
> + * - the page lock
> + * - LRU isolation
> + * - lock_page_memcg()
> + * - exclusive reference
> */
> page->mem_cgroup = memcg;
> }
>

2020-04-21 09:26:20

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 16/18] mm: memcontrol: charge swapin pages on instantiation



?? 2020/4/21 ????6:11, Johannes Weiner д??:
> Right now, users that are otherwise memory controlled can easily
> escape their containment and allocate significant amounts of memory
> that they're not being charged for. That's because swap readahead
> pages are not being charged until somebody actually faults them into
> their page table. This can be exploited with MADV_WILLNEED, which
> triggers arbitrary readahead allocations without charging the pages.
>
> There are additional problems with the delayed charging of swap pages:
>
> 1. To implement refault/workingset detection for anonymous pages, we
> need to have a target LRU available at swapin time, but the LRU is
> not determinable until the page has been charged.
>
> 2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
> stable when the page is isolated from the LRU; otherwise, the locks
> change under us. But swapcache gets charged after it's already on
> the LRU, and even if we cannot isolate it ourselves (since charging
> is not exactly optional).
>
> The previous patch ensured we always maintain cgroup ownership records
> for swap pages. This patch moves the swapcache charging point from the
> fault handler to swapin time to fix all of the above problems.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Alex Shi <[email protected]>

2020-04-21 09:32:44

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 15/18] mm: memcontrol: make swap tracking an integral part of memory control



?? 2020/4/21 ????6:11, Johannes Weiner д??:
> The swapaccount=0 boot option will continue to exist, and it will
> eliminate the page_counter overhead and hide the swap control files,
> but it won't disable swap slot ownership tracking.

May we add extra explanation for this change to user? and the default
memsw limitations?

>
> This patch makes sure we always have the cgroup records at swapin
> time; the next patch will fix the actual bug by charging readahead
> swap pages at swapin time rather than at fault time.
>
> Signed-off-by: Johannes Weiner <[email protected]>

2020-04-21 09:35:20

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 00/18] mm: memcontrol: charge swapin pages on instantiation



?? 2020/4/21 ????6:11, Johannes Weiner д??:
> This patch series reworks memcg to charge swapin pages directly at
> swapin time, rather than at fault time, which may be much later, or
> not happen at all.
>
> The delayed charging scheme we have right now causes problems:
>
> - Alex's per-cgroup lru_lock patches rely on pages that have been
> isolated from the LRU to have a stable page->mem_cgroup; otherwise
> the lock may change underneath him. Swapcache pages are charged only
> after they are added to the LRU, and charging doesn't follow the LRU
> isolation protocol.

Hi Johannes,

Thanks a lot!
It looks all fine for me. I will rebase per cgroup lru_lock on this.
Thanks!

Alex

>
> - Joonsoo's anon workingset patches need a suitable LRU at the time
> the page enters the swap cache and displaces the non-resident
> info. But the correct LRU is only available after charging.
>
> - It's a containment hole / DoS vector. Users can trigger arbitrarily
> large swap readahead using MADV_WILLNEED. The memory is never
> charged unless somebody actually touches it.
>
> - It complicates the page->mem_cgroup stabilization rules
>
> In order to charge pages directly at swapin time, the memcg code base
> needs to be prepared, and several overdue cleanups become a necessity:
>
> To charge pages at swapin time, we need to always have cgroup
> ownership tracking of swap records. We also cannot rely on
> page->mapping to tell apart page types at charge time, because that's
> only set up during a page fault.
>
> To eliminate the page->mapping dependency, memcg needs to ditch its
> private page type counters (MEMCG_CACHE, MEMCG_RSS, NR_SHMEM) in favor
> of the generic vmstat counters and accounting sites, such as
> NR_FILE_PAGES, NR_ANON_MAPPED etc.
>
> To switch to generic vmstat counters, the charge sequence must be
> adjusted such that page->mem_cgroup is set up by the time these
> counters are modified.
>
> The series is structured as follows:
>
> 1. Bug fixes
> 2. Decoupling charging from rmap
> 3. Swap controller integration into memcg
> 4. Direct swapin charging
>
> The patches survive a simple swapout->swapin test inside a virtual
> machine. Because this is blocking two major patch sets, I'm sending
> these out early and will continue testing in parallel to the review.
>
> include/linux/memcontrol.h | 53 +----
> include/linux/mm.h | 4 +-
> include/linux/swap.h | 6 +-
> init/Kconfig | 17 +-
> kernel/events/uprobes.c | 10 +-
> mm/filemap.c | 43 ++---
> mm/huge_memory.c | 45 ++---
> mm/khugepaged.c | 25 +--
> mm/memcontrol.c | 448 ++++++++++++++-----------------------------
> mm/memory.c | 51 ++---
> mm/migrate.c | 20 +-
> mm/rmap.c | 53 +++--
> mm/shmem.c | 117 +++++------
> mm/swap_cgroup.c | 6 -
> mm/swap_state.c | 89 +++++----
> mm/swapfile.c | 25 +--
> mm/userfaultfd.c | 5 +-
> 17 files changed, 367 insertions(+), 650 deletions(-)
>

2020-04-21 14:36:12

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 18/18] mm: memcontrol: update page->mem_cgroup stability rules

On Tue, Apr 21, 2020 at 05:10:14PM +0800, Hillf Danton wrote:
>
> On Mon, 20 Apr 2020 18:11:26 -0400 Johannes Weiner wrote:
> >
> > The previous patches have simplified the access rules around
> > page->mem_cgroup somewhat:
> >
> > 1. We never change page->mem_cgroup while the page is isolated by
> > somebody else. This was by far the biggest exception to our rules
> > and it didn't stop at lock_page() or lock_page_memcg().
> >
> > 2. We charge pages before they get put into page tables now, so the
> > somewhat fishy rule about "can be in page table as long as it's
> > still locked" is now gone and boiled down to having an exclusive
> > reference to the page.
> >
> > Document the new rules. Any of the following will stabilize the
> > page->mem_cgroup association:
> >
> > - the page lock
> > - LRU isolation
> > - lock_page_memcg()
> > - exclusive access to the page
>
> Then rule-1 makes rule-3 no longer needed in mem_cgroup_move_account()?

Well, mem_cgroup_move_account() is the write side. It's the function
that changes page->mem_cgroup. So it needs to take all these locks in
order for the readside / fastpath to be okay with any one of them.

2020-04-21 14:41:14

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 15/18] mm: memcontrol: make swap tracking an integral part of memory control

Hi Alex,

thanks for your quick review so far, I'll add the tags to the patches.

On Tue, Apr 21, 2020 at 05:27:30PM +0800, Alex Shi wrote:
>
>
> 在 2020/4/21 上午6:11, Johannes Weiner 写道:
> > The swapaccount=0 boot option will continue to exist, and it will
> > eliminate the page_counter overhead and hide the swap control files,
> > but it won't disable swap slot ownership tracking.
>
> May we add extra explanation for this change to user? and the default
> memsw limitations?

Can you elaborate what you think is missing and where you would like
to see it documented?

From a semantics POV, nothing changes with this patch. The memsw limit
defaults to "max", so it doesn't exert any control per default. The
only difference is whether we maintain swap records or not.

2020-04-21 19:15:25

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm: fix NUMA node file count error in replace_page_cache()

On Mon, Apr 20, 2020 at 3:11 PM Johannes Weiner <[email protected]> wrote:
>
> When replacing one page with another one in the cache, we have to
> decrease the file count of the old page's NUMA node and increase the
> one of the new NUMA node, otherwise the old node leaks the count and
> the new node eventually underflows its counter.
>
> Fixes: 74d609585d8b ("page cache: Add and replace pages using the XArray")
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2020-04-22 03:17:23

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 15/18] mm: memcontrol: make swap tracking an integral part of memory control



在 2020/4/21 下午10:39, Johannes Weiner 写道:
> Hi Alex,
>
> thanks for your quick review so far, I'll add the tags to the patches.
>
> On Tue, Apr 21, 2020 at 05:27:30PM +0800, Alex Shi wrote:
>>
>>
>> 在 2020/4/21 上午6:11, Johannes Weiner 写道:
>>> The swapaccount=0 boot option will continue to exist, and it will
>>> eliminate the page_counter overhead and hide the swap control files,
>>> but it won't disable swap slot ownership tracking.
>>
>> May we add extra explanation for this change to user? and the default
>> memsw limitations?
>
> Can you elaborate what you think is missing and where you would like
> to see it documented?
>
Maybe the following doc change is better after whole patchset?
Guess users would would happy to know details of this change.

Also as to the RSS account name change, I don't know if it's good to polish
them in docs.

Thanks
Alex

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 0ae4f564c2d6..1fd0878089fe 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -199,11 +199,11 @@ An RSS page is unaccounted when it's fully unmapped. A PageCache page is
unaccounted when it's removed from radix-tree. Even if RSS pages are fully
unmapped (by kswapd), they may exist as SwapCache in the system until they
are really freed. Such SwapCaches are also accounted.
-A swapped-in page is not accounted until it's mapped.
+A swapped-in page is accounted after adding into swapcache.

Note: The kernel does swapin-readahead and reads multiple swaps at once.
-This means swapped-in pages may contain pages for other tasks than a task
-causing page fault. So, we avoid accounting at swap-in I/O.
+Since page's memcg recorded into swap whatever memsw enabled, the page will
+be accounted after swapin.

At page migration, accounting information is kept.

@@ -230,10 +230,10 @@ caller of swapoff rather than the users of shmem.
2.4 Swap Extension (CONFIG_MEMCG_SWAP)
--------------------------------------

-Swap Extension allows you to record charge for swap. A swapped-in page is
-charged back to original page allocator if possible.
+Swap usage is always recorded for each of cgroup. Swap Extension allows you to
+read and limit it.

-When swap is accounted, following files are added.
+When swap is limited, following files are added.

- memory.memsw.usage_in_bytes.
- memory.memsw.limit_in_bytes.

> From a semantics POV, nothing changes with this patch. The memsw limit
> defaults to "max", so it doesn't exert any control per default. The
> only difference is whether we maintain swap records or not.
>

2020-04-22 06:37:01

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 01/18] mm: fix NUMA node file count error in replace_page_cache()

On Mon, Apr 20, 2020 at 06:11:09PM -0400, Johannes Weiner wrote:
> When replacing one page with another one in the cache, we have to
> decrease the file count of the old page's NUMA node and increase the
> one of the new NUMA node, otherwise the old node leaks the count and
> the new node eventually underflows its counter.
>
> Fixes: 74d609585d8b ("page cache: Add and replace pages using the XArray")
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

Thanks.

2020-04-22 06:38:41

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 02/18] mm: memcontrol: fix theoretical race in charge moving

On Mon, Apr 20, 2020 at 06:11:10PM -0400, Johannes Weiner wrote:
> The move_lock is a per-memcg lock, but the VM accounting code that
> needs to acquire it comes from the page and follows page->mem_cgroup
> under RCU protection. That means that the page becomes unlocked not
> when we drop the move_lock, but when we update page->mem_cgroup. And
> that assignment doesn't imply any memory ordering. If that pointer
> write gets reordered against the reads of the page state -
> page_mapped, PageDirty etc. the state may change while we rely on it
> being stable and we can end up corrupting the counters.
>
> Place an SMP memory barrier to make sure we're done with all page
> state by the time the new page->mem_cgroup becomes visible.
>
> Also replace the open-coded move_lock with a lock_page_memcg() to make
> it more obvious what we're serializing against.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

Thanks.

2020-04-22 06:39:24

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 04/18] mm: memcontrol: move out cgroup swaprate throttling

On Mon, Apr 20, 2020 at 06:11:12PM -0400, Johannes Weiner wrote:
> The cgroup swaprate throttling is about matching new anon allocations
> to the rate of available IO when that is being throttled. It's the io
> controller hooking into the VM, rather than a memory controller thing.
>
> Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(),
> and drop the @memcg argument which is only used to check whether the
> preceding page charge has succeeded and the fault is proceeding.
>
> We could decouple the call from mem_cgroup_try_charge() here as well,
> but that would cause unnecessary churn: the following patches convert
> all callsites to a new charge API and we'll decouple as we go along.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

2020-04-22 06:41:07

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 03/18] mm: memcontrol: drop @compound parameter from memcg charging API

On Mon, Apr 20, 2020 at 06:11:11PM -0400, Johannes Weiner wrote:
> The memcg charging API carries a boolean @compound parameter that
> tells whether the page we're dealing with is a hugepage.
> mem_cgroup_commit_charge() has another boolean @lrucare that indicates
> whether the page needs LRU locking or not while charging. The majority
> of callsites know those parameters at compile time, which results in a
> lot of naked "false, false" argument lists. This makes for cryptic
> code and is a breeding ground for subtle mistakes.
>
> Thankfully, the huge page state can be inferred from the page itself
> and doesn't need to be passed along. This is safe because charging
> completes before the page is published and somebody may split it.
>
> Simplify the callsites by removing @compound, and let memcg infer the
> state by using hpage_nr_pages() unconditionally. That function does
> PageTransHuge() to identify huge pages, which also helpfully asserts
> that nobody passes in tail pages by accident.
>
> The following patches will introduce a new charging API, best not to
> carry over unnecessary weight.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

2020-04-22 06:42:42

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

On Mon, Apr 20, 2020 at 06:11:13PM -0400, Johannes Weiner wrote:
> The try/commit/cancel protocol that memcg uses dates back to when
> pages used to be uncharged upon removal from the page cache, and thus
> couldn't be committed before the insertion had succeeded. Nowadays,
> pages are uncharged when they are physically freed; it doesn't matter
> whether the insertion was successful or not. For the page cache, the
> transaction dance has become unnecessary.
>
> Introduce a mem_cgroup_charge() function that simply charges a newly
> allocated page to a cgroup and sets up page->mem_cgroup in one single
> step. If the insertion fails, the caller doesn't have to do anything
> but free/put the page.
>
> Then switch the page cache over to this new API.
>
> Subsequent patches will also convert anon pages, but it needs a bit
> more prep work. Right now, memcg depends on page->mapping being
> already set up at the time of charging, so that it can maintain its
> own MEMCG_CACHE and MEMCG_RSS counters. For anon, page->mapping is set
> under the same pte lock under which the page is publishd, so a single
> charge point that can block doesn't work there just yet.
>
> The following prep patches will replace the private memcg counters
> with the generic vmstat counters, thus removing the page->mapping
> dependency, then complete the transition to the new single-point
> charge API and delete the old transactional scheme.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/memcontrol.h | 10 ++++
> mm/filemap.c | 24 ++++------
> mm/memcontrol.c | 27 +++++++++++
> mm/shmem.c | 97 +++++++++++++++++---------------------
> 4 files changed, 89 insertions(+), 69 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c7875a48c8c1..5e8b0e38f145 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -367,6 +367,10 @@ int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
> void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> bool lrucare);
> void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
> +
> +int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
> + bool lrucare);
> +
> void mem_cgroup_uncharge(struct page *page);
> void mem_cgroup_uncharge_list(struct list_head *page_list);
>
> @@ -872,6 +876,12 @@ static inline void mem_cgroup_cancel_charge(struct page *page,
> {
> }
>
> +static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
> + gfp_t gfp_mask, bool lrucare)
> +{
> + return 0;
> +}
> +
> static inline void mem_cgroup_uncharge(struct page *page)
> {
> }
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 5b31af9d5b1b..5bdbda965177 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -832,7 +832,6 @@ static int __add_to_page_cache_locked(struct page *page,
> {
> XA_STATE(xas, &mapping->i_pages, offset);
> int huge = PageHuge(page);
> - struct mem_cgroup *memcg;
> int error;
> void *old;
>
> @@ -840,17 +839,16 @@ static int __add_to_page_cache_locked(struct page *page,
> VM_BUG_ON_PAGE(PageSwapBacked(page), page);
> mapping_set_update(&xas, mapping);
>
> - if (!huge) {
> - error = mem_cgroup_try_charge(page, current->mm,
> - gfp_mask, &memcg);
> - if (error)
> - return error;
> - }
> -
> get_page(page);
> page->mapping = mapping;
> page->index = offset;
>
> + if (!huge) {
> + error = mem_cgroup_charge(page, current->mm, gfp_mask, false);
> + if (error)
> + goto error;
> + }
> +
> do {
> xas_lock_irq(&xas);
> old = xas_load(&xas);
> @@ -874,20 +872,18 @@ static int __add_to_page_cache_locked(struct page *page,
> xas_unlock_irq(&xas);
> } while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
>
> - if (xas_error(&xas))
> + if (xas_error(&xas)) {
> + error = xas_error(&xas);
> goto error;
> + }
>
> - if (!huge)
> - mem_cgroup_commit_charge(page, memcg, false);
> trace_mm_filemap_add_to_page_cache(page);
> return 0;
> error:
> page->mapping = NULL;
> /* Leave page->index set: truncation relies upon it */
> - if (!huge)
> - mem_cgroup_cancel_charge(page, memcg);
> put_page(page);
> - return xas_error(&xas);
> + return error;
> }
> ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 711d6dd5cbb1..b38c0a672d26 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6577,6 +6577,33 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
> cancel_charge(memcg, nr_pages);
> }
>
> +/**
> + * mem_cgroup_charge - charge a newly allocated page to a cgroup
> + * @page: page to charge
> + * @mm: mm context of the victim
> + * @gfp_mask: reclaim mode
> + * @lrucare: page might be on the LRU already
> + *
> + * Try to charge @page to the memcg that @mm belongs to, reclaiming
> + * pages according to @gfp_mask if necessary.
> + *
> + * Returns 0 on success. Otherwise, an error code is returned.
> + */
> +int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
> + bool lrucare)
> +{
> + struct mem_cgroup *memcg;
> + int ret;
> +
> + VM_BUG_ON_PAGE(!page->mapping, page);
> +
> + ret = mem_cgroup_try_charge(page, mm, gfp_mask, &memcg);
> + if (ret)
> + return ret;
> + mem_cgroup_commit_charge(page, memcg, lrucare);
> + return 0;
> +}
> +
> struct uncharge_gather {
> struct mem_cgroup *memcg;
> unsigned long pgpgout;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 52c66801321e..2384f6c7ef71 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -605,11 +605,13 @@ static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo)
> */
> static int shmem_add_to_page_cache(struct page *page,
> struct address_space *mapping,
> - pgoff_t index, void *expected, gfp_t gfp)
> + pgoff_t index, void *expected, gfp_t gfp,
> + struct mm_struct *charge_mm)
> {
> XA_STATE_ORDER(xas, &mapping->i_pages, index, compound_order(page));
> unsigned long i = 0;
> unsigned long nr = compound_nr(page);
> + int error;
>
> VM_BUG_ON_PAGE(PageTail(page), page);
> VM_BUG_ON_PAGE(index != round_down(index, nr), page);
> @@ -621,6 +623,16 @@ static int shmem_add_to_page_cache(struct page *page,
> page->mapping = mapping;
> page->index = index;
>
> + error = mem_cgroup_charge(page, charge_mm, gfp, PageSwapCache(page));
> + if (error) {
> + if (!PageSwapCache(page) && PageTransHuge(page)) {
> + count_vm_event(THP_FILE_FALLBACK);
> + count_vm_event(THP_FILE_FALLBACK_CHARGE);
> + }
> + goto error;
> + }
> + cgroup_throttle_swaprate(page, gfp);
> +
> do {
> void *entry;
> xas_lock_irq(&xas);
> @@ -648,12 +660,15 @@ static int shmem_add_to_page_cache(struct page *page,
> } while (xas_nomem(&xas, gfp));
>
> if (xas_error(&xas)) {
> - page->mapping = NULL;
> - page_ref_sub(page, nr);
> - return xas_error(&xas);
> + error = xas_error(&xas);
> + goto error;
> }
>
> return 0;
> +error:
> + page->mapping = NULL;
> + page_ref_sub(page, nr);
> + return error;
> }
>
> /*
> @@ -1619,7 +1634,6 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
> struct address_space *mapping = inode->i_mapping;
> struct shmem_inode_info *info = SHMEM_I(inode);
> struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
> - struct mem_cgroup *memcg;
> struct page *page;
> swp_entry_t swap;
> int error;
> @@ -1664,29 +1678,22 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
> goto failed;
> }
>
> - error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
> - if (!error) {
> - error = shmem_add_to_page_cache(page, mapping, index,
> - swp_to_radix_entry(swap), gfp);
> - /*
> - * We already confirmed swap under page lock, and make
> - * no memory allocation here, so usually no possibility
> - * of error; but free_swap_and_cache() only trylocks a
> - * page, so it is just possible that the entry has been
> - * truncated or holepunched since swap was confirmed.
> - * shmem_undo_range() will have done some of the
> - * unaccounting, now delete_from_swap_cache() will do
> - * the rest.
> - */
> - if (error) {
> - mem_cgroup_cancel_charge(page, memcg);
> - delete_from_swap_cache(page);
> - }
> - }
> - if (error)
> + error = shmem_add_to_page_cache(page, mapping, index,
> + swp_to_radix_entry(swap), gfp,
> + charge_mm);
> + /*
> + * We already confirmed swap under page lock, and make no
> + * memory allocation here, so usually no possibility of error;
> + * but free_swap_and_cache() only trylocks a page, so it is
> + * just possible that the entry has been truncated or
> + * holepunched since swap was confirmed. shmem_undo_range()
> + * will have done some of the unaccounting, now
> + * delete_from_swap_cache() will do the rest.
> + */
> + if (error) {
> + delete_from_swap_cache(page);
> goto failed;

-EEXIST (from swap cache) and -ENOMEM (from memcg) should be handled
differently. delete_from_swap_cache() is for -EEXIST case.

Thanks.

2020-04-22 06:43:13

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 07/18] mm: memcontrol: prepare move_account for removal of private page type counters

On Mon, Apr 20, 2020 at 06:11:15PM -0400, Johannes Weiner wrote:
> When memcg uses the generic vmstat counters, it doesn't need to do
> anything at charging and uncharging time. It does, however, need to
> migrate counts when pages move to a different cgroup in move_account.
>
> Prepare the move_account function for the arrival of NR_FILE_PAGES,
> NR_ANON_MAPPED, NR_ANON_THPS etc. by having a branch for files and a
> branch for anon, which can then divided into sub-branches.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

2020-04-22 06:43:32

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 06/18] mm: memcontrol: prepare uncharging for removal of private page type counters

On Mon, Apr 20, 2020 at 06:11:14PM -0400, Johannes Weiner wrote:
> The uncharge batching code adds up the anon, file, kmem counts to
> determine the total number of pages to uncharge and references to
> drop. But the next patches will remove the anon and file counters.
>
> Maintain an aggregate nr_pages in the uncharge_gather struct.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

2020-04-22 06:44:38

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 08/18] mm: memcontrol: prepare cgroup vmstat infrastructure for native anon counters

On Mon, Apr 20, 2020 at 06:11:16PM -0400, Johannes Weiner wrote:
> Anonymous compound pages can be mapped by ptes, which means that if we
> want to track NR_MAPPED_ANON, NR_ANON_THPS on a per-cgroup basis, we
> have to be prepared to see tail pages in our accounting functions.
>
> Make mod_lruvec_page_state() and lock_page_memcg() deal with tail
> pages correctly, namely by redirecting to the head page which has the
> page->mem_cgroup set up.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

2020-04-22 06:45:03

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 09/18] mm: memcontrol: switch to native NR_FILE_PAGES and NR_SHMEM counters

On Mon, Apr 20, 2020 at 06:11:17PM -0400, Johannes Weiner wrote:
> Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This
> divergence from the generic VM accounting means unnecessary code
> overhead, and creates a dependency for memcg that page->mapping is set
> up at the time of charging, so that page types can be told apart.
>
> Convert the generic accounting sites to mod_lruvec_page_state and
> friends to maintain the per-cgroup vmstat counters of NR_FILE_PAGES
> and NR_SHMEM. The page is already locked in these places, so
> page->mem_cgroup is stable; we only need minimal tweaks of two
> mem_cgroup_migrate() calls to ensure it's set up in time.
>
> Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
> NR_SHMEM accounting sites.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

2020-04-22 06:54:13

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 10/18] mm: memcontrol: switch to native NR_ANON_MAPPED counter

On Mon, Apr 20, 2020 at 06:11:18PM -0400, Johannes Weiner wrote:
> Memcg maintains a private MEMCG_RSS counter. This divergence from the
> generic VM accounting means unnecessary code overhead, and creates a
> dependency for memcg that page->mapping is set up at the time of
> charging, so that page types can be told apart.
>
> Convert the generic accounting sites to mod_lruvec_page_state and
> friends to maintain the per-cgroup vmstat counter of
> NR_ANON_MAPPED. We use lock_page_memcg() to stabilize page->mem_cgroup
> during rmap changes, the same way we do for NR_FILE_MAPPED.
>
> With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
> counter, this patch finally eliminates the need to have page->mapping
> set up at charge time.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/memcontrol.h | 3 +--
> mm/memcontrol.c | 27 ++++++++--------------
> mm/rmap.c | 47 +++++++++++++++++++++++---------------
> 3 files changed, 39 insertions(+), 38 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index c44aa1ccf553..bfb1d961e346 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -29,8 +29,7 @@ struct kmem_cache;
>
> /* Cgroup-specific page state, on top of universal node page state */
> enum memcg_stat_item {
> - MEMCG_RSS = NR_VM_NODE_STAT_ITEMS,
> - MEMCG_RSS_HUGE,
> + MEMCG_RSS_HUGE = NR_VM_NODE_STAT_ITEMS,
> MEMCG_SWAP,
> MEMCG_SOCK,
> /* XXX: why are these zone and not node counters? */
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7e77166cf10b..c87178d6219f 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -836,13 +836,6 @@ static void mem_cgroup_charge_statistics(struct mem_cgroup *memcg,
> struct page *page,
> int nr_pages)
> {
> - /*
> - * Here, RSS means 'mapped anon' and anon's SwapCache. Shmem/tmpfs is
> - * counted as CACHE even if it's on ANON LRU.
> - */
> - if (PageAnon(page))
> - __mod_memcg_state(memcg, MEMCG_RSS, nr_pages);
> -
> if (abs(nr_pages) > 1) {
> VM_BUG_ON_PAGE(!PageTransHuge(page), page);
> __mod_memcg_state(memcg, MEMCG_RSS_HUGE, nr_pages);
> @@ -1384,7 +1377,7 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
> */
>
> seq_buf_printf(&s, "anon %llu\n",
> - (u64)memcg_page_state(memcg, MEMCG_RSS) *
> + (u64)memcg_page_state(memcg, NR_ANON_MAPPED) *
> PAGE_SIZE);
> seq_buf_printf(&s, "file %llu\n",
> (u64)memcg_page_state(memcg, NR_FILE_PAGES) *
> @@ -3298,7 +3291,7 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
>
> if (mem_cgroup_is_root(memcg)) {
> val = memcg_page_state(memcg, NR_FILE_PAGES) +
> - memcg_page_state(memcg, MEMCG_RSS);
> + memcg_page_state(memcg, NR_ANON_MAPPED);
> if (swap)
> val += memcg_page_state(memcg, MEMCG_SWAP);
> } else {
> @@ -3768,7 +3761,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
>
> static const unsigned int memcg1_stats[] = {
> NR_FILE_PAGES,
> - MEMCG_RSS,
> + NR_ANON_MAPPED,
> MEMCG_RSS_HUGE,
> NR_SHMEM,
> NR_FILE_MAPPED,
> @@ -5395,7 +5388,12 @@ static int mem_cgroup_move_account(struct page *page,
>
> lock_page_memcg(page);
>
> - if (!PageAnon(page)) {
> + if (PageAnon(page)) {
> + if (page_mapped(page)) {

This page_mapped() check is newly inserted. Could you elaborate more
on why mem_cgroup_charge_statistics() doesn't need this check?

> + __mod_lruvec_state(from_vec, NR_ANON_MAPPED, -nr_pages);
> + __mod_lruvec_state(to_vec, NR_ANON_MAPPED, nr_pages);
> + }
> + } else {
> __mod_lruvec_state(from_vec, NR_FILE_PAGES, -nr_pages);
> __mod_lruvec_state(to_vec, NR_FILE_PAGES, nr_pages);
>
> @@ -6529,7 +6527,6 @@ void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> {
> unsigned int nr_pages = hpage_nr_pages(page);
>
> - VM_BUG_ON_PAGE(!page->mapping, page);
> VM_BUG_ON_PAGE(PageLRU(page) && !lrucare, page);
>
> if (mem_cgroup_disabled())
> @@ -6602,8 +6599,6 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
> struct mem_cgroup *memcg;
> int ret;
>
> - VM_BUG_ON_PAGE(!page->mapping, page);
> -
> ret = mem_cgroup_try_charge(page, mm, gfp_mask, &memcg);
> if (ret)
> return ret;
> @@ -6615,7 +6610,6 @@ struct uncharge_gather {
> struct mem_cgroup *memcg;
> unsigned long nr_pages;
> unsigned long pgpgout;
> - unsigned long nr_anon;
> unsigned long nr_kmem;
> unsigned long nr_huge;
> struct page *dummy_page;
> @@ -6640,7 +6634,6 @@ static void uncharge_batch(const struct uncharge_gather *ug)
> }
>
> local_irq_save(flags);
> - __mod_memcg_state(ug->memcg, MEMCG_RSS, -ug->nr_anon);
> __mod_memcg_state(ug->memcg, MEMCG_RSS_HUGE, -ug->nr_huge);
> __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
> __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_pages);
> @@ -6682,8 +6675,6 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
> if (!PageKmemcg(page)) {
> if (PageTransHuge(page))
> ug->nr_huge += nr_pages;
> - if (PageAnon(page))
> - ug->nr_anon += nr_pages;
> ug->pgpgout++;
> } else {
> ug->nr_kmem += nr_pages;
> diff --git a/mm/rmap.c b/mm/rmap.c
> index f79a206b271a..150513d31efa 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1114,6 +1114,11 @@ void do_page_add_anon_rmap(struct page *page,
> bool compound = flags & RMAP_COMPOUND;
> bool first;
>
> + if (unlikely(PageKsm(page)))
> + lock_page_memcg(page);
> + else
> + VM_BUG_ON_PAGE(!PageLocked(page), page);
> +
> if (compound) {
> atomic_t *mapcount;
> VM_BUG_ON_PAGE(!PageLocked(page), page);
> @@ -1134,12 +1139,13 @@ void do_page_add_anon_rmap(struct page *page,
> */
> if (compound)
> __inc_node_page_state(page, NR_ANON_THPS);
> - __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
> + __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
> }
> - if (unlikely(PageKsm(page)))
> - return;
>
> - VM_BUG_ON_PAGE(!PageLocked(page), page);
> + if (unlikely(PageKsm(page))) {
> + unlock_page_memcg(page);
> + return;
> + }
>
> /* address might be in next vma when migration races vma_adjust */
> if (first)
> @@ -1181,7 +1187,7 @@ void page_add_new_anon_rmap(struct page *page,
> /* increment count (starts at -1) */
> atomic_set(&page->_mapcount, 0);
> }
> - __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
> + __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
> __page_set_anon_rmap(page, vma, address, 1);
> }

memcg isn't setup yet and accounting isn't applied to proper memcg.
Maybe, it would be applied to root memcg. With this change, we don't
need the mapping to commit the charge so switching the order of
page_add_new_anon_rmap() and mem_cgroup_commit_charge() will solve the
issue.

Thanks.

2020-04-22 12:12:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

On Wed, Apr 22, 2020 at 03:40:41PM +0900, Joonsoo Kim wrote:
> On Mon, Apr 20, 2020 at 06:11:13PM -0400, Johannes Weiner wrote:
> > The try/commit/cancel protocol that memcg uses dates back to when
> > pages used to be uncharged upon removal from the page cache, and thus
> > couldn't be committed before the insertion had succeeded. Nowadays,
> > pages are uncharged when they are physically freed; it doesn't matter
> > whether the insertion was successful or not. For the page cache, the
> > transaction dance has become unnecessary.
> >
> > Introduce a mem_cgroup_charge() function that simply charges a newly
> > allocated page to a cgroup and sets up page->mem_cgroup in one single
> > step. If the insertion fails, the caller doesn't have to do anything
> > but free/put the page.
> >
> > Then switch the page cache over to this new API.
> >
> > Subsequent patches will also convert anon pages, but it needs a bit
> > more prep work. Right now, memcg depends on page->mapping being
> > already set up at the time of charging, so that it can maintain its
> > own MEMCG_CACHE and MEMCG_RSS counters. For anon, page->mapping is set
> > under the same pte lock under which the page is publishd, so a single
> > charge point that can block doesn't work there just yet.
> >
> > The following prep patches will replace the private memcg counters
> > with the generic vmstat counters, thus removing the page->mapping
> > dependency, then complete the transition to the new single-point
> > charge API and delete the old transactional scheme.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > include/linux/memcontrol.h | 10 ++++
> > mm/filemap.c | 24 ++++------
> > mm/memcontrol.c | 27 +++++++++++
> > mm/shmem.c | 97 +++++++++++++++++---------------------
> > 4 files changed, 89 insertions(+), 69 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index c7875a48c8c1..5e8b0e38f145 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -367,6 +367,10 @@ int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
> > void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> > bool lrucare);
> > void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
> > +
> > +int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
> > + bool lrucare);
> > +
> > void mem_cgroup_uncharge(struct page *page);
> > void mem_cgroup_uncharge_list(struct list_head *page_list);
> >
> > @@ -872,6 +876,12 @@ static inline void mem_cgroup_cancel_charge(struct page *page,
> > {
> > }
> >
> > +static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
> > + gfp_t gfp_mask, bool lrucare)
> > +{
> > + return 0;
> > +}
> > +
> > static inline void mem_cgroup_uncharge(struct page *page)
> > {
> > }
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 5b31af9d5b1b..5bdbda965177 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -832,7 +832,6 @@ static int __add_to_page_cache_locked(struct page *page,
> > {
> > XA_STATE(xas, &mapping->i_pages, offset);
> > int huge = PageHuge(page);
> > - struct mem_cgroup *memcg;
> > int error;
> > void *old;
> >
> > @@ -840,17 +839,16 @@ static int __add_to_page_cache_locked(struct page *page,
> > VM_BUG_ON_PAGE(PageSwapBacked(page), page);
> > mapping_set_update(&xas, mapping);
> >
> > - if (!huge) {
> > - error = mem_cgroup_try_charge(page, current->mm,
> > - gfp_mask, &memcg);
> > - if (error)
> > - return error;
> > - }
> > -
> > get_page(page);
> > page->mapping = mapping;
> > page->index = offset;
> >
> > + if (!huge) {
> > + error = mem_cgroup_charge(page, current->mm, gfp_mask, false);
> > + if (error)
> > + goto error;
> > + }
> > +
> > do {
> > xas_lock_irq(&xas);
> > old = xas_load(&xas);
> > @@ -874,20 +872,18 @@ static int __add_to_page_cache_locked(struct page *page,
> > xas_unlock_irq(&xas);
> > } while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
> >
> > - if (xas_error(&xas))
> > + if (xas_error(&xas)) {
> > + error = xas_error(&xas);
> > goto error;
> > + }
> >
> > - if (!huge)
> > - mem_cgroup_commit_charge(page, memcg, false);
> > trace_mm_filemap_add_to_page_cache(page);
> > return 0;
> > error:
> > page->mapping = NULL;
> > /* Leave page->index set: truncation relies upon it */
> > - if (!huge)
> > - mem_cgroup_cancel_charge(page, memcg);
> > put_page(page);
> > - return xas_error(&xas);
> > + return error;
> > }
> > ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 711d6dd5cbb1..b38c0a672d26 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -6577,6 +6577,33 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
> > cancel_charge(memcg, nr_pages);
> > }
> >
> > +/**
> > + * mem_cgroup_charge - charge a newly allocated page to a cgroup
> > + * @page: page to charge
> > + * @mm: mm context of the victim
> > + * @gfp_mask: reclaim mode
> > + * @lrucare: page might be on the LRU already
> > + *
> > + * Try to charge @page to the memcg that @mm belongs to, reclaiming
> > + * pages according to @gfp_mask if necessary.
> > + *
> > + * Returns 0 on success. Otherwise, an error code is returned.
> > + */
> > +int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
> > + bool lrucare)
> > +{
> > + struct mem_cgroup *memcg;
> > + int ret;
> > +
> > + VM_BUG_ON_PAGE(!page->mapping, page);
> > +
> > + ret = mem_cgroup_try_charge(page, mm, gfp_mask, &memcg);
> > + if (ret)
> > + return ret;
> > + mem_cgroup_commit_charge(page, memcg, lrucare);
> > + return 0;
> > +}
> > +
> > struct uncharge_gather {
> > struct mem_cgroup *memcg;
> > unsigned long pgpgout;
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 52c66801321e..2384f6c7ef71 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -605,11 +605,13 @@ static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo)
> > */
> > static int shmem_add_to_page_cache(struct page *page,
> > struct address_space *mapping,
> > - pgoff_t index, void *expected, gfp_t gfp)
> > + pgoff_t index, void *expected, gfp_t gfp,
> > + struct mm_struct *charge_mm)
> > {
> > XA_STATE_ORDER(xas, &mapping->i_pages, index, compound_order(page));
> > unsigned long i = 0;
> > unsigned long nr = compound_nr(page);
> > + int error;
> >
> > VM_BUG_ON_PAGE(PageTail(page), page);
> > VM_BUG_ON_PAGE(index != round_down(index, nr), page);
> > @@ -621,6 +623,16 @@ static int shmem_add_to_page_cache(struct page *page,
> > page->mapping = mapping;
> > page->index = index;
> >
> > + error = mem_cgroup_charge(page, charge_mm, gfp, PageSwapCache(page));
> > + if (error) {
> > + if (!PageSwapCache(page) && PageTransHuge(page)) {
> > + count_vm_event(THP_FILE_FALLBACK);
> > + count_vm_event(THP_FILE_FALLBACK_CHARGE);
> > + }
> > + goto error;
> > + }
> > + cgroup_throttle_swaprate(page, gfp);
> > +
> > do {
> > void *entry;
> > xas_lock_irq(&xas);
> > @@ -648,12 +660,15 @@ static int shmem_add_to_page_cache(struct page *page,
> > } while (xas_nomem(&xas, gfp));
> >
> > if (xas_error(&xas)) {
> > - page->mapping = NULL;
> > - page_ref_sub(page, nr);
> > - return xas_error(&xas);
> > + error = xas_error(&xas);
> > + goto error;
> > }
> >
> > return 0;
> > +error:
> > + page->mapping = NULL;
> > + page_ref_sub(page, nr);
> > + return error;
> > }
> >
> > /*
> > @@ -1619,7 +1634,6 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
> > struct address_space *mapping = inode->i_mapping;
> > struct shmem_inode_info *info = SHMEM_I(inode);
> > struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
> > - struct mem_cgroup *memcg;
> > struct page *page;
> > swp_entry_t swap;
> > int error;
> > @@ -1664,29 +1678,22 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
> > goto failed;
> > }
> >
> > - error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
> > - if (!error) {
> > - error = shmem_add_to_page_cache(page, mapping, index,
> > - swp_to_radix_entry(swap), gfp);
> > - /*
> > - * We already confirmed swap under page lock, and make
> > - * no memory allocation here, so usually no possibility
> > - * of error; but free_swap_and_cache() only trylocks a
> > - * page, so it is just possible that the entry has been
> > - * truncated or holepunched since swap was confirmed.
> > - * shmem_undo_range() will have done some of the
> > - * unaccounting, now delete_from_swap_cache() will do
> > - * the rest.
> > - */
> > - if (error) {
> > - mem_cgroup_cancel_charge(page, memcg);
> > - delete_from_swap_cache(page);
> > - }
> > - }
> > - if (error)
> > + error = shmem_add_to_page_cache(page, mapping, index,
> > + swp_to_radix_entry(swap), gfp,
> > + charge_mm);
> > + /*
> > + * We already confirmed swap under page lock, and make no
> > + * memory allocation here, so usually no possibility of error;
> > + * but free_swap_and_cache() only trylocks a page, so it is
> > + * just possible that the entry has been truncated or
> > + * holepunched since swap was confirmed. shmem_undo_range()
> > + * will have done some of the unaccounting, now
> > + * delete_from_swap_cache() will do the rest.
> > + */
> > + if (error) {
> > + delete_from_swap_cache(page);
> > goto failed;
>
> -EEXIST (from swap cache) and -ENOMEM (from memcg) should be handled
> differently. delete_from_swap_cache() is for -EEXIST case.

Good catch, I accidentally changed things here.

I was just going to change it back, but now I'm trying to understand
how it actually works.

Who is removing the page from swap cache if shmem_undo_range() races
but we fail to charge the page?

Here is how this race is supposed to be handled: The page is in the
swapcache, we have it locked and confirmed that the entry in i_pages
is indeed a swap entry. We charge the page, then we try to replace the
swap entry in i_pages with the actual page. If we determine, under
tree lock now, that shmem_undo_range has raced with us, unaccounted
the swap space, but must have failed to get the page lock, we remove
the page from swap cache on our side, to free up swap slot and page.

But what if shmem_undo_range() raced with us, deleted the swap entry
from i_pages while we had the page locked, but then we simply failed
to charge? We unlock the page and return -EEXIST (shmem_confirm_swap
at the exit). The page with its userdata is now in swapcache, but no
corresponding swap entry in i_pages. shmem_getpage_gfp() sees the
-EEXIST, retries, finds nothing in i_pages and allocates a new, empty
page.

Aren't we leaking the swap slot and the page?

2020-04-22 12:29:41

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 10/18] mm: memcontrol: switch to native NR_ANON_MAPPED counter

Hello Joonsoo,

On Wed, Apr 22, 2020 at 03:51:52PM +0900, Joonsoo Kim wrote:
> On Mon, Apr 20, 2020 at 06:11:18PM -0400, Johannes Weiner wrote:
> > @@ -3768,7 +3761,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
> >
> > static const unsigned int memcg1_stats[] = {
> > NR_FILE_PAGES,
> > - MEMCG_RSS,
> > + NR_ANON_MAPPED,
> > MEMCG_RSS_HUGE,
> > NR_SHMEM,
> > NR_FILE_MAPPED,
> > @@ -5395,7 +5388,12 @@ static int mem_cgroup_move_account(struct page *page,
> >
> > lock_page_memcg(page);
> >
> > - if (!PageAnon(page)) {
> > + if (PageAnon(page)) {
> > + if (page_mapped(page)) {
>
> This page_mapped() check is newly inserted. Could you elaborate more
> on why mem_cgroup_charge_statistics() doesn't need this check?

MEMCG_RSS extended from when the page was charged until it was
uncharged, but NR_ANON_MAPPED is only counted while the page is really
mapped into page tables. That starts shortly after we charge and ends
shortly before we uncharge, so pages could move between cgroups before
or after they are mapped, while they aren't counted in NR_ANON_MAPPED.

So to know that the page is counted, charge_statistics() only needed
to know that the page is charged and Anon; move_account() also needs
to know that the page is mapped.

> > @@ -1181,7 +1187,7 @@ void page_add_new_anon_rmap(struct page *page,
> > /* increment count (starts at -1) */
> > atomic_set(&page->_mapcount, 0);
> > }
> > - __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
> > + __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
> > __page_set_anon_rmap(page, vma, address, 1);
> > }
>
> memcg isn't setup yet and accounting isn't applied to proper memcg.
> Maybe, it would be applied to root memcg. With this change, we don't
> need the mapping to commit the charge so switching the order of
> page_add_new_anon_rmap() and mem_cgroup_commit_charge() will solve the
> issue.

Good catch, it's that dreaded circular dependency. It's fixed two
patches down when I charge anon pages earlier as well. But I'll change
the rmap<->commit order in this patch to avoid the temporary bug.

Thanks for your thorough review!

2020-04-22 13:32:59

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 15/18] mm: memcontrol: make swap tracking an integral part of memory control

On Wed, Apr 22, 2020 at 11:14:40AM +0800, Alex Shi wrote:
>
>
> 在 2020/4/21 下午10:39, Johannes Weiner 写道:
> > Hi Alex,
> >
> > thanks for your quick review so far, I'll add the tags to the patches.
> >
> > On Tue, Apr 21, 2020 at 05:27:30PM +0800, Alex Shi wrote:
> >>
> >>
> >> 在 2020/4/21 上午6:11, Johannes Weiner 写道:
> >>> The swapaccount=0 boot option will continue to exist, and it will
> >>> eliminate the page_counter overhead and hide the swap control files,
> >>> but it won't disable swap slot ownership tracking.
> >>
> >> May we add extra explanation for this change to user? and the default
> >> memsw limitations?
> >
> > Can you elaborate what you think is missing and where you would like
> > to see it documented?
> >
> Maybe the following doc change is better after whole patchset?
> Guess users would would happy to know details of this change.

Thanks, I stole your patch and extended/tweaked it a little. Would you
mind providing your Signed-off-by:?

From 589d3c1b505e6671b4a9b424436c9eda88a0b08c Mon Sep 17 00:00:00 2001
From: Alex Shi <[email protected]>
Date: Wed, 22 Apr 2020 11:14:40 +0800
Subject: [PATCH] mm: memcontrol: document the new swap control behavior

Signed-off-by: Johannes Weiner <[email protected]>
---
.../admin-guide/cgroup-v1/memory.rst | 19 +++++++------------
1 file changed, 7 insertions(+), 12 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 0ae4f564c2d6..12757e63b26c 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -199,11 +199,11 @@ An RSS page is unaccounted when it's fully unmapped. A PageCache page is
unaccounted when it's removed from radix-tree. Even if RSS pages are fully
unmapped (by kswapd), they may exist as SwapCache in the system until they
are really freed. Such SwapCaches are also accounted.
-A swapped-in page is not accounted until it's mapped.
+A swapped-in page is accounted after adding into swapcache.

Note: The kernel does swapin-readahead and reads multiple swaps at once.
-This means swapped-in pages may contain pages for other tasks than a task
-causing page fault. So, we avoid accounting at swap-in I/O.
+Since page's memcg recorded into swap whatever memsw enabled, the page will
+be accounted after swapin.

At page migration, accounting information is kept.

@@ -222,18 +222,13 @@ the cgroup that brought it in -- this will happen on memory pressure).
But see section 8.2: when moving a task to another cgroup, its pages may
be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.

-Exception: If CONFIG_MEMCG_SWAP is not used.
-When you do swapoff and make swapped-out pages of shmem(tmpfs) to
-be backed into memory in force, charges for pages are accounted against the
-caller of swapoff rather than the users of shmem.
-
-2.4 Swap Extension (CONFIG_MEMCG_SWAP)
+2.4 Swap Extension
--------------------------------------

-Swap Extension allows you to record charge for swap. A swapped-in page is
-charged back to original page allocator if possible.
+Swap usage is always recorded for each of cgroup. Swap Extension allows you to
+read and limit it.

-When swap is accounted, following files are added.
+When CONFIG_SWAP is enabled, following files are added.

- memory.memsw.usage_in_bytes.
- memory.memsw.limit_in_bytes.
--



> Also as to the RSS account name change, I don't know if it's good to polish
> them in docs.

I didn't actually change anything user-visible, just the internal name
of the counters:

static const unsigned int memcg1_stats[] = {
NR_FILE_PAGES, /* was MEMCG_CACHE */
NR_ANON_MAPPED, /* was MEMCG_RSS */
NR_ANON_THPS, /* was MEMCG_RSS_HUGE */
NR_SHMEM,
NR_FILE_MAPPED,
NR_FILE_DIRTY,
NR_WRITEBACK,
MEMCG_SWAP,
};

static const char *const memcg1_stat_names[] = {
"cache",
"rss",
"rss_huge",
"shmem",
"mapped_file",
"dirty",
"writeback",
"swap",
};

Or did you refer to something else?

2020-04-22 13:43:39

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 15/18] mm: memcontrol: make swap tracking an integral part of memory control



在 2020/4/22 下午9:30, Johannes Weiner 写道:
> On Wed, Apr 22, 2020 at 11:14:40AM +0800, Alex Shi wrote:
>>
>>
>> 在 2020/4/21 下午10:39, Johannes Weiner 写道:
>>> Hi Alex,
>>>
>>> thanks for your quick review so far, I'll add the tags to the patches.
>>>
>>> On Tue, Apr 21, 2020 at 05:27:30PM +0800, Alex Shi wrote:
>>>>
>>>>
>>>> 在 2020/4/21 上午6:11, Johannes Weiner 写道:
>>>>> The swapaccount=0 boot option will continue to exist, and it will
>>>>> eliminate the page_counter overhead and hide the swap control files,
>>>>> but it won't disable swap slot ownership tracking.
>>>>
>>>> May we add extra explanation for this change to user? and the default
>>>> memsw limitations?
>>>
>>> Can you elaborate what you think is missing and where you would like
>>> to see it documented?
>>>
>> Maybe the following doc change is better after whole patchset?
>> Guess users would would happy to know details of this change.
>
> Thanks, I stole your patch and extended/tweaked it a little. Would you
> mind providing your Signed-off-by:?

My pleasure. :)

Signed-off-by: Alex Shi <[email protected]>

>
> From 589d3c1b505e6671b4a9b424436c9eda88a0b08c Mon Sep 17 00:00:00 2001
> From: Alex Shi <[email protected]>
> Date: Wed, 22 Apr 2020 11:14:40 +0800
> Subject: [PATCH] mm: memcontrol: document the new swap control behavior
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> .../admin-guide/cgroup-v1/memory.rst | 19 +++++++------------
> 1 file changed, 7 insertions(+), 12 deletions(-)
>
> diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
> index 0ae4f564c2d6..12757e63b26c 100644
> --- a/Documentation/admin-guide/cgroup-v1/memory.rst
> +++ b/Documentation/admin-guide/cgroup-v1/memory.rst
> @@ -199,11 +199,11 @@ An RSS page is unaccounted when it's fully unmapped. A PageCache page is
> unaccounted when it's removed from radix-tree. Even if RSS pages are fully
> unmapped (by kswapd), they may exist as SwapCache in the system until they
> are really freed. Such SwapCaches are also accounted.
> -A swapped-in page is not accounted until it's mapped.
> +A swapped-in page is accounted after adding into swapcache.
>
> Note: The kernel does swapin-readahead and reads multiple swaps at once.
> -This means swapped-in pages may contain pages for other tasks than a task
> -causing page fault. So, we avoid accounting at swap-in I/O.
> +Since page's memcg recorded into swap whatever memsw enabled, the page will
> +be accounted after swapin.
>
> At page migration, accounting information is kept.
>
> @@ -222,18 +222,13 @@ the cgroup that brought it in -- this will happen on memory pressure).
> But see section 8.2: when moving a task to another cgroup, its pages may
> be recharged to the new cgroup, if move_charge_at_immigrate has been chosen.
>
> -Exception: If CONFIG_MEMCG_SWAP is not used.
> -When you do swapoff and make swapped-out pages of shmem(tmpfs) to
> -be backed into memory in force, charges for pages are accounted against the
> -caller of swapoff rather than the users of shmem.
> -
> -2.4 Swap Extension (CONFIG_MEMCG_SWAP)
> +2.4 Swap Extension
> --------------------------------------
>
> -Swap Extension allows you to record charge for swap. A swapped-in page is
> -charged back to original page allocator if possible.
> +Swap usage is always recorded for each of cgroup. Swap Extension allows you to
> +read and limit it.
>
> -When swap is accounted, following files are added.
> +When CONFIG_SWAP is enabled, following files are added.
>
> - memory.memsw.usage_in_bytes.
> - memory.memsw.limit_in_bytes.
>

2020-04-22 14:40:28

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 15/18] mm: memcontrol: make swap tracking an integral part of memory control



在 2020/4/22 下午9:30, Johannes Weiner 写道:
>> Also as to the RSS account name change, I don't know if it's good to polish
>> them in docs.
> I didn't actually change anything user-visible, just the internal name
> of the counters:
>
> static const unsigned int memcg1_stats[] = {
> NR_FILE_PAGES, /* was MEMCG_CACHE */
> NR_ANON_MAPPED, /* was MEMCG_RSS */
> NR_ANON_THPS, /* was MEMCG_RSS_HUGE */
> NR_SHMEM,
> NR_FILE_MAPPED,
> NR_FILE_DIRTY,
> NR_WRITEBACK,
> MEMCG_SWAP,
> };
>
> static const char *const memcg1_stat_names[] = {
> "cache",
> "rss",
> "rss_huge",
> "shmem",
> "mapped_file",
> "dirty",
> "writeback",
> "swap",
> };
>
> Or did you refer to something else?

With about 'was MEMCG_RSS' etc. I believe curious user would know where
the concept from. :)

Thanks for this comments!
Aelx

2020-04-22 16:55:10

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 02/18] mm: memcontrol: fix theoretical race in charge moving

On Mon, Apr 20, 2020 at 3:11 PM Johannes Weiner <[email protected]> wrote:
>
> The move_lock is a per-memcg lock, but the VM accounting code that
> needs to acquire it comes from the page and follows page->mem_cgroup
> under RCU protection. That means that the page becomes unlocked not
> when we drop the move_lock, but when we update page->mem_cgroup. And
> that assignment doesn't imply any memory ordering. If that pointer
> write gets reordered against the reads of the page state -
> page_mapped, PageDirty etc. the state may change while we rely on it
> being stable and we can end up corrupting the counters.
>
> Place an SMP memory barrier to make sure we're done with all page
> state by the time the new page->mem_cgroup becomes visible.
>
> Also replace the open-coded move_lock with a lock_page_memcg() to make
> it more obvious what we're serializing against.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/memcontrol.c | 26 ++++++++++++++------------
> 1 file changed, 14 insertions(+), 12 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5beea03dd58a..41f5ed79272e 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -5372,7 +5372,6 @@ static int mem_cgroup_move_account(struct page *page,
> {
> struct lruvec *from_vec, *to_vec;
> struct pglist_data *pgdat;
> - unsigned long flags;
> unsigned int nr_pages = compound ? hpage_nr_pages(page) : 1;
> int ret;
> bool anon;
> @@ -5399,18 +5398,13 @@ static int mem_cgroup_move_account(struct page *page,
> from_vec = mem_cgroup_lruvec(from, pgdat);
> to_vec = mem_cgroup_lruvec(to, pgdat);
>
> - spin_lock_irqsave(&from->move_lock, flags);
> + lock_page_memcg(page);
>
> if (!anon && page_mapped(page)) {
> __mod_lruvec_state(from_vec, NR_FILE_MAPPED, -nr_pages);
> __mod_lruvec_state(to_vec, NR_FILE_MAPPED, nr_pages);
> }
>
> - /*
> - * move_lock grabbed above and caller set from->moving_account, so
> - * mod_memcg_page_state will serialize updates to PageDirty.
> - * So mapping should be stable for dirty pages.
> - */
> if (!anon && PageDirty(page)) {
> struct address_space *mapping = page_mapping(page);
>
> @@ -5426,15 +5420,23 @@ static int mem_cgroup_move_account(struct page *page,
> }
>
> /*
> + * All state has been migrated, let's switch to the new memcg.
> + *
> * It is safe to change page->mem_cgroup here because the page
> - * is referenced, charged, and isolated - we can't race with
> - * uncharging, charging, migration, or LRU putback.
> + * is referenced, charged, isolated, and locked: we can't race
> + * with (un)charging, migration, LRU putback, or anything else
> + * that would rely on a stable page->mem_cgroup.
> + *
> + * Note that lock_page_memcg is a memcg lock, not a page lock,
> + * to save space. As soon as we switch page->mem_cgroup to a
> + * new memcg that isn't locked, the above state can change
> + * concurrently again. Make sure we're truly done with it.
> */
> + smp_mb();

You said theoretical race in the subject but the above comment
convinced me that smp_mb() is required. So, why is the race still
theoretical?

>
> - /* caller should have done css_get */
> - page->mem_cgroup = to;
> + page->mem_cgroup = to; /* caller should have done css_get */
>
> - spin_unlock_irqrestore(&from->move_lock, flags);
> + __unlock_page_memcg(from);
>
> ret = 0;
>
> --
> 2.26.0
>

2020-04-22 17:32:51

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 03/18] mm: memcontrol: drop @compound parameter from memcg charging API

On Mon, Apr 20, 2020 at 3:11 PM Johannes Weiner <[email protected]> wrote:
>
> The memcg charging API carries a boolean @compound parameter that
> tells whether the page we're dealing with is a hugepage.
> mem_cgroup_commit_charge() has another boolean @lrucare that indicates
> whether the page needs LRU locking or not while charging. The majority
> of callsites know those parameters at compile time, which results in a
> lot of naked "false, false" argument lists. This makes for cryptic
> code and is a breeding ground for subtle mistakes.
>
> Thankfully, the huge page state can be inferred from the page itself
> and doesn't need to be passed along. This is safe because charging
> completes before the page is published and somebody may split it.
>
> Simplify the callsites by removing @compound, and let memcg infer the
> state by using hpage_nr_pages() unconditionally. That function does
> PageTransHuge() to identify huge pages, which also helpfully asserts
> that nobody passes in tail pages by accident.
>
> The following patches will introduce a new charging API, best not to
> carry over unnecessary weight.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2020-04-22 17:44:47

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 02/18] mm: memcontrol: fix theoretical race in charge moving

On Wed, Apr 22, 2020 at 09:51:20AM -0700, Shakeel Butt wrote:
> On Mon, Apr 20, 2020 at 3:11 PM Johannes Weiner <[email protected]> wrote:
> > @@ -5426,15 +5420,23 @@ static int mem_cgroup_move_account(struct page *page,
> > }
> >
> > /*
> > + * All state has been migrated, let's switch to the new memcg.
> > + *
> > * It is safe to change page->mem_cgroup here because the page
> > - * is referenced, charged, and isolated - we can't race with
> > - * uncharging, charging, migration, or LRU putback.
> > + * is referenced, charged, isolated, and locked: we can't race
> > + * with (un)charging, migration, LRU putback, or anything else
> > + * that would rely on a stable page->mem_cgroup.
> > + *
> > + * Note that lock_page_memcg is a memcg lock, not a page lock,
> > + * to save space. As soon as we switch page->mem_cgroup to a
> > + * new memcg that isn't locked, the above state can change
> > + * concurrently again. Make sure we're truly done with it.
> > */
> > + smp_mb();
>
> You said theoretical race in the subject but the above comment
> convinced me that smp_mb() is required. So, why is the race still
> theoretical?

Sorry about the confusion.

I said theoretical because I spotted it while thinking about the
code. I'm not aware of any real users that suffered the consequences
of this race condition. But they could exist in theory :-)

I think it's a real bug that needs fixing.

2020-04-22 18:03:47

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 02/18] mm: memcontrol: fix theoretical race in charge moving

On Wed, Apr 22, 2020 at 10:42 AM Johannes Weiner <[email protected]> wrote:
>
> On Wed, Apr 22, 2020 at 09:51:20AM -0700, Shakeel Butt wrote:
> > On Mon, Apr 20, 2020 at 3:11 PM Johannes Weiner <[email protected]> wrote:
> > > @@ -5426,15 +5420,23 @@ static int mem_cgroup_move_account(struct page *page,
> > > }
> > >
> > > /*
> > > + * All state has been migrated, let's switch to the new memcg.
> > > + *
> > > * It is safe to change page->mem_cgroup here because the page
> > > - * is referenced, charged, and isolated - we can't race with
> > > - * uncharging, charging, migration, or LRU putback.
> > > + * is referenced, charged, isolated, and locked: we can't race
> > > + * with (un)charging, migration, LRU putback, or anything else
> > > + * that would rely on a stable page->mem_cgroup.
> > > + *
> > > + * Note that lock_page_memcg is a memcg lock, not a page lock,
> > > + * to save space. As soon as we switch page->mem_cgroup to a
> > > + * new memcg that isn't locked, the above state can change
> > > + * concurrently again. Make sure we're truly done with it.
> > > */
> > > + smp_mb();
> >
> > You said theoretical race in the subject but the above comment
> > convinced me that smp_mb() is required. So, why is the race still
> > theoretical?
>
> Sorry about the confusion.
>
> I said theoretical because I spotted it while thinking about the
> code. I'm not aware of any real users that suffered the consequences
> of this race condition. But they could exist in theory :-)
>
> I think it's a real bug that needs fixing.

Thanks for the clarification. I would suggest removing "theoretical"
from the subject as it undermines that a real bug is fixed by the
patch.

2020-04-22 18:05:08

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 02/18] mm: memcontrol: fix theoretical race in charge moving

On Mon, Apr 20, 2020 at 3:11 PM Johannes Weiner <[email protected]> wrote:
>
> The move_lock is a per-memcg lock, but the VM accounting code that
> needs to acquire it comes from the page and follows page->mem_cgroup
> under RCU protection. That means that the page becomes unlocked not
> when we drop the move_lock, but when we update page->mem_cgroup. And
> that assignment doesn't imply any memory ordering. If that pointer
> write gets reordered against the reads of the page state -
> page_mapped, PageDirty etc. the state may change while we rely on it
> being stable and we can end up corrupting the counters.
>
> Place an SMP memory barrier to make sure we're done with all page
> state by the time the new page->mem_cgroup becomes visible.
>
> Also replace the open-coded move_lock with a lock_page_memcg() to make
> it more obvious what we're serializing against.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2020-04-22 22:22:04

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 04/18] mm: memcontrol: move out cgroup swaprate throttling

On Mon, Apr 20, 2020 at 3:11 PM Johannes Weiner <[email protected]> wrote:
>
> The cgroup swaprate throttling is about matching new anon allocations
> to the rate of available IO when that is being throttled. It's the io
> controller hooking into the VM, rather than a memory controller thing.
>
> Rename mem_cgroup_throttle_swaprate() to cgroup_throttle_swaprate(),
> and drop the @memcg argument which is only used to check whether the
> preceding page charge has succeeded and the fault is proceeding.
>
> We could decouple the call from mem_cgroup_try_charge() here as well,
> but that would cause unnecessary churn: the following patches convert
> all callsites to a new charge API and we'll decouple as we go along.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2020-04-23 05:27:56

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

On Wed, Apr 22, 2020 at 08:09:46AM -0400, Johannes Weiner wrote:
> On Wed, Apr 22, 2020 at 03:40:41PM +0900, Joonsoo Kim wrote:
> > On Mon, Apr 20, 2020 at 06:11:13PM -0400, Johannes Weiner wrote:
> > > The try/commit/cancel protocol that memcg uses dates back to when
> > > pages used to be uncharged upon removal from the page cache, and thus
> > > couldn't be committed before the insertion had succeeded. Nowadays,
> > > pages are uncharged when they are physically freed; it doesn't matter
> > > whether the insertion was successful or not. For the page cache, the
> > > transaction dance has become unnecessary.
> > >
> > > Introduce a mem_cgroup_charge() function that simply charges a newly
> > > allocated page to a cgroup and sets up page->mem_cgroup in one single
> > > step. If the insertion fails, the caller doesn't have to do anything
> > > but free/put the page.
> > >
> > > Then switch the page cache over to this new API.
> > >
> > > Subsequent patches will also convert anon pages, but it needs a bit
> > > more prep work. Right now, memcg depends on page->mapping being
> > > already set up at the time of charging, so that it can maintain its
> > > own MEMCG_CACHE and MEMCG_RSS counters. For anon, page->mapping is set
> > > under the same pte lock under which the page is publishd, so a single
> > > charge point that can block doesn't work there just yet.
> > >
> > > The following prep patches will replace the private memcg counters
> > > with the generic vmstat counters, thus removing the page->mapping
> > > dependency, then complete the transition to the new single-point
> > > charge API and delete the old transactional scheme.
> > >
> > > Signed-off-by: Johannes Weiner <[email protected]>
> > > ---
> > > include/linux/memcontrol.h | 10 ++++
> > > mm/filemap.c | 24 ++++------
> > > mm/memcontrol.c | 27 +++++++++++
> > > mm/shmem.c | 97 +++++++++++++++++---------------------
> > > 4 files changed, 89 insertions(+), 69 deletions(-)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index c7875a48c8c1..5e8b0e38f145 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -367,6 +367,10 @@ int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
> > > void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
> > > bool lrucare);
> > > void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
> > > +
> > > +int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
> > > + bool lrucare);
> > > +
> > > void mem_cgroup_uncharge(struct page *page);
> > > void mem_cgroup_uncharge_list(struct list_head *page_list);
> > >
> > > @@ -872,6 +876,12 @@ static inline void mem_cgroup_cancel_charge(struct page *page,
> > > {
> > > }
> > >
> > > +static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
> > > + gfp_t gfp_mask, bool lrucare)
> > > +{
> > > + return 0;
> > > +}
> > > +
> > > static inline void mem_cgroup_uncharge(struct page *page)
> > > {
> > > }
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 5b31af9d5b1b..5bdbda965177 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -832,7 +832,6 @@ static int __add_to_page_cache_locked(struct page *page,
> > > {
> > > XA_STATE(xas, &mapping->i_pages, offset);
> > > int huge = PageHuge(page);
> > > - struct mem_cgroup *memcg;
> > > int error;
> > > void *old;
> > >
> > > @@ -840,17 +839,16 @@ static int __add_to_page_cache_locked(struct page *page,
> > > VM_BUG_ON_PAGE(PageSwapBacked(page), page);
> > > mapping_set_update(&xas, mapping);
> > >
> > > - if (!huge) {
> > > - error = mem_cgroup_try_charge(page, current->mm,
> > > - gfp_mask, &memcg);
> > > - if (error)
> > > - return error;
> > > - }
> > > -
> > > get_page(page);
> > > page->mapping = mapping;
> > > page->index = offset;
> > >
> > > + if (!huge) {
> > > + error = mem_cgroup_charge(page, current->mm, gfp_mask, false);
> > > + if (error)
> > > + goto error;
> > > + }
> > > +
> > > do {
> > > xas_lock_irq(&xas);
> > > old = xas_load(&xas);
> > > @@ -874,20 +872,18 @@ static int __add_to_page_cache_locked(struct page *page,
> > > xas_unlock_irq(&xas);
> > > } while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));
> > >
> > > - if (xas_error(&xas))
> > > + if (xas_error(&xas)) {
> > > + error = xas_error(&xas);
> > > goto error;
> > > + }
> > >
> > > - if (!huge)
> > > - mem_cgroup_commit_charge(page, memcg, false);
> > > trace_mm_filemap_add_to_page_cache(page);
> > > return 0;
> > > error:
> > > page->mapping = NULL;
> > > /* Leave page->index set: truncation relies upon it */
> > > - if (!huge)
> > > - mem_cgroup_cancel_charge(page, memcg);
> > > put_page(page);
> > > - return xas_error(&xas);
> > > + return error;
> > > }
> > > ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);
> > >
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 711d6dd5cbb1..b38c0a672d26 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -6577,6 +6577,33 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
> > > cancel_charge(memcg, nr_pages);
> > > }
> > >
> > > +/**
> > > + * mem_cgroup_charge - charge a newly allocated page to a cgroup
> > > + * @page: page to charge
> > > + * @mm: mm context of the victim
> > > + * @gfp_mask: reclaim mode
> > > + * @lrucare: page might be on the LRU already
> > > + *
> > > + * Try to charge @page to the memcg that @mm belongs to, reclaiming
> > > + * pages according to @gfp_mask if necessary.
> > > + *
> > > + * Returns 0 on success. Otherwise, an error code is returned.
> > > + */
> > > +int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
> > > + bool lrucare)
> > > +{
> > > + struct mem_cgroup *memcg;
> > > + int ret;
> > > +
> > > + VM_BUG_ON_PAGE(!page->mapping, page);
> > > +
> > > + ret = mem_cgroup_try_charge(page, mm, gfp_mask, &memcg);
> > > + if (ret)
> > > + return ret;
> > > + mem_cgroup_commit_charge(page, memcg, lrucare);
> > > + return 0;
> > > +}
> > > +
> > > struct uncharge_gather {
> > > struct mem_cgroup *memcg;
> > > unsigned long pgpgout;
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index 52c66801321e..2384f6c7ef71 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -605,11 +605,13 @@ static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo)
> > > */
> > > static int shmem_add_to_page_cache(struct page *page,
> > > struct address_space *mapping,
> > > - pgoff_t index, void *expected, gfp_t gfp)
> > > + pgoff_t index, void *expected, gfp_t gfp,
> > > + struct mm_struct *charge_mm)
> > > {
> > > XA_STATE_ORDER(xas, &mapping->i_pages, index, compound_order(page));
> > > unsigned long i = 0;
> > > unsigned long nr = compound_nr(page);
> > > + int error;
> > >
> > > VM_BUG_ON_PAGE(PageTail(page), page);
> > > VM_BUG_ON_PAGE(index != round_down(index, nr), page);
> > > @@ -621,6 +623,16 @@ static int shmem_add_to_page_cache(struct page *page,
> > > page->mapping = mapping;
> > > page->index = index;
> > >
> > > + error = mem_cgroup_charge(page, charge_mm, gfp, PageSwapCache(page));
> > > + if (error) {
> > > + if (!PageSwapCache(page) && PageTransHuge(page)) {
> > > + count_vm_event(THP_FILE_FALLBACK);
> > > + count_vm_event(THP_FILE_FALLBACK_CHARGE);
> > > + }
> > > + goto error;
> > > + }
> > > + cgroup_throttle_swaprate(page, gfp);
> > > +
> > > do {
> > > void *entry;
> > > xas_lock_irq(&xas);
> > > @@ -648,12 +660,15 @@ static int shmem_add_to_page_cache(struct page *page,
> > > } while (xas_nomem(&xas, gfp));
> > >
> > > if (xas_error(&xas)) {
> > > - page->mapping = NULL;
> > > - page_ref_sub(page, nr);
> > > - return xas_error(&xas);
> > > + error = xas_error(&xas);
> > > + goto error;
> > > }
> > >
> > > return 0;
> > > +error:
> > > + page->mapping = NULL;
> > > + page_ref_sub(page, nr);
> > > + return error;
> > > }
> > >
> > > /*
> > > @@ -1619,7 +1634,6 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
> > > struct address_space *mapping = inode->i_mapping;
> > > struct shmem_inode_info *info = SHMEM_I(inode);
> > > struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
> > > - struct mem_cgroup *memcg;
> > > struct page *page;
> > > swp_entry_t swap;
> > > int error;
> > > @@ -1664,29 +1678,22 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
> > > goto failed;
> > > }
> > >
> > > - error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
> > > - if (!error) {
> > > - error = shmem_add_to_page_cache(page, mapping, index,
> > > - swp_to_radix_entry(swap), gfp);
> > > - /*
> > > - * We already confirmed swap under page lock, and make
> > > - * no memory allocation here, so usually no possibility
> > > - * of error; but free_swap_and_cache() only trylocks a
> > > - * page, so it is just possible that the entry has been
> > > - * truncated or holepunched since swap was confirmed.
> > > - * shmem_undo_range() will have done some of the
> > > - * unaccounting, now delete_from_swap_cache() will do
> > > - * the rest.
> > > - */
> > > - if (error) {
> > > - mem_cgroup_cancel_charge(page, memcg);
> > > - delete_from_swap_cache(page);
> > > - }
> > > - }
> > > - if (error)
> > > + error = shmem_add_to_page_cache(page, mapping, index,
> > > + swp_to_radix_entry(swap), gfp,
> > > + charge_mm);
> > > + /*
> > > + * We already confirmed swap under page lock, and make no
> > > + * memory allocation here, so usually no possibility of error;
> > > + * but free_swap_and_cache() only trylocks a page, so it is
> > > + * just possible that the entry has been truncated or
> > > + * holepunched since swap was confirmed. shmem_undo_range()
> > > + * will have done some of the unaccounting, now
> > > + * delete_from_swap_cache() will do the rest.
> > > + */
> > > + if (error) {
> > > + delete_from_swap_cache(page);
> > > goto failed;
> >
> > -EEXIST (from swap cache) and -ENOMEM (from memcg) should be handled
> > differently. delete_from_swap_cache() is for -EEXIST case.
>
> Good catch, I accidentally changed things here.
>
> I was just going to change it back, but now I'm trying to understand
> how it actually works.
>
> Who is removing the page from swap cache if shmem_undo_range() races
> but we fail to charge the page?
>
> Here is how this race is supposed to be handled: The page is in the
> swapcache, we have it locked and confirmed that the entry in i_pages
> is indeed a swap entry. We charge the page, then we try to replace the
> swap entry in i_pages with the actual page. If we determine, under
> tree lock now, that shmem_undo_range has raced with us, unaccounted
> the swap space, but must have failed to get the page lock, we remove
> the page from swap cache on our side, to free up swap slot and page.
>
> But what if shmem_undo_range() raced with us, deleted the swap entry
> from i_pages while we had the page locked, but then we simply failed
> to charge? We unlock the page and return -EEXIST (shmem_confirm_swap
> at the exit). The page with its userdata is now in swapcache, but no
> corresponding swap entry in i_pages. shmem_getpage_gfp() sees the
> -EEXIST, retries, finds nothing in i_pages and allocates a new, empty
> page.
>
> Aren't we leaking the swap slot and the page?

Yes, you're right! It seems that it's possible to leak the swap slot
and the page. Race could happen for all the places after lock_page()
and shmem_confirm_swap() are done. And, I think that it's not possible
to fix the problem in shmem_swapin_page() side since we can't know the
timing that trylock_page() is called. Maybe, solution would be,
instead of using free_swap_and_cache() in shmem_undo_range() that
calls trylock_page(), to use another function that calls lock_page().

Thanks.

2020-04-23 05:31:19

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 10/18] mm: memcontrol: switch to native NR_ANON_MAPPED counter

On Wed, Apr 22, 2020 at 08:28:18AM -0400, Johannes Weiner wrote:
> Hello Joonsoo,
>
> On Wed, Apr 22, 2020 at 03:51:52PM +0900, Joonsoo Kim wrote:
> > On Mon, Apr 20, 2020 at 06:11:18PM -0400, Johannes Weiner wrote:
> > > @@ -3768,7 +3761,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
> > >
> > > static const unsigned int memcg1_stats[] = {
> > > NR_FILE_PAGES,
> > > - MEMCG_RSS,
> > > + NR_ANON_MAPPED,
> > > MEMCG_RSS_HUGE,
> > > NR_SHMEM,
> > > NR_FILE_MAPPED,
> > > @@ -5395,7 +5388,12 @@ static int mem_cgroup_move_account(struct page *page,
> > >
> > > lock_page_memcg(page);
> > >
> > > - if (!PageAnon(page)) {
> > > + if (PageAnon(page)) {
> > > + if (page_mapped(page)) {
> >
> > This page_mapped() check is newly inserted. Could you elaborate more
> > on why mem_cgroup_charge_statistics() doesn't need this check?
>
> MEMCG_RSS extended from when the page was charged until it was
> uncharged, but NR_ANON_MAPPED is only counted while the page is really
> mapped into page tables. That starts shortly after we charge and ends
> shortly before we uncharge, so pages could move between cgroups before
> or after they are mapped, while they aren't counted in NR_ANON_MAPPED.
>
> So to know that the page is counted, charge_statistics() only needed
> to know that the page is charged and Anon; move_account() also needs
> to know that the page is mapped.

Got it!

>
> > > @@ -1181,7 +1187,7 @@ void page_add_new_anon_rmap(struct page *page,
> > > /* increment count (starts at -1) */
> > > atomic_set(&page->_mapcount, 0);
> > > }
> > > - __mod_node_page_state(page_pgdat(page), NR_ANON_MAPPED, nr);
> > > + __mod_lruvec_page_state(page, NR_ANON_MAPPED, nr);
> > > __page_set_anon_rmap(page, vma, address, 1);
> > > }
> >
> > memcg isn't setup yet and accounting isn't applied to proper memcg.
> > Maybe, it would be applied to root memcg. With this change, we don't
> > need the mapping to commit the charge so switching the order of
> > page_add_new_anon_rmap() and mem_cgroup_commit_charge() will solve the
> > issue.
>
> Good catch, it's that dreaded circular dependency. It's fixed two
> patches down when I charge anon pages earlier as well. But I'll change
> the rmap<->commit order in this patch to avoid the temporary bug.

Okay.

> Thanks for your thorough review!

Thanks.

2020-04-24 00:32:55

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 11/18] mm: memcontrol: switch to native NR_ANON_THPS counter

On Mon, Apr 20, 2020 at 06:11:19PM -0400, Johannes Weiner wrote:
> With rmap memcg locking already in place for NR_ANON_MAPPED, it's just
> a small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the
> native NR_ANON_THPS accounting sites.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

2020-04-24 00:32:57

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 13/18] mm: memcontrol: drop unused try/commit/cancel charge API

On Mon, Apr 20, 2020 at 06:11:21PM -0400, Johannes Weiner wrote:
> There are no more users. RIP in peace.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

2020-04-24 00:33:55

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 12/18] mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API

On Mon, Apr 20, 2020 at 06:11:20PM -0400, Johannes Weiner wrote:
> With the page->mapping requirement gone from memcg, we can charge anon
> and file-thp pages in one single step, right after they're allocated.
>
> This removes two out of three API calls - especially the tricky commit
> step that needed to happen at just the right time between when the
> page is "set up" and when it's "published" - somewhat vague and fluid
> concepts that varied by page type. All we need is a freshly allocated
> page and a memcg context to charge.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

2020-04-24 00:34:23

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 15/18] mm: memcontrol: make swap tracking an integral part of memory control

On Mon, Apr 20, 2020 at 06:11:23PM -0400, Johannes Weiner wrote:
> Without swap page tracking, users that are otherwise memory controlled
> can easily escape their containment and allocate significant amounts
> of memory that they're not being charged for. That's because swap does
> readahead, but without the cgroup records of who owned the page at
> swapout, readahead pages don't get charged until somebody actually
> faults them into their page table and we can identify an owner task.
> This can be maliciously exploited with MADV_WILLNEED, which triggers
> arbitrary readahead allocations without charging the pages.
>
> Make swap swap page tracking an integral part of memcg and remove the
> Kconfig options. In the first place, it was only made configurable to
> allow users to save some memory. But the overhead of tracking cgroup
> ownership per swap page is minimal - 2 byte per page, or 512k per 1G
> of swap, or 0.04%. Saving that at the expense of broken containment
> semantics is not something we should present as a coequal option.
>
> The swapaccount=0 boot option will continue to exist, and it will
> eliminate the page_counter overhead and hide the swap control files,
> but it won't disable swap slot ownership tracking.
>
> This patch makes sure we always have the cgroup records at swapin
> time; the next patch will fix the actual bug by charging readahead
> swap pages at swapin time rather than at fault time.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

2020-04-24 00:35:32

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 14/18] mm: memcontrol: prepare swap controller setup for integration

On Mon, Apr 20, 2020 at 06:11:22PM -0400, Johannes Weiner wrote:
> A few cleanups to streamline the swap controller setup:
>
> - Replace the do_swap_account flag with cgroup_memory_noswap. This
> brings it in line with other functionality that is usually available
> unless explicitly opted out of - nosocket, nokmem.
>
> - Remove the really_do_swap_account flag that stores the boot option
> and is later used to switch the do_swap_account. It's not clear why
> this indirection is/was necessary. Use do_swap_account directly.
>
> - Minor coding style polishing
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

2020-04-24 00:47:31

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 16/18] mm: memcontrol: charge swapin pages on instantiation

On Mon, Apr 20, 2020 at 06:11:24PM -0400, Johannes Weiner wrote:
> Right now, users that are otherwise memory controlled can easily
> escape their containment and allocate significant amounts of memory
> that they're not being charged for. That's because swap readahead
> pages are not being charged until somebody actually faults them into
> their page table. This can be exploited with MADV_WILLNEED, which
> triggers arbitrary readahead allocations without charging the pages.
>
> There are additional problems with the delayed charging of swap pages:
>
> 1. To implement refault/workingset detection for anonymous pages, we
> need to have a target LRU available at swapin time, but the LRU is
> not determinable until the page has been charged.
>
> 2. To implement per-cgroup LRU locking, we need page->mem_cgroup to be
> stable when the page is isolated from the LRU; otherwise, the locks
> change under us. But swapcache gets charged after it's already on
> the LRU, and even if we cannot isolate it ourselves (since charging
> is not exactly optional).
>
> The previous patch ensured we always maintain cgroup ownership records
> for swap pages. This patch moves the swapcache charging point from the
> fault handler to swapin time to fix all of the above problems.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/memory.c | 15 ++++++---
> mm/shmem.c | 14 ++++----
> mm/swap_state.c | 89 ++++++++++++++++++++++++++-----------------------
> mm/swapfile.c | 6 ----
> 4 files changed, 67 insertions(+), 57 deletions(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 3fa379d9b17d..5d266532fc40 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3127,9 +3127,20 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma,
> vmf->address);
> if (page) {
> + int err;
> +
> __SetPageLocked(page);
> __SetPageSwapBacked(page);
> set_page_private(page, entry.val);
> +
> + /* Tell memcg to use swap ownership records */
> + SetPageSwapCache(page);
> + err = mem_cgroup_charge(page, vma->vm_mm,
> + GFP_KERNEL, false);
> + ClearPageSwapCache(page);
> + if (err)
> + goto out_page;
> +
> lru_cache_add_anon(page);
> swap_readpage(page, true);
> }
> @@ -3191,10 +3202,6 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
> goto out_page;
> }
>
> - if (mem_cgroup_charge(page, vma->vm_mm, GFP_KERNEL, true)) {
> - ret = VM_FAULT_OOM;
> - goto out_page;
> - }
> cgroup_throttle_swaprate(page, GFP_KERNEL);
>
> /*
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 363bd11eba85..966f150a4823 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -623,13 +623,15 @@ static int shmem_add_to_page_cache(struct page *page,
> page->mapping = mapping;
> page->index = index;
>
> - error = mem_cgroup_charge(page, charge_mm, gfp, PageSwapCache(page));
> - if (error) {
> - if (!PageSwapCache(page) && PageTransHuge(page)) {
> - count_vm_event(THP_FILE_FALLBACK);
> - count_vm_event(THP_FILE_FALLBACK_CHARGE);
> + if (!PageSwapCache(page)) {
> + error = mem_cgroup_charge(page, charge_mm, gfp, false);
> + if (error) {
> + if (PageTransHuge(page)) {
> + count_vm_event(THP_FILE_FALLBACK);
> + count_vm_event(THP_FILE_FALLBACK_CHARGE);
> + }
> + goto error;
> }
> - goto error;
> }
> cgroup_throttle_swaprate(page, gfp);
>
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index ebed37bbf7a3..f3b9073bfff3 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -360,12 +360,13 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> struct vm_area_struct *vma, unsigned long addr,
> bool *new_page_allocated)
> {
> - struct page *found_page = NULL, *new_page = NULL;
> struct swap_info_struct *si;
> - int err;
> + struct page *page;
> +
> *new_page_allocated = false;
>
> - do {
> + for (;;) {
> + int err;
> /*
> * First check the swap cache. Since this is normally
> * called after lookup_swap_cache() failed, re-calling
> @@ -373,12 +374,12 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> */
> si = get_swap_device(entry);
> if (!si)
> - break;
> - found_page = find_get_page(swap_address_space(entry),
> - swp_offset(entry));
> + return NULL;
> + page = find_get_page(swap_address_space(entry),
> + swp_offset(entry));
> put_swap_device(si);
> - if (found_page)
> - break;
> + if (page)
> + return page;
>
> /*
> * Just skip read ahead for unused swap slot.
> @@ -389,21 +390,15 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> * else swap_off will be aborted if we return NULL.
> */
> if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
> - break;
> -
> - /*
> - * Get a new page to read into from swap.
> - */
> - if (!new_page) {
> - new_page = alloc_page_vma(gfp_mask, vma, addr);
> - if (!new_page)
> - break; /* Out of memory */
> - }
> + return NULL;
>
> /*
> * Swap entry may have been freed since our caller observed it.
> */
> err = swapcache_prepare(entry);
> + if (!err)
> + break;
> +
> if (err == -EEXIST) {
> /*
> * We might race against get_swap_page() and stumble
> @@ -412,31 +407,43 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> */
> cond_resched();
> continue;
> - } else if (err) /* swp entry is obsolete ? */
> - break;
> -
> - /* May fail (-ENOMEM) if XArray node allocation failed. */
> - __SetPageLocked(new_page);
> - __SetPageSwapBacked(new_page);
> - err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> - if (likely(!err)) {
> - /* Initiate read into locked page */
> - SetPageWorkingset(new_page);
> - lru_cache_add_anon(new_page);
> - *new_page_allocated = true;
> - return new_page;
> }
> - __ClearPageLocked(new_page);
> - /*
> - * add_to_swap_cache() doesn't return -EEXIST, so we can safely
> - * clear SWAP_HAS_CACHE flag.
> - */
> - put_swap_page(new_page, entry);
> - } while (err != -ENOMEM);
> + if (err) /* swp entry is obsolete ? */
> + return NULL;

"if (err)" is not needed since "!err" is already exiting the loop.

> + }
> +
> + /*
> + * The swap entry is ours to swap in. Prepare a new page.
> + */
> +
> + page = alloc_page_vma(gfp_mask, vma, addr);
> + if (!page)
> + goto fail_free;
> +
> + __SetPageLocked(page);
> + __SetPageSwapBacked(page);
> +
> + /* May fail (-ENOMEM) if XArray node allocation failed. */
> + if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
> + goto fail_unlock;
> +
> + if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL, false))
> + goto fail_delete;
> +

I think that following order of operations is better than yours.

1. page alloc
2. memcg charge
3. swapcache_prepare
4. add_to_swap_cache

Reason is that page allocation and memcg charging could take for a
long time due to reclaim and other tasks waiting this swapcache page
could be blocked inbetween swapcache_prepare() and add_to_swap_cache().

Thanks.

2020-04-24 00:50:50

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 17/18] mm: memcontrol: delete unused lrucare handling

On Mon, Apr 20, 2020 at 06:11:25PM -0400, Johannes Weiner wrote:
> Signed-off-by: Johannes Weiner <[email protected]>

Code looks fine to me. With proper commit message,

Reviewed-by: Joonsoo Kim <[email protected]>

Thanks.

2020-04-24 00:52:22

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 18/18] mm: memcontrol: update page->mem_cgroup stability rules

On Mon, Apr 20, 2020 at 06:11:26PM -0400, Johannes Weiner wrote:
> The previous patches have simplified the access rules around
> page->mem_cgroup somewhat:
>
> 1. We never change page->mem_cgroup while the page is isolated by
> somebody else. This was by far the biggest exception to our rules
> and it didn't stop at lock_page() or lock_page_memcg().
>
> 2. We charge pages before they get put into page tables now, so the
> somewhat fishy rule about "can be in page table as long as it's
> still locked" is now gone and boiled down to having an exclusive
> reference to the page.
>
> Document the new rules. Any of the following will stabilize the
> page->mem_cgroup association:
>
> - the page lock
> - LRU isolation
> - lock_page_memcg()
> - exclusive access to the page
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Joonsoo Kim <[email protected]>

2020-04-24 02:56:18

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 16/18] mm: memcontrol: charge swapin pages on instantiation

On Fri, Apr 24, 2020 at 09:44:42AM +0900, Joonsoo Kim wrote:
> On Mon, Apr 20, 2020 at 06:11:24PM -0400, Johannes Weiner wrote:
> > @@ -412,31 +407,43 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > */
> > cond_resched();
> > continue;
> > - } else if (err) /* swp entry is obsolete ? */
> > - break;
> > -
> > - /* May fail (-ENOMEM) if XArray node allocation failed. */
> > - __SetPageLocked(new_page);
> > - __SetPageSwapBacked(new_page);
> > - err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > - if (likely(!err)) {
> > - /* Initiate read into locked page */
> > - SetPageWorkingset(new_page);
> > - lru_cache_add_anon(new_page);
> > - *new_page_allocated = true;
> > - return new_page;
> > }
> > - __ClearPageLocked(new_page);
> > - /*
> > - * add_to_swap_cache() doesn't return -EEXIST, so we can safely
> > - * clear SWAP_HAS_CACHE flag.
> > - */
> > - put_swap_page(new_page, entry);
> > - } while (err != -ENOMEM);
> > + if (err) /* swp entry is obsolete ? */
> > + return NULL;
>
> "if (err)" is not needed since "!err" is already exiting the loop.

But we don't want to leave the loop, we want to leave the
function. For example, if swapcache_prepare() says the entry is gone
(-ENOENT), we don't want to exit the loop and allocate a page for it.

> > +
> > + /*
> > + * The swap entry is ours to swap in. Prepare a new page.
> > + */
> > +
> > + page = alloc_page_vma(gfp_mask, vma, addr);
> > + if (!page)
> > + goto fail_free;
> > +
> > + __SetPageLocked(page);
> > + __SetPageSwapBacked(page);
> > +
> > + /* May fail (-ENOMEM) if XArray node allocation failed. */
> > + if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
> > + goto fail_unlock;
> > +
> > + if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL, false))
> > + goto fail_delete;
> > +
>
> I think that following order of operations is better than yours.
>
> 1. page alloc
> 2. memcg charge
> 3. swapcache_prepare
> 4. add_to_swap_cache
>
> Reason is that page allocation and memcg charging could take for a
> long time due to reclaim and other tasks waiting this swapcache page
> could be blocked inbetween swapcache_prepare() and add_to_swap_cache().

I see how that would be preferable, but memcg charging actually needs
the swap(cache) information to figure out the cgroup that owned it at
swapout, then uncharge the swapcache and drop the swap cgroup record.

Maybe it could be done, but I'm not sure that level of surgery would
be worth the benefits? Whoever else would be trying to swap the page
in at the same time is likely in the same memory situation, and would
not necessarily be able to allocate pages any faster.

2020-04-24 03:03:37

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 15/18] mm: memcontrol: make swap tracking an integral part of memory control

On Mon, Apr 20, 2020 at 06:11:23PM -0400, Johannes Weiner wrote:
> @@ -6884,9 +6876,6 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
> VM_BUG_ON_PAGE(PageLRU(page), page);
> VM_BUG_ON_PAGE(page_count(page), page);
>
> - if (!do_memsw_account())
> - return;
> -
> memcg = page->mem_cgroup;
>
> /* Readahead page, never charged */

I messed up here.

mem_cgroup_swapout() must not run on cgroup2, because cgroup2 uses
mem_cgroup_try_charge_swap() instead. Both record a swap entry and
running them both will trigger a VM_BUG_ON() on an existing record.

I'm actually somewhat baffled why this didn't trigger in my
MADV_PAGEOUT -> MADV_WILLNEED swap test. memory.max driven swapout
triggered it right away.

!do_memsw_account() needs to be !cgroup_subsys_on_dfl(memory_cgrp_subsys)

> @@ -6913,7 +6902,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
> if (!mem_cgroup_is_root(memcg))
> page_counter_uncharge(&memcg->memory, nr_entries);
>
> - if (memcg != swap_memcg) {
> + if (do_memsw_account() && memcg != swap_memcg) {
> if (!mem_cgroup_is_root(swap_memcg))
> page_counter_charge(&swap_memcg->memsw, nr_entries);
> page_counter_uncharge(&memcg->memsw, nr_entries);

And this can be !cgroup_memory_noswap instead. It'll do the same
thing, but will be clearer.

I'll have it fixed in version 2.

2020-04-28 06:51:37

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 16/18] mm: memcontrol: charge swapin pages on instantiation

2020년 4월 24일 (금) 오전 11:51, Johannes Weiner <[email protected]>님이 작성:
>
> On Fri, Apr 24, 2020 at 09:44:42AM +0900, Joonsoo Kim wrote:
> > On Mon, Apr 20, 2020 at 06:11:24PM -0400, Johannes Weiner wrote:
> > > @@ -412,31 +407,43 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
> > > */
> > > cond_resched();
> > > continue;
> > > - } else if (err) /* swp entry is obsolete ? */
> > > - break;
> > > -
> > > - /* May fail (-ENOMEM) if XArray node allocation failed. */
> > > - __SetPageLocked(new_page);
> > > - __SetPageSwapBacked(new_page);
> > > - err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
> > > - if (likely(!err)) {
> > > - /* Initiate read into locked page */
> > > - SetPageWorkingset(new_page);
> > > - lru_cache_add_anon(new_page);
> > > - *new_page_allocated = true;
> > > - return new_page;
> > > }
> > > - __ClearPageLocked(new_page);
> > > - /*
> > > - * add_to_swap_cache() doesn't return -EEXIST, so we can safely
> > > - * clear SWAP_HAS_CACHE flag.
> > > - */
> > > - put_swap_page(new_page, entry);
> > > - } while (err != -ENOMEM);
> > > + if (err) /* swp entry is obsolete ? */
> > > + return NULL;
> >
> > "if (err)" is not needed since "!err" is already exiting the loop.
>
> But we don't want to leave the loop, we want to leave the
> function. For example, if swapcache_prepare() says the entry is gone
> (-ENOENT), we don't want to exit the loop and allocate a page for it.

Yes, so I said "if (err)" is not needed.
Just "return NULL;" would be enough.

> > > +
> > > + /*
> > > + * The swap entry is ours to swap in. Prepare a new page.
> > > + */
> > > +
> > > + page = alloc_page_vma(gfp_mask, vma, addr);
> > > + if (!page)
> > > + goto fail_free;
> > > +
> > > + __SetPageLocked(page);
> > > + __SetPageSwapBacked(page);
> > > +
> > > + /* May fail (-ENOMEM) if XArray node allocation failed. */
> > > + if (add_to_swap_cache(page, entry, gfp_mask & GFP_KERNEL))
> > > + goto fail_unlock;
> > > +
> > > + if (mem_cgroup_charge(page, NULL, gfp_mask & GFP_KERNEL, false))
> > > + goto fail_delete;
> > > +
> >
> > I think that following order of operations is better than yours.
> >
> > 1. page alloc
> > 2. memcg charge
> > 3. swapcache_prepare
> > 4. add_to_swap_cache
> >
> > Reason is that page allocation and memcg charging could take for a
> > long time due to reclaim and other tasks waiting this swapcache page
> > could be blocked inbetween swapcache_prepare() and add_to_swap_cache().
>
> I see how that would be preferable, but memcg charging actually needs
> the swap(cache) information to figure out the cgroup that owned it at
> swapout, then uncharge the swapcache and drop the swap cgroup record.
>
> Maybe it could be done, but I'm not sure that level of surgery would
> be worth the benefits? Whoever else would be trying to swap the page
> in at the same time is likely in the same memory situation, and would
> not necessarily be able to allocate pages any faster.

Hmm, at least, some modifications are needed since waiting task would do
busy waiting in the loop and it wastes overall system cpu time.

I still think that changing operation order is better since it's possible that
later task allocates a page faster though it's not usual case. However, I
also agree your reasoning so will not insist more.

Thanks.

2020-05-08 16:03:59

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

On Thu, Apr 23, 2020 at 02:25:06PM +0900, Joonsoo Kim wrote:
> On Wed, Apr 22, 2020 at 08:09:46AM -0400, Johannes Weiner wrote:
> > On Wed, Apr 22, 2020 at 03:40:41PM +0900, Joonsoo Kim wrote:
> > > On Mon, Apr 20, 2020 at 06:11:13PM -0400, Johannes Weiner wrote:
> > > > @@ -1664,29 +1678,22 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
> > > > goto failed;
> > > > }
> > > >
> > > > - error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
> > > > - if (!error) {
> > > > - error = shmem_add_to_page_cache(page, mapping, index,
> > > > - swp_to_radix_entry(swap), gfp);
> > > > - /*
> > > > - * We already confirmed swap under page lock, and make
> > > > - * no memory allocation here, so usually no possibility
> > > > - * of error; but free_swap_and_cache() only trylocks a
> > > > - * page, so it is just possible that the entry has been
> > > > - * truncated or holepunched since swap was confirmed.
> > > > - * shmem_undo_range() will have done some of the
> > > > - * unaccounting, now delete_from_swap_cache() will do
> > > > - * the rest.
> > > > - */
> > > > - if (error) {
> > > > - mem_cgroup_cancel_charge(page, memcg);
> > > > - delete_from_swap_cache(page);
> > > > - }
> > > > - }
> > > > - if (error)
> > > > + error = shmem_add_to_page_cache(page, mapping, index,
> > > > + swp_to_radix_entry(swap), gfp,
> > > > + charge_mm);
> > > > + /*
> > > > + * We already confirmed swap under page lock, and make no
> > > > + * memory allocation here, so usually no possibility of error;
> > > > + * but free_swap_and_cache() only trylocks a page, so it is
> > > > + * just possible that the entry has been truncated or
> > > > + * holepunched since swap was confirmed. shmem_undo_range()
> > > > + * will have done some of the unaccounting, now
> > > > + * delete_from_swap_cache() will do the rest.
> > > > + */
> > > > + if (error) {
> > > > + delete_from_swap_cache(page);
> > > > goto failed;
> > >
> > > -EEXIST (from swap cache) and -ENOMEM (from memcg) should be handled
> > > differently. delete_from_swap_cache() is for -EEXIST case.
> >
> > Good catch, I accidentally changed things here.
> >
> > I was just going to change it back, but now I'm trying to understand
> > how it actually works.
> >
> > Who is removing the page from swap cache if shmem_undo_range() races
> > but we fail to charge the page?
> >
> > Here is how this race is supposed to be handled: The page is in the
> > swapcache, we have it locked and confirmed that the entry in i_pages
> > is indeed a swap entry. We charge the page, then we try to replace the
> > swap entry in i_pages with the actual page. If we determine, under
> > tree lock now, that shmem_undo_range has raced with us, unaccounted
> > the swap space, but must have failed to get the page lock, we remove
> > the page from swap cache on our side, to free up swap slot and page.
> >
> > But what if shmem_undo_range() raced with us, deleted the swap entry
> > from i_pages while we had the page locked, but then we simply failed
> > to charge? We unlock the page and return -EEXIST (shmem_confirm_swap
> > at the exit). The page with its userdata is now in swapcache, but no
> > corresponding swap entry in i_pages. shmem_getpage_gfp() sees the
> > -EEXIST, retries, finds nothing in i_pages and allocates a new, empty
> > page.
> >
> > Aren't we leaking the swap slot and the page?
>
> Yes, you're right! It seems that it's possible to leak the swap slot
> and the page. Race could happen for all the places after lock_page()
> and shmem_confirm_swap() are done. And, I think that it's not possible
> to fix the problem in shmem_swapin_page() side since we can't know the
> timing that trylock_page() is called. Maybe, solution would be,
> instead of using free_swap_and_cache() in shmem_undo_range() that
> calls trylock_page(), to use another function that calls lock_page().

I looked at this some more, as well as compared it to non-shmem
swapping. My conclusion is - and Hugh may correct me on this - that
the deletion looks mandatory but is actually an optimization. Page
reclaim will ultimately pick these pages up.

When non-shmem pages are swapped in by readahead (locked until IO
completes) and their page tables are simultaneously unmapped, the
zap_pte_range() code calls free_swap_and_cache() and the locked pages
are stranded in the swap cache with no page table references. We rely
on page reclaim to pick them up later on.

The same appears to be true for shmem. If the references to the swap
page are zapped while we're trying to swap in, we can strand the page
in the swap cache. But it's not up to swapin to detect this reliably,
it just frees the page more quickly than having to wait for reclaim.

That being said, my patch introduces potentially undesirable behavior
(although AFAICS no correctness problem): We should only delete the
page from swapcache when we actually raced with undo_range - which we
see from the swap entry having been purged from the page cache
tree. If we delete the page from swapcache just because we failed to
charge it, the next fault has to read the still-valid page again from
the swap device.

I'm going to include this:

diff --git a/mm/shmem.c b/mm/shmem.c
index e80167927dce..236642775f89 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -640,7 +640,7 @@ static int shmem_add_to_page_cache(struct page *page,
xas_lock_irq(&xas);
entry = xas_find_conflict(&xas);
if (entry != expected)
- xas_set_err(&xas, -EEXIST);
+ xas_set_err(&xas, expected ? -ENOENT : -EEXIST);
xas_create_range(&xas);
if (xas_error(&xas))
goto unlock;
@@ -1683,17 +1683,18 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
error = shmem_add_to_page_cache(page, mapping, index,
swp_to_radix_entry(swap), gfp,
charge_mm);
- /*
- * We already confirmed swap under page lock, and make no
- * memory allocation here, so usually no possibility of error;
- * but free_swap_and_cache() only trylocks a page, so it is
- * just possible that the entry has been truncated or
- * holepunched since swap was confirmed. shmem_undo_range()
- * will have done some of the unaccounting, now
- * delete_from_swap_cache() will do the rest.
- */
if (error) {
- delete_from_swap_cache(page);
+ /*
+ * We already confirmed swap under page lock, but
+ * free_swap_and_cache() only trylocks a page, so it
+ * is just possible that the entry has been truncated
+ * or holepunched since swap was confirmed.
+ * shmem_undo_range() will have done some of the
+ * unaccounting, now delete_from_swap_cache() will do
+ * the rest.
+ */
+ if (error == -ENOENT)
+ delete_from_swap_cache(page);
goto failed;
}

2020-05-11 02:00:52

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

On Fri, May 08, 2020 at 12:01:22PM -0400, Johannes Weiner wrote:
> On Thu, Apr 23, 2020 at 02:25:06PM +0900, Joonsoo Kim wrote:
> > On Wed, Apr 22, 2020 at 08:09:46AM -0400, Johannes Weiner wrote:
> > > On Wed, Apr 22, 2020 at 03:40:41PM +0900, Joonsoo Kim wrote:
> > > > On Mon, Apr 20, 2020 at 06:11:13PM -0400, Johannes Weiner wrote:
> > > > > @@ -1664,29 +1678,22 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
> > > > > goto failed;
> > > > > }
> > > > >
> > > > > - error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
> > > > > - if (!error) {
> > > > > - error = shmem_add_to_page_cache(page, mapping, index,
> > > > > - swp_to_radix_entry(swap), gfp);
> > > > > - /*
> > > > > - * We already confirmed swap under page lock, and make
> > > > > - * no memory allocation here, so usually no possibility
> > > > > - * of error; but free_swap_and_cache() only trylocks a
> > > > > - * page, so it is just possible that the entry has been
> > > > > - * truncated or holepunched since swap was confirmed.
> > > > > - * shmem_undo_range() will have done some of the
> > > > > - * unaccounting, now delete_from_swap_cache() will do
> > > > > - * the rest.
> > > > > - */
> > > > > - if (error) {
> > > > > - mem_cgroup_cancel_charge(page, memcg);
> > > > > - delete_from_swap_cache(page);
> > > > > - }
> > > > > - }
> > > > > - if (error)
> > > > > + error = shmem_add_to_page_cache(page, mapping, index,
> > > > > + swp_to_radix_entry(swap), gfp,
> > > > > + charge_mm);
> > > > > + /*
> > > > > + * We already confirmed swap under page lock, and make no
> > > > > + * memory allocation here, so usually no possibility of error;
> > > > > + * but free_swap_and_cache() only trylocks a page, so it is
> > > > > + * just possible that the entry has been truncated or
> > > > > + * holepunched since swap was confirmed. shmem_undo_range()
> > > > > + * will have done some of the unaccounting, now
> > > > > + * delete_from_swap_cache() will do the rest.
> > > > > + */
> > > > > + if (error) {
> > > > > + delete_from_swap_cache(page);
> > > > > goto failed;
> > > >
> > > > -EEXIST (from swap cache) and -ENOMEM (from memcg) should be handled
> > > > differently. delete_from_swap_cache() is for -EEXIST case.
> > >
> > > Good catch, I accidentally changed things here.
> > >
> > > I was just going to change it back, but now I'm trying to understand
> > > how it actually works.
> > >
> > > Who is removing the page from swap cache if shmem_undo_range() races
> > > but we fail to charge the page?
> > >
> > > Here is how this race is supposed to be handled: The page is in the
> > > swapcache, we have it locked and confirmed that the entry in i_pages
> > > is indeed a swap entry. We charge the page, then we try to replace the
> > > swap entry in i_pages with the actual page. If we determine, under
> > > tree lock now, that shmem_undo_range has raced with us, unaccounted
> > > the swap space, but must have failed to get the page lock, we remove
> > > the page from swap cache on our side, to free up swap slot and page.
> > >
> > > But what if shmem_undo_range() raced with us, deleted the swap entry
> > > from i_pages while we had the page locked, but then we simply failed
> > > to charge? We unlock the page and return -EEXIST (shmem_confirm_swap
> > > at the exit). The page with its userdata is now in swapcache, but no
> > > corresponding swap entry in i_pages. shmem_getpage_gfp() sees the
> > > -EEXIST, retries, finds nothing in i_pages and allocates a new, empty
> > > page.
> > >
> > > Aren't we leaking the swap slot and the page?
> >
> > Yes, you're right! It seems that it's possible to leak the swap slot
> > and the page. Race could happen for all the places after lock_page()
> > and shmem_confirm_swap() are done. And, I think that it's not possible
> > to fix the problem in shmem_swapin_page() side since we can't know the
> > timing that trylock_page() is called. Maybe, solution would be,
> > instead of using free_swap_and_cache() in shmem_undo_range() that
> > calls trylock_page(), to use another function that calls lock_page().
>
> I looked at this some more, as well as compared it to non-shmem
> swapping. My conclusion is - and Hugh may correct me on this - that
> the deletion looks mandatory but is actually an optimization. Page
> reclaim will ultimately pick these pages up.
>
> When non-shmem pages are swapped in by readahead (locked until IO
> completes) and their page tables are simultaneously unmapped, the
> zap_pte_range() code calls free_swap_and_cache() and the locked pages
> are stranded in the swap cache with no page table references. We rely
> on page reclaim to pick them up later on.
>
> The same appears to be true for shmem. If the references to the swap
> page are zapped while we're trying to swap in, we can strand the page
> in the swap cache. But it's not up to swapin to detect this reliably,
> it just frees the page more quickly than having to wait for reclaim.
>
> That being said, my patch introduces potentially undesirable behavior
> (although AFAICS no correctness problem): We should only delete the
> page from swapcache when we actually raced with undo_range - which we
> see from the swap entry having been purged from the page cache
> tree. If we delete the page from swapcache just because we failed to
> charge it, the next fault has to read the still-valid page again from
> the swap device.

I got it! Thanks for explanation.

Thanks.

2020-05-11 07:42:19

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

On Fri, 8 May 2020, Johannes Weiner wrote:
>
> I looked at this some more, as well as compared it to non-shmem
> swapping. My conclusion is - and Hugh may correct me on this - that
> the deletion looks mandatory but is actually an optimization. Page
> reclaim will ultimately pick these pages up.
>
> When non-shmem pages are swapped in by readahead (locked until IO
> completes) and their page tables are simultaneously unmapped, the
> zap_pte_range() code calls free_swap_and_cache() and the locked pages
> are stranded in the swap cache with no page table references. We rely
> on page reclaim to pick them up later on.
>
> The same appears to be true for shmem. If the references to the swap
> page are zapped while we're trying to swap in, we can strand the page
> in the swap cache. But it's not up to swapin to detect this reliably,
> it just frees the page more quickly than having to wait for reclaim.

I think you've got all that exactly right, thanks for working it out.
It originates from v3.7's 215c02bc33bb ("tmpfs: fix shmem_getpage_gfp()
VM_BUG_ON") - in which I also had to thank you.

I think I chose to do the delete_from_swap_cache() right there, partly
because of following shmem_unuse_inode() code which already did that,
partly on the basis that while we have to observe the case then it's
better to clean it up, and partly out of guilt that our page lock here
is what had prevented shmem_undo_range() from completing its job; but
I believe you're right that unused swapcache reclaim would sort it out
eventually.

>
> That being said, my patch introduces potentially undesirable behavior
> (although AFAICS no correctness problem): We should only delete the
> page from swapcache when we actually raced with undo_range - which we
> see from the swap entry having been purged from the page cache
> tree. If we delete the page from swapcache just because we failed to
> charge it, the next fault has to read the still-valid page again from
> the swap device.

Yes.

>
> I'm going to include this:

I haven't pulled down your V2 series into a tree yet (expecting perhaps
a respin from Alex on top, when I hope to switch over to trying them
both), so haven't looked into the context and may be wrong...

>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index e80167927dce..236642775f89 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -640,7 +640,7 @@ static int shmem_add_to_page_cache(struct page *page,
> xas_lock_irq(&xas);
> entry = xas_find_conflict(&xas);
> if (entry != expected)
> - xas_set_err(&xas, -EEXIST);
> + xas_set_err(&xas, expected ? -ENOENT : -EEXIST);

Two things on this.

Minor matter of taste, I'd prefer that as
xas_set_err(&xas, entry ? -EEXIST : -ENOENT);
which would be more general and more understandable -
but what you have written should be fine for the actual callers.

Except... I think returning -ENOENT there will not work correctly,
in the case of a punched hole. Because (unless you've reworked it
and I just haven't looked) shmem_getpage_gfp() knows to retry in
the case of -EEXIST, but -ENOENT will percolate up to shmem_fault()
and result in a SIGBUS, or a read/write error, when the hole should
just get refilled instead.

Not something that needs fixing in a hurry (it took trinity to
generate this racy case in the first place), I'll take another look
once I've pulled it into a tree (or collected next mmotm) - unless
you've already have changed it around by then.

Hugh

> xas_create_range(&xas);
> if (xas_error(&xas))
> goto unlock;
> @@ -1683,17 +1683,18 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
> error = shmem_add_to_page_cache(page, mapping, index,
> swp_to_radix_entry(swap), gfp,
> charge_mm);
> - /*
> - * We already confirmed swap under page lock, and make no
> - * memory allocation here, so usually no possibility of error;
> - * but free_swap_and_cache() only trylocks a page, so it is
> - * just possible that the entry has been truncated or
> - * holepunched since swap was confirmed. shmem_undo_range()
> - * will have done some of the unaccounting, now
> - * delete_from_swap_cache() will do the rest.
> - */
> if (error) {
> - delete_from_swap_cache(page);
> + /*
> + * We already confirmed swap under page lock, but
> + * free_swap_and_cache() only trylocks a page, so it
> + * is just possible that the entry has been truncated
> + * or holepunched since swap was confirmed.
> + * shmem_undo_range() will have done some of the
> + * unaccounting, now delete_from_swap_cache() will do
> + * the rest.
> + */
> + if (error == -ENOENT)
> + delete_from_swap_cache(page);
> goto failed;
> }
>
>

2020-05-11 15:09:41

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

On Mon, May 11, 2020 at 12:38:04AM -0700, Hugh Dickins wrote:
> On Fri, 8 May 2020, Johannes Weiner wrote:
> >
> > I looked at this some more, as well as compared it to non-shmem
> > swapping. My conclusion is - and Hugh may correct me on this - that
> > the deletion looks mandatory but is actually an optimization. Page
> > reclaim will ultimately pick these pages up.
> >
> > When non-shmem pages are swapped in by readahead (locked until IO
> > completes) and their page tables are simultaneously unmapped, the
> > zap_pte_range() code calls free_swap_and_cache() and the locked pages
> > are stranded in the swap cache with no page table references. We rely
> > on page reclaim to pick them up later on.
> >
> > The same appears to be true for shmem. If the references to the swap
> > page are zapped while we're trying to swap in, we can strand the page
> > in the swap cache. But it's not up to swapin to detect this reliably,
> > it just frees the page more quickly than having to wait for reclaim.
>
> I think you've got all that exactly right, thanks for working it out.
> It originates from v3.7's 215c02bc33bb ("tmpfs: fix shmem_getpage_gfp()
> VM_BUG_ON") - in which I also had to thank you.

I should have looked where it actually came from - I had forgotten
about that patch!

> I think I chose to do the delete_from_swap_cache() right there, partly
> because of following shmem_unuse_inode() code which already did that,
> partly on the basis that while we have to observe the case then it's
> better to clean it up, and partly out of guilt that our page lock here
> is what had prevented shmem_undo_range() from completing its job; but
> I believe you're right that unused swapcache reclaim would sort it out
> eventually.

That makes sense to me.

> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index e80167927dce..236642775f89 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -640,7 +640,7 @@ static int shmem_add_to_page_cache(struct page *page,
> > xas_lock_irq(&xas);
> > entry = xas_find_conflict(&xas);
> > if (entry != expected)
> > - xas_set_err(&xas, -EEXIST);
> > + xas_set_err(&xas, expected ? -ENOENT : -EEXIST);
>
> Two things on this.
>
> Minor matter of taste, I'd prefer that as
> xas_set_err(&xas, entry ? -EEXIST : -ENOENT);
> which would be more general and more understandable -
> but what you have written should be fine for the actual callers.

Yes, checking `expected' was to differentiate the behavior depending
on the callsite. But testing `entry' is more obvious in that location.

> Except... I think returning -ENOENT there will not work correctly,
> in the case of a punched hole. Because (unless you've reworked it
> and I just haven't looked) shmem_getpage_gfp() knows to retry in
> the case of -EEXIST, but -ENOENT will percolate up to shmem_fault()
> and result in a SIGBUS, or a read/write error, when the hole should
> just get refilled instead.

Good catch, I had indeed missed that. I'm going to make it retry on
-ENOENT as well.

We could have it go directly to allocating a new page, but it seems
unnecessarily complicated: we've already been retrying in this
situation until now, so I would stick to "there was a race, retry."

> Not something that needs fixing in a hurry (it took trinity to
> generate this racy case in the first place), I'll take another look
> once I've pulled it into a tree (or collected next mmotm) - unless
> you've already have changed it around by then.

Attaching a delta fix based on your observations.

Andrew, barring any objections to this, could you please fold it into
the version you have in your tree already?

---

From 33d03ceebce0a6261d472ddc9c5a07940f44714c Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Mon, 11 May 2020 10:45:14 -0400
Subject: [PATCH] mm: memcontrol: convert page cache to a new
mem_cgroup_charge() API fix

Incorporate Hugh's feedback:

- shmem_getpage_gfp() needs to handle the new -ENOENT that was
previously implied in the -EEXIST when a swap entry changed under us
in any way. Otherwise hole punching could cause a racing fault to
SIGBUS instead of allocating a new page.

- It is indeed page reclaim that picks up any swapcache we leave
stranded when free_swap_and_cache() runs on a page locked by
somebody else. Document that our delete_from_swap_cache() is an
optimization, not something we rely on for correctness.

- Style cleanup: testing `expected' to decide on -EEXIST vs -ENOENT
differentiates the callsites, but is a bit awkward to read. Test
`entry' instead.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/shmem.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index afd5a057ebb7..00fb001e8f3e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -638,7 +638,7 @@ static int shmem_add_to_page_cache(struct page *page,
xas_lock_irq(&xas);
entry = xas_find_conflict(&xas);
if (entry != expected)
- xas_set_err(&xas, expected ? -ENOENT : -EEXIST);
+ xas_set_err(&xas, entry ? -EEXIST : -ENOENT);
xas_create_range(&xas);
if (xas_error(&xas))
goto unlock;
@@ -1686,10 +1686,13 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
* We already confirmed swap under page lock, but
* free_swap_and_cache() only trylocks a page, so it
* is just possible that the entry has been truncated
- * or holepunched since swap was confirmed.
- * shmem_undo_range() will have done some of the
- * unaccounting, now delete_from_swap_cache() will do
- * the rest.
+ * or holepunched since swap was confirmed. This could
+ * occur at any time while the page is locked, and
+ * usually page reclaim will take care of the stranded
+ * swapcache page. But when we catch it, we may as
+ * well clean up after ourselves: shmem_undo_range()
+ * will have done some of the unaccounting, now
+ * delete_from_swap_cache() will do the rest.
*/
if (error == -ENOENT)
delete_from_swap_cache(page);
@@ -1765,7 +1768,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
if (xa_is_value(page)) {
error = shmem_swapin_page(inode, index, &page,
sgp, gfp, vma, fault_type);
- if (error == -EEXIST)
+ if (error == -EEXIST || error == -ENOENT)
goto repeat;

*pagep = page;
--
2.26.2

2020-05-11 16:34:33

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

On Mon, 11 May 2020, Johannes Weiner wrote:
> On Mon, May 11, 2020 at 12:38:04AM -0700, Hugh Dickins wrote:
> > On Fri, 8 May 2020, Johannes Weiner wrote:
> > >
> > > I looked at this some more, as well as compared it to non-shmem
> > > swapping. My conclusion is - and Hugh may correct me on this - that
> > > the deletion looks mandatory but is actually an optimization. Page
> > > reclaim will ultimately pick these pages up.
> > >
> > > When non-shmem pages are swapped in by readahead (locked until IO
> > > completes) and their page tables are simultaneously unmapped, the
> > > zap_pte_range() code calls free_swap_and_cache() and the locked pages
> > > are stranded in the swap cache with no page table references. We rely
> > > on page reclaim to pick them up later on.
> > >
> > > The same appears to be true for shmem. If the references to the swap
> > > page are zapped while we're trying to swap in, we can strand the page
> > > in the swap cache. But it's not up to swapin to detect this reliably,
> > > it just frees the page more quickly than having to wait for reclaim.
> >
> > I think you've got all that exactly right, thanks for working it out.
> > It originates from v3.7's 215c02bc33bb ("tmpfs: fix shmem_getpage_gfp()
> > VM_BUG_ON") - in which I also had to thank you.
>
> I should have looked where it actually came from - I had forgotten
> about that patch!
>
> > I think I chose to do the delete_from_swap_cache() right there, partly
> > because of following shmem_unuse_inode() code which already did that,
> > partly on the basis that while we have to observe the case then it's
> > better to clean it up, and partly out of guilt that our page lock here
> > is what had prevented shmem_undo_range() from completing its job; but
> > I believe you're right that unused swapcache reclaim would sort it out
> > eventually.
>
> That makes sense to me.
>
> > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > index e80167927dce..236642775f89 100644
> > > --- a/mm/shmem.c
> > > +++ b/mm/shmem.c
> > > @@ -640,7 +640,7 @@ static int shmem_add_to_page_cache(struct page *page,
> > > xas_lock_irq(&xas);
> > > entry = xas_find_conflict(&xas);
> > > if (entry != expected)
> > > - xas_set_err(&xas, -EEXIST);
> > > + xas_set_err(&xas, expected ? -ENOENT : -EEXIST);
> >
> > Two things on this.
> >
> > Minor matter of taste, I'd prefer that as
> > xas_set_err(&xas, entry ? -EEXIST : -ENOENT);
> > which would be more general and more understandable -
> > but what you have written should be fine for the actual callers.
>
> Yes, checking `expected' was to differentiate the behavior depending
> on the callsite. But testing `entry' is more obvious in that location.
>
> > Except... I think returning -ENOENT there will not work correctly,
> > in the case of a punched hole. Because (unless you've reworked it
> > and I just haven't looked) shmem_getpage_gfp() knows to retry in
> > the case of -EEXIST, but -ENOENT will percolate up to shmem_fault()
> > and result in a SIGBUS, or a read/write error, when the hole should
> > just get refilled instead.
>
> Good catch, I had indeed missed that. I'm going to make it retry on
> -ENOENT as well.
>
> We could have it go directly to allocating a new page, but it seems
> unnecessarily complicated: we've already been retrying in this
> situation until now, so I would stick to "there was a race, retry."
>
> > Not something that needs fixing in a hurry (it took trinity to
> > generate this racy case in the first place), I'll take another look
> > once I've pulled it into a tree (or collected next mmotm) - unless
> > you've already have changed it around by then.
>
> Attaching a delta fix based on your observations.
>
> Andrew, barring any objections to this, could you please fold it into
> the version you have in your tree already?

Not so strong as an objection, and I won't get to see whether your
retry on -ENOENT is good (can -ENOENT arrive at that point from any
other case, that might endlessly retry?) until I've got the full
context; but I had arrived at the opposite conclusion overnight.

Given that this case only appeared with a fuzzer, and stale swapcache
reclaim is anyway relied upon to clean up after plenty of other such
races, I think we should agree that I over-complicated the VM_BUG_ON
removal originally, and it's best to kill that delete_from_swap_cache(),
and the comment having to explain it, and your EEXIST/ENOENT distinction.

(I haven't checked, but I suspect that the shmem_unuse_inode() case
that I copied from, actually really needed to delete_from_swap_cache(),
in order to swapoff the page without full retry of the big swapoff loop.)

Hugh

>
> ---
>
> From 33d03ceebce0a6261d472ddc9c5a07940f44714c Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Mon, 11 May 2020 10:45:14 -0400
> Subject: [PATCH] mm: memcontrol: convert page cache to a new
> mem_cgroup_charge() API fix
>
> Incorporate Hugh's feedback:
>
> - shmem_getpage_gfp() needs to handle the new -ENOENT that was
> previously implied in the -EEXIST when a swap entry changed under us
> in any way. Otherwise hole punching could cause a racing fault to
> SIGBUS instead of allocating a new page.
>
> - It is indeed page reclaim that picks up any swapcache we leave
> stranded when free_swap_and_cache() runs on a page locked by
> somebody else. Document that our delete_from_swap_cache() is an
> optimization, not something we rely on for correctness.
>
> - Style cleanup: testing `expected' to decide on -EEXIST vs -ENOENT
> differentiates the callsites, but is a bit awkward to read. Test
> `entry' instead.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/shmem.c | 15 +++++++++------
> 1 file changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index afd5a057ebb7..00fb001e8f3e 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -638,7 +638,7 @@ static int shmem_add_to_page_cache(struct page *page,
> xas_lock_irq(&xas);
> entry = xas_find_conflict(&xas);
> if (entry != expected)
> - xas_set_err(&xas, expected ? -ENOENT : -EEXIST);
> + xas_set_err(&xas, entry ? -EEXIST : -ENOENT);
> xas_create_range(&xas);
> if (xas_error(&xas))
> goto unlock;
> @@ -1686,10 +1686,13 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
> * We already confirmed swap under page lock, but
> * free_swap_and_cache() only trylocks a page, so it
> * is just possible that the entry has been truncated
> - * or holepunched since swap was confirmed.
> - * shmem_undo_range() will have done some of the
> - * unaccounting, now delete_from_swap_cache() will do
> - * the rest.
> + * or holepunched since swap was confirmed. This could
> + * occur at any time while the page is locked, and
> + * usually page reclaim will take care of the stranded
> + * swapcache page. But when we catch it, we may as
> + * well clean up after ourselves: shmem_undo_range()
> + * will have done some of the unaccounting, now
> + * delete_from_swap_cache() will do the rest.
> */
> if (error == -ENOENT)
> delete_from_swap_cache(page);
> @@ -1765,7 +1768,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
> if (xa_is_value(page)) {
> error = shmem_swapin_page(inode, index, &page,
> sgp, gfp, vma, fault_type);
> - if (error == -EEXIST)
> + if (error == -EEXIST || error == -ENOENT)
> goto repeat;
>
> *pagep = page;
> --
> 2.26.2
>

2020-05-11 18:14:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

On Mon, May 11, 2020 at 09:32:16AM -0700, Hugh Dickins wrote:
> On Mon, 11 May 2020, Johannes Weiner wrote:
> > On Mon, May 11, 2020 at 12:38:04AM -0700, Hugh Dickins wrote:
> > > On Fri, 8 May 2020, Johannes Weiner wrote:
> > > >
> > > > I looked at this some more, as well as compared it to non-shmem
> > > > swapping. My conclusion is - and Hugh may correct me on this - that
> > > > the deletion looks mandatory but is actually an optimization. Page
> > > > reclaim will ultimately pick these pages up.
> > > >
> > > > When non-shmem pages are swapped in by readahead (locked until IO
> > > > completes) and their page tables are simultaneously unmapped, the
> > > > zap_pte_range() code calls free_swap_and_cache() and the locked pages
> > > > are stranded in the swap cache with no page table references. We rely
> > > > on page reclaim to pick them up later on.
> > > >
> > > > The same appears to be true for shmem. If the references to the swap
> > > > page are zapped while we're trying to swap in, we can strand the page
> > > > in the swap cache. But it's not up to swapin to detect this reliably,
> > > > it just frees the page more quickly than having to wait for reclaim.
> > >
> > > I think you've got all that exactly right, thanks for working it out.
> > > It originates from v3.7's 215c02bc33bb ("tmpfs: fix shmem_getpage_gfp()
> > > VM_BUG_ON") - in which I also had to thank you.
> >
> > I should have looked where it actually came from - I had forgotten
> > about that patch!
> >
> > > I think I chose to do the delete_from_swap_cache() right there, partly
> > > because of following shmem_unuse_inode() code which already did that,
> > > partly on the basis that while we have to observe the case then it's
> > > better to clean it up, and partly out of guilt that our page lock here
> > > is what had prevented shmem_undo_range() from completing its job; but
> > > I believe you're right that unused swapcache reclaim would sort it out
> > > eventually.
> >
> > That makes sense to me.
> >
> > > > diff --git a/mm/shmem.c b/mm/shmem.c
> > > > index e80167927dce..236642775f89 100644
> > > > --- a/mm/shmem.c
> > > > +++ b/mm/shmem.c
> > > > @@ -640,7 +640,7 @@ static int shmem_add_to_page_cache(struct page *page,
> > > > xas_lock_irq(&xas);
> > > > entry = xas_find_conflict(&xas);
> > > > if (entry != expected)
> > > > - xas_set_err(&xas, -EEXIST);
> > > > + xas_set_err(&xas, expected ? -ENOENT : -EEXIST);
> > >
> > > Two things on this.
> > >
> > > Minor matter of taste, I'd prefer that as
> > > xas_set_err(&xas, entry ? -EEXIST : -ENOENT);
> > > which would be more general and more understandable -
> > > but what you have written should be fine for the actual callers.
> >
> > Yes, checking `expected' was to differentiate the behavior depending
> > on the callsite. But testing `entry' is more obvious in that location.
> >
> > > Except... I think returning -ENOENT there will not work correctly,
> > > in the case of a punched hole. Because (unless you've reworked it
> > > and I just haven't looked) shmem_getpage_gfp() knows to retry in
> > > the case of -EEXIST, but -ENOENT will percolate up to shmem_fault()
> > > and result in a SIGBUS, or a read/write error, when the hole should
> > > just get refilled instead.
> >
> > Good catch, I had indeed missed that. I'm going to make it retry on
> > -ENOENT as well.
> >
> > We could have it go directly to allocating a new page, but it seems
> > unnecessarily complicated: we've already been retrying in this
> > situation until now, so I would stick to "there was a race, retry."
> >
> > > Not something that needs fixing in a hurry (it took trinity to
> > > generate this racy case in the first place), I'll take another look
> > > once I've pulled it into a tree (or collected next mmotm) - unless
> > > you've already have changed it around by then.
> >
> > Attaching a delta fix based on your observations.
> >
> > Andrew, barring any objections to this, could you please fold it into
> > the version you have in your tree already?
>
> Not so strong as an objection, and I won't get to see whether your
> retry on -ENOENT is good (can -ENOENT arrive at that point from any
> other case, that might endlessly retry?) until I've got the full
> context; but I had arrived at the opposite conclusion overnight.
>
> Given that this case only appeared with a fuzzer, and stale swapcache
> reclaim is anyway relied upon to clean up after plenty of other such
> races, I think we should agree that I over-complicated the VM_BUG_ON
> removal originally, and it's best to kill that delete_from_swap_cache(),
> and the comment having to explain it, and your EEXIST/ENOENT distinction.
>
> (I haven't checked, but I suspect that the shmem_unuse_inode() case
> that I copied from, actually really needed to delete_from_swap_cache(),
> in order to swapoff the page without full retry of the big swapoff loop.)

Since commit b56a2d8af914 ("mm: rid swapoff of quadratic complexity"),
shmem_unuse_inode() doesn't have its own copy anymore - it uses
shmem_swapin_page().

However, that commit appears to have made shmem's private call to
delete_from_swap_cache() obsolete as well. Whereas before this change
we fully relied on shmem_unuse() to find and clear a shmem swap entry
and its swapcache page, we now only need it to clean out shmem's
private state in the inode, as it's followed by a loop over all
remaining swap slots, calling try_to_free_swap() on stragglers.

Unless I missed something, it's still merely an optimization, and we
can delete it for simplicity:

---

From fc9dcaf68c8b54baf365cd670fb5780c7f0d243f Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Mon, 11 May 2020 12:59:08 -0400
Subject: [PATCH] mm: shmem: remove rare optimization when swapin races with
hole punching

Commit 215c02bc33bb ("tmpfs: fix shmem_getpage_gfp() VM_BUG_ON")
recognized that hole punching can race with swapin and removed the
BUG_ON() for a truncated entry from the swapin path.

The patch also added a swapcache deletion to optimize this rare case:
Since swapin has the page locked, and free_swap_and_cache() merely
trylocks, this situation can leave the page stranded in
swapcache. Usually, page reclaim picks up stale swapcache pages, and
the race can happen at any other time when the page is locked. (The
same happens for non-shmem swapin racing with page table zapping.) The
thinking here was: we already observed the race and we have the page
locked, we may as well do the cleanup instead of waiting for reclaim.

However, this optimization complicates the next patch which moves the
cgroup charging code around. As this is just a minor speedup for a
race condition that is so rare that it required a fuzzer to trigger
the original BUG_ON(), it's no longer worth the complications.

Suggested-by: Hugh Dickins <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/shmem.c | 25 +++++++------------------
1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index d505b6cce4ab..729bbb3513cd 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1665,27 +1665,16 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
}

error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
- if (!error) {
- error = shmem_add_to_page_cache(page, mapping, index,
- swp_to_radix_entry(swap), gfp);
- /*
- * We already confirmed swap under page lock, and make
- * no memory allocation here, so usually no possibility
- * of error; but free_swap_and_cache() only trylocks a
- * page, so it is just possible that the entry has been
- * truncated or holepunched since swap was confirmed.
- * shmem_undo_range() will have done some of the
- * unaccounting, now delete_from_swap_cache() will do
- * the rest.
- */
- if (error) {
- mem_cgroup_cancel_charge(page, memcg);
- delete_from_swap_cache(page);
- }
- }
if (error)
goto failed;

+ error = shmem_add_to_page_cache(page, mapping, index,
+ swp_to_radix_entry(swap), gfp);
+ if (error) {
+ mem_cgroup_cancel_charge(page, memcg);
+ goto failed;
+ }
+
mem_cgroup_commit_charge(page, memcg, true);

spin_lock_irq(&info->lock);
--
2.26.2

2020-05-11 18:15:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

On Mon, May 11, 2020 at 02:10:58PM -0400, Johannes Weiner wrote:
> From fc9dcaf68c8b54baf365cd670fb5780c7f0d243f Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Mon, 11 May 2020 12:59:08 -0400
> Subject: [PATCH] mm: shmem: remove rare optimization when swapin races with
> hole punching

And a new, conflict-resolved version of the patch this thread is
attached to:

---
From 7f630d9bc5d6f692298fd906edd5f48070b257c7 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Thu, 16 Apr 2020 15:08:07 -0400
Subject: [PATCH] mm: memcontrol: convert page cache to a new
mem_cgroup_charge() API

The try/commit/cancel protocol that memcg uses dates back to when
pages used to be uncharged upon removal from the page cache, and thus
couldn't be committed before the insertion had succeeded. Nowadays,
pages are uncharged when they are physically freed; it doesn't matter
whether the insertion was successful or not. For the page cache, the
transaction dance has become unnecessary.

Introduce a mem_cgroup_charge() function that simply charges a newly
allocated page to a cgroup and sets up page->mem_cgroup in one single
step. If the insertion fails, the caller doesn't have to do anything
but free/put the page.

Then switch the page cache over to this new API.

Subsequent patches will also convert anon pages, but it needs a bit
more prep work. Right now, memcg depends on page->mapping being
already set up at the time of charging, so that it can maintain its
own MEMCG_CACHE and MEMCG_RSS counters. For anon, page->mapping is set
under the same pte lock under which the page is publishd, so a single
charge point that can block doesn't work there just yet.

The following prep patches will replace the private memcg counters
with the generic vmstat counters, thus removing the page->mapping
dependency, then complete the transition to the new single-point
charge API and delete the old transactional scheme.

v2: leave shmem swapcache when charging fails to avoid double IO (Joonsoo)
v3: rebase on preceeding shmem simplification patch

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Alex Shi <[email protected]>
---
include/linux/memcontrol.h | 10 ++++++
mm/filemap.c | 24 ++++++-------
mm/memcontrol.c | 29 +++++++++++++--
mm/shmem.c | 73 ++++++++++++++++----------------------
4 files changed, 77 insertions(+), 59 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 30292d57c8af..57339514d960 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -379,6 +379,10 @@ int mem_cgroup_try_charge_delay(struct page *page, struct mm_struct *mm,
void mem_cgroup_commit_charge(struct page *page, struct mem_cgroup *memcg,
bool lrucare);
void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg);
+
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
+ bool lrucare);
+
void mem_cgroup_uncharge(struct page *page);
void mem_cgroup_uncharge_list(struct list_head *page_list);

@@ -893,6 +897,12 @@ static inline void mem_cgroup_cancel_charge(struct page *page,
{
}

+static inline int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
+ gfp_t gfp_mask, bool lrucare)
+{
+ return 0;
+}
+
static inline void mem_cgroup_uncharge(struct page *page)
{
}
diff --git a/mm/filemap.c b/mm/filemap.c
index ce200386736c..ee9882509566 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -832,7 +832,6 @@ static int __add_to_page_cache_locked(struct page *page,
{
XA_STATE(xas, &mapping->i_pages, offset);
int huge = PageHuge(page);
- struct mem_cgroup *memcg;
int error;
void *old;

@@ -840,17 +839,16 @@ static int __add_to_page_cache_locked(struct page *page,
VM_BUG_ON_PAGE(PageSwapBacked(page), page);
mapping_set_update(&xas, mapping);

- if (!huge) {
- error = mem_cgroup_try_charge(page, current->mm,
- gfp_mask, &memcg);
- if (error)
- return error;
- }
-
get_page(page);
page->mapping = mapping;
page->index = offset;

+ if (!huge) {
+ error = mem_cgroup_charge(page, current->mm, gfp_mask, false);
+ if (error)
+ goto error;
+ }
+
do {
xas_lock_irq(&xas);
old = xas_load(&xas);
@@ -874,20 +872,18 @@ static int __add_to_page_cache_locked(struct page *page,
xas_unlock_irq(&xas);
} while (xas_nomem(&xas, gfp_mask & GFP_RECLAIM_MASK));

- if (xas_error(&xas))
+ if (xas_error(&xas)) {
+ error = xas_error(&xas);
goto error;
+ }

- if (!huge)
- mem_cgroup_commit_charge(page, memcg, false);
trace_mm_filemap_add_to_page_cache(page);
return 0;
error:
page->mapping = NULL;
/* Leave page->index set: truncation relies upon it */
- if (!huge)
- mem_cgroup_cancel_charge(page, memcg);
put_page(page);
- return xas_error(&xas);
+ return error;
}
ALLOW_ERROR_INJECTION(__add_to_page_cache_locked, ERRNO);

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8188d462d7ce..1d45a09b334f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6578,6 +6578,33 @@ void mem_cgroup_cancel_charge(struct page *page, struct mem_cgroup *memcg)
cancel_charge(memcg, nr_pages);
}

+/**
+ * mem_cgroup_charge - charge a newly allocated page to a cgroup
+ * @page: page to charge
+ * @mm: mm context of the victim
+ * @gfp_mask: reclaim mode
+ * @lrucare: page might be on the LRU already
+ *
+ * Try to charge @page to the memcg that @mm belongs to, reclaiming
+ * pages according to @gfp_mask if necessary.
+ *
+ * Returns 0 on success. Otherwise, an error code is returned.
+ */
+int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask,
+ bool lrucare)
+{
+ struct mem_cgroup *memcg;
+ int ret;
+
+ VM_BUG_ON_PAGE(!page->mapping, page);
+
+ ret = mem_cgroup_try_charge(page, mm, gfp_mask, &memcg);
+ if (ret)
+ return ret;
+ mem_cgroup_commit_charge(page, memcg, lrucare);
+ return 0;
+}
+
struct uncharge_gather {
struct mem_cgroup *memcg;
unsigned long pgpgout;
@@ -6625,8 +6652,6 @@ static void uncharge_batch(const struct uncharge_gather *ug)
static void uncharge_page(struct page *page, struct uncharge_gather *ug)
{
VM_BUG_ON_PAGE(PageLRU(page), page);
- VM_BUG_ON_PAGE(page_count(page) && !is_zone_device_page(page) &&
- !PageHWPoison(page) , page);

if (!page->mem_cgroup)
return;
diff --git a/mm/shmem.c b/mm/shmem.c
index 729bbb3513cd..0d9615723152 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -605,11 +605,13 @@ static inline bool is_huge_enabled(struct shmem_sb_info *sbinfo)
*/
static int shmem_add_to_page_cache(struct page *page,
struct address_space *mapping,
- pgoff_t index, void *expected, gfp_t gfp)
+ pgoff_t index, void *expected, gfp_t gfp,
+ struct mm_struct *charge_mm)
{
XA_STATE_ORDER(xas, &mapping->i_pages, index, compound_order(page));
unsigned long i = 0;
unsigned long nr = compound_nr(page);
+ int error;

VM_BUG_ON_PAGE(PageTail(page), page);
VM_BUG_ON_PAGE(index != round_down(index, nr), page);
@@ -621,6 +623,16 @@ static int shmem_add_to_page_cache(struct page *page,
page->mapping = mapping;
page->index = index;

+ error = mem_cgroup_charge(page, charge_mm, gfp, PageSwapCache(page));
+ if (error) {
+ if (!PageSwapCache(page) && PageTransHuge(page)) {
+ count_vm_event(THP_FILE_FALLBACK);
+ count_vm_event(THP_FILE_FALLBACK_CHARGE);
+ }
+ goto error;
+ }
+ cgroup_throttle_swaprate(page, gfp);
+
do {
void *entry;
xas_lock_irq(&xas);
@@ -648,12 +660,15 @@ static int shmem_add_to_page_cache(struct page *page,
} while (xas_nomem(&xas, gfp));

if (xas_error(&xas)) {
- page->mapping = NULL;
- page_ref_sub(page, nr);
- return xas_error(&xas);
+ error = xas_error(&xas);
+ goto error;
}

return 0;
+error:
+ page->mapping = NULL;
+ page_ref_sub(page, nr);
+ return error;
}

/*
@@ -1619,7 +1634,6 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
- struct mem_cgroup *memcg;
struct page *page;
swp_entry_t swap;
int error;
@@ -1664,18 +1678,11 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
goto failed;
}

- error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
- if (error)
- goto failed;
-
error = shmem_add_to_page_cache(page, mapping, index,
- swp_to_radix_entry(swap), gfp);
- if (error) {
- mem_cgroup_cancel_charge(page, memcg);
+ swp_to_radix_entry(swap), gfp,
+ charge_mm);
+ if (error)
goto failed;
- }
-
- mem_cgroup_commit_charge(page, memcg, true);

spin_lock_irq(&info->lock);
info->swapped--;
@@ -1722,7 +1729,6 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
struct shmem_inode_info *info = SHMEM_I(inode);
struct shmem_sb_info *sbinfo;
struct mm_struct *charge_mm;
- struct mem_cgroup *memcg;
struct page *page;
enum sgp_type sgp_huge = sgp;
pgoff_t hindex = index;
@@ -1847,21 +1853,11 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
if (sgp == SGP_WRITE)
__SetPageReferenced(page);

- error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
- if (error) {
- if (PageTransHuge(page)) {
- count_vm_event(THP_FILE_FALLBACK);
- count_vm_event(THP_FILE_FALLBACK_CHARGE);
- }
- goto unacct;
- }
error = shmem_add_to_page_cache(page, mapping, hindex,
- NULL, gfp & GFP_RECLAIM_MASK);
- if (error) {
- mem_cgroup_cancel_charge(page, memcg);
+ NULL, gfp & GFP_RECLAIM_MASK,
+ charge_mm);
+ if (error)
goto unacct;
- }
- mem_cgroup_commit_charge(page, memcg, false);
lru_cache_add_anon(page);

spin_lock_irq(&info->lock);
@@ -2299,7 +2295,6 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
struct address_space *mapping = inode->i_mapping;
gfp_t gfp = mapping_gfp_mask(mapping);
pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
- struct mem_cgroup *memcg;
spinlock_t *ptl;
void *page_kaddr;
struct page *page;
@@ -2349,16 +2344,10 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
if (unlikely(offset >= max_off))
goto out_release;

- ret = mem_cgroup_try_charge_delay(page, dst_mm, gfp, &memcg);
- if (ret)
- goto out_release;
-
ret = shmem_add_to_page_cache(page, mapping, pgoff, NULL,
- gfp & GFP_RECLAIM_MASK);
+ gfp & GFP_RECLAIM_MASK, dst_mm);
if (ret)
- goto out_release_uncharge;
-
- mem_cgroup_commit_charge(page, memcg, false);
+ goto out_release;

_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
if (dst_vma->vm_flags & VM_WRITE)
@@ -2379,11 +2368,11 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
ret = -EFAULT;
max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
if (unlikely(offset >= max_off))
- goto out_release_uncharge_unlock;
+ goto out_release_unlock;

ret = -EEXIST;
if (!pte_none(*dst_pte))
- goto out_release_uncharge_unlock;
+ goto out_release_unlock;

lru_cache_add_anon(page);

@@ -2404,12 +2393,10 @@ static int shmem_mfill_atomic_pte(struct mm_struct *dst_mm,
ret = 0;
out:
return ret;
-out_release_uncharge_unlock:
+out_release_unlock:
pte_unmap_unlock(dst_pte, ptl);
ClearPageDirty(page);
delete_from_page_cache(page);
-out_release_uncharge:
- mem_cgroup_cancel_charge(page, memcg);
out_release:
unlock_page(page);
put_page(page);
--
2.26.2

2020-05-11 18:46:40

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new mem_cgroup_charge() API

On Mon, 11 May 2020, Johannes Weiner wrote:
>
> Since commit b56a2d8af914 ("mm: rid swapoff of quadratic complexity"),
> shmem_unuse_inode() doesn't have its own copy anymore - it uses
> shmem_swapin_page().
>
> However, that commit appears to have made shmem's private call to
> delete_from_swap_cache() obsolete as well. Whereas before this change
> we fully relied on shmem_unuse() to find and clear a shmem swap entry
> and its swapcache page, we now only need it to clean out shmem's
> private state in the inode, as it's followed by a loop over all
> remaining swap slots, calling try_to_free_swap() on stragglers.

Great, you've looked deeper into the current situation than I had.

>
> Unless I missed something, it's still merely an optimization, and we
> can delete it for simplicity:

Yes, nice ---s, simpler code, and a good idea to separate it out
as a precursor: thanks, Hannes.

>
> ---
>
> From fc9dcaf68c8b54baf365cd670fb5780c7f0d243f Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <[email protected]>
> Date: Mon, 11 May 2020 12:59:08 -0400
> Subject: [PATCH] mm: shmem: remove rare optimization when swapin races with
> hole punching
>
> Commit 215c02bc33bb ("tmpfs: fix shmem_getpage_gfp() VM_BUG_ON")
> recognized that hole punching can race with swapin and removed the
> BUG_ON() for a truncated entry from the swapin path.
>
> The patch also added a swapcache deletion to optimize this rare case:
> Since swapin has the page locked, and free_swap_and_cache() merely
> trylocks, this situation can leave the page stranded in
> swapcache. Usually, page reclaim picks up stale swapcache pages, and
> the race can happen at any other time when the page is locked. (The
> same happens for non-shmem swapin racing with page table zapping.) The
> thinking here was: we already observed the race and we have the page
> locked, we may as well do the cleanup instead of waiting for reclaim.
>
> However, this optimization complicates the next patch which moves the
> cgroup charging code around. As this is just a minor speedup for a
> race condition that is so rare that it required a fuzzer to trigger
> the original BUG_ON(), it's no longer worth the complications.
>
> Suggested-by: Hugh Dickins <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: Hugh Dickins <[email protected]>
(if one is allowed to suggest and to ack)

> ---
> mm/shmem.c | 25 +++++++------------------
> 1 file changed, 7 insertions(+), 18 deletions(-)
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d505b6cce4ab..729bbb3513cd 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1665,27 +1665,16 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
> }
>
> error = mem_cgroup_try_charge_delay(page, charge_mm, gfp, &memcg);
> - if (!error) {
> - error = shmem_add_to_page_cache(page, mapping, index,
> - swp_to_radix_entry(swap), gfp);
> - /*
> - * We already confirmed swap under page lock, and make
> - * no memory allocation here, so usually no possibility
> - * of error; but free_swap_and_cache() only trylocks a
> - * page, so it is just possible that the entry has been
> - * truncated or holepunched since swap was confirmed.
> - * shmem_undo_range() will have done some of the
> - * unaccounting, now delete_from_swap_cache() will do
> - * the rest.
> - */
> - if (error) {
> - mem_cgroup_cancel_charge(page, memcg);
> - delete_from_swap_cache(page);
> - }
> - }
> if (error)
> goto failed;
>
> + error = shmem_add_to_page_cache(page, mapping, index,
> + swp_to_radix_entry(swap), gfp);
> + if (error) {
> + mem_cgroup_cancel_charge(page, memcg);
> + goto failed;
> + }
> +
> mem_cgroup_commit_charge(page, memcg, true);
>
> spin_lock_irq(&info->lock);
> --
> 2.26.2
>
>