2022-02-16 12:31:20

by Muchun Song

[permalink] [raw]
Subject: [PATCH v3 00/12] Use obj_cgroup APIs to charge the LRU pages

This version is rebased over linux 5.17-rc4.

Since the following patchsets applied. All the kernel memory are charged
with the new APIs of obj_cgroup.

[v17,00/19] The new cgroup slab memory controller [1]
[v5,0/7] Use obj_cgroup APIs to charge kmem pages [2]

But user memory allocations (LRU pages) pinning memcgs for a long time -
it exists at a larger scale and is causing recurring problems in the real
world: page cache doesn't get reclaimed for a long time, or is used by the
second, third, fourth, ... instance of the same job that was restarted into
a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
and make page reclaim very inefficient.

We can convert LRU pages and most other raw memcg pins to the objcg direction
to fix this problem, and then the LRU pages will not pin the memcgs.

This patchset aims to make the LRU pages to drop the reference to memory
cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
of the dying cgroups will not increase if we run the following test script.

```bash
#!/bin/bash

dd if=/dev/zero of=temp bs=4096 count=1
cat /proc/cgroups | grep memory

for i in {0..2000}
do
mkdir /sys/fs/cgroup/memory/test$i
echo $$ > /sys/fs/cgroup/memory/test$i/cgroup.procs
cat temp >> log
echo $$ > /sys/fs/cgroup/memory/cgroup.procs
rmdir /sys/fs/cgroup/memory/test$i
done

cat /proc/cgroups | grep memory

rm -f temp log
```

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/

v2: https://lore.kernel.org/all/[email protected]/
v1: https://lore.kernel.org/all/[email protected]/
RFC v4: https://lore.kernel.org/all/[email protected]/
RFC v3: https://lore.kernel.org/all/[email protected]/
RFC v2: https://lore.kernel.org/all/[email protected]/
RFC v1: https://lore.kernel.org/all/[email protected]/

v3:
- Removed the Acked-by tags from Roman since this version is based on
the folio relevant.

v2:
- Rename obj_cgroup_release_kmem() to obj_cgroup_release_bytes() and the
dependencies of CONFIG_MEMCG_KMEM (suggested by Roman, Thanks).
- Rebase to linux 5.15-rc1.
- Add a new pacth to cleanup mem_cgroup_kmem_disabled().

v1:
- Drop RFC tag.
- Rebase to linux next-20210811.

RFC v4:
- Collect Acked-by from Roman.
- Rebase to linux next-20210525.
- Rename obj_cgroup_release_uncharge() to obj_cgroup_release_kmem().
- Change the patch 1 title to "prepare objcg API for non-kmem usage".
- Convert reparent_ops_head to an array in patch 8.

Thanks for Roman's review and suggestions.

RFC v3:
- Drop the code cleanup and simplification patches. Gather those patches
into a separate series[1].
- Rework patch #1 suggested by Johannes.

RFC v2:
- Collect Acked-by tags by Johannes. Thanks.
- Rework lruvec_holds_page_lru_lock() suggested by Johannes. Thanks.
- Fix move_pages_to_lru().

Muchun Song (12):
mm: memcontrol: prepare objcg API for non-kmem usage
mm: memcontrol: introduce compact_folio_lruvec_lock_irqsave
mm: memcontrol: make lruvec lock safe when LRU pages are reparented
mm: vmscan: rework move_pages_to_lru()
mm: thp: introduce folio_split_queue_lock{_irqsave}()
mm: thp: make split queue lock safe when LRU pages are reparented
mm: memcontrol: make all the callers of {folio,page}_memcg() safe
mm: memcontrol: introduce memcg_reparent_ops
mm: memcontrol: use obj_cgroup APIs to charge the LRU pages
mm: memcontrol: rename {un}lock_page_memcg() to {un}lock_page_objcg()
mm: lru: add VM_BUG_ON_FOLIO to lru maintenance function
mm: lru: use lruvec lock to serialize memcg changes

Documentation/admin-guide/cgroup-v1/memory.rst | 2 +-
fs/buffer.c | 12 +-
fs/fs-writeback.c | 23 +-
include/linux/memcontrol.h | 192 +++++----
include/linux/mm.h | 1 +
include/linux/mm_inline.h | 15 +-
include/trace/events/writeback.h | 5 +
mm/compaction.c | 39 +-
mm/filemap.c | 2 +-
mm/huge_memory.c | 166 ++++++--
mm/memcontrol.c | 559 ++++++++++++++++++-------
mm/migrate.c | 4 +
mm/page-writeback.c | 6 +-
mm/page_io.c | 5 +-
mm/rmap.c | 14 +-
mm/swap.c | 49 +--
mm/vmscan.c | 57 ++-
17 files changed, 795 insertions(+), 356 deletions(-)

--
2.11.0


2022-02-16 13:01:21

by Muchun Song

[permalink] [raw]
Subject: [PATCH v3 05/12] mm: thp: introduce folio_split_queue_lock{_irqsave}()

We should make thp deferred split queue lock safe when LRU pages
are reparented. Similar to folio_lruvec_lock{_irqsave, _irq}(), we
introduce folio_split_queue_lock{_irqsave}() to make the deferred
split queue lock easier to be reparented.

And in the next patch, we can use a similar approach (just like
lruvec lock does) to make thp deferred split queue lock safe when
the LRU pages reparented.

Signed-off-by: Muchun Song <[email protected]>
---
include/linux/memcontrol.h | 10 +++++
mm/huge_memory.c | 97 +++++++++++++++++++++++++++++++++-------------
2 files changed, 80 insertions(+), 27 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 961e9f9b6567..df607c9de500 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1633,6 +1633,11 @@ int alloc_shrinker_info(struct mem_cgroup *memcg);
void free_shrinker_info(struct mem_cgroup *memcg);
void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
void reparent_shrinker_deferred(struct mem_cgroup *memcg);
+
+static inline int shrinker_id(struct shrinker *shrinker)
+{
+ return shrinker->id;
+}
#else
#define mem_cgroup_sockets_enabled 0
static inline void mem_cgroup_sk_alloc(struct sock *sk) { };
@@ -1646,6 +1651,11 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg,
int nid, int shrinker_id)
{
}
+
+static inline int shrinker_id(struct shrinker *shrinker)
+{
+ return -1;
+}
#endif

#ifdef CONFIG_MEMCG_KMEM
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..a227731988b3 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -499,25 +499,70 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
}

#ifdef CONFIG_MEMCG
-static inline struct deferred_split *get_deferred_split_queue(struct page *page)
+static inline struct mem_cgroup *split_queue_memcg(struct deferred_split *queue)
{
- struct mem_cgroup *memcg = page_memcg(compound_head(page));
- struct pglist_data *pgdat = NODE_DATA(page_to_nid(page));
+ if (mem_cgroup_disabled())
+ return NULL;
+ return container_of(queue, struct mem_cgroup, deferred_split_queue);
+}

- if (memcg)
- return &memcg->deferred_split_queue;
- else
- return &pgdat->deferred_split_queue;
+static inline struct deferred_split *folio_memcg_split_queue(struct folio *folio)
+{
+ struct mem_cgroup *memcg = folio_memcg(folio);
+
+ return memcg ? &memcg->deferred_split_queue : NULL;
}
#else
-static inline struct deferred_split *get_deferred_split_queue(struct page *page)
+static inline struct mem_cgroup *split_queue_memcg(struct deferred_split *queue)
{
- struct pglist_data *pgdat = NODE_DATA(page_to_nid(page));
+ return NULL;
+}

- return &pgdat->deferred_split_queue;
+static inline struct deferred_split *folio_memcg_split_queue(struct folio *folio)
+{
+ return NULL;
}
#endif

+static struct deferred_split *folio_split_queue(struct folio *folio)
+{
+ struct deferred_split *queue = folio_memcg_split_queue(folio);
+
+ return queue ? : &NODE_DATA(folio_nid(folio))->deferred_split_queue;
+}
+
+static struct deferred_split *folio_split_queue_lock(struct folio *folio)
+{
+ struct deferred_split *queue;
+
+ queue = folio_split_queue(folio);
+ spin_lock(&queue->split_queue_lock);
+
+ return queue;
+}
+
+static struct deferred_split *
+folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
+{
+ struct deferred_split *queue;
+
+ queue = folio_split_queue(folio);
+ spin_lock_irqsave(&queue->split_queue_lock, *flags);
+
+ return queue;
+}
+
+static inline void split_queue_unlock(struct deferred_split *queue)
+{
+ spin_unlock(&queue->split_queue_lock);
+}
+
+static inline void split_queue_unlock_irqrestore(struct deferred_split *queue,
+ unsigned long flags)
+{
+ spin_unlock_irqrestore(&queue->split_queue_lock, flags);
+}
+
void prep_transhuge_page(struct page *page)
{
/*
@@ -2602,8 +2647,9 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
*/
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
- struct page *head = compound_head(page);
- struct deferred_split *ds_queue = get_deferred_split_queue(head);
+ struct folio *folio = page_folio(page);
+ struct page *head = &folio->page;
+ struct deferred_split *ds_queue;
XA_STATE(xas, &head->mapping->i_pages, head->index);
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
@@ -2690,13 +2736,13 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
}

/* Prevent deferred_split_scan() touching ->_refcount */
- spin_lock(&ds_queue->split_queue_lock);
+ ds_queue = folio_split_queue_lock(folio);
if (page_ref_freeze(head, 1 + extra_pins)) {
if (!list_empty(page_deferred_list(head))) {
ds_queue->split_queue_len--;
list_del(page_deferred_list(head));
}
- spin_unlock(&ds_queue->split_queue_lock);
+ split_queue_unlock(ds_queue);
if (mapping) {
int nr = thp_nr_pages(head);

@@ -2714,7 +2760,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
__split_huge_page(page, list, end);
ret = 0;
} else {
- spin_unlock(&ds_queue->split_queue_lock);
+ split_queue_unlock(ds_queue);
fail:
if (mapping)
xas_unlock(&xas);
@@ -2739,24 +2785,21 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)

void free_transhuge_page(struct page *page)
{
- struct deferred_split *ds_queue = get_deferred_split_queue(page);
+ struct deferred_split *ds_queue;
unsigned long flags;

- spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+ ds_queue = folio_split_queue_lock_irqsave(page_folio(page), &flags);
if (!list_empty(page_deferred_list(page))) {
ds_queue->split_queue_len--;
list_del(page_deferred_list(page));
}
- spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+ split_queue_unlock_irqrestore(ds_queue, flags);
free_compound_page(page);
}

void deferred_split_huge_page(struct page *page)
{
- struct deferred_split *ds_queue = get_deferred_split_queue(page);
-#ifdef CONFIG_MEMCG
- struct mem_cgroup *memcg = page_memcg(compound_head(page));
-#endif
+ struct deferred_split *ds_queue;
unsigned long flags;

VM_BUG_ON_PAGE(!PageTransHuge(page), page);
@@ -2774,18 +2817,18 @@ void deferred_split_huge_page(struct page *page)
if (PageSwapCache(page))
return;

- spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+ ds_queue = folio_split_queue_lock_irqsave(page_folio(page), &flags);
if (list_empty(page_deferred_list(page))) {
+ struct mem_cgroup *memcg = split_queue_memcg(ds_queue);
+
count_vm_event(THP_DEFERRED_SPLIT_PAGE);
list_add_tail(page_deferred_list(page), &ds_queue->split_queue);
ds_queue->split_queue_len++;
-#ifdef CONFIG_MEMCG
if (memcg)
set_shrinker_bit(memcg, page_to_nid(page),
- deferred_split_shrinker.id);
-#endif
+ shrinker_id(&deferred_split_shrinker));
}
- spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+ split_queue_unlock_irqrestore(ds_queue, flags);
}

static unsigned long deferred_split_count(struct shrinker *shrink,
--
2.11.0

2022-02-16 13:12:13

by Muchun Song

[permalink] [raw]
Subject: [PATCH v3 12/12] mm: lru: use lruvec lock to serialize memcg changes

As described by commit fc574c23558c ("mm/swap.c: serialize memcg
changes in pagevec_lru_move_fn"), TestClearPageLRU() aims to
serialize mem_cgroup_move_account() during pagevec_lru_move_fn().
Now folio_lruvec_lock*() has the ability to detect whether page
memcg has been changed. So we can use lruvec lock to serialize
mem_cgroup_move_account() during pagevec_lru_move_fn(). This
change is a partial revert of the commit fc574c23558c ("mm/swap.c:
serialize memcg changes in pagevec_lru_move_fn").

And pagevec_lru_move_fn() is more hot compare with
mem_cgroup_move_account(), removing an atomic operation would be
an optimization. Also this change would not dirty cacheline for a
page which isn't on the LRU.

Signed-off-by: Muchun Song <[email protected]>
---
mm/memcontrol.c | 32 +++++++++++++++++++++++++++++++-
mm/swap.c | 45 ++++++++++++++-------------------------------
mm/vmscan.c | 9 ++++-----
3 files changed, 49 insertions(+), 37 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9531bdb6ede3..0a28f87b68c0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1316,13 +1316,38 @@ struct lruvec *folio_lruvec_lock(struct folio *folio)
lruvec = folio_lruvec(folio);

spin_lock(&lruvec->lru_lock);
-
+ /*
+ * The memcg of the page can be changed by any the following routines:
+ *
+ * 1) mem_cgroup_move_account() or
+ * 2) memcg_reparent_objcgs()
+ *
+ * The possible bad scenario would like:
+ *
+ * CPU0: CPU1: CPU2:
+ * lruvec = folio_lruvec()
+ *
+ * if (!isolate_lru_page())
+ * mem_cgroup_move_account()
+ *
+ * memcg_reparent_objcgs()
+ *
+ * spin_lock(&lruvec->lru_lock)
+ * ^^^^^^
+ * wrong lock
+ *
+ * Either CPU1 or CPU2 can change page memcg, so we need to check
+ * whether page memcg is changed, if so, we should reacquire the
+ * new lruvec lock.
+ */
if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
spin_unlock(&lruvec->lru_lock);
goto retry;
}

/*
+ * When we reach here, it means that the folio_memcg(folio) is stable.
+ *
* Preemption is disabled in the internal of spin_lock, which can serve
* as RCU read-side critical sections.
*/
@@ -1353,6 +1378,7 @@ struct lruvec *folio_lruvec_lock_irq(struct folio *folio)
lruvec = folio_lruvec(folio);
spin_lock_irq(&lruvec->lru_lock);

+ /* See the comments in folio_lruvec_lock(). */
if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
spin_unlock_irq(&lruvec->lru_lock);
goto retry;
@@ -1388,6 +1414,7 @@ struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
lruvec = folio_lruvec(folio);
spin_lock_irqsave(&lruvec->lru_lock, *flags);

+ /* See the comments in folio_lruvec_lock(). */
if (unlikely(lruvec_memcg(lruvec) != folio_memcg(folio))) {
spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
goto retry;
@@ -5834,7 +5861,10 @@ static int mem_cgroup_move_account(struct page *page,
obj_cgroup_put(rcu_dereference(from->objcg));
rcu_read_unlock();

+ /* See the comments in folio_lruvec_lock(). */
+ spin_lock(&from_vec->lru_lock);
folio->memcg_data = (unsigned long)rcu_access_pointer(to->objcg);
+ spin_unlock(&from_vec->lru_lock);

__folio_memcg_unlock(from);

diff --git a/mm/swap.c b/mm/swap.c
index 9c2bcc2651c6..b9022fbbb70f 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -201,14 +201,8 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
struct page *page = pvec->pages[i];
struct folio *folio = page_folio(page);

- /* block memcg migration during page moving between lru */
- if (!TestClearPageLRU(page))
- continue;
-
lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags);
(*move_fn)(page, lruvec);
-
- SetPageLRU(page);
}
if (lruvec)
unlock_page_lruvec_irqrestore(lruvec, flags);
@@ -220,7 +214,7 @@ static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
{
struct folio *folio = page_folio(page);

- if (!folio_test_unevictable(folio)) {
+ if (folio_test_lru(folio) && !folio_test_unevictable(folio)) {
lruvec_del_folio(lruvec, folio);
folio_clear_active(folio);
lruvec_add_folio_tail(lruvec, folio);
@@ -315,7 +309,8 @@ void lru_note_cost_folio(struct folio *folio)

static void __folio_activate(struct folio *folio, struct lruvec *lruvec)
{
- if (!folio_test_active(folio) && !folio_test_unevictable(folio)) {
+ if (folio_test_lru(folio) && !folio_test_active(folio) &&
+ !folio_test_unevictable(folio)) {
long nr_pages = folio_nr_pages(folio);

lruvec_del_folio(lruvec, folio);
@@ -372,12 +367,9 @@ static void folio_activate(struct folio *folio)
{
struct lruvec *lruvec;

- if (folio_test_clear_lru(folio)) {
- lruvec = folio_lruvec_lock_irq(folio);
- __folio_activate(folio, lruvec);
- unlock_page_lruvec_irq(lruvec);
- folio_set_lru(folio);
- }
+ lruvec = folio_lruvec_lock_irq(folio);
+ __folio_activate(folio, lruvec);
+ unlock_page_lruvec_irq(lruvec);
}
#endif

@@ -530,6 +522,9 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
bool active = PageActive(page);
int nr_pages = thp_nr_pages(page);

+ if (!PageLRU(page))
+ return;
+
if (PageUnevictable(page))
return;

@@ -567,7 +562,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)

static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
{
- if (PageActive(page) && !PageUnevictable(page)) {
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
int nr_pages = thp_nr_pages(page);

del_page_from_lru_list(page, lruvec);
@@ -583,7 +578,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)

static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
{
- if (PageAnon(page) && PageSwapBacked(page) &&
+ if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
!PageSwapCache(page) && !PageUnevictable(page)) {
int nr_pages = thp_nr_pages(page);

@@ -1006,8 +1001,9 @@ void __pagevec_release(struct pagevec *pvec)
}
EXPORT_SYMBOL(__pagevec_release);

-static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
+static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
{
+ struct folio *folio = page_folio(page);
int was_unevictable = folio_test_clear_unevictable(folio);
long nr_pages = folio_nr_pages(folio);

@@ -1064,20 +1060,7 @@ static void __pagevec_lru_add_fn(struct folio *folio, struct lruvec *lruvec)
*/
void __pagevec_lru_add(struct pagevec *pvec)
{
- int i;
- struct lruvec *lruvec = NULL;
- unsigned long flags = 0;
-
- for (i = 0; i < pagevec_count(pvec); i++) {
- struct folio *folio = page_folio(pvec->pages[i]);
-
- lruvec = folio_lruvec_relock_irqsave(folio, lruvec, &flags);
- __pagevec_lru_add_fn(folio, lruvec);
- }
- if (lruvec)
- unlock_page_lruvec_irqrestore(lruvec, flags);
- release_pages(pvec->pages, pvec->nr);
- pagevec_reinit(pvec);
+ pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
}

/**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 00207553c419..23d6f91b483a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4868,18 +4868,17 @@ void check_move_unevictable_pages(struct pagevec *pvec)
nr_pages = thp_nr_pages(page);
pgscanned += nr_pages;

- /* block memcg migration during page moving between lru */
- if (!TestClearPageLRU(page))
+ lruvec = folio_lruvec_relock_irq(folio, lruvec);
+
+ if (!PageLRU(page) || !PageUnevictable(page))
continue;

- lruvec = folio_lruvec_relock_irq(folio, lruvec);
- if (page_evictable(page) && PageUnevictable(page)) {
+ if (page_evictable(page)) {
del_page_from_lru_list(page, lruvec);
ClearPageUnevictable(page);
add_page_to_lru_list(page, lruvec);
pgrescued += nr_pages;
}
- SetPageLRU(page);
}

if (lruvec) {
--
2.11.0

2022-02-16 13:14:45

by Muchun Song

[permalink] [raw]
Subject: [PATCH v3 08/12] mm: memcontrol: introduce memcg_reparent_ops

In the previous patch, we know how to make the lruvec lock safe when LRU
pages are reparented. We should do something like following.

memcg_reparent_objcgs(memcg)
1) lock
// lruvec belongs to memcg and lruvec_parent belongs to parent memcg.
spin_lock(&lruvec->lru_lock);
spin_lock(&lruvec_parent->lru_lock);

2) do reparent
// Move all the pages from the lruvec list to the parent lruvec list.

3) unlock
spin_unlock(&lruvec_parent->lru_lock);
spin_unlock(&lruvec->lru_lock);

Apart from the page lruvec lock, the deferred split queue lock (THP only)
also needs to do something similar. So we extract the necessary three steps
in the memcg_reparent_objcgs().

memcg_reparent_objcgs(memcg)
1) lock
memcg_reparent_ops->lock(memcg, parent);

2) reparent
memcg_reparent_ops->reparent(memcg, reparent);

3) unlock
memcg_reparent_ops->unlock(memcg, reparent);

Now there are two different locks (e.g. lruvec lock and deferred split
queue lock) need to use this infrastructure. In the next patch, we will
use those APIs to make those locks safe when the LRU pages reparented.

Signed-off-by: Muchun Song <[email protected]>
---
include/linux/memcontrol.h | 7 +++++++
mm/memcontrol.c | 39 +++++++++++++++++++++++++++++++++++++--
2 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6e0f7104f2fa..3c841c155f0d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -346,6 +346,13 @@ struct mem_cgroup {
struct mem_cgroup_per_node *nodeinfo[];
};

+struct memcg_reparent_ops {
+ /* Irq is disabled before calling those callbacks. */
+ void (*lock)(struct mem_cgroup *memcg, struct mem_cgroup *parent);
+ void (*unlock)(struct mem_cgroup *memcg, struct mem_cgroup *parent);
+ void (*reparent)(struct mem_cgroup *memcg, struct mem_cgroup *parent);
+};
+
/*
* size of first charge trial. "32" comes from vmscan.c's magic value.
* TODO: maybe necessary to use big numbers in big irons.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dd2602149ef3..6a393fe8e589 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -336,6 +336,35 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
return objcg;
}

+static const struct memcg_reparent_ops *memcg_reparent_ops[] = {};
+
+static void memcg_reparent_lock(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(memcg_reparent_ops); i++)
+ memcg_reparent_ops[i]->lock(memcg, parent);
+}
+
+static void memcg_reparent_unlock(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(memcg_reparent_ops); i++)
+ memcg_reparent_ops[i]->unlock(memcg, parent);
+}
+
+static void memcg_do_reparent(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ int i;
+
+ for (i = 0; i < ARRAY_SIZE(memcg_reparent_ops); i++)
+ memcg_reparent_ops[i]->reparent(memcg, parent);
+}
+
static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
{
struct obj_cgroup *objcg, *iter;
@@ -345,9 +374,11 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
if (!parent)
parent = root_mem_cgroup;

+ local_irq_disable();
+ memcg_reparent_lock(memcg, parent);
objcg = rcu_replace_pointer(memcg->objcg, NULL, true);

- spin_lock_irq(&objcg_lock);
+ spin_lock(&objcg_lock);

/* 1) Ready to reparent active objcg. */
list_add(&objcg->list, &memcg->objcg_list);
@@ -357,7 +388,11 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
/* 3) Move already reparented objcgs to the parent's list */
list_splice(&memcg->objcg_list, &parent->objcg_list);

- spin_unlock_irq(&objcg_lock);
+ spin_unlock(&objcg_lock);
+
+ memcg_do_reparent(memcg, parent);
+ memcg_reparent_unlock(memcg, parent);
+ local_irq_enable();

percpu_ref_kill(&objcg->refcnt);
}
--
2.11.0

2022-02-16 13:16:52

by Muchun Song

[permalink] [raw]
Subject: [PATCH v3 01/12] mm: memcontrol: prepare objcg API for non-kmem usage

Pagecache pages are charged at the allocation time and holding a
reference to the original memory cgroup until being reclaimed.
Depending on the memory pressure, specific patterns of the page
sharing between different cgroups and the cgroup creation and
destruction rates, a large number of dying memory cgroups can be
pinned by pagecache pages. It makes the page reclaim less efficient
and wastes memory.

We can convert LRU pages and most other raw memcg pins to the objcg
direction to fix this problem, and then the page->memcg will always
point to an object cgroup pointer.

Therefore, the infrastructure of objcg no longer only serves
CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the
objcg out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages
can reuse it to charge pages.

We know that the LRU pages are not accounted at the root level. But
the page->memcg_data points to the root_mem_cgroup. So the
page->memcg_data of the LRU pages always points to a valid pointer.
But the root_mem_cgroup dose not have an object cgroup. If we use
obj_cgroup APIs to charge the LRU pages, we should set the
page->memcg_data to a root object cgroup. So we also allocate an
object cgroup for the root_mem_cgroup.

Signed-off-by: Muchun Song <[email protected]>
---
include/linux/memcontrol.h | 2 +-
mm/memcontrol.c | 66 +++++++++++++++++++++++++++++-----------------
2 files changed, 43 insertions(+), 25 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0abbd685703b..81a2720653d0 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -314,10 +314,10 @@ struct mem_cgroup {

#ifdef CONFIG_MEMCG_KMEM
int kmemcg_id;
+#endif
struct obj_cgroup __rcu *objcg;
/* list of inherited objcgs, protected by objcg_lock */
struct list_head objcg_list;
-#endif

MEMCG_PADDING(_pad2_);

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 36e9f38c919d..6501f5b6df4b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -253,9 +253,9 @@ struct mem_cgroup *vmpressure_to_memcg(struct vmpressure *vmpr)
return container_of(vmpr, struct mem_cgroup, vmpressure);
}

-#ifdef CONFIG_MEMCG_KMEM
static DEFINE_SPINLOCK(objcg_lock);

+#ifdef CONFIG_MEMCG_KMEM
bool mem_cgroup_kmem_disabled(void)
{
return cgroup_memory_nokmem;
@@ -264,12 +264,10 @@ bool mem_cgroup_kmem_disabled(void)
static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
unsigned int nr_pages);

-static void obj_cgroup_release(struct percpu_ref *ref)
+static void obj_cgroup_release_bytes(struct obj_cgroup *objcg)
{
- struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
unsigned int nr_bytes;
unsigned int nr_pages;
- unsigned long flags;

/*
* At this point all allocated objects are freed, and
@@ -283,9 +281,9 @@ static void obj_cgroup_release(struct percpu_ref *ref)
* 3) CPU1: a process from another memcg is allocating something,
* the stock if flushed,
* objcg->nr_charged_bytes = PAGE_SIZE - 92
- * 5) CPU0: we do release this object,
+ * 4) CPU0: we do release this object,
* 92 bytes are added to stock->nr_bytes
- * 6) CPU0: stock is flushed,
+ * 5) CPU0: stock is flushed,
* 92 bytes are added to objcg->nr_charged_bytes
*
* In the result, nr_charged_bytes == PAGE_SIZE.
@@ -297,6 +295,19 @@ static void obj_cgroup_release(struct percpu_ref *ref)

if (nr_pages)
obj_cgroup_uncharge_pages(objcg, nr_pages);
+}
+#else
+static inline void obj_cgroup_release_bytes(struct obj_cgroup *objcg)
+{
+}
+#endif
+
+static void obj_cgroup_release(struct percpu_ref *ref)
+{
+ struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
+ unsigned long flags;
+
+ obj_cgroup_release_bytes(objcg);

spin_lock_irqsave(&objcg_lock, flags);
list_del(&objcg->list);
@@ -325,10 +336,14 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
return objcg;
}

-static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
- struct mem_cgroup *parent)
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
{
struct obj_cgroup *objcg, *iter;
+ struct mem_cgroup *parent;
+
+ parent = parent_mem_cgroup(memcg);
+ if (!parent)
+ parent = root_mem_cgroup;

objcg = rcu_replace_pointer(memcg->objcg, NULL, true);

@@ -347,6 +362,7 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
percpu_ref_kill(&objcg->refcnt);
}

+#ifdef CONFIG_MEMCG_KMEM
/*
* This will be used as a shrinker list's index.
* The main reason for not using cgroup id for this:
@@ -3624,7 +3640,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
#ifdef CONFIG_MEMCG_KMEM
static int memcg_online_kmem(struct mem_cgroup *memcg)
{
- struct obj_cgroup *objcg;
int memcg_id;

if (cgroup_memory_nokmem)
@@ -3636,14 +3651,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
if (memcg_id < 0)
return memcg_id;

- objcg = obj_cgroup_alloc();
- if (!objcg) {
- memcg_free_cache_id(memcg_id);
- return -ENOMEM;
- }
- objcg->memcg = memcg;
- rcu_assign_pointer(memcg->objcg, objcg);
-
static_branch_enable(&memcg_kmem_enabled_key);

memcg->kmemcg_id = memcg_id;
@@ -3663,8 +3670,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
if (!parent)
parent = root_mem_cgroup;

- memcg_reparent_objcgs(memcg, parent);
-
kmemcg_id = memcg->kmemcg_id;
BUG_ON(kmemcg_id < 0);

@@ -5166,8 +5171,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
memcg->socket_pressure = jiffies;
#ifdef CONFIG_MEMCG_KMEM
memcg->kmemcg_id = -1;
- INIT_LIST_HEAD(&memcg->objcg_list);
#endif
+ INIT_LIST_HEAD(&memcg->objcg_list);
#ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(&memcg->cgwb_list);
for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
@@ -5239,16 +5244,22 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+ struct obj_cgroup *objcg;

/*
* A memcg must be visible for expand_shrinker_info()
* by the time the maps are allocated. So, we allocate maps
* here, when for_each_mem_cgroup() can't skip it.
*/
- if (alloc_shrinker_info(memcg)) {
- mem_cgroup_id_remove(memcg);
- return -ENOMEM;
- }
+ if (alloc_shrinker_info(memcg))
+ goto remove_id;
+
+ objcg = obj_cgroup_alloc();
+ if (!objcg)
+ goto free_shrinker;
+
+ objcg->memcg = memcg;
+ rcu_assign_pointer(memcg->objcg, objcg);

/* Online state pins memcg ID, memcg ID pins CSS */
refcount_set(&memcg->id.ref, 1);
@@ -5258,6 +5269,12 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
queue_delayed_work(system_unbound_wq, &stats_flush_dwork,
2UL*HZ);
return 0;
+
+free_shrinker:
+ free_shrinker_info(memcg);
+remove_id:
+ mem_cgroup_id_remove(memcg);
+ return -ENOMEM;
}

static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
@@ -5281,6 +5298,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
page_counter_set_low(&memcg->memory, 0);

memcg_offline_kmem(memcg);
+ memcg_reparent_objcgs(memcg);
reparent_shrinker_deferred(memcg);
wb_memcg_offline(memcg);

--
2.11.0