Since the following patchsets applied. All the kernel memory are charged
with the new APIs of obj_cgroup.
[v17,00/19] The new cgroup slab memory controller
[v5,0/7] Use obj_cgroup APIs to charge kmem pages
But user memory allocations (LRU pages) pinning memcgs for a long time -
it exists at a larger scale and is causing recurring problems in the real
world: page cache doesn't get reclaimed for a long time, or is used by the
second, third, fourth, ... instance of the same job that was restarted into
a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
and make page reclaim very inefficient.
We can convert LRU pages and most other raw memcg pins to the objcg direction
to fix this problem, and then the LRU pages will not pin the memcgs.
This patchset aims to make the LRU pages to drop the reference to memory
cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
of the dying cgroups will not increase if we run the following test script.
```bash
#!/bin/bash
cat /proc/cgroups | grep memory
cd /sys/fs/cgroup/memory
for i in range{1..500}
do
mkdir test
echo $$ > test/cgroup.procs
sleep 60 &
echo $$ > cgroup.procs
echo `cat test/cgroup.procs` > cgroup.procs
rmdir test
done
cat /proc/cgroups | grep memory
```
Patch 1 aims to fix page charging in page replacement.
Patch 2-5 are code cleanup and simplification.
Patch 6-18 convert LRU pages pin to the objcg direction.
Any comments are welcome. Thanks.
Changlogs in RFC v2:
1. Collect Acked-by tags by Johannes. Thanks.
2. Rework lruvec_holds_page_lru_lock() suggested by Johannes. Thanks.
3. Fix move_pages_to_lru().
Muchun Song (18):
mm: memcontrol: fix page charging in page replacement
mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm
mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec
mm: memcontrol: simplify lruvec_holds_page_lru_lock
mm: memcontrol: simplify the logic of objcg pinning memcg
mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM
mm: memcontrol: introduce compact_lock_page_lruvec_irqsave
mm: memcontrol: make lruvec lock safe when the LRU pages reparented
mm: vmscan: remove noinline_for_stack
mm: vmscan: rework move_pages_to_lru()
mm: thp: introduce lock/unlock_split_queue{_irqsave}()
mm: thp: make deferred split queue lock safe when the LRU pages
reparented
mm: memcontrol: make all the callers of page_memcg() safe
mm: memcontrol: introduce memcg_reparent_ops
mm: memcontrol: use obj_cgroup APIs to charge the LRU pages
mm: memcontrol: rename {un}lock_page_memcg() to {un}lock_page_objcg()
mm: lru: add VM_BUG_ON_PAGE to lru maintenance function
mm: lru: use lruvec lock to serialize memcg changes
Documentation/admin-guide/cgroup-v1/memory.rst | 2 +-
fs/buffer.c | 13 +-
fs/fs-writeback.c | 23 +-
fs/iomap/buffered-io.c | 4 +-
include/linux/memcontrol.h | 216 +++++----
include/linux/mm_inline.h | 6 +
mm/compaction.c | 40 +-
mm/filemap.c | 2 +-
mm/huge_memory.c | 171 ++++++-
mm/memcontrol.c | 622 ++++++++++++++++++-------
mm/migrate.c | 4 +
mm/page-writeback.c | 24 +-
mm/page_io.c | 5 +-
mm/rmap.c | 14 +-
mm/swap.c | 48 +-
mm/vmscan.c | 58 ++-
mm/workingset.c | 2 +-
17 files changed, 841 insertions(+), 413 deletions(-)
--
2.11.0
The pages aren't accounted at the root level, so do not charge the page
to the root memcg in page replacement. Although we do not display the
value (mem_cgroup_usage) so there shouldn't be any actual problem, but
there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it
will trigger? So it is better to fix it.
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 64ada9e650a5..f229de925aa5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6806,9 +6806,11 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
/* Force-charge the new page. The old one will be freed soon */
nr_pages = thp_nr_pages(newpage);
- page_counter_charge(&memcg->memory, nr_pages);
- if (do_memsw_account())
- page_counter_charge(&memcg->memsw, nr_pages);
+ if (!mem_cgroup_is_root(memcg)) {
+ page_counter_charge(&memcg->memory, nr_pages);
+ if (do_memsw_account())
+ page_counter_charge(&memcg->memsw, nr_pages);
+ }
css_get(&memcg->css);
commit_charge(newpage, memcg);
--
2.11.0
All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page)
as the 2nd parameter to it (except isolate_migratepages_block()). But
for isolate_migratepages_block(), the page_pgdat(page) is also equal
to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not
need the pgdat parameter. Just remove it to simplify the code.
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 10 +++++-----
mm/compaction.c | 2 +-
mm/memcontrol.c | 9 +++------
mm/swap.c | 2 +-
mm/workingset.c | 2 +-
5 files changed, 11 insertions(+), 14 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c960fd49c3e8..4f49865c9958 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -743,13 +743,12 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
/**
* mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
* @page: the page
- * @pgdat: pgdat of the page
*
* This function relies on page->mem_cgroup being stable.
*/
-static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
- struct pglist_data *pgdat)
+static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
{
+ pg_data_t *pgdat = page_pgdat(page);
struct mem_cgroup *memcg = page_memcg(page);
VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page);
@@ -1223,9 +1222,10 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
return &pgdat->__lruvec;
}
-static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
- struct pglist_data *pgdat)
+static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
{
+ pg_data_t *pgdat = page_pgdat(page);
+
return &pgdat->__lruvec;
}
diff --git a/mm/compaction.c b/mm/compaction.c
index caa4c36c1db3..e7da342003dd 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1033,7 +1033,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (!TestClearPageLRU(page))
goto isolate_fail_put;
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ lruvec = mem_cgroup_page_lruvec(page);
/* If we already hold the lock, we can skip some rechecking */
if (lruvec != locked) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9cbfff59b171..1f807448233e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1177,9 +1177,8 @@ void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
struct lruvec *lock_page_lruvec(struct page *page)
{
struct lruvec *lruvec;
- struct pglist_data *pgdat = page_pgdat(page);
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ lruvec = mem_cgroup_page_lruvec(page);
spin_lock(&lruvec->lru_lock);
lruvec_memcg_debug(lruvec, page);
@@ -1190,9 +1189,8 @@ struct lruvec *lock_page_lruvec(struct page *page)
struct lruvec *lock_page_lruvec_irq(struct page *page)
{
struct lruvec *lruvec;
- struct pglist_data *pgdat = page_pgdat(page);
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ lruvec = mem_cgroup_page_lruvec(page);
spin_lock_irq(&lruvec->lru_lock);
lruvec_memcg_debug(lruvec, page);
@@ -1203,9 +1201,8 @@ struct lruvec *lock_page_lruvec_irq(struct page *page)
struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
{
struct lruvec *lruvec;
- struct pglist_data *pgdat = page_pgdat(page);
- lruvec = mem_cgroup_page_lruvec(page, pgdat);
+ lruvec = mem_cgroup_page_lruvec(page);
spin_lock_irqsave(&lruvec->lru_lock, *flags);
lruvec_memcg_debug(lruvec, page);
diff --git a/mm/swap.c b/mm/swap.c
index a75a8265302b..e0d5699213cc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -313,7 +313,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
void lru_note_cost_page(struct page *page)
{
- lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
+ lru_note_cost(mem_cgroup_page_lruvec(page),
page_is_file_lru(page), thp_nr_pages(page));
}
diff --git a/mm/workingset.c b/mm/workingset.c
index b7cdeca5a76d..4f7a306ce75a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -408,7 +408,7 @@ void workingset_activation(struct page *page)
memcg = page_memcg_rcu(page);
if (!mem_cgroup_disabled() && !memcg)
goto out;
- lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+ lruvec = mem_cgroup_page_lruvec(page);
workingset_age_nonresident(lruvec, thp_nr_pages(page));
out:
rcu_read_unlock();
--
2.11.0
We already have a helper lruvec_memcg() to get the memcg from lruvec, we
do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
to simplify the code. We can have a single definition for this function
that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
CONFIG_MEMCG.
Signed-off-by: Muchun Song <[email protected]>
---
include/linux/memcontrol.h | 31 +++++++------------------------
1 file changed, 7 insertions(+), 24 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 4f49865c9958..38b8d3fb24ff 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -755,22 +755,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
return mem_cgroup_lruvec(memcg, pgdat);
}
-static inline bool lruvec_holds_page_lru_lock(struct page *page,
- struct lruvec *lruvec)
-{
- pg_data_t *pgdat = page_pgdat(page);
- const struct mem_cgroup *memcg;
- struct mem_cgroup_per_node *mz;
-
- if (mem_cgroup_disabled())
- return lruvec == &pgdat->__lruvec;
-
- mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- memcg = page_memcg(page) ? : root_mem_cgroup;
-
- return lruvec->pgdat == pgdat && mz->memcg == memcg;
-}
-
struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -1229,14 +1213,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
return &pgdat->__lruvec;
}
-static inline bool lruvec_holds_page_lru_lock(struct page *page,
- struct lruvec *lruvec)
-{
- pg_data_t *pgdat = page_pgdat(page);
-
- return lruvec == &pgdat->__lruvec;
-}
-
static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
{
}
@@ -1518,6 +1494,13 @@ static inline void unlock_page_lruvec_irqrestore(struct lruvec *lruvec,
spin_unlock_irqrestore(&lruvec->lru_lock, flags);
}
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+ struct lruvec *lruvec)
+{
+ return lruvec_pgdat(lruvec) == page_pgdat(page) &&
+ lruvec_memcg(lruvec) == page_memcg(page);
+}
+
/* Don't lock again iff page's lruvec locked */
static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
struct lruvec *locked_lruvec)
--
2.11.0
When mm is NULL, we do not need to hold rcu lock and call css_tryget for
the root memcg. And we also do not need to check !mm in every loop of
while. So bail out early when !mm.
Signed-off-by: Muchun Song <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f229de925aa5..9cbfff59b171 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -901,20 +901,19 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
if (mem_cgroup_disabled())
return NULL;
+ /*
+ * Page cache insertions can happen without an
+ * actual mm context, e.g. during disk probing
+ * on boot, loopback IO, acct() writes etc.
+ */
+ if (unlikely(!mm))
+ return root_mem_cgroup;
+
rcu_read_lock();
do {
- /*
- * Page cache insertions can happen without an
- * actual mm context, e.g. during disk probing
- * on boot, loopback IO, acct() writes etc.
- */
- if (unlikely(!mm))
+ memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+ if (unlikely(!memcg))
memcg = root_mem_cgroup;
- else {
- memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
- if (unlikely(!memcg))
- memcg = root_mem_cgroup;
- }
} while (!css_tryget(&memcg->css));
rcu_read_unlock();
return memcg;
--
2.11.0
The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by
the css_set_lock. We do not need to care about objcg->memcg being
released in the process of obj_cgroup_release(). So there is no need
to pin memcg before releasing objcg. Remove those pinning logic to
simplfy the code.
There are only two places that modifies the objcg->memcg. One is the
initialization to objcg->memcg in the memcg_online_kmem(), another
is objcgs reparenting in the memcg_reparent_objcgs(). It is also
impossible for the two to run in parallel. So xchg() is unnecessary
and it is enough to use WRITE_ONCE().
Signed-off-by: Muchun Song <[email protected]>
---
mm/memcontrol.c | 20 +++++++-------------
1 file changed, 7 insertions(+), 13 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1f807448233e..90c1ac58c64c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -261,7 +261,6 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
static void obj_cgroup_release(struct percpu_ref *ref)
{
struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
- struct mem_cgroup *memcg;
unsigned int nr_bytes;
unsigned int nr_pages;
unsigned long flags;
@@ -291,11 +290,9 @@ static void obj_cgroup_release(struct percpu_ref *ref)
nr_pages = nr_bytes >> PAGE_SHIFT;
spin_lock_irqsave(&css_set_lock, flags);
- memcg = obj_cgroup_memcg(objcg);
if (nr_pages)
obj_cgroup_uncharge_pages(objcg, nr_pages);
list_del(&objcg->list);
- mem_cgroup_put(memcg);
spin_unlock_irqrestore(&css_set_lock, flags);
percpu_ref_exit(ref);
@@ -330,17 +327,14 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
spin_lock_irq(&css_set_lock);
- /* Move active objcg to the parent's list */
- xchg(&objcg->memcg, parent);
- css_get(&parent->css);
- list_add(&objcg->list, &parent->objcg_list);
+ /* 1) Ready to reparent active objcg. */
+ list_add(&objcg->list, &memcg->objcg_list);
- /* Move already reparented objcgs to the parent's list */
- list_for_each_entry(iter, &memcg->objcg_list, list) {
- css_get(&parent->css);
- xchg(&iter->memcg, parent);
- css_put(&memcg->css);
- }
+ /* 2) Reparent active objcg and already reparented objcgs to parent. */
+ list_for_each_entry(iter, &memcg->objcg_list, list)
+ WRITE_ONCE(iter->memcg, parent);
+
+ /* 3) Move already reparented objcgs to the parent's list */
list_splice(&memcg->objcg_list, &parent->objcg_list);
spin_unlock_irq(&css_set_lock);
--
2.11.0
Because memory allocations pinning memcgs for a long time - it exists
at a larger scale and is causing recurring problems in the real world:
page cache doesn't get reclaimed for a long time, or is used by the
second, third, fourth, ... instance of the same job that was restarted
into a new cgroup every time. Unreclaimable dying cgroups pile up,
waste memory, and make page reclaim very inefficient.
We can convert LRU pages and most other raw memcg pins to the objcg
direction to fix this problem, and then the page->memcg will always
point to an object cgroup pointer.
Therefore, the infrastructure of objcg no longer only serves
CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the
objcg out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages
can reuse it to charge pages.
We know that the LRU pages are not accounted at the root level. But the
page->memcg_data points to the root_mem_cgroup. So the page->memcg_data
of the LRU pages always points to a valid pointer. But the root_mem_cgroup
dose not have an object cgroup. If we use obj_cgroup APIs to charge the
LRU pages, we should set the page->memcg_data to a root object cgroup. So
we also allocate an object cgroup for the root_mem_cgroup and introduce
root_obj_cgroup to cache its value just like root_mem_cgroup.
Signed-off-by: Muchun Song <[email protected]>
---
include/linux/memcontrol.h | 4 ++-
mm/memcontrol.c | 71 +++++++++++++++++++++++++++++++++++++---------
2 files changed, 60 insertions(+), 15 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 38b8d3fb24ff..ab948eb5f62e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -223,7 +223,9 @@ struct memcg_cgwb_frn {
struct obj_cgroup {
struct percpu_ref refcnt;
struct mem_cgroup *memcg;
+#ifdef CONFIG_MEMCG_KMEM
atomic_t nr_charged_bytes;
+#endif
union {
struct list_head list;
struct rcu_head rcu;
@@ -321,9 +323,9 @@ struct mem_cgroup {
#ifdef CONFIG_MEMCG_KMEM
int kmemcg_id;
enum memcg_kmem_state kmem_state;
+#endif
struct obj_cgroup __rcu *objcg;
struct list_head objcg_list; /* list of inherited objcgs */
-#endif
MEMCG_PADDING(_pad2_);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90c1ac58c64c..27caf24bb0c1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -75,6 +75,7 @@ struct cgroup_subsys memory_cgrp_subsys __read_mostly;
EXPORT_SYMBOL(memory_cgrp_subsys);
struct mem_cgroup *root_mem_cgroup __read_mostly;
+static struct obj_cgroup *root_obj_cgroup __read_mostly;
/* Active memory cgroup to use from an interrupt context */
DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
@@ -252,9 +253,14 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
return &container_of(vmpr, struct mem_cgroup, vmpressure)->css;
}
-#ifdef CONFIG_MEMCG_KMEM
extern spinlock_t css_set_lock;
+static inline bool obj_cgroup_is_root(struct obj_cgroup *objcg)
+{
+ return objcg == root_obj_cgroup;
+}
+
+#ifdef CONFIG_MEMCG_KMEM
static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
unsigned int nr_pages);
@@ -298,6 +304,20 @@ static void obj_cgroup_release(struct percpu_ref *ref)
percpu_ref_exit(ref);
kfree_rcu(objcg, rcu);
}
+#else
+static void obj_cgroup_release(struct percpu_ref *ref)
+{
+ struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
+ unsigned long flags;
+
+ spin_lock_irqsave(&css_set_lock, flags);
+ list_del(&objcg->list);
+ spin_unlock_irqrestore(&css_set_lock, flags);
+
+ percpu_ref_exit(ref);
+ kfree_rcu(objcg, rcu);
+}
+#endif
static struct obj_cgroup *obj_cgroup_alloc(void)
{
@@ -318,10 +338,14 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
return objcg;
}
-static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
- struct mem_cgroup *parent)
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
{
struct obj_cgroup *objcg, *iter;
+ struct mem_cgroup *parent;
+
+ parent = parent_mem_cgroup(memcg);
+ if (!parent)
+ parent = root_mem_cgroup;
objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
@@ -342,6 +366,27 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
percpu_ref_kill(&objcg->refcnt);
}
+static int memcg_obj_cgroup_alloc(struct mem_cgroup *memcg)
+{
+ struct obj_cgroup *objcg;
+
+ objcg = obj_cgroup_alloc();
+ if (!objcg)
+ return -ENOMEM;
+
+ objcg->memcg = memcg;
+ rcu_assign_pointer(memcg->objcg, objcg);
+
+ return 0;
+}
+
+static void memcg_obj_cgroup_free(struct mem_cgroup *memcg)
+{
+ if (unlikely(memcg->objcg))
+ memcg_reparent_objcgs(memcg);
+}
+
+#ifdef CONFIG_MEMCG_KMEM
/*
* This will be used as a shrinker list's index.
* The main reason for not using cgroup id for this:
@@ -3444,7 +3489,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
#ifdef CONFIG_MEMCG_KMEM
static int memcg_online_kmem(struct mem_cgroup *memcg)
{
- struct obj_cgroup *objcg;
int memcg_id;
if (cgroup_memory_nokmem)
@@ -3457,14 +3501,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
if (memcg_id < 0)
return memcg_id;
- objcg = obj_cgroup_alloc();
- if (!objcg) {
- memcg_free_cache_id(memcg_id);
- return -ENOMEM;
- }
- objcg->memcg = memcg;
- rcu_assign_pointer(memcg->objcg, objcg);
-
static_branch_enable(&memcg_kmem_enabled_key);
memcg->kmemcg_id = memcg_id;
@@ -3488,7 +3524,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
if (!parent)
parent = root_mem_cgroup;
- memcg_reparent_objcgs(memcg, parent);
+ memcg_reparent_objcgs(memcg);
kmemcg_id = memcg->kmemcg_id;
BUG_ON(kmemcg_id < 0);
@@ -4978,6 +5014,7 @@ static void mem_cgroup_free(struct mem_cgroup *memcg)
{
int cpu;
+ memcg_obj_cgroup_free(memcg);
memcg_wb_domain_exit(memcg);
/*
* Flush percpu lruvec stats to guarantee the value
@@ -5023,6 +5060,9 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (memcg_wb_domain_init(memcg, GFP_KERNEL))
goto fail;
+ if (memcg_obj_cgroup_alloc(memcg))
+ goto free_wb;
+
INIT_WORK(&memcg->high_work, high_work_func);
INIT_LIST_HEAD(&memcg->oom_notify);
mutex_init(&memcg->thresholds_lock);
@@ -5033,8 +5073,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
memcg->socket_pressure = jiffies;
#ifdef CONFIG_MEMCG_KMEM
memcg->kmemcg_id = -1;
- INIT_LIST_HEAD(&memcg->objcg_list);
#endif
+ INIT_LIST_HEAD(&memcg->objcg_list);
#ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(&memcg->cgwb_list);
for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
@@ -5048,6 +5088,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
#endif
idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
return memcg;
+free_wb:
+ memcg_wb_domain_exit(memcg);
fail:
mem_cgroup_id_remove(memcg);
__mem_cgroup_free(memcg);
@@ -5085,6 +5127,7 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
page_counter_init(&memcg->tcpmem, NULL);
root_mem_cgroup = memcg;
+ root_obj_cgroup = memcg->objcg;
return &memcg->css;
}
--
2.11.0
If we reuse the objcg APIs to charge LRU pages, the page_memcg()
can be changed when the LRU pages reparented. In this case, we need
to acquire the new lruvec lock.
lruvec = mem_cgroup_page_lruvec(page);
// The page is reparented.
compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
// Acquired the wrong lruvec lock and need to retry.
But compact_lock_irqsave() only take lruvec lock as the parameter,
we cannot aware this change. If it can take the page as parameter
to acquire the lruvec lock. When the page memcg is changed, we can
use the page_memcg() detect whether we need to reacquire the new
lruvec lock. So compact_lock_irqsave() is not suitable for us.
Similar to lock_page_lruvec_irqsave(), introduce
compact_lock_page_lruvec_irqsave() to acquire the lruvec lock in
the compaction routine.
Signed-off-by: Muchun Song <[email protected]>
---
mm/compaction.c | 29 +++++++++++++++++++++++++----
1 file changed, 25 insertions(+), 4 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index e7da342003dd..c9efe3542b0a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -511,6 +511,29 @@ static bool compact_lock_irqsave(spinlock_t *lock, unsigned long *flags,
return true;
}
+static struct lruvec *
+compact_lock_page_lruvec_irqsave(struct page *page, unsigned long *flags,
+ struct compact_control *cc)
+{
+ struct lruvec *lruvec;
+
+ lruvec = mem_cgroup_page_lruvec(page);
+
+ /* Track if the lock is contended in async mode */
+ if (cc->mode == MIGRATE_ASYNC && !cc->contended) {
+ if (spin_trylock_irqsave(&lruvec->lru_lock, *flags))
+ goto out;
+
+ cc->contended = true;
+ }
+
+ spin_lock_irqsave(&lruvec->lru_lock, *flags);
+out:
+ lruvec_memcg_debug(lruvec, page);
+
+ return lruvec;
+}
+
/*
* Compaction requires the taking of some coarse locks that are potentially
* very heavily contended. The lock should be periodically unlocked to avoid
@@ -1040,10 +1063,8 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (locked)
unlock_page_lruvec_irqrestore(locked, flags);
- compact_lock_irqsave(&lruvec->lru_lock, &flags, cc);
- locked = lruvec;
-
- lruvec_memcg_debug(lruvec, page);
+ locked = compact_lock_page_lruvec_irqsave(page, &flags, cc);
+ lruvec = locked;
/* Try get exclusive access under lock */
if (!skip_updated) {
--
2.11.0
The diagram below shows how to make the page lruvec lock safe when the
LRU pages reparented.
lock_page_lruvec(page)
retry:
lruvec = mem_cgroup_page_lruvec(page);
// The page is reparented at this time.
spin_lock(&lruvec->lru_lock);
if (unlikely(lruvec_memcg(lruvec) != page_memcg(page)))
// Acquired the wrong lruvec lock and need to retry.
// Because this page is on the parent memcg lruvec list.
goto retry;
// If we reach here, it means that page_memcg(page) is stable.
memcg_reparent_objcgs(memcg)
// lruvec belongs to memcg and lruvec_parent belongs to parent memcg.
spin_lock(&lruvec->lru_lock);
spin_lock(&lruvec_parent->lru_lock);
// Move all the pages from the lruvec list to the parent lruvec list.
spin_unlock(&lruvec_parent->lru_lock);
spin_unlock(&lruvec->lru_lock);
After we acquire the lruvec lock, we need to check whether the page is
reparented. If so, we need to reacquire the new lruvec lock. On the
routine of the LRU pages reparenting, we will also acquire the lruvec
lock (Will be implemented in the later patch). So page_memcg() cannot
be changed when we hold the lruvec lock.
Since lruvec_memcg(lruvec) is always equal to page_memcg(page) after
we hold the lruvec lock, lruvec_memcg_debug() check is pointless. So
remove it.
This is a preparation for reparenting the LRU pages.
Signed-off-by: Muchun Song <[email protected]>
---
include/linux/memcontrol.h | 16 +++------------
mm/compaction.c | 10 +++++++++-
mm/memcontrol.c | 50 +++++++++++++++++++++++++++-------------------
mm/swap.c | 5 +++++
4 files changed, 47 insertions(+), 34 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ab948eb5f62e..93aa41600913 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -746,7 +746,9 @@ static inline struct lruvec *mem_cgroup_lruvec(struct mem_cgroup *memcg,
* mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
* @page: the page
*
- * This function relies on page->mem_cgroup being stable.
+ * The lruvec can be changed to its parent lruvec when the page reparented.
+ * The caller need to recheck if it cares about this change (just like
+ * lock_page_lruvec() does).
*/
static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
{
@@ -766,14 +768,6 @@ struct lruvec *lock_page_lruvec_irq(struct page *page);
struct lruvec *lock_page_lruvec_irqsave(struct page *page,
unsigned long *flags);
-#ifdef CONFIG_DEBUG_VM
-void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page);
-#else
-static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
-{
-}
-#endif
-
static inline
struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css){
return css ? container_of(css, struct mem_cgroup, css) : NULL;
@@ -1215,10 +1209,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
return &pgdat->__lruvec;
}
-static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
-{
-}
-
static inline struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg)
{
return NULL;
diff --git a/mm/compaction.c b/mm/compaction.c
index c9efe3542b0a..5fd37e14404f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -517,6 +517,8 @@ compact_lock_page_lruvec_irqsave(struct page *page, unsigned long *flags,
{
struct lruvec *lruvec;
+ rcu_read_lock();
+retry:
lruvec = mem_cgroup_page_lruvec(page);
/* Track if the lock is contended in async mode */
@@ -529,7 +531,13 @@ compact_lock_page_lruvec_irqsave(struct page *page, unsigned long *flags,
spin_lock_irqsave(&lruvec->lru_lock, *flags);
out:
- lruvec_memcg_debug(lruvec, page);
+ if (unlikely(lruvec_memcg(lruvec) != page_memcg(page))) {
+ spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
+ goto retry;
+ }
+
+ /* See the comments in lock_page_lruvec(). */
+ rcu_read_unlock();
return lruvec;
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 27caf24bb0c1..3a2f5c43aed3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1186,23 +1186,6 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
return ret;
}
-#ifdef CONFIG_DEBUG_VM
-void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
-{
- struct mem_cgroup *memcg;
-
- if (mem_cgroup_disabled())
- return;
-
- memcg = page_memcg(page);
-
- if (!memcg)
- VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != root_mem_cgroup, page);
- else
- VM_BUG_ON_PAGE(lruvec_memcg(lruvec) != memcg, page);
-}
-#endif
-
/**
* lock_page_lruvec - lock and return lruvec for a given page.
* @page: the page
@@ -1217,10 +1200,21 @@ struct lruvec *lock_page_lruvec(struct page *page)
{
struct lruvec *lruvec;
+ rcu_read_lock();
+retry:
lruvec = mem_cgroup_page_lruvec(page);
spin_lock(&lruvec->lru_lock);
- lruvec_memcg_debug(lruvec, page);
+ if (unlikely(lruvec_memcg(lruvec) != page_memcg(page))) {
+ spin_unlock(&lruvec->lru_lock);
+ goto retry;
+ }
+
+ /*
+ * Preemption is disabled in the internal of spin_lock, which can serve
+ * as RCU read-side critical sections.
+ */
+ rcu_read_unlock();
return lruvec;
}
@@ -1229,10 +1223,18 @@ struct lruvec *lock_page_lruvec_irq(struct page *page)
{
struct lruvec *lruvec;
+ rcu_read_lock();
+retry:
lruvec = mem_cgroup_page_lruvec(page);
spin_lock_irq(&lruvec->lru_lock);
- lruvec_memcg_debug(lruvec, page);
+ if (unlikely(lruvec_memcg(lruvec) != page_memcg(page))) {
+ spin_unlock_irq(&lruvec->lru_lock);
+ goto retry;
+ }
+
+ /* See the comments in lock_page_lruvec(). */
+ rcu_read_unlock();
return lruvec;
}
@@ -1241,10 +1243,18 @@ struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
{
struct lruvec *lruvec;
+ rcu_read_lock();
+retry:
lruvec = mem_cgroup_page_lruvec(page);
spin_lock_irqsave(&lruvec->lru_lock, *flags);
- lruvec_memcg_debug(lruvec, page);
+ if (unlikely(lruvec_memcg(lruvec) != page_memcg(page))) {
+ spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
+ goto retry;
+ }
+
+ /* See the comments in lock_page_lruvec(). */
+ rcu_read_unlock();
return lruvec;
}
diff --git a/mm/swap.c b/mm/swap.c
index e0d5699213cc..f3ce307d09fa 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -313,6 +313,11 @@ void lru_note_cost(struct lruvec *lruvec, bool file, unsigned int nr_pages)
void lru_note_cost_page(struct page *page)
{
+ /*
+ * The rcu read lock is held by the caller, so we do not need to
+ * care about the lruvec returned by mem_cgroup_page_lruvec() being
+ * released.
+ */
lru_note_cost(mem_cgroup_page_lruvec(page),
page_is_file_lru(page), thp_nr_pages(page));
}
--
2.11.0
The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
set up pagevec as late as possible in shrink_inactive_list()"), its
purpose is to delay the allocation of pagevec as late as possible to
save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
reclaim stack") replace pagevecs by lists of pages_to_free. So we do
not need noinline_for_stack, just remove it (let the compiler decide
whether to inline).
Signed-off-by: Muchun Song <[email protected]>
---
mm/vmscan.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 64bf07cc20f2..e40b21298d77 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2015,8 +2015,8 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
*
* Returns the number of pages moved to the given lruvec.
*/
-static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
- struct list_head *list)
+static unsigned int move_pages_to_lru(struct lruvec *lruvec,
+ struct list_head *list)
{
int nr_pages, nr_moved = 0;
LIST_HEAD(pages_to_free);
@@ -2096,7 +2096,7 @@ static int current_may_throttle(void)
* shrink_inactive_list() is a helper for shrink_node(). It returns the number
* of reclaimed pages
*/
-static noinline_for_stack unsigned long
+static unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
struct scan_control *sc, enum lru_list lru)
{
--
2.11.0
We should make thp deferred split queue lock safe when LRU pages
reparented. Similar to lock_page_lruvec{_irqsave, _irq}(), we
introduce lock/unlock_split_queue{_irqsave}() to make the deferred
split queue lock easier to be reparented.
And in the next patch, we can use a similar approach (just like
lruvec lock did) to make thp deferred split queue lock safe when
the LRU pages reparented.
Signed-off-by: Muchun Song <[email protected]>
---
mm/huge_memory.c | 96 +++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 74 insertions(+), 22 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 437178ddbedb..275dbfc8b2ae 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -487,25 +487,76 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
}
#ifdef CONFIG_MEMCG
-static inline struct deferred_split *get_deferred_split_queue(struct page *page)
+static inline struct mem_cgroup *split_queue_to_memcg(struct deferred_split *queue)
{
- struct mem_cgroup *memcg = page_memcg(compound_head(page));
- struct pglist_data *pgdat = NODE_DATA(page_to_nid(page));
+ return container_of(queue, struct mem_cgroup, deferred_split_queue);
+}
+
+static struct deferred_split *lock_split_queue(struct page *page)
+{
+ struct deferred_split *queue;
+ struct mem_cgroup *memcg;
+
+ memcg = page_memcg(compound_head(page));
+ if (memcg)
+ queue = &memcg->deferred_split_queue;
+ else
+ queue = &NODE_DATA(page_to_nid(page))->deferred_split_queue;
+ spin_lock(&queue->split_queue_lock);
+
+ return queue;
+}
+static struct deferred_split *lock_split_queue_irqsave(struct page *page,
+ unsigned long *flags)
+{
+ struct deferred_split *queue;
+ struct mem_cgroup *memcg;
+
+ memcg = page_memcg(compound_head(page));
if (memcg)
- return &memcg->deferred_split_queue;
+ queue = &memcg->deferred_split_queue;
else
- return &pgdat->deferred_split_queue;
+ queue = &NODE_DATA(page_to_nid(page))->deferred_split_queue;
+ spin_lock_irqsave(&queue->split_queue_lock, *flags);
+
+ return queue;
}
#else
-static inline struct deferred_split *get_deferred_split_queue(struct page *page)
+static struct deferred_split *lock_split_queue(struct page *page)
+{
+ struct deferred_split *queue;
+
+ queue = &NODE_DATA(page_to_nid(page))->deferred_split_queue;
+ spin_lock(&queue->split_queue_lock);
+
+ return queue;
+}
+
+static struct deferred_split *lock_split_queue_irqsave(struct page *page,
+ unsigned long *flags)
+
{
- struct pglist_data *pgdat = NODE_DATA(page_to_nid(page));
+ struct deferred_split *queue;
+
+ queue = &NODE_DATA(page_to_nid(page))->deferred_split_queue;
+ spin_lock_irqsave(&queue->split_queue_lock, *flags);
- return &pgdat->deferred_split_queue;
+ return queue;
}
#endif
+static inline void unlock_split_queue(struct deferred_split *queue)
+{
+ spin_unlock(&queue->split_queue_lock);
+}
+
+static inline void unlock_split_queue_irqrestore(struct deferred_split *queue,
+ unsigned long flags)
+{
+ spin_unlock_irqrestore(&queue->split_queue_lock, flags);
+}
+
void prep_transhuge_page(struct page *page)
{
/*
@@ -2668,7 +2719,7 @@ bool can_split_huge_page(struct page *page, int *pextra_pins)
int split_huge_page_to_list(struct page *page, struct list_head *list)
{
struct page *head = compound_head(page);
- struct deferred_split *ds_queue = get_deferred_split_queue(head);
+ struct deferred_split *ds_queue;
struct anon_vma *anon_vma = NULL;
struct address_space *mapping = NULL;
int count, mapcount, extra_pins, ret;
@@ -2747,7 +2798,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
}
/* Prevent deferred_split_scan() touching ->_refcount */
- spin_lock(&ds_queue->split_queue_lock);
+ ds_queue = lock_split_queue(head);
count = page_count(head);
mapcount = total_mapcount(head);
if (!mapcount && page_ref_freeze(head, 1 + extra_pins)) {
@@ -2755,7 +2806,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
ds_queue->split_queue_len--;
list_del(page_deferred_list(head));
}
- spin_unlock(&ds_queue->split_queue_lock);
+ unlock_split_queue(ds_queue);
if (mapping) {
int nr = thp_nr_pages(head);
@@ -2778,7 +2829,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
dump_page(page, "total_mapcount(head) > 0");
BUG();
}
- spin_unlock(&ds_queue->split_queue_lock);
+ unlock_split_queue(ds_queue);
fail: if (mapping)
xa_unlock(&mapping->i_pages);
local_irq_enable();
@@ -2800,24 +2851,21 @@ fail: if (mapping)
void free_transhuge_page(struct page *page)
{
- struct deferred_split *ds_queue = get_deferred_split_queue(page);
+ struct deferred_split *ds_queue;
unsigned long flags;
- spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+ ds_queue = lock_split_queue_irqsave(page, &flags);
if (!list_empty(page_deferred_list(page))) {
ds_queue->split_queue_len--;
list_del(page_deferred_list(page));
}
- spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+ unlock_split_queue_irqrestore(ds_queue, flags);
free_compound_page(page);
}
void deferred_split_huge_page(struct page *page)
{
- struct deferred_split *ds_queue = get_deferred_split_queue(page);
-#ifdef CONFIG_MEMCG
- struct mem_cgroup *memcg = page_memcg(compound_head(page));
-#endif
+ struct deferred_split *ds_queue;
unsigned long flags;
VM_BUG_ON_PAGE(!PageTransHuge(page), page);
@@ -2835,18 +2883,22 @@ void deferred_split_huge_page(struct page *page)
if (PageSwapCache(page))
return;
- spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+ ds_queue = lock_split_queue_irqsave(page, &flags);
if (list_empty(page_deferred_list(page))) {
count_vm_event(THP_DEFERRED_SPLIT_PAGE);
list_add_tail(page_deferred_list(page), &ds_queue->split_queue);
ds_queue->split_queue_len++;
#ifdef CONFIG_MEMCG
- if (memcg)
+ if (page_memcg(page)) {
+ struct mem_cgroup *memcg;
+
+ memcg = split_queue_to_memcg(ds_queue);
set_shrinker_bit(memcg, page_to_nid(page),
deferred_split_shrinker.id);
+ }
#endif
}
- spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+ unlock_split_queue_irqrestore(ds_queue, flags);
}
static unsigned long deferred_split_count(struct shrinker *shrink,
--
2.11.0
In the later patch, we will reparent the LRU pages. The pages which will
move to appropriate LRU list can be reparented during the process of the
move_pages_to_lru(). So holding a lruvec lock by the caller is wrong, we
should use the more general interface of relock_page_lruvec_irq() to
acquire the correct lruvec lock.
Signed-off-by: Muchun Song <[email protected]>
---
mm/vmscan.c | 46 +++++++++++++++++++++++-----------------------
1 file changed, 23 insertions(+), 23 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e40b21298d77..4431007825ad 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2013,23 +2013,27 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
* move_pages_to_lru() moves pages from private @list to appropriate LRU list.
* On return, @list is reused as a list of pages to be freed by the caller.
*
- * Returns the number of pages moved to the given lruvec.
+ * Returns the number of pages moved to the appropriate LRU list.
+ *
+ * Note: The caller must not hold any lruvec lock.
*/
-static unsigned int move_pages_to_lru(struct lruvec *lruvec,
- struct list_head *list)
+static unsigned int move_pages_to_lru(struct list_head *list)
{
- int nr_pages, nr_moved = 0;
+ int nr_moved = 0;
+ struct lruvec *lruvec = NULL;
LIST_HEAD(pages_to_free);
- struct page *page;
while (!list_empty(list)) {
- page = lru_to_page(list);
+ int nr_pages;
+ struct page *page = lru_to_page(list);
+
+ lruvec = relock_page_lruvec_irq(page, lruvec);
VM_BUG_ON_PAGE(PageLRU(page), page);
list_del(&page->lru);
if (unlikely(!page_evictable(page))) {
- spin_unlock_irq(&lruvec->lru_lock);
+ unlock_page_lruvec_irq(lruvec);
putback_lru_page(page);
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec = NULL;
continue;
}
@@ -2050,19 +2054,15 @@ static unsigned int move_pages_to_lru(struct lruvec *lruvec,
__clear_page_lru_flags(page);
if (unlikely(PageCompound(page))) {
- spin_unlock_irq(&lruvec->lru_lock);
+ unlock_page_lruvec_irq(lruvec);
destroy_compound_page(page);
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec = NULL;
} else
list_add(&page->lru, &pages_to_free);
continue;
}
- /*
- * All pages were isolated from the same lruvec (and isolation
- * inhibits memcg migration).
- */
VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
add_page_to_lru_list(page, lruvec);
nr_pages = thp_nr_pages(page);
@@ -2071,6 +2071,8 @@ static unsigned int move_pages_to_lru(struct lruvec *lruvec,
workingset_age_nonresident(lruvec, nr_pages);
}
+ if (lruvec)
+ unlock_page_lruvec_irq(lruvec);
/*
* To save our caller's stack, now use input list for pages to free.
*/
@@ -2144,16 +2146,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, &stat, false);
- spin_lock_irq(&lruvec->lru_lock);
- move_pages_to_lru(lruvec, &page_list);
+ move_pages_to_lru(&page_list);
+ local_irq_disable();
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
if (!cgroup_reclaim(sc))
__count_vm_events(item, nr_reclaimed);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
- spin_unlock_irq(&lruvec->lru_lock);
+ local_irq_enable();
lru_note_cost(lruvec, file, stat.nr_pageout);
mem_cgroup_uncharge_list(&page_list);
@@ -2280,18 +2282,16 @@ static void shrink_active_list(unsigned long nr_to_scan,
/*
* Move pages back to the lru list.
*/
- spin_lock_irq(&lruvec->lru_lock);
-
- nr_activate = move_pages_to_lru(lruvec, &l_active);
- nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
+ nr_activate = move_pages_to_lru(&l_active);
+ nr_deactivate = move_pages_to_lru(&l_inactive);
/* Keep all free pages in l_active list */
list_splice(&l_inactive, &l_active);
+ local_irq_disable();
__count_vm_events(PGDEACTIVATE, nr_deactivate);
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
-
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(&lruvec->lru_lock);
+ local_irq_enable();
mem_cgroup_uncharge_list(&l_active);
free_unref_page_list(&l_active);
--
2.11.0
In the previous patch, we know how to make the lruvec lock safe when the
LRU pages reparented. We should do something like following.
memcg_reparent_objcgs(memcg)
1) lock
// lruvec belongs to memcg and lruvec_parent belongs to parent memcg.
spin_lock(&lruvec->lru_lock);
spin_lock(&lruvec_parent->lru_lock);
2) do reparent
// Move all the pages from the lruvec list to the parent lruvec list.
3) unlock
spin_unlock(&lruvec_parent->lru_lock);
spin_unlock(&lruvec->lru_lock);
Apart from the page lruvec lock, the deferred split queue lock (THP only)
also needs to do something similar. So we extracted the necessary 3 steps
in the memcg_reparent_objcgs().
memcg_reparent_objcgs(memcg)
1) lock
memcg_reparent_ops->lock(memcg, parent);
2) reparent
memcg_reparent_ops->reparent(memcg, reparent);
3) unlock
memcg_reparent_ops->unlock(memcg, reparent);
Now there are two different locks (e.g. lruvec lock and deferred split
queue lock) need to use this infrastructure. In the next patch, we will
use those APIs to make those locks safe when the LRU pages reparented.
Signed-off-by: Muchun Song <[email protected]>
---
include/linux/memcontrol.h | 11 +++++++++++
mm/memcontrol.c | 49 ++++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 58 insertions(+), 2 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7e15be2bd47a..fdc385ecff55 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -355,6 +355,17 @@ struct mem_cgroup {
/* WARNING: nodeinfo must be the last member here */
};
+struct memcg_reparent_ops {
+ struct list_head list;
+
+ /* Irq is disabled before calling those functions. */
+ void (*lock)(struct mem_cgroup *memcg, struct mem_cgroup *parent);
+ void (*unlock)(struct mem_cgroup *memcg, struct mem_cgroup *parent);
+ void (*reparent)(struct mem_cgroup *memcg, struct mem_cgroup *parent);
+};
+
+void __init register_memcg_repatent(struct memcg_reparent_ops *ops);
+
/*
* size of first charge trial. "32" comes from vmscan.c's magic value.
* TODO: maybe necessary to use big numbers in big irons.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2f4fcb182883..536be7bdc98f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -338,6 +338,41 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
return objcg;
}
+static LIST_HEAD(reparent_ops_head);
+
+static void memcg_reparent_lock(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ struct memcg_reparent_ops *ops;
+
+ list_for_each_entry(ops, &reparent_ops_head, list)
+ ops->lock(memcg, parent);
+}
+
+static void memcg_reparent_unlock(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ struct memcg_reparent_ops *ops;
+
+ list_for_each_entry(ops, &reparent_ops_head, list)
+ ops->unlock(memcg, parent);
+}
+
+static void memcg_do_reparent(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ struct memcg_reparent_ops *ops;
+
+ list_for_each_entry(ops, &reparent_ops_head, list)
+ ops->reparent(memcg, parent);
+}
+
+void __init register_memcg_repatent(struct memcg_reparent_ops *ops)
+{
+ BUG_ON(!ops->lock || !ops->unlock || !ops->reparent);
+ list_add(&ops->list, &reparent_ops_head);
+}
+
static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
{
struct obj_cgroup *objcg, *iter;
@@ -347,9 +382,13 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
if (!parent)
parent = root_mem_cgroup;
+ local_irq_disable();
+
+ memcg_reparent_lock(memcg, parent);
+
objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
- spin_lock_irq(&css_set_lock);
+ spin_lock(&css_set_lock);
/* 1) Ready to reparent active objcg. */
list_add(&objcg->list, &memcg->objcg_list);
@@ -361,7 +400,13 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
/* 3) Move already reparented objcgs to the parent's list */
list_splice(&memcg->objcg_list, &parent->objcg_list);
- spin_unlock_irq(&css_set_lock);
+ spin_unlock(&css_set_lock);
+
+ memcg_do_reparent(memcg, parent);
+
+ memcg_reparent_unlock(memcg, parent);
+
+ local_irq_enable();
percpu_ref_kill(&objcg->refcnt);
}
--
2.11.0
We will reuse the obj_cgroup APIs to charge the LRU pages. Finally,
page->memcg_data will have 2 different meanings.
- For the slab pages, page->memcg_data points to an object cgroups
vector.
- For the kmem pages (exclude the slab pages) and the LRU pages,
page->memcg_data points to an object cgroup.
In this patch, we reuse obj_cgroup APIs to charge LRU pages. In the end,
The page cache cannot prevent long-living objects from pinning the original
memory cgroup in the memory.
At the same time we also changed the rules of page and objcg or memcg
binding stability. The new rules are as follows.
For a page any of the following ensures page and objcg binding stability:
- the page lock
- LRU isolation
- lock_page_memcg()
- exclusive reference
Based on the stable binding of page and objcg, for a page any of the
following ensures page and memcg binding stability:
- css_set_lock
- cgroup_mutex
- the lruvec lock
- the split queue lock (only THP page)
If the caller only want to ensure that the page counters of memcg are
updated correctly, ensure that the binding stability of page and objcg
is sufficient.
Signed-off-by: Muchun Song <[email protected]>
---
include/linux/memcontrol.h | 96 ++++++---------
mm/huge_memory.c | 48 ++++++++
mm/memcontrol.c | 283 ++++++++++++++++++++++++++++++++-------------
3 files changed, 287 insertions(+), 140 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index fdc385ecff55..8c56baacc255 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -385,8 +385,6 @@ enum page_memcg_data_flags {
#define MEMCG_DATA_FLAGS_MASK (__NR_MEMCG_DATA_FLAGS - 1)
-static inline bool PageMemcgKmem(struct page *page);
-
/*
* After the initialization objcg->memcg is always pointing at
* a valid memcg, but can be atomically swapped to the parent memcg.
@@ -400,43 +398,19 @@ static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
}
/*
- * __page_memcg - get the memory cgroup associated with a non-kmem page
- * @page: a pointer to the page struct
- *
- * Returns a pointer to the memory cgroup associated with the page,
- * or NULL. This function assumes that the page is known to have a
- * proper memory cgroup pointer. It's not safe to call this function
- * against some type of pages, e.g. slab pages or ex-slab pages or
- * kmem pages.
- */
-static inline struct mem_cgroup *__page_memcg(struct page *page)
-{
- unsigned long memcg_data = page->memcg_data;
-
- VM_BUG_ON_PAGE(PageSlab(page), page);
- VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
- VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_KMEM, page);
-
- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
-}
-
-/*
- * __page_objcg - get the object cgroup associated with a kmem page
+ * page_objcg - get the object cgroup associated with page
* @page: a pointer to the page struct
*
* Returns a pointer to the object cgroup associated with the page,
* or NULL. This function assumes that the page is known to have a
- * proper object cgroup pointer. It's not safe to call this function
- * against some type of pages, e.g. slab pages or ex-slab pages or
- * LRU pages.
+ * proper object cgroup pointer.
*/
-static inline struct obj_cgroup *__page_objcg(struct page *page)
+static inline struct obj_cgroup *page_objcg(struct page *page)
{
unsigned long memcg_data = page->memcg_data;
VM_BUG_ON_PAGE(PageSlab(page), page);
VM_BUG_ON_PAGE(memcg_data & MEMCG_DATA_OBJCGS, page);
- VM_BUG_ON_PAGE(!(memcg_data & MEMCG_DATA_KMEM), page);
return (struct obj_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
}
@@ -450,23 +424,35 @@ static inline struct obj_cgroup *__page_objcg(struct page *page)
* proper memory cgroup pointer. It's not safe to call this function
* against some type of pages, e.g. slab pages or ex-slab pages.
*
- * For a non-kmem page any of the following ensures page and memcg binding
- * stability:
+ * For a page any of the following ensures page and objcg binding stability:
*
* - the page lock
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
*
- * For a kmem page a caller should hold an rcu read lock to protect memcg
- * associated with a kmem page from being released.
+ * Based on the stable binding of page and objcg, for a page any of the
+ * following ensures page and memcg binding stability:
+ *
+ * - css_set_lock
+ * - cgroup_mutex
+ * - the lruvec lock
+ * - the split queue lock (only THP page)
+ *
+ * If the caller only want to ensure that the page counters of memcg are
+ * updated correctly, ensure that the binding stability of page and objcg
+ * is sufficient.
+ *
+ * A caller should hold an rcu read lock (In addition, regions of code across
+ * which interrupts, preemption, or softirqs have been disabled also serve as
+ * RCU read-side critical sections) to protect memcg associated with a page
+ * from being released.
*/
static inline struct mem_cgroup *page_memcg(struct page *page)
{
- if (PageMemcgKmem(page))
- return obj_cgroup_memcg(__page_objcg(page));
- else
- return __page_memcg(page);
+ struct obj_cgroup *objcg = page_objcg(page);
+
+ return objcg ? obj_cgroup_memcg(objcg) : NULL;
}
/*
@@ -479,6 +465,8 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
* is known to have a proper memory cgroup pointer. It's not safe to call
* this function against some type of pages, e.g. slab pages or ex-slab
* pages.
+ *
+ * The page and objcg or memcg binding rules can refer to page_memcg().
*/
static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
{
@@ -502,22 +490,20 @@ static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
* or NULL. This function assumes that the page is known to have a
* proper memory cgroup pointer. It's not safe to call this function
* against some type of pages, e.g. slab pages or ex-slab pages.
+ *
+ * The page and objcg or memcg binding rules can refer to page_memcg().
*/
static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
{
unsigned long memcg_data = READ_ONCE(page->memcg_data);
+ struct obj_cgroup *objcg;
VM_BUG_ON_PAGE(PageSlab(page), page);
WARN_ON_ONCE(!rcu_read_lock_held());
- if (memcg_data & MEMCG_DATA_KMEM) {
- struct obj_cgroup *objcg;
-
- objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
- return obj_cgroup_memcg(objcg);
- }
+ objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return objcg ? obj_cgroup_memcg(objcg) : NULL;
}
/*
@@ -530,16 +516,10 @@ static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
* has an associated memory cgroup pointer or an object cgroups vector or
* an object cgroup.
*
- * For a non-kmem page any of the following ensures page and memcg binding
- * stability:
- *
- * - the page lock
- * - LRU isolation
- * - lock_page_memcg()
- * - exclusive reference
+ * The page and objcg or memcg binding rules can refer to page_memcg().
*
- * For a kmem page a caller should hold an rcu read lock to protect memcg
- * associated with a kmem page from being released.
+ * A caller should hold an rcu read lock to protect memcg associated with a
+ * page from being released.
*/
static inline struct mem_cgroup *page_memcg_check(struct page *page)
{
@@ -548,18 +528,14 @@ static inline struct mem_cgroup *page_memcg_check(struct page *page)
* for slab pages, READ_ONCE() should be used here.
*/
unsigned long memcg_data = READ_ONCE(page->memcg_data);
+ struct obj_cgroup *objcg;
if (memcg_data & MEMCG_DATA_OBJCGS)
return NULL;
- if (memcg_data & MEMCG_DATA_KMEM) {
- struct obj_cgroup *objcg;
-
- objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
- return obj_cgroup_memcg(objcg);
- }
+ objcg = (void *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
- return (struct mem_cgroup *)(memcg_data & ~MEMCG_DATA_FLAGS_MASK);
+ return objcg ? obj_cgroup_memcg(objcg) : NULL;
}
#ifdef CONFIG_MEMCG_KMEM
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aa5d7b72d5fc..4c0a36dc67ea 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -487,6 +487,8 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
}
#ifdef CONFIG_MEMCG
+static struct shrinker deferred_split_shrinker;
+
static inline struct mem_cgroup *split_queue_to_memcg(struct deferred_split *queue)
{
return container_of(queue, struct mem_cgroup, deferred_split_queue);
@@ -545,6 +547,52 @@ static struct deferred_split *lock_split_queue_irqsave(struct page *page,
return queue;
}
+
+static void memcg_reparent_split_queue_lock(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ spin_lock(&memcg->deferred_split_queue.split_queue_lock);
+ spin_lock(&parent->deferred_split_queue.split_queue_lock);
+}
+
+static void memcg_reparent_split_queue_unlock(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ spin_unlock(&parent->deferred_split_queue.split_queue_lock);
+ spin_unlock(&memcg->deferred_split_queue.split_queue_lock);
+}
+
+static void memcg_reparent_split_queue(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ int nid;
+ struct deferred_split *src, *dst;
+
+ src = &memcg->deferred_split_queue;
+ dst = &parent->deferred_split_queue;
+
+ if (!src->split_queue_len)
+ return;
+
+ list_splice_tail_init(&src->split_queue, &dst->split_queue);
+ dst->split_queue_len += src->split_queue_len;
+ src->split_queue_len = 0;
+
+ for_each_node(nid)
+ set_shrinker_bit(parent, nid, deferred_split_shrinker.id);
+}
+
+static struct memcg_reparent_ops split_queue_reparent_ops = {
+ .lock = memcg_reparent_split_queue_lock,
+ .unlock = memcg_reparent_split_queue_unlock,
+ .reparent = memcg_reparent_split_queue,
+};
+
+static void __init split_queue_reparent_init(void)
+{
+ register_memcg_repatent(&split_queue_reparent_ops);
+}
+core_initcall(split_queue_reparent_init);
#else
static struct deferred_split *lock_split_queue(struct page *page)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 536be7bdc98f..35a4c768dacc 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -338,6 +338,77 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
return objcg;
}
+static void memcg_reparent_lruvec_lock(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ int nid;
+
+ for_each_node(nid) {
+ spin_lock(&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->lru_lock);
+ spin_lock(&mem_cgroup_lruvec(parent, NODE_DATA(nid))->lru_lock);
+ }
+}
+
+static void memcg_reparent_lruvec_unlock(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ int nid;
+
+ for_each_node(nid) {
+ spin_unlock(&mem_cgroup_lruvec(parent, NODE_DATA(nid))->lru_lock);
+ spin_unlock(&mem_cgroup_lruvec(memcg, NODE_DATA(nid))->lru_lock);
+ }
+}
+
+static void lruvec_reparent_lru(struct lruvec *src, struct lruvec *dst,
+ enum lru_list lru)
+{
+ int zid;
+ struct mem_cgroup_per_node *mz_src, *mz_dst;
+
+ mz_src = container_of(src, struct mem_cgroup_per_node, lruvec);
+ mz_dst = container_of(dst, struct mem_cgroup_per_node, lruvec);
+
+ list_splice_tail_init(&src->lists[lru], &dst->lists[lru]);
+
+ for (zid = 0; zid < MAX_NR_ZONES; zid++) {
+ mz_dst->lru_zone_size[zid][lru] += mz_src->lru_zone_size[zid][lru];
+ mz_src->lru_zone_size[zid][lru] = 0;
+ }
+}
+
+static void memcg_reparent_lruvec(struct mem_cgroup *memcg,
+ struct mem_cgroup *parent)
+{
+ int nid;
+
+ for_each_node(nid) {
+ enum lru_list lru;
+ struct lruvec *src, *dst;
+
+ src = mem_cgroup_lruvec(memcg, NODE_DATA(nid));
+ dst = mem_cgroup_lruvec(parent, NODE_DATA(nid));
+
+ dst->anon_cost += src->anon_cost;
+ dst->file_cost += src->file_cost;
+
+ for_each_lru(lru)
+ lruvec_reparent_lru(src, dst, lru);
+ }
+}
+
+static struct memcg_reparent_ops lruvec_reparent_ops = {
+ .lock = memcg_reparent_lruvec_lock,
+ .unlock = memcg_reparent_lruvec_unlock,
+ .reparent = memcg_reparent_lruvec,
+};
+
+static void __init lruvec_reparent_init(void)
+{
+ register_memcg_repatent(&lruvec_reparent_ops);
+}
+core_initcall(lruvec_reparent_init);
+
static LIST_HEAD(reparent_ops_head);
static void memcg_reparent_lock(struct mem_cgroup *memcg,
@@ -2804,18 +2875,18 @@ static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages)
}
#endif
-static void commit_charge(struct page *page, struct mem_cgroup *memcg)
+static void commit_charge(struct page *page, struct obj_cgroup *objcg)
{
- VM_BUG_ON_PAGE(page_memcg(page), page);
+ VM_BUG_ON_PAGE(page_objcg(page), page);
/*
- * Any of the following ensures page's memcg stability:
+ * Any of the following ensures page's objcg stability:
*
* - the page lock
* - LRU isolation
* - lock_page_memcg()
* - exclusive reference
*/
- page->memcg_data = (unsigned long)memcg;
+ page->memcg_data = (unsigned long)objcg;
}
static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
@@ -2832,6 +2903,21 @@ static struct mem_cgroup *get_mem_cgroup_from_objcg(struct obj_cgroup *objcg)
return memcg;
}
+static struct obj_cgroup *get_obj_cgroup_from_memcg(struct mem_cgroup *memcg)
+{
+ struct obj_cgroup *objcg = NULL;
+
+ rcu_read_lock();
+ for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+ objcg = rcu_dereference(memcg->objcg);
+ if (objcg && obj_cgroup_tryget(objcg))
+ break;
+ }
+ rcu_read_unlock();
+
+ return objcg;
+}
+
#ifdef CONFIG_MEMCG_KMEM
int memcg_alloc_page_obj_cgroups(struct page *page, struct kmem_cache *s,
gfp_t gfp, bool new_page)
@@ -2929,12 +3015,15 @@ __always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
else
memcg = mem_cgroup_from_task(current);
- for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
- objcg = rcu_dereference(memcg->objcg);
- if (objcg && obj_cgroup_tryget(objcg))
- break;
+ if (mem_cgroup_is_root(memcg))
+ goto out;
+
+ objcg = get_obj_cgroup_from_memcg(memcg);
+ if (obj_cgroup_is_root(objcg)) {
+ obj_cgroup_put(objcg);
objcg = NULL;
}
+out:
rcu_read_unlock();
return objcg;
@@ -3077,13 +3166,14 @@ int __memcg_kmem_charge_page(struct page *page, gfp_t gfp, int order)
*/
void __memcg_kmem_uncharge_page(struct page *page, int order)
{
- struct obj_cgroup *objcg;
+ struct obj_cgroup *objcg = page_objcg(page);
unsigned int nr_pages = 1 << order;
- if (!PageMemcgKmem(page))
+ if (!objcg)
return;
- objcg = __page_objcg(page);
+ VM_BUG_ON_PAGE(!PageMemcgKmem(page), page);
+ objcg = page_objcg(page);
obj_cgroup_uncharge_pages(objcg, nr_pages);
page->memcg_data = 0;
obj_cgroup_put(objcg);
@@ -3215,23 +3305,20 @@ void obj_cgroup_uncharge(struct obj_cgroup *objcg, size_t size)
#endif /* CONFIG_MEMCG_KMEM */
/*
- * Because page_memcg(head) is not set on tails, set it now.
+ * Because page_objcg(head) is not set on tails, set it now.
*/
void split_page_memcg(struct page *head, unsigned int nr)
{
- struct mem_cgroup *memcg = page_memcg(head);
+ struct obj_cgroup *objcg = page_objcg(head);
int i;
- if (mem_cgroup_disabled() || !memcg)
+ if (mem_cgroup_disabled() || !objcg)
return;
for (i = 1; i < nr; i++)
head[i].memcg_data = head->memcg_data;
- if (PageMemcgKmem(head))
- obj_cgroup_get_many(__page_objcg(head), nr - 1);
- else
- css_get_many(&memcg->css, nr - 1);
+ obj_cgroup_get_many(objcg, nr - 1);
}
#ifdef CONFIG_MEMCG_SWAP
@@ -5588,10 +5675,10 @@ static int mem_cgroup_move_account(struct page *page,
*/
smp_mb();
- css_get(&to->css);
- css_put(&from->css);
+ obj_cgroup_get(to->objcg);
+ obj_cgroup_put(from->objcg);
- page->memcg_data = (unsigned long)to;
+ page->memcg_data = (unsigned long)to->objcg;
__unlock_page_memcg(from);
@@ -6063,6 +6150,42 @@ static void mem_cgroup_move_charge(void)
mmap_read_unlock(mc.mm);
atomic_dec(&mc.from->moving_account);
+
+ /*
+ * Moving its pages to another memcg is finished. Wait for already
+ * started RCU-only updates to finish to make sure that the caller
+ * of lock_page_memcg() can unlock the correct move_lock. The
+ * possible bad scenario would like:
+ *
+ * CPU0: CPU1:
+ * mem_cgroup_move_charge()
+ * walk_page_range()
+ *
+ * lock_page_memcg(page)
+ * memcg = page_memcg(page)
+ * spin_lock_irqsave(&memcg->move_lock)
+ * memcg->move_lock_task = current
+ *
+ * atomic_dec(&mc.from->moving_account)
+ *
+ * mem_cgroup_css_offline()
+ * memcg_offline_kmem()
+ * memcg_reparent_objcgs() <== reparented
+ *
+ * unlock_page_memcg(page)
+ * memcg = page_memcg(page) <== memcg has been changed
+ * if (memcg->move_lock_task == current) <== false
+ * spin_unlock_irqrestore(&memcg->move_lock)
+ *
+ * Once mem_cgroup_move_charge() returns (it means that the cgroup_mutex
+ * would be released soon), the page can be reparented to its parent
+ * memcg. When the unlock_page_memcg() is called for the page, we will
+ * miss unlock the move_lock. So using synchronize_rcu to wait for
+ * already started RCU-only updates to finish before this function
+ * returns (mem_cgroup_move_charge() and mem_cgroup_css_offline() are
+ * serialized by cgroup_mutex).
+ */
+ synchronize_rcu();
}
/*
@@ -6618,21 +6741,27 @@ void mem_cgroup_calculate_protection(struct mem_cgroup *root,
static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
gfp_t gfp)
{
+ struct obj_cgroup *objcg;
unsigned int nr_pages = thp_nr_pages(page);
int ret;
- ret = try_charge(memcg, gfp, nr_pages);
- if (ret)
- goto out;
+ objcg = get_obj_cgroup_from_memcg(memcg);
+ /* Do not account at the root objcg level. */
+ if (!obj_cgroup_is_root(objcg)) {
+ ret = try_charge(memcg, gfp, nr_pages);
+ if (ret)
+ goto out;
+ }
- css_get(&memcg->css);
- commit_charge(page, memcg);
+ obj_cgroup_get(objcg);
+ commit_charge(page, objcg);
local_irq_disable();
mem_cgroup_charge_statistics(memcg, page, nr_pages);
memcg_check_events(memcg, page);
local_irq_enable();
out:
+ obj_cgroup_put(objcg);
return ret;
}
@@ -6733,7 +6862,7 @@ void mem_cgroup_swapin_uncharge_swap(swp_entry_t entry)
}
struct uncharge_gather {
- struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
unsigned long nr_memory;
unsigned long pgpgout;
unsigned long nr_kmem;
@@ -6748,63 +6877,56 @@ static inline void uncharge_gather_clear(struct uncharge_gather *ug)
static void uncharge_batch(const struct uncharge_gather *ug)
{
unsigned long flags;
+ struct mem_cgroup *memcg;
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(ug->objcg);
if (ug->nr_memory) {
- page_counter_uncharge(&ug->memcg->memory, ug->nr_memory);
+ page_counter_uncharge(&memcg->memory, ug->nr_memory);
if (do_memsw_account())
- page_counter_uncharge(&ug->memcg->memsw, ug->nr_memory);
+ page_counter_uncharge(&memcg->memsw, ug->nr_memory);
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys) && ug->nr_kmem)
- page_counter_uncharge(&ug->memcg->kmem, ug->nr_kmem);
- memcg_oom_recover(ug->memcg);
+ page_counter_uncharge(&memcg->kmem, ug->nr_kmem);
+ memcg_oom_recover(memcg);
}
local_irq_save(flags);
- __count_memcg_events(ug->memcg, PGPGOUT, ug->pgpgout);
- __this_cpu_add(ug->memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
- memcg_check_events(ug->memcg, ug->dummy_page);
+ __count_memcg_events(memcg, PGPGOUT, ug->pgpgout);
+ __this_cpu_add(memcg->vmstats_percpu->nr_page_events, ug->nr_memory);
+ memcg_check_events(memcg, ug->dummy_page);
local_irq_restore(flags);
+ rcu_read_unlock();
/* drop reference from uncharge_page */
- css_put(&ug->memcg->css);
+ obj_cgroup_put(ug->objcg);
}
static void uncharge_page(struct page *page, struct uncharge_gather *ug)
{
unsigned long nr_pages;
- struct mem_cgroup *memcg;
struct obj_cgroup *objcg;
VM_BUG_ON_PAGE(PageLRU(page), page);
/*
* Nobody should be changing or seriously looking at
- * page memcg or objcg at this point, we have fully
- * exclusive access to the page.
+ * page objcg at this point, we have fully exclusive
+ * access to the page.
*/
- if (PageMemcgKmem(page)) {
- objcg = __page_objcg(page);
- /*
- * This get matches the put at the end of the function and
- * kmem pages do not hold memcg references anymore.
- */
- memcg = get_mem_cgroup_from_objcg(objcg);
- } else {
- memcg = __page_memcg(page);
- }
-
- if (!memcg)
+ objcg = page_objcg(page);
+ if (!objcg)
return;
- if (ug->memcg != memcg) {
- if (ug->memcg) {
+ if (ug->objcg != objcg) {
+ if (ug->objcg) {
uncharge_batch(ug);
uncharge_gather_clear(ug);
}
- ug->memcg = memcg;
+ ug->objcg = objcg;
ug->dummy_page = page;
- /* pairs with css_put in uncharge_batch */
- css_get(&memcg->css);
+ /* pairs with obj_cgroup_put in uncharge_batch */
+ obj_cgroup_get(objcg);
}
nr_pages = compound_nr(page);
@@ -6812,19 +6934,15 @@ static void uncharge_page(struct page *page, struct uncharge_gather *ug)
if (PageMemcgKmem(page)) {
ug->nr_memory += nr_pages;
ug->nr_kmem += nr_pages;
-
- page->memcg_data = 0;
- obj_cgroup_put(objcg);
} else {
/* LRU pages aren't accounted at the root level */
- if (!mem_cgroup_is_root(memcg))
+ if (!obj_cgroup_is_root(objcg))
ug->nr_memory += nr_pages;
ug->pgpgout++;
-
- page->memcg_data = 0;
}
- css_put(&memcg->css);
+ page->memcg_data = 0;
+ obj_cgroup_put(objcg);
}
/**
@@ -6841,7 +6959,7 @@ void mem_cgroup_uncharge(struct page *page)
return;
/* Don't touch page->lru of any random page, pre-check: */
- if (!page_memcg(page))
+ if (!page_objcg(page))
return;
uncharge_gather_clear(&ug);
@@ -6867,7 +6985,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
uncharge_gather_clear(&ug);
list_for_each_entry(page, page_list, lru)
uncharge_page(page, &ug);
- if (ug.memcg)
+ if (ug.objcg)
uncharge_batch(&ug);
}
@@ -6884,6 +7002,7 @@ void mem_cgroup_uncharge_list(struct list_head *page_list)
void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
{
struct mem_cgroup *memcg;
+ struct obj_cgroup *objcg;
unsigned int nr_pages;
unsigned long flags;
@@ -6897,32 +7016,34 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
return;
/* Page cache replacement: new page already charged? */
- if (page_memcg(newpage))
+ if (page_objcg(newpage))
return;
- memcg = get_mem_cgroup_from_page(oldpage);
- VM_WARN_ON_ONCE_PAGE(!memcg, oldpage);
- if (!memcg)
+ objcg = page_objcg(oldpage);
+ VM_WARN_ON_ONCE_PAGE(!objcg, oldpage);
+ if (!objcg)
return;
/* Force-charge the new page. The old one will be freed soon */
nr_pages = thp_nr_pages(newpage);
- if (!mem_cgroup_is_root(memcg)) {
+ rcu_read_lock();
+ memcg = obj_cgroup_memcg(objcg);
+
+ if (!obj_cgroup_is_root(objcg)) {
page_counter_charge(&memcg->memory, nr_pages);
if (do_memsw_account())
page_counter_charge(&memcg->memsw, nr_pages);
}
- css_get(&memcg->css);
- commit_charge(newpage, memcg);
+ obj_cgroup_get(objcg);
+ commit_charge(newpage, objcg);
local_irq_save(flags);
mem_cgroup_charge_statistics(memcg, newpage, nr_pages);
memcg_check_events(memcg, newpage);
local_irq_restore(flags);
-
- css_put(&memcg->css);
+ rcu_read_unlock();
}
DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
@@ -7099,6 +7220,7 @@ static struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg)
void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
{
struct mem_cgroup *memcg, *swap_memcg;
+ struct obj_cgroup *objcg;
unsigned int nr_entries;
unsigned short oldid;
@@ -7111,15 +7233,16 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
return;
+ objcg = page_objcg(page);
+ VM_WARN_ON_ONCE_PAGE(!objcg, page);
+ if (!objcg)
+ return;
+
/*
* Interrupts should be disabled by the caller (see the comments below),
* which can serve as RCU read-side critical sections.
*/
- memcg = page_memcg(page);
-
- VM_WARN_ON_ONCE_PAGE(!memcg, page);
- if (!memcg)
- return;
+ memcg = obj_cgroup_memcg(objcg);
/*
* In case the memcg owning these pages has been offlined and doesn't
@@ -7138,7 +7261,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
page->memcg_data = 0;
- if (!mem_cgroup_is_root(memcg))
+ if (!obj_cgroup_is_root(objcg))
page_counter_uncharge(&memcg->memory, nr_entries);
if (!cgroup_memory_noswap && memcg != swap_memcg) {
@@ -7157,7 +7280,7 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
mem_cgroup_charge_statistics(memcg, page, -nr_entries);
memcg_check_events(memcg, page);
- css_put(&memcg->css);
+ obj_cgroup_put(objcg);
}
/**
--
2.11.0
We need to make sure that the page is deleted from or added to the
correct lruvec list. So add a VM_BUG_ON_PAGE() to catch invalid
users.
Signed-off-by: Muchun Song <[email protected]>
---
include/linux/mm_inline.h | 6 ++++++
mm/vmscan.c | 1 -
2 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index 355ea1ee32bd..d19870448287 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -84,6 +84,8 @@ static __always_inline void add_page_to_lru_list(struct page *page,
{
enum lru_list lru = page_lru(page);
+ VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
+
update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
list_add(&page->lru, &lruvec->lists[lru]);
}
@@ -93,6 +95,8 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
{
enum lru_list lru = page_lru(page);
+ VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
+
update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
list_add_tail(&page->lru, &lruvec->lists[lru]);
}
@@ -100,6 +104,8 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
static __always_inline void del_page_from_lru_list(struct page *page,
struct lruvec *lruvec)
{
+ VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
+
list_del(&page->lru);
update_lru_size(lruvec, page_lru(page), page_zonenum(page),
-thp_nr_pages(page));
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4431007825ad..af0fc8110bdc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2063,7 +2063,6 @@ static unsigned int move_pages_to_lru(struct list_head *list)
continue;
}
- VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
add_page_to_lru_list(page, lruvec);
nr_pages = thp_nr_pages(page);
nr_moved += nr_pages;
--
2.11.0
As described by commit fc574c23558c ("mm/swap.c: serialize memcg
changes in pagevec_lru_move_fn"), TestClearPageLRU() aims to
serialize mem_cgroup_move_account() during pagevec_lru_move_fn().
Now lock_page_lruvec*() has the ability to detect whether page
memcg has been changed. So we can use lruvec lock to serialize
mem_cgroup_move_account() during pagevec_lru_move_fn(). This
change is a partial revert of the commit fc574c23558c ("mm/swap.c:
serialize memcg changes in pagevec_lru_move_fn").
And pagevec_lru_move_fn() is more hot compare with
mem_cgroup_move_account(), removing an atomic operation would be
an optimization. Also this change would not dirty cacheline for a
page which isn't on the LRU.
Signed-off-by: Muchun Song <[email protected]>
---
mm/compaction.c | 1 +
mm/memcontrol.c | 31 +++++++++++++++++++++++++++++++
mm/swap.c | 41 +++++++++++------------------------------
mm/vmscan.c | 9 ++++-----
4 files changed, 47 insertions(+), 35 deletions(-)
diff --git a/mm/compaction.c b/mm/compaction.c
index 5fd37e14404f..382e40ccc694 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -531,6 +531,7 @@ compact_lock_page_lruvec_irqsave(struct page *page, unsigned long *flags,
spin_lock_irqsave(&lruvec->lru_lock, *flags);
out:
+ /* See the comments in lock_page_lruvec(). */
if (unlikely(lruvec_memcg(lruvec) != page_memcg(page))) {
spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
goto retry;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 403b0743338b..1017e92f1d82 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1323,12 +1323,38 @@ struct lruvec *lock_page_lruvec(struct page *page)
lruvec = mem_cgroup_page_lruvec(page);
spin_lock(&lruvec->lru_lock);
+ /*
+ * The memcg of the page can be changed by any the following routines:
+ *
+ * 1) mem_cgroup_move_account() or
+ * 2) memcg_reparent_objcgs()
+ *
+ * The possible bad scenario would like:
+ *
+ * CPU0: CPU1: CPU2:
+ * lruvec = mem_cgroup_page_lruvec()
+ *
+ * if (!isolate_lru_page())
+ * mem_cgroup_move_account()
+ *
+ * memcg_reparent_objcgs()
+ *
+ * spin_lock(&lruvec->lru_lock)
+ * ^^^^^^
+ * wrong lock
+ *
+ * Either CPU1 or CPU2 can change page memcg, so we need to check
+ * whether page memcg is changed, if so, we should reacquire the
+ * new lruvec lock.
+ */
if (unlikely(lruvec_memcg(lruvec) != page_memcg(page))) {
spin_unlock(&lruvec->lru_lock);
goto retry;
}
/*
+ * When we reach here, it means that the page_memcg(page) is stable.
+ *
* Preemption is disabled in the internal of spin_lock, which can serve
* as RCU read-side critical sections.
*/
@@ -1346,6 +1372,7 @@ struct lruvec *lock_page_lruvec_irq(struct page *page)
lruvec = mem_cgroup_page_lruvec(page);
spin_lock_irq(&lruvec->lru_lock);
+ /* See the comments in lock_page_lruvec(). */
if (unlikely(lruvec_memcg(lruvec) != page_memcg(page))) {
spin_unlock_irq(&lruvec->lru_lock);
goto retry;
@@ -1366,6 +1393,7 @@ struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long *flags)
lruvec = mem_cgroup_page_lruvec(page);
spin_lock_irqsave(&lruvec->lru_lock, *flags);
+ /* See the comments in lock_page_lruvec(). */
if (unlikely(lruvec_memcg(lruvec) != page_memcg(page))) {
spin_unlock_irqrestore(&lruvec->lru_lock, *flags);
goto retry;
@@ -5687,7 +5715,10 @@ static int mem_cgroup_move_account(struct page *page,
obj_cgroup_get(to->objcg);
obj_cgroup_put(from->objcg);
+ /* See the comments in lock_page_lruvec(). */
+ spin_lock(&from_vec->lru_lock);
page->memcg_data = (unsigned long)to->objcg;
+ spin_unlock(&from_vec->lru_lock);
__unlock_page_objcg(from->objcg);
diff --git a/mm/swap.c b/mm/swap.c
index f3ce307d09fa..48e66a05c913 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -211,14 +211,8 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
for (i = 0; i < pagevec_count(pvec); i++) {
struct page *page = pvec->pages[i];
- /* block memcg migration during page moving between lru */
- if (!TestClearPageLRU(page))
- continue;
-
lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
(*move_fn)(page, lruvec);
-
- SetPageLRU(page);
}
if (lruvec)
unlock_page_lruvec_irqrestore(lruvec, flags);
@@ -228,7 +222,7 @@ static void pagevec_lru_move_fn(struct pagevec *pvec,
static void pagevec_move_tail_fn(struct page *page, struct lruvec *lruvec)
{
- if (!PageUnevictable(page)) {
+ if (PageLRU(page) && !PageUnevictable(page)) {
del_page_from_lru_list(page, lruvec);
ClearPageActive(page);
add_page_to_lru_list_tail(page, lruvec);
@@ -324,7 +318,7 @@ void lru_note_cost_page(struct page *page)
static void __activate_page(struct page *page, struct lruvec *lruvec)
{
- if (!PageActive(page) && !PageUnevictable(page)) {
+ if (PageLRU(page) && !PageActive(page) && !PageUnevictable(page)) {
int nr_pages = thp_nr_pages(page);
del_page_from_lru_list(page, lruvec);
@@ -377,12 +371,9 @@ static void activate_page(struct page *page)
struct lruvec *lruvec;
page = compound_head(page);
- if (TestClearPageLRU(page)) {
- lruvec = lock_page_lruvec_irq(page);
- __activate_page(page, lruvec);
- unlock_page_lruvec_irq(lruvec);
- SetPageLRU(page);
- }
+ lruvec = lock_page_lruvec_irq(page);
+ __activate_page(page, lruvec);
+ unlock_page_lruvec_irq(lruvec);
}
#endif
@@ -537,6 +528,9 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
bool active = PageActive(page);
int nr_pages = thp_nr_pages(page);
+ if (!PageLRU(page))
+ return;
+
if (PageUnevictable(page))
return;
@@ -574,7 +568,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
{
- if (PageActive(page) && !PageUnevictable(page)) {
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
int nr_pages = thp_nr_pages(page);
del_page_from_lru_list(page, lruvec);
@@ -590,7 +584,7 @@ static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec)
{
- if (PageAnon(page) && PageSwapBacked(page) &&
+ if (PageLRU(page) && PageAnon(page) && PageSwapBacked(page) &&
!PageSwapCache(page) && !PageUnevictable(page)) {
int nr_pages = thp_nr_pages(page);
@@ -1055,20 +1049,7 @@ static void __pagevec_lru_add_fn(struct page *page, struct lruvec *lruvec)
*/
void __pagevec_lru_add(struct pagevec *pvec)
{
- int i;
- struct lruvec *lruvec = NULL;
- unsigned long flags = 0;
-
- for (i = 0; i < pagevec_count(pvec); i++) {
- struct page *page = pvec->pages[i];
-
- lruvec = relock_page_lruvec_irqsave(page, lruvec, &flags);
- __pagevec_lru_add_fn(page, lruvec);
- }
- if (lruvec)
- unlock_page_lruvec_irqrestore(lruvec, flags);
- release_pages(pvec->pages, pvec->nr);
- pagevec_reinit(pvec);
+ pagevec_lru_move_fn(pvec, __pagevec_lru_add_fn);
}
/**
diff --git a/mm/vmscan.c b/mm/vmscan.c
index af0fc8110bdc..12c2ea6cb6f3 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4469,18 +4469,17 @@ void check_move_unevictable_pages(struct pagevec *pvec)
nr_pages = thp_nr_pages(page);
pgscanned += nr_pages;
- /* block memcg migration during page moving between lru */
- if (!TestClearPageLRU(page))
+ lruvec = relock_page_lruvec_irq(page, lruvec);
+
+ if (!PageLRU(page) || !PageUnevictable(page))
continue;
- lruvec = relock_page_lruvec_irq(page, lruvec);
- if (page_evictable(page) && PageUnevictable(page)) {
+ if (page_evictable(page)) {
del_page_from_lru_list(page, lruvec);
ClearPageUnevictable(page);
add_page_to_lru_list(page, lruvec);
pgrescued += nr_pages;
}
- SetPageLRU(page);
}
if (lruvec) {
--
2.11.0
Now the lock_page_memcg() does not lock a page and memcg binding, it
actually lock a page and objcg binding. So rename lock_page_memcg()
to lock_page_objcg().
This is just code cleanup without any functionality changes.
Signed-off-by: Muchun Song <[email protected]>
---
Documentation/admin-guide/cgroup-v1/memory.rst | 2 +-
fs/buffer.c | 10 +++----
fs/iomap/buffered-io.c | 4 +--
include/linux/memcontrol.h | 18 +++++++----
mm/filemap.c | 2 +-
mm/huge_memory.c | 4 +--
mm/memcontrol.c | 41 ++++++++++++++++----------
mm/page-writeback.c | 24 +++++++--------
mm/rmap.c | 14 ++++-----
9 files changed, 67 insertions(+), 52 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v1/memory.rst b/Documentation/admin-guide/cgroup-v1/memory.rst
index 41191b5fb69d..dd582312b91a 100644
--- a/Documentation/admin-guide/cgroup-v1/memory.rst
+++ b/Documentation/admin-guide/cgroup-v1/memory.rst
@@ -291,7 +291,7 @@ Lock order is as follows:
Page lock (PG_locked bit of page->flags)
mm->page_table_lock or split pte_lock
- lock_page_memcg (memcg->move_lock)
+ lock_page_objcg (memcg->move_lock)
mapping->i_pages lock
lruvec->lru_lock.
diff --git a/fs/buffer.c b/fs/buffer.c
index a542a47f6e27..6935f12d23f8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -595,7 +595,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
* If warn is true, then emit a warning if the page is not uptodate and has
* not been truncated.
*
- * The caller must hold lock_page_memcg().
+ * The caller must hold lock_page_objcg().
*/
void __set_page_dirty(struct page *page, struct address_space *mapping,
int warn)
@@ -660,14 +660,14 @@ int __set_page_dirty_buffers(struct page *page)
* Lock out page's memcg migration to keep PageDirty
* synchronized with per-memcg dirty page counters.
*/
- lock_page_memcg(page);
+ lock_page_objcg(page);
newly_dirty = !TestSetPageDirty(page);
spin_unlock(&mapping->private_lock);
if (newly_dirty)
__set_page_dirty(page, mapping, 1);
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
if (newly_dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
@@ -1164,13 +1164,13 @@ void mark_buffer_dirty(struct buffer_head *bh)
struct page *page = bh->b_page;
struct address_space *mapping = NULL;
- lock_page_memcg(page);
+ lock_page_objcg(page);
if (!TestSetPageDirty(page)) {
mapping = page_mapping(page);
if (mapping)
__set_page_dirty(page, mapping, 0);
}
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
if (mapping)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 414769a6ad11..106082a5109f 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -653,11 +653,11 @@ iomap_set_page_dirty(struct page *page)
* Lock out page's memcg migration to keep PageDirty
* synchronized with per-memcg dirty page counters.
*/
- lock_page_memcg(page);
+ lock_page_objcg(page);
newly_dirty = !TestSetPageDirty(page);
if (newly_dirty)
__set_page_dirty(page, mapping, 0);
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
if (newly_dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8c56baacc255..35c83e49ac42 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -424,11 +424,12 @@ static inline struct obj_cgroup *page_objcg(struct page *page)
* proper memory cgroup pointer. It's not safe to call this function
* against some type of pages, e.g. slab pages or ex-slab pages.
*
- * For a page any of the following ensures page and objcg binding stability:
+ * For a page any of the following ensures page and objcg binding stability
+ * (But the page can be reparented to its parent memcg):
*
* - the page lock
* - LRU isolation
- * - lock_page_memcg()
+ * - lock_page_objcg()
* - exclusive reference
*
* Based on the stable binding of page and objcg, for a page any of the
@@ -948,8 +949,8 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg);
extern bool cgroup_memory_noswap;
#endif
-void lock_page_memcg(struct page *page);
-void unlock_page_memcg(struct page *page);
+void lock_page_objcg(struct page *page);
+void unlock_page_objcg(struct page *page);
void __mod_memcg_state(struct mem_cgroup *memcg, int idx, int val);
@@ -1120,6 +1121,11 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order,
struct mem_cgroup;
+static inline struct obj_cgroup *page_objcg(struct page *page)
+{
+ return NULL;
+}
+
static inline struct mem_cgroup *page_memcg(struct page *page)
{
return NULL;
@@ -1342,11 +1348,11 @@ mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg)
{
}
-static inline void lock_page_memcg(struct page *page)
+static inline void lock_page_objcg(struct page *page)
{
}
-static inline void unlock_page_memcg(struct page *page)
+static inline void unlock_page_objcg(struct page *page)
{
}
diff --git a/mm/filemap.c b/mm/filemap.c
index c03463cb72d6..aeb38953d68b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -110,7 +110,7 @@
* ->i_pages lock (page_remove_rmap->set_page_dirty)
* bdi.wb->list_lock (page_remove_rmap->set_page_dirty)
* ->inode->i_lock (page_remove_rmap->set_page_dirty)
- * ->memcg->move_lock (page_remove_rmap->lock_page_memcg)
+ * ->memcg->move_lock (page_remove_rmap->lock_page_objcg)
* bdi.wb->list_lock (zap_pte_range->set_page_dirty)
* ->inode->i_lock (zap_pte_range->set_page_dirty)
* ->private_lock (zap_pte_range->__set_page_dirty_buffers)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4c0a36dc67ea..d94cc7916253 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2306,7 +2306,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
atomic_inc(&page[i]._mapcount);
}
- lock_page_memcg(page);
+ lock_page_objcg(page);
if (atomic_add_negative(-1, compound_mapcount_ptr(page))) {
/* Last compound_mapcount is gone. */
__mod_lruvec_page_state(page, NR_ANON_THPS,
@@ -2317,7 +2317,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
atomic_dec(&page[i]._mapcount);
}
}
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
}
smp_wmb(); /* make pte visible before pmd */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 35a4c768dacc..403b0743338b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1311,7 +1311,7 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
* These functions are safe to use under any of the following conditions:
* - page locked
* - PageLRU cleared
- * - lock_page_memcg()
+ * - lock_page_objcg()
* - page->_refcount is zero
*/
struct lruvec *lock_page_lruvec(struct page *page)
@@ -2122,16 +2122,16 @@ void mem_cgroup_print_oom_group(struct mem_cgroup *memcg)
}
/**
- * lock_page_memcg - lock a page and memcg binding
+ * lock_page_objcg - lock a page and objcg binding
* @page: the page
*
* This function protects unlocked LRU pages from being moved to
* another cgroup.
*
- * It ensures lifetime of the locked memcg. Caller is responsible
+ * It ensures lifetime of the locked objcg. Caller is responsible
* for the lifetime of the page.
*/
-void lock_page_memcg(struct page *page)
+void lock_page_objcg(struct page *page)
{
struct page *head = compound_head(page); /* rmap on tail pages */
struct mem_cgroup *memcg;
@@ -2169,18 +2169,27 @@ void lock_page_memcg(struct page *page)
}
/*
+ * The cgroup migration and memory cgroup offlining are serialized by
+ * cgroup_mutex. If we reach here, it means that we are race with cgroup
+ * migration (or we are cgroup migration) and the @page cannot be
+ * reparented to its parent memory cgroup. So during the whole process
+ * from lock_page_objcg(page) to unlock_page_objcg(page), page_memcg(page)
+ * and obj_cgroup_memcg(objcg) are stable.
+ *
* When charge migration first begins, we can have multiple
* critical sections holding the fast-path RCU lock and one
* holding the slowpath move_lock. Track the task who has the
- * move_lock for unlock_page_memcg().
+ * move_lock for unlock_page_objcg().
*/
memcg->move_lock_task = current;
memcg->move_lock_flags = flags;
}
-EXPORT_SYMBOL(lock_page_memcg);
+EXPORT_SYMBOL(lock_page_objcg);
-static void __unlock_page_memcg(struct mem_cgroup *memcg)
+static void __unlock_page_objcg(struct obj_cgroup *objcg)
{
+ struct mem_cgroup *memcg = objcg ? obj_cgroup_memcg(objcg) : NULL;
+
if (memcg && memcg->move_lock_task == current) {
unsigned long flags = memcg->move_lock_flags;
@@ -2194,16 +2203,16 @@ static void __unlock_page_memcg(struct mem_cgroup *memcg)
}
/**
- * unlock_page_memcg - unlock a page and memcg binding
+ * unlock_page_objcg - unlock a page and objcg binding
* @page: the page
*/
-void unlock_page_memcg(struct page *page)
+void unlock_page_objcg(struct page *page)
{
struct page *head = compound_head(page);
- __unlock_page_memcg(page_memcg(head));
+ __unlock_page_objcg(page_objcg(head));
}
-EXPORT_SYMBOL(unlock_page_memcg);
+EXPORT_SYMBOL(unlock_page_objcg);
struct memcg_stock_pcp {
struct mem_cgroup *cached; /* this never be root cgroup */
@@ -2883,7 +2892,7 @@ static void commit_charge(struct page *page, struct obj_cgroup *objcg)
*
* - the page lock
* - LRU isolation
- * - lock_page_memcg()
+ * - lock_page_objcg()
* - exclusive reference
*/
page->memcg_data = (unsigned long)objcg;
@@ -5616,7 +5625,7 @@ static int mem_cgroup_move_account(struct page *page,
from_vec = mem_cgroup_lruvec(from, pgdat);
to_vec = mem_cgroup_lruvec(to, pgdat);
- lock_page_memcg(page);
+ lock_page_objcg(page);
if (PageAnon(page)) {
if (page_mapped(page)) {
@@ -5668,7 +5677,7 @@ static int mem_cgroup_move_account(struct page *page,
* with (un)charging, migration, LRU putback, or anything else
* that would rely on a stable page's memory cgroup.
*
- * Note that lock_page_memcg is a memcg lock, not a page lock,
+ * Note that lock_page_objcg is a memcg lock, not a page lock,
* to save space. As soon as we switch page's memory cgroup to a
* new memcg that isn't locked, the above state can change
* concurrently again. Make sure we're truly done with it.
@@ -5680,7 +5689,7 @@ static int mem_cgroup_move_account(struct page *page,
page->memcg_data = (unsigned long)to->objcg;
- __unlock_page_memcg(from);
+ __unlock_page_objcg(from->objcg);
ret = 0;
@@ -6122,7 +6131,7 @@ static void mem_cgroup_move_charge(void)
{
lru_add_drain_all();
/*
- * Signal lock_page_memcg() to take the memcg's move_lock
+ * Signal lock_page_objcg() to take the memcg's move_lock
* while we're moving its pages to another memcg. Then wait
* for already started RCU-only updates to finish.
*/
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0062d5c57d41..d5d9feb05a2e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2413,7 +2413,7 @@ int __set_page_dirty_no_writeback(struct page *page)
/*
* Helper function for set_page_dirty family.
*
- * Caller must hold lock_page_memcg().
+ * Caller must hold lock_page_objcg().
*
* NOTE: This relies on being atomic wrt interrupts.
*/
@@ -2445,7 +2445,7 @@ void account_page_dirtied(struct page *page, struct address_space *mapping)
/*
* Helper function for deaccounting dirty page without writeback.
*
- * Caller must hold lock_page_memcg().
+ * Caller must hold lock_page_objcg().
*/
void account_page_cleaned(struct page *page, struct address_space *mapping,
struct bdi_writeback *wb)
@@ -2472,13 +2472,13 @@ void account_page_cleaned(struct page *page, struct address_space *mapping,
*/
int __set_page_dirty_nobuffers(struct page *page)
{
- lock_page_memcg(page);
+ lock_page_objcg(page);
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
unsigned long flags;
if (!mapping) {
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
return 1;
}
@@ -2489,7 +2489,7 @@ int __set_page_dirty_nobuffers(struct page *page)
__xa_set_mark(&mapping->i_pages, page_index(page),
PAGECACHE_TAG_DIRTY);
xa_unlock_irqrestore(&mapping->i_pages, flags);
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
if (mapping->host) {
/* !PageAnon && !swapper_space */
@@ -2497,7 +2497,7 @@ int __set_page_dirty_nobuffers(struct page *page)
}
return 1;
}
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
return 0;
}
EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -2630,14 +2630,14 @@ void __cancel_dirty_page(struct page *page)
struct bdi_writeback *wb;
struct wb_lock_cookie cookie = {};
- lock_page_memcg(page);
+ lock_page_objcg(page);
wb = unlocked_inode_to_wb_begin(inode, &cookie);
if (TestClearPageDirty(page))
account_page_cleaned(page, mapping, wb);
unlocked_inode_to_wb_end(inode, &cookie);
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
} else {
ClearPageDirty(page);
}
@@ -2724,7 +2724,7 @@ int test_clear_page_writeback(struct page *page)
struct address_space *mapping = page_mapping(page);
int ret;
- lock_page_memcg(page);
+ lock_page_objcg(page);
if (mapping && mapping_use_writeback_tags(mapping)) {
struct inode *inode = mapping->host;
struct backing_dev_info *bdi = inode_to_bdi(inode);
@@ -2756,7 +2756,7 @@ int test_clear_page_writeback(struct page *page)
dec_zone_page_state(page, NR_ZONE_WRITE_PENDING);
inc_node_page_state(page, NR_WRITTEN);
}
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
return ret;
}
@@ -2765,7 +2765,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
struct address_space *mapping = page_mapping(page);
int ret, access_ret;
- lock_page_memcg(page);
+ lock_page_objcg(page);
if (mapping && mapping_use_writeback_tags(mapping)) {
XA_STATE(xas, &mapping->i_pages, page_index(page));
struct inode *inode = mapping->host;
@@ -2805,7 +2805,7 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
inc_lruvec_page_state(page, NR_WRITEBACK);
inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
}
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
access_ret = arch_make_page_accessible(page);
/*
* If writeback has been triggered on a page that cannot be made
diff --git a/mm/rmap.c b/mm/rmap.c
index b0fc27e77d6d..3c2488e1081c 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -31,7 +31,7 @@
* swap_lock (in swap_duplicate, swap_info_get)
* mmlist_lock (in mmput, drain_mmlist and others)
* mapping->private_lock (in __set_page_dirty_buffers)
- * lock_page_memcg move_lock (in __set_page_dirty_buffers)
+ * lock_page_objcg move_lock (in __set_page_dirty_buffers)
* i_pages lock (widely used)
* lruvec->lru_lock (in lock_page_lruvec_irq)
* inode->i_lock (in set_page_dirty's __mark_inode_dirty)
@@ -1127,7 +1127,7 @@ void do_page_add_anon_rmap(struct page *page,
bool first;
if (unlikely(PageKsm(page)))
- lock_page_memcg(page);
+ lock_page_objcg(page);
else
VM_BUG_ON_PAGE(!PageLocked(page), page);
@@ -1155,7 +1155,7 @@ void do_page_add_anon_rmap(struct page *page,
}
if (unlikely(PageKsm(page))) {
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
return;
}
@@ -1215,7 +1215,7 @@ void page_add_file_rmap(struct page *page, bool compound)
int i, nr = 1;
VM_BUG_ON_PAGE(compound && !PageTransHuge(page), page);
- lock_page_memcg(page);
+ lock_page_objcg(page);
if (compound && PageTransHuge(page)) {
int nr_pages = thp_nr_pages(page);
@@ -1244,7 +1244,7 @@ void page_add_file_rmap(struct page *page, bool compound)
}
__mod_lruvec_page_state(page, NR_FILE_MAPPED, nr);
out:
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
}
static void page_remove_file_rmap(struct page *page, bool compound)
@@ -1345,7 +1345,7 @@ static void page_remove_anon_compound_rmap(struct page *page)
*/
void page_remove_rmap(struct page *page, bool compound)
{
- lock_page_memcg(page);
+ lock_page_objcg(page);
if (!PageAnon(page)) {
page_remove_file_rmap(page, compound);
@@ -1384,7 +1384,7 @@ void page_remove_rmap(struct page *page, bool compound)
* faster for those pages still in swapcache.
*/
out:
- unlock_page_memcg(page);
+ unlock_page_objcg(page);
}
/*
--
2.11.0
Similar to lruvec lock, we use the same approach to make the lock safe
when the LRU pages reparented.
Signed-off-by: Muchun Song <[email protected]>
---
mm/huge_memory.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 275dbfc8b2ae..aa5d7b72d5fc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -497,6 +497,8 @@ static struct deferred_split *lock_split_queue(struct page *page)
struct deferred_split *queue;
struct mem_cgroup *memcg;
+ rcu_read_lock();
+retry:
memcg = page_memcg(compound_head(page));
if (memcg)
queue = &memcg->deferred_split_queue;
@@ -504,6 +506,17 @@ static struct deferred_split *lock_split_queue(struct page *page)
queue = &NODE_DATA(page_to_nid(page))->deferred_split_queue;
spin_lock(&queue->split_queue_lock);
+ if (unlikely(memcg != page_memcg(page))) {
+ spin_unlock(&queue->split_queue_lock);
+ goto retry;
+ }
+
+ /*
+ * Preemption is disabled in the internal of spin_lock, which can serve
+ * as RCU read-side critical sections.
+ */
+ rcu_read_unlock();
+
return queue;
}
@@ -513,6 +526,8 @@ static struct deferred_split *lock_split_queue_irqsave(struct page *page,
struct deferred_split *queue;
struct mem_cgroup *memcg;
+ rcu_read_lock();
+retry:
memcg = page_memcg(compound_head(page));
if (memcg)
queue = &memcg->deferred_split_queue;
@@ -520,6 +535,14 @@ static struct deferred_split *lock_split_queue_irqsave(struct page *page,
queue = &NODE_DATA(page_to_nid(page))->deferred_split_queue;
spin_lock_irqsave(&queue->split_queue_lock, *flags);
+ if (unlikely(memcg != page_memcg(page))) {
+ spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
+ goto retry;
+ }
+
+ /* See the comments in lock_split_queue(). */
+ rcu_read_unlock();
+
return queue;
}
#else
--
2.11.0
When we use objcg APIs to charge the LRU pages, the page will not hold
a reference to the memcg associated with the page. So the caller of the
page_memcg() should hold an rcu read lock or obtain a reference to the
memcg associated with the page to protect memcg from being released. So
introduce get_mem_cgroup_from_page() to obtain a reference to the memory
cgroup associated with the page.
In this patch, make all the callers hold an rcu read lock or obtain a
reference to the memcg to protect memcg from being released when the LRU
pages reparented.
We do not need to adjust the callers of page_memcg() during the whole
process of mem_cgroup_move_task(). Because the cgroup migration and
memory cgroup offlining are serialized by @cgroup_mutex. In this
routine, the LRU pages cannot be reparented to its parent memory
cgroup. So page_memcg(page) is stable and cannot be released.
This is a preparation for reparenting the LRU pages.
Signed-off-by: Muchun Song <[email protected]>
---
fs/buffer.c | 3 ++-
fs/fs-writeback.c | 23 +++++++++++----------
include/linux/memcontrol.h | 34 ++++++++++++++++++++++++++++---
mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++++++++----------
mm/migrate.c | 4 ++++
mm/page_io.c | 5 +++--
6 files changed, 92 insertions(+), 28 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 673cfbef9eec..a542a47f6e27 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -848,7 +848,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
gfp |= __GFP_NOFAIL;
/* The page lock pins the memcg */
- memcg = page_memcg(page);
+ memcg = get_mem_cgroup_from_page(page);
old_memcg = set_active_memcg(memcg);
head = NULL;
@@ -868,6 +868,7 @@ struct buffer_head *alloc_page_buffers(struct page *page, unsigned long size,
set_bh_page(bh, page, offset);
}
out:
+ mem_cgroup_put(memcg);
set_active_memcg(old_memcg);
return head;
/*
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index e91980f49388..3ac002561327 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -255,15 +255,13 @@ void __inode_attach_wb(struct inode *inode, struct page *page)
if (inode_cgwb_enabled(inode)) {
struct cgroup_subsys_state *memcg_css;
- if (page) {
- memcg_css = mem_cgroup_css_from_page(page);
- wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
- } else {
- /* must pin memcg_css, see wb_get_create() */
+ /* must pin memcg_css, see wb_get_create() */
+ if (page)
+ memcg_css = get_mem_cgroup_css_from_page(page);
+ else
memcg_css = task_get_css(current, memory_cgrp_id);
- wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
- css_put(memcg_css);
- }
+ wb = wb_get_create(bdi, memcg_css, GFP_ATOMIC);
+ css_put(memcg_css);
}
if (!wb)
@@ -736,16 +734,16 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
if (!wbc->wb || wbc->no_cgroup_owner)
return;
- css = mem_cgroup_css_from_page(page);
+ css = get_mem_cgroup_css_from_page(page);
/* dead cgroups shouldn't contribute to inode ownership arbitration */
if (!(css->flags & CSS_ONLINE))
- return;
+ goto out;
id = css->id;
if (id == wbc->wb_id) {
wbc->wb_bytes += bytes;
- return;
+ goto out;
}
if (id == wbc->wb_lcand_id)
@@ -758,6 +756,9 @@ void wbc_account_cgroup_owner(struct writeback_control *wbc, struct page *page,
wbc->wb_tcand_bytes += bytes;
else
wbc->wb_tcand_bytes -= min(bytes, wbc->wb_tcand_bytes);
+
+out:
+ css_put(css);
}
EXPORT_SYMBOL_GPL(wbc_account_cgroup_owner);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 93aa41600913..7e15be2bd47a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -381,7 +381,7 @@ static inline bool PageMemcgKmem(struct page *page);
* a valid memcg, but can be atomically swapped to the parent memcg.
*
* The caller must ensure that the returned memcg won't be released:
- * e.g. acquire the rcu_read_lock or css_set_lock.
+ * e.g. acquire the rcu_read_lock or css_set_lock or cgroup_mutex.
*/
static inline struct mem_cgroup *obj_cgroup_memcg(struct obj_cgroup *objcg)
{
@@ -459,6 +459,31 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
}
/*
+ * get_mem_cgroup_from_page - Obtain a reference on the memory cgroup associated
+ * with a page
+ * @page: a pointer to the page struct
+ *
+ * Returns a pointer to the memory cgroup (and obtain a reference on it)
+ * associated with the page, or NULL. This function assumes that the page
+ * is known to have a proper memory cgroup pointer. It's not safe to call
+ * this function against some type of pages, e.g. slab pages or ex-slab
+ * pages.
+ */
+static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
+{
+ struct mem_cgroup *memcg;
+
+ rcu_read_lock();
+retry:
+ memcg = page_memcg(page);
+ if (unlikely(memcg && !css_tryget(&memcg->css)))
+ goto retry;
+ rcu_read_unlock();
+
+ return memcg;
+}
+
+/*
* page_memcg_rcu - locklessly get the memory cgroup associated with a page
* @page: a pointer to the page struct
*
@@ -871,7 +896,7 @@ static inline bool mm_match_cgroup(struct mm_struct *mm,
return match;
}
-struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page);
+struct cgroup_subsys_state *get_mem_cgroup_css_from_page(struct page *page);
ino_t page_cgroup_ino(struct page *page);
static inline bool mem_cgroup_online(struct mem_cgroup *memcg)
@@ -1031,10 +1056,13 @@ static inline void count_memcg_events(struct mem_cgroup *memcg,
static inline void count_memcg_page_event(struct page *page,
enum vm_event_item idx)
{
- struct mem_cgroup *memcg = page_memcg(page);
+ struct mem_cgroup *memcg;
+ rcu_read_lock();
+ memcg = page_memcg(page);
if (memcg)
count_memcg_events(memcg, idx, 1);
+ rcu_read_unlock();
}
static inline void count_memcg_event_mm(struct mm_struct *mm,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 3a2f5c43aed3..2f4fcb182883 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -440,7 +440,7 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
#endif
/**
- * mem_cgroup_css_from_page - css of the memcg associated with a page
+ * get_mem_cgroup_css_from_page - get css of the memcg associated with a page
* @page: page of interest
*
* If memcg is bound to the default hierarchy, css of the memcg associated
@@ -450,13 +450,15 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
* If memcg is bound to a traditional hierarchy, the css of root_mem_cgroup
* is returned.
*/
-struct cgroup_subsys_state *mem_cgroup_css_from_page(struct page *page)
+struct cgroup_subsys_state *get_mem_cgroup_css_from_page(struct page *page)
{
struct mem_cgroup *memcg;
- memcg = page_memcg(page);
+ if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ return &root_mem_cgroup->css;
- if (!memcg || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
+ memcg = get_mem_cgroup_from_page(page);
+ if (!memcg)
memcg = root_mem_cgroup;
return &memcg->css;
@@ -2023,7 +2025,9 @@ void lock_page_memcg(struct page *page)
* The RCU lock is held throughout the transaction. The fast
* path can get away without acquiring the memcg->move_lock
* because page moving starts with an RCU grace period.
- */
+ *
+ * The RCU lock also protects the memcg from being freed.
+ */
rcu_read_lock();
if (mem_cgroup_disabled())
@@ -4443,7 +4447,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
void mem_cgroup_track_foreign_dirty_slowpath(struct page *page,
struct bdi_writeback *wb)
{
- struct mem_cgroup *memcg = page_memcg(page);
+ struct mem_cgroup *memcg;
struct memcg_cgwb_frn *frn;
u64 now = get_jiffies_64();
u64 oldest_at = now;
@@ -4452,6 +4456,7 @@ void mem_cgroup_track_foreign_dirty_slowpath(struct page *page,
trace_track_foreign_dirty(page, wb);
+ memcg = get_mem_cgroup_from_page(page);
/*
* Pick the slot to use. If there is already a slot for @wb, keep
* using it. If not replace the oldest one which isn't being
@@ -4490,6 +4495,7 @@ void mem_cgroup_track_foreign_dirty_slowpath(struct page *page,
frn->memcg_id = wb->memcg_css->id;
frn->at = now;
}
+ css_put(&memcg->css);
}
/* issue foreign writeback flushes for recorded foreign dirtying events */
@@ -6014,6 +6020,14 @@ static void mem_cgroup_move_charge(void)
atomic_dec(&mc.from->moving_account);
}
+/*
+ * The cgroup migration and memory cgroup offlining are serialized by
+ * @cgroup_mutex. If we reach here, it means that the LRU pages cannot
+ * be reparented to its parent memory cgroup. So during the whole process
+ * of mem_cgroup_move_task(), page_memcg(page) is stable. So we do not
+ * need to worry about the memcg (returned from page_memcg()) being
+ * released even if we do not hold an rcu read lock.
+ */
static void mem_cgroup_move_task(void)
{
if (mc.to) {
@@ -6841,7 +6855,7 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
if (page_memcg(newpage))
return;
- memcg = page_memcg(oldpage);
+ memcg = get_mem_cgroup_from_page(oldpage);
VM_WARN_ON_ONCE_PAGE(!memcg, oldpage);
if (!memcg)
return;
@@ -6862,6 +6876,8 @@ void mem_cgroup_migrate(struct page *oldpage, struct page *newpage)
mem_cgroup_charge_statistics(memcg, newpage, nr_pages);
memcg_check_events(memcg, newpage);
local_irq_restore(flags);
+
+ css_put(&memcg->css);
}
DEFINE_STATIC_KEY_FALSE(memcg_sockets_enabled_key);
@@ -7050,6 +7066,10 @@ void mem_cgroup_swapout(struct page *page, swp_entry_t entry)
if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
return;
+ /*
+ * Interrupts should be disabled by the caller (see the comments below),
+ * which can serve as RCU read-side critical sections.
+ */
memcg = page_memcg(page);
VM_WARN_ON_ONCE_PAGE(!memcg, page);
@@ -7117,15 +7137,16 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return 0;
+ rcu_read_lock();
memcg = page_memcg(page);
VM_WARN_ON_ONCE_PAGE(!memcg, page);
if (!memcg)
- return 0;
+ goto out;
if (!entry.val) {
memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
- return 0;
+ goto out;
}
memcg = mem_cgroup_id_get_online(memcg);
@@ -7135,6 +7156,7 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
memcg_memory_event(memcg, MEMCG_SWAP_MAX);
memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
mem_cgroup_id_put(memcg);
+ rcu_read_unlock();
return -ENOMEM;
}
@@ -7144,6 +7166,8 @@ int mem_cgroup_try_charge_swap(struct page *page, swp_entry_t entry)
oldid = swap_cgroup_record(entry, mem_cgroup_id(memcg), nr_pages);
VM_BUG_ON_PAGE(oldid, page);
mod_memcg_state(memcg, MEMCG_SWAP, nr_pages);
+out:
+ rcu_read_unlock();
return 0;
}
@@ -7198,17 +7222,22 @@ bool mem_cgroup_swap_full(struct page *page)
if (cgroup_memory_noswap || !cgroup_subsys_on_dfl(memory_cgrp_subsys))
return false;
+ rcu_read_lock();
memcg = page_memcg(page);
if (!memcg)
- return false;
+ goto out;
for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
unsigned long usage = page_counter_read(&memcg->swap);
if (usage * 2 >= READ_ONCE(memcg->swap.high) ||
- usage * 2 >= READ_ONCE(memcg->swap.max))
+ usage * 2 >= READ_ONCE(memcg->swap.max)) {
+ rcu_read_unlock();
return true;
+ }
}
+out:
+ rcu_read_unlock();
return false;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index b234c3f3acb7..9256693a9979 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -463,6 +463,10 @@ int migrate_page_move_mapping(struct address_space *mapping,
struct lruvec *old_lruvec, *new_lruvec;
struct mem_cgroup *memcg;
+ /*
+ * Irq is disabled, which can serve as RCU read-side critical
+ * sections.
+ */
memcg = page_memcg(page);
old_lruvec = mem_cgroup_lruvec(memcg, oldzone->zone_pgdat);
new_lruvec = mem_cgroup_lruvec(memcg, newzone->zone_pgdat);
diff --git a/mm/page_io.c b/mm/page_io.c
index c493ce9ebcf5..81744777ab76 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -269,13 +269,14 @@ static void bio_associate_blkg_from_page(struct bio *bio, struct page *page)
struct cgroup_subsys_state *css;
struct mem_cgroup *memcg;
+ rcu_read_lock();
memcg = page_memcg(page);
if (!memcg)
- return;
+ goto out;
- rcu_read_lock();
css = cgroup_e_css(memcg->css.cgroup, &io_cgrp_subsys);
bio_associate_blkg_from_css(bio, css);
+out:
rcu_read_unlock();
}
#else
--
2.11.0
On Fri, Apr 09, 2021 at 08:29:45PM +0800, Muchun Song wrote:
> We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
> lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
> page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
> of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
> to simplify the code. We can have a single definition for this function
> that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
> CONFIG_MEMCG.
>
> Signed-off-by: Muchun Song <[email protected]>
Looks good to me.
Acked-by: Johannes Weiner <[email protected]>
If you haven't done so yet, please make sure to explicitly test with
all three config combinations, just because the dummy abstractions for
memcg disabled or compiled out tend to be paper thin and don't always
behave the way you might expect when you do more complicated things.
Something like
boot
echo sparsefile >/dev/null (> ram size to fill memory and reclaim)
echo 1 >/proc/sys/vm/compact_memory
should exercise this new function in a couple of important scenarios.
On Fri, Apr 09, 2021 at 08:29:46PM +0800, Muchun Song wrote:
> The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by
> the css_set_lock. We do not need to care about objcg->memcg being
> released in the process of obj_cgroup_release(). So there is no need
> to pin memcg before releasing objcg. Remove those pinning logic to
> simplfy the code.
Hm yeah, it's not clear to me why inherited objcgs pinned the memcg in
the first place, since they are reparented during memcg deletion and
therefor have no actual impact on the memcg's lifetime.
> There are only two places that modifies the objcg->memcg. One is the
> initialization to objcg->memcg in the memcg_online_kmem(), another
> is objcgs reparenting in the memcg_reparent_objcgs(). It is also
> impossible for the two to run in parallel. So xchg() is unnecessary
> and it is enough to use WRITE_ONCE().
Good catch.
> Signed-off-by: Muchun Song <[email protected]>
Looks like a nice cleanup / simplification:
Acked-by: Johannes Weiner <[email protected]>
On Fri, Apr 09, 2021 at 08:29:47PM +0800, Muchun Song wrote:
> Because memory allocations pinning memcgs for a long time - it exists
> at a larger scale and is causing recurring problems in the real world:
> page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted
> into a new cgroup every time. Unreclaimable dying cgroups pile up,
> waste memory, and make page reclaim very inefficient.
>
> We can convert LRU pages and most other raw memcg pins to the objcg
> direction to fix this problem, and then the page->memcg will always
> point to an object cgroup pointer.
>
> Therefore, the infrastructure of objcg no longer only serves
> CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the
> objcg out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages
> can reuse it to charge pages.
Just an observation on this:
We actually may want to remove CONFIG_MEMCG_KMEM altogether at this
point. It used to be an optional feature, but nowadays it's not
configurable anymore, and always on unless slob is configured.
We've also added more than just slab accounting to it, like kernel
stack pages, and it all gets disabled on slob configs just because it
doesn't support slab object tracking.
We could probably replace CONFIG_MEMCG_KMEM with CONFIG_MEMCG in most
places, and add a couple of !CONFIG_SLOB checks in the slab callbacks.
But that's beyond the scope of your patch series, so I'm also okay
with this patch here.
> We know that the LRU pages are not accounted at the root level. But the
> page->memcg_data points to the root_mem_cgroup. So the page->memcg_data
> of the LRU pages always points to a valid pointer. But the root_mem_cgroup
> dose not have an object cgroup. If we use obj_cgroup APIs to charge the
> LRU pages, we should set the page->memcg_data to a root object cgroup. So
> we also allocate an object cgroup for the root_mem_cgroup and introduce
> root_obj_cgroup to cache its value just like root_mem_cgroup.
>
> Signed-off-by: Muchun Song <[email protected]>
Overall, the patch makes sense to me. A few comments below:
> @@ -252,9 +253,14 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
> return &container_of(vmpr, struct mem_cgroup, vmpressure)->css;
> }
>
> -#ifdef CONFIG_MEMCG_KMEM
> extern spinlock_t css_set_lock;
>
> +static inline bool obj_cgroup_is_root(struct obj_cgroup *objcg)
> +{
> + return objcg == root_obj_cgroup;
> +}
This function, and by extension root_obj_cgroup, aren't used by this
patch. Please move them to the patch that adds users for them.
> @@ -298,6 +304,20 @@ static void obj_cgroup_release(struct percpu_ref *ref)
> percpu_ref_exit(ref);
> kfree_rcu(objcg, rcu);
> }
> +#else
> +static void obj_cgroup_release(struct percpu_ref *ref)
> +{
> + struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> + unsigned long flags;
> +
> + spin_lock_irqsave(&css_set_lock, flags);
> + list_del(&objcg->list);
> + spin_unlock_irqrestore(&css_set_lock, flags);
> +
> + percpu_ref_exit(ref);
> + kfree_rcu(objcg, rcu);
> +}
> +#endif
Having two separate functions for if and else is good when the else
branch is a completely empty dummy function. In this case you end up
duplicating code, so it's better to have just one function and put the
ifdef around the nr_charged_bytes handling in it.
> @@ -318,10 +338,14 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
> return objcg;
> }
>
> -static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> - struct mem_cgroup *parent)
> +static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
> {
> struct obj_cgroup *objcg, *iter;
> + struct mem_cgroup *parent;
> +
> + parent = parent_mem_cgroup(memcg);
> + if (!parent)
> + parent = root_mem_cgroup;
>
> objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
>
> @@ -342,6 +366,27 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> percpu_ref_kill(&objcg->refcnt);
> }
>
> +static int memcg_obj_cgroup_alloc(struct mem_cgroup *memcg)
> +{
> + struct obj_cgroup *objcg;
> +
> + objcg = obj_cgroup_alloc();
> + if (!objcg)
> + return -ENOMEM;
> +
> + objcg->memcg = memcg;
> + rcu_assign_pointer(memcg->objcg, objcg);
> +
> + return 0;
> +}
> +
> +static void memcg_obj_cgroup_free(struct mem_cgroup *memcg)
> +{
> + if (unlikely(memcg->objcg))
> + memcg_reparent_objcgs(memcg);
> +}
It's confusing to have a 'free' function not free the object it's
called on.
But rather than search for a fitting name, I think it might be better
to just fold both of these short functions into their only callsites.
Also, since memcg->objcg is reparented, and the pointer cleared, on
offlining, when could this ever be non-NULL? This deserves a comment.
> @@ -3444,7 +3489,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
> #ifdef CONFIG_MEMCG_KMEM
> static int memcg_online_kmem(struct mem_cgroup *memcg)
> {
> - struct obj_cgroup *objcg;
> int memcg_id;
>
> if (cgroup_memory_nokmem)
> @@ -3457,14 +3501,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
> if (memcg_id < 0)
> return memcg_id;
>
> - objcg = obj_cgroup_alloc();
> - if (!objcg) {
> - memcg_free_cache_id(memcg_id);
> - return -ENOMEM;
> - }
> - objcg->memcg = memcg;
> - rcu_assign_pointer(memcg->objcg, objcg);
> -
> static_branch_enable(&memcg_kmem_enabled_key);
>
> memcg->kmemcg_id = memcg_id;
> @@ -3488,7 +3524,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
> if (!parent)
> parent = root_mem_cgroup;
>
> - memcg_reparent_objcgs(memcg, parent);
> + memcg_reparent_objcgs(memcg);
Since the objcg is no longer tied to kmem, this should move to
mem_cgroup_css_offline() instead.
On Fri, Apr 09, 2021 at 08:29:50PM +0800, Muchun Song wrote:
> The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
> set up pagevec as late as possible in shrink_inactive_list()"), its
> purpose is to delay the allocation of pagevec as late as possible to
> save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
> reclaim stack") replace pagevecs by lists of pages_to_free. So we do
> not need noinline_for_stack, just remove it (let the compiler decide
> whether to inline).
>
> Signed-off-by: Muchun Song <[email protected]>
Good catch.
Acked-by: Johannes Weiner <[email protected]>
Since this patch is somewhat independent of the rest of the series,
you may want to put it in the very beginning, or even submit it
separately, to keep the main series as compact as possible. Reviewers
can be more hesitant to get involved with larger series ;)
On Fri, Apr 09, 2021 at 08:29:50PM +0800, Muchun Song wrote:
> The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
> set up pagevec as late as possible in shrink_inactive_list()"), its
> purpose is to delay the allocation of pagevec as late as possible to
> save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
> reclaim stack") replace pagevecs by lists of pages_to_free. So we do
> not need noinline_for_stack, just remove it (let the compiler decide
> whether to inline).
>
> Signed-off-by: Muchun Song <[email protected]>
Acked-by: Roman Gushchin <[email protected]>
> ---
> mm/vmscan.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 64bf07cc20f2..e40b21298d77 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2015,8 +2015,8 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
> *
> * Returns the number of pages moved to the given lruvec.
> */
> -static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
> - struct list_head *list)
> +static unsigned int move_pages_to_lru(struct lruvec *lruvec,
> + struct list_head *list)
> {
> int nr_pages, nr_moved = 0;
> LIST_HEAD(pages_to_free);
> @@ -2096,7 +2096,7 @@ static int current_may_throttle(void)
> * shrink_inactive_list() is a helper for shrink_node(). It returns the number
> * of reclaimed pages
> */
> -static noinline_for_stack unsigned long
> +static unsigned long
> shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
> struct scan_control *sc, enum lru_list lru)
> {
> --
> 2.11.0
>
On Fri, Apr 09, 2021 at 08:29:41PM +0800, Muchun Song wrote:
> Since the following patchsets applied. All the kernel memory are charged
> with the new APIs of obj_cgroup.
>
> [v17,00/19] The new cgroup slab memory controller
> [v5,0/7] Use obj_cgroup APIs to charge kmem pages
>
> But user memory allocations (LRU pages) pinning memcgs for a long time -
> it exists at a larger scale and is causing recurring problems in the real
> world: page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted into
> a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> and make page reclaim very inefficient.
>
> We can convert LRU pages and most other raw memcg pins to the objcg direction
> to fix this problem, and then the LRU pages will not pin the memcgs.
>
> This patchset aims to make the LRU pages to drop the reference to memory
> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> of the dying cgroups will not increase if we run the following test script.
>
> ```bash
> #!/bin/bash
>
> cat /proc/cgroups | grep memory
>
> cd /sys/fs/cgroup/memory
>
> for i in range{1..500}
> do
> mkdir test
> echo $$ > test/cgroup.procs
> sleep 60 &
> echo $$ > cgroup.procs
> echo `cat test/cgroup.procs` > cgroup.procs
> rmdir test
> done
>
> cat /proc/cgroups | grep memory
> ```
>
> Patch 1 aims to fix page charging in page replacement.
> Patch 2-5 are code cleanup and simplification.
> Patch 6-18 convert LRU pages pin to the objcg direction.
>
> Any comments are welcome. Thanks.
Indeed the problem exists for a long time and it would be nice to fix it.
However I'm against merging the patchset in the current form (there are some
nice fixes/clean-ups, which can/must be applied independently). Let me explain
my concerns:
Back to the new slab controller discussion obj_cgroup was suggested by Johannes
as a union of two concepts:
1) reparenting (basically an auto-pointer to a memcg in c++ terms)
2) byte-sized accounting
I was initially against this union because I anticipated that the reparenting
part will be useful separately. And the time told it was true.
I still think obj_cgroup API must be significantly reworked before being
applied outside of the kmem area: reparenting part must be separated
and moved to the cgroup core level to be used not only in the memcg
context but also for other controllers, which are facing similar problems.
Spilling obj_cgroup API in the current form over all memcg code will
make it more complicated and will delay it, given the amount of changes
and the number of potential code conflicts.
I'm working on the generalization of obj_cgroup API (as described above)
and expect to have some patches next week.
Thanks!
On Sat, Apr 10, 2021 at 2:31 AM Johannes Weiner <[email protected]> wrote:
>
> On Fri, Apr 09, 2021 at 08:29:50PM +0800, Muchun Song wrote:
> > The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
> > set up pagevec as late as possible in shrink_inactive_list()"), its
> > purpose is to delay the allocation of pagevec as late as possible to
> > save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
> > reclaim stack") replace pagevecs by lists of pages_to_free. So we do
> > not need noinline_for_stack, just remove it (let the compiler decide
> > whether to inline).
> >
> > Signed-off-by: Muchun Song <[email protected]>
>
> Good catch.
>
> Acked-by: Johannes Weiner <[email protected]>
>
> Since this patch is somewhat independent of the rest of the series,
> you may want to put it in the very beginning, or even submit it
> separately, to keep the main series as compact as possible. Reviewers
> can be more hesitant to get involved with larger series ;)
OK. I will gather all the cleanup patches into a separate series.
Thanks for your suggestion.
On Fri, Apr 9, 2021 at 5:32 AM Muchun Song <[email protected]> wrote:
>
> The pages aren't accounted at the root level, so do not charge the page
> to the root memcg in page replacement. Although we do not display the
> value (mem_cgroup_usage) so there shouldn't be any actual problem, but
> there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it
> will trigger? So it is better to fix it.
>
> Signed-off-by: Muchun Song <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
On Fri, Apr 9, 2021 at 9:34 PM Muchun Song <[email protected]> wrote:
>
> On Sat, Apr 10, 2021 at 2:31 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Fri, Apr 09, 2021 at 08:29:50PM +0800, Muchun Song wrote:
> > > The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
> > > set up pagevec as late as possible in shrink_inactive_list()"), its
> > > purpose is to delay the allocation of pagevec as late as possible to
> > > save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
> > > reclaim stack") replace pagevecs by lists of pages_to_free. So we do
> > > not need noinline_for_stack, just remove it (let the compiler decide
> > > whether to inline).
> > >
> > > Signed-off-by: Muchun Song <[email protected]>
> >
> > Good catch.
> >
> > Acked-by: Johannes Weiner <[email protected]>
> >
> > Since this patch is somewhat independent of the rest of the series,
> > you may want to put it in the very beginning, or even submit it
> > separately, to keep the main series as compact as possible. Reviewers
> > can be more hesitant to get involved with larger series ;)
>
> OK. I will gather all the cleanup patches into a separate series.
> Thanks for your suggestion.
That would be best.
For this patch:
Reviewed-by: Shakeel Butt <[email protected]>
On Fri, Apr 9, 2021 at 5:32 AM Muchun Song <[email protected]> wrote:
>
> When mm is NULL, we do not need to hold rcu lock and call css_tryget for
> the root memcg. And we also do not need to check !mm in every loop of
> while. So bail out early when !mm.
>
> Signed-off-by: Muchun Song <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
On Sat, Apr 10, 2021 at 12:00 AM Johannes Weiner <[email protected]> wrote:
>
> On Fri, Apr 09, 2021 at 08:29:45PM +0800, Muchun Song wrote:
> > We already have a helper lruvec_memcg() to get the memcg from lruvec, we
> > do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
> > lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
> > page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
> > of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
> > to simplify the code. We can have a single definition for this function
> > that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
> > CONFIG_MEMCG.
> >
> > Signed-off-by: Muchun Song <[email protected]>
>
> Looks good to me.
>
> Acked-by: Johannes Weiner <[email protected]>
Thanks for your review.
>
> If you haven't done so yet, please make sure to explicitly test with
> all three config combinations, just because the dummy abstractions for
> memcg disabled or compiled out tend to be paper thin and don't always
> behave the way you might expect when you do more complicated things.
I have tested. There is no problem. Thanks :-)
>
> Something like
>
> boot
> echo sparsefile >/dev/null (> ram size to fill memory and reclaim)
> echo 1 >/proc/sys/vm/compact_memory
>
> should exercise this new function in a couple of important scenarios.
On Sat, Apr 10, 2021 at 12:56 AM Johannes Weiner <[email protected]> wrote:
>
> On Fri, Apr 09, 2021 at 08:29:47PM +0800, Muchun Song wrote:
> > Because memory allocations pinning memcgs for a long time - it exists
> > at a larger scale and is causing recurring problems in the real world:
> > page cache doesn't get reclaimed for a long time, or is used by the
> > second, third, fourth, ... instance of the same job that was restarted
> > into a new cgroup every time. Unreclaimable dying cgroups pile up,
> > waste memory, and make page reclaim very inefficient.
> >
> > We can convert LRU pages and most other raw memcg pins to the objcg
> > direction to fix this problem, and then the page->memcg will always
> > point to an object cgroup pointer.
> >
> > Therefore, the infrastructure of objcg no longer only serves
> > CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the
> > objcg out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages
> > can reuse it to charge pages.
>
> Just an observation on this:
>
> We actually may want to remove CONFIG_MEMCG_KMEM altogether at this
> point. It used to be an optional feature, but nowadays it's not
> configurable anymore, and always on unless slob is configured.
>
> We've also added more than just slab accounting to it, like kernel
> stack pages, and it all gets disabled on slob configs just because it
> doesn't support slab object tracking.
>
> We could probably replace CONFIG_MEMCG_KMEM with CONFIG_MEMCG in most
> places, and add a couple of !CONFIG_SLOB checks in the slab callbacks.
>
> But that's beyond the scope of your patch series, so I'm also okay
> with this patch here.
>
> > We know that the LRU pages are not accounted at the root level. But the
> > page->memcg_data points to the root_mem_cgroup. So the page->memcg_data
> > of the LRU pages always points to a valid pointer. But the root_mem_cgroup
> > dose not have an object cgroup. If we use obj_cgroup APIs to charge the
> > LRU pages, we should set the page->memcg_data to a root object cgroup. So
> > we also allocate an object cgroup for the root_mem_cgroup and introduce
> > root_obj_cgroup to cache its value just like root_mem_cgroup.
> >
> > Signed-off-by: Muchun Song <[email protected]>
>
> Overall, the patch makes sense to me. A few comments below:
>
> > @@ -252,9 +253,14 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
> > return &container_of(vmpr, struct mem_cgroup, vmpressure)->css;
> > }
> >
> > -#ifdef CONFIG_MEMCG_KMEM
> > extern spinlock_t css_set_lock;
> >
> > +static inline bool obj_cgroup_is_root(struct obj_cgroup *objcg)
> > +{
> > + return objcg == root_obj_cgroup;
> > +}
>
> This function, and by extension root_obj_cgroup, aren't used by this
> patch. Please move them to the patch that adds users for them.
OK. Will do.
>
> > @@ -298,6 +304,20 @@ static void obj_cgroup_release(struct percpu_ref *ref)
> > percpu_ref_exit(ref);
> > kfree_rcu(objcg, rcu);
> > }
> > +#else
> > +static void obj_cgroup_release(struct percpu_ref *ref)
> > +{
> > + struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> > + unsigned long flags;
> > +
> > + spin_lock_irqsave(&css_set_lock, flags);
> > + list_del(&objcg->list);
> > + spin_unlock_irqrestore(&css_set_lock, flags);
> > +
> > + percpu_ref_exit(ref);
> > + kfree_rcu(objcg, rcu);
> > +}
> > +#endif
>
> Having two separate functions for if and else is good when the else
> branch is a completely empty dummy function. In this case you end up
> duplicating code, so it's better to have just one function and put the
> ifdef around the nr_charged_bytes handling in it.
Make sense. I will rework the code here.
>
> > @@ -318,10 +338,14 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
> > return objcg;
> > }
> >
> > -static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> > - struct mem_cgroup *parent)
> > +static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
> > {
> > struct obj_cgroup *objcg, *iter;
> > + struct mem_cgroup *parent;
> > +
> > + parent = parent_mem_cgroup(memcg);
> > + if (!parent)
> > + parent = root_mem_cgroup;
> >
> > objcg = rcu_replace_pointer(memcg->objcg, NULL, true);
> >
> > @@ -342,6 +366,27 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
> > percpu_ref_kill(&objcg->refcnt);
> > }
> >
> > +static int memcg_obj_cgroup_alloc(struct mem_cgroup *memcg)
> > +{
> > + struct obj_cgroup *objcg;
> > +
> > + objcg = obj_cgroup_alloc();
> > + if (!objcg)
> > + return -ENOMEM;
> > +
> > + objcg->memcg = memcg;
> > + rcu_assign_pointer(memcg->objcg, objcg);
> > +
> > + return 0;
> > +}
> > +
> > +static void memcg_obj_cgroup_free(struct mem_cgroup *memcg)
> > +{
> > + if (unlikely(memcg->objcg))
> > + memcg_reparent_objcgs(memcg);
> > +}
>
> It's confusing to have a 'free' function not free the object it's
> called on.
>
> But rather than search for a fitting name, I think it might be better
> to just fold both of these short functions into their only callsites.
OK. Will do.
>
> Also, since memcg->objcg is reparented, and the pointer cleared, on
> offlining, when could this ever be non-NULL? This deserves a comment.
css_alloc() failed, offlining didn't happen. In this case, memcg->objcg
could be non-NULL (Just like memcg_free_kmem() dose). I will move
memcg_obj_cgroup_alloc() to the mem_cgroup_css_online() so that
we do not need memcg_obj_cgroup_free.
>
> > @@ -3444,7 +3489,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
> > #ifdef CONFIG_MEMCG_KMEM
> > static int memcg_online_kmem(struct mem_cgroup *memcg)
> > {
> > - struct obj_cgroup *objcg;
> > int memcg_id;
> >
> > if (cgroup_memory_nokmem)
> > @@ -3457,14 +3501,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
> > if (memcg_id < 0)
> > return memcg_id;
> >
> > - objcg = obj_cgroup_alloc();
> > - if (!objcg) {
> > - memcg_free_cache_id(memcg_id);
> > - return -ENOMEM;
> > - }
> > - objcg->memcg = memcg;
> > - rcu_assign_pointer(memcg->objcg, objcg);
> > -
> > static_branch_enable(&memcg_kmem_enabled_key);
> >
> > memcg->kmemcg_id = memcg_id;
> > @@ -3488,7 +3524,7 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
> > if (!parent)
> > parent = root_mem_cgroup;
> >
> > - memcg_reparent_objcgs(memcg, parent);
> > + memcg_reparent_objcgs(memcg);
>
> Since the objcg is no longer tied to kmem, this should move to
> mem_cgroup_css_offline() instead.
LGTM, will do.
Thanks for your all suggestions.
On Mon, Apr 12, 2021 at 01:14:57PM -0400, Johannes Weiner wrote:
> On Fri, Apr 09, 2021 at 06:29:46PM -0700, Roman Gushchin wrote:
> > On Fri, Apr 09, 2021 at 08:29:41PM +0800, Muchun Song wrote:
> > > Since the following patchsets applied. All the kernel memory are charged
> > > with the new APIs of obj_cgroup.
> > >
> > > [v17,00/19] The new cgroup slab memory controller
> > > [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> > >
> > > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > > it exists at a larger scale and is causing recurring problems in the real
> > > world: page cache doesn't get reclaimed for a long time, or is used by the
> > > second, third, fourth, ... instance of the same job that was restarted into
> > > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > > and make page reclaim very inefficient.
> > >
> > > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > > to fix this problem, and then the LRU pages will not pin the memcgs.
> > >
> > > This patchset aims to make the LRU pages to drop the reference to memory
> > > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > > of the dying cgroups will not increase if we run the following test script.
> > >
> > > ```bash
> > > #!/bin/bash
> > >
> > > cat /proc/cgroups | grep memory
> > >
> > > cd /sys/fs/cgroup/memory
> > >
> > > for i in range{1..500}
> > > do
> > > mkdir test
> > > echo $$ > test/cgroup.procs
> > > sleep 60 &
> > > echo $$ > cgroup.procs
> > > echo `cat test/cgroup.procs` > cgroup.procs
> > > rmdir test
> > > done
> > >
> > > cat /proc/cgroups | grep memory
> > > ```
> > >
> > > Patch 1 aims to fix page charging in page replacement.
> > > Patch 2-5 are code cleanup and simplification.
> > > Patch 6-18 convert LRU pages pin to the objcg direction.
> > >
> > > Any comments are welcome. Thanks.
> >
> > Indeed the problem exists for a long time and it would be nice to fix it.
> > However I'm against merging the patchset in the current form (there are some
> > nice fixes/clean-ups, which can/must be applied independently). Let me explain
> > my concerns:
> >
> > Back to the new slab controller discussion obj_cgroup was suggested by Johannes
> > as a union of two concepts:
> > 1) reparenting (basically an auto-pointer to a memcg in c++ terms)
> > 2) byte-sized accounting
> >
> > I was initially against this union because I anticipated that the reparenting
> > part will be useful separately. And the time told it was true.
>
> "The idea of moving stocks and leftovers to the memcg_ptr/obj_cgroup
> level is really good."
>
> https://lore.kernel.org/lkml/[email protected]/
>
> If you recall, the main concern was how the byte charging interface
> was added to the existing page charging interface, instead of being
> layered on top of it. I suggested to do that and, since there was no
> other user for the indirection pointer, just include it in the API.
>
> It made sense at the time, and you seemed to agree. But I also agree
> it makes sense to factor it out now that more users are materializing.
Agreed.
>
> > I still think obj_cgroup API must be significantly reworked before being
> > applied outside of the kmem area: reparenting part must be separated
> > and moved to the cgroup core level to be used not only in the memcg
> > context but also for other controllers, which are facing similar problems.
> > Spilling obj_cgroup API in the current form over all memcg code will
> > make it more complicated and will delay it, given the amount of changes
> > and the number of potential code conflicts.
> >
> > I'm working on the generalization of obj_cgroup API (as described above)
> > and expect to have some patches next week.
>
> Yeah, splitting the byte charging API from the reference API and
> making the latter cgroup-generic makes sense. I'm looking forward to
> your patches.
>
> And yes, the conflicts between that work and Muchun's patches would be
> quite large. However, most of them would come down to renames, since
> the access rules and refcounting sites will remain the same, so it
> shouldn't be too bad to rebase Muchun's patches on yours. And we can
> continue reviewing his patches for correctness for now.
Sounds good to me!
Thanks
On Fri, Apr 09, 2021 at 06:29:46PM -0700, Roman Gushchin wrote:
> On Fri, Apr 09, 2021 at 08:29:41PM +0800, Muchun Song wrote:
> > Since the following patchsets applied. All the kernel memory are charged
> > with the new APIs of obj_cgroup.
> >
> > [v17,00/19] The new cgroup slab memory controller
> > [v5,0/7] Use obj_cgroup APIs to charge kmem pages
> >
> > But user memory allocations (LRU pages) pinning memcgs for a long time -
> > it exists at a larger scale and is causing recurring problems in the real
> > world: page cache doesn't get reclaimed for a long time, or is used by the
> > second, third, fourth, ... instance of the same job that was restarted into
> > a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> > and make page reclaim very inefficient.
> >
> > We can convert LRU pages and most other raw memcg pins to the objcg direction
> > to fix this problem, and then the LRU pages will not pin the memcgs.
> >
> > This patchset aims to make the LRU pages to drop the reference to memory
> > cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> > of the dying cgroups will not increase if we run the following test script.
> >
> > ```bash
> > #!/bin/bash
> >
> > cat /proc/cgroups | grep memory
> >
> > cd /sys/fs/cgroup/memory
> >
> > for i in range{1..500}
> > do
> > mkdir test
> > echo $$ > test/cgroup.procs
> > sleep 60 &
> > echo $$ > cgroup.procs
> > echo `cat test/cgroup.procs` > cgroup.procs
> > rmdir test
> > done
> >
> > cat /proc/cgroups | grep memory
> > ```
> >
> > Patch 1 aims to fix page charging in page replacement.
> > Patch 2-5 are code cleanup and simplification.
> > Patch 6-18 convert LRU pages pin to the objcg direction.
> >
> > Any comments are welcome. Thanks.
>
> Indeed the problem exists for a long time and it would be nice to fix it.
> However I'm against merging the patchset in the current form (there are some
> nice fixes/clean-ups, which can/must be applied independently). Let me explain
> my concerns:
>
> Back to the new slab controller discussion obj_cgroup was suggested by Johannes
> as a union of two concepts:
> 1) reparenting (basically an auto-pointer to a memcg in c++ terms)
> 2) byte-sized accounting
>
> I was initially against this union because I anticipated that the reparenting
> part will be useful separately. And the time told it was true.
"The idea of moving stocks and leftovers to the memcg_ptr/obj_cgroup
level is really good."
https://lore.kernel.org/lkml/[email protected]/
If you recall, the main concern was how the byte charging interface
was added to the existing page charging interface, instead of being
layered on top of it. I suggested to do that and, since there was no
other user for the indirection pointer, just include it in the API.
It made sense at the time, and you seemed to agree. But I also agree
it makes sense to factor it out now that more users are materializing.
> I still think obj_cgroup API must be significantly reworked before being
> applied outside of the kmem area: reparenting part must be separated
> and moved to the cgroup core level to be used not only in the memcg
> context but also for other controllers, which are facing similar problems.
> Spilling obj_cgroup API in the current form over all memcg code will
> make it more complicated and will delay it, given the amount of changes
> and the number of potential code conflicts.
>
> I'm working on the generalization of obj_cgroup API (as described above)
> and expect to have some patches next week.
Yeah, splitting the byte charging API from the reference API and
making the latter cgroup-generic makes sense. I'm looking forward to
your patches.
And yes, the conflicts between that work and Muchun's patches would be
quite large. However, most of them would come down to renames, since
the access rules and refcounting sites will remain the same, so it
shouldn't be too bad to rebase Muchun's patches on yours. And we can
continue reviewing his patches for correctness for now.
Thanks