2021-04-21 07:46:10

by Muchun Song

[permalink] [raw]
Subject: [RFC PATCH v3 00/12] Use obj_cgroup APIs to charge the LRU pages

This is v3 based on the top of the series[1] (memcontrol code cleanup and
simplification). Roman is working on the generalization of obj_cgroup API.
But before that, hope someone can review this patches for correctness.

Since the following patchsets applied. All the kernel memory are charged
with the new APIs of obj_cgroup.

[v17,00/19] The new cgroup slab memory controller[2]
[v5,0/7] Use obj_cgroup APIs to charge kmem pages[3]

But user memory allocations (LRU pages) pinning memcgs for a long time -
it exists at a larger scale and is causing recurring problems in the real
world: page cache doesn't get reclaimed for a long time, or is used by the
second, third, fourth, ... instance of the same job that was restarted into
a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
and make page reclaim very inefficient.

We can convert LRU pages and most other raw memcg pins to the objcg direction
to fix this problem, and then the LRU pages will not pin the memcgs.

This patchset aims to make the LRU pages to drop the reference to memory
cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
of the dying cgroups will not increase if we run the following test script.

```bash
#!/bin/bash

cat /proc/cgroups | grep memory

cd /sys/fs/cgroup/memory

for i in range{1..500}
do
mkdir test
echo $$ > test/cgroup.procs
sleep 60 &
echo $$ > cgroup.procs
echo `cat test/cgroup.procs` > cgroup.procs
rmdir test
done

cat /proc/cgroups | grep memory
```

Thanks.

[1] https://lore.kernel.org/linux-mm/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/
[3] https://lore.kernel.org/linux-mm/[email protected]/

Changlogs in RFC v3:
1. Drop the code cleanup and simplification patches. Gather those patches
into a separate series[1].
2. Rework patch #1 suggested by Johannes.

Changlogs in RFC v2:
1. Collect Acked-by tags by Johannes. Thanks.
2. Rework lruvec_holds_page_lru_lock() suggested by Johannes. Thanks.
3. Fix move_pages_to_lru().

Muchun Song (12):
mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM
mm: memcontrol: introduce compact_lock_page_lruvec_irqsave
mm: memcontrol: make lruvec lock safe when the LRU pages reparented
mm: vmscan: rework move_pages_to_lru()
mm: thp: introduce lock/unlock_split_queue{_irqsave}()
mm: thp: make deferred split queue lock safe when the LRU pages
reparented
mm: memcontrol: make all the callers of page_memcg() safe
mm: memcontrol: introduce memcg_reparent_ops
mm: memcontrol: use obj_cgroup APIs to charge the LRU pages
mm: memcontrol: rename {un}lock_page_memcg() to {un}lock_page_objcg()
mm: lru: add VM_BUG_ON_PAGE to lru maintenance function
mm: lru: use lruvec lock to serialize memcg changes

Documentation/admin-guide/cgroup-v1/memory.rst | 2 +-
fs/buffer.c | 13 +-
fs/fs-writeback.c | 23 +-
fs/iomap/buffered-io.c | 4 +-
include/linux/memcontrol.h | 182 ++++----
include/linux/mm_inline.h | 6 +
mm/compaction.c | 36 +-
mm/filemap.c | 2 +-
mm/huge_memory.c | 171 ++++++--
mm/memcontrol.c | 562 ++++++++++++++++++-------
mm/migrate.c | 4 +
mm/page-writeback.c | 24 +-
mm/page_io.c | 5 +-
mm/rmap.c | 14 +-
mm/swap.c | 46 +-
mm/vmscan.c | 56 ++-
16 files changed, 795 insertions(+), 355 deletions(-)

--
2.11.0


2021-04-21 08:32:48

by Muchun Song

[permalink] [raw]
Subject: [RFC PATCH v3 01/12] mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM

Because memory allocations pinning memcgs for a long time - it exists
at a larger scale and is causing recurring problems in the real world:
page cache doesn't get reclaimed for a long time, or is used by the
second, third, fourth, ... instance of the same job that was restarted
into a new cgroup every time. Unreclaimable dying cgroups pile up,
waste memory, and make page reclaim very inefficient.

We can convert LRU pages and most other raw memcg pins to the objcg
direction to fix this problem, and then the page->memcg will always
point to an object cgroup pointer.

Therefore, the infrastructure of objcg no longer only serves
CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the
objcg out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages
can reuse it to charge pages.

We know that the LRU pages are not accounted at the root level. But
the page->memcg_data points to the root_mem_cgroup. So the
page->memcg_data of the LRU pages always points to a valid pointer.
But the root_mem_cgroup dose not have an object cgroup. If we use
obj_cgroup APIs to charge the LRU pages, we should set the
page->memcg_data to a root object cgroup. So we also allocate an
object cgroup for the root_mem_cgroup.

Signed-off-by: Muchun Song <[email protected]>
---
include/linux/memcontrol.h | 4 ++-
mm/memcontrol.c | 62 ++++++++++++++++++++++++++++++----------------
2 files changed, 43 insertions(+), 23 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0ce97eff79e2..53a19db9f8eb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -223,7 +223,9 @@ struct memcg_cgwb_frn {
struct obj_cgroup {
struct percpu_ref refcnt;
struct mem_cgroup *memcg;
+#ifdef CONFIG_MEMCG_KMEM
atomic_t nr_charged_bytes;
+#endif
union {
struct list_head list;
struct rcu_head rcu;
@@ -321,9 +323,9 @@ struct mem_cgroup {
#ifdef CONFIG_MEMCG_KMEM
int kmemcg_id;
enum memcg_kmem_state kmem_state;
+#endif
struct obj_cgroup __rcu *objcg;
struct list_head objcg_list; /* list of inherited objcgs */
-#endif

MEMCG_PADDING(_pad2_);

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e0c398fe7443..dae7ab2b1406 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -252,18 +252,16 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
return &container_of(vmpr, struct mem_cgroup, vmpressure)->css;
}

-#ifdef CONFIG_MEMCG_KMEM
extern spinlock_t css_set_lock;

+#ifdef CONFIG_MEMCG_KMEM
static void obj_cgroup_uncharge_pages(struct obj_cgroup *objcg,
unsigned int nr_pages);

-static void obj_cgroup_release(struct percpu_ref *ref)
+static void obj_cgroup_release_uncharge(struct obj_cgroup *objcg)
{
- struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
unsigned int nr_bytes;
unsigned int nr_pages;
- unsigned long flags;

/*
* At this point all allocated objects are freed, and
@@ -291,6 +289,19 @@ static void obj_cgroup_release(struct percpu_ref *ref)

if (nr_pages)
obj_cgroup_uncharge_pages(objcg, nr_pages);
+}
+#else
+static inline void obj_cgroup_release_uncharge(struct obj_cgroup *objcg)
+{
+}
+#endif
+
+static void obj_cgroup_release(struct percpu_ref *ref)
+{
+ struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
+ unsigned long flags;
+
+ obj_cgroup_release_uncharge(objcg);

spin_lock_irqsave(&css_set_lock, flags);
list_del(&objcg->list);
@@ -319,10 +330,14 @@ static struct obj_cgroup *obj_cgroup_alloc(void)
return objcg;
}

-static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
- struct mem_cgroup *parent)
+static void memcg_reparent_objcgs(struct mem_cgroup *memcg)
{
struct obj_cgroup *objcg, *iter;
+ struct mem_cgroup *parent;
+
+ parent = parent_mem_cgroup(memcg);
+ if (!parent)
+ parent = root_mem_cgroup;

objcg = rcu_replace_pointer(memcg->objcg, NULL, true);

@@ -341,6 +356,7 @@ static void memcg_reparent_objcgs(struct mem_cgroup *memcg,
percpu_ref_kill(&objcg->refcnt);
}

+#ifdef CONFIG_MEMCG_KMEM
/*
* This will be used as a shrinker list's index.
* The main reason for not using cgroup id for this:
@@ -3447,7 +3463,6 @@ static u64 mem_cgroup_read_u64(struct cgroup_subsys_state *css,
#ifdef CONFIG_MEMCG_KMEM
static int memcg_online_kmem(struct mem_cgroup *memcg)
{
- struct obj_cgroup *objcg;
int memcg_id;

if (cgroup_memory_nokmem)
@@ -3460,14 +3475,6 @@ static int memcg_online_kmem(struct mem_cgroup *memcg)
if (memcg_id < 0)
return memcg_id;

- objcg = obj_cgroup_alloc();
- if (!objcg) {
- memcg_free_cache_id(memcg_id);
- return -ENOMEM;
- }
- objcg->memcg = memcg;
- rcu_assign_pointer(memcg->objcg, objcg);
-
static_branch_enable(&memcg_kmem_enabled_key);

memcg->kmemcg_id = memcg_id;
@@ -3491,8 +3498,6 @@ static void memcg_offline_kmem(struct mem_cgroup *memcg)
if (!parent)
parent = root_mem_cgroup;

- memcg_reparent_objcgs(memcg, parent);
-
kmemcg_id = memcg->kmemcg_id;
BUG_ON(kmemcg_id < 0);

@@ -5036,8 +5041,8 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
memcg->socket_pressure = jiffies;
#ifdef CONFIG_MEMCG_KMEM
memcg->kmemcg_id = -1;
- INIT_LIST_HEAD(&memcg->objcg_list);
#endif
+ INIT_LIST_HEAD(&memcg->objcg_list);
#ifdef CONFIG_CGROUP_WRITEBACK
INIT_LIST_HEAD(&memcg->cgwb_list);
for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
@@ -5109,21 +5114,33 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css)
static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+ struct obj_cgroup *objcg;

/*
* A memcg must be visible for expand_shrinker_info()
* by the time the maps are allocated. So, we allocate maps
* here, when for_each_mem_cgroup() can't skip it.
*/
- if (alloc_shrinker_info(memcg)) {
- mem_cgroup_id_remove(memcg);
- return -ENOMEM;
- }
+ if (alloc_shrinker_info(memcg))
+ goto remove_id;
+
+ objcg = obj_cgroup_alloc();
+ if (!objcg)
+ goto free_shrinker;
+
+ objcg->memcg = memcg;
+ rcu_assign_pointer(memcg->objcg, objcg);

/* Online state pins memcg ID, memcg ID pins CSS */
refcount_set(&memcg->id.ref, 1);
css_get(css);
return 0;
+
+free_shrinker:
+ free_shrinker_info(memcg);
+remove_id:
+ mem_cgroup_id_remove(memcg);
+ return -ENOMEM;
}

static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
@@ -5147,6 +5164,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
page_counter_set_low(&memcg->memory, 0);

memcg_offline_kmem(memcg);
+ memcg_reparent_objcgs(memcg);
reparent_shrinker_deferred(memcg);
wb_memcg_offline(memcg);

--
2.11.0

2021-04-21 08:32:52

by Muchun Song

[permalink] [raw]
Subject: [RFC PATCH v3 04/12] mm: vmscan: rework move_pages_to_lru()

In the later patch, we will reparent the LRU pages. The pages which will
move to appropriate LRU list can be reparented during the process of the
move_pages_to_lru(). So holding a lruvec lock by the caller is wrong, we
should use the more general interface of relock_page_lruvec_irq() to
acquire the correct lruvec lock.

Signed-off-by: Muchun Song <[email protected]>
---
mm/vmscan.c | 46 +++++++++++++++++++++++-----------------------
1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d2727b78df9..02d14a377b0e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2012,23 +2012,27 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
* move_pages_to_lru() moves pages from private @list to appropriate LRU list.
* On return, @list is reused as a list of pages to be freed by the caller.
*
- * Returns the number of pages moved to the given lruvec.
+ * Returns the number of pages moved to the appropriate LRU list.
+ *
+ * Note: The caller must not hold any lruvec lock.
*/
-static unsigned int move_pages_to_lru(struct lruvec *lruvec,
- struct list_head *list)
+static unsigned int move_pages_to_lru(struct list_head *list)
{
- int nr_pages, nr_moved = 0;
+ int nr_moved = 0;
+ struct lruvec *lruvec = NULL;
LIST_HEAD(pages_to_free);
- struct page *page;

while (!list_empty(list)) {
- page = lru_to_page(list);
+ int nr_pages;
+ struct page *page = lru_to_page(list);
+
+ lruvec = relock_page_lruvec_irq(page, lruvec);
VM_BUG_ON_PAGE(PageLRU(page), page);
list_del(&page->lru);
if (unlikely(!page_evictable(page))) {
- spin_unlock_irq(&lruvec->lru_lock);
+ unlock_page_lruvec_irq(lruvec);
putback_lru_page(page);
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec = NULL;
continue;
}

@@ -2049,19 +2053,15 @@ static unsigned int move_pages_to_lru(struct lruvec *lruvec,
__clear_page_lru_flags(page);

if (unlikely(PageCompound(page))) {
- spin_unlock_irq(&lruvec->lru_lock);
+ unlock_page_lruvec_irq(lruvec);
destroy_compound_page(page);
- spin_lock_irq(&lruvec->lru_lock);
+ lruvec = NULL;
} else
list_add(&page->lru, &pages_to_free);

continue;
}

- /*
- * All pages were isolated from the same lruvec (and isolation
- * inhibits memcg migration).
- */
VM_BUG_ON_PAGE(!page_matches_lruvec(page, lruvec), page);
add_page_to_lru_list(page, lruvec);
nr_pages = thp_nr_pages(page);
@@ -2070,6 +2070,8 @@ static unsigned int move_pages_to_lru(struct lruvec *lruvec,
workingset_age_nonresident(lruvec, nr_pages);
}

+ if (lruvec)
+ unlock_page_lruvec_irq(lruvec);
/*
* To save our caller's stack, now use input list for pages to free.
*/
@@ -2143,16 +2145,16 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,

nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, &stat, false);

- spin_lock_irq(&lruvec->lru_lock);
- move_pages_to_lru(lruvec, &page_list);
+ move_pages_to_lru(&page_list);

+ local_irq_disable();
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
if (!cgroup_reclaim(sc))
__count_vm_events(item, nr_reclaimed);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
- spin_unlock_irq(&lruvec->lru_lock);
+ local_irq_enable();

lru_note_cost(lruvec, file, stat.nr_pageout);
mem_cgroup_uncharge_list(&page_list);
@@ -2279,18 +2281,16 @@ static void shrink_active_list(unsigned long nr_to_scan,
/*
* Move pages back to the lru list.
*/
- spin_lock_irq(&lruvec->lru_lock);
-
- nr_activate = move_pages_to_lru(lruvec, &l_active);
- nr_deactivate = move_pages_to_lru(lruvec, &l_inactive);
+ nr_activate = move_pages_to_lru(&l_active);
+ nr_deactivate = move_pages_to_lru(&l_inactive);
/* Keep all free pages in l_active list */
list_splice(&l_inactive, &l_active);

+ local_irq_disable();
__count_vm_events(PGDEACTIVATE, nr_deactivate);
__count_memcg_events(lruvec_memcg(lruvec), PGDEACTIVATE, nr_deactivate);
-
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
- spin_unlock_irq(&lruvec->lru_lock);
+ local_irq_enable();

mem_cgroup_uncharge_list(&l_active);
free_unref_page_list(&l_active);
--
2.11.0

2021-05-19 18:15:08

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/12] Use obj_cgroup APIs to charge the LRU pages

Hi Muchun!

It looks like the writeback problem will be solved in a different way, which will not require generalization of the obj_cgroup api to the cgroup level. It’s not fully confirmed yet though. We still might wanna do this generalization lingn-term, but as now I have no objections for continuing the work on your patchset. I’m on pto this week, but will take a deeper look at your patches early next week. Sorry for the delay.

Thanks!

Sent from my iPhone

> On May 18, 2021, at 06:50, Muchun Song <[email protected]> wrote:
>
> Ping...
>
> Hi Johannes and Roman,
>
> Any suggestions on this patch set?
>
> Thanks.
>
>> On Wed, Apr 21, 2021 at 3:01 PM Muchun Song <[email protected]> wrote:
>>
>> This is v3 based on the top of the series[1] (memcontrol code cleanup and
>> simplification). Roman is working on the generalization of obj_cgroup API.
>> But before that, hope someone can review this patches for correctness.
>>
>> Since the following patchsets applied. All the kernel memory are charged
>> with the new APIs of obj_cgroup.
>>
>> [v17,00/19] The new cgroup slab memory controller[2]
>> [v5,0/7] Use obj_cgroup APIs to charge kmem pages[3]
>>
>> But user memory allocations (LRU pages) pinning memcgs for a long time -
>> it exists at a larger scale and is causing recurring problems in the real
>> world: page cache doesn't get reclaimed for a long time, or is used by the
>> second, third, fourth, ... instance of the same job that was restarted into
>> a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
>> and make page reclaim very inefficient.
>>
>> We can convert LRU pages and most other raw memcg pins to the objcg direction
>> to fix this problem, and then the LRU pages will not pin the memcgs.
>>
>> This patchset aims to make the LRU pages to drop the reference to memory
>> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
>> of the dying cgroups will not increase if we run the following test script.
>>
>> ```bash
>> #!/bin/bash
>>
>> cat /proc/cgroups | grep memory
>>
>> cd /sys/fs/cgroup/memory
>>
>> for i in range{1..500}
>> do
>> mkdir test
>> echo $$ > test/cgroup.procs
>> sleep 60 &
>> echo $$ > cgroup.procs
>> echo `cat test/cgroup.procs` > cgroup.procs
>> rmdir test
>> done
>>
>> cat /proc/cgroups | grep memory
>> ```
>>
>> Thanks.
>>
>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>> [2] https://lore.kernel.org/linux-mm/[email protected]/
>> [3] https://lore.kernel.org/linux-mm/[email protected]/
>>
>> Changlogs in RFC v3:
>> 1. Drop the code cleanup and simplification patches. Gather those patches
>> into a separate series[1].
>> 2. Rework patch #1 suggested by Johannes.
>>
>> Changlogs in RFC v2:
>> 1. Collect Acked-by tags by Johannes. Thanks.
>> 2. Rework lruvec_holds_page_lru_lock() suggested by Johannes. Thanks.
>> 3. Fix move_pages_to_lru().
>>
>> Muchun Song (12):
>> mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM
>> mm: memcontrol: introduce compact_lock_page_lruvec_irqsave
>> mm: memcontrol: make lruvec lock safe when the LRU pages reparented
>> mm: vmscan: rework move_pages_to_lru()
>> mm: thp: introduce lock/unlock_split_queue{_irqsave}()
>> mm: thp: make deferred split queue lock safe when the LRU pages
>> reparented
>> mm: memcontrol: make all the callers of page_memcg() safe
>> mm: memcontrol: introduce memcg_reparent_ops
>> mm: memcontrol: use obj_cgroup APIs to charge the LRU pages
>> mm: memcontrol: rename {un}lock_page_memcg() to {un}lock_page_objcg()
>> mm: lru: add VM_BUG_ON_PAGE to lru maintenance function
>> mm: lru: use lruvec lock to serialize memcg changes
>>
>> Documentation/admin-guide/cgroup-v1/memory.rst | 2 +-
>> fs/buffer.c | 13 +-
>> fs/fs-writeback.c | 23 +-
>> fs/iomap/buffered-io.c | 4 +-
>> include/linux/memcontrol.h | 182 ++++----
>> include/linux/mm_inline.h | 6 +
>> mm/compaction.c | 36 +-
>> mm/filemap.c | 2 +-
>> mm/huge_memory.c | 171 ++++++--
>> mm/memcontrol.c | 562 ++++++++++++++++++-------
>> mm/migrate.c | 4 +
>> mm/page-writeback.c | 24 +-
>> mm/page_io.c | 5 +-
>> mm/rmap.c | 14 +-
>> mm/swap.c | 46 +-
>> mm/vmscan.c | 56 ++-
>> 16 files changed, 795 insertions(+), 355 deletions(-)
>>
>> --
>> 2.11.0
>>

2021-05-19 18:15:35

by Muchun Song

[permalink] [raw]
Subject: Re: [RFC PATCH v3 00/12] Use obj_cgroup APIs to charge the LRU pages

Ping...

Hi Johannes and Roman,

Any suggestions on this patch set?

Thanks.

On Wed, Apr 21, 2021 at 3:01 PM Muchun Song <[email protected]> wrote:
>
> This is v3 based on the top of the series[1] (memcontrol code cleanup and
> simplification). Roman is working on the generalization of obj_cgroup API.
> But before that, hope someone can review this patches for correctness.
>
> Since the following patchsets applied. All the kernel memory are charged
> with the new APIs of obj_cgroup.
>
> [v17,00/19] The new cgroup slab memory controller[2]
> [v5,0/7] Use obj_cgroup APIs to charge kmem pages[3]
>
> But user memory allocations (LRU pages) pinning memcgs for a long time -
> it exists at a larger scale and is causing recurring problems in the real
> world: page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted into
> a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> and make page reclaim very inefficient.
>
> We can convert LRU pages and most other raw memcg pins to the objcg direction
> to fix this problem, and then the LRU pages will not pin the memcgs.
>
> This patchset aims to make the LRU pages to drop the reference to memory
> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> of the dying cgroups will not increase if we run the following test script.
>
> ```bash
> #!/bin/bash
>
> cat /proc/cgroups | grep memory
>
> cd /sys/fs/cgroup/memory
>
> for i in range{1..500}
> do
> mkdir test
> echo $$ > test/cgroup.procs
> sleep 60 &
> echo $$ > cgroup.procs
> echo `cat test/cgroup.procs` > cgroup.procs
> rmdir test
> done
>
> cat /proc/cgroups | grep memory
> ```
>
> Thanks.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/
> [3] https://lore.kernel.org/linux-mm/[email protected]/
>
> Changlogs in RFC v3:
> 1. Drop the code cleanup and simplification patches. Gather those patches
> into a separate series[1].
> 2. Rework patch #1 suggested by Johannes.
>
> Changlogs in RFC v2:
> 1. Collect Acked-by tags by Johannes. Thanks.
> 2. Rework lruvec_holds_page_lru_lock() suggested by Johannes. Thanks.
> 3. Fix move_pages_to_lru().
>
> Muchun Song (12):
> mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM
> mm: memcontrol: introduce compact_lock_page_lruvec_irqsave
> mm: memcontrol: make lruvec lock safe when the LRU pages reparented
> mm: vmscan: rework move_pages_to_lru()
> mm: thp: introduce lock/unlock_split_queue{_irqsave}()
> mm: thp: make deferred split queue lock safe when the LRU pages
> reparented
> mm: memcontrol: make all the callers of page_memcg() safe
> mm: memcontrol: introduce memcg_reparent_ops
> mm: memcontrol: use obj_cgroup APIs to charge the LRU pages
> mm: memcontrol: rename {un}lock_page_memcg() to {un}lock_page_objcg()
> mm: lru: add VM_BUG_ON_PAGE to lru maintenance function
> mm: lru: use lruvec lock to serialize memcg changes
>
> Documentation/admin-guide/cgroup-v1/memory.rst | 2 +-
> fs/buffer.c | 13 +-
> fs/fs-writeback.c | 23 +-
> fs/iomap/buffered-io.c | 4 +-
> include/linux/memcontrol.h | 182 ++++----
> include/linux/mm_inline.h | 6 +
> mm/compaction.c | 36 +-
> mm/filemap.c | 2 +-
> mm/huge_memory.c | 171 ++++++--
> mm/memcontrol.c | 562 ++++++++++++++++++-------
> mm/migrate.c | 4 +
> mm/page-writeback.c | 24 +-
> mm/page_io.c | 5 +-
> mm/rmap.c | 14 +-
> mm/swap.c | 46 +-
> mm/vmscan.c | 56 ++-
> 16 files changed, 795 insertions(+), 355 deletions(-)
>
> --
> 2.11.0
>

2021-05-20 03:24:36

by Muchun Song

[permalink] [raw]
Subject: Re: [External] Re: [RFC PATCH v3 00/12] Use obj_cgroup APIs to charge the LRU pages

On Tue, May 18, 2021 at 10:17 PM Roman Gushchin <[email protected]> wrote:
>
> Hi Muchun!
>
> It looks like the writeback problem will be solved in a different way, which will not require generalization of the obj_cgroup api to the cgroup level. It’s not fully confirmed yet though. We still might wanna do this generalization lingn-term, but as now I have no objections for continuing the work on your patchset. I’m on pto this week, but will take a deeper look at your patches early next week. Sorry for the delay.

Waiting on your review. Thanks Roman.

>
> Thanks!
>
> Sent from my iPhone
>
> > On May 18, 2021, at 06:50, Muchun Song <[email protected]> wrote:
> >
> > Ping...
> >
> > Hi Johannes and Roman,
> >
> > Any suggestions on this patch set?
> >
> > Thanks.
> >
> >> On Wed, Apr 21, 2021 at 3:01 PM Muchun Song <[email protected]> wrote:
> >>
> >> This is v3 based on the top of the series[1] (memcontrol code cleanup and
> >> simplification). Roman is working on the generalization of obj_cgroup API.
> >> But before that, hope someone can review this patches for correctness.
> >>
> >> Since the following patchsets applied. All the kernel memory are charged
> >> with the new APIs of obj_cgroup.
> >>
> >> [v17,00/19] The new cgroup slab memory controller[2]
> >> [v5,0/7] Use obj_cgroup APIs to charge kmem pages[3]
> >>
> >> But user memory allocations (LRU pages) pinning memcgs for a long time -
> >> it exists at a larger scale and is causing recurring problems in the real
> >> world: page cache doesn't get reclaimed for a long time, or is used by the
> >> second, third, fourth, ... instance of the same job that was restarted into
> >> a new cgroup every time. Unreclaimable dying cgroups pile up, waste memory,
> >> and make page reclaim very inefficient.
> >>
> >> We can convert LRU pages and most other raw memcg pins to the objcg direction
> >> to fix this problem, and then the LRU pages will not pin the memcgs.
> >>
> >> This patchset aims to make the LRU pages to drop the reference to memory
> >> cgroup by using the APIs of obj_cgroup. Finally, we can see that the number
> >> of the dying cgroups will not increase if we run the following test script.
> >>
> >> ```bash
> >> #!/bin/bash
> >>
> >> cat /proc/cgroups | grep memory
> >>
> >> cd /sys/fs/cgroup/memory
> >>
> >> for i in range{1..500}
> >> do
> >> mkdir test
> >> echo $$ > test/cgroup.procs
> >> sleep 60 &
> >> echo $$ > cgroup.procs
> >> echo `cat test/cgroup.procs` > cgroup.procs
> >> rmdir test
> >> done
> >>
> >> cat /proc/cgroups | grep memory
> >> ```
> >>
> >> Thanks.
> >>
> >> [1] https://lore.kernel.org/linux-mm/[email protected]/
> >> [2] https://lore.kernel.org/linux-mm/[email protected]/
> >> [3] https://lore.kernel.org/linux-mm/[email protected]/
> >>
> >> Changlogs in RFC v3:
> >> 1. Drop the code cleanup and simplification patches. Gather those patches
> >> into a separate series[1].
> >> 2. Rework patch #1 suggested by Johannes.
> >>
> >> Changlogs in RFC v2:
> >> 1. Collect Acked-by tags by Johannes. Thanks.
> >> 2. Rework lruvec_holds_page_lru_lock() suggested by Johannes. Thanks.
> >> 3. Fix move_pages_to_lru().
> >>
> >> Muchun Song (12):
> >> mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM
> >> mm: memcontrol: introduce compact_lock_page_lruvec_irqsave
> >> mm: memcontrol: make lruvec lock safe when the LRU pages reparented
> >> mm: vmscan: rework move_pages_to_lru()
> >> mm: thp: introduce lock/unlock_split_queue{_irqsave}()
> >> mm: thp: make deferred split queue lock safe when the LRU pages
> >> reparented
> >> mm: memcontrol: make all the callers of page_memcg() safe
> >> mm: memcontrol: introduce memcg_reparent_ops
> >> mm: memcontrol: use obj_cgroup APIs to charge the LRU pages
> >> mm: memcontrol: rename {un}lock_page_memcg() to {un}lock_page_objcg()
> >> mm: lru: add VM_BUG_ON_PAGE to lru maintenance function
> >> mm: lru: use lruvec lock to serialize memcg changes
> >>
> >> Documentation/admin-guide/cgroup-v1/memory.rst | 2 +-
> >> fs/buffer.c | 13 +-
> >> fs/fs-writeback.c | 23 +-
> >> fs/iomap/buffered-io.c | 4 +-
> >> include/linux/memcontrol.h | 182 ++++----
> >> include/linux/mm_inline.h | 6 +
> >> mm/compaction.c | 36 +-
> >> mm/filemap.c | 2 +-
> >> mm/huge_memory.c | 171 ++++++--
> >> mm/memcontrol.c | 562 ++++++++++++++++++-------
> >> mm/migrate.c | 4 +
> >> mm/page-writeback.c | 24 +-
> >> mm/page_io.c | 5 +-
> >> mm/rmap.c | 14 +-
> >> mm/swap.c | 46 +-
> >> mm/vmscan.c | 56 ++-
> >> 16 files changed, 795 insertions(+), 355 deletions(-)
> >>
> >> --
> >> 2.11.0
> >>

2021-05-25 18:02:40

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC PATCH v3 01/12] mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM

On Wed, Apr 21, 2021 at 03:00:48PM +0800, Muchun Song wrote:
> Because memory allocations pinning memcgs for a long time - it exists
> at a larger scale and is causing recurring problems in the real world:
> page cache doesn't get reclaimed for a long time, or is used by the
> second, third, fourth, ... instance of the same job that was restarted
> into a new cgroup every time. Unreclaimable dying cgroups pile up,
> waste memory, and make page reclaim very inefficient.
>
> We can convert LRU pages and most other raw memcg pins to the objcg
> direction to fix this problem, and then the page->memcg will always
> point to an object cgroup pointer.
>
> Therefore, the infrastructure of objcg no longer only serves
> CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the
> objcg out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages
> can reuse it to charge pages.
>
> We know that the LRU pages are not accounted at the root level. But
> the page->memcg_data points to the root_mem_cgroup. So the
> page->memcg_data of the LRU pages always points to a valid pointer.
> But the root_mem_cgroup dose not have an object cgroup. If we use
> obj_cgroup APIs to charge the LRU pages, we should set the
> page->memcg_data to a root object cgroup. So we also allocate an
> object cgroup for the root_mem_cgroup.

Overall the patch looks very good to me. There are few small things to enhance:

1) I'd rename it. Looking at the title I expect a trivial code move,
however the patch is doing more than this: e.g. allocating an objcg
for the root memcg. Something like "prepare objcg API for non-kmem usage".
2) How about obj_cgroup_release_kmem() instead of obj_cgroup_release_uncharge()?
3) The first paragraph of the commit log looks a bit vague: which allocations
pinning memcgs? How about something like this?

Pagecache pages are charged at the allocation time and holding a reference
to the original memory cgroup until being reclaimed. Depending on the memory
pressure, specific patterns of the page sharing between different cgroups and
the cgroup creation and destruction rates, a large number of dying memory
cgroups can be pinned by pagecache pages. It makes the page reclaim less
efficient and wastes memory.


Thanks!

2021-05-25 18:20:26

by Roman Gushchin

[permalink] [raw]
Subject: Re: [External] Re: [RFC PATCH v3 00/12] Use obj_cgroup APIs to charge the LRU pages

On Thu, May 20, 2021 at 11:20:47AM +0800, Muchun Song wrote:
> On Tue, May 18, 2021 at 10:17 PM Roman Gushchin <[email protected]> wrote:
> >
> > Hi Muchun!
> >
> > It looks like the writeback problem will be solved in a different way, which will not require generalization of the obj_cgroup api to the cgroup level. It’s not fully confirmed yet though. We still might wanna do this generalization lingn-term, but as now I have no objections for continuing the work on your patchset. I’m on pto this week, but will take a deeper look at your patches early next week. Sorry for the delay.
>
> Waiting on your review. Thanks Roman.

It looks like the mm tree went ahead and I can't clearly apply the whole patchset.
Would you mind to rebase it and resend?

Thank you!

2021-05-26 02:59:41

by Muchun Song

[permalink] [raw]
Subject: Re: [External] Re: [RFC PATCH v3 01/12] mm: memcontrol: move the objcg infrastructure out of CONFIG_MEMCG_KMEM

On Wed, May 26, 2021 at 12:27 AM Roman Gushchin <[email protected]> wrote:
>
> On Wed, Apr 21, 2021 at 03:00:48PM +0800, Muchun Song wrote:
> > Because memory allocations pinning memcgs for a long time - it exists
> > at a larger scale and is causing recurring problems in the real world:
> > page cache doesn't get reclaimed for a long time, or is used by the
> > second, third, fourth, ... instance of the same job that was restarted
> > into a new cgroup every time. Unreclaimable dying cgroups pile up,
> > waste memory, and make page reclaim very inefficient.
> >
> > We can convert LRU pages and most other raw memcg pins to the objcg
> > direction to fix this problem, and then the page->memcg will always
> > point to an object cgroup pointer.
> >
> > Therefore, the infrastructure of objcg no longer only serves
> > CONFIG_MEMCG_KMEM. In this patch, we move the infrastructure of the
> > objcg out of the scope of the CONFIG_MEMCG_KMEM so that the LRU pages
> > can reuse it to charge pages.
> >
> > We know that the LRU pages are not accounted at the root level. But
> > the page->memcg_data points to the root_mem_cgroup. So the
> > page->memcg_data of the LRU pages always points to a valid pointer.
> > But the root_mem_cgroup dose not have an object cgroup. If we use
> > obj_cgroup APIs to charge the LRU pages, we should set the
> > page->memcg_data to a root object cgroup. So we also allocate an
> > object cgroup for the root_mem_cgroup.
>
> Overall the patch looks very good to me. There are few small things to enhance:
>
> 1) I'd rename it. Looking at the title I expect a trivial code move,
> however the patch is doing more than this: e.g. allocating an objcg
> for the root memcg. Something like "prepare objcg API for non-kmem usage".

OK. I will rename it.

> 2) How about obj_cgroup_release_kmem() instead of obj_cgroup_release_uncharge()?

LGTM. Will use this name.

> 3) The first paragraph of the commit log looks a bit vague: which allocations
> pinning memcgs? How about something like this?
>
> Pagecache pages are charged at the allocation time and holding a reference
> to the original memory cgroup until being reclaimed. Depending on the memory
> pressure, specific patterns of the page sharing between different cgroups and
> the cgroup creation and destruction rates, a large number of dying memory
> cgroups can be pinned by pagecache pages. It makes the page reclaim less
> efficient and wastes memory.

More clear. I would love to use this.

Thanks Roman.

>
>
> Thanks!

2021-05-26 05:34:01

by Muchun Song

[permalink] [raw]
Subject: Re: [External] Re: [RFC PATCH v3 00/12] Use obj_cgroup APIs to charge the LRU pages

On Wed, May 26, 2021 at 1:35 AM Roman Gushchin <[email protected]> wrote:
>
> On Thu, May 20, 2021 at 11:20:47AM +0800, Muchun Song wrote:
> > On Tue, May 18, 2021 at 10:17 PM Roman Gushchin <[email protected]> wrote:
> > >
> > > Hi Muchun!
> > >
> > > It looks like the writeback problem will be solved in a different way, which will not require generalization of the obj_cgroup api to the cgroup level. It’s not fully confirmed yet though. We still might wanna do this generalization lingn-term, but as now I have no objections for continuing the work on your patchset. I’m on pto this week, but will take a deeper look at your patches early next week. Sorry for the delay.
> >
> > Waiting on your review. Thanks Roman.
>
> It looks like the mm tree went ahead and I can't clearly apply the whole patchset.
> Would you mind to rebase it and resend?

Got it. Will do that. Thanks.

>
> Thank you!