LinuxLists.cc - [PATCH V12 0/3] Charge loop device i/o to issuing cgroup

2021-04-02 19:18:42

Subject: [PATCH V12 0/3] Charge loop device i/o to issuing cgroup

No major changes, rebased on top of latest mm tree

Changes since V12:

* Small change to get_mem_cgroup_from_mm to avoid needing
get_active_memcg

Changes since V11:

* Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this
can be driven by writeback, but this was causing a warning in xfs
and likely other filesystems aren't equipped to be driven by reclaim
at the VFS layer.
* Included a small fix from Colin Ian King.
* reworked get_mem_cgroup_from_mm to institute the necessary charge
priority.

Changes since V10:

* Added page-cache charging to mm: Charge active memcg when no mm is set

Changes since V9:

* Rebased against linus's branch which now includes Roman Gushchin's
patch this series is based off of

Changes since V8:

* Rebased on top of Roman Gushchin's patch
(https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
support for setting active memcg. Dropped the patch from this series
that did the same thing.

Changes since V7:

* Rebased against linus's branch

Changes since V6:

* Added separate spinlock for worker synchronization
* Minor style changes

Changes since V5:

* Fixed a missing css_put when failing to allocate a worker
* Minor style changes

Changes since V4:

Only patches 1 and 2 have changed.

* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (3):
loop: Use worker per cgroup instead of kworker
mm: Charge active memcg when no mm is set
loop: Charge i/o to mem and blk cg

drivers/block/loop.c | 244 ++++++++++++++++++++++++++++++-------
drivers/block/loop.h | 15 ++-
include/linux/memcontrol.h | 6 +
kernel/cgroup/cgroup.c | 1 +
mm/filemap.c | 2 +-
mm/memcontrol.c | 49 +++++---
mm/shmem.c | 4 +-
7 files changed, 253 insertions(+), 68 deletions(-)

--
2.30.2

2021-04-02 19:18:52

by Dan Schatzberg

[permalink] [raw]

Subject: [PATCH 2/3] mm: Charge active memcg when no mm is set

set_active_memcg() worked for kernel allocations but was silently
ignored for user pages.

This patch establishes a precedence order for who gets charged:

1. If there is a memcg associated with the page already, that memcg is
charged. This happens during swapin.

2. If an explicit mm is passed, mm->memcg is charged. This happens
during page faults, which can be triggered in remote VMs (eg gup).

3. Otherwise consult the current process context. If there is an
active_memcg, use that. Otherwise, current->mm->memcg.

Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it
would always charge the root cgroup. Now it looks up the active_memcg
first (falling back to charging the root cgroup if not set).

Signed-off-by: Dan Schatzberg <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Chris Down <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
---
mm/filemap.c | 2 +-
mm/memcontrol.c | 48 +++++++++++++++++++++++++++++++-----------------
mm/shmem.c | 4 ++--
3 files changed, 34 insertions(+), 20 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c03463cb72d6..38648f7d2106 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -872,7 +872,7 @@ noinline int __add_to_page_cache_locked(struct page *page,
page->index = offset;

if (!huge) {
- error = mem_cgroup_charge(page, current->mm, gfp);
+ error = mem_cgroup_charge(page, NULL, gfp);
if (error)
goto error;
charged = true;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c0b83a396299..d2939d6602b3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -886,13 +886,24 @@ struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p)
}
EXPORT_SYMBOL(mem_cgroup_from_task);

+static __always_inline struct mem_cgroup *active_memcg(void)
+{
+ if (in_interrupt())
+ return this_cpu_read(int_active_memcg);
+ else
+ return current->active_memcg;
+}
+
/**
* get_mem_cgroup_from_mm: Obtain a reference on given mm_struct's memcg.
* @mm: mm from which memcg should be extracted. It can be NULL.
*
- * Obtain a reference on mm->memcg and returns it if successful. Otherwise
- * root_mem_cgroup is returned. However if mem_cgroup is disabled, NULL is
- * returned.
+ * Obtain a reference on mm->memcg and returns it if successful. If mm
+ * is NULL, then the memcg is chosen as follows:
+ * 1) The active memcg, if set.
+ * 2) current->mm->memcg, if available
+ * 3) root memcg
+ * If mem_cgroup is disabled, NULL is returned.
*/
struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
{
@@ -901,13 +912,23 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
if (mem_cgroup_disabled())
return NULL;

+ /*
+ * Page cache insertions can happen without an
+ * actual mm context, e.g. during disk probing
+ * on boot, loopback IO, acct() writes etc.
+ */
+ if (unlikely(!mm)) {
+ memcg = active_memcg();
+ if (unlikely(memcg)) {
+ /* remote memcg must hold a ref */
+ css_get(&memcg->css);
+ return memcg;
+ }
+ mm = current->mm;
+ }
+
rcu_read_lock();
do {
- /*
- * Page cache insertions can happen without an
- * actual mm context, e.g. during disk probing
- * on boot, loopback IO, acct() writes etc.
- */
if (unlikely(!mm))
memcg = root_mem_cgroup;
else {
@@ -921,14 +942,6 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
}
EXPORT_SYMBOL(get_mem_cgroup_from_mm);

-static __always_inline struct mem_cgroup *active_memcg(void)
-{
- if (in_interrupt())
- return this_cpu_read(int_active_memcg);
- else
- return current->active_memcg;
-}
-
static __always_inline bool memcg_kmem_bypass(void)
{
/* Allow remote memcg charging from any context. */
@@ -6537,7 +6550,8 @@ static int __mem_cgroup_charge(struct page *page, struct mem_cgroup *memcg,
* @gfp_mask: reclaim mode
*
* Try to charge @page to the memcg that @mm belongs to, reclaiming
- * pages according to @gfp_mask if necessary.
+ * pages according to @gfp_mask if necessary. if @mm is NULL, try to
+ * charge to the active memcg.
*
* Do not use this for pages allocated for swapin.
*
diff --git a/mm/shmem.c b/mm/shmem.c
index 5cfd2fb6e52b..524fa5aa0459 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1694,7 +1694,7 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
{
struct address_space *mapping = inode->i_mapping;
struct shmem_inode_info *info = SHMEM_I(inode);
- struct mm_struct *charge_mm = vma ? vma->vm_mm : current->mm;
+ struct mm_struct *charge_mm = vma ? vma->vm_mm : NULL;
struct page *page;
swp_entry_t swap;
int error;
@@ -1815,7 +1815,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
}

sbinfo = SHMEM_SB(inode->i_sb);
- charge_mm = vma ? vma->vm_mm : current->mm;
+ charge_mm = vma ? vma->vm_mm : NULL;

page = pagecache_get_page(mapping, index,
FGP_ENTRY | FGP_HEAD | FGP_LOCK, 0);
--
2.30.2

2021-04-02 19:19:43

by Dan Schatzberg

[permalink] [raw]

Subject: [PATCH 3/3] loop: Charge i/o to mem and blk cg

The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css and int_active_memcg so it
can be used by the loop module.

Signed-off-by: Dan Schatzberg <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
drivers/block/loop.c | 61 +++++++++++++++++++++++++-------------
drivers/block/loop.h | 3 +-
include/linux/memcontrol.h | 6 ++++
kernel/cgroup/cgroup.c | 1 +
mm/memcontrol.c | 1 +
5 files changed, 51 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 4750b373d4bb..d2759f8a7c2a 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -78,6 +78,7 @@
#include <linux/uio.h>
#include <linux/ioprio.h>
#include <linux/blk-cgroup.h>
+#include <linux/sched/mm.h>

#include "loop.h"

@@ -516,8 +517,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
{
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);

- if (cmd->css)
- css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
}
@@ -578,8 +577,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
- if (cmd->css)
- kthread_associate_blkcg(cmd->css);

if (rw == WRITE)
ret = call_write_iter(file, &cmd->iocb, &iter);
@@ -587,7 +584,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
ret = call_read_iter(file, &cmd->iocb, &iter);

lo_rw_aio_do_completion(cmd);
- kthread_associate_blkcg(NULL);

if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
@@ -928,7 +924,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
};

@@ -945,7 +941,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)

spin_lock_irq(&lo->lo_work_lock);

- if (!cmd->css)
+ if (!cmd->blkcg_css)
goto queue_work;

node = &lo->worker_tree.rb_node;
@@ -953,10 +949,10 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
- if (cur_worker->css == cmd->css) {
+ if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
- } else if ((long)cur_worker->css < (long)cmd->css) {
+ } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -968,13 +964,18 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
* In the event we cannot allocate a worker, just queue on the
- * rootcg worker
+ * rootcg worker and issue the I/O as the rootcg
*/
- if (!worker)
+ if (!worker) {
+ cmd->blkcg_css = NULL;
+ if (cmd->memcg_css)
+ css_put(cmd->memcg_css);
+ cmd->memcg_css = NULL;
goto queue_work;
+ }

- worker->css = cmd->css;
- css_get(worker->css);
+ worker->blkcg_css = cmd->blkcg_css;
+ css_get(worker->blkcg_css);
INIT_WORK(&worker->work, loop_workfn);
INIT_LIST_HEAD(&worker->cmd_list);
INIT_LIST_HEAD(&worker->idle_list);
@@ -1294,7 +1295,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
idle_list) {
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(&lo->lo_work_lock);
@@ -2099,13 +2100,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
}

/* always use the first bio's css */
+ cmd->blkcg_css = NULL;
+ cmd->memcg_css = NULL;
#ifdef CONFIG_BLK_CGROUP
- if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
- cmd->css = &bio_blkcg(rq->bio)->css;
- css_get(cmd->css);
- } else
+ if (rq->bio && rq->bio->bi_blkg) {
+ cmd->blkcg_css = &bio_blkcg(rq->bio)->css;
+#ifdef CONFIG_MEMCG
+ cmd->memcg_css =
+ cgroup_get_e_css(cmd->blkcg_css->cgroup,
+ &memory_cgrp_subsys);
+#endif
+ }
#endif
- cmd->css = NULL;
loop_queue_work(lo, cmd);

return BLK_STS_OK;
@@ -2117,13 +2123,28 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
const bool write = op_is_write(req_op(rq));
struct loop_device *lo = rq->q->queuedata;
int ret = 0;
+ struct mem_cgroup *old_memcg = NULL;

if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) {
ret = -EIO;
goto failed;
}

+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(cmd->blkcg_css);
+ if (cmd->memcg_css)
+ old_memcg = set_active_memcg(
+ mem_cgroup_from_css(cmd->memcg_css));
+
ret = do_req_filebacked(lo, rq);
+
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(NULL);
+
+ if (cmd->memcg_css) {
+ set_active_memcg(old_memcg);
+ css_put(cmd->memcg_css);
+ }
failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
@@ -2202,7 +2223,7 @@ static void loop_free_idle_workers(struct timer_list *timer)
break;
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
if (!list_empty(&lo->idle_worker_list))
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 9289c1cd6374..cd24a81e00e6 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -76,7 +76,8 @@ struct loop_cmd {
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
+ struct cgroup_subsys_state *memcg_css;
};

/* Support for loadable transfer modules */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b8b0a802852c..a92500734f90 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1249,6 +1249,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
return NULL;
}

+static inline
+struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+ return NULL;
+}
+
static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index e049edd66776..8c84a5374238 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -577,6 +577,7 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgrp,
rcu_read_unlock();
return css;
}
+EXPORT_SYMBOL_GPL(cgroup_get_e_css);

static void cgroup_get_live(struct cgroup *cgrp)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d2939d6602b3..f12886a85e8b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -78,6 +78,7 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;

/* Active memory cgroup to use from an interrupt context */
DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
+EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);

/* Socket memory accounting disabled? */
static bool cgroup_memory_nosocket;
--
2.30.2

2021-04-03 05:50:05

by Muchun Song

[permalink] [raw]

Subject: Re: [External] [PATCH 2/3] mm: Charge active memcg when no mm is set

On Sat, Apr 3, 2021 at 3:17 AM Dan Schatzberg <[email protected]> wrote:
>
> set_active_memcg() worked for kernel allocations but was silently
> ignored for user pages.
>
> This patch establishes a precedence order for who gets charged:
>
> 1. If there is a memcg associated with the page already, that memcg is
> charged. This happens during swapin.
>
> 2. If an explicit mm is passed, mm->memcg is charged. This happens
> during page faults, which can be triggered in remote VMs (eg gup).
>
> 3. Otherwise consult the current process context. If there is an
> active_memcg, use that. Otherwise, current->mm->memcg.
>
> Previously, if a NULL mm was passed to mem_cgroup_charge (case 3) it
> would always charge the root cgroup. Now it looks up the active_memcg
> first (falling back to charging the root cgroup if not set).
>
> Signed-off-by: Dan Schatzberg <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>
> Acked-by: Tejun Heo <[email protected]>
> Acked-by: Chris Down <[email protected]>
> Reviewed-by: Shakeel Butt <[email protected]>

Reviewed-by: Muchun Song <[email protected]>

Thanks.

2021-04-06 13:05:54

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH 3/3] loop: Charge i/o to mem and blk cg

On Sat, Apr 3, 2021 at 3:18 AM Dan Schatzberg <[email protected]> wrote:
>
> The current code only associates with the existing blkcg when aio is
> used to access the backing file. This patch covers all types of i/o to
> the backing file and also associates the memcg so if the backing file is
> on tmpfs, memory is charged appropriately.
>
> This patch also exports cgroup_get_e_css and int_active_memcg so it
> can be used by the loop module.
>
> Signed-off-by: Dan Schatzberg <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>

Reviewed-by: Ming Lei <[email protected]>

--
Ming Lei

2021-04-13 04:30:34

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH V12 0/3] Charge loop device i/o to issuing cgroup

It looks like all feedback has been addressed and there hasn't been
any new activity on it in a while.

As per the suggestion last time [1], Andrew, Jens, could this go
through the -mm tree to deal with the memcg conflicts?

[1] https://lore.kernel.org/lkml/CALvZod6FMQQC17Zsu9xoKs=dFWaJdMC2Qk3YiDPUUQHx8teLYg@mail.gmail.com/

On Fri, Apr 02, 2021 at 12:16:31PM -0700, Dan Schatzberg wrote:
> No major changes, rebased on top of latest mm tree
>
> Changes since V12:
>
> * Small change to get_mem_cgroup_from_mm to avoid needing
> get_active_memcg
>
> Changes since V11:
>
> * Removed WQ_MEM_RECLAIM flag from loop workqueue. Technically, this
> can be driven by writeback, but this was causing a warning in xfs
> and likely other filesystems aren't equipped to be driven by reclaim
> at the VFS layer.
> * Included a small fix from Colin Ian King.
> * reworked get_mem_cgroup_from_mm to institute the necessary charge
> priority.
>
> Changes since V10:
>
> * Added page-cache charging to mm: Charge active memcg when no mm is set
>
> Changes since V9:
>
> * Rebased against linus's branch which now includes Roman Gushchin's
> patch this series is based off of
>
> Changes since V8:
>
> * Rebased on top of Roman Gushchin's patch
> (https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
> support for setting active memcg. Dropped the patch from this series
> that did the same thing.
>
> Changes since V7:
>
> * Rebased against linus's branch
>
> Changes since V6:
>
> * Added separate spinlock for worker synchronization
> * Minor style changes
>
> Changes since V5:
>
> * Fixed a missing css_put when failing to allocate a worker
> * Minor style changes
>
> Changes since V4:
>
> Only patches 1 and 2 have changed.
>
> * Fixed irq lock ordering bug
> * Simplified loop detach
> * Added support for nesting memalloc_use_memcg
>
> Changes since V3:
>
> * Fix race on loop device destruction and deferred worker cleanup
> * Ensure charge on shmem_swapin_page works just like getpage
> * Minor style changes
>
> Changes since V2:
>
> * Deferred destruction of workqueue items so in the common case there
> is no allocation needed
>
> Changes since V1:
>
> * Split out and reordered patches so cgroup charging changes are
> separate from kworker -> workqueue change
>
> * Add mem_css to struct loop_cmd to simplify logic
>
> The loop device runs all i/o to the backing file on a separate kworker
> thread which results in all i/o being charged to the root cgroup. This
> allows a loop device to be used to trivially bypass resource limits
> and other policy. This patch series fixes this gap in accounting.
>
> A simple script to demonstrate this behavior on cgroupv2 machine:
>
> '''
> #!/bin/bash
> set -e
>
> CGROUP=/sys/fs/cgroup/test.slice
> LOOP_DEV=/dev/loop0
>
> if [[ ! -d $CGROUP ]]
> then
> sudo mkdir $CGROUP
> fi
>
> grep oom_kill $CGROUP/memory.events
>
> # Set a memory limit, write more than that limit to tmpfs -> OOM kill
> sudo unshare -m bash -c "
> echo \$\$ > $CGROUP/cgroup.procs;
> echo 0 > $CGROUP/memory.swap.max;
> echo 64M > $CGROUP/memory.max;
> mount -t tmpfs -o size=512m tmpfs /tmp;
> dd if=/dev/zero of=/tmp/file bs=1M count=256" || true
>
> grep oom_kill $CGROUP/memory.events
>
> # Set a memory limit, write more than that limit through loopback
> # device -> no OOM kill
> sudo unshare -m bash -c "
> echo \$\$ > $CGROUP/cgroup.procs;
> echo 0 > $CGROUP/memory.swap.max;
> echo 64M > $CGROUP/memory.max;
> mount -t tmpfs -o size=512m tmpfs /tmp;
> truncate -s 512m /tmp/backing_file
> losetup $LOOP_DEV /tmp/backing_file
> dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
> losetup -D $LOOP_DEV" || true
>
> grep oom_kill $CGROUP/memory.events
> '''
>
> Naively charging cgroups could result in priority inversions through
> the single kworker thread in the case where multiple cgroups are
> reading/writing to the same loop device. This patch series does some
> minor modification to the loop driver so that each cgroup can make
> forward progress independently to avoid this inversion.
>
> With this patch series applied, the above script triggers OOM kills
> when writing through the loop device as expected.
>
> Dan Schatzberg (3):
> loop: Use worker per cgroup instead of kworker
> mm: Charge active memcg when no mm is set
> loop: Charge i/o to mem and blk cg
>
> drivers/block/loop.c | 244 ++++++++++++++++++++++++++++++-------
> drivers/block/loop.h | 15 ++-
> include/linux/memcontrol.h | 6 +
> kernel/cgroup/cgroup.c | 1 +
> mm/filemap.c | 2 +-
> mm/memcontrol.c | 49 +++++---
> mm/shmem.c | 4 +-
> 7 files changed, 253 insertions(+), 68 deletions(-)
>
> --
> 2.30.2
>
>

2021-04-13 04:30:43

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH V12 0/3] Charge loop device i/o to issuing cgroup

On 4/12/21 9:45 AM, Johannes Weiner wrote:
> It looks like all feedback has been addressed and there hasn't been
> any new activity on it in a while.
>
> As per the suggestion last time [1], Andrew, Jens, could this go
> through the -mm tree to deal with the memcg conflicts?

Yep, I think that would make it the most painless for everyone.

Dan/Andrew, you can add my Acked-by to the series.

--
Jens Axboe