2021-03-16 20:45:22

by Dan Schatzberg

[permalink] [raw]
Subject: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

No major changes, just rebasing and resubmitting

Changes since V10:

* Added page-cache charging to mm: Charge active memcg when no mm is set

Changes since V9:

* Rebased against linus's branch which now includes Roman Gushchin's
patch this series is based off of

Changes since V8:

* Rebased on top of Roman Gushchin's patch
(https://lkml.org/lkml/2020/8/21/1464) which provides the nesting
support for setting active memcg. Dropped the patch from this series
that did the same thing.

Changes since V7:

* Rebased against linus's branch

Changes since V6:

* Added separate spinlock for worker synchronization
* Minor style changes

Changes since V5:

* Fixed a missing css_put when failing to allocate a worker
* Minor style changes

Changes since V4:

Only patches 1 and 2 have changed.

* Fixed irq lock ordering bug
* Simplified loop detach
* Added support for nesting memalloc_use_memcg

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (3):
loop: Use worker per cgroup instead of kworker
mm: Charge active memcg when no mm is set
loop: Charge i/o to mem and blk cg

drivers/block/loop.c | 248 ++++++++++++++++++++++++++++++-------
drivers/block/loop.h | 15 ++-
include/linux/memcontrol.h | 11 ++
kernel/cgroup/cgroup.c | 1 +
mm/filemap.c | 2 +-
mm/memcontrol.c | 15 ++-
mm/shmem.c | 4 +-
7 files changed, 242 insertions(+), 54 deletions(-)

--
2.30.2


2021-03-16 20:45:52

by Dan Schatzberg

[permalink] [raw]
Subject: [PATCH 3/3] loop: Charge i/o to mem and blk cg

The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css and int_active_memcg so it
can be used by the loop module.

Signed-off-by: Dan Schatzberg <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
drivers/block/loop.c | 61 +++++++++++++++++++++++++-------------
drivers/block/loop.h | 3 +-
include/linux/memcontrol.h | 11 +++++++
kernel/cgroup/cgroup.c | 1 +
mm/memcontrol.c | 1 +
5 files changed, 56 insertions(+), 21 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 5c080af73caa..6cf3086a2e75 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -77,6 +77,7 @@
#include <linux/uio.h>
#include <linux/ioprio.h>
#include <linux/blk-cgroup.h>
+#include <linux/sched/mm.h>

#include "loop.h"

@@ -515,8 +516,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
{
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);

- if (cmd->css)
- css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
}
@@ -577,8 +576,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
- if (cmd->css)
- kthread_associate_blkcg(cmd->css);

if (rw == WRITE)
ret = call_write_iter(file, &cmd->iocb, &iter);
@@ -586,7 +583,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
ret = call_read_iter(file, &cmd->iocb, &iter);

lo_rw_aio_do_completion(cmd);
- kthread_associate_blkcg(NULL);

if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
@@ -927,7 +923,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
};

@@ -944,7 +940,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)

spin_lock_irq(&lo->lo_work_lock);

- if (!cmd->css)
+ if (!cmd->blkcg_css)
goto queue_work;

node = &lo->worker_tree.rb_node;
@@ -952,10 +948,10 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
- if (cur_worker->css == cmd->css) {
+ if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
- } else if ((long)cur_worker->css < (long)cmd->css) {
+ } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -967,13 +963,18 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
* In the event we cannot allocate a worker, just queue on the
- * rootcg worker
+ * rootcg worker and issue the I/O as the rootcg
*/
- if (!worker)
+ if (!worker) {
+ cmd->blkcg_css = NULL;
+ if (cmd->memcg_css)
+ css_put(cmd->memcg_css);
+ cmd->memcg_css = NULL;
goto queue_work;
+ }

- worker->css = cmd->css;
- css_get(worker->css);
+ worker->blkcg_css = cmd->blkcg_css;
+ css_get(worker->blkcg_css);
INIT_WORK(&worker->work, loop_workfn);
INIT_LIST_HEAD(&worker->cmd_list);
INIT_LIST_HEAD(&worker->idle_list);
@@ -1297,7 +1298,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
idle_list) {
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(&lo->lo_work_lock);
@@ -2102,13 +2103,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
}

/* always use the first bio's css */
+ cmd->blkcg_css = NULL;
+ cmd->memcg_css = NULL;
#ifdef CONFIG_BLK_CGROUP
- if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
- cmd->css = &bio_blkcg(rq->bio)->css;
- css_get(cmd->css);
- } else
+ if (rq->bio && rq->bio->bi_blkg) {
+ cmd->blkcg_css = &bio_blkcg(rq->bio)->css;
+#ifdef CONFIG_MEMCG
+ cmd->memcg_css =
+ cgroup_get_e_css(cmd->blkcg_css->cgroup,
+ &memory_cgrp_subsys);
+#endif
+ }
#endif
- cmd->css = NULL;
loop_queue_work(lo, cmd);

return BLK_STS_OK;
@@ -2120,13 +2126,28 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
const bool write = op_is_write(req_op(rq));
struct loop_device *lo = rq->q->queuedata;
int ret = 0;
+ struct mem_cgroup *old_memcg = NULL;

if (write && (lo->lo_flags & LO_FLAGS_READ_ONLY)) {
ret = -EIO;
goto failed;
}

+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(cmd->blkcg_css);
+ if (cmd->memcg_css)
+ old_memcg = set_active_memcg(
+ mem_cgroup_from_css(cmd->memcg_css));
+
ret = do_req_filebacked(lo, rq);
+
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(NULL);
+
+ if (cmd->memcg_css) {
+ set_active_memcg(old_memcg);
+ css_put(cmd->memcg_css);
+ }
failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
@@ -2205,7 +2226,7 @@ static void loop_free_idle_workers(struct timer_list *timer)
break;
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
if (!list_empty(&lo->idle_worker_list))
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 9289c1cd6374..cd24a81e00e6 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -76,7 +76,8 @@ struct loop_cmd {
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
+ struct cgroup_subsys_state *memcg_css;
};

/* Support for loadable transfer modules */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0c04d39a7967..fd5dd961d01f 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1187,6 +1187,17 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
return NULL;
}

+static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
+{
+ return NULL;
+}
+
+static inline
+struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+ return NULL;
+}
+
static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 9153b20e5cc6..94c88f7221c5 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -577,6 +577,7 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgrp,
rcu_read_unlock();
return css;
}
+EXPORT_SYMBOL_GPL(cgroup_get_e_css);

static void cgroup_get_live(struct cgroup *cgrp)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a1b23ed3412..f05501669e29 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -78,6 +78,7 @@ struct mem_cgroup *root_mem_cgroup __read_mostly;

/* Active memory cgroup to use from an interrupt context */
DEFINE_PER_CPU(struct mem_cgroup *, int_active_memcg);
+EXPORT_PER_CPU_SYMBOL_GPL(int_active_memcg);

/* Socket memory accounting disabled? */
static bool cgroup_memory_nosocket;
--
2.30.2

2021-03-16 20:53:01

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH 3/3] loop: Charge i/o to mem and blk cg

On Tue, Mar 16, 2021 at 8:37 AM Dan Schatzberg <[email protected]> wrote:
>
[...]
>
> /* Support for loadable transfer modules */
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 0c04d39a7967..fd5dd961d01f 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -1187,6 +1187,17 @@ static inline struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm)
> return NULL;
> }
>
> +static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
> +{
> + return NULL;
> +}

The above function has been removed.

2021-03-17 22:34:36

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

On 3/16/21 9:36 AM, Dan Schatzberg wrote:
> No major changes, just rebasing and resubmitting

Applied for 5.13, thanks.

--
Jens Axboe

2021-03-18 15:55:20

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

On Wed, Mar 17, 2021 at 3:30 PM Jens Axboe <[email protected]> wrote:
>
> On 3/16/21 9:36 AM, Dan Schatzberg wrote:
> > No major changes, just rebasing and resubmitting
>
> Applied for 5.13, thanks.
>

I have requested a couple of changes in the patch series. Can this
applied series still be changed or new patches are required?

2021-03-18 16:03:52

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

On 3/18/21 9:53 AM, Shakeel Butt wrote:
> On Wed, Mar 17, 2021 at 3:30 PM Jens Axboe <[email protected]> wrote:
>>
>> On 3/16/21 9:36 AM, Dan Schatzberg wrote:
>>> No major changes, just rebasing and resubmitting
>>
>> Applied for 5.13, thanks.
>>
>
> I have requested a couple of changes in the patch series. Can this
> applied series still be changed or new patches are required?

I have nothing sitting on top of it for now, so as far as I'm concerned
we can apply a new series instead. Then we can also fold in that fix
from Colin that he posted this morning...

--
Jens Axboe

2021-03-18 23:48:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

On Thu, 18 Mar 2021 10:00:17 -0600 Jens Axboe <[email protected]> wrote:

> On 3/18/21 9:53 AM, Shakeel Butt wrote:
> > On Wed, Mar 17, 2021 at 3:30 PM Jens Axboe <[email protected]> wrote:
> >>
> >> On 3/16/21 9:36 AM, Dan Schatzberg wrote:
> >>> No major changes, just rebasing and resubmitting
> >>
> >> Applied for 5.13, thanks.
> >>
> >
> > I have requested a couple of changes in the patch series. Can this
> > applied series still be changed or new patches are required?
>
> I have nothing sitting on top of it for now, so as far as I'm concerned
> we can apply a new series instead. Then we can also fold in that fix
> from Colin that he posted this morning...

The collision in memcontrol.c is a pain, but I guess as this is mainly
a loop patch, the block tree is an appropriate route.

Here's the collision between "mm: Charge active memcg when no mm is
set" and Shakeels's
https://lkml.kernel.org/r/[email protected]


--- mm/memcontrol.c
+++ mm/memcontrol.c
@@ -6728,8 +6730,15 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
rcu_read_unlock();
}

- if (!memcg)
- memcg = get_mem_cgroup_from_mm(mm);
+ if (!memcg) {
+ if (!mm) {
+ memcg = get_mem_cgroup_from_current();
+ if (!memcg)
+ memcg = get_mem_cgroup_from_mm(current->mm);
+ } else {
+ memcg = get_mem_cgroup_from_mm(mm);
+ }
+ }

ret = try_charge(memcg, gfp_mask, nr_pages);
if (ret)


Which I resolved thusly:

int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
{
struct mem_cgroup *memcg;
int ret;

if (mem_cgroup_disabled())
return 0;

if (!mm) {
memcg = get_mem_cgroup_from_current();
(!memcg)
memcg = get_mem_cgroup_from_mm(current->mm);
} else {
memcg = get_mem_cgroup_from_mm(mm);
}

ret = __mem_cgroup_charge(page, memcg, gfp_mask);
css_put(&memcg->css);

return ret;
}


2021-03-19 01:01:55

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

On Thu, Mar 18, 2021 at 4:46 PM Andrew Morton <[email protected]> wrote:
>
> On Thu, 18 Mar 2021 10:00:17 -0600 Jens Axboe <[email protected]> wrote:
>
> > On 3/18/21 9:53 AM, Shakeel Butt wrote:
> > > On Wed, Mar 17, 2021 at 3:30 PM Jens Axboe <[email protected]> wrote:
> > >>
> > >> On 3/16/21 9:36 AM, Dan Schatzberg wrote:
> > >>> No major changes, just rebasing and resubmitting
> > >>
> > >> Applied for 5.13, thanks.
> > >>
> > >
> > > I have requested a couple of changes in the patch series. Can this
> > > applied series still be changed or new patches are required?
> >
> > I have nothing sitting on top of it for now, so as far as I'm concerned
> > we can apply a new series instead. Then we can also fold in that fix
> > from Colin that he posted this morning...
>
> The collision in memcontrol.c is a pain, but I guess as this is mainly
> a loop patch, the block tree is an appropriate route.
>
> Here's the collision between "mm: Charge active memcg when no mm is
> set" and Shakeels's
> https://lkml.kernel.org/r/[email protected]
>
>
> --- mm/memcontrol.c
> +++ mm/memcontrol.c
> @@ -6728,8 +6730,15 @@ int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> rcu_read_unlock();
> }
>
> - if (!memcg)
> - memcg = get_mem_cgroup_from_mm(mm);
> + if (!memcg) {
> + if (!mm) {
> + memcg = get_mem_cgroup_from_current();
> + if (!memcg)
> + memcg = get_mem_cgroup_from_mm(current->mm);
> + } else {
> + memcg = get_mem_cgroup_from_mm(mm);
> + }
> + }
>
> ret = try_charge(memcg, gfp_mask, nr_pages);
> if (ret)
>
>
> Which I resolved thusly:
>
> int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask)
> {
> struct mem_cgroup *memcg;
> int ret;
>
> if (mem_cgroup_disabled())
> return 0;
>
> if (!mm) {
> memcg = get_mem_cgroup_from_current();
> (!memcg)
> memcg = get_mem_cgroup_from_mm(current->mm);
> } else {
> memcg = get_mem_cgroup_from_mm(mm);
> }
>
> ret = __mem_cgroup_charge(page, memcg, gfp_mask);
> css_put(&memcg->css);
>
> return ret;
> }
>

We need something similar for mem_cgroup_swapin_charge_page() as well.

It is better to take this series in mm tree and Jens is ok with that [1].

[1] https://lore.kernel.org/linux-next/[email protected]/

2021-03-19 15:55:08

by Dan Schatzberg

[permalink] [raw]
Subject: Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

On Thu, Mar 18, 2021 at 05:56:28PM -0700, Shakeel Butt wrote:
>
> We need something similar for mem_cgroup_swapin_charge_page() as well.
>
> It is better to take this series in mm tree and Jens is ok with that [1].
>
> [1] https://lore.kernel.org/linux-next/[email protected]/

It sounds like there are no concerns about the loop-related work in
the patch series. I'll rebase on the mm tree and resubmit.

2021-03-19 16:22:36

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

On Fri, Mar 19, 2021 at 8:51 AM Dan Schatzberg <[email protected]> wrote:
>
> On Thu, Mar 18, 2021 at 05:56:28PM -0700, Shakeel Butt wrote:
> >
> > We need something similar for mem_cgroup_swapin_charge_page() as well.
> >
> > It is better to take this series in mm tree and Jens is ok with that [1].
> >
> > [1] https://lore.kernel.org/linux-next/[email protected]/
>
> It sounds like there are no concerns about the loop-related work in
> the patch series. I'll rebase on the mm tree and resubmit.

One suggestion would be to make get_mem_cgroup_from_mm() more generic
(i.e. handle !mm && active_memcg() case) and avoid
get_mem_cgroup_from_current() as it might go away.

2021-03-19 16:31:48

by Dan Schatzberg

[permalink] [raw]
Subject: Re: [PATCH v10 0/3] Charge loop device i/o to issuing cgroup

On Fri, Mar 19, 2021 at 09:20:16AM -0700, Shakeel Butt wrote:
> One suggestion would be to make get_mem_cgroup_from_mm() more generic
> (i.e. handle !mm && active_memcg() case) and avoid
> get_mem_cgroup_from_current() as it might go away.

Yeah, that occurred to me as well. I'll take a stab at doing that.