2020-02-24 22:20:01

by Dan Schatzberg

[permalink] [raw]
Subject: [PATCH v3 0/3] Charge loop device i/o to issuing cgroup

Changes since V3:

* Fix race on loop device destruction and deferred worker cleanup
* Ensure charge on shmem_swapin_page works just like getpage
* Minor style changes

Changes since V2:

* Deferred destruction of workqueue items so in the common case there
is no allocation needed

Changes since V1:

* Split out and reordered patches so cgroup charging changes are
separate from kworker -> workqueue change

* Add mem_css to struct loop_cmd to simplify logic

The loop device runs all i/o to the backing file on a separate kworker
thread which results in all i/o being charged to the root cgroup. This
allows a loop device to be used to trivially bypass resource limits
and other policy. This patch series fixes this gap in accounting.

A simple script to demonstrate this behavior on cgroupv2 machine:

'''
#!/bin/bash
set -e

CGROUP=/sys/fs/cgroup/test.slice
LOOP_DEV=/dev/loop0

if [[ ! -d $CGROUP ]]
then
sudo mkdir $CGROUP
fi

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit to tmpfs -> OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
dd if=/dev/zero of=/tmp/file bs=1M count=256" || true

grep oom_kill $CGROUP/memory.events

# Set a memory limit, write more than that limit through loopback
# device -> no OOM kill
sudo unshare -m bash -c "
echo \$\$ > $CGROUP/cgroup.procs;
echo 0 > $CGROUP/memory.swap.max;
echo 64M > $CGROUP/memory.max;
mount -t tmpfs -o size=512m tmpfs /tmp;
truncate -s 512m /tmp/backing_file
losetup $LOOP_DEV /tmp/backing_file
dd if=/dev/zero of=$LOOP_DEV bs=1M count=256;
losetup -D $LOOP_DEV" || true

grep oom_kill $CGROUP/memory.events
'''

Naively charging cgroups could result in priority inversions through
the single kworker thread in the case where multiple cgroups are
reading/writing to the same loop device. This patch series does some
minor modification to the loop driver so that each cgroup can make
forward progress independently to avoid this inversion.

With this patch series applied, the above script triggers OOM kills
when writing through the loop device as expected.

Dan Schatzberg (3):
loop: Use worker per cgroup instead of kworker
mm: Charge active memcg when no mm is set
loop: Charge i/o to mem and blk cg

drivers/block/loop.c | 246 +++++++++++++++++++++++++++++++------
drivers/block/loop.h | 14 ++-
include/linux/memcontrol.h | 6 +
kernel/cgroup/cgroup.c | 1 +
mm/memcontrol.c | 11 +-
mm/shmem.c | 4 +-
6 files changed, 235 insertions(+), 47 deletions(-)

--
2.17.1


2020-02-24 22:21:10

by Dan Schatzberg

[permalink] [raw]
Subject: [PATCH v3 1/3] loop: Use worker per cgroup instead of kworker

Existing uses of loop device may have multiple cgroups reading/writing
to the same device. Simply charging resources for I/O to the backing
file could result in priority inversion where one cgroup gets
synchronously blocked, holding up all other I/O to the loop device.

In order to avoid this priority inversion, we use a single workqueue
where each work item is a "struct loop_worker" which contains a queue of
struct loop_cmds to issue. The loop device maintains a tree mapping blk
css_id -> loop_worker. This allows each cgroup to independently make
forward progress issuing I/O to the backing file.

There is also a single queue for I/O associated with the rootcg which
can be used in cases of extreme memory shortage where we cannot allocate
a loop_worker.

The locking for the tree and queues is fairly heavy handed - we acquire
the per-loop-device spinlock any time either is accessed. The existing
implementation serializes all I/O through a single thread anyways, so I
don't believe this is any worse.

Signed-off-by: Dan Schatzberg <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
drivers/block/loop.c | 207 +++++++++++++++++++++++++++++++++++++------
drivers/block/loop.h | 11 ++-
2 files changed, 188 insertions(+), 30 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 739b372a5112..a9b05cacc393 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -70,7 +70,6 @@
#include <linux/writeback.h>
#include <linux/completion.h>
#include <linux/highmem.h>
-#include <linux/kthread.h>
#include <linux/splice.h>
#include <linux/sysfs.h>
#include <linux/miscdevice.h>
@@ -83,6 +82,8 @@

#include <linux/uaccess.h>

+#define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
+
static DEFINE_IDR(loop_index_idr);
static DEFINE_MUTEX(loop_ctl_mutex);

@@ -891,27 +892,100 @@ static void loop_config_discard(struct loop_device *lo)

static void loop_unprepare_queue(struct loop_device *lo)
{
- kthread_flush_worker(&lo->worker);
- kthread_stop(lo->worker_task);
-}
-
-static int loop_kthread_worker_fn(void *worker_ptr)
-{
- current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
- return kthread_worker_fn(worker_ptr);
+ destroy_workqueue(lo->workqueue);
}

static int loop_prepare_queue(struct loop_device *lo)
{
- kthread_init_worker(&lo->worker);
- lo->worker_task = kthread_run(loop_kthread_worker_fn,
- &lo->worker, "loop%d", lo->lo_number);
- if (IS_ERR(lo->worker_task))
+ lo->workqueue = alloc_workqueue("loop%d",
+ WQ_UNBOUND | WQ_FREEZABLE |
+ WQ_MEM_RECLAIM,
+ lo->lo_number);
+ if (IS_ERR(lo->workqueue))
return -ENOMEM;
- set_user_nice(lo->worker_task, MIN_NICE);
+
return 0;
}

+struct loop_worker {
+ struct rb_node rb_node;
+ struct work_struct work;
+ struct list_head cmd_list;
+ struct list_head idle_list;
+ struct loop_device *lo;
+ struct cgroup_subsys_state *css;
+ unsigned long last_ran_at;
+};
+
+static void loop_workfn(struct work_struct *work);
+static void loop_rootcg_workfn(struct work_struct *work);
+static void loop_free_idle_workers(struct timer_list *timer);
+
+static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
+{
+ struct rb_node **node = &(lo->worker_tree.rb_node), *parent = NULL;
+ struct loop_worker *cur_worker, *worker = NULL;
+ struct work_struct *work;
+ struct list_head *cmd_list;
+
+ spin_lock_irq(&lo->lo_lock);
+
+ if (!cmd->css)
+ goto queue_work;
+
+ node = &lo->worker_tree.rb_node;
+
+ while (*node) {
+ parent = *node;
+ cur_worker = container_of(*node, struct loop_worker, rb_node);
+ if (cur_worker->css == cmd->css) {
+ worker = cur_worker;
+ break;
+ } else if ((long)cur_worker->css < (long)cmd->css) {
+ node = &(*node)->rb_left;
+ } else {
+ node = &(*node)->rb_right;
+ }
+ }
+ if (worker)
+ goto queue_work;
+
+ worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
+ /*
+ * In the event we cannot allocate a worker, just queue on the
+ * rootcg worker
+ */
+ if (!worker)
+ goto queue_work;
+
+ worker->css = cmd->css;
+ css_get(worker->css);
+ INIT_WORK(&worker->work, loop_workfn);
+ INIT_LIST_HEAD(&worker->cmd_list);
+ INIT_LIST_HEAD(&worker->idle_list);
+ worker->lo = lo;
+ rb_link_node(&worker->rb_node, parent, node);
+ rb_insert_color(&worker->rb_node, &lo->worker_tree);
+queue_work:
+ if (worker) {
+ /*
+ * We need to remove from the idle list here while
+ * holding the lock so that the idle timer doesn't
+ * free the worker
+ */
+ if (!list_empty(&worker->idle_list))
+ list_del_init(&worker->idle_list);
+ work = &worker->work;
+ cmd_list = &worker->cmd_list;
+ } else {
+ work = &lo->rootcg_work;
+ cmd_list = &lo->rootcg_cmd_list;
+ }
+ list_add_tail(&cmd->list_entry, cmd_list);
+ queue_work(lo->workqueue, work);
+ spin_unlock_irq(&lo->lo_lock);
+}
+
static void loop_update_rotational(struct loop_device *lo)
{
struct file *file = lo->lo_backing_file;
@@ -993,6 +1067,12 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,

set_device_ro(bdev, (lo_flags & LO_FLAGS_READ_ONLY) != 0);

+ INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn);
+ INIT_LIST_HEAD(&lo->rootcg_cmd_list);
+ INIT_LIST_HEAD(&lo->idle_worker_list);
+ lo->worker_tree = RB_ROOT;
+ timer_setup(&lo->timer, loop_free_idle_workers,
+ TIMER_DEFERRABLE);
lo->use_dio = false;
lo->lo_device = bdev;
lo->lo_flags = lo_flags;
@@ -1101,6 +1181,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
int err = 0;
bool partscan = false;
int lo_number;
+ struct loop_worker *pos, *worker;

mutex_lock(&loop_ctl_mutex);
if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
@@ -1117,9 +1198,28 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
/* freeze request queue during the transition */
blk_mq_freeze_queue(lo->lo_queue);

+ /*
+ * Ordering here is a bit tricky:
+ *
+ * 1) flush/destroy workqueue without lock held so outstanding
+ * I/O is issued and all active workers go idle
+ *
+ * 2) Grab lock, free all idle workers
+ *
+ * 3) unlock, del_timer_sync so if timer raced it will be a no-op
+ */
+ loop_unprepare_queue(lo);
spin_lock_irq(&lo->lo_lock);
lo->lo_backing_file = NULL;
+ list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
+ idle_list) {
+ list_del(&worker->idle_list);
+ rb_erase(&worker->rb_node, &lo->worker_tree);
+ css_put(worker->css);
+ kfree(worker);
+ }
spin_unlock_irq(&lo->lo_lock);
+ del_timer_sync(&lo->timer);

loop_release_xfer(lo);
lo->transfer = NULL;
@@ -1154,7 +1254,6 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)

partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
lo_number = lo->lo_number;
- loop_unprepare_queue(lo);
out_unlock:
mutex_unlock(&loop_ctl_mutex);
if (partscan) {
@@ -1932,7 +2031,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
} else
#endif
cmd->css = NULL;
- kthread_queue_work(&lo->worker, &cmd->work);
+ loop_queue_work(lo, cmd);

return BLK_STS_OK;
}
@@ -1958,26 +2057,82 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
}
}

-static void loop_queue_work(struct kthread_work *work)
+static void loop_set_timer(struct loop_device *lo)
+{
+ timer_reduce(&lo->timer, jiffies + LOOP_IDLE_WORKER_TIMEOUT);
+}
+
+static void loop_process_work(struct loop_worker *worker,
+ struct list_head *cmd_list, struct loop_device *lo)
{
- struct loop_cmd *cmd =
- container_of(work, struct loop_cmd, work);
+ int orig_flags = current->flags;
+ struct loop_cmd *cmd;

- loop_handle_cmd(cmd);
+ current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
+ spin_lock_irq(&lo->lo_lock);
+ while (!list_empty(cmd_list)) {
+ cmd = container_of(
+ cmd_list->next, struct loop_cmd, list_entry);
+ list_del(cmd_list->next);
+ spin_unlock_irq(&lo->lo_lock);
+
+ loop_handle_cmd(cmd);
+ cond_resched();
+
+ spin_lock_irq(&lo->lo_lock);
+ }
+
+ /*
+ * We only add to the idle list if there are no pending cmds
+ * *and* the worker will not run again which ensures that it
+ * is safe to free any worker on the idle list
+ */
+ if (worker && !work_pending(&worker->work)) {
+ worker->last_ran_at = jiffies;
+ list_add_tail(&worker->idle_list, &lo->idle_worker_list);
+ loop_set_timer(lo);
+ }
+ spin_unlock_irq(&lo->lo_lock);
+ current->flags = orig_flags;
}

-static int loop_init_request(struct blk_mq_tag_set *set, struct request *rq,
- unsigned int hctx_idx, unsigned int numa_node)
+static void loop_workfn(struct work_struct *work)
{
- struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
+ struct loop_worker *worker =
+ container_of(work, struct loop_worker, work);
+ loop_process_work(worker, &worker->cmd_list, worker->lo);
+}

- kthread_init_work(&cmd->work, loop_queue_work);
- return 0;
+static void loop_rootcg_workfn(struct work_struct *work)
+{
+ struct loop_device *lo =
+ container_of(work, struct loop_device, rootcg_work);
+ loop_process_work(NULL, &lo->rootcg_cmd_list, lo);
+}
+
+static void loop_free_idle_workers(struct timer_list *timer)
+{
+ struct loop_device *lo = container_of(timer, struct loop_device, timer);
+ struct loop_worker *pos, *worker;
+
+ spin_lock_irq(&lo->lo_lock);
+ list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
+ idle_list) {
+ if (time_is_after_jiffies(worker->last_ran_at +
+ LOOP_IDLE_WORKER_TIMEOUT))
+ break;
+ list_del(&worker->idle_list);
+ rb_erase(&worker->rb_node, &lo->worker_tree);
+ css_put(worker->css);
+ kfree(worker);
+ }
+ if (!list_empty(&lo->idle_worker_list))
+ loop_set_timer(lo);
+ spin_unlock_irq(&lo->lo_lock);
}

static const struct blk_mq_ops loop_mq_ops = {
.queue_rq = loop_queue_rq,
- .init_request = loop_init_request,
.complete = lo_complete_rq,
};

diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index af75a5ee4094..87fd0e372227 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -14,7 +14,6 @@
#include <linux/blk-mq.h>
#include <linux/spinlock.h>
#include <linux/mutex.h>
-#include <linux/kthread.h>
#include <uapi/linux/loop.h>

/* Possible states of device */
@@ -54,8 +53,12 @@ struct loop_device {

spinlock_t lo_lock;
int lo_state;
- struct kthread_worker worker;
- struct task_struct *worker_task;
+ struct workqueue_struct *workqueue;
+ struct work_struct rootcg_work;
+ struct list_head rootcg_cmd_list;
+ struct list_head idle_worker_list;
+ struct rb_root worker_tree;
+ struct timer_list timer;
bool use_dio;
bool sysfs_inited;

@@ -65,7 +68,7 @@ struct loop_device {
};

struct loop_cmd {
- struct kthread_work work;
+ struct list_head list_entry;
bool use_aio; /* use AIO interface to handle I/O */
atomic_t ref; /* only for aio */
long ret;
--
2.17.1

2020-02-24 22:21:21

by Dan Schatzberg

[permalink] [raw]
Subject: [PATCH v3 3/3] loop: Charge i/o to mem and blk cg

The current code only associates with the existing blkcg when aio is
used to access the backing file. This patch covers all types of i/o to
the backing file and also associates the memcg so if the backing file is
on tmpfs, memory is charged appropriately.

This patch also exports cgroup_get_e_css so it can be used by the loop
module.

Signed-off-by: Dan Schatzberg <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
---
drivers/block/loop.c | 59 ++++++++++++++++++++++++--------------
drivers/block/loop.h | 3 +-
include/linux/memcontrol.h | 6 ++++
kernel/cgroup/cgroup.c | 1 +
4 files changed, 47 insertions(+), 22 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index a9b05cacc393..da8ec0b9d909 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -77,6 +77,7 @@
#include <linux/uio.h>
#include <linux/ioprio.h>
#include <linux/blk-cgroup.h>
+#include <linux/sched/mm.h>

#include "loop.h"

@@ -504,8 +505,6 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret, long ret2)
{
struct loop_cmd *cmd = container_of(iocb, struct loop_cmd, iocb);

- if (cmd->css)
- css_put(cmd->css);
cmd->ret = ret;
lo_rw_aio_do_completion(cmd);
}
@@ -566,8 +565,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);
- if (cmd->css)
- kthread_associate_blkcg(cmd->css);

if (rw == WRITE)
ret = call_write_iter(file, &cmd->iocb, &iter);
@@ -575,7 +572,6 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
ret = call_read_iter(file, &cmd->iocb, &iter);

lo_rw_aio_do_completion(cmd);
- kthread_associate_blkcg(NULL);

if (ret != -EIOCBQUEUED)
cmd->iocb.ki_complete(&cmd->iocb, ret, 0);
@@ -913,7 +909,7 @@ struct loop_worker {
struct list_head cmd_list;
struct list_head idle_list;
struct loop_device *lo;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
unsigned long last_ran_at;
};

@@ -930,7 +926,7 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)

spin_lock_irq(&lo->lo_lock);

- if (!cmd->css)
+ if (!cmd->blkcg_css)
goto queue_work;

node = &lo->worker_tree.rb_node;
@@ -938,10 +934,10 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
while (*node) {
parent = *node;
cur_worker = container_of(*node, struct loop_worker, rb_node);
- if (cur_worker->css == cmd->css) {
+ if (cur_worker->blkcg_css == cmd->blkcg_css) {
worker = cur_worker;
break;
- } else if ((long)cur_worker->css < (long)cmd->css) {
+ } else if ((long)cur_worker->blkcg_css < (long)cmd->blkcg_css) {
node = &(*node)->rb_left;
} else {
node = &(*node)->rb_right;
@@ -953,13 +949,16 @@ static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
/*
* In the event we cannot allocate a worker, just queue on the
- * rootcg worker
+ * rootcg worker and issue the I/O as the rootcg
*/
- if (!worker)
+ if (!worker) {
+ cmd->blkcg_css = NULL;
+ cmd->memcg_css = NULL;
goto queue_work;
+ }

- worker->css = cmd->css;
- css_get(worker->css);
+ worker->blkcg_css = cmd->blkcg_css;
+ css_get(worker->blkcg_css);
INIT_WORK(&worker->work, loop_workfn);
INIT_LIST_HEAD(&worker->cmd_list);
INIT_LIST_HEAD(&worker->idle_list);
@@ -1215,7 +1214,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
idle_list) {
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
spin_unlock_irq(&lo->lo_lock);
@@ -2024,13 +2023,18 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
}

/* always use the first bio's css */
+ cmd->blkcg_css = NULL;
+ cmd->memcg_css = NULL;
#ifdef CONFIG_BLK_CGROUP
- if (cmd->use_aio && rq->bio && rq->bio->bi_blkg) {
- cmd->css = &bio_blkcg(rq->bio)->css;
- css_get(cmd->css);
- } else
+ if (rq->bio && rq->bio->bi_blkg) {
+ cmd->blkcg_css = &bio_blkcg(rq->bio)->css;
+#ifdef CONFIG_MEMCG
+ cmd->memcg_css =
+ cgroup_get_e_css(cmd->blkcg_css->cgroup,
+ &memory_cgrp_subsys);
+#endif
+ }
#endif
- cmd->css = NULL;
loop_queue_work(lo, cmd);

return BLK_STS_OK;
@@ -2048,8 +2052,21 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
goto failed;
}

+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(cmd->blkcg_css);
+ if (cmd->memcg_css)
+ memalloc_use_memcg(mem_cgroup_from_css(cmd->memcg_css));
+
ret = do_req_filebacked(lo, rq);
- failed:
+
+ if (cmd->blkcg_css)
+ kthread_associate_blkcg(NULL);
+
+ if (cmd->memcg_css) {
+ memalloc_unuse_memcg();
+ css_put(cmd->memcg_css);
+ }
+failed:
/* complete non-aio request */
if (!cmd->use_aio || ret) {
cmd->ret = ret ? -EIO : 0;
@@ -2123,7 +2140,7 @@ static void loop_free_idle_workers(struct timer_list *timer)
break;
list_del(&worker->idle_list);
rb_erase(&worker->rb_node, &lo->worker_tree);
- css_put(worker->css);
+ css_put(worker->blkcg_css);
kfree(worker);
}
if (!list_empty(&lo->idle_worker_list))
diff --git a/drivers/block/loop.h b/drivers/block/loop.h
index 87fd0e372227..3e65acf7a0e9 100644
--- a/drivers/block/loop.h
+++ b/drivers/block/loop.h
@@ -74,7 +74,8 @@ struct loop_cmd {
long ret;
struct kiocb iocb;
struct bio_vec *bvec;
- struct cgroup_subsys_state *css;
+ struct cgroup_subsys_state *blkcg_css;
+ struct cgroup_subsys_state *memcg_css;
};

/* Support for loadable transfer modules */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a7a0a1a5c8d5..aeb51f2ded46 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -922,6 +922,12 @@ static inline struct mem_cgroup *get_mem_cgroup_from_page(struct page *page)
return NULL;
}

+static inline
+struct mem_cgroup *mem_cgroup_from_css(struct cgroup_subsys_state *css)
+{
+ return NULL;
+}
+
static inline void mem_cgroup_put(struct mem_cgroup *memcg)
{
}
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 75f687301bbf..c3896c2e0942 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -587,6 +587,7 @@ struct cgroup_subsys_state *cgroup_get_e_css(struct cgroup *cgrp,
rcu_read_unlock();
return css;
}
+EXPORT_SYMBOL_GPL(cgroup_get_e_css);

static void cgroup_get_live(struct cgroup *cgrp)
{
--
2.17.1

2020-02-26 17:03:27

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] loop: Use worker per cgroup instead of kworker

On Mon, 2020-02-24 at 17:17 -0500, Dan Schatzberg wrote:
> Existing uses of loop device may have multiple cgroups reading/writing
> to the same device. Simply charging resources for I/O to the backing
> file could result in priority inversion where one cgroup gets
> synchronously blocked, holding up all other I/O to the loop device.
>
> In order to avoid this priority inversion, we use a single workqueue
> where each work item is a "struct loop_worker" which contains a queue of
> struct loop_cmds to issue. The loop device maintains a tree mapping blk
> css_id -> loop_worker. This allows each cgroup to independently make
> forward progress issuing I/O to the backing file.
>
> There is also a single queue for I/O associated with the rootcg which
> can be used in cases of extreme memory shortage where we cannot allocate
> a loop_worker.
>
> The locking for the tree and queues is fairly heavy handed - we acquire
> the per-loop-device spinlock any time either is accessed. The existing
> implementation serializes all I/O through a single thread anyways, so I
> don't believe this is any worse.
>
> Signed-off-by: Dan Schatzberg <[email protected]>
> Acked-by: Johannes Weiner <[email protected]>

The locking in loop_free_idle_workers() will trigger this with sysfs reading,

[ 7080.047167] LTP: starting read_all_sys (read_all -d /sys -q -r 10)
[ 7239.842276] cpufreq transition table exceeds PAGE_SIZE. Disabling

[ 7247.054961] =====================================================
[ 7247.054971] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[ 7247.054983] 5.6.0-rc3-next-20200226 #2 Tainted: G           O     
[ 7247.054992] -----------------------------------------------------
[ 7247.055002] read_all/8513 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
[ 7247.055014] c0000006844864c8 (&fs->seq){+.+.}, at: file_path+0x24/0x40
[ 7247.055041] 
               and this task is already holding:
[ 7247.055061] c0002006bab8b928 (&(&lo->lo_lock)->rlock){..-.}, at:
loop_attr_do_show_backing_file+0x3c/0x120 [loop]
[ 7247.055078] which would create a new lock dependency:
[ 7247.055105]  (&(&lo->lo_lock)->rlock){..-.} -> (&fs->seq){+.+.}
[ 7247.055125] 
               but this new dependency connects a SOFTIRQ-irq-safe lock:
[ 7247.055155]  (&(&lo->lo_lock)->rlock){..-.}
[ 7247.055156] 
               ... which became SOFTIRQ-irq-safe at:
[ 7247.055196]   lock_acquire+0x130/0x360
[ 7247.055221]   _raw_spin_lock_irq+0x68/0x90
[ 7247.055230]   loop_free_idle_workers+0x44/0x3f0 [loop]
[ 7247.055242]   call_timer_fn+0x110/0x5f0
[ 7247.055260]   run_timer_softirq+0x8f8/0x9f0
[ 7247.055278]   __do_softirq+0x34c/0x8c8
[ 7247.055288]   irq_exit+0x16c/0x1d0
[ 7247.055298]   timer_interrupt+0x1f0/0x680
[ 7247.055308]   decrementer_common+0x124/0x130
[ 7247.055328]   arch_local_irq_restore.part.8+0x34/0x90
[ 7247.055352]   cpuidle_enter_state+0x11c/0x8f0
[ 7247.055361]   cpuidle_enter+0x50/0x70
[ 7247.055389]   call_cpuidle+0x4c/0x90
[ 7247.055398]   do_idle+0x378/0x470
[ 7247.055414]   cpu_startup_entry+0x3c/0x40
[ 7247.055442]   start_secondary+0x7a8/0xa80
[ 7247.055461]   start_secondary_prolog+0x10/0x14
[ 7247.055478] 
               to a SOFTIRQ-irq-unsafe lock:
[ 7247.055506]  (&fs->seq){+.+.}
[ 7247.055507] 
               ... which became SOFTIRQ-irq-unsafe at:
[ 7247.055541] ...
[ 7247.055542]   lock_acquire+0x130/0x360
[ 7247.055577]   set_fs_pwd+0x98/0x140
[ 7247.055593]   ksys_chdir+0x8c/0x100
[ 7247.055611]   devtmpfsd+0xac/0xe0
[ 7247.055629]   kthread+0x1a8/0x1b0
[ 7247.055645]   ret_from_kernel_thread+0x5c/0x74
[ 7247.055663] 
               other info that might help us debug this:

[ 7247.055685]  Possible interrupt unsafe locking scenario:

[ 7247.055695]        CPU0                    CPU1
[ 7247.055712]        ----                    ----
[ 7247.055719]   lock(&fs->seq);
[ 7247.055748]                                local_irq_disable();
[ 7247.055776]                                lock(&(&lo->lo_lock)->rlock);
[ 7247.055805]                                lock(&fs->seq);
[ 7247.055825]   <Interrupt>
[ 7247.055840]     lock(&(&lo->lo_lock)->rlock);
[ 7247.055867] 
                *** DEADLOCK ***

[ 7247.055896] 5 locks held by read_all/8513:
[ 7247.055913]  #0: c000001afc46d708 (&p->lock){+.+.}, at: seq_read+0x60/0x620
[ 7247.055944]  #1: c000000f6ce00e80 (&of->mutex){+.+.}, at:
kernfs_seq_start+0x40/0x130
[ 7247.055981]  #2: c0002011b81e0010 (kn->count#685){.+.+}, at:
kernfs_seq_start+0x4c/0x130
[ 7247.056013]  #3: c0002006bab8b928 (&(&lo->lo_lock)->rlock){..-.}, at:
loop_attr_do_show_backing_file+0x3c/0x120 [loop]
[ 7247.056048]  #4: c000000001500888 (rcu_read_lock){....}, at:
d_path+0xa0/0x3f0
[ 7247.056072] 
               the dependencies between SOFTIRQ-irq-safe lock and the holding
lock:
[ 7247.056104] -> (&(&lo->lo_lock)->rlock){..-.} ops: 196198 {
[ 7247.056144]    IN-SOFTIRQ-W at:
[ 7247.056161]                     lock_acquire+0x130/0x360
[ 7247.056180]                     _raw_spin_lock_irq+0x68/0x90
[ 7247.056208]                     loop_free_idle_workers+0x44/0x3f0 [loop]
[ 7247.056238]                     call_timer_fn+0x110/0x5f0
[ 7247.056271]                     run_timer_softirq+0x8f8/0x9f0
[ 7247.056290]                     __do_softirq+0x34c/0x8c8
[ 7247.056318]                     irq_exit+0x16c/0x1d0
[ 7247.056337]                     timer_interrupt+0x1f0/0x680
[ 7247.056366]                     decrementer_common+0x124/0x130
[ 7247.056394]                     arch_local_irq_restore.part.8+0x34/0x90
[ 7247.056432]                     cpuidle_enter_state+0x11c/0x8f0
[ 7247.056460]                     cpuidle_enter+0x50/0x70
[ 7247.056480]                     call_cpuidle+0x4c/0x90
[ 7247.056498]                     do_idle+0x378/0x470
[ 7247.056516]                     cpu_startup_entry+0x3c/0x40
[ 7247.056535]                     start_secondary+0x7a8/0xa80
[ 7247.056554]                     start_secondary_prolog+0x10/0x14
[ 7247.056572]    INITIAL USE at:
[ 7247.056589]                    lock_acquire+0x130/0x360
[ 7247.056618]                    _raw_spin_lock_irq+0x68/0x90
[ 7247.056671]                    loop_queue_rq+0xf4/0x540 [loop]
[ 7247.056737]                    __blk_mq_try_issue_directly+0x1ac/0x2e0
[ 7247.056850]                    blk_mq_request_issue_directly+0x64/0xb0
[ 7247.056924]                    blk_mq_try_issue_list_directly+0x90/0x140
[ 7247.056988]                    blk_mq_sched_insert_requests+0x1c0/0x4b0
[ 7247.057070]                    blk_mq_flush_plug_list+0x1d0/0x4b0
[ 7247.057155]                    blk_flush_plug_list+0x148/0x1a0
[ 7247.057221]                    blk_finish_plug+0x50/0x6c
[ 7247.057285]                    read_pages+0xac/0x320
[ 7247.057344]                    __do_page_cache_readahead+0x198/0x350
[ 7247.057428]                    force_page_cache_readahead+0xcc/0x1b0
[ 7247.057505]                    generic_file_buffered_read+0x7d4/0xdd0
[ 7247.057577]                    blkdev_read_iter+0x50/0x80
[ 7247.057658]                    new_sync_read+0x15c/0x1f0
[ 7247.057723]                    vfs_read+0xac/0x170
[ 7247.057786]                    ksys_read+0x7c/0x140
[ 7247.057856]                    system_call+0x5c/0x68
[ 7247.057902]  }
[ 7247.057935]  ... key      at: [<c00800000fa6bbc8>]
__key.50783+0x0/0xffffffffffffc9b8 [loop]
[ 7247.058028]  ... acquired at:
[ 7247.058074]    lock_acquire+0x130/0x360
[ 7247.058126]    d_path+0x194/0x3f0
[ 7247.058174]    file_path+0x24/0x40
[ 7247.058225]    loop_attr_do_show_backing_file+0x5c/0x120 [loop]
[ 7247.058305]    dev_attr_show+0x44/0xb0
[ 7247.058354]    sysfs_kf_seq_show+0xbc/0x1d0
[ 7247.058421]    kernfs_seq_show+0x44/0x60
[ 7247.058458]    seq_read+0x1c4/0x620
[ 7247.058504]    kernfs_fop_read+0x40/0x2b0
[ 7247.058559]    __vfs_read+0x3c/0x70
[ 7247.058607]    vfs_read+0xac/0x170
[ 7247.058659]    ksys_read+0x7c/0x140
[ 7247.058722]    system_call+0x5c/0x68

[ 7247.058778] 
               the dependencies between the lock to be acquired
[ 7247.058779]  and SOFTIRQ-irq-unsafe lock:
[ 7247.058924] -> (&fs->seq){+.+.} ops: 22599070 {
[ 7247.058992]    HARDIRQ-ON-W at:
[ 7247.059043]                     lock_acquire+0x130/0x360
[ 7247.059108]                     set_fs_pwd+0x98/0x140
[ 7247.059179]                     ksys_chdir+0x8c/0x100
[ 7247.059262]                     devtmpfsd+0xac/0xe0
[ 7247.059318]                     kthread+0x1a8/0x1b0
[ 7247.059369]                     ret_from_kernel_thread+0x5c/0x74
[ 7247.059453]    SOFTIRQ-ON-W at:
[ 7247.059498]                     lock_acquire+0x130/0x360
[ 7247.059545]                     set_fs_pwd+0x98/0x140
[ 7247.059614]                     ksys_chdir+0x8c/0x100
[ 7247.059675]                     devtmpfsd+0xac/0xe0
[ 7247.059745]                     kthread+0x1a8/0x1b0
[ 7247.059804]                     ret_from_kernel_thread+0x5c/0x74
[ 7247.059884]    INITIAL USE at:
[ 7247.059929]                    lock_acquire+0x130/0x360
[ 7247.059994]                    set_root+0x118/0x220
[ 7247.060066]                    nd_jump_root+0x100/0x170
[ 7247.060111]                    path_init+0x640/0x7b0
[ 7247.060193]                    path_lookupat+0x3c/0x1b0
[ 7247.060257]                    filename_lookup.part.60+0xa0/0x170
[ 7247.060323]                    do_mount+0xa8/0xc00
[ 7247.060381]                    devtmpfsd+0x94/0xe0
[ 7247.060436]                    kthread+0x1a8/0x1b0
[ 7247.060481]                    ret_from_kernel_thread+0x5c/0x74
[ 7247.060563]  }
[ 7247.060597]  ... key      at: [<c00000000509c9e8>] __key.27999+0x0/0x10
[ 7247.060676]  ... acquired at:
[ 7247.060723]    lock_acquire+0x130/0x360
[ 7247.060768]    d_path+0x194/0x3f0
[ 7247.060810]    file_path+0x24/0x40
[ 7247.060869]    loop_attr_do_show_backing_file+0x5c/0x120 [loop]
[ 7247.060931]    dev_attr_show+0x44/0xb0
[ 7247.060988]    sysfs_kf_seq_show+0xbc/0x1d0
[ 7247.061042]    kernfs_seq_show+0x44/0x60
[ 7247.061116]    seq_read+0x1c4/0x620
[ 7247.061155]    kernfs_fop_read+0x40/0x2b0
[ 7247.061204]    __vfs_read+0x3c/0x70
[ 7247.061249]    vfs_read+0xac/0x170
[ 7247.061292]    ksys_read+0x7c/0x140
[ 7247.061341]    system_call+0x5c/0x68

> ---
> drivers/block/loop.c | 207 +++++++++++++++++++++++++++++++++++++------
> drivers/block/loop.h | 11 ++-
> 2 files changed, 188 insertions(+), 30 deletions(-)
>
> diff --git a/drivers/block/loop.c b/drivers/block/loop.c
> index 739b372a5112..a9b05cacc393 100644
> --- a/drivers/block/loop.c
> +++ b/drivers/block/loop.c
> @@ -70,7 +70,6 @@
> #include <linux/writeback.h>
> #include <linux/completion.h>
> #include <linux/highmem.h>
> -#include <linux/kthread.h>
> #include <linux/splice.h>
> #include <linux/sysfs.h>
> #include <linux/miscdevice.h>
> @@ -83,6 +82,8 @@
>
> #include <linux/uaccess.h>
>
> +#define LOOP_IDLE_WORKER_TIMEOUT (60 * HZ)
> +
> static DEFINE_IDR(loop_index_idr);
> static DEFINE_MUTEX(loop_ctl_mutex);
>
> @@ -891,27 +892,100 @@ static void loop_config_discard(struct loop_device *lo)
>
> static void loop_unprepare_queue(struct loop_device *lo)
> {
> - kthread_flush_worker(&lo->worker);
> - kthread_stop(lo->worker_task);
> -}
> -
> -static int loop_kthread_worker_fn(void *worker_ptr)
> -{
> - current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
> - return kthread_worker_fn(worker_ptr);
> + destroy_workqueue(lo->workqueue);
> }
>
> static int loop_prepare_queue(struct loop_device *lo)
> {
> - kthread_init_worker(&lo->worker);
> - lo->worker_task = kthread_run(loop_kthread_worker_fn,
> - &lo->worker, "loop%d", lo->lo_number);
> - if (IS_ERR(lo->worker_task))
> + lo->workqueue = alloc_workqueue("loop%d",
> + WQ_UNBOUND | WQ_FREEZABLE |
> + WQ_MEM_RECLAIM,
> + lo->lo_number);
> + if (IS_ERR(lo->workqueue))
> return -ENOMEM;
> - set_user_nice(lo->worker_task, MIN_NICE);
> +
> return 0;
> }
>
> +struct loop_worker {
> + struct rb_node rb_node;
> + struct work_struct work;
> + struct list_head cmd_list;
> + struct list_head idle_list;
> + struct loop_device *lo;
> + struct cgroup_subsys_state *css;
> + unsigned long last_ran_at;
> +};
> +
> +static void loop_workfn(struct work_struct *work);
> +static void loop_rootcg_workfn(struct work_struct *work);
> +static void loop_free_idle_workers(struct timer_list *timer);
> +
> +static void loop_queue_work(struct loop_device *lo, struct loop_cmd *cmd)
> +{
> + struct rb_node **node = &(lo->worker_tree.rb_node), *parent = NULL;
> + struct loop_worker *cur_worker, *worker = NULL;
> + struct work_struct *work;
> + struct list_head *cmd_list;
> +
> + spin_lock_irq(&lo->lo_lock);
> +
> + if (!cmd->css)
> + goto queue_work;
> +
> + node = &lo->worker_tree.rb_node;
> +
> + while (*node) {
> + parent = *node;
> + cur_worker = container_of(*node, struct loop_worker, rb_node);
> + if (cur_worker->css == cmd->css) {
> + worker = cur_worker;
> + break;
> + } else if ((long)cur_worker->css < (long)cmd->css) {
> + node = &(*node)->rb_left;
> + } else {
> + node = &(*node)->rb_right;
> + }
> + }
> + if (worker)
> + goto queue_work;
> +
> + worker = kzalloc(sizeof(struct loop_worker), GFP_NOWAIT | __GFP_NOWARN);
> + /*
> + * In the event we cannot allocate a worker, just queue on the
> + * rootcg worker
> + */
> + if (!worker)
> + goto queue_work;
> +
> + worker->css = cmd->css;
> + css_get(worker->css);
> + INIT_WORK(&worker->work, loop_workfn);
> + INIT_LIST_HEAD(&worker->cmd_list);
> + INIT_LIST_HEAD(&worker->idle_list);
> + worker->lo = lo;
> + rb_link_node(&worker->rb_node, parent, node);
> + rb_insert_color(&worker->rb_node, &lo->worker_tree);
> +queue_work:
> + if (worker) {
> + /*
> + * We need to remove from the idle list here while
> + * holding the lock so that the idle timer doesn't
> + * free the worker
> + */
> + if (!list_empty(&worker->idle_list))
> + list_del_init(&worker->idle_list);
> + work = &worker->work;
> + cmd_list = &worker->cmd_list;
> + } else {
> + work = &lo->rootcg_work;
> + cmd_list = &lo->rootcg_cmd_list;
> + }
> + list_add_tail(&cmd->list_entry, cmd_list);
> + queue_work(lo->workqueue, work);
> + spin_unlock_irq(&lo->lo_lock);
> +}
> +
> static void loop_update_rotational(struct loop_device *lo)
> {
> struct file *file = lo->lo_backing_file;
> @@ -993,6 +1067,12 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
>
> set_device_ro(bdev, (lo_flags & LO_FLAGS_READ_ONLY) != 0);
>
> + INIT_WORK(&lo->rootcg_work, loop_rootcg_workfn);
> + INIT_LIST_HEAD(&lo->rootcg_cmd_list);
> + INIT_LIST_HEAD(&lo->idle_worker_list);
> + lo->worker_tree = RB_ROOT;
> + timer_setup(&lo->timer, loop_free_idle_workers,
> + TIMER_DEFERRABLE);
> lo->use_dio = false;
> lo->lo_device = bdev;
> lo->lo_flags = lo_flags;
> @@ -1101,6 +1181,7 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
> int err = 0;
> bool partscan = false;
> int lo_number;
> + struct loop_worker *pos, *worker;
>
> mutex_lock(&loop_ctl_mutex);
> if (WARN_ON_ONCE(lo->lo_state != Lo_rundown)) {
> @@ -1117,9 +1198,28 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
> /* freeze request queue during the transition */
> blk_mq_freeze_queue(lo->lo_queue);
>
> + /*
> + * Ordering here is a bit tricky:
> + *
> + * 1) flush/destroy workqueue without lock held so outstanding
> + * I/O is issued and all active workers go idle
> + *
> + * 2) Grab lock, free all idle workers
> + *
> + * 3) unlock, del_timer_sync so if timer raced it will be a no-op
> + */
> + loop_unprepare_queue(lo);
> spin_lock_irq(&lo->lo_lock);
> lo->lo_backing_file = NULL;
> + list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
> + idle_list) {
> + list_del(&worker->idle_list);
> + rb_erase(&worker->rb_node, &lo->worker_tree);
> + css_put(worker->css);
> + kfree(worker);
> + }
> spin_unlock_irq(&lo->lo_lock);
> + del_timer_sync(&lo->timer);
>
> loop_release_xfer(lo);
> lo->transfer = NULL;
> @@ -1154,7 +1254,6 @@ static int __loop_clr_fd(struct loop_device *lo, bool release)
>
> partscan = lo->lo_flags & LO_FLAGS_PARTSCAN && bdev;
> lo_number = lo->lo_number;
> - loop_unprepare_queue(lo);
> out_unlock:
> mutex_unlock(&loop_ctl_mutex);
> if (partscan) {
> @@ -1932,7 +2031,7 @@ static blk_status_t loop_queue_rq(struct blk_mq_hw_ctx *hctx,
> } else
> #endif
> cmd->css = NULL;
> - kthread_queue_work(&lo->worker, &cmd->work);
> + loop_queue_work(lo, cmd);
>
> return BLK_STS_OK;
> }
> @@ -1958,26 +2057,82 @@ static void loop_handle_cmd(struct loop_cmd *cmd)
> }
> }
>
> -static void loop_queue_work(struct kthread_work *work)
> +static void loop_set_timer(struct loop_device *lo)
> +{
> + timer_reduce(&lo->timer, jiffies + LOOP_IDLE_WORKER_TIMEOUT);
> +}
> +
> +static void loop_process_work(struct loop_worker *worker,
> + struct list_head *cmd_list, struct loop_device *lo)
> {
> - struct loop_cmd *cmd =
> - container_of(work, struct loop_cmd, work);
> + int orig_flags = current->flags;
> + struct loop_cmd *cmd;
>
> - loop_handle_cmd(cmd);
> + current->flags |= PF_LESS_THROTTLE | PF_MEMALLOC_NOIO;
> + spin_lock_irq(&lo->lo_lock);
> + while (!list_empty(cmd_list)) {
> + cmd = container_of(
> + cmd_list->next, struct loop_cmd, list_entry);
> + list_del(cmd_list->next);
> + spin_unlock_irq(&lo->lo_lock);
> +
> + loop_handle_cmd(cmd);
> + cond_resched();
> +
> + spin_lock_irq(&lo->lo_lock);
> + }
> +
> + /*
> + * We only add to the idle list if there are no pending cmds
> + * *and* the worker will not run again which ensures that it
> + * is safe to free any worker on the idle list
> + */
> + if (worker && !work_pending(&worker->work)) {
> + worker->last_ran_at = jiffies;
> + list_add_tail(&worker->idle_list, &lo->idle_worker_list);
> + loop_set_timer(lo);
> + }
> + spin_unlock_irq(&lo->lo_lock);
> + current->flags = orig_flags;
> }
>
> -static int loop_init_request(struct blk_mq_tag_set *set, struct request *rq,
> - unsigned int hctx_idx, unsigned int numa_node)
> +static void loop_workfn(struct work_struct *work)
> {
> - struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
> + struct loop_worker *worker =
> + container_of(work, struct loop_worker, work);
> + loop_process_work(worker, &worker->cmd_list, worker->lo);
> +}
>
> - kthread_init_work(&cmd->work, loop_queue_work);
> - return 0;
> +static void loop_rootcg_workfn(struct work_struct *work)
> +{
> + struct loop_device *lo =
> + container_of(work, struct loop_device, rootcg_work);
> + loop_process_work(NULL, &lo->rootcg_cmd_list, lo);
> +}
> +
> +static void loop_free_idle_workers(struct timer_list *timer)
> +{
> + struct loop_device *lo = container_of(timer, struct loop_device, timer);
> + struct loop_worker *pos, *worker;
> +
> + spin_lock_irq(&lo->lo_lock);
> + list_for_each_entry_safe(worker, pos, &lo->idle_worker_list,
> + idle_list) {
> + if (time_is_after_jiffies(worker->last_ran_at +
> + LOOP_IDLE_WORKER_TIMEOUT))
> + break;
> + list_del(&worker->idle_list);
> + rb_erase(&worker->rb_node, &lo->worker_tree);
> + css_put(worker->css);
> + kfree(worker);
> + }
> + if (!list_empty(&lo->idle_worker_list))
> + loop_set_timer(lo);
> + spin_unlock_irq(&lo->lo_lock);
> }
>
> static const struct blk_mq_ops loop_mq_ops = {
> .queue_rq = loop_queue_rq,
> - .init_request = loop_init_request,
> .complete = lo_complete_rq,
> };
>
> diff --git a/drivers/block/loop.h b/drivers/block/loop.h
> index af75a5ee4094..87fd0e372227 100644
> --- a/drivers/block/loop.h
> +++ b/drivers/block/loop.h
> @@ -14,7 +14,6 @@
> #include <linux/blk-mq.h>
> #include <linux/spinlock.h>
> #include <linux/mutex.h>
> -#include <linux/kthread.h>
> #include <uapi/linux/loop.h>
>
> /* Possible states of device */
> @@ -54,8 +53,12 @@ struct loop_device {
>
> spinlock_t lo_lock;
> int lo_state;
> - struct kthread_worker worker;
> - struct task_struct *worker_task;
> + struct workqueue_struct *workqueue;
> + struct work_struct rootcg_work;
> + struct list_head rootcg_cmd_list;
> + struct list_head idle_worker_list;
> + struct rb_root worker_tree;
> + struct timer_list timer;
> bool use_dio;
> bool sysfs_inited;
>
> @@ -65,7 +68,7 @@ struct loop_device {
> };
>
> struct loop_cmd {
> - struct kthread_work work;
> + struct list_head list_entry;
> bool use_aio; /* use AIO interface to handle I/O */
> atomic_t ref; /* only for aio */
> long ret;

2020-02-27 18:16:09

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v3 1/3] loop: Use worker per cgroup instead of kworker

On Wed, Feb 26, 2020 at 12:02:38PM -0500, Qian Cai wrote:
> On Mon, 2020-02-24 at 17:17 -0500, Dan Schatzberg wrote:
> > Existing uses of loop device may have multiple cgroups reading/writing
> > to the same device. Simply charging resources for I/O to the backing
> > file could result in priority inversion where one cgroup gets
> > synchronously blocked, holding up all other I/O to the loop device.
> >
> > In order to avoid this priority inversion, we use a single workqueue
> > where each work item is a "struct loop_worker" which contains a queue of
> > struct loop_cmds to issue. The loop device maintains a tree mapping blk
> > css_id -> loop_worker. This allows each cgroup to independently make
> > forward progress issuing I/O to the backing file.
> >
> > There is also a single queue for I/O associated with the rootcg which
> > can be used in cases of extreme memory shortage where we cannot allocate
> > a loop_worker.
> >
> > The locking for the tree and queues is fairly heavy handed - we acquire
> > the per-loop-device spinlock any time either is accessed. The existing
> > implementation serializes all I/O through a single thread anyways, so I
> > don't believe this is any worse.
> >
> > Signed-off-by: Dan Schatzberg <[email protected]>
> > Acked-by: Johannes Weiner <[email protected]>
>
> The locking in loop_free_idle_workers() will trigger this with sysfs reading,
>
> [ 7080.047167] LTP: starting read_all_sys (read_all -d /sys -q -r 10)
> [ 7239.842276] cpufreq transition table exceeds PAGE_SIZE. Disabling
>
> [ 7247.054961] =====================================================
> [ 7247.054971] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
> [ 7247.054983] 5.6.0-rc3-next-20200226 #2 Tainted: G???????????O?????
> [ 7247.054992] -----------------------------------------------------
> [ 7247.055002] read_all/8513 [HC0[0]:SC0[0]:HE0:SE1] is trying to acquire:
> [ 7247.055014] c0000006844864c8 (&fs->seq){+.+.}, at: file_path+0x24/0x40
> [ 7247.055041]?
> ???????????????and this task is already holding:
> [ 7247.055061] c0002006bab8b928 (&(&lo->lo_lock)->rlock){..-.}, at:
> loop_attr_do_show_backing_file+0x3c/0x120 [loop]
> [ 7247.055078] which would create a new lock dependency:
> [ 7247.055105]??(&(&lo->lo_lock)->rlock){..-.} -> (&fs->seq){+.+.}
> [ 7247.055125]?
> ???????????????but this new dependency connects a SOFTIRQ-irq-safe lock:
> [ 7247.055155]??(&(&lo->lo_lock)->rlock){..-.}
> [ 7247.055156]?
> ???????????????... which became SOFTIRQ-irq-safe at:
> [ 7247.055196]???lock_acquire+0x130/0x360
> [ 7247.055221]???_raw_spin_lock_irq+0x68/0x90
> [ 7247.055230]???loop_free_idle_workers+0x44/0x3f0 [loop]
> [ 7247.055242]???call_timer_fn+0x110/0x5f0
> [ 7247.055260]???run_timer_softirq+0x8f8/0x9f0
> [ 7247.055278]???__do_softirq+0x34c/0x8c8
> [ 7247.055288]???irq_exit+0x16c/0x1d0
> [ 7247.055298]???timer_interrupt+0x1f0/0x680
> [ 7247.055308]???decrementer_common+0x124/0x130
> [ 7247.055328]???arch_local_irq_restore.part.8+0x34/0x90
> [ 7247.055352]???cpuidle_enter_state+0x11c/0x8f0
> [ 7247.055361]???cpuidle_enter+0x50/0x70
> [ 7247.055389]???call_cpuidle+0x4c/0x90
> [ 7247.055398]???do_idle+0x378/0x470
> [ 7247.055414]???cpu_startup_entry+0x3c/0x40
> [ 7247.055442]???start_secondary+0x7a8/0xa80
> [ 7247.055461]???start_secondary_prolog+0x10/0x14

That's kind of hilarious.

So even though it's a spin_lock_irq(), suggesting it's used from both
process and irq context, Dan appears to be adding the first user that
actually runs from irq context. It looks like it should have been a
regular spinlock all along. Until now, anyway.

Fixing it should be straight-forward. Use get_file() under lock to pin
the file, drop the lock to do file_path(), release file with fput().