LinuxLists.cc - [PATCHSET block/for-next] blk-mq: update freezing

2014-06-18 15:21:17

Subject: [PATCHSET block/for-next] blk-mq: update freezing

Hello, Jens.

While reading through blk-mq, I spotted several issues and this
patchset addresses them. The biggest one is how freezing is
implemented. Coupling it with bypassing doesn't seem to work well and
there's a subtle bug in the perpcu switch implementation.

I don't think open-coding this level of percpu logic is a good idea.
It tends to be very error-prone and difficult to follow. The barrier
problem is the only thing I spotted but it's very difficult to say
that it's correct. percpu_ref already implements most of what's
necessary to implement this sort of percpu switch and I added the
missing bits in a recent patchset and converted blk-mq freezing
mechanism to use it in this patch.

It's far simpler and easier to wrap one's head around, and, as it's
shared with other mechanisms including aio and cgroups, we should have
better testing coverage and more eyes scrutinizing it.

This patchset contains the following six patches.

0001-blk-mq-make-blk_mq_stop_hw_queue-reliably-block-queu.patch
0002-blk-mq-fix-a-memory-ordering-bug-in-blk_mq_queue_ent.patch
0003-block-blk-mq-draining-can-t-be-skipped-even-if-bypas.patch
0004-blk-mq-decouble-blk-mq-freezing-from-generic-bypassi.patch
0005-blk-mq-collapse-__blk_mq_drain_queue-into-blk_mq_fre.patch
0006-blk-mq-use-percpu_ref-for-mq-usage-count.patch

0001-0003 are fix patches that can be applied separately.

0004 decouples blk-mq freezing from queue bypassing.

0005-0006 replace the custom percpu switch with percpu_ref.

This patchset is on top of

percpu/for-3.17 6fbc07bbe2b5 ("percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations")
+[1] [PATCHSET percpu/for-3.17] percpu: implement percpu_ref_reinit()

and available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-mq-percpu_ref

block/blk-core.c | 13 ++++---
block/blk-mq.c | 90 ++++++++++++++++---------------------------------
block/blk-mq.h | 2 -
block/blk-sysfs.c | 2 -
include/linux/blkdev.h | 4 +-
5 files changed, 44 insertions(+), 67 deletions(-)

Thanks.

--
tejun

[1] http://lkml.kernel.org/g/[email protected]

2014-06-18 15:21:23

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 4/6] blk-mq: decouble blk-mq freezing from generic bypassing

blk_mq freezing is entangled with generic bypassing which bypasses
blkcg and io scheduler and lets IO requests fall through the block
layer to the drivers in FIFO order. This allows forward progress on
IOs with the advanced features disabled so that those features can be
configured or altered without worrying about stalling IO which may
lead to deadlock through memory allocation.

However, generic bypassing doesn't quite fit blk-mq. blk-mq currently
doesn't make use of blkcg or ioscheds and it maps bypssing to
freezing, which blocks request processing and drains all the in-flight
ones. This causes problems as bypassing assumes that request
processing is online. blk-mq works around this by conditionally
allowing request processing for the problem case - during queue
initialization.

Another weirdity is that except for during queue cleanup, bypassing
started on the generic side prevents blk-mq from processing new
requests but doesn't drain the in-flight ones. This shouldn't break
anything but again highlights that something isn't quite right here.

The root cause is conflating blk-mq freezing and generic bypassing
which are two different mechanisms. The only intersecting purpose
that they serve is during queue cleanup. Let's properly separate
blk-mq freezing from generic bypassing and simply use it where
necessary.

* request_queue->mq_freeze_depth is added and
blk_mq_[un]freeze_queue() now operate on this counter instead of
->bypass_depth. The replacement for QUEUE_FLAG_BYPASS isn't added
but the counter is tested directly. This will be further updated by
later changes.

* blk_mq_drain_queue() is dropped and "__" prefix is dropped from
blk_mq_freeze_queue(). Queue cleanup path now calls
blk_mq_freeze_queue() directly.

* blk_queue_enter()'s fast path condition is simplified to simply
check @q->mq_freeze_depth. Previously, the condition was

!blk_queue_dying(q) &&
(!blk_queue_bypass(q) || !blk_queue_init_done(q))

mq_freeze_depth is incremented right after dying is set and
blk_queue_init_done() exception isn't necessary as blk-mq doesn't
start frozen, which only leaves the blk_queue_bypass() test which
can be replaced by @q->mq_freeze_depth test.

This change simplifies the code and reduces confusion in the area.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Nicholas A. Bellinger <[email protected]>
---
block/blk-core.c | 2 +-
block/blk-mq.c | 22 ++++++----------------
block/blk-mq.h | 2 +-
include/linux/blkdev.h | 1 +
4 files changed, 9 insertions(+), 18 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 2375130..0251947 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -514,7 +514,7 @@ void blk_cleanup_queue(struct request_queue *q)
* prevent that q->request_fn() gets invoked after draining finished.
*/
if (q->mq_ops) {
- blk_mq_drain_queue(q);
+ blk_mq_freeze_queue(q);
spin_lock_irq(lock);
} else {
spin_lock_irq(lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index c0122a4..f49b2ab 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -84,15 +84,14 @@ static int blk_mq_queue_enter(struct request_queue *q)
smp_mb();

/* we have problems freezing the queue if it's initializing */
- if (!blk_queue_dying(q) &&
- (!blk_queue_bypass(q) || !blk_queue_init_done(q)))
+ if (!q->mq_freeze_depth)
return 0;

__percpu_counter_add(&q->mq_usage_counter, -1, 1000000);

spin_lock_irq(q->queue_lock);
ret = wait_event_interruptible_lock_irq(q->mq_freeze_wq,
- !blk_queue_bypass(q) || blk_queue_dying(q),
+ !q->mq_freeze_depth || blk_queue_dying(q),
*q->queue_lock);
/* inc usage with lock hold to avoid freeze_queue runs here */
if (!ret && !blk_queue_dying(q))
@@ -129,31 +128,22 @@ static void __blk_mq_drain_queue(struct request_queue *q)
* Guarantee no request is in use, so we can change any data structure of
* the queue afterward.
*/
-static void blk_mq_freeze_queue(struct request_queue *q)
+void blk_mq_freeze_queue(struct request_queue *q)
{
spin_lock_irq(q->queue_lock);
- q->bypass_depth++;
- queue_flag_set(QUEUE_FLAG_BYPASS, q);
+ q->mq_freeze_depth++;
spin_unlock_irq(q->queue_lock);

__blk_mq_drain_queue(q);
}

-void blk_mq_drain_queue(struct request_queue *q)
-{
- __blk_mq_drain_queue(q);
-}
-
static void blk_mq_unfreeze_queue(struct request_queue *q)
{
bool wake = false;

spin_lock_irq(q->queue_lock);
- if (!--q->bypass_depth) {
- queue_flag_clear(QUEUE_FLAG_BYPASS, q);
- wake = true;
- }
- WARN_ON_ONCE(q->bypass_depth < 0);
+ wake = !--q->mq_freeze_depth;
+ WARN_ON_ONCE(q->mq_freeze_depth < 0);
spin_unlock_irq(q->queue_lock);
if (wake)
wake_up_all(&q->mq_freeze_wq);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 2646088..ca4964a 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -28,7 +28,7 @@ struct blk_mq_ctx {
void __blk_mq_complete_request(struct request *rq);
void blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
void blk_mq_init_flush(struct request_queue *q);
-void blk_mq_drain_queue(struct request_queue *q);
+void blk_mq_freeze_queue(struct request_queue *q);
void blk_mq_free_queue(struct request_queue *q);
void blk_mq_clone_flush_request(struct request *flush_rq,
struct request *orig_rq);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 31e1105..2efe26a 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -470,6 +470,7 @@ struct request_queue {
struct mutex sysfs_lock;

int bypass_depth;
+ int mq_freeze_depth;

#if defined(CONFIG_BLK_DEV_BSG)
bsg_job_fn *bsg_job_fn;
--
1.9.3

2014-06-18 15:21:25

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 6/6] blk-mq: use percpu_ref for mq usage count

Currently, blk-mq uses a percpu_counter to keep track of how many
usages are in flight. The percpu_counter is drained while freezing to
ensure that no usage is left in-flight after freezing is complete.
blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism.

This type of code has relatively high chance of subtle bugs which are
extremely difficult to trigger and it's way too hairy to be open coded
in blk-mq. percpu_ref can serve the same purpose after the recent
changes. This patch replaces the open-coded per-cpu usage counting
and draining mechanism with percpu_ref.

blk_mq_queue_enter() performs tryget_live on the ref and exit()
performs put. blk_mq_freeze_queue() kills the ref and waits until the
reference count reaches zero. blk_mq_unfreeze_queue() revives the ref
and wakes up the waiters.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Nicholas A. Bellinger <[email protected]>
Cc: Kent Overstreet <[email protected]>
---
block/blk-mq.c | 68 +++++++++++++++++++++-----------------------------
include/linux/blkdev.h | 3 ++-
2 files changed, 31 insertions(+), 40 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index eda78c8..7b5701a 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -78,34 +78,32 @@ static void blk_mq_hctx_clear_pending(struct blk_mq_hw_ctx *hctx,

static int blk_mq_queue_enter(struct request_queue *q)
{
- int ret;
-
- __percpu_counter_add(&q->mq_usage_counter, 1, 1000000);
- smp_mb();
-
- /* we have problems freezing the queue if it's initializing */
- if (!q->mq_freeze_depth)
- return 0;
-
- __percpu_counter_add(&q->mq_usage_counter, -1, 1000000);
+ while (true) {
+ int ret;

- spin_lock_irq(q->queue_lock);
- ret = wait_event_interruptible_lock_irq(q->mq_freeze_wq,
- !q->mq_freeze_depth || blk_queue_dying(q),
- *q->queue_lock);
- /* inc usage with lock hold to avoid freeze_queue runs here */
- if (!ret && !blk_queue_dying(q))
- __percpu_counter_add(&q->mq_usage_counter, 1, 1000000);
- else if (blk_queue_dying(q))
- ret = -ENODEV;
- spin_unlock_irq(q->queue_lock);
+ if (percpu_ref_tryget_live(&q->mq_usage_counter))
+ return 0;

- return ret;
+ ret = wait_event_interruptible(q->mq_freeze_wq,
+ !q->mq_freeze_depth || blk_queue_dying(q));
+ if (blk_queue_dying(q))
+ return -ENODEV;
+ if (ret)
+ return ret;
+ }
}

static void blk_mq_queue_exit(struct request_queue *q)
{
- __percpu_counter_add(&q->mq_usage_counter, -1, 1000000);
+ percpu_ref_put(&q->mq_usage_counter);
+}
+
+static void blk_mq_usage_counter_release(struct percpu_ref *ref)
+{
+ struct request_queue *q =
+ container_of(ref, struct request_queue, mq_usage_counter);
+
+ wake_up_all(&q->mq_freeze_wq);
}

/*
@@ -118,18 +116,9 @@ void blk_mq_freeze_queue(struct request_queue *q)
q->mq_freeze_depth++;
spin_unlock_irq(q->queue_lock);

- while (true) {
- s64 count;
-
- spin_lock_irq(q->queue_lock);
- count = percpu_counter_sum(&q->mq_usage_counter);
- spin_unlock_irq(q->queue_lock);
-
- if (count == 0)
- break;
- blk_mq_run_queues(q, false);
- msleep(10);
- }
+ percpu_ref_kill(&q->mq_usage_counter);
+ blk_mq_run_queues(q, false);
+ wait_event(q->mq_freeze_wq, percpu_ref_is_zero(&q->mq_usage_counter));
}

static void blk_mq_unfreeze_queue(struct request_queue *q)
@@ -140,8 +129,10 @@ static void blk_mq_unfreeze_queue(struct request_queue *q)
wake = !--q->mq_freeze_depth;
WARN_ON_ONCE(q->mq_freeze_depth < 0);
spin_unlock_irq(q->queue_lock);
- if (wake)
+ if (wake) {
+ percpu_ref_reinit(&q->mq_usage_counter);
wake_up_all(&q->mq_freeze_wq);
+ }
}

bool blk_mq_can_queue(struct blk_mq_hw_ctx *hctx)
@@ -1785,7 +1776,7 @@ struct request_queue *blk_mq_init_queue(struct blk_mq_tag_set *set)
if (!q)
goto err_hctxs;

- if (percpu_counter_init(&q->mq_usage_counter, 0))
+ if (percpu_ref_init(&q->mq_usage_counter, blk_mq_usage_counter_release))
goto err_map;

setup_timer(&q->timeout, blk_mq_rq_timer, (unsigned long) q);
@@ -1878,7 +1869,7 @@ void blk_mq_free_queue(struct request_queue *q)
blk_mq_exit_hw_queues(q, set, set->nr_hw_queues);
blk_mq_free_hw_queues(q, set);

- percpu_counter_destroy(&q->mq_usage_counter);
+ percpu_ref_exit(&q->mq_usage_counter);

free_percpu(q->queue_ctx);
kfree(q->queue_hw_ctx);
@@ -2037,8 +2028,7 @@ static int __init blk_mq_init(void)
{
blk_mq_cpu_init();

- /* Must be called after percpu_counter_hotcpu_callback() */
- hotcpu_notifier(blk_mq_queue_reinit_notify, -10);
+ hotcpu_notifier(blk_mq_queue_reinit_notify, 0);

return 0;
}
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 2efe26a..d5933dd 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -21,6 +21,7 @@
#include <linux/bsg.h>
#include <linux/smp.h>
#include <linux/rcupdate.h>
+#include <linux/percpu-refcount.h>

#include <asm/scatterlist.h>

@@ -484,7 +485,7 @@ struct request_queue {
#endif
struct rcu_head rcu_head;
wait_queue_head_t mq_freeze_wq;
- struct percpu_counter mq_usage_counter;
+ struct percpu_ref mq_usage_counter;
struct list_head all_q_node;

struct blk_mq_tag_set *tag_set;
--
1.9.3

2014-06-18 15:21:20

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 1/6] blk-mq: make blk_mq_stop_hw_queue() reliably block queue processing

blk_mq_stop_hw_queue() has the following two issues.

* BLK_MQ_S_STOPPED should be set before canceling the work items;
otherwise, a new instance may proceed beyond STOPPED checking
inbetween.

* cancel_delayed_work() doesn't do anything to work instance already
executing. Use cancel_delayed_work_sync() to ensure that the
currently in-flight one is finished.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Nicholas A. Bellinger <[email protected]>
---
block/blk-mq.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e11f5f8..4787c3d 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -862,9 +862,9 @@ EXPORT_SYMBOL(blk_mq_run_queues);

void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
{
- cancel_delayed_work(&hctx->run_work);
- cancel_delayed_work(&hctx->delay_work);
set_bit(BLK_MQ_S_STOPPED, &hctx->state);
+ cancel_delayed_work_sync(&hctx->run_work);
+ cancel_delayed_work_sync(&hctx->delay_work);
}
EXPORT_SYMBOL(blk_mq_stop_hw_queue);

--
1.9.3

2014-06-18 15:21:49

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 5/6] blk-mq: collapse __blk_mq_drain_queue() into blk_mq_freeze_queue()

Keeping __blk_mq_drain_queue() as a separate function doesn't buy us
anything and it's gonna be further simplified. Let's flatten it into
its caller.

This patch doesn't make any functional change.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Nicholas A. Bellinger <[email protected]>
---
block/blk-mq.c | 23 +++++++++--------------
1 file changed, 9 insertions(+), 14 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index f49b2ab..eda78c8 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -108,8 +108,16 @@ static void blk_mq_queue_exit(struct request_queue *q)
__percpu_counter_add(&q->mq_usage_counter, -1, 1000000);
}

-static void __blk_mq_drain_queue(struct request_queue *q)
+/*
+ * Guarantee no request is in use, so we can change any data structure of
+ * the queue afterward.
+ */
+void blk_mq_freeze_queue(struct request_queue *q)
{
+ spin_lock_irq(q->queue_lock);
+ q->mq_freeze_depth++;
+ spin_unlock_irq(q->queue_lock);
+
while (true) {
s64 count;

@@ -124,19 +132,6 @@ static void __blk_mq_drain_queue(struct request_queue *q)
}
}

-/*
- * Guarantee no request is in use, so we can change any data structure of
- * the queue afterward.
- */
-void blk_mq_freeze_queue(struct request_queue *q)
-{
- spin_lock_irq(q->queue_lock);
- q->mq_freeze_depth++;
- spin_unlock_irq(q->queue_lock);
-
- __blk_mq_drain_queue(q);
-}
-
static void blk_mq_unfreeze_queue(struct request_queue *q)
{
bool wake = false;
--
1.9.3

2014-06-18 15:22:18

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 3/6] block, blk-mq: draining can't be skipped even if bypass_depth was non-zero

Currently, both blk_queeu_bypass_start() and blk_mq_freeze_queue()
skip queue draining if bypass_depth was already above zero. The
assumption is that the one which bumped the bypass_depth should have
performed draining already; however, there's nothing which prevents a
new instance of bypassing/freezing from starting before the previous
one finishes draining. The current code may allow the later
bypassing/freezing instances to complete while there still are
in-flight requests which haven't finished draining.

Fix it by draining regardless of bypass_depth. We still skip draining
from blk_queue_bypass_start() while the queue is initializing to avoid
introducing excessive delays during boot. INIT_DONE setting is moved
above the initial blk_queue_bypass_end() so that bypassing attempts
can't slip inbetween.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Nicholas A. Bellinger <[email protected]>
---
block/blk-core.c | 11 +++++++----
block/blk-mq.c | 7 ++-----
block/blk-sysfs.c | 2 +-
3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f6f6b9a..2375130 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -438,14 +438,17 @@ static void __blk_drain_queue(struct request_queue *q, bool drain_all)
*/
void blk_queue_bypass_start(struct request_queue *q)
{
- bool drain;
-
spin_lock_irq(q->queue_lock);
- drain = !q->bypass_depth++;
+ q->bypass_depth++;
queue_flag_set(QUEUE_FLAG_BYPASS, q);
spin_unlock_irq(q->queue_lock);

- if (drain) {
+ /*
+ * Queues start drained. Skip actual draining till init is
+ * complete. This avoids lenghty delays during queue init which
+ * can happen many times during boot.
+ */
+ if (blk_queue_init_done(q)) {
spin_lock_irq(q->queue_lock);
__blk_drain_queue(q, false);
spin_unlock_irq(q->queue_lock);
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 1c09d57..c0122a4 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -131,15 +131,12 @@ static void __blk_mq_drain_queue(struct request_queue *q)
*/
static void blk_mq_freeze_queue(struct request_queue *q)
{
- bool drain;
-
spin_lock_irq(q->queue_lock);
- drain = !q->bypass_depth++;
+ q->bypass_depth++;
queue_flag_set(QUEUE_FLAG_BYPASS, q);
spin_unlock_irq(q->queue_lock);

- if (drain)
- __blk_mq_drain_queue(q);
+ __blk_mq_drain_queue(q);
}

void blk_mq_drain_queue(struct request_queue *q)
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 23321fb..4db5abf 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -554,8 +554,8 @@ int blk_register_queue(struct gendisk *disk)
* Initialization must be complete by now. Finish the initial
* bypass from queue allocation.
*/
- blk_queue_bypass_end(q);
queue_flag_set_unlocked(QUEUE_FLAG_INIT_DONE, q);
+ blk_queue_bypass_end(q);

ret = blk_trace_init_sysfs(dev);
if (ret)
--
1.9.3

2014-06-18 15:22:37

by Tejun Heo

[permalink] [raw]

Subject: [PATCH 2/6] blk-mq: fix a memory ordering bug in blk_mq_queue_enter()

blk-mq uses a percpu_counter to keep track of how many usages are in
flight. The percpu_counter is drained while freezing to ensure that
no usage is left in-flight after freezing is complete.

blk_mq_queue_enter/exit() and blk_mq_[un]freeze_queue() implement this
per-cpu gating mechanism; unfortunately, it contains a subtle bug -
smp_wmb() in blk_mq_queue_enter() doesn't prevent prevent the cpu from
fetching @q->bypass_depth before incrementing @q->mq_usage_counter and
if freezing happens inbetween the caller can slip through and freezing
can be complete while there are active users.

Use smp_mb() instead so that bypass_depth and mq_usage_counter
modifications and tests are properly interlocked.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Nicholas A. Bellinger <[email protected]>
---
block/blk-mq.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4787c3d..1c09d57 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -81,7 +81,7 @@ static int blk_mq_queue_enter(struct request_queue *q)
int ret;

__percpu_counter_add(&q->mq_usage_counter, 1, 1000000);
- smp_wmb();
+ smp_mb();

/* we have problems freezing the queue if it's initializing */
if (!blk_queue_dying(q) &&
--
1.9.3

2014-06-18 16:19:31

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCHSET block/for-next] blk-mq: update freezing

On 2014-06-18 08:21, Tejun Heo wrote:
> Hello, Jens.
>
> While reading through blk-mq, I spotted several issues and this
> patchset addresses them. The biggest one is how freezing is
> implemented. Coupling it with bypassing doesn't seem to work well and
> there's a subtle bug in the perpcu switch implementation.
>
> I don't think open-coding this level of percpu logic is a good idea.
> It tends to be very error-prone and difficult to follow. The barrier
> problem is the only thing I spotted but it's very difficult to say
> that it's correct. percpu_ref already implements most of what's
> necessary to implement this sort of percpu switch and I added the
> missing bits in a recent patchset and converted blk-mq freezing
> mechanism to use it in this patch.
>
> It's far simpler and easier to wrap one's head around, and, as it's
> shared with other mechanisms including aio and cgroups, we should have
> better testing coverage and more eyes scrutinizing it.
>
> This patchset contains the following six patches.
>
> 0001-blk-mq-make-blk_mq_stop_hw_queue-reliably-block-queu.patch
> 0002-blk-mq-fix-a-memory-ordering-bug-in-blk_mq_queue_ent.patch
> 0003-block-blk-mq-draining-can-t-be-skipped-even-if-bypas.patch
> 0004-blk-mq-decouble-blk-mq-freezing-from-generic-bypassi.patch
> 0005-blk-mq-collapse-__blk_mq_drain_queue-into-blk_mq_fre.patch
> 0006-blk-mq-use-percpu_ref-for-mq-usage-count.patch
>
> 0001-0003 are fix patches that can be applied separately.
>
> 0004 decouples blk-mq freezing from queue bypassing.
>
> 0005-0006 replace the custom percpu switch with percpu_ref.
>
> This patchset is on top of
>
> percpu/for-3.17 6fbc07bbe2b5 ("percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations")
> +[1] [PATCHSET percpu/for-3.17] percpu: implement percpu_ref_reinit()
>
> and available in the following git branch.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-mq-percpu_ref
>
> block/blk-core.c | 13 ++++---
> block/blk-mq.c | 90 ++++++++++++++++---------------------------------
> block/blk-mq.h | 2 -
> block/blk-sysfs.c | 2 -
> include/linux/blkdev.h | 4 +-
> 5 files changed, 44 insertions(+), 67 deletions(-)

Thanks Tejun, this looks pretty good. I was worried the tryget_live()
would be too expensive, but that looks not to be the case. I like the
cleanups to using a general mechanism. I'll run this through some
functional and performance testing.

--
Jens Axboe

2014-06-19 10:10:30

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH 1/6] blk-mq: make blk_mq_stop_hw_queue() reliably block queue processing

On Wed, Jun 18, 2014 at 11:21 PM, Tejun Heo <[email protected]> wrote:
> blk_mq_stop_hw_queue() has the following two issues.
>
> * BLK_MQ_S_STOPPED should be set before canceling the work items;
> otherwise, a new instance may proceed beyond STOPPED checking
> inbetween.
>
> * cancel_delayed_work() doesn't do anything to work instance already
> executing. Use cancel_delayed_work_sync() to ensure that the
> currently in-flight one is finished.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: Nicholas A. Bellinger <[email protected]>
> ---
> block/blk-mq.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index e11f5f8..4787c3d 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -862,9 +862,9 @@ EXPORT_SYMBOL(blk_mq_run_queues);
>
> void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
> {
> - cancel_delayed_work(&hctx->run_work);
> - cancel_delayed_work(&hctx->delay_work);
> set_bit(BLK_MQ_S_STOPPED, &hctx->state);
> + cancel_delayed_work_sync(&hctx->run_work);
> + cancel_delayed_work_sync(&hctx->delay_work);

The function can be called from atomic context, like virtio-blk.

Currently blk_mq_stop_hw_queue() is called in case of queue
being busy, so looks it is ok with cancel_delayed_work().

Thanks,
--
Ming Lei

2014-06-19 13:12:25

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH 1/6] blk-mq: make blk_mq_stop_hw_queue() reliably block queue processing

Hello, Ming.

On Thu, Jun 19, 2014 at 06:10:28PM +0800, Ming Lei wrote:
> > void blk_mq_stop_hw_queue(struct blk_mq_hw_ctx *hctx)
> > {
> > - cancel_delayed_work(&hctx->run_work);
> > - cancel_delayed_work(&hctx->delay_work);
> > set_bit(BLK_MQ_S_STOPPED, &hctx->state);
> > + cancel_delayed_work_sync(&hctx->run_work);
> > + cancel_delayed_work_sync(&hctx->delay_work);
>
> The function can be called from atomic context, like virtio-blk.
>
> Currently blk_mq_stop_hw_queue() is called in case of queue
> being busy, so looks it is ok with cancel_delayed_work().

Hmm... you're right. So the function is used to stop future
invocations of the queue processing function and thus the order
doesn't matter. We probably wanna document that. I don't understand
why virblk_freeze() is invoking it tho. If freezing is complete,
there's no request in-flight on the driver side, why does it need to
invoke stop too?

Jens, please drop this one.

Thanks.

--
tejun

2014-06-28 12:13:19

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCHSET block/for-next] blk-mq: update freezing

On Wed, Jun 18, 2014 at 09:19:26AM -0700, Jens Axboe wrote:
> > percpu/for-3.17 6fbc07bbe2b5 ("percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations")
> >+[1] [PATCHSET percpu/for-3.17] percpu: implement percpu_ref_reinit()
> >
> >and available in the following git branch.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-mq-percpu_ref
> >
> > block/blk-core.c | 13 ++++---
> > block/blk-mq.c | 90 ++++++++++++++++---------------------------------
> > block/blk-mq.h | 2 -
> > block/blk-sysfs.c | 2 -
> > include/linux/blkdev.h | 4 +-
> > 5 files changed, 44 insertions(+), 67 deletions(-)
>
> Thanks Tejun, this looks pretty good. I was worried the tryget_live() would
> be too expensive, but that looks not to be the case. I like the cleanups to
> using a general mechanism. I'll run this through some functional and
> performance testing.

FYI, the percpu_ref changes needed by this patchset are applied to
percpu/for-3.17 branch which is stable and can be pulled into the
block tree.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-3.17

Thanks.

--
tejun

2014-06-28 14:29:30

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCHSET block/for-next] blk-mq: update freezing

On 2014-06-28 06:13, Tejun Heo wrote:
> On Wed, Jun 18, 2014 at 09:19:26AM -0700, Jens Axboe wrote:
>>> percpu/for-3.17 6fbc07bbe2b5 ("percpu: invoke __verify_pcpu_ptr() from the generic part of accessors and operations")
>>> +[1] [PATCHSET percpu/for-3.17] percpu: implement percpu_ref_reinit()
>>>
>>> and available in the following git branch.
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git review-mq-percpu_ref
>>>
>>> block/blk-core.c | 13 ++++---
>>> block/blk-mq.c | 90 ++++++++++++++++---------------------------------
>>> block/blk-mq.h | 2 -
>>> block/blk-sysfs.c | 2 -
>>> include/linux/blkdev.h | 4 +-
>>> 5 files changed, 44 insertions(+), 67 deletions(-)
>>
>> Thanks Tejun, this looks pretty good. I was worried the tryget_live() would
>> be too expensive, but that looks not to be the case. I like the cleanups to
>> using a general mechanism. I'll run this through some functional and
>> performance testing.
>
> FYI, the percpu_ref changes needed by this patchset are applied to
> percpu/for-3.17 branch which is stable and can be pulled into the
> block tree.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu.git for-3.17

Great, thanks! Was just mulling over this yesterday. I'll get it pulled
in as a base.

--
Jens Axboe