Hi Jens
As we know, there is a risk of accesing stale requests when iterate
in-flight requests with tags->rqs[] and this has been talked in following
thread,
[1] https://marc.info/?l=linux-scsi&m=154511693912752&w=2
[2] https://marc.info/?l=linux-block&m=154526189023236&w=2
A typical sence could be
blk_mq_get_request blk_mq_queue_tag_busy_iter
-> blk_mq_get_tag
-> bt_for_each
-> bt_iter
-> rq = taags->rqs[]
-> rq->q
-> blk_mq_rq_ctx_init
-> data->hctx->tags->rqs[rq->tag] = rq;
The root cause is that there is a window between set bit on tag sbitmap
and set tags->rqs[].
This patch would fix this issue by iterating requests with tags->static_rqs[]
instead of tags->rqs[] which would be changed dynamically. Moreover,
we will try to get a non-zero q_usage_counter before access hctxs and tags and
thus could avoid the race with updating nr_hw_queues, switching io scheduler
and even queue clean up which are all under a frozen and drained queue.
The 1st patch get rid of the useless of synchronize_rcu in __blk_mq_update_nr_hw_queues
The 2nd patch modify the blk_mq_queue_tag_busy_iter to use tags->static_rqs[]
instead of tags->rqs[] to iterate the busy tags.
The 3rd ~ 7th patch change the blk_mq_tagset_busy_iter to blk_mq_queue_tag_busy_iter
which is safer
The 8th patch get rid of the blk_mq_tagset_busy_iter.
Jianchao Wang(8)
blk-mq: get rid of the synchronize_rcu in
blk-mq: change the method of iterating busy tags of a
blk-mq: use blk_mq_queue_tag_busy_iter in debugfs
mtip32xx: use blk_mq_queue_tag_busy_iter
nbd: use blk_mq_queue_tag_busy_iter
skd: use blk_mq_queue_tag_busy_iter
nvme: use blk_mq_queue_tag_busy_iter
blk-mq: remove blk_mq_tagset_busy_iter
diff stat
block/blk-mq-debugfs.c | 4 +-
block/blk-mq-tag.c | 173 +++++++++++++++++++++++++-------------------------------------------------------------
block/blk-mq-tag.h | 2 -
block/blk-mq.c | 35 ++++++------------
drivers/block/mtip32xx/mtip32xx.c | 8 ++--
drivers/block/nbd.c | 2 +-
drivers/block/skd_main.c | 4 +-
drivers/nvme/host/core.c | 12 ++++++
drivers/nvme/host/fc.c | 12 +++---
drivers/nvme/host/nvme.h | 2 +
drivers/nvme/host/pci.c | 5 ++-
drivers/nvme/host/rdma.c | 6 +--
drivers/nvme/host/tcp.c | 5 ++-
drivers/nvme/target/loop.c | 6 +--
include/linux/blk-mq.h | 7 ++--
15 files changed, 105 insertions(+), 178 deletions(-
Thanks
Jianchao
blk_mq_tagset_busy_iter is not safe that it could get stale request
in tags->rqs[]. Use blk_mq_queue_tag_busy_iter here.
Signed-off-by: Jianchao Wang <[email protected]>
---
drivers/block/nbd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
index 7c9a949..9e7e828 100644
--- a/drivers/block/nbd.c
+++ b/drivers/block/nbd.c
@@ -747,7 +747,7 @@ static bool nbd_clear_req(struct request *req, void *data, bool reserved)
static void nbd_clear_que(struct nbd_device *nbd)
{
blk_mq_quiesce_queue(nbd->disk->queue);
- blk_mq_tagset_busy_iter(&nbd->tag_set, nbd_clear_req, NULL);
+ blk_mq_queue_tag_busy_iter(nbd->disk->queue, nbd_clear_req, NULL, true);
blk_mq_unquiesce_queue(nbd->disk->queue);
dev_dbg(disk_to_dev(nbd->disk), "queue cleared\n");
}
--
2.7.4
tags->rqs[] will not been cleaned when free driver tag and there
is a window between get driver tag and write tags->rqs[], so we
may see stale rq in tags->rqs[] which may have been freed, as
following case,
blk_mq_get_request blk_mq_queue_tag_busy_iter
-> blk_mq_get_tag
-> bt_for_each
-> bt_iter
-> rq = taags->rqs[]
-> rq->q
-> blk_mq_rq_ctx_init
-> data->hctx->tags->rqs[rq->tag] = rq;
To fix this, the blk_mq_queue_tag_busy_iter is changed in this
patch to use tags->static_rqs[] instead of tags->rqs[]. We have
to identify whether there is a io scheduler attached to decide
to use hctx->tags or hctx->sched_tags. And we will try to get a
non-zero q_usage_counter before that, so it is safe to access
them. Add 'inflight' parameter to determine to iterate in-flight
requests or just busy tags. A correction here is that
part_in_flight should count the busy tags instead of rqs that
have got driver tags.
Moreover, get rid of the 'hctx' parameter in iter function as
we could get it from rq->mq_hctx and export this interface for
drivers.
Signed-off-by: Jianchao Wang <[email protected]>
---
block/blk-mq-tag.c | 76 +++++++++++++++++++++++++++++++++-----------------
block/blk-mq-tag.h | 2 --
block/blk-mq.c | 31 ++++++++------------
include/linux/blk-mq.h | 5 ++--
4 files changed, 64 insertions(+), 50 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index b792537..cdec2cd 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -216,26 +216,38 @@ struct bt_iter_data {
busy_iter_fn *fn;
void *data;
bool reserved;
+ bool inflight;
};
static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
{
struct bt_iter_data *iter_data = data;
struct blk_mq_hw_ctx *hctx = iter_data->hctx;
- struct blk_mq_tags *tags = hctx->tags;
bool reserved = iter_data->reserved;
+ struct blk_mq_tags *tags;
struct request *rq;
+ tags = hctx->sched_tags ? hctx->sched_tags : hctx->tags;
+
if (!reserved)
bitnr += tags->nr_reserved_tags;
- rq = tags->rqs[bitnr];
+ /*
+ * Because tags->rqs[] will not been cleaned when free driver tag
+ * and there is a window between get driver tag and write tags->rqs[],
+ * so we may see stale rq in tags->rqs[] which may have been freed.
+ * Using static_rqs[] is safer.
+ */
+ rq = tags->static_rqs[bitnr];
/*
- * We can hit rq == NULL here, because the tagging functions
- * test and set the bit before assigning ->rqs[].
+ * There is a small window between get tag and blk_mq_rq_ctx_init,
+ * so rq->q and rq->mq_hctx maybe different.
*/
- if (rq && rq->q == hctx->queue)
- return iter_data->fn(hctx, rq, iter_data->data, reserved);
+ if (rq && rq->q == hctx->queue &&
+ rq->mq_hctx == hctx &&
+ (!iter_data->inflight ||
+ blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
+ return iter_data->fn(rq, iter_data->data, reserved);
return true;
}
@@ -246,7 +258,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
* or the bitmap_tags member of struct blk_mq_tags.
* @fn: Pointer to the function that will be called for each request
* associated with @hctx that has been assigned a driver tag.
- * @fn will be called as follows: @fn(@hctx, rq, @data, @reserved)
+ * @fn will be called as follows: @fn(rq, @data, @reserved)
* where rq is a pointer to a request. Return true to continue
* iterating tags, false to stop.
* @data: Will be passed as third argument to @fn.
@@ -254,13 +266,14 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
* bitmap_tags member of struct blk_mq_tags.
*/
static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
- busy_iter_fn *fn, void *data, bool reserved)
+ busy_iter_fn *fn, void *data, bool reserved, bool inflight)
{
struct bt_iter_data iter_data = {
.hctx = hctx,
.fn = fn,
.data = data,
.reserved = reserved,
+ .inflight = inflight,
};
sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
@@ -362,36 +375,33 @@ void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
/**
- * blk_mq_queue_tag_busy_iter - iterate over all requests with a driver tag
+ * blk_mq_queue_tag_busy_iter - iterate over all busy tags
* @q: Request queue to examine.
- * @fn: Pointer to the function that will be called for each request
- * on @q. @fn will be called as follows: @fn(hctx, rq, @priv,
- * reserved) where rq is a pointer to a request and hctx points
- * to the hardware queue associated with the request. 'reserved'
- * indicates whether or not @rq is a reserved request.
+ * @fn: Pointer to the function that will be called for each
+ * in-flight request issued by @q. @fn will be called as
+ * follows:
+ * @fn(rq, @priv, reserved)
+ * rq is a pointer to a request.'reserved' indicates whether or
+ * not @rq is a reserved request.
* @priv: Will be passed as third argument to @fn.
- *
- * Note: if @q->tag_set is shared with other request queues then @fn will be
- * called for all requests on all queues that share that tag set and not only
- * for requests associated with @q.
*/
void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
- void *priv)
+ void *priv, bool inflight)
{
struct blk_mq_hw_ctx *hctx;
int i;
/*
- * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
- * while the queue is frozen. So we can use q_usage_counter to avoid
- * racing with it.
+ * Get a reference of the queue unless it has been zero. We use this
+ * to avoid the race with the code that would modify the hctxs after
+ * freeze and drain the queue, including updating nr_hw_queues, io
+ * scheduler switching and queue clean up.
*/
if (!percpu_ref_tryget(&q->q_usage_counter))
return;
queue_for_each_hw_ctx(q, hctx, i) {
- struct blk_mq_tags *tags = hctx->tags;
-
+ struct blk_mq_tags *tags;
/*
* If no software queues are currently mapped to this
* hardware queue, there's nothing to check
@@ -399,12 +409,26 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
if (!blk_mq_hw_queue_mapped(hctx))
continue;
+ tags = hctx->sched_tags ? hctx->sched_tags : hctx->tags;
+
if (tags->nr_reserved_tags)
- bt_for_each(hctx, &tags->breserved_tags, fn, priv, true);
- bt_for_each(hctx, &tags->bitmap_tags, fn, priv, false);
+ bt_for_each(hctx, &tags->breserved_tags,
+ fn, priv, true, inflight);
+ bt_for_each(hctx, &tags->bitmap_tags,
+ fn, priv, false, inflight);
+ /*
+ * flush_rq represents the rq with REQ_PREFLUSH and REQ_FUA
+ * (if FUA is not supported by device) to be issued to
+ * device. So we need to consider it when iterate inflight
+ * rqs, but needn't to count it when iterate busy tags.
+ */
+ if (inflight &&
+ blk_mq_rq_state(hctx->fq->flush_rq) == MQ_RQ_IN_FLIGHT)
+ fn(hctx->fq->flush_rq, priv, false);
}
blk_queue_exit(q);
}
+EXPORT_SYMBOL(blk_mq_queue_tag_busy_iter);
static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth,
bool round_robin, int node)
diff --git a/block/blk-mq-tag.h b/block/blk-mq-tag.h
index 61deab0..5af7ff9 100644
--- a/block/blk-mq-tag.h
+++ b/block/blk-mq-tag.h
@@ -33,8 +33,6 @@ extern int blk_mq_tag_update_depth(struct blk_mq_hw_ctx *hctx,
struct blk_mq_tags **tags,
unsigned int depth, bool can_grow);
extern void blk_mq_tag_wakeup_all(struct blk_mq_tags *tags, bool);
-void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
- void *priv);
static inline struct sbq_wait_state *bt_wait_ptr(struct sbitmap_queue *bt,
struct blk_mq_hw_ctx *hctx)
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 07584a9..9144ec1 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -93,8 +93,7 @@ struct mq_inflight {
unsigned int *inflight;
};
-static bool blk_mq_check_inflight(struct blk_mq_hw_ctx *hctx,
- struct request *rq, void *priv,
+static bool blk_mq_check_inflight(struct request *rq, void *priv,
bool reserved)
{
struct mq_inflight *mi = priv;
@@ -114,13 +113,12 @@ unsigned int blk_mq_in_flight(struct request_queue *q, struct hd_struct *part)
struct mq_inflight mi = { .part = part, .inflight = inflight, };
inflight[0] = inflight[1] = 0;
- blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi);
+ blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight, &mi, false);
return inflight[0];
}
-static bool blk_mq_check_inflight_rw(struct blk_mq_hw_ctx *hctx,
- struct request *rq, void *priv,
+static bool blk_mq_check_inflight_rw(struct request *rq, void *priv,
bool reserved)
{
struct mq_inflight *mi = priv;
@@ -137,7 +135,7 @@ void blk_mq_in_flight_rw(struct request_queue *q, struct hd_struct *part,
struct mq_inflight mi = { .part = part, .inflight = inflight, };
inflight[0] = inflight[1] = 0;
- blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_rw, &mi);
+ blk_mq_queue_tag_busy_iter(q, blk_mq_check_inflight_rw, &mi, false);
}
void blk_freeze_queue_start(struct request_queue *q)
@@ -813,28 +811,22 @@ struct request *blk_mq_tag_to_rq(struct blk_mq_tags *tags, unsigned int tag)
}
EXPORT_SYMBOL(blk_mq_tag_to_rq);
-static bool blk_mq_rq_inflight(struct blk_mq_hw_ctx *hctx, struct request *rq,
- void *priv, bool reserved)
+static bool blk_mq_rq_inflight(struct request *rq, void *priv, bool reserved)
{
+ bool *busy = priv;
/*
* If we find a request that is inflight and the queue matches,
* we know the queue is busy. Return false to stop the iteration.
*/
- if (rq->state == MQ_RQ_IN_FLIGHT && rq->q == hctx->queue) {
- bool *busy = priv;
-
- *busy = true;
- return false;
- }
-
- return true;
+ *busy = true;
+ return false;
}
bool blk_mq_queue_inflight(struct request_queue *q)
{
bool busy = false;
- blk_mq_queue_tag_busy_iter(q, blk_mq_rq_inflight, &busy);
+ blk_mq_queue_tag_busy_iter(q, blk_mq_rq_inflight, &busy, true);
return busy;
}
EXPORT_SYMBOL_GPL(blk_mq_queue_inflight);
@@ -874,8 +866,7 @@ static bool blk_mq_req_expired(struct request *rq, unsigned long *next)
return false;
}
-static bool blk_mq_check_expired(struct blk_mq_hw_ctx *hctx,
- struct request *rq, void *priv, bool reserved)
+static bool blk_mq_check_expired(struct request *rq, void *priv, bool reserved)
{
unsigned long *next = priv;
@@ -936,7 +927,7 @@ static void blk_mq_timeout_work(struct work_struct *work)
if (!percpu_ref_tryget(&q->q_usage_counter))
return;
- blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next);
+ blk_mq_queue_tag_busy_iter(q, blk_mq_check_expired, &next, true);
if (next != 0) {
mod_timer(&q->timeout, next);
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 0e030f5..d6beeb5 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -132,8 +132,7 @@ typedef int (init_request_fn)(struct blk_mq_tag_set *set, struct request *,
typedef void (exit_request_fn)(struct blk_mq_tag_set *set, struct request *,
unsigned int);
-typedef bool (busy_iter_fn)(struct blk_mq_hw_ctx *, struct request *, void *,
- bool);
+typedef bool (busy_iter_fn)(struct request *, void *, bool);
typedef bool (busy_tag_iter_fn)(struct request *, void *, bool);
typedef int (poll_fn)(struct blk_mq_hw_ctx *);
typedef int (map_queues_fn)(struct blk_mq_tag_set *set);
@@ -322,6 +321,8 @@ bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
void blk_mq_run_hw_queues(struct request_queue *q, bool async);
void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
busy_tag_iter_fn *fn, void *priv);
+void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
+ void *priv, bool inflight);
void blk_mq_freeze_queue(struct request_queue *q);
void blk_mq_unfreeze_queue(struct request_queue *q);
void blk_freeze_queue_start(struct request_queue *q);
--
2.7.4
blk_mq_tagset_busy_iter is not safe that it could get stale request
in tags->rqs[]. Use blk_mq_queue_tag_busy_iter here.
Signed-off-by: Jianchao Wang <[email protected]>
---
drivers/block/mtip32xx/mtip32xx.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/block/mtip32xx/mtip32xx.c b/drivers/block/mtip32xx/mtip32xx.c
index 88e8440..6e356f7 100644
--- a/drivers/block/mtip32xx/mtip32xx.c
+++ b/drivers/block/mtip32xx/mtip32xx.c
@@ -2770,13 +2770,13 @@ static int mtip_service_thread(void *data)
blk_mq_quiesce_queue(dd->queue);
- blk_mq_tagset_busy_iter(&dd->tags, mtip_queue_cmd, dd);
+ blk_mq_queue_tag_busy_iter(dd->queue, mtip_queue_cmd, dd, true);
set_bit(MTIP_PF_ISSUE_CMDS_BIT, &dd->port->flags);
if (mtip_device_reset(dd))
- blk_mq_tagset_busy_iter(&dd->tags,
- mtip_abort_cmd, dd);
+ blk_mq_queue_tag_busy_iter(dd->queue,
+ mtip_abort_cmd, dd, true);
clear_bit(MTIP_PF_TO_ACTIVE_BIT, &dd->port->flags);
@@ -3907,7 +3907,7 @@ static int mtip_block_remove(struct driver_data *dd)
blk_freeze_queue_start(dd->queue);
blk_mq_quiesce_queue(dd->queue);
- blk_mq_tagset_busy_iter(&dd->tags, mtip_no_dev_cleanup, dd);
+ blk_mq_queue_tag_busy_iter(dd->queue, mtip_no_dev_cleanup, dd, true);
blk_mq_unquiesce_queue(dd->queue);
/*
--
2.7.4
As nobody uses blk_mq_tagset_busy_iter, remove it.
Signed-off-by: Jianchao Wang <[email protected]>
---
block/blk-mq-tag.c | 95 --------------------------------------------------
include/linux/blk-mq.h | 2 --
2 files changed, 97 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index cdec2cd..b5a14ce 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -279,101 +279,6 @@ static void bt_for_each(struct blk_mq_hw_ctx *hctx, struct sbitmap_queue *bt,
sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
}
-struct bt_tags_iter_data {
- struct blk_mq_tags *tags;
- busy_tag_iter_fn *fn;
- void *data;
- bool reserved;
-};
-
-static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
-{
- struct bt_tags_iter_data *iter_data = data;
- struct blk_mq_tags *tags = iter_data->tags;
- bool reserved = iter_data->reserved;
- struct request *rq;
-
- if (!reserved)
- bitnr += tags->nr_reserved_tags;
-
- /*
- * We can hit rq == NULL here, because the tagging functions
- * test and set the bit before assining ->rqs[].
- */
- rq = tags->rqs[bitnr];
- if (rq && blk_mq_request_started(rq))
- return iter_data->fn(rq, iter_data->data, reserved);
-
- return true;
-}
-
-/**
- * bt_tags_for_each - iterate over the requests in a tag map
- * @tags: Tag map to iterate over.
- * @bt: sbitmap to examine. This is either the breserved_tags member
- * or the bitmap_tags member of struct blk_mq_tags.
- * @fn: Pointer to the function that will be called for each started
- * request. @fn will be called as follows: @fn(rq, @data,
- * @reserved) where rq is a pointer to a request. Return true
- * to continue iterating tags, false to stop.
- * @data: Will be passed as second argument to @fn.
- * @reserved: Indicates whether @bt is the breserved_tags member or the
- * bitmap_tags member of struct blk_mq_tags.
- */
-static void bt_tags_for_each(struct blk_mq_tags *tags, struct sbitmap_queue *bt,
- busy_tag_iter_fn *fn, void *data, bool reserved)
-{
- struct bt_tags_iter_data iter_data = {
- .tags = tags,
- .fn = fn,
- .data = data,
- .reserved = reserved,
- };
-
- if (tags->rqs)
- sbitmap_for_each_set(&bt->sb, bt_tags_iter, &iter_data);
-}
-
-/**
- * blk_mq_all_tag_busy_iter - iterate over all started requests in a tag map
- * @tags: Tag map to iterate over.
- * @fn: Pointer to the function that will be called for each started
- * request. @fn will be called as follows: @fn(rq, @priv,
- * reserved) where rq is a pointer to a request. 'reserved'
- * indicates whether or not @rq is a reserved request. Return
- * true to continue iterating tags, false to stop.
- * @priv: Will be passed as second argument to @fn.
- */
-static void blk_mq_all_tag_busy_iter(struct blk_mq_tags *tags,
- busy_tag_iter_fn *fn, void *priv)
-{
- if (tags->nr_reserved_tags)
- bt_tags_for_each(tags, &tags->breserved_tags, fn, priv, true);
- bt_tags_for_each(tags, &tags->bitmap_tags, fn, priv, false);
-}
-
-/**
- * blk_mq_tagset_busy_iter - iterate over all started requests in a tag set
- * @tagset: Tag set to iterate over.
- * @fn: Pointer to the function that will be called for each started
- * request. @fn will be called as follows: @fn(rq, @priv,
- * reserved) where rq is a pointer to a request. 'reserved'
- * indicates whether or not @rq is a reserved request. Return
- * true to continue iterating tags, false to stop.
- * @priv: Will be passed as second argument to @fn.
- */
-void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
- busy_tag_iter_fn *fn, void *priv)
-{
- int i;
-
- for (i = 0; i < tagset->nr_hw_queues; i++) {
- if (tagset->tags && tagset->tags[i])
- blk_mq_all_tag_busy_iter(tagset->tags[i], fn, priv);
- }
-}
-EXPORT_SYMBOL(blk_mq_tagset_busy_iter);
-
/**
* blk_mq_queue_tag_busy_iter - iterate over all busy tags
* @q: Request queue to examine.
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index d6beeb5..3a048db 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -319,8 +319,6 @@ void blk_mq_unquiesce_queue(struct request_queue *q);
void blk_mq_delay_run_hw_queue(struct blk_mq_hw_ctx *hctx, unsigned long msecs);
bool blk_mq_run_hw_queue(struct blk_mq_hw_ctx *hctx, bool async);
void blk_mq_run_hw_queues(struct request_queue *q, bool async);
-void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
- busy_tag_iter_fn *fn, void *priv);
void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
void *priv, bool inflight);
void blk_mq_freeze_queue(struct request_queue *q);
--
2.7.4
In commit 530ca2c (blk-mq: Allow blocking queue tag iter callbacks),
we try to get a non-zero q_usage_counter to avoid access hctxs that
being modified. So the synchronize_rcu is useless and should be
removed.
Signed-off-by: Jianchao Wang <[email protected]>
---
block/blk-mq-tag.c | 4 +---
block/blk-mq.c | 4 ----
2 files changed, 1 insertion(+), 7 deletions(-)
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 2089c6c..b792537 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -384,9 +384,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
/*
* __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
* while the queue is frozen. So we can use q_usage_counter to avoid
- * racing with it. __blk_mq_update_nr_hw_queues() uses
- * synchronize_rcu() to ensure this function left the critical section
- * below.
+ * racing with it.
*/
if (!percpu_ref_tryget(&q->q_usage_counter))
return;
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 9437a5e..07584a9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3208,10 +3208,6 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
list_for_each_entry(q, &set->tag_list, tag_set_list)
blk_mq_freeze_queue(q);
/*
- * Sync with blk_mq_queue_tag_busy_iter.
- */
- synchronize_rcu();
- /*
* Switch IO scheduler to 'none', cleaning up the data associated
* with the previous scheduler. We will switch back once we are done
* updating the new sw to hw queue mappings.
--
2.7.4
blk_mq_tagset_busy_iter is not safe that it could get stale request
in tags->rqs[]. Use blk_mq_queue_tag_busy_iter here. A new helper
interface nvme_iterate_inflight_rqs is introduced to iterate
all of the ns under a ctrl.
Signed-off-by: Jianchao Wang <[email protected]>
---
drivers/nvme/host/core.c | 12 ++++++++++++
drivers/nvme/host/fc.c | 12 ++++++------
drivers/nvme/host/nvme.h | 2 ++
drivers/nvme/host/pci.c | 5 +++--
drivers/nvme/host/rdma.c | 6 +++---
drivers/nvme/host/tcp.c | 5 +++--
drivers/nvme/target/loop.c | 6 +++---
7 files changed, 32 insertions(+), 16 deletions(-)
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 6a9dd68..5760fad 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -3816,6 +3816,18 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
}
EXPORT_SYMBOL_GPL(nvme_start_queues);
+void nvme_iterate_inflight_rqs(struct nvme_ctrl *ctrl,
+ busy_iter_fn *fn, void *data)
+{
+ struct nvme_ns *ns;
+
+ down_read(&ctrl->namespaces_rwsem);
+ list_for_each_entry(ns, &ctrl->namespaces, list)
+ blk_mq_queue_tag_busy_iter(ns->queue, fn, data, true);
+ up_read(&ctrl->namespaces_rwsem);
+}
+EXPORT_SYMBOL_GPL(nvme_iterate_inflight_rqs);
+
int __init nvme_core_init(void)
{
int result = -ENOMEM;
diff --git a/drivers/nvme/host/fc.c b/drivers/nvme/host/fc.c
index 89accc7..02b2de0 100644
--- a/drivers/nvme/host/fc.c
+++ b/drivers/nvme/host/fc.c
@@ -2368,7 +2368,7 @@ nvme_fc_complete_rq(struct request *rq)
/*
* This routine is used by the transport when it needs to find active
* io on a queue that is to be terminated. The transport uses
- * blk_mq_tagset_busy_itr() to find the busy requests, which then invoke
+ * blk_mq_queue_tag_busy_iter() to find the busy requests, which then invoke
* this routine to kill them on a 1 by 1 basis.
*
* As FC allocates FC exchange for each io, the transport must contact
@@ -2729,7 +2729,7 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
* If io queues are present, stop them and terminate all outstanding
* ios on them. As FC allocates FC exchange for each io, the
* transport must contact the LLDD to terminate the exchange,
- * thus releasing the FC exchange. We use blk_mq_tagset_busy_itr()
+ * thus releasing the FC exchange. We use blk_mq_queue_tag_busy_iter
* to tell us what io's are busy and invoke a transport routine
* to kill them with the LLDD. After terminating the exchange
* the LLDD will call the transport's normal io done path, but it
@@ -2739,7 +2739,7 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
*/
if (ctrl->ctrl.queue_count > 1) {
nvme_stop_queues(&ctrl->ctrl);
- blk_mq_tagset_busy_iter(&ctrl->tag_set,
+ nvme_iterate_inflight_rqs(&ctrl->ctrl,
nvme_fc_terminate_exchange, &ctrl->ctrl);
}
@@ -2757,12 +2757,12 @@ nvme_fc_delete_association(struct nvme_fc_ctrl *ctrl)
/*
* clean up the admin queue. Same thing as above.
- * use blk_mq_tagset_busy_itr() and the transport routine to
+ * use blk_mq_queue_tag_busy_iter() and the transport routine to
* terminate the exchanges.
*/
blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
- blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
- nvme_fc_terminate_exchange, &ctrl->ctrl);
+ blk_mq_queue_tag_busy_iter(ctrl->ctrl.admin_q,
+ nvme_fc_terminate_exchange, &ctrl->ctrl, true);
/* kill the aens as they are a separate path */
nvme_fc_abort_aen_ops(ctrl);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index c4a1bb4..6dc4ff8 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -441,6 +441,8 @@ void nvme_unfreeze(struct nvme_ctrl *ctrl);
void nvme_wait_freeze(struct nvme_ctrl *ctrl);
void nvme_wait_freeze_timeout(struct nvme_ctrl *ctrl, long timeout);
void nvme_start_freeze(struct nvme_ctrl *ctrl);
+void nvme_iterate_inflight_rqs(struct nvme_ctrl *ctrl,
+ busy_iter_fn *fn, void *data);
#define NVME_QID_ANY -1
struct request *nvme_alloc_request(struct request_queue *q,
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 7fee665..0f0e361 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -2476,8 +2476,9 @@ static void nvme_dev_disable(struct nvme_dev *dev, bool shutdown)
nvme_suspend_queue(&dev->queues[0]);
nvme_pci_disable(dev);
- blk_mq_tagset_busy_iter(&dev->tagset, nvme_cancel_request, &dev->ctrl);
- blk_mq_tagset_busy_iter(&dev->admin_tagset, nvme_cancel_request, &dev->ctrl);
+ nvme_iterate_inflight_rqs(&dev->ctrl, nvme_cancel_request, &dev->ctrl);
+ blk_mq_queue_tag_busy_iter(dev->ctrl.admin_q,
+ nvme_cancel_request, &dev->ctrl, true);
/*
* The driver will not be starting up queues again if shutting down so
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 52abc3a..1be5688 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -922,8 +922,8 @@ static void nvme_rdma_teardown_admin_queue(struct nvme_rdma_ctrl *ctrl,
{
blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
nvme_rdma_stop_queue(&ctrl->queues[0]);
- blk_mq_tagset_busy_iter(&ctrl->admin_tag_set, nvme_cancel_request,
- &ctrl->ctrl);
+ blk_mq_queue_tag_busy_iter(ctrl->ctrl.admin_q, nvme_cancel_request,
+ &ctrl->ctrl, true);
blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
nvme_rdma_destroy_admin_queue(ctrl, remove);
}
@@ -934,7 +934,7 @@ static void nvme_rdma_teardown_io_queues(struct nvme_rdma_ctrl *ctrl,
if (ctrl->ctrl.queue_count > 1) {
nvme_stop_queues(&ctrl->ctrl);
nvme_rdma_stop_io_queues(ctrl);
- blk_mq_tagset_busy_iter(&ctrl->tag_set, nvme_cancel_request,
+ nvme_iterate_inflight_rqs(&ctrl->ctrl, nvme_cancel_request,
&ctrl->ctrl);
if (remove)
nvme_start_queues(&ctrl->ctrl);
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 5f0a004..75eb069 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1686,7 +1686,8 @@ static void nvme_tcp_teardown_admin_queue(struct nvme_ctrl *ctrl,
{
blk_mq_quiesce_queue(ctrl->admin_q);
nvme_tcp_stop_queue(ctrl, 0);
- blk_mq_tagset_busy_iter(ctrl->admin_tagset, nvme_cancel_request, ctrl);
+ blk_mq_queue_tag_busy_iter(ctrl->admin_q,
+ nvme_cancel_request, ctrl, true);
blk_mq_unquiesce_queue(ctrl->admin_q);
nvme_tcp_destroy_admin_queue(ctrl, remove);
}
@@ -1698,7 +1699,7 @@ static void nvme_tcp_teardown_io_queues(struct nvme_ctrl *ctrl,
return;
nvme_stop_queues(ctrl);
nvme_tcp_stop_io_queues(ctrl);
- blk_mq_tagset_busy_iter(ctrl->tagset, nvme_cancel_request, ctrl);
+ nvme_iterate_inflight_rqs(ctrl, nvme_cancel_request, ctrl);
if (remove)
nvme_start_queues(ctrl);
nvme_tcp_destroy_io_queues(ctrl, remove);
diff --git a/drivers/nvme/target/loop.c b/drivers/nvme/target/loop.c
index 4aac1b4..0d25083 100644
--- a/drivers/nvme/target/loop.c
+++ b/drivers/nvme/target/loop.c
@@ -429,7 +429,7 @@ static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
{
if (ctrl->ctrl.queue_count > 1) {
nvme_stop_queues(&ctrl->ctrl);
- blk_mq_tagset_busy_iter(&ctrl->tag_set,
+ nvme_iterate_inflight_rqs(&ctrl->ctrl,
nvme_cancel_request, &ctrl->ctrl);
nvme_loop_destroy_io_queues(ctrl);
}
@@ -438,8 +438,8 @@ static void nvme_loop_shutdown_ctrl(struct nvme_loop_ctrl *ctrl)
nvme_shutdown_ctrl(&ctrl->ctrl);
blk_mq_quiesce_queue(ctrl->ctrl.admin_q);
- blk_mq_tagset_busy_iter(&ctrl->admin_tag_set,
- nvme_cancel_request, &ctrl->ctrl);
+ blk_mq_queue_tag_busy_iter(ctrl->ctrl.admin_q,
+ nvme_cancel_request, &ctrl->ctrl, true);
blk_mq_unquiesce_queue(ctrl->ctrl.admin_q);
nvme_loop_destroy_admin_queue(ctrl);
}
--
2.7.4
blk_mq_tagset_busy_iter is not safe that it could get stale request
in tags->rqs[]. Use blk_mq_queue_tag_busy_iter here.
Signed-off-by: Jianchao Wang <[email protected]>
---
drivers/block/skd_main.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
index ab893a7..60c34ff 100644
--- a/drivers/block/skd_main.c
+++ b/drivers/block/skd_main.c
@@ -395,7 +395,7 @@ static int skd_in_flight(struct skd_device *skdev)
{
int count = 0;
- blk_mq_tagset_busy_iter(&skdev->tag_set, skd_inc_in_flight, &count);
+ blk_mq_queue_tag_busy_iter(skdev->queue, skd_inc_in_flight, &count, true);
return count;
}
@@ -1916,7 +1916,7 @@ static bool skd_recover_request(struct request *req, void *data, bool reserved)
static void skd_recover_requests(struct skd_device *skdev)
{
- blk_mq_tagset_busy_iter(&skdev->tag_set, skd_recover_request, skdev);
+ blk_mq_queue_tag_busy_iter(skdev->queue, skd_recover_request, skdev, true);
}
static void skd_isr_msg_from_dev(struct skd_device *skdev)
--
2.7.4
blk_mq_tagset_busy_iter is not safe that it could get stale request
in tags->rqs[]. Use blk_mq_queue_tag_busy_iter here.
Signed-off-by: Jianchao Wang <[email protected]>
---
block/blk-mq-debugfs.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 7921573..7c4e771 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -442,8 +442,8 @@ static int hctx_busy_show(void *data, struct seq_file *m)
struct blk_mq_hw_ctx *hctx = data;
struct show_busy_params params = { .m = m, .hctx = hctx };
- blk_mq_tagset_busy_iter(hctx->queue->tag_set, hctx_show_busy_rq,
- ¶ms);
+ blk_mq_queue_tag_busy_iter(hctx->queue, hctx_show_busy_rq,
+ ¶ms, true);
return 0;
}
--
2.7.4
On Fri, Mar 15, 2019 at 04:57:36PM +0800, Jianchao Wang wrote:
> Hi Jens
>
> As we know, there is a risk of accesing stale requests when iterate
> in-flight requests with tags->rqs[] and this has been talked in following
> thread,
> [1] https://marc.info/?l=linux-scsi&m=154511693912752&w=2
> [2] https://marc.info/?l=linux-block&m=154526189023236&w=2
I'd rather take one step back and figure out why we are iterating
the busy requests. There really shouldn't be any reason why a driver
is even doings that (vs some error handling helpers in the core
block code that can properly synchronize).
On 3/15/19 5:20 PM, Christoph Hellwig wrote:
> On Fri, Mar 15, 2019 at 04:57:36PM +0800, Jianchao Wang wrote:
>> Hi Jens
>>
>> As we know, there is a risk of accesing stale requests when iterate
>> in-flight requests with tags->rqs[] and this has been talked in following
>> thread,
>> [1] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dscsi-26m-3D154511693912752-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=CydqJPTf4FUrfs7ipUc2chm2jGuNuDVn_onIetKEehM&s=ZQ7RfO6-737-t5kQv7SFlXMhIdpwn_AxJI93d6c-nj0&e=
>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D154526189023236-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=CydqJPTf4FUrfs7ipUc2chm2jGuNuDVn_onIetKEehM&s=EBV1M5p4mE8jZ5ZD1ecU5kMbJ9EtbpVJoc7Tqolrsc8&e=
>
> I'd rather take one step back and figure out why we are iterating
> the busy requests. There really shouldn't be any reason why a driver
> is even doings that (vs some error handling helpers in the core
> block code that can properly synchronize).
>
A typical scene is blk_mq_in_flight,
blk_mq_get_request blk_mq_in_flight
-> blk_mq_get_tag -> blk_mq_queue_tag_busy_iter
-> bt_for_each
-> bt_iter
-> rq = taags->rqs[]
-> rq->q //---> get a stale request
-> blk_mq_rq_ctx_init
-> data->hctx->tags->rqs[rq->tag] = rq
This stale request maybe something that has been freed due to io scheduler
is detached or a q using a shared tagset is gone.
And also the blk_mq_timeout_work could use it to pick up the expired request.
The driver would also use it to requeue the in-flight requests when the device is dead.
Compared with adding more synchronization, using static_rqs[] directly maybe simpler :)
Thanks
Jianchao
On Fri, Mar 15, 2019 at 10:20:20AM +0100, Christoph Hellwig wrote:
> On Fri, Mar 15, 2019 at 04:57:36PM +0800, Jianchao Wang wrote:
> > Hi Jens
> >
> > As we know, there is a risk of accesing stale requests when iterate
> > in-flight requests with tags->rqs[] and this has been talked in following
> > thread,
> > [1] https://marc.info/?l=linux-scsi&m=154511693912752&w=2
> > [2] https://marc.info/?l=linux-block&m=154526189023236&w=2
>
> I'd rather take one step back and figure out why we are iterating
> the busy requests. There really shouldn't be any reason why a driver
> is even doings that (vs some error handling helpers in the core
> block code that can properly synchronize).
I use it in NBD to error out any pending requests for forced disconnects. I'd
be happy to use some other blessed interface, I'm not married to this one.
Thanks,
Josef
On Fri, Mar 15, 2019 at 01:57:38AM -0700, Jianchao Wang wrote:
> tags->rqs[] will not been cleaned when free driver tag and there
> is a window between get driver tag and write tags->rqs[], so we
> may see stale rq in tags->rqs[] which may have been freed, as
> following case,
> blk_mq_get_request blk_mq_queue_tag_busy_iter
> -> blk_mq_get_tag
> -> bt_for_each
> -> bt_iter
> -> rq = taags->rqs[]
> -> rq->q
> -> blk_mq_rq_ctx_init
> -> data->hctx->tags->rqs[rq->tag] = rq;
>
> To fix this, the blk_mq_queue_tag_busy_iter is changed in this
> patch to use tags->static_rqs[] instead of tags->rqs[]. We have
> to identify whether there is a io scheduler attached to decide
> to use hctx->tags or hctx->sched_tags. And we will try to get a
> non-zero q_usage_counter before that, so it is safe to access
> them. Add 'inflight' parameter to determine to iterate in-flight
> requests or just busy tags. A correction here is that
> part_in_flight should count the busy tags instead of rqs that
> have got driver tags.
>
> Moreover, get rid of the 'hctx' parameter in iter function as
> we could get it from rq->mq_hctx and export this interface for
> drivers.
>
> Signed-off-by: Jianchao Wang <[email protected]>
> ---
> block/blk-mq-tag.c | 76 +++++++++++++++++++++++++++++++++-----------------
> block/blk-mq-tag.h | 2 --
> block/blk-mq.c | 31 ++++++++------------
> include/linux/blk-mq.h | 5 ++--
> 4 files changed, 64 insertions(+), 50 deletions(-)
>
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index b792537..cdec2cd 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -216,26 +216,38 @@ struct bt_iter_data {
> busy_iter_fn *fn;
> void *data;
> bool reserved;
> + bool inflight;
> };
>
> static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> {
> struct bt_iter_data *iter_data = data;
> struct blk_mq_hw_ctx *hctx = iter_data->hctx;
> - struct blk_mq_tags *tags = hctx->tags;
> bool reserved = iter_data->reserved;
> + struct blk_mq_tags *tags;
> struct request *rq;
>
> + tags = hctx->sched_tags ? hctx->sched_tags : hctx->tags;
> +
> if (!reserved)
> bitnr += tags->nr_reserved_tags;
> - rq = tags->rqs[bitnr];
> + /*
> + * Because tags->rqs[] will not been cleaned when free driver tag
> + * and there is a window between get driver tag and write tags->rqs[],
> + * so we may see stale rq in tags->rqs[] which may have been freed.
> + * Using static_rqs[] is safer.
> + */
> + rq = tags->static_rqs[bitnr];
>
> /*
> - * We can hit rq == NULL here, because the tagging functions
> - * test and set the bit before assigning ->rqs[].
> + * There is a small window between get tag and blk_mq_rq_ctx_init,
> + * so rq->q and rq->mq_hctx maybe different.
> */
> - if (rq && rq->q == hctx->queue)
> - return iter_data->fn(hctx, rq, iter_data->data, reserved);
> + if (rq && rq->q == hctx->queue &&
> + rq->mq_hctx == hctx &&
> + (!iter_data->inflight ||
> + blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
> + return iter_data->fn(rq, iter_data->data, reserved);
There is still a window where the check may succeed, but the request is
being assigned to a completely different request_queue. The callback
then operates on a request it doesn't own.
Tag iteration from a driver can be safe only if the driver initiates a
freeze and quiesced all queues sharing the same tagset first. The nvme
driver does that, but I think there may need to be an additional grace
period to wait for the allocation's queue_enter to call queue_exit to
ensure the request is initialized through blk_mq_rq_ctx_init().
On Fri, 2019-03-15 at 17:44 +-0800, jianchao.wang wrote:
+AD4 On 3/15/19 5:20 PM, Christoph Hellwig wrote:
+AD4 +AD4 On Fri, Mar 15, 2019 at 04:57:36PM +-0800, Jianchao Wang wrote:
+AD4 +AD4 +AD4 Hi Jens
+AD4 +AD4 +AD4
+AD4 +AD4 +AD4 As we know, there is a risk of accesing stale requests when iterate
+AD4 +AD4 +AD4 in-flight requests with tags-+AD4-rqs+AFsAXQ and this has been talked in following
+AD4 +AD4 +AD4 thread,
+AD4 +AD4 +AD4 +AFs-1+AF0 https://urldefense.proofpoint.com/v2/url?u+AD0-https-3A+AF8AXw-marc.info+AF8--3Fl-3Dlinux-2Dscsi-26m-3D154511693912752-26w-3D2+ACY-d+AD0-DwICAg+ACY-c+AD0-RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI+AF8-JnE+ACY-r+AD0-7WdAxUBeiTUTCy8v-7zX
+AD4 +AD4 +AD4 yr4qk7sx26ATvfo6QSTvZyQ+ACY-m+AD0-CydqJPTf4FUrfs7ipUc2chm2jGuNuDVn+AF8-onIetKEehM+ACY-s+AD0-ZQ7RfO6-737-t5kQv7SFlXMhIdpwn+AF8-AxJI93d6c-nj0+ACY-e+AD0
+AD4 +AD4 +AD4 +AFs-2+AF0 https://urldefense.proofpoint.com/v2/url?u+AD0-https-3A+AF8AXw-marc.info+AF8--3Fl-3Dlinux-2Dblock-26m-3D154526189023236-26w-3D2+ACY-d+AD0-DwICAg+ACY-c+AD0-RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI+AF8-JnE+ACY-r+AD0-7WdAxUBeiTUTCy8v-7z
+AD4 +AD4 +AD4 Xyr4qk7sx26ATvfo6QSTvZyQ+ACY-m+AD0-CydqJPTf4FUrfs7ipUc2chm2jGuNuDVn+AF8-onIetKEehM+ACY-s+AD0-EBV1M5p4mE8jZ5ZD1ecU5kMbJ9EtbpVJoc7Tqolrsc8+ACY-e+AD0
+AD4 +AD4
+AD4 +AD4 I'd rather take one step back and figure out why we are iterating
+AD4 +AD4 the busy requests. There really shouldn't be any reason why a driver
+AD4 +AD4 is even doings that (vs some error handling helpers in the core
+AD4 +AD4 block code that can properly synchronize).
+AD4 +AD4
+AD4
+AD4 A typical scene is blk+AF8-mq+AF8-in+AF8-flight,
+AD4
+AD4 blk+AF8-mq+AF8-get+AF8-request blk+AF8-mq+AF8-in+AF8-flight
+AD4 -+AD4 blk+AF8-mq+AF8-get+AF8-tag -+AD4 blk+AF8-mq+AF8-queue+AF8-tag+AF8-busy+AF8-iter
+AD4 -+AD4 bt+AF8-for+AF8-each
+AD4 -+AD4 bt+AF8-iter
+AD4 -+AD4 rq +AD0 taags-+AD4-rqs+AFsAXQ
+AD4 -+AD4 rq-+AD4-q //---+AD4 get a stale request
+AD4 -+AD4 blk+AF8-mq+AF8-rq+AF8-ctx+AF8-init
+AD4 -+AD4 data-+AD4-hctx-+AD4-tags-+AD4-rqs+AFs-rq-+AD4-tag+AF0 +AD0 rq
+AD4
+AD4 This stale request maybe something that has been freed due to io scheduler
+AD4 is detached or a q using a shared tagset is gone.
+AD4
+AD4 And also the blk+AF8-mq+AF8-timeout+AF8-work could use it to pick up the expired request.
+AD4 The driver would also use it to requeue the in-flight requests when the device is dead.
+AD4
+AD4 Compared with adding more synchronization, using static+AF8-rqs+AFsAXQ directly maybe simpler :)
Hi Jianchao,
Although I appreciate your work: I agree with Christoph that we should avoid races
like this rather than modifying the block layer to make sure that such races are
handled safely.
Thanks,
Bart.
On 3/15/2019 1:57 AM, Jianchao Wang wrote:
> blk_mq_tagset_busy_iter is not safe that it could get stale request
> in tags->rqs[]. Use blk_mq_queue_tag_busy_iter here. A new helper
> interface nvme_iterate_inflight_rqs is introduced to iterate
> all of the ns under a ctrl.
>
> Signed-off-by: Jianchao Wang <[email protected]>
> ---
> drivers/nvme/host/core.c | 12 ++++++++++++
> drivers/nvme/host/fc.c | 12 ++++++------
> drivers/nvme/host/nvme.h | 2 ++
> drivers/nvme/host/pci.c | 5 +++--
> drivers/nvme/host/rdma.c | 6 +++---
> drivers/nvme/host/tcp.c | 5 +++--
> drivers/nvme/target/loop.c | 6 +++---
> 7 files changed, 32 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index 6a9dd68..5760fad 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -3816,6 +3816,18 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
> }
> EXPORT_SYMBOL_GPL(nvme_start_queues);
>
> +void nvme_iterate_inflight_rqs(struct nvme_ctrl *ctrl,
> + busy_iter_fn *fn, void *data)
> +{
> + struct nvme_ns *ns;
> +
> + down_read(&ctrl->namespaces_rwsem);
> + list_for_each_entry(ns, &ctrl->namespaces, list)
> + blk_mq_queue_tag_busy_iter(ns->queue, fn, data, true);
> + up_read(&ctrl->namespaces_rwsem);
> +}
> +EXPORT_SYMBOL_GPL(nvme_iterate_inflight_rqs);
Hmm... so this aborts ios for namespaces. How about ios that aren't for
a namespace but rather for the controller itself. For example, anything
on the admin queue ? Rather critical for those ios be terminated as well.
-- james
On 3/15/2019 9:33 AM, James Smart wrote:
> On 3/15/2019 1:57 AM, Jianchao Wang wrote:
>> blk_mq_tagset_busy_iter is not safe that it could get stale request
>> in tags->rqs[]. Use blk_mq_queue_tag_busy_iter here. A new helper
>> interface nvme_iterate_inflight_rqs is introduced to iterate
>> all of the ns under a ctrl.
>>
>> Signed-off-by: Jianchao Wang <[email protected]>
>> ---
>> drivers/nvme/host/core.c | 12 ++++++++++++
>> drivers/nvme/host/fc.c | 12 ++++++------
>> drivers/nvme/host/nvme.h | 2 ++
>> drivers/nvme/host/pci.c | 5 +++--
>> drivers/nvme/host/rdma.c | 6 +++---
>> drivers/nvme/host/tcp.c | 5 +++--
>> drivers/nvme/target/loop.c | 6 +++---
>> 7 files changed, 32 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 6a9dd68..5760fad 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3816,6 +3816,18 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
>> }
>> EXPORT_SYMBOL_GPL(nvme_start_queues);
>> +void nvme_iterate_inflight_rqs(struct nvme_ctrl *ctrl,
>> + busy_iter_fn *fn, void *data)
>> +{
>> + struct nvme_ns *ns;
>> +
>> + down_read(&ctrl->namespaces_rwsem);
>> + list_for_each_entry(ns, &ctrl->namespaces, list)
>> + blk_mq_queue_tag_busy_iter(ns->queue, fn, data, true);
>> + up_read(&ctrl->namespaces_rwsem);
>> +}
>> +EXPORT_SYMBOL_GPL(nvme_iterate_inflight_rqs);
>
> Hmm... so this aborts ios for namespaces. How about ios that aren't
> for a namespace but rather for the controller itself. For example,
> anything on the admin queue ? Rather critical for those ios be
> terminated as well.
NVM - didn't look at which tag set. but it does highlight - doesn't
cover anything issued by core/transport on the io queues.
-- james
On 3/15/19 5:39 PM, James Smart wrote:
>
>
> On 3/15/2019 9:33 AM, James Smart wrote:
>> On 3/15/2019 1:57 AM, Jianchao Wang wrote:
>>> blk_mq_tagset_busy_iter is not safe that it could get stale request
>>> in tags->rqs[]. Use blk_mq_queue_tag_busy_iter here. A new helper
>>> interface nvme_iterate_inflight_rqs is introduced to iterate
>>> all of the ns under a ctrl.
>>>
>>> Signed-off-by: Jianchao Wang <[email protected]>
>>> ---
>>> drivers/nvme/host/core.c | 12 ++++++++++++
>>> drivers/nvme/host/fc.c | 12 ++++++------
>>> drivers/nvme/host/nvme.h | 2 ++
>>> drivers/nvme/host/pci.c | 5 +++--
>>> drivers/nvme/host/rdma.c | 6 +++---
>>> drivers/nvme/host/tcp.c | 5 +++--
>>> drivers/nvme/target/loop.c | 6 +++---
>>> 7 files changed, 32 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>>> index 6a9dd68..5760fad 100644
>>> --- a/drivers/nvme/host/core.c
>>> +++ b/drivers/nvme/host/core.c
>>> @@ -3816,6 +3816,18 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
>>> }
>>> EXPORT_SYMBOL_GPL(nvme_start_queues);
>>> +void nvme_iterate_inflight_rqs(struct nvme_ctrl *ctrl,
>>> + busy_iter_fn *fn, void *data)
>>> +{
>>> + struct nvme_ns *ns;
>>> +
>>> + down_read(&ctrl->namespaces_rwsem);
>>> + list_for_each_entry(ns, &ctrl->namespaces, list)
>>> + blk_mq_queue_tag_busy_iter(ns->queue, fn, data, true);
>>> + up_read(&ctrl->namespaces_rwsem);
>>> +}
>>> +EXPORT_SYMBOL_GPL(nvme_iterate_inflight_rqs);
>>
>> Hmm... so this aborts ios for namespaces. How about ios that aren't
>> for a namespace but rather for the controller itself. For example,
>> anything on the admin queue ? Rather critical for those ios be
>> terminated as well.
> NVM - didn't look at which tag set. but it does highlight - doesn't
> cover anything issued by core/transport on the io queues.
>
Actually, I would consider that an error.
_All_ commands on io queues (internal or block-layer generated) whould
be tracked sbitmap and hence the tag iterator.
That's what we have the 'private' tags for, and why the iter callback
has this ominous 'bool' argument...
Cheers,
Hannes
On Fri, Mar 15, 2019 at 5:07 PM Jianchao Wang
<[email protected]> wrote:
>
> In commit 530ca2c (blk-mq: Allow blocking queue tag iter callbacks),
> we try to get a non-zero q_usage_counter to avoid access hctxs that
> being modified. So the synchronize_rcu is useless and should be
> removed.
>
> Signed-off-by: Jianchao Wang <[email protected]>
> ---
> block/blk-mq-tag.c | 4 +---
> block/blk-mq.c | 4 ----
> 2 files changed, 1 insertion(+), 7 deletions(-)
>
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> index 2089c6c..b792537 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -384,9 +384,7 @@ void blk_mq_queue_tag_busy_iter(struct request_queue *q, busy_iter_fn *fn,
> /*
> * __blk_mq_update_nr_hw_queues() updates nr_hw_queues and queue_hw_ctx
> * while the queue is frozen. So we can use q_usage_counter to avoid
> - * racing with it. __blk_mq_update_nr_hw_queues() uses
> - * synchronize_rcu() to ensure this function left the critical section
> - * below.
> + * racing with it.
> */
> if (!percpu_ref_tryget(&q->q_usage_counter))
> return;
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 9437a5e..07584a9 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -3208,10 +3208,6 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set,
> list_for_each_entry(q, &set->tag_list, tag_set_list)
> blk_mq_freeze_queue(q);
> /*
> - * Sync with blk_mq_queue_tag_busy_iter.
> - */
> - synchronize_rcu();
> - /*
> * Switch IO scheduler to 'none', cleaning up the data associated
> * with the previous scheduler. We will switch back once we are done
> * updating the new sw to hw queue mappings.
> --
> 2.7.4
>
Reviewed-by: Ming Lei <[email protected]>
Thanks,
Ming Lei
On Sat, Mar 16, 2019 at 12:15 AM Keith Busch <[email protected]> wrote:
>
> On Fri, Mar 15, 2019 at 01:57:38AM -0700, Jianchao Wang wrote:
> > tags->rqs[] will not been cleaned when free driver tag and there
> > is a window between get driver tag and write tags->rqs[], so we
> > may see stale rq in tags->rqs[] which may have been freed, as
> > following case,
> > blk_mq_get_request blk_mq_queue_tag_busy_iter
> > -> blk_mq_get_tag
> > -> bt_for_each
> > -> bt_iter
> > -> rq = taags->rqs[]
> > -> rq->q
> > -> blk_mq_rq_ctx_init
> > -> data->hctx->tags->rqs[rq->tag] = rq;
> >
> > To fix this, the blk_mq_queue_tag_busy_iter is changed in this
> > patch to use tags->static_rqs[] instead of tags->rqs[]. We have
> > to identify whether there is a io scheduler attached to decide
> > to use hctx->tags or hctx->sched_tags. And we will try to get a
> > non-zero q_usage_counter before that, so it is safe to access
> > them. Add 'inflight' parameter to determine to iterate in-flight
> > requests or just busy tags. A correction here is that
> > part_in_flight should count the busy tags instead of rqs that
> > have got driver tags.
> >
> > Moreover, get rid of the 'hctx' parameter in iter function as
> > we could get it from rq->mq_hctx and export this interface for
> > drivers.
> >
> > Signed-off-by: Jianchao Wang <[email protected]>
> > ---
> > block/blk-mq-tag.c | 76 +++++++++++++++++++++++++++++++++-----------------
> > block/blk-mq-tag.h | 2 --
> > block/blk-mq.c | 31 ++++++++------------
> > include/linux/blk-mq.h | 5 ++--
> > 4 files changed, 64 insertions(+), 50 deletions(-)
> >
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
> > index b792537..cdec2cd 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -216,26 +216,38 @@ struct bt_iter_data {
> > busy_iter_fn *fn;
> > void *data;
> > bool reserved;
> > + bool inflight;
> > };
> >
> > static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
> > {
> > struct bt_iter_data *iter_data = data;
> > struct blk_mq_hw_ctx *hctx = iter_data->hctx;
> > - struct blk_mq_tags *tags = hctx->tags;
> > bool reserved = iter_data->reserved;
> > + struct blk_mq_tags *tags;
> > struct request *rq;
> >
> > + tags = hctx->sched_tags ? hctx->sched_tags : hctx->tags;
> > +
> > if (!reserved)
> > bitnr += tags->nr_reserved_tags;
> > - rq = tags->rqs[bitnr];
> > + /*
> > + * Because tags->rqs[] will not been cleaned when free driver tag
> > + * and there is a window between get driver tag and write tags->rqs[],
> > + * so we may see stale rq in tags->rqs[] which may have been freed.
> > + * Using static_rqs[] is safer.
> > + */
> > + rq = tags->static_rqs[bitnr];
> >
> > /*
> > - * We can hit rq == NULL here, because the tagging functions
> > - * test and set the bit before assigning ->rqs[].
> > + * There is a small window between get tag and blk_mq_rq_ctx_init,
> > + * so rq->q and rq->mq_hctx maybe different.
> > */
> > - if (rq && rq->q == hctx->queue)
> > - return iter_data->fn(hctx, rq, iter_data->data, reserved);
> > + if (rq && rq->q == hctx->queue &&
> > + rq->mq_hctx == hctx &&
> > + (!iter_data->inflight ||
> > + blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
> > + return iter_data->fn(rq, iter_data->data, reserved);
>
> There is still a window where the check may succeed, but the request is
> being assigned to a completely different request_queue. The callback
> then operates on a request it doesn't own.
The situation above is only related with 'none' io scheduler.
However, this patch doesn't change behavior for none given .rq[tag] is same with
.static_rq[tag], so I guess your concern may not be related with this patch.
>
> Tag iteration from a driver can be safe only if the driver initiates a
> freeze and quiesced all queues sharing the same tagset first. The nvme
> driver does that, but I think there may need to be an additional grace
> period to wait for the allocation's queue_enter to call queue_exit to
> ensure the request is initialized through blk_mq_rq_ctx_init().
blk_mq_queue_tag_busy_iter() is used by blk-mq core code only for
io accounting or timeout on the specified request_queue, and NVMe's
uses are actually over the whole driver tags.
Thanks,
Ming Lei
Hi Keith
On 3/16/19 12:16 AM, Keith Busch wrote:
> There is still a window where the check may succeed, but the request is
> being assigned to a completely different request_queue. The callback
> then operates on a request it doesn't own.
>
> Tag iteration from a driver can be safe only if the driver initiates a
> freeze and quiesced all queues sharing the same tagset first. The nvme
> driver does that, but I think there may need to be an additional grace
> period to wait for the allocation's queue_enter to call queue_exit to
> ensure the request is initialized through blk_mq_rq_ctx_init().
This patch is to avoid the use-after-free case in the comment.
The helper interface is used in following cases,
1. in flight request account
this case should not require so much accuracy
2. timeout check
a extra reference will be held there to avoid the request recycle
3. if it is used by driver, it should be enough that quiesce the request_queue
the driver usually wants to handle the in-flight ones, quiesce request_queue
could ensure all of tasks quit the hctx_lock and no one issue request any
more.
Thanks
Jianchao
Hi Bart
On 3/16/19 12:19 AM, Bart Van Assche wrote:
>> This stale request maybe something that has been freed due to io scheduler
>> is detached or a q using a shared tagset is gone.
>>
>> And also the blk_mq_timeout_work could use it to pick up the expired request.
>> The driver would also use it to requeue the in-flight requests when the device is dead.
>>
>> Compared with adding more synchronization, using static_rqs[] directly maybe simpler :)
> Hi Jianchao,
>
> Although I appreciate your work: I agree with Christoph that we should avoid races
> like this rather than modifying the block layer to make sure that such races are
> handled safely.
The root cause here is that there is window between set tag sbitmap and set tags->rqs[]
, and we don't clear the tags->rqs[] in free tags path. So when we iterate the busy
tags, we could see stale requests in tags->rqs[] and this stale requests maybe freed.
Looks like it is difficult to close the window above, so we used to try to clear the
tags->rqs[] in two ways,
1. clear the tags->rqs[] in the request free path
Jens doesn't like it
https://marc.info/?l=linux-block&m=154515671524877&w=2
"
It's an extra store, and it's a store to an area that's then now shared
between issue and completion. Those are never a good idea. Besides, it's
the kind of issue you solve in the SLOW path, not in the fast path. Since
that's doable, it would be silly to do it for every IO.
"
2. clear the associated slots in tags->tags[] when blk_mq_free_rqs
https://marc.info/?l=linux-block&m=154534605914798&w=2
+ rcu_read_lock();
sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
+ rcu_read_unlock();
The busy_iter_fn could sleep
blk_mq_check_expired
-> blk_mq_rq_timed_out
-> q->mq_ops->timeout
nvme_timeout
-> nvme_dev_disable
-> mutex_lock dev->shutdown_lock
Since it is not so flexible to fix on tags->rqs[], why not try to use tags->static_rqs[] ?
Then we wound never care about the stable requests any more.
Thanks
Jianchao
Hi James
On 3/16/19 12:33 AM, James Smart wrote:
> On 3/15/2019 1:57 AM, Jianchao Wang wrote:
>> blk_mq_tagset_busy_iter is not safe that it could get stale request
>> in tags->rqs[]. Use blk_mq_queue_tag_busy_iter here. A new helper
>> interface nvme_iterate_inflight_rqs is introduced to iterate
>> all of the ns under a ctrl.
>>
>> Signed-off-by: Jianchao Wang <[email protected]>
>> ---
>> drivers/nvme/host/core.c | 12 ++++++++++++
>> drivers/nvme/host/fc.c | 12 ++++++------
>> drivers/nvme/host/nvme.h | 2 ++
>> drivers/nvme/host/pci.c | 5 +++--
>> drivers/nvme/host/rdma.c | 6 +++---
>> drivers/nvme/host/tcp.c | 5 +++--
>> drivers/nvme/target/loop.c | 6 +++---
>> 7 files changed, 32 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index 6a9dd68..5760fad 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -3816,6 +3816,18 @@ void nvme_start_queues(struct nvme_ctrl *ctrl)
>> }
>> EXPORT_SYMBOL_GPL(nvme_start_queues);
>> +void nvme_iterate_inflight_rqs(struct nvme_ctrl *ctrl,
>> + busy_iter_fn *fn, void *data)
>> +{
>> + struct nvme_ns *ns;
>> +
>> + down_read(&ctrl->namespaces_rwsem);
>> + list_for_each_entry(ns, &ctrl->namespaces, list)
>> + blk_mq_queue_tag_busy_iter(ns->queue, fn, data, true);
>> + up_read(&ctrl->namespaces_rwsem);
>> +}
>> +EXPORT_SYMBOL_GPL(nvme_iterate_inflight_rqs);
>
> Hmm... so this aborts ios for namespaces. How about ios that aren't for a namespace but rather for the controller itself. For example, anything on the admin queue ? Rather critical for those ios be terminated as well.
>
The blk_mq_tagset_busy_iter which is currently used by nvme is iterating the in-flight ios for the controller.
But due to blk_mq_tagset_busy_iter is not safe which is talked in the patch 0/8,
so change it to blk_mq_queue_busy_tag_iter which is for a namespace/request_queue.
Thanks
Jianchao
On Sat, Mar 16, 2019 at 11:50:27PM -0700, Ming Lei wrote:
> On Sat, Mar 16, 2019 at 12:15 AM Keith Busch <[email protected]> wrote:
> > On Fri, Mar 15, 2019 at 01:57:38AM -0700, Jianchao Wang wrote:
> > > + if (rq && rq->q == hctx->queue &&
> > > + rq->mq_hctx == hctx &&
> > > + (!iter_data->inflight ||
> > > + blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT))
> > > + return iter_data->fn(rq, iter_data->data, reserved);
> >
> > There is still a window where the check may succeed, but the request is
> > being assigned to a completely different request_queue. The callback
> > then operates on a request it doesn't own.
>
> The situation above is only related with 'none' io scheduler.
>
> However, this patch doesn't change behavior for none given .rq[tag] is same with
> .static_rq[tag], so I guess your concern may not be related with this patch.
Right, it's not new, but queue iteration wasn't exported for drivers
before this.
On Fri, 2019-03-15 at 16:57 +-0800, Jianchao Wang wrote:
+AD4 blk+AF8-mq+AF8-tagset+AF8-busy+AF8-iter is not safe that it could get stale request
+AD4 in tags-+AD4-rqs+AFsAXQ. Use blk+AF8-mq+AF8-queue+AF8-tag+AF8-busy+AF8-iter here.
+AD4
+AD4 Signed-off-by: Jianchao Wang +ADw-jianchao.w.wang+AEA-oracle.com+AD4
+AD4 ---
+AD4 drivers/block/nbd.c +AHw 2 +-
+AD4 1 file changed, 1 insertion(), 1 deletion(-)
+AD4
+AD4 diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
+AD4 index 7c9a949..9e7e828 100644
+AD4 --- a/drivers/block/nbd.c
+AD4 +-+-+- b/drivers/block/nbd.c
+AD4 +AEAAQA -747,7 +-747,7 +AEAAQA static bool nbd+AF8-clear+AF8-req(struct request +ACo-req, void +ACo-data, bool reserved)
+AD4 static void nbd+AF8-clear+AF8-que(struct nbd+AF8-device +ACo-nbd)
+AD4 +AHs
+AD4 blk+AF8-mq+AF8-quiesce+AF8-queue(nbd-+AD4-disk-+AD4-queue)+ADs
+AD4 - blk+AF8-mq+AF8-tagset+AF8-busy+AF8-iter(+ACY-nbd-+AD4-tag+AF8-set, nbd+AF8-clear+AF8-req, NULL)+ADs
+AD4 +- blk+AF8-mq+AF8-queue+AF8-tag+AF8-busy+AF8-iter(nbd-+AD4-disk-+AD4-queue, nbd+AF8-clear+AF8-req, NULL, true)+ADs
+AD4 blk+AF8-mq+AF8-unquiesce+AF8-queue(nbd-+AD4-disk-+AD4-queue)+ADs
+AD4 dev+AF8-dbg(disk+AF8-to+AF8-dev(nbd-+AD4-disk), +ACI-queue cleared+AFw-n+ACI)+ADs
+AD4 +AH0
Hi Jianchao,
The nbd driver calls nbd+AF8-clear+AF8-que() after having called sock+AF8-shutdown(). So
what makes you think that it's not safe to call blk+AF8-mq+AF8-tagset+AF8-busy+AF8-iter()
from nbd+AF8-clear+AF8-que()?
Thanks,
Bart.
On Fri, 2019-03-15 at 16:57 +-0800, Jianchao Wang wrote:
+AD4 blk+AF8-mq+AF8-tagset+AF8-busy+AF8-iter is not safe that it could get stale request
+AD4 in tags-+AD4-rqs+AFsAXQ. Use blk+AF8-mq+AF8-queue+AF8-tag+AF8-busy+AF8-iter here.
+AD4
+AD4 Signed-off-by: Jianchao Wang +ADw-jianchao.w.wang+AEA-oracle.com+AD4
+AD4 ---
+AD4 drivers/block/skd+AF8-main.c +AHw 4 ++1filechanged2insertions(), 2 deletions(-)
+AD4
+AD4 diff --git a/drivers/block/skd+AF8-main.c b/drivers/block/skd+AF8-main.c
+AD4 index ab893a7..60c34ff 100644
+AD4 --- a/drivers/block/skd+AF8-main.c
+AD4 diff --git a/drivers/block/skd+AF8-main.c b/drivers/block/skd+AF8-main.c
+AD4 index ab893a7..60c34ff 100644
+AD4 --- a/drivers/block/skd+AF8-main.c
+AD4 +-+-+- b/drivers/block/skd+AF8-main.c
+AD4 +AEAAQA -395,7 +-395,7 +AEAAQA static int skd+AF8-in+AF8-flight(struct skd+AF8-device +ACo-skdev)
+AD4 +AHs
+AD4 int count +AD0 0+ADs
+AD4
+AD4 - blk+AF8-mq+AF8-tagset+AF8-busy+AF8-iter(+ACY-skdev-+AD4-tag+AF8-set, skd+AF8-inc+AF8-in+AF8-flight, +ACY-count)+ADs
+AD4 +- blk+AF8-mq+AF8-queue+AF8-tag+AF8-busy+AF8-iter(skdev-+AD4-queue, skd+AF8-inc+AF8-in+AF8-flight, +ACY-count, true)+ADs
+AD4
+AD4 return count+ADs
+AD4 +AH0
Hi Jianchao,
If you have a look at the skd+AF8-in+AF8-flight() callers you will see that the above
change is not necessary.
+AD4 +AEAAQA -1916,7 +-1916,7 +AEAAQA static bool skd+AF8-recover+AF8-request(struct request +ACo-req, void +ACo-data, bool reserved)
+AD4
+AD4 static void skd+AF8-recover+AF8-requests(struct skd+AF8-device +ACo-skdev)
+AD4 +AHs
+AD4 - blk+AF8-mq+AF8-tagset+AF8-busy+AF8-iter(+ACY-skdev-+AD4-tag+AF8-set, skd+AF8-recover+AF8-request, skdev)+ADs
+AD4 +- blk+AF8-mq+AF8-queue+AF8-tag+AF8-busy+AF8-iter(skdev-+AD4-queue, skd+AF8-recover+AF8-request, skdev, true)+ADs
+AD4 +AH0
Same comment here. If you have a look at the callers of this function you will
see that this change is not necessary.
Thanks,
Bart.
On Fri, 2019-03-15 at 16:57 +-0800, Jianchao Wang wrote:
+AD4 +AFs-2+AF0 https://marc.info/?l+AD0-linux-block+ACY-m+AD0-154526189023236+ACY-w+AD0-2
Hi Jianchao,
That is a reference to the +ACI-BUG: KASAN: use-after-free in bt+AF8-iter+ACI issue.
I think that issue can be fixed in another way than modifying all code that
iterates over tags, namely by adding an rcu+AF8-read+AF8-lock() / rcu+AF8-read+AF8-unlock()
pair in bt+AF8-for+AF8-each() and bt+AF8-tags+AF8-for+AF8-each() and by changing the calls in
blk+AF8-mq+AF8-free+AF8-rqs() and blk+AF8-free+AF8-flush+AF8-queue() that free the data structures
used by the tag iteration functions into kfree+AF8-rcu() or call+AF8-rcu() calls.
Thanks,
Bart.
Hi Bart
Thanks for your kindly and detailed comment on this.
On 3/19/19 1:28 AM, Bart Van Assche wrote:
> On Fri, 2019-03-15 at 16:57 +0800, Jianchao Wang wrote:
>> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__marc.info_-3Fl-3Dlinux-2Dblock-26m-3D154526189023236-26w-3D2&d=DwICAg&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=_8Zz6iRpso8g7WlZ-WB50qqNkI2X2GRfySSBWyFKuI4&s=ZVNqSClQ_47hVGJpSrF5rbTh3X32cAlY-GFF2BPkGx0&e=
>
> Hi Jianchao,
>
> That is a reference to the "BUG: KASAN: use-after-free in bt_iter" issue.
> I think that issue can be fixed in another way than modifying all code that
> iterates over tags, namely by adding an rcu_read_lock() / rcu_read_unlock()
> pair in bt_for_each() and bt_tags_for_each() and by changing the calls in
> blk_mq_free_rqs() and blk_free_flush_queue() that free the data structures
> used by the tag iteration functions into kfree_rcu() or call_rcu() calls.
Do you mean this patch from Jens ?
https://marc.info/?l=linux-block&m=154534605914798&w=2
+ rcu_read_lock();
sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
+ rcu_read_unlock();
The busy_iter_fn could sleep for nvme
blk_mq_check_expired
-> blk_mq_rq_timed_out
-> q->mq_ops->timeout
nvme_timeout
-> nvme_dev_disable
-> mutex_lock dev->shutdown_lock
Thanks
Jianchao
Hi Bart
Thanks so much for you comment for this.
After make blk_mq_queue_tag_busy_iter safer in patch 2, the patches that
replace blk_mq_tagset_busy_iter with blk_mq_queue_tag_busy_iter in drivers
are efforts to unify the tag iterate interface and finally we could remove
the blk_mq_tagset_busy_iter which is not safe.
The blk_mq_queue_tag_busy_iter will try to get a non-zero q_usage_counter
of a request_queue before it try to access the tags, so it could avoid the
race with the procedures that need to freeze and drain the queue, such as
update nr_hw_queues, switch io scheduler and even queue clean up. And also
it iterate the static_rqs and needn't to worry about the stale requests issue.
So it is a safer interface. Although for shared tagset case, the driver
need to loop every request_queue itself with blk_mq_queue_tag_busy_iter,
but all of the work is in error handler path, so it should not be a big deal.
Thanks
Jianchao
On 3/19/19 1:20 AM, Bart Van Assche wrote:
> On Fri, 2019-03-15 at 16:57 +0800, Jianchao Wang wrote:
>> blk_mq_tagset_busy_iter is not safe that it could get stale request
>> in tags->rqs[]. Use blk_mq_queue_tag_busy_iter here.
>>
>> Signed-off-by: Jianchao Wang <[email protected]>
>> ---
>> drivers/block/skd_main.c | 4 ﭗ橸ṷ梧뇪觬(), 2 deletions(-)
>>
>> diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
>> index ab893a7..60c34ff 100644
>> --- a/drivers/block/skd_main.c
>> diff --git a/drivers/block/skd_main.c b/drivers/block/skd_main.c
>> index ab893a7..60c34ff 100644
>> --- a/drivers/block/skd_main.c
>> +++ b/drivers/block/skd_main.c
>> @@ -395,7 +395,7 @@ static int skd_in_flight(struct skd_device *skdev)
>> {
>> int count = 0;
>>
>> - blk_mq_tagset_busy_iter(&skdev->tag_set, skd_inc_in_flight, &count);
>> + blk_mq_queue_tag_busy_iter(skdev->queue, skd_inc_in_flight, &count, true);
>>
>> return count;
>> }
>
> Hi Jianchao,
>
> If you have a look at the skd_in_flight() callers you will see that the above
> change is not necessary.
>
>> @@ -1916,7 +1916,7 @@ static bool skd_recover_request(struct request *req, void *data, bool reserved)
>>
>> static void skd_recover_requests(struct skd_device *skdev)
>> {
>> - blk_mq_tagset_busy_iter(&skdev->tag_set, skd_recover_request, skdev);
>> + blk_mq_queue_tag_busy_iter(skdev->queue, skd_recover_request, skdev, true);
>> }
>
> Same comment here. If you have a look at the callers of this function you will
> see that this change is not necessary.
>
> Thanks,
>
> Bart.
>
>
> _______________________________________________
> Linux-nvme mailing list
> [email protected]
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.infradead.org_mailman_listinfo_linux-2Dnvme&d=DwIGgQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=7WdAxUBeiTUTCy8v-7zXyr4qk7sx26ATvfo6QSTvZyQ&m=sLURgJ0-Ppht_QzBm__dp4MAaPyGCYjTWVYVglTtnoQ&s=fmR2wU9GQUr-0yrG88JCs1afrjYTd9or1wPKyDKgemg&e=
>
Hi Bart
Thanks for your comment on this.
On 3/19/19 1:16 AM, Bart Van Assche wrote:
> On Fri, 2019-03-15 at 16:57 +0800, Jianchao Wang wrote:
>> blk_mq_tagset_busy_iter is not safe that it could get stale request
>> in tags->rqs[]. Use blk_mq_queue_tag_busy_iter here.
>>
>> Signed-off-by: Jianchao Wang <[email protected]>
>> ---
>> drivers/block/nbd.c | 2 +
>> 1 file changed, 1 insertion(), 1 deletion(-)
>>
>> diff --git a/drivers/block/nbd.c b/drivers/block/nbd.c
>> index 7c9a949..9e7e828 100644
>> --- a/drivers/block/nbd.c
>> +++ b/drivers/block/nbd.c
>> @@ -747,7 +747,7 @@ static bool nbd_clear_req(struct request *req, void *data, bool reserved)
>> static void nbd_clear_que(struct nbd_device *nbd)
>> {
>> blk_mq_quiesce_queue(nbd->disk->queue);
>> - blk_mq_tagset_busy_iter(&nbd->tag_set, nbd_clear_req, NULL);
>> + blk_mq_queue_tag_busy_iter(nbd->disk->queue, nbd_clear_req, NULL, true);
>> blk_mq_unquiesce_queue(nbd->disk->queue);
>> dev_dbg(disk_to_dev(nbd->disk), "queue cleared\n");
>> }
>
> Hi Jianchao,
>
> The nbd driver calls nbd_clear_que() after having called sock_shutdown(). So
> what makes you think that it's not safe to call blk_mq_tagset_busy_iter()
> from nbd_clear_que()?
>
The request_queue is not frozen, so there still could be someone enters and get
request. If no io scheduler attached, the driver tag could be allocated and
tags->rqs[] would be set. During this, blk_mq_tagset_busy_iter could get some stale
requests in tags->rqs[] which maybe freed due to switching io scheduler to none.
Thanks
Jianchao
On Tue, 2019-03-19 at 09:25 +-0800, jianchao.wang wrote:
+AD4 Do you mean this patch from Jens ?
+AD4 https://marc.info/?l+AD0-linux-block+ACY-m+AD0-154534605914798+ACY-w+AD0-2
+AD4
+AD4 +- rcu+AF8-read+AF8-lock()+ADs
+AD4 sbitmap+AF8-for+AF8-each+AF8-set(+ACY-bt-+AD4-sb, bt+AF8-iter, +ACY-iter+AF8-data)+ADs
+AD4 +- rcu+AF8-read+AF8-unlock()+ADs
+AD4
+AD4 The busy+AF8-iter+AF8-fn could sleep for nvme
+AD4 blk+AF8-mq+AF8-check+AF8-expired
+AD4 -+AD4 blk+AF8-mq+AF8-rq+AF8-timed+AF8-out
+AD4 -+AD4 q-+AD4-mq+AF8-ops-+AD4-timeout
+AD4 nvme+AF8-timeout
+AD4 -+AD4 nvme+AF8-dev+AF8-disable
+AD4 -+AD4 mutex+AF8-lock dev-+AD4-shutdown+AF8-lock
Hi Jianchao,
I think that's an additional reason to rewrite NVMe error handling ...
Best regards,
Bart.
On Tue, Mar 19, 2019 at 08:10:22AM -0700, Bart Van Assche wrote:
> On Tue, 2019-03-19 at 09:25 +0800, jianchao.wang wrote:
> > Do you mean this patch from Jens ?
> > https://marc.info/?l=linux-block&m=154534605914798&w=2
> >
> > + rcu_read_lock();
> > sbitmap_for_each_set(&bt->sb, bt_iter, &iter_data);
> > + rcu_read_unlock();
> >
> > The busy_iter_fn could sleep for nvme
> > blk_mq_check_expired
> > -> blk_mq_rq_timed_out
> > -> q->mq_ops->timeout
> > nvme_timeout
> > -> nvme_dev_disable
> > -> mutex_lock dev->shutdown_lock
>
> Hi Jianchao,
>
> I think that's an additional reason to rewrite NVMe error handling ...
Nonesense. Block timeout handling runs in a work queue precicesly so
handlers can actually do useful work in line with the notification.
>> Hi Jianchao,
>>
>> I think that's an additional reason to rewrite NVMe error handling ...
>
> Nonesense. Block timeout handling runs in a work queue precicesly so
> handlers can actually do useful work in line with the notification.
I have to agree with Keith on this one, there is absolutely no reason to
force this constraint on the error handler. If we want to teardown stuff
to guarantee that we can free the request safely we may very well need
to block on queue flushing.
On 3/15/19 1:57 AM, Jianchao Wang wrote:
> tags->rqs[] will not been cleaned when free driver tag and there
> is a window between get driver tag and write tags->rqs[], so we
> may see stale rq in tags->rqs[] which may have been freed, as
> following case,
> blk_mq_get_request blk_mq_queue_tag_busy_iter
> -> blk_mq_get_tag
> -> bt_for_each
> -> bt_iter
> -> rq = taags->rqs[]
> -> rq->q
> -> blk_mq_rq_ctx_init
> -> data->hctx->tags->rqs[rq->tag] = rq;
>
> To fix this, the blk_mq_queue_tag_busy_iter is changed in this
> patch to use tags->static_rqs[] instead of tags->rqs[]. We have
> to identify whether there is a io scheduler attached to decide
> to use hctx->tags or hctx->sched_tags. And we will try to get a
> non-zero q_usage_counter before that, so it is safe to access
> them. Add 'inflight' parameter to determine to iterate in-flight
> requests or just busy tags. A correction here is that
> part_in_flight should count the busy tags instead of rqs that
> have got driver tags.
IMO, instead of this parameter, add a wrapper like
blk_mq_queue_tag_inflight_iter() or keep the parameter out until
we actually have a user that calls it for busy and not inflight.
Other than that, I think that iterating over static_rqs is a good
solution to the problem described.
Hi Sagi
Thanks for your kindly response.
On 3/21/19 2:52 AM, Sagi Grimberg wrote:
>
>
> On 3/15/19 1:57 AM, Jianchao Wang wrote:
>> tags->rqs[] will not been cleaned when free driver tag and there
>> is a window between get driver tag and write tags->rqs[], so we
>> may see stale rq in tags->rqs[] which may have been freed, as
>> following case,
>> blk_mq_get_request blk_mq_queue_tag_busy_iter
>> -> blk_mq_get_tag
>> -> bt_for_each
>> -> bt_iter
>> -> rq = taags->rqs[]
>> -> rq->q
>> -> blk_mq_rq_ctx_init
>> -> data->hctx->tags->rqs[rq->tag] = rq;
>>
>> To fix this, the blk_mq_queue_tag_busy_iter is changed in this
>> patch to use tags->static_rqs[] instead of tags->rqs[]. We have
>> to identify whether there is a io scheduler attached to decide
>> to use hctx->tags or hctx->sched_tags. And we will try to get a
>> non-zero q_usage_counter before that, so it is safe to access
>> them. Add 'inflight' parameter to determine to iterate in-flight
>> requests or just busy tags. A correction here is that
>> part_in_flight should count the busy tags instead of rqs that
>> have got driver tags.
>
> IMO, instead of this parameter, add a wrapper like
> blk_mq_queue_tag_inflight_iter() or keep the parameter out until
> we actually have a user that calls it for busy and not inflight.
Using a wrapper instead of exporting the parameter to user is indeed better.
I will change it in next version.
When the 'inflight' parameter is true, we iterate the inflight requests,
otherwise, we iterate the all the busy requests no matter they are in-flight
or not. The latter would be useful for part_in_flight. Currently, we only
account the requests with driver tags, it is inconsistent for the io-scheduler
and non-io-scheduler case.
>
> Other than that, I think that iterating over static_rqs is a good
> solution to the problem described.
>
Sincerely
Jianchao