2022-09-13 04:20:10

by Ziyang Zhang

[permalink] [raw]
Subject: [PATCH V3 0/7] ublk_drv: add USER_RECOVERY support

ublk_drv is a driver simply passes all blk-mq rqs to userspace
target(such as ublksrv[1]). For each ublk queue, there is one
ubq_daemon(pthread). All ubq_daemons share the same process
which opens /dev/ublkcX. The ubq_daemon code infinitely loops on
io_uring_enter() to send/receive io_uring cmds which pass
information of blk-mq rqs.

Since the real IO handler(the process/thread opening /dev/ublkcX) is
in userspace, it could crash if:
(1) the user kills -9 it because of IO hang on backend, system
reboot, etc...
(2) the process/thread catches a exception(segfault, divisor error,
oom...) Therefore, the kernel driver has to deal with a dying
ubq_daemon or the process.

Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv
must abort the dying ubq, stop the device and release everything.
This is not a good choice in practice because users do not expect
aborted requests, I/O errors and a released device. They may want
a recovery machenism so that no requests are aborted and no I/O
error occurs. Anyway, users just want everything works as usual.

This patchset implements USER_RECOVERY support. If the process
or any ubq_daemon(pthread) crashes(exits accidentally), we allow
user to provide new process and ubq_daemons.

Note: The responsibility of recovery belongs to the user who opens
/dev/ublkcX. After a crash, the kernel driver only switch the
device's state to be ready for recovery(START_USER_RECOVERY) or
termination(STOP_DEV). The state is defined as UBLK_S_DEV_QUIESCED.
This patchset does not provide how to detect such a crash in userspace.
The user has may ways to do so. For example, user may:
(1) send GET_DEV_INFO on specific dev_id and check if its state is
UBLK_S_DEV_QUIESCED.
(2) 'ps' on ublksrv_pid.

Recovery feature is quite useful for real products. In detail,
we support this scenario:
(1) The /dev/ublkc0 is opened by process 0.
(2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all
rqs are handled by process 0.
(3) Process 0 suddenly crashes(e.g. segfault);
(4) Fio is still running and submit IOs(but these IOs cannot
be dispatched now)
(5) User starts process 1 and attach it to /dev/ublkc0
(6) All rqs are handled by process 1 now and IOs can be
completed now.

Note: The backend must tolerate double-write because we re-issue
a rq sent to the old process 0 before.

We provide a sample script here to simulate the above steps:

***************************script***************************
LOOPS=10

__ublk_get_pid() {
pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'`
echo $pid
}

ublk_recover_kill()
{
for CNT in `seq $LOOPS`; do
dmesg -C
pid=`__ublk_get_pid`
echo -e "*** kill $pid now ***"
kill -9 $pid
sleep 6
echo -e "*** recover now ***"
./ublk recover -n 0
sleep 6
done
}

ublk_test()
{
echo -e "*** add ublk device ***"
./ublk add -t null -d 4 -i 1
sleep 2
echo -e "*** start fio ***"
fio --bs=4k \
--filename=/dev/ublkb0 \
--runtime=140s \
--rw=read &
sleep 4
ublk_recover_kill
wait
echo -e "*** delete ublk device ***"
./ublk del -n 0
}

for CNT in `seq 4`; do
modprobe -rv ublk_drv
modprobe ublk_drv
echo -e "************ round $CNT ************"
ublk_test
sleep 5
done
***************************script***************************

You may run it with our modified ublksrv[2] which supports
recovery feature. No I/O error occurs and you can verify it
by typing
$ perf-tools/bin/tpoint block:block_rq_error

The basic idea of USER_RECOVERY is quite straightfoward:
(1) quiesce ublk queues and requeue/abort rqs.
(2) release/free everything belongs to the dying process.
Note: Since ublk_drv does save information about user process,
this work is important because we don't expect any resource
lekage. Particularly, ioucmds from the dying ubq_daemons
need to be completed(freed).
(3) allow new ubq_daemons issue FETCH_REQ.
Note: ublk_ch_uring_cmd() checks some status and flags. We
have to set them to a correct value.

Here is steps to reocver:
(1) monitor_work finds one dying ubq_daemon, and it should
schedule quiesce_work.
(2) quiesce_work must (a)quiesce request queue to ban any incoming
ublk_queue_rq(), (b)requeue/abort inflightrqs, (c)complete old
ioucmds. Then the ublk device is ready for recovery or stop.
(3) The user sends START_USER_RECOVERY ctrl-cmd to /dev/ublk-control
with a dev_id X (such as 3 for /dev/ublkc3).
(4) Then ublk_drv should perpare for a new process to attach /dev/ublkcX.
All ublk_io structures are cleared and ubq_daemons are reset.
(5) Then, user should start a new process and ubq_daemons(pthreads) and
send FETCH_REQ by io_uring_enter() to make all ubqs be ready. The
user must correctly setup queues, flags and so on(how to persist
user's information is not related to this patchset).
(6) The user sends END_USER_RECOVERY ctrl-cmd to /dev/ublk-control with a
dev_id X.
(7) After receiving END_USER_RECOVERY, ublk_drv waits for all ubq_daemons
getting ready. Then it unquiesces request queue and new rqs are
allowed.

You should use ublksrv[2] and tests[3] provided by us. We add 3 additional
tests to verify that recovery feature works. Our code will be PR-ed to
Ming's repo soon.

[1] https://github.com/ming1/ubdsrv
[2] https://github.com/old-memories/ubdsrv/tree/recovery-v1
[3] https://github.com/old-memories/ubdsrv/tree/recovery-v1/tests/generic

Since V2:
(1) run ublk_quiesce_dev() in a standalone work.
(2) do not run monitor_work after START_USER_RECOVERY is handled.
(3) refactor recovery feature code so that it does not affect current code.

Since V1:
(1) refactor cover letter. Add intruduction on "how to detect a crash" and
"why we need recovery feature".
(2) do not refactor task_work and ublk_queue_rq().
(3) allow users freely stop/recover the device.
(4) add comment on ublk_cancel_queue().
(5) refactor monitor_work and aborting machenism since we add recovery
machenism in monitor_work.

ZiyangZhang (7):
ublk_drv: check 'current' instead of 'ubq_daemon'
ublk_drv: refactor ublk_cancel_queue()
ublk_drv: define macros for recovery feature and check them
ublk_drv: requeue rqs with recovery feature enabled
ublk_drv: consider recovery feature in aborting mechanism
ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support
ublk_drv: do not run monitor_work while ub's state is QUIESCED

drivers/block/ublk_drv.c | 409 ++++++++++++++++++++++++++++++++--
include/uapi/linux/ublk_cmd.h | 7 +
2 files changed, 398 insertions(+), 18 deletions(-)

--
2.27.0


2022-09-13 04:20:35

by Ziyang Zhang

[permalink] [raw]
Subject: [PATCH V3 2/7] ublk_drv: refactor ublk_cancel_queue()

Assume only a few FETCH_REQ ioucmds are sent to ublk_drv, then the
ubq_daemon exits, We have to call io_uring_cmd_done() for all ioucmds
received so that io_uring ctx will not leak.

ublk_cancel_queue() may be called before START_DEV or after STOP_DEV,
we decrease ubq->nr_io_ready and clear UBLK_IO_FLAG_ACTIVE so that we
won't call io_uring_cmd_done() twice for one ioucmd to avoid UAF. Also
clearing UBLK_IO_FLAG_ACTIVE makes the code more reasonable.

Signed-off-by: ZiyangZhang <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 23 ++++++++++++++++++++---
1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index c39b67d7133d..0c6db0978ed0 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -963,22 +963,39 @@ static inline bool ublk_queue_ready(struct ublk_queue *ubq)
return ubq->nr_io_ready == ubq->q_depth;
}

+/* If ublk_cancel_queue() is called before sending START_DEV(), ->mutex
+ * provides protection on above update.
+ *
+ * If ublk_cancel_queue() is called after sending START_DEV(), disk is
+ * deleted first, UBLK_IO_RES_ABORT is returned so that any new io
+ * command can't be issued to driver, so updating on io flags and
+ * nr_io_ready is safe here.
+ *
+ * Also ->nr_io_ready is guaranteed to become zero after ublk_cance_queue()
+ * returns since request queue is either frozen or not present in both two
+ * cases.
+ */
static void ublk_cancel_queue(struct ublk_queue *ubq)
{
int i;

- if (!ublk_queue_ready(ubq))
+ if (!ubq->nr_io_ready)
return;

for (i = 0; i < ubq->q_depth; i++) {
struct ublk_io *io = &ubq->ios[i];

- if (io->flags & UBLK_IO_FLAG_ACTIVE)
+ if (io->flags & UBLK_IO_FLAG_ACTIVE) {
+ pr_devel("%s: done old cmd: qid %d tag %d\n",
+ __func__, ubq->q_id, i);
io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0);
+ io->flags &= ~UBLK_IO_FLAG_ACTIVE;
+ ubq->nr_io_ready--;
+ }
}

/* all io commands are canceled */
- ubq->nr_io_ready = 0;
+ WARN_ON_ONCE(ubq->nr_io_ready);
}

/* Cancel all pending commands, must be called after del_gendisk() returns */
--
2.27.0

2022-09-13 04:26:43

by Ziyang Zhang

[permalink] [raw]
Subject: [PATCH V3 1/7] ublk_drv: check 'current' instead of 'ubq_daemon'

This check is not atomic. So with recovery feature, ubq_daemon may be
modified simultaneously by recovery task. Instead, check 'current' is
safe here because 'current' never changes.

Also add comment explaining this check, which is really important for
understanding recovery feature.

Signed-off-by: ZiyangZhang <[email protected]>
Reviewed-by: Ming Lei <[email protected]>
---
drivers/block/ublk_drv.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 6a4a94b4cdf4..c39b67d7133d 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -645,14 +645,22 @@ static inline void __ublk_rq_task_work(struct request *req)
struct ublk_device *ub = ubq->dev;
int tag = req->tag;
struct ublk_io *io = &ubq->ios[tag];
- bool task_exiting = current != ubq->ubq_daemon || ubq_daemon_is_dying(ubq);
unsigned int mapped_bytes;

pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n",
__func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags,
ublk_get_iod(ubq, req->tag)->addr);

- if (unlikely(task_exiting)) {
+ /*
+ * Task is exiting if either:
+ *
+ * (1) current != ubq_daemon.
+ * io_uring_cmd_complete_in_task() tries to run task_work
+ * in a workqueue if ubq_daemon(cmd's task) is PF_EXITING.
+ *
+ * (2) current->flags & PF_EXITING.
+ */
+ if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) {
blk_mq_end_request(req, BLK_STS_IOERR);
mod_delayed_work(system_wq, &ub->monitor_work, 0);
return;
--
2.27.0

2022-09-13 04:26:54

by Ziyang Zhang

[permalink] [raw]
Subject: [PATCH V3 4/7] ublk_drv: requeue rqs with recovery feature enabled

With recovery feature enabled, in ublk_queue_rq or task work
(in exit_task_work or fallback wq), we requeue rqs instead of
ending(aborting) them. Besides, No matter recovery feature is enabled
or disabled, we schedule monitor_work immediately.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 34 ++++++++++++++++++++++++++++++++--
1 file changed, 32 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 23337bd7c105..b067f33a1913 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -682,6 +682,21 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)

#define UBLK_REQUEUE_DELAY_MS 3

+static inline void __ublk_abort_rq_in_task_work(struct ublk_queue *ubq,
+ struct request *rq)
+{
+ pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
+ (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
+ ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
+ /* We cannot process this rq so just requeue it. */
+ if (ublk_queue_can_use_recovery(ubq)) {
+ blk_mq_requeue_request(rq, false);
+ blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);
+ } else {
+ blk_mq_end_request(rq, BLK_STS_IOERR);
+ }
+}
+
static inline void __ublk_rq_task_work(struct request *req)
{
struct ublk_queue *ubq = req->mq_hctx->driver_data;
@@ -704,7 +719,7 @@ static inline void __ublk_rq_task_work(struct request *req)
* (2) current->flags & PF_EXITING.
*/
if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) {
- blk_mq_end_request(req, BLK_STS_IOERR);
+ __ublk_abort_rq_in_task_work(ubq, req);
mod_delayed_work(system_wq, &ub->monitor_work, 0);
return;
}
@@ -779,6 +794,21 @@ static void ublk_rq_task_work_fn(struct callback_head *work)
__ublk_rq_task_work(req);
}

+static inline blk_status_t __ublk_abort_rq(struct ublk_queue *ubq,
+ struct request *rq)
+{
+ pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
+ (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
+ ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
+ /* We cannot process this rq so just requeue it. */
+ if (ublk_queue_can_use_recovery(ubq)) {
+ blk_mq_requeue_request(rq, false);
+ blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);
+ return BLK_STS_OK;
+ }
+ return BLK_STS_IOERR;
+}
+
static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
const struct blk_mq_queue_data *bd)
{
@@ -796,7 +826,7 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
if (unlikely(ubq_daemon_is_dying(ubq))) {
fail:
mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0);
- return BLK_STS_IOERR;
+ return __ublk_abort_rq(ubq, rq);
}

if (ublk_can_use_task_work(ubq)) {
--
2.27.0

2022-09-13 04:26:56

by Ziyang Zhang

[permalink] [raw]
Subject: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

With USER_RECOVERY feature enabled, the monitor_work schedules
quiesce_work after finding a dying ubq_daemon. The quiesce_work's job
is to:
(1) quiesce request queue.
(2) check if there is any INFLIGHT rq with UBLK_IO_FLAG_ACTIVE unset.
If so, we retry until all these rqs are requeued by ublk_queue_rq()
and task_work and become IDLE.
(3) requeue/abort inflight rqs issued to the crash ubq_daemon before. If
UBLK_F_USER_RECOVERY_REISSUE is set, rq is requeued; or it is
aborted.
(4) complete all ioucmds by calling io_uring_cmd_done(). We are safe to
do so because no ioucmd can be referenced now.
(5) set ub's state to UBLK_S_DEV_QUIESCED, which means we are ready for
recovery. This state is exposed to userspace by GET_DEV_INFO.

The driver can always handle STOP_DEV and cleanup everything no matter
ub's state is LIVE or QUIESCED. After ub's state is UBLK_S_DEV_QUIESCED,
user can recover with new process by sending START_USER_RECOVERY.

Note: we do not change the default behavior with reocvery feature
disabled. monitor_work still schedules stop_work and abort inflight
rqs. Finally ublk_device is released.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 168 +++++++++++++++++++++++++++++++++++++--
1 file changed, 161 insertions(+), 7 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index b067f33a1913..4409a130d0b6 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -121,7 +121,7 @@ struct ublk_queue {

unsigned long io_addr; /* mapped vm address */
unsigned int max_io_sz;
- bool abort_work_pending;
+ bool force_abort;
unsigned short nr_io_ready; /* how many ios setup */
struct ublk_device *dev;
struct ublk_io ios[0];
@@ -163,6 +163,7 @@ struct ublk_device {
* monitor each queue's daemon periodically
*/
struct delayed_work monitor_work;
+ struct work_struct quiesce_work;
struct work_struct stop_work;
};

@@ -660,6 +661,11 @@ static void __ublk_fail_req(struct ublk_io *io, struct request *req)
WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);

if (!(io->flags & UBLK_IO_FLAG_ABORTED)) {
+ pr_devel("%s: abort rq: qid %d tag %d io_flags %x\n",
+ __func__,
+ ((struct ublk_queue *)req->mq_hctx->driver_data)->q_id,
+ req->tag,
+ io->flags);
io->flags |= UBLK_IO_FLAG_ABORTED;
blk_mq_end_request(req, BLK_STS_IOERR);
}
@@ -820,6 +826,21 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
res = ublk_setup_iod(ubq, rq);
if (unlikely(res != BLK_STS_OK))
return BLK_STS_IOERR;
+ /* With recovery feature enabled, force_abort is set in
+ * ublk_stop_dev() before calling del_gendisk() if ub's state
+ * is QUIESCED. We have to abort all requeued and new rqs here
+ * to let del_gendisk() move on. Besides, we do not call
+ * io_uring_cmd_complete_in_task() to avoid UAF on io_uring ctx.
+ *
+ * Note: force_abort is guaranteed to be seen because it is set
+ * before request queue is unqiuesced.
+ */
+ if (unlikely(ubq->force_abort)) {
+ pr_devel("%s: abort rq: qid %d tag %d io_flags %x\n",
+ __func__, ubq->q_id, rq->tag,
+ ubq->ios[rq->tag].flags);
+ return BLK_STS_IOERR;
+ }

blk_mq_start_request(bd->rq);

@@ -1003,6 +1024,101 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
ublk_put_device(ub);
}

+static bool ublk_check_inflight_rq(struct request *rq, void *data)
+{
+ struct ublk_queue *ubq = rq->mq_hctx->driver_data;
+ struct ublk_io *io = &ubq->ios[rq->tag];
+ bool *busy = data;
+
+ if (io->flags & UBLK_IO_FLAG_ACTIVE) {
+ *busy = true;
+ return false;
+ }
+ return true;
+}
+
+static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
+{
+ bool busy = false;
+
+ WARN_ON_ONCE(!blk_queue_quiesced(ub->ub_disk->queue));
+ while (true) {
+ blk_mq_tagset_busy_iter(&ub->tag_set,
+ ublk_check_inflight_rq, &busy);
+ if (busy)
+ msleep(UBLK_REQUEUE_DELAY_MS);
+ else
+ break;
+ }
+}
+
+static void ublk_quiesce_queue(struct ublk_device *ub,
+ struct ublk_queue *ubq)
+{
+ int i;
+
+ for (i = 0; i < ubq->q_depth; i++) {
+ struct ublk_io *io = &ubq->ios[i];
+
+ if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
+ struct request *rq = blk_mq_tag_to_rq(
+ ub->tag_set.tags[ubq->q_id], i);
+
+ WARN_ON_ONCE(!rq);
+ pr_devel("%s: %s rq: qid %d tag %d io_flags %x\n", __func__,
+ ublk_queue_can_use_recovery_reissue(ubq) ?
+ "requeue" : "abort",
+ ubq->q_id, i, io->flags);
+ if (ublk_queue_can_use_recovery_reissue(ubq))
+ blk_mq_requeue_request(rq, false);
+ else
+ __ublk_fail_req(io, rq);
+ } else {
+ pr_devel("%s: done old cmd: qid %d tag %d\n",
+ __func__, ubq->q_id, i);
+ io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0);
+ io->flags &= ~UBLK_IO_FLAG_ACTIVE;
+ }
+ ubq->nr_io_ready--;
+ }
+ WARN_ON_ONCE(ubq->nr_io_ready);
+}
+
+static void ublk_quiesce_dev(struct ublk_device *ub)
+{
+ int i;
+
+ mutex_lock(&ub->mutex);
+ if (ub->dev_info.state != UBLK_S_DEV_LIVE)
+ goto unlock;
+
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+ struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+ if (!ubq_daemon_is_dying(ubq))
+ goto unlock;
+ }
+ blk_mq_quiesce_queue(ub->ub_disk->queue);
+ ublk_wait_tagset_rqs_idle(ub);
+ pr_devel("%s: quiesce ub: dev_id %d\n",
+ __func__, ub->dev_info.dev_id);
+
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+ ublk_quiesce_queue(ub, ublk_get_queue(ub, i));
+
+ ub->dev_info.state = UBLK_S_DEV_QUIESCED;
+ unlock:
+ mutex_unlock(&ub->mutex);
+}
+
+static void ublk_quiesce_work_fn(struct work_struct *work)
+{
+ struct ublk_device *ub =
+ container_of(work, struct ublk_device, quiesce_work);
+
+ ublk_quiesce_dev(ub);
+}
+
static void ublk_daemon_monitor_work(struct work_struct *work)
{
struct ublk_device *ub =
@@ -1013,10 +1129,14 @@ static void ublk_daemon_monitor_work(struct work_struct *work)
struct ublk_queue *ubq = ublk_get_queue(ub, i);

if (ubq_daemon_is_dying(ubq)) {
- schedule_work(&ub->stop_work);
-
- /* abort queue is for making forward progress */
- ublk_abort_queue(ub, ubq);
+ if (ublk_queue_can_use_recovery(ubq)) {
+ schedule_work(&ub->quiesce_work);
+ } else {
+ schedule_work(&ub->stop_work);
+
+ /* abort queue is for making forward progress */
+ ublk_abort_queue(ub, ubq);
+ }
}
}

@@ -1080,12 +1200,43 @@ static void ublk_cancel_dev(struct ublk_device *ub)
ublk_cancel_queue(ublk_get_queue(ub, i));
}

+static void ublk_unquiesce_dev(struct ublk_device *ub)
+{
+ int i;
+
+ pr_devel("%s: ub state %s\n", __func__,
+ ub->dev_info.state == UBLK_S_DEV_LIVE ?
+ "LIVE" : "QUIESCED");
+ if (ub->dev_info.state == UBLK_S_DEV_LIVE) {
+ /*
+ * quiesce_work cannot be running. We let monitor_work,
+ * ublk_queue_rq() and task_work abort rqs instead of
+ * requeuing them with a dying ubq_daemon. Then
+ * del_gendisk() can move on.
+ */
+ ublk_disable_recovery(ub);
+ } else {
+ /* quiesce_work has run. We let requeued rqs be aborted
+ * before running fallback_wq. "force_abort" must be seen
+ * after request queue is unqiuesced. Then del_gendisk()
+ * can move on.
+ */
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+ ublk_get_queue(ub, i)->force_abort = true;
+
+ blk_mq_unquiesce_queue(ub->ub_disk->queue);
+ /* We may have requeued some rqs in ublk_quiesce_queue() */
+ blk_mq_kick_requeue_list(ub->ub_disk->queue);
+ }
+}
+
static void ublk_stop_dev(struct ublk_device *ub)
{
mutex_lock(&ub->mutex);
- if (ub->dev_info.state != UBLK_S_DEV_LIVE)
+ if (ub->dev_info.state == UBLK_S_DEV_DEAD)
goto unlock;
-
+ if (ublk_can_use_recovery(ub))
+ ublk_unquiesce_dev(ub);
del_gendisk(ub->ub_disk);
ub->dev_info.state = UBLK_S_DEV_DEAD;
ub->dev_info.ublksrv_pid = -1;
@@ -1409,6 +1560,7 @@ static void ublk_remove(struct ublk_device *ub)
{
ublk_stop_dev(ub);
cancel_work_sync(&ub->stop_work);
+ cancel_work_sync(&ub->quiesce_work);
cdev_device_del(&ub->cdev, &ub->cdev_dev);
put_device(&ub->cdev_dev);
}
@@ -1585,6 +1737,7 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
goto out_unlock;
mutex_init(&ub->mutex);
spin_lock_init(&ub->mm_lock);
+ INIT_WORK(&ub->quiesce_work, ublk_quiesce_work_fn);
INIT_WORK(&ub->stop_work, ublk_stop_work_fn);
INIT_DELAYED_WORK(&ub->monitor_work, ublk_daemon_monitor_work);

@@ -1705,6 +1858,7 @@ static int ublk_ctrl_stop_dev(struct io_uring_cmd *cmd)

ublk_stop_dev(ub);
cancel_work_sync(&ub->stop_work);
+ cancel_work_sync(&ub->quiesce_work);

ublk_put_device(ub);
return 0;
--
2.27.0

2022-09-13 04:37:44

by Ziyang Zhang

[permalink] [raw]
Subject: [PATCH V3 3/7] ublk_drv: define macros for recovery feature and check them

Define some macros for recovery feature. Especially define a new state:
UBLK_S_DEV_QUIESCED which implies that ublk_device is quiesced
and is ready for recovery. This state can be observed by userspace.

UBLK_F_USER_RECOVERY implies that:
(1) ublk_drv enables recovery feature. It won't let monitor_work to
automatically abort rqs and release the device.
(2) With a dying ubq_daemon, ublk_drv ends(aborts) rqs issued to
userspace(ublksrv) before crash.
(3) With a dying ubq_daemon, in task work and ublk_queue_rq(),
ublk_drv requeues rqs.

UBLK_F_USER_RECOVERY_REISSUE implies that:
(1) everything UBLK_F_USER_RECOVERY implies except
(2) With a dying ubq_daemon, ublk_drv requeues rqs issued to
userspace(ublksrv) before crash.

UBLK_F_USER_RECOVERY_REISSUE is designed for backends which:
(1) tolerates double-writes because ublk_drv may issue the same rq
twice.
(2) does not let frontend users get I/O error. such as read-only FS
and VM backend.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 45 ++++++++++++++++++++++++++++++++++-
include/uapi/linux/ublk_cmd.h | 7 ++++++
2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 0c6db0978ed0..23337bd7c105 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -49,7 +49,9 @@
/* All UBLK_F_* have to be included into UBLK_F_ALL */
#define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \
| UBLK_F_URING_CMD_COMP_IN_TASK \
- | UBLK_F_NEED_GET_DATA)
+ | UBLK_F_NEED_GET_DATA \
+ | UBLK_F_USER_RECOVERY \
+ | UBLK_F_USER_RECOVERY_REISSUE)

/* All UBLK_PARAM_TYPE_* should be included here */
#define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD)
@@ -323,6 +325,47 @@ static inline int ublk_queue_cmd_buf_size(struct ublk_device *ub, int q_id)
PAGE_SIZE);
}

+static inline bool ublk_queue_can_use_recovery(
+ struct ublk_queue *ubq)
+{
+ if (ubq->flags & UBLK_F_USER_RECOVERY)
+ return true;
+ return false;
+}
+
+static inline void ublk_disable_recovery(struct ublk_device *ub)
+{
+ int i;
+
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+ struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+ ubq->flags &= ~UBLK_F_USER_RECOVERY;
+ }
+}
+
+static inline bool ublk_can_use_recovery(struct ublk_device *ub)
+{
+ int i;
+
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+ struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+ if (!ublk_queue_can_use_recovery(ubq))
+ return false;
+ }
+ return true;
+}
+
+static inline bool ublk_queue_can_use_recovery_reissue(
+ struct ublk_queue *ubq)
+{
+ if (ublk_queue_can_use_recovery(ubq) &&
+ (ubq->flags & UBLK_F_USER_RECOVERY_REISSUE))
+ return true;
+ return false;
+}
+
static void ublk_free_disk(struct gendisk *disk)
{
struct ublk_device *ub = disk->private_data;
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index 677edaab2b66..87204c39f1ee 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -17,6 +17,8 @@
#define UBLK_CMD_STOP_DEV 0x07
#define UBLK_CMD_SET_PARAMS 0x08
#define UBLK_CMD_GET_PARAMS 0x09
+#define UBLK_CMD_START_USER_RECOVERY 0x10
+#define UBLK_CMD_END_USER_RECOVERY 0x11

/*
* IO commands, issued by ublk server, and handled by ublk driver.
@@ -74,9 +76,14 @@
*/
#define UBLK_F_NEED_GET_DATA (1UL << 2)

+#define UBLK_F_USER_RECOVERY (1UL << 3)
+
+#define UBLK_F_USER_RECOVERY_REISSUE (1UL << 4)
+
/* device state */
#define UBLK_S_DEV_DEAD 0
#define UBLK_S_DEV_LIVE 1
+#define UBLK_S_DEV_QUIESCED 2

/* shipped via sqe->cmd of io_uring command */
struct ublksrv_ctrl_cmd {
--
2.27.0

2022-09-13 04:38:47

by Ziyang Zhang

[permalink] [raw]
Subject: [PATCH V3 6/7] ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support

START_USER_RECOVERY and END_USER_RECOVERY are two new control commands
to support user recovery feature.

After a crash, user should send START_USER_RECOVERY, it will:
(1) check if (a)current ublk_device is UBLK_S_DEV_QUIESCED which was
set by quiesce_work and (b)the file struct is released.
(2) reinit all ubqs, including:
(a) put the task_struct and reset ->ubq_daemon to NULL.
(b) reset all ublk_io.
(3) reset ub->mm to NULL.

Then, user should start a new process and send FETCH_REQ on each
ubq_daemon.

Finally, user should send END_USER_RECOVERY, it will:
(1) wait for all new ubq_daemons getting ready.
(2) update ublksrv_pid
(3) unquiesce the request queue and expect incoming ublk_queue_rq()
(4) convert ub's state to UBLK_S_DEV_LIVE

Note: we can handle STOP_DEV between START_USER_RECOVERY and
END_USER_RECOVERY. This is helpful to users who cannot start new process
after sending START_USER_RECOVERY ctrl-cmd.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 116 +++++++++++++++++++++++++++++++++++++++
1 file changed, 116 insertions(+)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 4409a130d0b6..3a3af80ee938 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1961,6 +1961,116 @@ static int ublk_ctrl_set_params(struct io_uring_cmd *cmd)
return ret;
}

+static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
+{
+ int i;
+
+ WARN_ON_ONCE(!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq)));
+ /* All old ioucmds have to be completed */
+ WARN_ON_ONCE(ubq->nr_io_ready);
+ pr_devel("%s: prepare for recovering qid %d\n", __func__, ubq->q_id);
+ /* old daemon is PF_EXITING, put it now */
+ put_task_struct(ubq->ubq_daemon);
+ /* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
+ ubq->ubq_daemon = NULL;
+
+ for (i = 0; i < ubq->q_depth; i++) {
+ struct ublk_io *io = &ubq->ios[i];
+
+ /* forget everything now and be ready for new FETCH_REQ */
+ io->flags = 0;
+ io->cmd = NULL;
+ io->addr = 0;
+ }
+}
+
+static int ublk_ctrl_start_recovery(struct io_uring_cmd *cmd)
+{
+ struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+ struct ublk_device *ub;
+ int ret = -EINVAL;
+ int i;
+
+ ub = ublk_get_device_from_id(header->dev_id);
+ if (!ub)
+ return ret;
+
+ mutex_lock(&ub->mutex);
+ if (!ublk_can_use_recovery(ub))
+ goto out_unlock;
+ /*
+ * START_RECOVERY is only allowd after:
+ *
+ * (1) UB_STATE_OPEN is not set, which means the dying process is exited
+ * and related io_uring ctx is freed so file struct of /dev/ublkcX is
+ * released.
+ *
+ * (2) UBLK_S_DEV_QUIESCED is set, which means the quiesce_work:
+ * (a)has quiesced request queue
+ * (b)has requeued every inflight rqs whose io_flags is ACTIVE
+ * (c)has requeued/aborted every inflight rqs whose io_flags is NOT ACTIVE
+ * (d)has completed/camceled all ioucmds owned by ther dying process
+ */
+ if (test_bit(UB_STATE_OPEN, &ub->state) ||
+ ub->dev_info.state != UBLK_S_DEV_QUIESCED) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+ pr_devel("%s: start recovery for dev id %d.\n", __func__, header->dev_id);
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+ ublk_queue_reinit(ub, ublk_get_queue(ub, i));
+ /* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */
+ ub->mm = NULL;
+ ub->nr_queues_ready = 0;
+ init_completion(&ub->completion);
+ ret = 0;
+ out_unlock:
+ mutex_unlock(&ub->mutex);
+ ublk_put_device(ub);
+ return ret;
+}
+
+static int ublk_ctrl_end_recovery(struct io_uring_cmd *cmd)
+{
+ struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+ int ublksrv_pid = (int)header->data[0];
+ struct ublk_device *ub;
+ int ret = -EINVAL;
+
+ ub = ublk_get_device_from_id(header->dev_id);
+ if (!ub)
+ return ret;
+
+ pr_devel("%s: Waiting for new ubq_daemons(nr: %d) are ready, dev id %d...\n",
+ __func__, ub->dev_info.nr_hw_queues, header->dev_id);
+ /* wait until new ubq_daemon sending all FETCH_REQ */
+ wait_for_completion_interruptible(&ub->completion);
+ pr_devel("%s: All new ubq_daemons(nr: %d) are ready, dev id %d\n",
+ __func__, ub->dev_info.nr_hw_queues, header->dev_id);
+
+ mutex_lock(&ub->mutex);
+ if (!ublk_can_use_recovery(ub))
+ goto out_unlock;
+
+ if (ub->dev_info.state != UBLK_S_DEV_QUIESCED) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+ ub->dev_info.ublksrv_pid = ublksrv_pid;
+ pr_devel("%s: new ublksrv_pid %d, dev id %d\n",
+ __func__, ublksrv_pid, header->dev_id);
+ blk_mq_unquiesce_queue(ub->ub_disk->queue);
+ pr_devel("%s: queue unquiesced, dev id %d.\n",
+ __func__, header->dev_id);
+ blk_mq_kick_requeue_list(ub->ub_disk->queue);
+ ub->dev_info.state = UBLK_S_DEV_LIVE;
+ ret = 0;
+ out_unlock:
+ mutex_unlock(&ub->mutex);
+ ublk_put_device(ub);
+ return ret;
+}
+
static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
unsigned int issue_flags)
{
@@ -2002,6 +2112,12 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
case UBLK_CMD_SET_PARAMS:
ret = ublk_ctrl_set_params(cmd);
break;
+ case UBLK_CMD_START_USER_RECOVERY:
+ ret = ublk_ctrl_start_recovery(cmd);
+ break;
+ case UBLK_CMD_END_USER_RECOVERY:
+ ret = ublk_ctrl_end_recovery(cmd);
+ break;
default:
break;
}
--
2.27.0

2022-09-13 04:44:19

by Ziyang Zhang

[permalink] [raw]
Subject: [PATCH V3 7/7] ublk_drv: do not run monitor_work while ub's state is QUIESCED

START_USER_RECOVERY release task_struct of ubq_daemon and resets
->ubq_daemon to NULL. So in monitor_work, check on ubq_daemon causes UAF.
Besides, monitor_work is not necessary in QUIESCED state since we have
already scheduled quiesce_work and quiesce all ubqs.

Do not let monitor_work schedule itself if state it QUIESCED. And we cancel
it in START_USER_RECOVERY and re-schedule it in END_USER_RECOVERY to avoid
UAF.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 3a3af80ee938..044f9b4a0d08 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1143,10 +1143,13 @@ static void ublk_daemon_monitor_work(struct work_struct *work)
/*
* We can't schedule monitor work after ublk_remove() is started.
*
- * No need ub->mutex, monitor work are canceled after state is marked
- * as DEAD, so DEAD state is observed reliably.
+ * We can't schedule monitor work after ub is QUIESCED because
+ * ubq_daemon may be NULL during user recovery.
+ *
+ * No need ub->mutex, monitor work are canceled after state is not
+ * UBLK_S_DEV_LIVE, so new state is observed reliably.
*/
- if (ub->dev_info.state != UBLK_S_DEV_DEAD)
+ if (ub->dev_info.state == UBLK_S_DEV_LIVE)
schedule_delayed_work(&ub->monitor_work,
UBLK_DAEMON_MONITOR_PERIOD);
}
@@ -2016,6 +2019,7 @@ static int ublk_ctrl_start_recovery(struct io_uring_cmd *cmd)
ret = -EBUSY;
goto out_unlock;
}
+ cancel_delayed_work_sync(&ub->monitor_work);
pr_devel("%s: start recovery for dev id %d.\n", __func__, header->dev_id);
for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
ublk_queue_reinit(ub, ublk_get_queue(ub, i));
@@ -2064,6 +2068,7 @@ static int ublk_ctrl_end_recovery(struct io_uring_cmd *cmd)
__func__, header->dev_id);
blk_mq_kick_requeue_list(ub->ub_disk->queue);
ub->dev_info.state = UBLK_S_DEV_LIVE;
+ schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON_MONITOR_PERIOD);
ret = 0;
out_unlock:
mutex_unlock(&ub->mutex);
--
2.27.0

2022-09-19 04:29:26

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 4/7] ublk_drv: requeue rqs with recovery feature enabled

On Tue, Sep 13, 2022 at 12:17:04PM +0800, ZiyangZhang wrote:
> With recovery feature enabled, in ublk_queue_rq or task work
> (in exit_task_work or fallback wq), we requeue rqs instead of
> ending(aborting) them. Besides, No matter recovery feature is enabled
> or disabled, we schedule monitor_work immediately.
>
> Signed-off-by: ZiyangZhang <[email protected]>
> ---
> drivers/block/ublk_drv.c | 34 ++++++++++++++++++++++++++++++++--
> 1 file changed, 32 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 23337bd7c105..b067f33a1913 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -682,6 +682,21 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)
>
> #define UBLK_REQUEUE_DELAY_MS 3
>
> +static inline void __ublk_abort_rq_in_task_work(struct ublk_queue *ubq,
> + struct request *rq)
> +{
> + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
> + (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
> + ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
> + /* We cannot process this rq so just requeue it. */
> + if (ublk_queue_can_use_recovery(ubq)) {
> + blk_mq_requeue_request(rq, false);
> + blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);

Here you needn't to kick requeue list since we know it can't make
progress. And you can do that once before deleting gendisk
or the queue is recovered.

> + } else {
> + blk_mq_end_request(rq, BLK_STS_IOERR);
> + }
> +}
> +
> static inline void __ublk_rq_task_work(struct request *req)
> {
> struct ublk_queue *ubq = req->mq_hctx->driver_data;
> @@ -704,7 +719,7 @@ static inline void __ublk_rq_task_work(struct request *req)
> * (2) current->flags & PF_EXITING.
> */
> if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) {
> - blk_mq_end_request(req, BLK_STS_IOERR);
> + __ublk_abort_rq_in_task_work(ubq, req);
> mod_delayed_work(system_wq, &ub->monitor_work, 0);
> return;
> }
> @@ -779,6 +794,21 @@ static void ublk_rq_task_work_fn(struct callback_head *work)
> __ublk_rq_task_work(req);
> }
>
> +static inline blk_status_t __ublk_abort_rq(struct ublk_queue *ubq,
> + struct request *rq)
> +{
> + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
> + (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
> + ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
> + /* We cannot process this rq so just requeue it. */
> + if (ublk_queue_can_use_recovery(ubq)) {
> + blk_mq_requeue_request(rq, false);
> + blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);

Same with above.


Thanks,
Ming

2022-09-19 09:47:10

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [PATCH V3 4/7] ublk_drv: requeue rqs with recovery feature enabled

On 2022/9/19 11:55, Ming Lei wrote:
> On Tue, Sep 13, 2022 at 12:17:04PM +0800, ZiyangZhang wrote:
>> With recovery feature enabled, in ublk_queue_rq or task work
>> (in exit_task_work or fallback wq), we requeue rqs instead of
>> ending(aborting) them. Besides, No matter recovery feature is enabled
>> or disabled, we schedule monitor_work immediately.
>>
>> Signed-off-by: ZiyangZhang <[email protected]>
>> ---
>> drivers/block/ublk_drv.c | 34 ++++++++++++++++++++++++++++++++--
>> 1 file changed, 32 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>> index 23337bd7c105..b067f33a1913 100644
>> --- a/drivers/block/ublk_drv.c
>> +++ b/drivers/block/ublk_drv.c
>> @@ -682,6 +682,21 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)
>>
>> #define UBLK_REQUEUE_DELAY_MS 3
>>
>> +static inline void __ublk_abort_rq_in_task_work(struct ublk_queue *ubq,
>> + struct request *rq)
>> +{
>> + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
>> + (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
>> + ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
>> + /* We cannot process this rq so just requeue it. */
>> + if (ublk_queue_can_use_recovery(ubq)) {
>> + blk_mq_requeue_request(rq, false);
>> + blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);
>
> Here you needn't to kick requeue list since we know it can't make
> progress. And you can do that once before deleting gendisk
> or the queue is recovered.

No, kicking rq here is necessary.

Consider USER_RECOVERY is enabled and everything goes well.
User sends STOP_DEV, and we have kicked requeue list in
ublk_stop_dev() and are going to call del_gendisk().
However, a crash happens now. Then rqs may be still requeued
by ublk_queue_rq() because ublk_queue_rq() sees a dying
ubq_daemon. So del_gendisk() will hang because there are
rqs leaving in requeue list and no one kicks them.

BTW, kicking requeue list after requeue rqs is really harmless
since we schedule quiesce_work immediately after finding a
dying ubq_daemon. So few rqs have chance to be re-dispatched.

>
>> + } else {
>> + blk_mq_end_request(rq, BLK_STS_IOERR);
>> + }
>> +}
>> +
>> static inline void __ublk_rq_task_work(struct request *req)
>> {
>> struct ublk_queue *ubq = req->mq_hctx->driver_data;
>> @@ -704,7 +719,7 @@ static inline void __ublk_rq_task_work(struct request *req)
>> * (2) current->flags & PF_EXITING.
>> */
>> if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) {
>> - blk_mq_end_request(req, BLK_STS_IOERR);
>> + __ublk_abort_rq_in_task_work(ubq, req);
>> mod_delayed_work(system_wq, &ub->monitor_work, 0);
>> return;
>> }
>> @@ -779,6 +794,21 @@ static void ublk_rq_task_work_fn(struct callback_head *work)
>> __ublk_rq_task_work(req);
>> }
>>
>> +static inline blk_status_t __ublk_abort_rq(struct ublk_queue *ubq,
>> + struct request *rq)
>> +{
>> + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
>> + (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
>> + ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
>> + /* We cannot process this rq so just requeue it. */
>> + if (ublk_queue_can_use_recovery(ubq)) {
>> + blk_mq_requeue_request(rq, false);
>> + blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);
>
> Same with above.
>
>
> Thanks,
> Ming

2022-09-19 10:00:48

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

On 2022/9/19 17:32, Ming Lei wrote:
> On Tue, Sep 13, 2022 at 12:17:05PM +0800, ZiyangZhang wrote:
>> With USER_RECOVERY feature enabled, the monitor_work schedules
>> quiesce_work after finding a dying ubq_daemon. The quiesce_work's job
>> is to:
>> (1) quiesce request queue.
>> (2) check if there is any INFLIGHT rq with UBLK_IO_FLAG_ACTIVE unset.
>> If so, we retry until all these rqs are requeued by ublk_queue_rq()
>> and task_work and become IDLE.
>
> These requests should be being handled by task work or the io_uring
> fallback wq, and suggest to add the words here.

Will do so.

>
>> (3) requeue/abort inflight rqs issued to the crash ubq_daemon before. If
>> UBLK_F_USER_RECOVERY_REISSUE is set, rq is requeued; or it is
>> aborted.
>> (4) complete all ioucmds by calling io_uring_cmd_done(). We are safe to
>> do so because no ioucmd can be referenced now.
>> (5) set ub's state to UBLK_S_DEV_QUIESCED, which means we are ready for
>> recovery. This state is exposed to userspace by GET_DEV_INFO.
>>
>> The driver can always handle STOP_DEV and cleanup everything no matter
>> ub's state is LIVE or QUIESCED. After ub's state is UBLK_S_DEV_QUIESCED,
>> user can recover with new process by sending START_USER_RECOVERY.
>>
>> Note: we do not change the default behavior with reocvery feature
>> disabled. monitor_work still schedules stop_work and abort inflight
>> rqs. Finally ublk_device is released.
>
> This version looks much better than before.

Thanks :)

>
>>
>> Signed-off-by: ZiyangZhang <[email protected]>
>> ---
>> drivers/block/ublk_drv.c | 168 +++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 161 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>> index b067f33a1913..4409a130d0b6 100644
>> --- a/drivers/block/ublk_drv.c
>> +++ b/drivers/block/ublk_drv.c
>> @@ -121,7 +121,7 @@ struct ublk_queue {
>>
>> unsigned long io_addr; /* mapped vm address */
>> unsigned int max_io_sz;
>> - bool abort_work_pending;
>> + bool force_abort;
>> unsigned short nr_io_ready; /* how many ios setup */
>> struct ublk_device *dev;
>> struct ublk_io ios[0];
>> @@ -163,6 +163,7 @@ struct ublk_device {
>> * monitor each queue's daemon periodically
>> */
>> struct delayed_work monitor_work;
>> + struct work_struct quiesce_work;
>> struct work_struct stop_work;
>> };
>>
>> @@ -660,6 +661,11 @@ static void __ublk_fail_req(struct ublk_io *io, struct request *req)
>> WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
>>
>> if (!(io->flags & UBLK_IO_FLAG_ABORTED)) {
>> + pr_devel("%s: abort rq: qid %d tag %d io_flags %x\n",
>> + __func__,
>> + ((struct ublk_queue *)req->mq_hctx->driver_data)->q_id,
>
> req->mq_hctx->queue_num is cleaner.

Ok.

>
>> + req->tag,
>> + io->flags);
>> io->flags |= UBLK_IO_FLAG_ABORTED;
>> blk_mq_end_request(req, BLK_STS_IOERR);
>> }
>> @@ -820,6 +826,21 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
>> res = ublk_setup_iod(ubq, rq);
>> if (unlikely(res != BLK_STS_OK))
>> return BLK_STS_IOERR;
>> + /* With recovery feature enabled, force_abort is set in
>> + * ublk_stop_dev() before calling del_gendisk() if ub's state
>> + * is QUIESCED. We have to abort all requeued and new rqs here
>> + * to let del_gendisk() move on. Besides, we do not call
>> + * io_uring_cmd_complete_in_task() to avoid UAF on io_uring ctx.
>> + *
>> + * Note: force_abort is guaranteed to be seen because it is set
>> + * before request queue is unqiuesced.
>> + */
>> + if (unlikely(ubq->force_abort)) {
>> + pr_devel("%s: abort rq: qid %d tag %d io_flags %x\n",
>> + __func__, ubq->q_id, rq->tag,
>> + ubq->ios[rq->tag].flags);
>> + return BLK_STS_IOERR;
>> + }
>>
>> blk_mq_start_request(bd->rq);
>>
>> @@ -1003,6 +1024,101 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
>> ublk_put_device(ub);
>> }
>>
>> +static bool ublk_check_inflight_rq(struct request *rq, void *data)
>> +{
>> + struct ublk_queue *ubq = rq->mq_hctx->driver_data;
>> + struct ublk_io *io = &ubq->ios[rq->tag];
>> + bool *busy = data;
>> +
>> + if (io->flags & UBLK_IO_FLAG_ACTIVE) {
>> + *busy = true;
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> +static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
>> +{
>> + bool busy = false;
>> +
>> + WARN_ON_ONCE(!blk_queue_quiesced(ub->ub_disk->queue));
>> + while (true) {
>> + blk_mq_tagset_busy_iter(&ub->tag_set,
>> + ublk_check_inflight_rq, &busy);
>> + if (busy)
>> + msleep(UBLK_REQUEUE_DELAY_MS);
>> + else
>> + break;
>> + }
>> +}
>> +
>> +static void ublk_quiesce_queue(struct ublk_device *ub,
>> + struct ublk_queue *ubq)
>> +{
>> + int i;
>> +
>> + for (i = 0; i < ubq->q_depth; i++) {
>> + struct ublk_io *io = &ubq->ios[i];
>> +
>> + if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
>> + struct request *rq = blk_mq_tag_to_rq(
>> + ub->tag_set.tags[ubq->q_id], i);
>> +
>> + WARN_ON_ONCE(!rq);
>> + pr_devel("%s: %s rq: qid %d tag %d io_flags %x\n", __func__,
>> + ublk_queue_can_use_recovery_reissue(ubq) ?
>> + "requeue" : "abort",
>> + ubq->q_id, i, io->flags);
>> + if (ublk_queue_can_use_recovery_reissue(ubq))
>> + blk_mq_requeue_request(rq, false);
>
> This way is too violent.
>
> There may be just one queue dying, but you requeue all requests
> from any queue. I'd suggest to take the approach in ublk_daemon_monitor_work(),
> such as, just requeuing requests in dying queue.

If we want to start a new process after a crash for USER_RECOVERY, all old ubq_daemons
must exit and rqs of all queues have to be requeued/aborted. We cannot let live
ubq_daemons run any more because they do not belong to the new process.

BTW, I really wonder why there could be just one queue dying? All queues must be dying
shortly after any ubq_daemon is dying since they are all pthreads in the same process.

>
> That said you still can re-use the logic in ublk_abort_queue()/ublk_daemon_monitor_work()
> for making progress, just changing aborting request with requeue in
> ublk_abort_queue().

I get your point, but it may be hard to reuse the logic in ublk_daemon_monitor_work()
because:
(1) we have to quiesce request queue in ublk_quiesce_dev(). This has to be done with
ub_mutex.
(2) ublk_quiesce_dev() cannot be run after rqs are requeued/aborted.

>
>> + else
>> + __ublk_fail_req(io, rq);
>> + } else {
>> + pr_devel("%s: done old cmd: qid %d tag %d\n",
>> + __func__, ubq->q_id, i);
>> + io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0);
>> + io->flags &= ~UBLK_IO_FLAG_ACTIVE;
>> + }
>> + ubq->nr_io_ready--;
>> + }
>> + WARN_ON_ONCE(ubq->nr_io_ready);
>> +}
>> +
>> +static void ublk_quiesce_dev(struct ublk_device *ub)
>> +{
>> + int i;
>> +
>> + mutex_lock(&ub->mutex);
>> + if (ub->dev_info.state != UBLK_S_DEV_LIVE)
>> + goto unlock;
>> +
>> + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
>> + struct ublk_queue *ubq = ublk_get_queue(ub, i);
>> +
>> + if (!ubq_daemon_is_dying(ubq))
>> + goto unlock;
>> + }
>> + blk_mq_quiesce_queue(ub->ub_disk->queue);
>> + ublk_wait_tagset_rqs_idle(ub);
>> + pr_devel("%s: quiesce ub: dev_id %d\n",
>> + __func__, ub->dev_info.dev_id);
>> +
>> + for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
>> + ublk_quiesce_queue(ub, ublk_get_queue(ub, i));
>> +
>> + ub->dev_info.state = UBLK_S_DEV_QUIESCED;
>> + unlock:
>> + mutex_unlock(&ub->mutex);
>> +}
>> +
>> +static void ublk_quiesce_work_fn(struct work_struct *work)
>> +{
>> + struct ublk_device *ub =
>> + container_of(work, struct ublk_device, quiesce_work);
>> +
>> + ublk_quiesce_dev(ub);
>> +}
>> +
>> static void ublk_daemon_monitor_work(struct work_struct *work)
>> {
>> struct ublk_device *ub =
>> @@ -1013,10 +1129,14 @@ static void ublk_daemon_monitor_work(struct work_struct *work)
>> struct ublk_queue *ubq = ublk_get_queue(ub, i);
>>
>> if (ubq_daemon_is_dying(ubq)) {
>> - schedule_work(&ub->stop_work);
>> -
>> - /* abort queue is for making forward progress */
>> - ublk_abort_queue(ub, ubq);
>> + if (ublk_queue_can_use_recovery(ubq)) {
>> + schedule_work(&ub->quiesce_work);
>> + } else {
>> + schedule_work(&ub->stop_work);
>> +
>> + /* abort queue is for making forward progress */
>> + ublk_abort_queue(ub, ubq);
>> + }
>
> If quiesce work are always scheduled exclusively with stop work,
> the two can be defined as union, but not one big deal.

OK.

Regards,
Zhang

2022-09-19 10:16:12

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

On Tue, Sep 13, 2022 at 12:17:05PM +0800, ZiyangZhang wrote:
> With USER_RECOVERY feature enabled, the monitor_work schedules
> quiesce_work after finding a dying ubq_daemon. The quiesce_work's job
> is to:
> (1) quiesce request queue.
> (2) check if there is any INFLIGHT rq with UBLK_IO_FLAG_ACTIVE unset.
> If so, we retry until all these rqs are requeued by ublk_queue_rq()
> and task_work and become IDLE.

These requests should be being handled by task work or the io_uring
fallback wq, and suggest to add the words here.

> (3) requeue/abort inflight rqs issued to the crash ubq_daemon before. If
> UBLK_F_USER_RECOVERY_REISSUE is set, rq is requeued; or it is
> aborted.
> (4) complete all ioucmds by calling io_uring_cmd_done(). We are safe to
> do so because no ioucmd can be referenced now.
> (5) set ub's state to UBLK_S_DEV_QUIESCED, which means we are ready for
> recovery. This state is exposed to userspace by GET_DEV_INFO.
>
> The driver can always handle STOP_DEV and cleanup everything no matter
> ub's state is LIVE or QUIESCED. After ub's state is UBLK_S_DEV_QUIESCED,
> user can recover with new process by sending START_USER_RECOVERY.
>
> Note: we do not change the default behavior with reocvery feature
> disabled. monitor_work still schedules stop_work and abort inflight
> rqs. Finally ublk_device is released.

This version looks much better than before.

>
> Signed-off-by: ZiyangZhang <[email protected]>
> ---
> drivers/block/ublk_drv.c | 168 +++++++++++++++++++++++++++++++++++++--
> 1 file changed, 161 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index b067f33a1913..4409a130d0b6 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -121,7 +121,7 @@ struct ublk_queue {
>
> unsigned long io_addr; /* mapped vm address */
> unsigned int max_io_sz;
> - bool abort_work_pending;
> + bool force_abort;
> unsigned short nr_io_ready; /* how many ios setup */
> struct ublk_device *dev;
> struct ublk_io ios[0];
> @@ -163,6 +163,7 @@ struct ublk_device {
> * monitor each queue's daemon periodically
> */
> struct delayed_work monitor_work;
> + struct work_struct quiesce_work;
> struct work_struct stop_work;
> };
>
> @@ -660,6 +661,11 @@ static void __ublk_fail_req(struct ublk_io *io, struct request *req)
> WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
>
> if (!(io->flags & UBLK_IO_FLAG_ABORTED)) {
> + pr_devel("%s: abort rq: qid %d tag %d io_flags %x\n",
> + __func__,
> + ((struct ublk_queue *)req->mq_hctx->driver_data)->q_id,

req->mq_hctx->queue_num is cleaner.

> + req->tag,
> + io->flags);
> io->flags |= UBLK_IO_FLAG_ABORTED;
> blk_mq_end_request(req, BLK_STS_IOERR);
> }
> @@ -820,6 +826,21 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> res = ublk_setup_iod(ubq, rq);
> if (unlikely(res != BLK_STS_OK))
> return BLK_STS_IOERR;
> + /* With recovery feature enabled, force_abort is set in
> + * ublk_stop_dev() before calling del_gendisk() if ub's state
> + * is QUIESCED. We have to abort all requeued and new rqs here
> + * to let del_gendisk() move on. Besides, we do not call
> + * io_uring_cmd_complete_in_task() to avoid UAF on io_uring ctx.
> + *
> + * Note: force_abort is guaranteed to be seen because it is set
> + * before request queue is unqiuesced.
> + */
> + if (unlikely(ubq->force_abort)) {
> + pr_devel("%s: abort rq: qid %d tag %d io_flags %x\n",
> + __func__, ubq->q_id, rq->tag,
> + ubq->ios[rq->tag].flags);
> + return BLK_STS_IOERR;
> + }
>
> blk_mq_start_request(bd->rq);
>
> @@ -1003,6 +1024,101 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
> ublk_put_device(ub);
> }
>
> +static bool ublk_check_inflight_rq(struct request *rq, void *data)
> +{
> + struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> + struct ublk_io *io = &ubq->ios[rq->tag];
> + bool *busy = data;
> +
> + if (io->flags & UBLK_IO_FLAG_ACTIVE) {
> + *busy = true;
> + return false;
> + }
> + return true;
> +}
> +
> +static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
> +{
> + bool busy = false;
> +
> + WARN_ON_ONCE(!blk_queue_quiesced(ub->ub_disk->queue));
> + while (true) {
> + blk_mq_tagset_busy_iter(&ub->tag_set,
> + ublk_check_inflight_rq, &busy);
> + if (busy)
> + msleep(UBLK_REQUEUE_DELAY_MS);
> + else
> + break;
> + }
> +}
> +
> +static void ublk_quiesce_queue(struct ublk_device *ub,
> + struct ublk_queue *ubq)
> +{
> + int i;
> +
> + for (i = 0; i < ubq->q_depth; i++) {
> + struct ublk_io *io = &ubq->ios[i];
> +
> + if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
> + struct request *rq = blk_mq_tag_to_rq(
> + ub->tag_set.tags[ubq->q_id], i);
> +
> + WARN_ON_ONCE(!rq);
> + pr_devel("%s: %s rq: qid %d tag %d io_flags %x\n", __func__,
> + ublk_queue_can_use_recovery_reissue(ubq) ?
> + "requeue" : "abort",
> + ubq->q_id, i, io->flags);
> + if (ublk_queue_can_use_recovery_reissue(ubq))
> + blk_mq_requeue_request(rq, false);

This way is too violent.

There may be just one queue dying, but you requeue all requests
from any queue. I'd suggest to take the approach in ublk_daemon_monitor_work(),
such as, just requeuing requests in dying queue.

That said you still can re-use the logic in ublk_abort_queue()/ublk_daemon_monitor_work()
for making progress, just changing aborting request with requeue in
ublk_abort_queue().

> + else
> + __ublk_fail_req(io, rq);
> + } else {
> + pr_devel("%s: done old cmd: qid %d tag %d\n",
> + __func__, ubq->q_id, i);
> + io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0);
> + io->flags &= ~UBLK_IO_FLAG_ACTIVE;
> + }
> + ubq->nr_io_ready--;
> + }
> + WARN_ON_ONCE(ubq->nr_io_ready);
> +}
> +
> +static void ublk_quiesce_dev(struct ublk_device *ub)
> +{
> + int i;
> +
> + mutex_lock(&ub->mutex);
> + if (ub->dev_info.state != UBLK_S_DEV_LIVE)
> + goto unlock;
> +
> + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> + if (!ubq_daemon_is_dying(ubq))
> + goto unlock;
> + }
> + blk_mq_quiesce_queue(ub->ub_disk->queue);
> + ublk_wait_tagset_rqs_idle(ub);
> + pr_devel("%s: quiesce ub: dev_id %d\n",
> + __func__, ub->dev_info.dev_id);
> +
> + for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> + ublk_quiesce_queue(ub, ublk_get_queue(ub, i));
> +
> + ub->dev_info.state = UBLK_S_DEV_QUIESCED;
> + unlock:
> + mutex_unlock(&ub->mutex);
> +}
> +
> +static void ublk_quiesce_work_fn(struct work_struct *work)
> +{
> + struct ublk_device *ub =
> + container_of(work, struct ublk_device, quiesce_work);
> +
> + ublk_quiesce_dev(ub);
> +}
> +
> static void ublk_daemon_monitor_work(struct work_struct *work)
> {
> struct ublk_device *ub =
> @@ -1013,10 +1129,14 @@ static void ublk_daemon_monitor_work(struct work_struct *work)
> struct ublk_queue *ubq = ublk_get_queue(ub, i);
>
> if (ubq_daemon_is_dying(ubq)) {
> - schedule_work(&ub->stop_work);
> -
> - /* abort queue is for making forward progress */
> - ublk_abort_queue(ub, ubq);
> + if (ublk_queue_can_use_recovery(ubq)) {
> + schedule_work(&ub->quiesce_work);
> + } else {
> + schedule_work(&ub->stop_work);
> +
> + /* abort queue is for making forward progress */
> + ublk_abort_queue(ub, ubq);
> + }

If quiesce work are always scheduled exclusively with stop work,
the two can be defined as union, but not one big deal.


Thanks,
Ming

2022-09-19 12:56:09

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

On Mon, Sep 19, 2022 at 05:55:05PM +0800, Ziyang Zhang wrote:
> On 2022/9/19 17:32, Ming Lei wrote:
> > On Tue, Sep 13, 2022 at 12:17:05PM +0800, ZiyangZhang wrote:
> >> With USER_RECOVERY feature enabled, the monitor_work schedules
> >> quiesce_work after finding a dying ubq_daemon. The quiesce_work's job
> >> is to:
> >> (1) quiesce request queue.
> >> (2) check if there is any INFLIGHT rq with UBLK_IO_FLAG_ACTIVE unset.
> >> If so, we retry until all these rqs are requeued by ublk_queue_rq()
> >> and task_work and become IDLE.
> >
> > These requests should be being handled by task work or the io_uring
> > fallback wq, and suggest to add the words here.
>
> Will do so.
>
> >
> >> (3) requeue/abort inflight rqs issued to the crash ubq_daemon before. If
> >> UBLK_F_USER_RECOVERY_REISSUE is set, rq is requeued; or it is
> >> aborted.
> >> (4) complete all ioucmds by calling io_uring_cmd_done(). We are safe to
> >> do so because no ioucmd can be referenced now.
> >> (5) set ub's state to UBLK_S_DEV_QUIESCED, which means we are ready for
> >> recovery. This state is exposed to userspace by GET_DEV_INFO.
> >>
> >> The driver can always handle STOP_DEV and cleanup everything no matter
> >> ub's state is LIVE or QUIESCED. After ub's state is UBLK_S_DEV_QUIESCED,
> >> user can recover with new process by sending START_USER_RECOVERY.
> >>
> >> Note: we do not change the default behavior with reocvery feature
> >> disabled. monitor_work still schedules stop_work and abort inflight
> >> rqs. Finally ublk_device is released.
> >
> > This version looks much better than before.
>
> Thanks :)
>
> >
> >>
> >> Signed-off-by: ZiyangZhang <[email protected]>
> >> ---
> >> drivers/block/ublk_drv.c | 168 +++++++++++++++++++++++++++++++++++++--
> >> 1 file changed, 161 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> >> index b067f33a1913..4409a130d0b6 100644
> >> --- a/drivers/block/ublk_drv.c
> >> +++ b/drivers/block/ublk_drv.c
> >> @@ -121,7 +121,7 @@ struct ublk_queue {
> >>
> >> unsigned long io_addr; /* mapped vm address */
> >> unsigned int max_io_sz;
> >> - bool abort_work_pending;
> >> + bool force_abort;
> >> unsigned short nr_io_ready; /* how many ios setup */
> >> struct ublk_device *dev;
> >> struct ublk_io ios[0];
> >> @@ -163,6 +163,7 @@ struct ublk_device {
> >> * monitor each queue's daemon periodically
> >> */
> >> struct delayed_work monitor_work;
> >> + struct work_struct quiesce_work;
> >> struct work_struct stop_work;
> >> };
> >>
> >> @@ -660,6 +661,11 @@ static void __ublk_fail_req(struct ublk_io *io, struct request *req)
> >> WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
> >>
> >> if (!(io->flags & UBLK_IO_FLAG_ABORTED)) {
> >> + pr_devel("%s: abort rq: qid %d tag %d io_flags %x\n",
> >> + __func__,
> >> + ((struct ublk_queue *)req->mq_hctx->driver_data)->q_id,
> >
> > req->mq_hctx->queue_num is cleaner.
>
> Ok.
>
> >
> >> + req->tag,
> >> + io->flags);
> >> io->flags |= UBLK_IO_FLAG_ABORTED;
> >> blk_mq_end_request(req, BLK_STS_IOERR);
> >> }
> >> @@ -820,6 +826,21 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
> >> res = ublk_setup_iod(ubq, rq);
> >> if (unlikely(res != BLK_STS_OK))
> >> return BLK_STS_IOERR;
> >> + /* With recovery feature enabled, force_abort is set in
> >> + * ublk_stop_dev() before calling del_gendisk() if ub's state
> >> + * is QUIESCED. We have to abort all requeued and new rqs here
> >> + * to let del_gendisk() move on. Besides, we do not call
> >> + * io_uring_cmd_complete_in_task() to avoid UAF on io_uring ctx.
> >> + *
> >> + * Note: force_abort is guaranteed to be seen because it is set
> >> + * before request queue is unqiuesced.
> >> + */
> >> + if (unlikely(ubq->force_abort)) {
> >> + pr_devel("%s: abort rq: qid %d tag %d io_flags %x\n",
> >> + __func__, ubq->q_id, rq->tag,
> >> + ubq->ios[rq->tag].flags);
> >> + return BLK_STS_IOERR;
> >> + }
> >>
> >> blk_mq_start_request(bd->rq);
> >>
> >> @@ -1003,6 +1024,101 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
> >> ublk_put_device(ub);
> >> }
> >>
> >> +static bool ublk_check_inflight_rq(struct request *rq, void *data)
> >> +{
> >> + struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> >> + struct ublk_io *io = &ubq->ios[rq->tag];
> >> + bool *busy = data;
> >> +
> >> + if (io->flags & UBLK_IO_FLAG_ACTIVE) {
> >> + *busy = true;
> >> + return false;
> >> + }
> >> + return true;
> >> +}
> >> +
> >> +static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
> >> +{
> >> + bool busy = false;
> >> +
> >> + WARN_ON_ONCE(!blk_queue_quiesced(ub->ub_disk->queue));
> >> + while (true) {
> >> + blk_mq_tagset_busy_iter(&ub->tag_set,
> >> + ublk_check_inflight_rq, &busy);
> >> + if (busy)
> >> + msleep(UBLK_REQUEUE_DELAY_MS);
> >> + else
> >> + break;
> >> + }
> >> +}
> >> +
> >> +static void ublk_quiesce_queue(struct ublk_device *ub,
> >> + struct ublk_queue *ubq)
> >> +{
> >> + int i;
> >> +
> >> + for (i = 0; i < ubq->q_depth; i++) {
> >> + struct ublk_io *io = &ubq->ios[i];
> >> +
> >> + if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
> >> + struct request *rq = blk_mq_tag_to_rq(
> >> + ub->tag_set.tags[ubq->q_id], i);
> >> +
> >> + WARN_ON_ONCE(!rq);
> >> + pr_devel("%s: %s rq: qid %d tag %d io_flags %x\n", __func__,
> >> + ublk_queue_can_use_recovery_reissue(ubq) ?
> >> + "requeue" : "abort",
> >> + ubq->q_id, i, io->flags);
> >> + if (ublk_queue_can_use_recovery_reissue(ubq))
> >> + blk_mq_requeue_request(rq, false);
> >
> > This way is too violent.
> >
> > There may be just one queue dying, but you requeue all requests
> > from any queue. I'd suggest to take the approach in ublk_daemon_monitor_work(),
> > such as, just requeuing requests in dying queue.
>
> If we want to start a new process after a crash for USER_RECOVERY, all old ubq_daemons
> must exit and rqs of all queues have to be requeued/aborted. We cannot let live
> ubq_daemons run any more because they do not belong to the new process.

IMO, the old process really can exist, and recently even I got such
requirement for switching queue from one thread to another.

What we should do is to get all inflight requests done, and cancel all io
commands, no matter if the ubq pthread is dead or live.

>
> BTW, I really wonder why there could be just one queue dying? All queues must be dying
> shortly after any ubq_daemon is dying since they are all pthreads in the same process.

You can't assume it is always so. Maybe one pthread is dead first, and
others are dying later, maybe just one is dead.

If one queue's pthread is live, you may get trouble by simply requeuing
the request, that is why I suggest to re-use the logic of
ublk_daemon_monitor_work/ublk_abort_queue().

For stopping device, request queue is frozen in del_gendisk() and all
in-flight requests are drained, and monitor work provides such
guarantee.

For user recovery, monitor work should help you too by aborting one
queue if it is dying until all requests are drained.

>
> >
> > That said you still can re-use the logic in ublk_abort_queue()/ublk_daemon_monitor_work()
> > for making progress, just changing aborting request with requeue in
> > ublk_abort_queue().
>
> I get your point, but it may be hard to reuse the logic in ublk_daemon_monitor_work()
> because:
> (1) we have to quiesce request queue in ublk_quiesce_dev(). This has to be done with
> ub_mutex.
> (2) ublk_quiesce_dev() cannot be run after rqs are requeued/aborted.

I don't get your point, the request queue needs to be quiesced once, then
either inflight requests are requeued if the queue is dying, or completed by
the queue's pthread if it is live. As you mentioned, in reality, most times,
all pthreads will be killed, but timing can be different, and I think
you can not requeue one request if the ubq pthread isn't dying.



Thanks,
Ming

2022-09-19 13:16:37

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 6/7] ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support

On Tue, Sep 13, 2022 at 12:17:06PM +0800, ZiyangZhang wrote:
> START_USER_RECOVERY and END_USER_RECOVERY are two new control commands
> to support user recovery feature.
>
> After a crash, user should send START_USER_RECOVERY, it will:
> (1) check if (a)current ublk_device is UBLK_S_DEV_QUIESCED which was
> set by quiesce_work and (b)the file struct is released.
> (2) reinit all ubqs, including:
> (a) put the task_struct and reset ->ubq_daemon to NULL.
> (b) reset all ublk_io.
> (3) reset ub->mm to NULL.
>
> Then, user should start a new process and send FETCH_REQ on each
> ubq_daemon.
>
> Finally, user should send END_USER_RECOVERY, it will:
> (1) wait for all new ubq_daemons getting ready.
> (2) update ublksrv_pid
> (3) unquiesce the request queue and expect incoming ublk_queue_rq()
> (4) convert ub's state to UBLK_S_DEV_LIVE
>
> Note: we can handle STOP_DEV between START_USER_RECOVERY and
> END_USER_RECOVERY. This is helpful to users who cannot start new process
> after sending START_USER_RECOVERY ctrl-cmd.
>
> Signed-off-by: ZiyangZhang <[email protected]>
> ---
> drivers/block/ublk_drv.c | 116 +++++++++++++++++++++++++++++++++++++++
> 1 file changed, 116 insertions(+)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 4409a130d0b6..3a3af80ee938 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -1961,6 +1961,116 @@ static int ublk_ctrl_set_params(struct io_uring_cmd *cmd)
> return ret;
> }
>
> +static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
> +{
> + int i;
> +
> + WARN_ON_ONCE(!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq)));
> + /* All old ioucmds have to be completed */
> + WARN_ON_ONCE(ubq->nr_io_ready);
> + pr_devel("%s: prepare for recovering qid %d\n", __func__, ubq->q_id);
> + /* old daemon is PF_EXITING, put it now */
> + put_task_struct(ubq->ubq_daemon);
> + /* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
> + ubq->ubq_daemon = NULL;
> +
> + for (i = 0; i < ubq->q_depth; i++) {
> + struct ublk_io *io = &ubq->ios[i];
> +
> + /* forget everything now and be ready for new FETCH_REQ */
> + io->flags = 0;
> + io->cmd = NULL;
> + io->addr = 0;
> + }
> +}
> +
> +static int ublk_ctrl_start_recovery(struct io_uring_cmd *cmd)
> +{
> + struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
> + struct ublk_device *ub;
> + int ret = -EINVAL;
> + int i;
> +
> + ub = ublk_get_device_from_id(header->dev_id);
> + if (!ub)
> + return ret;
> +
> + mutex_lock(&ub->mutex);
> + if (!ublk_can_use_recovery(ub))
> + goto out_unlock;
> + /*
> + * START_RECOVERY is only allowd after:
> + *
> + * (1) UB_STATE_OPEN is not set, which means the dying process is exited
> + * and related io_uring ctx is freed so file struct of /dev/ublkcX is
> + * released.
> + *
> + * (2) UBLK_S_DEV_QUIESCED is set, which means the quiesce_work:
> + * (a)has quiesced request queue
> + * (b)has requeued every inflight rqs whose io_flags is ACTIVE
> + * (c)has requeued/aborted every inflight rqs whose io_flags is NOT ACTIVE
> + * (d)has completed/camceled all ioucmds owned by ther dying process
> + */
> + if (test_bit(UB_STATE_OPEN, &ub->state) ||
> + ub->dev_info.state != UBLK_S_DEV_QUIESCED) {
> + ret = -EBUSY;
> + goto out_unlock;
> + }
> + pr_devel("%s: start recovery for dev id %d.\n", __func__, header->dev_id);
> + for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
> + ublk_queue_reinit(ub, ublk_get_queue(ub, i));
> + /* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */
> + ub->mm = NULL;
> + ub->nr_queues_ready = 0;

I am wondering why you don't move the above(queue reinit, clearing ub->mm)
into ublk_ch_release(), and looks it is more clean to clear this stuff
there. Meantime I guess one control command might be enough for user
recovery.


Thanks,
Ming

2022-09-19 13:47:09

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 4/7] ublk_drv: requeue rqs with recovery feature enabled

On Mon, Sep 19, 2022 at 05:12:21PM +0800, Ziyang Zhang wrote:
> On 2022/9/19 11:55, Ming Lei wrote:
> > On Tue, Sep 13, 2022 at 12:17:04PM +0800, ZiyangZhang wrote:
> >> With recovery feature enabled, in ublk_queue_rq or task work
> >> (in exit_task_work or fallback wq), we requeue rqs instead of
> >> ending(aborting) them. Besides, No matter recovery feature is enabled
> >> or disabled, we schedule monitor_work immediately.
> >>
> >> Signed-off-by: ZiyangZhang <[email protected]>
> >> ---
> >> drivers/block/ublk_drv.c | 34 ++++++++++++++++++++++++++++++++--
> >> 1 file changed, 32 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> >> index 23337bd7c105..b067f33a1913 100644
> >> --- a/drivers/block/ublk_drv.c
> >> +++ b/drivers/block/ublk_drv.c
> >> @@ -682,6 +682,21 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)
> >>
> >> #define UBLK_REQUEUE_DELAY_MS 3
> >>
> >> +static inline void __ublk_abort_rq_in_task_work(struct ublk_queue *ubq,
> >> + struct request *rq)
> >> +{
> >> + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
> >> + (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
> >> + ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
> >> + /* We cannot process this rq so just requeue it. */
> >> + if (ublk_queue_can_use_recovery(ubq)) {
> >> + blk_mq_requeue_request(rq, false);
> >> + blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);
> >
> > Here you needn't to kick requeue list since we know it can't make
> > progress. And you can do that once before deleting gendisk
> > or the queue is recovered.
>
> No, kicking rq here is necessary.
>
> Consider USER_RECOVERY is enabled and everything goes well.
> User sends STOP_DEV, and we have kicked requeue list in
> ublk_stop_dev() and are going to call del_gendisk().
> However, a crash happens now. Then rqs may be still requeued
> by ublk_queue_rq() because ublk_queue_rq() sees a dying
> ubq_daemon. So del_gendisk() will hang because there are
> rqs leaving in requeue list and no one kicks them.

Why can't you kick requeue list before calling del_gendisk().

>
> BTW, kicking requeue list after requeue rqs is really harmless
> since we schedule quiesce_work immediately after finding a
> dying ubq_daemon. So few rqs have chance to be re-dispatched.

Do you think it makes sense to kick requeue list when the queue
can't handle any request?


Thanks,
Ming

2022-09-20 02:04:14

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [PATCH V3 4/7] ublk_drv: requeue rqs with recovery feature enabled

On 2022/9/19 20:39, Ming Lei wrote:
> On Mon, Sep 19, 2022 at 05:12:21PM +0800, Ziyang Zhang wrote:
>> On 2022/9/19 11:55, Ming Lei wrote:
>>> On Tue, Sep 13, 2022 at 12:17:04PM +0800, ZiyangZhang wrote:
>>>> With recovery feature enabled, in ublk_queue_rq or task work
>>>> (in exit_task_work or fallback wq), we requeue rqs instead of
>>>> ending(aborting) them. Besides, No matter recovery feature is enabled
>>>> or disabled, we schedule monitor_work immediately.
>>>>
>>>> Signed-off-by: ZiyangZhang <[email protected]>
>>>> ---
>>>> drivers/block/ublk_drv.c | 34 ++++++++++++++++++++++++++++++++--
>>>> 1 file changed, 32 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>>>> index 23337bd7c105..b067f33a1913 100644
>>>> --- a/drivers/block/ublk_drv.c
>>>> +++ b/drivers/block/ublk_drv.c
>>>> @@ -682,6 +682,21 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)
>>>>
>>>> #define UBLK_REQUEUE_DELAY_MS 3
>>>>
>>>> +static inline void __ublk_abort_rq_in_task_work(struct ublk_queue *ubq,
>>>> + struct request *rq)
>>>> +{
>>>> + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
>>>> + (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
>>>> + ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
>>>> + /* We cannot process this rq so just requeue it. */
>>>> + if (ublk_queue_can_use_recovery(ubq)) {
>>>> + blk_mq_requeue_request(rq, false);
>>>> + blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);
>>>
>>> Here you needn't to kick requeue list since we know it can't make
>>> progress. And you can do that once before deleting gendisk
>>> or the queue is recovered.
>>
>> No, kicking rq here is necessary.
>>
>> Consider USER_RECOVERY is enabled and everything goes well.
>> User sends STOP_DEV, and we have kicked requeue list in
>> ublk_stop_dev() and are going to call del_gendisk().
>> However, a crash happens now. Then rqs may be still requeued
>> by ublk_queue_rq() because ublk_queue_rq() sees a dying
>> ubq_daemon. So del_gendisk() will hang because there are
>> rqs leaving in requeue list and no one kicks them.
>
> Why can't you kick requeue list before calling del_gendisk().

Yes, we can kick requeue list once before calling del_gendisk().
But a crash may happen just after kicking but before del_gendisk().
So some rqs may be requeued at this moment. But we have already
kicked the requeue list! Then del_gendisk() will hang, right?

>
>>
>> BTW, kicking requeue list after requeue rqs is really harmless
>> since we schedule quiesce_work immediately after finding a
>> dying ubq_daemon. So few rqs have chance to be re-dispatched.
>
> Do you think it makes sense to kick requeue list when the queue
> can't handle any request?

I know it does not make sense while ubq_daemon is dying, but it is good
for handling the situation I discribed before.

Regards,
Zhang.

2022-09-20 02:40:52

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

On 2022/9/19 20:33, Ming Lei wrote:
>>>> +
>>>> +static void ublk_quiesce_queue(struct ublk_device *ub,
>>>> + struct ublk_queue *ubq)
>>>> +{
>>>> + int i;
>>>> +
>>>> + for (i = 0; i < ubq->q_depth; i++) {
>>>> + struct ublk_io *io = &ubq->ios[i];
>>>> +
>>>> + if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
>>>> + struct request *rq = blk_mq_tag_to_rq(
>>>> + ub->tag_set.tags[ubq->q_id], i);
>>>> +
>>>> + WARN_ON_ONCE(!rq);
>>>> + pr_devel("%s: %s rq: qid %d tag %d io_flags %x\n", __func__,
>>>> + ublk_queue_can_use_recovery_reissue(ubq) ?
>>>> + "requeue" : "abort",
>>>> + ubq->q_id, i, io->flags);
>>>> + if (ublk_queue_can_use_recovery_reissue(ubq))
>>>> + blk_mq_requeue_request(rq, false);
>>>
>>> This way is too violent.
>>>
>>> There may be just one queue dying, but you requeue all requests
>>> from any queue. I'd suggest to take the approach in ublk_daemon_monitor_work(),
>>> such as, just requeuing requests in dying queue.
>>
>> If we want to start a new process after a crash for USER_RECOVERY, all old ubq_daemons
>> must exit and rqs of all queues have to be requeued/aborted. We cannot let live
>> ubq_daemons run any more because they do not belong to the new process.
>
> IMO, the old process really can exist, and recently even I got such
> requirement for switching queue from one thread to another.

For now, only one process can open /dev/ublkcX, so a new process is necessary now.

If you think "per ubq_daemon" recovery is reasonable, I can do that in the future
if multiple processes is supported. But I really suggest that we can keep current
design as the first step which assumes all ubq_daemons are exited and a new process
is started, and that really meets our requirement.

BTW, START_USER_RECOVERY has to be reconsidered because we may need to pass a ubq_id
with it.

>
> What we should do is to get all inflight requests done, and cancel all io
> commands, no matter if the ubq pthread is dead or live.
>
>>
>> BTW, I really wonder why there could be just one queue dying? All queues must be dying
>> shortly after any ubq_daemon is dying since they are all pthreads in the same process.
>
> You can't assume it is always so. Maybe one pthread is dead first, and
> others are dying later, maybe just one is dead.

Yes, I know there may be only one pthread is dead while others keep running, but now
ublk_drv only support one process opening the same /dev/ublkcX, so other pthreads
must dead(no matter they are aborted by signal or themselves) later.

>
> If one queue's pthread is live, you may get trouble by simply requeuing
> the request, that is why I suggest to re-use the logic of
> ublk_daemon_monitor_work/ublk_abort_queue().

Actually, if any ubq_daemon is live, no rqs are requeued, please see the check in
ublk_quiesce_dev(). It always makes sure that ALL ubq_daemons are dying, then it
starts quiesce jobs.

>
> For stopping device, request queue is frozen in del_gendisk() and all
> in-flight requests are drained, and monitor work provides such
> guarantee.
>
> For user recovery, monitor work should help you too by aborting one
> queue if it is dying until all requests are drained.

Monitor work can schedule quiesce_work if it finds a dying ubq_daemon.
Then quiesce_work calls ublk_quiesce_dev(). I do this because ublk_quiesce_dev()
has to wait all inflight rqs with ACTIVE set being requeued.

>
>>
>>>
>>> That said you still can re-use the logic in ublk_abort_queue()/ublk_daemon_monitor_work()
>>> for making progress, just changing aborting request with requeue in
>>> ublk_abort_queue().
>>
>> I get your point, but it may be hard to reuse the logic in ublk_daemon_monitor_work()
>> because:
>> (1) we have to quiesce request queue in ublk_quiesce_dev(). This has to be done with
>> ub_mutex.
>> (2) ublk_quiesce_dev() cannot be run after rqs are requeued/aborted.
>
> I don't get your point, the request queue needs to be quiesced once, then
> either inflight requests are requeued if the queue is dying, or completed by
> the queue's pthread if it is live. As you mentioned, in reality, most times,
> all pthreads will be killed, but timing can be different, and I think
> you can not requeue one request if the ubq pthread isn't dying.

I do not requeue rqs with a live ubq_daemon. ublk_quiesce_dev() always starts
after all ubq_daemons are dying.

Regards,
Zhang.

2022-09-20 03:04:38

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 4/7] ublk_drv: requeue rqs with recovery feature enabled

On Tue, Sep 20, 2022 at 09:31:54AM +0800, Ziyang Zhang wrote:
> On 2022/9/19 20:39, Ming Lei wrote:
> > On Mon, Sep 19, 2022 at 05:12:21PM +0800, Ziyang Zhang wrote:
> >> On 2022/9/19 11:55, Ming Lei wrote:
> >>> On Tue, Sep 13, 2022 at 12:17:04PM +0800, ZiyangZhang wrote:
> >>>> With recovery feature enabled, in ublk_queue_rq or task work
> >>>> (in exit_task_work or fallback wq), we requeue rqs instead of
> >>>> ending(aborting) them. Besides, No matter recovery feature is enabled
> >>>> or disabled, we schedule monitor_work immediately.
> >>>>
> >>>> Signed-off-by: ZiyangZhang <[email protected]>
> >>>> ---
> >>>> drivers/block/ublk_drv.c | 34 ++++++++++++++++++++++++++++++++--
> >>>> 1 file changed, 32 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> >>>> index 23337bd7c105..b067f33a1913 100644
> >>>> --- a/drivers/block/ublk_drv.c
> >>>> +++ b/drivers/block/ublk_drv.c
> >>>> @@ -682,6 +682,21 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)
> >>>>
> >>>> #define UBLK_REQUEUE_DELAY_MS 3
> >>>>
> >>>> +static inline void __ublk_abort_rq_in_task_work(struct ublk_queue *ubq,
> >>>> + struct request *rq)
> >>>> +{
> >>>> + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
> >>>> + (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
> >>>> + ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
> >>>> + /* We cannot process this rq so just requeue it. */
> >>>> + if (ublk_queue_can_use_recovery(ubq)) {
> >>>> + blk_mq_requeue_request(rq, false);
> >>>> + blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);
> >>>
> >>> Here you needn't to kick requeue list since we know it can't make
> >>> progress. And you can do that once before deleting gendisk
> >>> or the queue is recovered.
> >>
> >> No, kicking rq here is necessary.
> >>
> >> Consider USER_RECOVERY is enabled and everything goes well.
> >> User sends STOP_DEV, and we have kicked requeue list in
> >> ublk_stop_dev() and are going to call del_gendisk().
> >> However, a crash happens now. Then rqs may be still requeued
> >> by ublk_queue_rq() because ublk_queue_rq() sees a dying
> >> ubq_daemon. So del_gendisk() will hang because there are
> >> rqs leaving in requeue list and no one kicks them.
> >
> > Why can't you kick requeue list before calling del_gendisk().
>
> Yes, we can kick requeue list once before calling del_gendisk().
> But a crash may happen just after kicking but before del_gendisk().
> So some rqs may be requeued at this moment. But we have already
> kicked the requeue list! Then del_gendisk() will hang, right?

->force_abort is set before kicking in ublk_unquiesce_dev(), so
all new requests are failed immediately instead of being requeued,
right?


Thanks,
Ming

2022-09-20 03:16:22

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [PATCH V3 6/7] ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support

On 2022/9/19 21:03, Ming Lei wrote:
> On Tue, Sep 13, 2022 at 12:17:06PM +0800, ZiyangZhang wrote:
>> START_USER_RECOVERY and END_USER_RECOVERY are two new control commands
>> to support user recovery feature.
>>
>> After a crash, user should send START_USER_RECOVERY, it will:
>> (1) check if (a)current ublk_device is UBLK_S_DEV_QUIESCED which was
>> set by quiesce_work and (b)the file struct is released.
>> (2) reinit all ubqs, including:
>> (a) put the task_struct and reset ->ubq_daemon to NULL.
>> (b) reset all ublk_io.
>> (3) reset ub->mm to NULL.
>>
>> Then, user should start a new process and send FETCH_REQ on each
>> ubq_daemon.
>>
>> Finally, user should send END_USER_RECOVERY, it will:
>> (1) wait for all new ubq_daemons getting ready.
>> (2) update ublksrv_pid
>> (3) unquiesce the request queue and expect incoming ublk_queue_rq()
>> (4) convert ub's state to UBLK_S_DEV_LIVE
>>
>> Note: we can handle STOP_DEV between START_USER_RECOVERY and
>> END_USER_RECOVERY. This is helpful to users who cannot start new process
>> after sending START_USER_RECOVERY ctrl-cmd.
>>
>> Signed-off-by: ZiyangZhang <[email protected]>
>> ---
>> drivers/block/ublk_drv.c | 116 +++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 116 insertions(+)
>>
>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>> index 4409a130d0b6..3a3af80ee938 100644
>> --- a/drivers/block/ublk_drv.c
>> +++ b/drivers/block/ublk_drv.c
>> @@ -1961,6 +1961,116 @@ static int ublk_ctrl_set_params(struct io_uring_cmd *cmd)
>> return ret;
>> }
>>
>> +static void ublk_queue_reinit(struct ublk_device *ub, struct ublk_queue *ubq)
>> +{
>> + int i;
>> +
>> + WARN_ON_ONCE(!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq)));
>> + /* All old ioucmds have to be completed */
>> + WARN_ON_ONCE(ubq->nr_io_ready);
>> + pr_devel("%s: prepare for recovering qid %d\n", __func__, ubq->q_id);
>> + /* old daemon is PF_EXITING, put it now */
>> + put_task_struct(ubq->ubq_daemon);
>> + /* We have to reset it to NULL, otherwise ub won't accept new FETCH_REQ */
>> + ubq->ubq_daemon = NULL;
>> +
>> + for (i = 0; i < ubq->q_depth; i++) {
>> + struct ublk_io *io = &ubq->ios[i];
>> +
>> + /* forget everything now and be ready for new FETCH_REQ */
>> + io->flags = 0;
>> + io->cmd = NULL;
>> + io->addr = 0;
>> + }
>> +}
>> +
>> +static int ublk_ctrl_start_recovery(struct io_uring_cmd *cmd)
>> +{
>> + struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
>> + struct ublk_device *ub;
>> + int ret = -EINVAL;
>> + int i;
>> +
>> + ub = ublk_get_device_from_id(header->dev_id);
>> + if (!ub)
>> + return ret;
>> +
>> + mutex_lock(&ub->mutex);
>> + if (!ublk_can_use_recovery(ub))
>> + goto out_unlock;
>> + /*
>> + * START_RECOVERY is only allowd after:
>> + *
>> + * (1) UB_STATE_OPEN is not set, which means the dying process is exited
>> + * and related io_uring ctx is freed so file struct of /dev/ublkcX is
>> + * released.
>> + *
>> + * (2) UBLK_S_DEV_QUIESCED is set, which means the quiesce_work:
>> + * (a)has quiesced request queue
>> + * (b)has requeued every inflight rqs whose io_flags is ACTIVE
>> + * (c)has requeued/aborted every inflight rqs whose io_flags is NOT ACTIVE
>> + * (d)has completed/camceled all ioucmds owned by ther dying process
>> + */
>> + if (test_bit(UB_STATE_OPEN, &ub->state) ||
>> + ub->dev_info.state != UBLK_S_DEV_QUIESCED) {
>> + ret = -EBUSY;
>> + goto out_unlock;
>> + }
>> + pr_devel("%s: start recovery for dev id %d.\n", __func__, header->dev_id);
>> + for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
>> + ublk_queue_reinit(ub, ublk_get_queue(ub, i));
>> + /* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */
>> + ub->mm = NULL;
>> + ub->nr_queues_ready = 0;
>
> I am wondering why you don't move the above(queue reinit, clearing ub->mm)
> into ublk_ch_release(), and looks it is more clean to clear this stuff
> there. Meantime I guess one control command might be enough for user
> recovery.


OK, START_USER_RECOVERY just does cleanup stuff for a new process and
ublk_ch_release() does the similar thing since it is always called when chardev
is released. And our new process must open the chardev after it is released.

So (queue reinit, clearing ub->mm) can be done in ublk_ch_release().

Regards,
Zhang.

2022-09-20 03:19:01

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

On Tue, Sep 20, 2022 at 09:49:33AM +0800, Ziyang Zhang wrote:
> On 2022/9/19 20:33, Ming Lei wrote:
> >>>> +
> >>>> +static void ublk_quiesce_queue(struct ublk_device *ub,
> >>>> + struct ublk_queue *ubq)
> >>>> +{
> >>>> + int i;
> >>>> +
> >>>> + for (i = 0; i < ubq->q_depth; i++) {
> >>>> + struct ublk_io *io = &ubq->ios[i];
> >>>> +
> >>>> + if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
> >>>> + struct request *rq = blk_mq_tag_to_rq(
> >>>> + ub->tag_set.tags[ubq->q_id], i);
> >>>> +
> >>>> + WARN_ON_ONCE(!rq);
> >>>> + pr_devel("%s: %s rq: qid %d tag %d io_flags %x\n", __func__,
> >>>> + ublk_queue_can_use_recovery_reissue(ubq) ?
> >>>> + "requeue" : "abort",
> >>>> + ubq->q_id, i, io->flags);
> >>>> + if (ublk_queue_can_use_recovery_reissue(ubq))
> >>>> + blk_mq_requeue_request(rq, false);
> >>>
> >>> This way is too violent.
> >>>
> >>> There may be just one queue dying, but you requeue all requests
> >>> from any queue. I'd suggest to take the approach in ublk_daemon_monitor_work(),
> >>> such as, just requeuing requests in dying queue.
> >>
> >> If we want to start a new process after a crash for USER_RECOVERY, all old ubq_daemons
> >> must exit and rqs of all queues have to be requeued/aborted. We cannot let live
> >> ubq_daemons run any more because they do not belong to the new process.
> >
> > IMO, the old process really can exist, and recently even I got such
> > requirement for switching queue from one thread to another.
>
> For now, only one process can open /dev/ublkcX, so a new process is necessary now.
>
> If you think "per ubq_daemon" recovery is reasonable, I can do that in the future
> if multiple processes is supported. But I really suggest that we can keep current
> design as the first step which assumes all ubq_daemons are exited and a new process
> is started, and that really meets our requirement.
>
> BTW, START_USER_RECOVERY has to be reconsidered because we may need to pass a ubq_id
> with it.
>
> >
> > What we should do is to get all inflight requests done, and cancel all io
> > commands, no matter if the ubq pthread is dead or live.
> >
> >>
> >> BTW, I really wonder why there could be just one queue dying? All queues must be dying
> >> shortly after any ubq_daemon is dying since they are all pthreads in the same process.
> >
> > You can't assume it is always so. Maybe one pthread is dead first, and
> > others are dying later, maybe just one is dead.
>
> Yes, I know there may be only one pthread is dead while others keep running, but now
> ublk_drv only support one process opening the same /dev/ublkcX, so other pthreads
> must dead(no matter they are aborted by signal or themselves) later.
>
> >
> > If one queue's pthread is live, you may get trouble by simply requeuing
> > the request, that is why I suggest to re-use the logic of
> > ublk_daemon_monitor_work/ublk_abort_queue().
>
> Actually, if any ubq_daemon is live, no rqs are requeued, please see the check in
> ublk_quiesce_dev(). It always makes sure that ALL ubq_daemons are dying, then it
> starts quiesce jobs.

OK, looks I miss this point, but you should have quiesced queue at the
beginning of ublk_quiesce_dev(), then the transition period can be kept
as short as possible. Otherwise, if one queue pthread isn't dying, the
device can be kept in this part-working state forever.

>
> >
> > For stopping device, request queue is frozen in del_gendisk() and all
> > in-flight requests are drained, and monitor work provides such
> > guarantee.
> >
> > For user recovery, monitor work should help you too by aborting one
> > queue if it is dying until all requests are drained.
>
> Monitor work can schedule quiesce_work if it finds a dying ubq_daemon.
> Then quiesce_work calls ublk_quiesce_dev(). I do this because ublk_quiesce_dev()
> has to wait all inflight rqs with ACTIVE set being requeued.
>
> >
> >>
> >>>
> >>> That said you still can re-use the logic in ublk_abort_queue()/ublk_daemon_monitor_work()
> >>> for making progress, just changing aborting request with requeue in
> >>> ublk_abort_queue().
> >>
> >> I get your point, but it may be hard to reuse the logic in ublk_daemon_monitor_work()
> >> because:
> >> (1) we have to quiesce request queue in ublk_quiesce_dev(). This has to be done with
> >> ub_mutex.
> >> (2) ublk_quiesce_dev() cannot be run after rqs are requeued/aborted.

Follows the delta patch against patch 5 for showing the idea:


diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 4409a130d0b6..60c5786c4711 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -656,7 +656,8 @@ static void ublk_complete_rq(struct request *req)
* Also aborting may not be started yet, keep in mind that one failed
* request may be issued by block layer again.
*/
-static void __ublk_fail_req(struct ublk_io *io, struct request *req)
+static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
+ struct request *req)
{
WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);

@@ -667,7 +668,10 @@ static void __ublk_fail_req(struct ublk_io *io, struct request *req)
req->tag,
io->flags);
io->flags |= UBLK_IO_FLAG_ABORTED;
- blk_mq_end_request(req, BLK_STS_IOERR);
+ if (ublk_queue_can_use_recovery_reissue(ubq))
+ blk_mq_requeue_request(req, false);
+ else
+ blk_mq_end_request(req, BLK_STS_IOERR);
}
}

@@ -1018,7 +1022,7 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
*/
rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i);
if (rq)
- __ublk_fail_req(io, rq);
+ __ublk_fail_req(ubq, io, rq);
}
}
ublk_put_device(ub);
@@ -1026,12 +1030,10 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)

static bool ublk_check_inflight_rq(struct request *rq, void *data)
{
- struct ublk_queue *ubq = rq->mq_hctx->driver_data;
- struct ublk_io *io = &ubq->ios[rq->tag];
- bool *busy = data;
+ bool *idle = data;

- if (io->flags & UBLK_IO_FLAG_ACTIVE) {
- *busy = true;
+ if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) {
+ *idle = false;
return false;
}
return true;
@@ -1039,16 +1041,15 @@ static bool ublk_check_inflight_rq(struct request *rq, void *data)

static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
{
- bool busy = false;
+ bool idle = true;

WARN_ON_ONCE(!blk_queue_quiesced(ub->ub_disk->queue));
while (true) {
blk_mq_tagset_busy_iter(&ub->tag_set,
- ublk_check_inflight_rq, &busy);
- if (busy)
- msleep(UBLK_REQUEUE_DELAY_MS);
- else
+ ublk_check_inflight_rq, &idle);
+ if (idle)
break;
+ msleep(UBLK_REQUEUE_DELAY_MS);
}
}

@@ -1069,10 +1070,7 @@ static void ublk_quiesce_queue(struct ublk_device *ub,
ublk_queue_can_use_recovery_reissue(ubq) ?
"requeue" : "abort",
ubq->q_id, i, io->flags);
- if (ublk_queue_can_use_recovery_reissue(ubq))
- blk_mq_requeue_request(rq, false);
- else
- __ublk_fail_req(io, rq);
+ __ublk_fail_req(ubq, io, rq);
} else {
pr_devel("%s: done old cmd: qid %d tag %d\n",
__func__, ubq->q_id, i);
@@ -1092,12 +1090,6 @@ static void ublk_quiesce_dev(struct ublk_device *ub)
if (ub->dev_info.state != UBLK_S_DEV_LIVE)
goto unlock;

- for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
- struct ublk_queue *ubq = ublk_get_queue(ub, i);
-
- if (!ubq_daemon_is_dying(ubq))
- goto unlock;
- }
blk_mq_quiesce_queue(ub->ub_disk->queue);
ublk_wait_tagset_rqs_idle(ub);
pr_devel("%s: quiesce ub: dev_id %d\n",
@@ -1129,14 +1121,13 @@ static void ublk_daemon_monitor_work(struct work_struct *work)
struct ublk_queue *ubq = ublk_get_queue(ub, i);

if (ubq_daemon_is_dying(ubq)) {
- if (ublk_queue_can_use_recovery(ubq)) {
+ if (ublk_queue_can_use_recovery(ubq))
schedule_work(&ub->quiesce_work);
- } else {
+ else
schedule_work(&ub->stop_work);

- /* abort queue is for making forward progress */
- ublk_abort_queue(ub, ubq);
- }
+ /* abort queue is for making forward progress */
+ ublk_abort_queue(ub, ubq);
}
}





Thanks,
Ming

2022-09-20 03:21:04

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [PATCH V3 4/7] ublk_drv: requeue rqs with recovery feature enabled

On 2022/9/20 10:39, Ming Lei wrote:
> On Tue, Sep 20, 2022 at 09:31:54AM +0800, Ziyang Zhang wrote:
>> On 2022/9/19 20:39, Ming Lei wrote:
>>> On Mon, Sep 19, 2022 at 05:12:21PM +0800, Ziyang Zhang wrote:
>>>> On 2022/9/19 11:55, Ming Lei wrote:
>>>>> On Tue, Sep 13, 2022 at 12:17:04PM +0800, ZiyangZhang wrote:
>>>>>> With recovery feature enabled, in ublk_queue_rq or task work
>>>>>> (in exit_task_work or fallback wq), we requeue rqs instead of
>>>>>> ending(aborting) them. Besides, No matter recovery feature is enabled
>>>>>> or disabled, we schedule monitor_work immediately.
>>>>>>
>>>>>> Signed-off-by: ZiyangZhang <[email protected]>
>>>>>> ---
>>>>>> drivers/block/ublk_drv.c | 34 ++++++++++++++++++++++++++++++++--
>>>>>> 1 file changed, 32 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>>>>>> index 23337bd7c105..b067f33a1913 100644
>>>>>> --- a/drivers/block/ublk_drv.c
>>>>>> +++ b/drivers/block/ublk_drv.c
>>>>>> @@ -682,6 +682,21 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)
>>>>>>
>>>>>> #define UBLK_REQUEUE_DELAY_MS 3
>>>>>>
>>>>>> +static inline void __ublk_abort_rq_in_task_work(struct ublk_queue *ubq,
>>>>>> + struct request *rq)
>>>>>> +{
>>>>>> + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
>>>>>> + (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
>>>>>> + ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
>>>>>> + /* We cannot process this rq so just requeue it. */
>>>>>> + if (ublk_queue_can_use_recovery(ubq)) {
>>>>>> + blk_mq_requeue_request(rq, false);
>>>>>> + blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);
>>>>>
>>>>> Here you needn't to kick requeue list since we know it can't make
>>>>> progress. And you can do that once before deleting gendisk
>>>>> or the queue is recovered.
>>>>
>>>> No, kicking rq here is necessary.
>>>>
>>>> Consider USER_RECOVERY is enabled and everything goes well.
>>>> User sends STOP_DEV, and we have kicked requeue list in
>>>> ublk_stop_dev() and are going to call del_gendisk().
>>>> However, a crash happens now. Then rqs may be still requeued
>>>> by ublk_queue_rq() because ublk_queue_rq() sees a dying
>>>> ubq_daemon. So del_gendisk() will hang because there are
>>>> rqs leaving in requeue list and no one kicks them.
>>>
>>> Why can't you kick requeue list before calling del_gendisk().
>>
>> Yes, we can kick requeue list once before calling del_gendisk().
>> But a crash may happen just after kicking but before del_gendisk().
>> So some rqs may be requeued at this moment. But we have already
>> kicked the requeue list! Then del_gendisk() will hang, right?
>
> ->force_abort is set before kicking in ublk_unquiesce_dev(), so
> all new requests are failed immediately instead of being requeued,
> right?
>

->force_abort is not heplful here because there may be fallback wq running
which can requeue rqs after kicking requeue list.

In ublk_unquiesce_dev(), I simply disable recovery feature
if ub's state is UBLK_S_DEV_LIVE while stopping ublk_dev.

Note: We can make sure fallback wq does not run if we wait until all rqs with
ACTIVE flag set are requeued. This is done in quiesce_work(). But it cannot
run while ublk_stop_dev() is running...


Regards,
Zhang.

2022-09-20 03:38:12

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 4/7] ublk_drv: requeue rqs with recovery feature enabled

On Tue, Sep 20, 2022 at 11:04:30AM +0800, Ziyang Zhang wrote:
> On 2022/9/20 10:39, Ming Lei wrote:
> > On Tue, Sep 20, 2022 at 09:31:54AM +0800, Ziyang Zhang wrote:
> >> On 2022/9/19 20:39, Ming Lei wrote:
> >>> On Mon, Sep 19, 2022 at 05:12:21PM +0800, Ziyang Zhang wrote:
> >>>> On 2022/9/19 11:55, Ming Lei wrote:
> >>>>> On Tue, Sep 13, 2022 at 12:17:04PM +0800, ZiyangZhang wrote:
> >>>>>> With recovery feature enabled, in ublk_queue_rq or task work
> >>>>>> (in exit_task_work or fallback wq), we requeue rqs instead of
> >>>>>> ending(aborting) them. Besides, No matter recovery feature is enabled
> >>>>>> or disabled, we schedule monitor_work immediately.
> >>>>>>
> >>>>>> Signed-off-by: ZiyangZhang <[email protected]>
> >>>>>> ---
> >>>>>> drivers/block/ublk_drv.c | 34 ++++++++++++++++++++++++++++++++--
> >>>>>> 1 file changed, 32 insertions(+), 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> >>>>>> index 23337bd7c105..b067f33a1913 100644
> >>>>>> --- a/drivers/block/ublk_drv.c
> >>>>>> +++ b/drivers/block/ublk_drv.c
> >>>>>> @@ -682,6 +682,21 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)
> >>>>>>
> >>>>>> #define UBLK_REQUEUE_DELAY_MS 3
> >>>>>>
> >>>>>> +static inline void __ublk_abort_rq_in_task_work(struct ublk_queue *ubq,
> >>>>>> + struct request *rq)
> >>>>>> +{
> >>>>>> + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
> >>>>>> + (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
> >>>>>> + ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
> >>>>>> + /* We cannot process this rq so just requeue it. */
> >>>>>> + if (ublk_queue_can_use_recovery(ubq)) {
> >>>>>> + blk_mq_requeue_request(rq, false);
> >>>>>> + blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);
> >>>>>
> >>>>> Here you needn't to kick requeue list since we know it can't make
> >>>>> progress. And you can do that once before deleting gendisk
> >>>>> or the queue is recovered.
> >>>>
> >>>> No, kicking rq here is necessary.
> >>>>
> >>>> Consider USER_RECOVERY is enabled and everything goes well.
> >>>> User sends STOP_DEV, and we have kicked requeue list in
> >>>> ublk_stop_dev() and are going to call del_gendisk().
> >>>> However, a crash happens now. Then rqs may be still requeued
> >>>> by ublk_queue_rq() because ublk_queue_rq() sees a dying
> >>>> ubq_daemon. So del_gendisk() will hang because there are
> >>>> rqs leaving in requeue list and no one kicks them.
> >>>
> >>> Why can't you kick requeue list before calling del_gendisk().
> >>
> >> Yes, we can kick requeue list once before calling del_gendisk().
> >> But a crash may happen just after kicking but before del_gendisk().
> >> So some rqs may be requeued at this moment. But we have already
> >> kicked the requeue list! Then del_gendisk() will hang, right?
> >
> > ->force_abort is set before kicking in ublk_unquiesce_dev(), so
> > all new requests are failed immediately instead of being requeued,
> > right?
> >
>
> ->force_abort is not heplful here because there may be fallback wq running
> which can requeue rqs after kicking requeue list.

After ublk_wait_tagset_rqs_idle() returns, there can't be any
pending requests in fallback wq or task work, can there?

Please look at the 2nd point of the comment log, and I did ask you
to add the words for fallback wq and task work.

>
> In ublk_unquiesce_dev(), I simply disable recovery feature
> if ub's state is UBLK_S_DEV_LIVE while stopping ublk_dev.

monitor work will provide forward progress in case of not applying
user recovery.

>
> Note: We can make sure fallback wq does not run if we wait until all rqs with
> ACTIVE flag set are requeued. This is done in quiesce_work(). But it cannot
> run while ublk_stop_dev() is running...

Please take a look at the delta patch I just sent, which is supposed to be
simpler for addressing this corner case.


Thanks,
Ming

2022-09-20 03:53:01

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

On 2022/9/20 11:04, Ming Lei wrote:
> On Tue, Sep 20, 2022 at 09:49:33AM +0800, Ziyang Zhang wrote:
>
> Follows the delta patch against patch 5 for showing the idea:
>
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 4409a130d0b6..60c5786c4711 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -656,7 +656,8 @@ static void ublk_complete_rq(struct request *req)
> * Also aborting may not be started yet, keep in mind that one failed
> * request may be issued by block layer again.
> */
> -static void __ublk_fail_req(struct ublk_io *io, struct request *req)
> +static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
> + struct request *req)
> {
> WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
>
> @@ -667,7 +668,10 @@ static void __ublk_fail_req(struct ublk_io *io, struct request *req)
> req->tag,
> io->flags);
> io->flags |= UBLK_IO_FLAG_ABORTED;
> - blk_mq_end_request(req, BLK_STS_IOERR);
> + if (ublk_queue_can_use_recovery_reissue(ubq))
> + blk_mq_requeue_request(req, false);

Here is one problem:
We reset io->flags to 0 in ublk_queue_reinit() and it is called before new
ubq_daemon with FETCH_REQ is accepted. ublk_abort_queue() is not protected with
ub_mutex and it is called many times in monitor_work. So same rq may be requeued
multiple times.

With recovery disabled, there is no such problem since io->flags does not change
until ublk_dev is released.

In my patch 5 I only requeue the same rq once. So re-using ublk_abort_queue() is
hard for recovery feature.

> + else
> + blk_mq_end_request(req, BLK_STS_IOERR);
> }
> }
>
> @@ -1018,7 +1022,7 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
> */
> rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i);
> if (rq)
> - __ublk_fail_req(io, rq);
> + __ublk_fail_req(ubq, io, rq);
> }
> }
> ublk_put_device(ub);
> @@ -1026,12 +1030,10 @@ static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
>
> static bool ublk_check_inflight_rq(struct request *rq, void *data)
> {
> - struct ublk_queue *ubq = rq->mq_hctx->driver_data;
> - struct ublk_io *io = &ubq->ios[rq->tag];
> - bool *busy = data;
> + bool *idle = data;
>
> - if (io->flags & UBLK_IO_FLAG_ACTIVE) {
> - *busy = true;
> + if (blk_mq_rq_state(rq) == MQ_RQ_IN_FLIGHT) {
> + *idle = false;
> return false;
> }
> return true;
> @@ -1039,16 +1041,15 @@ static bool ublk_check_inflight_rq(struct request *rq, void *data)
>
> static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
> {
> - bool busy = false;
> + bool idle = true;
>
> WARN_ON_ONCE(!blk_queue_quiesced(ub->ub_disk->queue));
> while (true) {
> blk_mq_tagset_busy_iter(&ub->tag_set,
> - ublk_check_inflight_rq, &busy);
> - if (busy)
> - msleep(UBLK_REQUEUE_DELAY_MS);
> - else
> + ublk_check_inflight_rq, &idle);
> + if (idle)
> break;
> + msleep(UBLK_REQUEUE_DELAY_MS);
> }
> }
>
> @@ -1069,10 +1070,7 @@ static void ublk_quiesce_queue(struct ublk_device *ub,
> ublk_queue_can_use_recovery_reissue(ubq) ?
> "requeue" : "abort",
> ubq->q_id, i, io->flags);
> - if (ublk_queue_can_use_recovery_reissue(ubq))
> - blk_mq_requeue_request(rq, false);
> - else
> - __ublk_fail_req(io, rq);
> + __ublk_fail_req(ubq, io, rq);
> } else {
> pr_devel("%s: done old cmd: qid %d tag %d\n",
> __func__, ubq->q_id, i);
> @@ -1092,12 +1090,6 @@ static void ublk_quiesce_dev(struct ublk_device *ub)
> if (ub->dev_info.state != UBLK_S_DEV_LIVE)
> goto unlock;
>
> - for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> - struct ublk_queue *ubq = ublk_get_queue(ub, i);
> -
> - if (!ubq_daemon_is_dying(ubq))
> - goto unlock;
> - }
> blk_mq_quiesce_queue(ub->ub_disk->queue);
> ublk_wait_tagset_rqs_idle(ub);
> pr_devel("%s: quiesce ub: dev_id %d\n",
> @@ -1129,14 +1121,13 @@ static void ublk_daemon_monitor_work(struct work_struct *work)
> struct ublk_queue *ubq = ublk_get_queue(ub, i);
>
> if (ubq_daemon_is_dying(ubq)) {
> - if (ublk_queue_can_use_recovery(ubq)) {
> + if (ublk_queue_can_use_recovery(ubq))
> schedule_work(&ub->quiesce_work);
> - } else {
> + else
> schedule_work(&ub->stop_work);
>
> - /* abort queue is for making forward progress */
> - ublk_abort_queue(ub, ubq);
> - }
> + /* abort queue is for making forward progress */
> + ublk_abort_queue(ub, ubq);
> }
> }
>
>
>
>
>
> Thanks,
> Ming

2022-09-20 04:03:32

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [PATCH V3 4/7] ublk_drv: requeue rqs with recovery feature enabled

On 2022/9/20 11:18, Ming Lei wrote:
> On Tue, Sep 20, 2022 at 11:04:30AM +0800, Ziyang Zhang wrote:
>> On 2022/9/20 10:39, Ming Lei wrote:
>>> On Tue, Sep 20, 2022 at 09:31:54AM +0800, Ziyang Zhang wrote:
>>>> On 2022/9/19 20:39, Ming Lei wrote:
>>>>> On Mon, Sep 19, 2022 at 05:12:21PM +0800, Ziyang Zhang wrote:
>>>>>> On 2022/9/19 11:55, Ming Lei wrote:
>>>>>>> On Tue, Sep 13, 2022 at 12:17:04PM +0800, ZiyangZhang wrote:
>>>>>>>> With recovery feature enabled, in ublk_queue_rq or task work
>>>>>>>> (in exit_task_work or fallback wq), we requeue rqs instead of
>>>>>>>> ending(aborting) them. Besides, No matter recovery feature is enabled
>>>>>>>> or disabled, we schedule monitor_work immediately.
>>>>>>>>
>>>>>>>> Signed-off-by: ZiyangZhang <[email protected]>
>>>>>>>> ---
>>>>>>>> drivers/block/ublk_drv.c | 34 ++++++++++++++++++++++++++++++++--
>>>>>>>> 1 file changed, 32 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>>>>>>>> index 23337bd7c105..b067f33a1913 100644
>>>>>>>> --- a/drivers/block/ublk_drv.c
>>>>>>>> +++ b/drivers/block/ublk_drv.c
>>>>>>>> @@ -682,6 +682,21 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)
>>>>>>>>
>>>>>>>> #define UBLK_REQUEUE_DELAY_MS 3
>>>>>>>>
>>>>>>>> +static inline void __ublk_abort_rq_in_task_work(struct ublk_queue *ubq,
>>>>>>>> + struct request *rq)
>>>>>>>> +{
>>>>>>>> + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
>>>>>>>> + (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
>>>>>>>> + ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
>>>>>>>> + /* We cannot process this rq so just requeue it. */
>>>>>>>> + if (ublk_queue_can_use_recovery(ubq)) {
>>>>>>>> + blk_mq_requeue_request(rq, false);
>>>>>>>> + blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);
>>>>>>>
>>>>>>> Here you needn't to kick requeue list since we know it can't make
>>>>>>> progress. And you can do that once before deleting gendisk
>>>>>>> or the queue is recovered.
>>>>>>
>>>>>> No, kicking rq here is necessary.
>>>>>>
>>>>>> Consider USER_RECOVERY is enabled and everything goes well.
>>>>>> User sends STOP_DEV, and we have kicked requeue list in
>>>>>> ublk_stop_dev() and are going to call del_gendisk().
>>>>>> However, a crash happens now. Then rqs may be still requeued
>>>>>> by ublk_queue_rq() because ublk_queue_rq() sees a dying
>>>>>> ubq_daemon. So del_gendisk() will hang because there are
>>>>>> rqs leaving in requeue list and no one kicks them.
>>>>>
>>>>> Why can't you kick requeue list before calling del_gendisk().
>>>>
>>>> Yes, we can kick requeue list once before calling del_gendisk().
>>>> But a crash may happen just after kicking but before del_gendisk().
>>>> So some rqs may be requeued at this moment. But we have already
>>>> kicked the requeue list! Then del_gendisk() will hang, right?
>>>
>>> ->force_abort is set before kicking in ublk_unquiesce_dev(), so
>>> all new requests are failed immediately instead of being requeued,
>>> right?
>>>
>>
>> ->force_abort is not heplful here because there may be fallback wq running
>> which can requeue rqs after kicking requeue list.
>
> After ublk_wait_tagset_rqs_idle() returns, there can't be any
> pending requests in fallback wq or task work, can there
Please consider this case: a crash happens while ublk_stop_dev() is
calling. In such case I cannot schedule quiesce_work or call
ublk_wait_tagset_rqs_idle(). This is because quiesce_work has to
accquire ub_mutex to quiesce request queue.

>
> Please look at the 2nd point of the comment log, and I did ask you
> to add the words for fallback wq and task work.
>
>>
>> In ublk_unquiesce_dev(), I simply disable recovery feature
>> if ub's state is UBLK_S_DEV_LIVE while stopping ublk_dev.
>
> monitor work will provide forward progress in case of not applying
> user recovery.

Yes, that's why I disable recovery feature in ublk_stop_dev if quiesce_work
has not run. In this case nonitor_work can abort rqs.

>
>>
>> Note: We can make sure fallback wq does not run if we wait until all rqs with
>> ACTIVE flag set are requeued. This is done in quiesce_work(). But it cannot
>> run while ublk_stop_dev() is running...
>
> Please take a look at the delta patch I just sent, which is supposed to be
> simpler for addressing this corner case.

I saw your patch, it is for rqs with ACTIVE unset, but I am concerning on rqs
with ACTIVE set.

Regards,
Zhang.

2022-09-20 04:52:25

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

On Tue, Sep 20, 2022 at 11:24:12AM +0800, Ziyang Zhang wrote:
> On 2022/9/20 11:04, Ming Lei wrote:
> > On Tue, Sep 20, 2022 at 09:49:33AM +0800, Ziyang Zhang wrote:
> >
> > Follows the delta patch against patch 5 for showing the idea:
> >
> >
> > diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> > index 4409a130d0b6..60c5786c4711 100644
> > --- a/drivers/block/ublk_drv.c
> > +++ b/drivers/block/ublk_drv.c
> > @@ -656,7 +656,8 @@ static void ublk_complete_rq(struct request *req)
> > * Also aborting may not be started yet, keep in mind that one failed
> > * request may be issued by block layer again.
> > */
> > -static void __ublk_fail_req(struct ublk_io *io, struct request *req)
> > +static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
> > + struct request *req)
> > {
> > WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
> >
> > @@ -667,7 +668,10 @@ static void __ublk_fail_req(struct ublk_io *io, struct request *req)
> > req->tag,
> > io->flags);
> > io->flags |= UBLK_IO_FLAG_ABORTED;
> > - blk_mq_end_request(req, BLK_STS_IOERR);
> > + if (ublk_queue_can_use_recovery_reissue(ubq))
> > + blk_mq_requeue_request(req, false);
>
> Here is one problem:
> We reset io->flags to 0 in ublk_queue_reinit() and it is called before new

As we agreed, ublk_queue_reinit() will be moved to ublk_ch_release(), when there isn't
any inflight request, which is completed by either ublk server or __ublk_fail_req().

So clearing io->flags isn't related with quisceing device.

> ubq_daemon with FETCH_REQ is accepted. ublk_abort_queue() is not protected with
> ub_mutex and it is called many times in monitor_work. So same rq may be requeued
> multiple times.

UBLK_IO_FLAG_ABORTED is set for the slot, so one req is only ended or
requeued just once.

>
> With recovery disabled, there is no such problem since io->flags does not change
> until ublk_dev is released.

But we have agreed that ublk_queue_reinit() can be moved to release
handler of /dev/ublkcN.

>
> In my patch 5 I only requeue the same rq once. So re-using ublk_abort_queue() is
> hard for recovery feature.

No, the same rq is just requeued once. Here the point is:

1) reuse previous pattern in ublk_stop_dev(), which is proved as
workable reliably

2) avoid to stay in half-working state forever

3) the behind idea is more simpler.


Thanks.
Ming

2022-09-20 04:56:51

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

On 2022/9/20 11:04, Ming Lei wrote:
> On Tue, Sep 20, 2022 at 09:49:33AM +0800, Ziyang Zhang wrote:
>> On 2022/9/19 20:33, Ming Lei wrote:
>>>>>> +
>>>>>> +static void ublk_quiesce_queue(struct ublk_device *ub,
>>>>>> + struct ublk_queue *ubq)
>>>>>> +{
>>>>>> + int i;
>>>>>> +
>>>>>> + for (i = 0; i < ubq->q_depth; i++) {
>>>>>> + struct ublk_io *io = &ubq->ios[i];
>>>>>> +
>>>>>> + if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
>>>>>> + struct request *rq = blk_mq_tag_to_rq(
>>>>>> + ub->tag_set.tags[ubq->q_id], i);
>>>>>> +
>>>>>> + WARN_ON_ONCE(!rq);
>>>>>> + pr_devel("%s: %s rq: qid %d tag %d io_flags %x\n", __func__,
>>>>>> + ublk_queue_can_use_recovery_reissue(ubq) ?
>>>>>> + "requeue" : "abort",
>>>>>> + ubq->q_id, i, io->flags);
>>>>>> + if (ublk_queue_can_use_recovery_reissue(ubq))
>>>>>> + blk_mq_requeue_request(rq, false);
>>>>>
>>>>> This way is too violent.
>>>>>
>>>>> There may be just one queue dying, but you requeue all requests
>>>>> from any queue. I'd suggest to take the approach in ublk_daemon_monitor_work(),
>>>>> such as, just requeuing requests in dying queue.
>>>>
>>>> If we want to start a new process after a crash for USER_RECOVERY, all old ubq_daemons
>>>> must exit and rqs of all queues have to be requeued/aborted. We cannot let live
>>>> ubq_daemons run any more because they do not belong to the new process.
>>>
>>> IMO, the old process really can exist, and recently even I got such
>>> requirement for switching queue from one thread to another.
>>
>> For now, only one process can open /dev/ublkcX, so a new process is necessary now.
>>
>> If you think "per ubq_daemon" recovery is reasonable, I can do that in the future
>> if multiple processes is supported. But I really suggest that we can keep current
>> design as the first step which assumes all ubq_daemons are exited and a new process
>> is started, and that really meets our requirement.
>>
>> BTW, START_USER_RECOVERY has to be reconsidered because we may need to pass a ubq_id
>> with it.
>>
>>>
>>> What we should do is to get all inflight requests done, and cancel all io
>>> commands, no matter if the ubq pthread is dead or live.
>>>
>>>>
>>>> BTW, I really wonder why there could be just one queue dying? All queues must be dying
>>>> shortly after any ubq_daemon is dying since they are all pthreads in the same process.
>>>
>>> You can't assume it is always so. Maybe one pthread is dead first, and
>>> others are dying later, maybe just one is dead.
>>
>> Yes, I know there may be only one pthread is dead while others keep running, but now
>> ublk_drv only support one process opening the same /dev/ublkcX, so other pthreads
>> must dead(no matter they are aborted by signal or themselves) later.
>>
>>>
>>> If one queue's pthread is live, you may get trouble by simply requeuing
>>> the request, that is why I suggest to re-use the logic of
>>> ublk_daemon_monitor_work/ublk_abort_queue().
>>
>> Actually, if any ubq_daemon is live, no rqs are requeued, please see the check in
>> ublk_quiesce_dev(). It always makes sure that ALL ubq_daemons are dying, then it
>> starts quiesce jobs.
>
> OK, looks I miss this point, but you should have quiesced queue at the
> beginning of ublk_quiesce_dev(), then the transition period can be kept
> as short as possible. Otherwise, if one queue pthread isn't dying, the
> device can be kept in this part-working state forever.
>

Ming, this is what you said in PATCH V2:
"
The simplest handling might be to exit all ublk queues first, and re-create one
new process to recover all since the request queue is required to be
quiesced first, and all ublk queue is actually quiesced too. So from user
viewpoint, there is nothing visible comparing with just recovering
single ubq daemon/pthread.
"

So I assume that quiesce_work starts only after all ubq_damons are dying.
Note that current ublk does not support mutpile process opening the same chardev.

Really we should agree on this. My suggestion is that we only consider "all ubq_daemons
are dying".

You mention that someone want to enable "switch ubq_daemon pthread to another one" and
I think it is another feature but not recovery feature.

Regards,
Zhang.

2022-09-20 04:56:52

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

On 2022/9/20 12:01, Ming Lei wrote:
> On Tue, Sep 20, 2022 at 11:24:12AM +0800, Ziyang Zhang wrote:
>> On 2022/9/20 11:04, Ming Lei wrote:
>>> On Tue, Sep 20, 2022 at 09:49:33AM +0800, Ziyang Zhang wrote:
>>>
>>> Follows the delta patch against patch 5 for showing the idea:
>>>
>>>
>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>>> index 4409a130d0b6..60c5786c4711 100644
>>> --- a/drivers/block/ublk_drv.c
>>> +++ b/drivers/block/ublk_drv.c
>>> @@ -656,7 +656,8 @@ static void ublk_complete_rq(struct request *req)
>>> * Also aborting may not be started yet, keep in mind that one failed
>>> * request may be issued by block layer again.
>>> */
>>> -static void __ublk_fail_req(struct ublk_io *io, struct request *req)
>>> +static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
>>> + struct request *req)
>>> {
>>> WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
>>>
>>> @@ -667,7 +668,10 @@ static void __ublk_fail_req(struct ublk_io *io, struct request *req)
>>> req->tag,
>>> io->flags);
>>> io->flags |= UBLK_IO_FLAG_ABORTED;
>>> - blk_mq_end_request(req, BLK_STS_IOERR);
>>> + if (ublk_queue_can_use_recovery_reissue(ubq))
>>> + blk_mq_requeue_request(req, false);
>>
>> Here is one problem:
>> We reset io->flags to 0 in ublk_queue_reinit() and it is called before new
>
> As we agreed, ublk_queue_reinit() will be moved to ublk_ch_release(), when there isn't
> any inflight request, which is completed by either ublk server or __ublk_fail_req().
>
> So clearing io->flags isn't related with quisceing device.
>
>> ubq_daemon with FETCH_REQ is accepted. ublk_abort_queue() is not protected with
>> ub_mutex and it is called many times in monitor_work. So same rq may be requeued
>> multiple times.
>
> UBLK_IO_FLAG_ABORTED is set for the slot, so one req is only ended or
> requeued just once.

Yes, we can move ublk_queue_reinit() into ublk_ch_release(), but monitor_work is scheduled
periodically so ublk_abort_queue() is called multiple times. As ublk_queue_reinit() clear
io->flags, ublk_abort_queue() can requeue the same rq twice. Note that monitor_work can be
scheduled after ublk_ch_release().

>
>>
>> With recovery disabled, there is no such problem since io->flags does not change
>> until ublk_dev is released.
>
> But we have agreed that ublk_queue_reinit() can be moved to release
> handler of /dev/ublkcN.
>
>>
>> In my patch 5 I only requeue the same rq once. So re-using ublk_abort_queue() is
>> hard for recovery feature.
>
> No, the same rq is just requeued once. Here the point is:
>
> 1) reuse previous pattern in ublk_stop_dev(), which is proved as
> workable reliably
>
> 2) avoid to stay in half-working state forever
>
> 3) the behind idea is more simpler.

Ming, your patch requeue rqs with ACTVE unset. these rqs have been issued to the
dying ubq_daemon. What I concern about is inflight rqs with ACTIVE set.

Regards,
Zhang.

2022-09-20 05:21:50

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

On 2022/9/20 12:49, Ming Lei wrote:
> On Tue, Sep 20, 2022 at 12:39:31PM +0800, Ziyang Zhang wrote:
>> On 2022/9/20 12:01, Ming Lei wrote:
>>> On Tue, Sep 20, 2022 at 11:24:12AM +0800, Ziyang Zhang wrote:
>>>> On 2022/9/20 11:04, Ming Lei wrote:
>>>>> On Tue, Sep 20, 2022 at 09:49:33AM +0800, Ziyang Zhang wrote:
>>>>>
>>>>> Follows the delta patch against patch 5 for showing the idea:
>>>>>
>>>>>
>>>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>>>>> index 4409a130d0b6..60c5786c4711 100644
>>>>> --- a/drivers/block/ublk_drv.c
>>>>> +++ b/drivers/block/ublk_drv.c
>>>>> @@ -656,7 +656,8 @@ static void ublk_complete_rq(struct request *req)
>>>>> * Also aborting may not be started yet, keep in mind that one failed
>>>>> * request may be issued by block layer again.
>>>>> */
>>>>> -static void __ublk_fail_req(struct ublk_io *io, struct request *req)
>>>>> +static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
>>>>> + struct request *req)
>>>>> {
>>>>> WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
>>>>>
>>>>> @@ -667,7 +668,10 @@ static void __ublk_fail_req(struct ublk_io *io, struct request *req)
>>>>> req->tag,
>>>>> io->flags);
>>>>> io->flags |= UBLK_IO_FLAG_ABORTED;
>>>>> - blk_mq_end_request(req, BLK_STS_IOERR);
>>>>> + if (ublk_queue_can_use_recovery_reissue(ubq))
>>>>> + blk_mq_requeue_request(req, false);
>>>>
>>>> Here is one problem:
>>>> We reset io->flags to 0 in ublk_queue_reinit() and it is called before new
>>>
>>> As we agreed, ublk_queue_reinit() will be moved to ublk_ch_release(), when there isn't
>>> any inflight request, which is completed by either ublk server or __ublk_fail_req().
>>>
>>> So clearing io->flags isn't related with quisceing device.
>>>
>>>> ubq_daemon with FETCH_REQ is accepted. ublk_abort_queue() is not protected with
>>>> ub_mutex and it is called many times in monitor_work. So same rq may be requeued
>>>> multiple times.
>>>
>>> UBLK_IO_FLAG_ABORTED is set for the slot, so one req is only ended or
>>> requeued just once.
>>
>> Yes, we can move ublk_queue_reinit() into ublk_ch_release(), but monitor_work is scheduled
>> periodically so ublk_abort_queue() is called multiple times. As ublk_queue_reinit() clear
>> io->flags, ublk_abort_queue() can requeue the same rq twice. Note that monitor_work can be
>> scheduled after ublk_ch_release().
>
> No, monitor work is supposed to be shutdown after in-flight requests are
> drained.


Let's add cancel_delayed_work_sync(&ub->monitor_work) in ublk_ch_release().
monitor_work should not be scheduled after ub's state is QUIESCED.

Regards,
Zhang.

2022-09-20 05:22:41

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

On Tue, Sep 20, 2022 at 12:39:31PM +0800, Ziyang Zhang wrote:
> On 2022/9/20 12:01, Ming Lei wrote:
> > On Tue, Sep 20, 2022 at 11:24:12AM +0800, Ziyang Zhang wrote:
> >> On 2022/9/20 11:04, Ming Lei wrote:
> >>> On Tue, Sep 20, 2022 at 09:49:33AM +0800, Ziyang Zhang wrote:
> >>>
> >>> Follows the delta patch against patch 5 for showing the idea:
> >>>
> >>>
> >>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> >>> index 4409a130d0b6..60c5786c4711 100644
> >>> --- a/drivers/block/ublk_drv.c
> >>> +++ b/drivers/block/ublk_drv.c
> >>> @@ -656,7 +656,8 @@ static void ublk_complete_rq(struct request *req)
> >>> * Also aborting may not be started yet, keep in mind that one failed
> >>> * request may be issued by block layer again.
> >>> */
> >>> -static void __ublk_fail_req(struct ublk_io *io, struct request *req)
> >>> +static void __ublk_fail_req(struct ublk_queue *ubq, struct ublk_io *io,
> >>> + struct request *req)
> >>> {
> >>> WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
> >>>
> >>> @@ -667,7 +668,10 @@ static void __ublk_fail_req(struct ublk_io *io, struct request *req)
> >>> req->tag,
> >>> io->flags);
> >>> io->flags |= UBLK_IO_FLAG_ABORTED;
> >>> - blk_mq_end_request(req, BLK_STS_IOERR);
> >>> + if (ublk_queue_can_use_recovery_reissue(ubq))
> >>> + blk_mq_requeue_request(req, false);
> >>
> >> Here is one problem:
> >> We reset io->flags to 0 in ublk_queue_reinit() and it is called before new
> >
> > As we agreed, ublk_queue_reinit() will be moved to ublk_ch_release(), when there isn't
> > any inflight request, which is completed by either ublk server or __ublk_fail_req().
> >
> > So clearing io->flags isn't related with quisceing device.
> >
> >> ubq_daemon with FETCH_REQ is accepted. ublk_abort_queue() is not protected with
> >> ub_mutex and it is called many times in monitor_work. So same rq may be requeued
> >> multiple times.
> >
> > UBLK_IO_FLAG_ABORTED is set for the slot, so one req is only ended or
> > requeued just once.
>
> Yes, we can move ublk_queue_reinit() into ublk_ch_release(), but monitor_work is scheduled
> periodically so ublk_abort_queue() is called multiple times. As ublk_queue_reinit() clear
> io->flags, ublk_abort_queue() can requeue the same rq twice. Note that monitor_work can be
> scheduled after ublk_ch_release().

No, monitor work is supposed to be shutdown after in-flight requests are
drained.

>
> >
> >>
> >> With recovery disabled, there is no such problem since io->flags does not change
> >> until ublk_dev is released.
> >
> > But we have agreed that ublk_queue_reinit() can be moved to release
> > handler of /dev/ublkcN.
> >
> >>
> >> In my patch 5 I only requeue the same rq once. So re-using ublk_abort_queue() is
> >> hard for recovery feature.
> >
> > No, the same rq is just requeued once. Here the point is:
> >
> > 1) reuse previous pattern in ublk_stop_dev(), which is proved as
> > workable reliably
> >
> > 2) avoid to stay in half-working state forever
> >
> > 3) the behind idea is more simpler.
>
> Ming, your patch requeue rqs with ACTVE unset. these rqs have been issued to the
> dying ubq_daemon. What I concern about is inflight rqs with ACTIVE set.

My patch drains all inflight requests no matter if ACTIVE is set or not,
and that is the reason why it is simpler.

Thanks,
Ming

2022-09-20 05:23:14

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 4/7] ublk_drv: requeue rqs with recovery feature enabled

On Tue, Sep 20, 2022 at 11:34:32AM +0800, Ziyang Zhang wrote:
> On 2022/9/20 11:18, Ming Lei wrote:
> > On Tue, Sep 20, 2022 at 11:04:30AM +0800, Ziyang Zhang wrote:
> >> On 2022/9/20 10:39, Ming Lei wrote:
> >>> On Tue, Sep 20, 2022 at 09:31:54AM +0800, Ziyang Zhang wrote:
> >>>> On 2022/9/19 20:39, Ming Lei wrote:
> >>>>> On Mon, Sep 19, 2022 at 05:12:21PM +0800, Ziyang Zhang wrote:
> >>>>>> On 2022/9/19 11:55, Ming Lei wrote:
> >>>>>>> On Tue, Sep 13, 2022 at 12:17:04PM +0800, ZiyangZhang wrote:
> >>>>>>>> With recovery feature enabled, in ublk_queue_rq or task work
> >>>>>>>> (in exit_task_work or fallback wq), we requeue rqs instead of
> >>>>>>>> ending(aborting) them. Besides, No matter recovery feature is enabled
> >>>>>>>> or disabled, we schedule monitor_work immediately.
> >>>>>>>>
> >>>>>>>> Signed-off-by: ZiyangZhang <[email protected]>
> >>>>>>>> ---
> >>>>>>>> drivers/block/ublk_drv.c | 34 ++++++++++++++++++++++++++++++++--
> >>>>>>>> 1 file changed, 32 insertions(+), 2 deletions(-)
> >>>>>>>>
> >>>>>>>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> >>>>>>>> index 23337bd7c105..b067f33a1913 100644
> >>>>>>>> --- a/drivers/block/ublk_drv.c
> >>>>>>>> +++ b/drivers/block/ublk_drv.c
> >>>>>>>> @@ -682,6 +682,21 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)
> >>>>>>>>
> >>>>>>>> #define UBLK_REQUEUE_DELAY_MS 3
> >>>>>>>>
> >>>>>>>> +static inline void __ublk_abort_rq_in_task_work(struct ublk_queue *ubq,
> >>>>>>>> + struct request *rq)
> >>>>>>>> +{
> >>>>>>>> + pr_devel("%s: %s q_id %d tag %d io_flags %x.\n", __func__,
> >>>>>>>> + (ublk_queue_can_use_recovery(ubq)) ? "requeue" : "abort",
> >>>>>>>> + ubq->q_id, rq->tag, ubq->ios[rq->tag].flags);
> >>>>>>>> + /* We cannot process this rq so just requeue it. */
> >>>>>>>> + if (ublk_queue_can_use_recovery(ubq)) {
> >>>>>>>> + blk_mq_requeue_request(rq, false);
> >>>>>>>> + blk_mq_delay_kick_requeue_list(rq->q, UBLK_REQUEUE_DELAY_MS);
> >>>>>>>
> >>>>>>> Here you needn't to kick requeue list since we know it can't make
> >>>>>>> progress. And you can do that once before deleting gendisk
> >>>>>>> or the queue is recovered.
> >>>>>>
> >>>>>> No, kicking rq here is necessary.
> >>>>>>
> >>>>>> Consider USER_RECOVERY is enabled and everything goes well.
> >>>>>> User sends STOP_DEV, and we have kicked requeue list in
> >>>>>> ublk_stop_dev() and are going to call del_gendisk().
> >>>>>> However, a crash happens now. Then rqs may be still requeued
> >>>>>> by ublk_queue_rq() because ublk_queue_rq() sees a dying
> >>>>>> ubq_daemon. So del_gendisk() will hang because there are
> >>>>>> rqs leaving in requeue list and no one kicks them.
> >>>>>
> >>>>> Why can't you kick requeue list before calling del_gendisk().
> >>>>
> >>>> Yes, we can kick requeue list once before calling del_gendisk().
> >>>> But a crash may happen just after kicking but before del_gendisk().
> >>>> So some rqs may be requeued at this moment. But we have already
> >>>> kicked the requeue list! Then del_gendisk() will hang, right?
> >>>
> >>> ->force_abort is set before kicking in ublk_unquiesce_dev(), so
> >>> all new requests are failed immediately instead of being requeued,
> >>> right?
> >>>
> >>
> >> ->force_abort is not heplful here because there may be fallback wq running
> >> which can requeue rqs after kicking requeue list.
> >
> > After ublk_wait_tagset_rqs_idle() returns, there can't be any
> > pending requests in fallback wq or task work, can there
> Please consider this case: a crash happens while ublk_stop_dev() is
> calling. In such case I cannot schedule quiesce_work or call
> ublk_wait_tagset_rqs_idle(). This is because quiesce_work has to
> accquire ub_mutex to quiesce request queue.

The issue can be addressed in the following way more reliably &
cleanly & consistently, then you needn't to switch between the
two modes.

ublk_stop_dev()

if (ublk_can_use_recovery(ub)) {
if (ub->dev_info.state == UBLK_S_DEV_LIVE)
__ublk_quiesce_dev(ub); //lockless version
ublk_unquiesce_dev();
}

Meantime not necessary to disable recovery feature in ublk_unquiesce_dev
any more.




thanks,
Ming

2022-09-20 05:53:43

by Ming Lei

[permalink] [raw]
Subject: Re: [PATCH V3 3/7] ublk_drv: define macros for recovery feature and check them

On Tue, Sep 13, 2022 at 12:17:03PM +0800, ZiyangZhang wrote:
> Define some macros for recovery feature. Especially define a new state:
> UBLK_S_DEV_QUIESCED which implies that ublk_device is quiesced
> and is ready for recovery. This state can be observed by userspace.
>
> UBLK_F_USER_RECOVERY implies that:
> (1) ublk_drv enables recovery feature. It won't let monitor_work to
> automatically abort rqs and release the device.
> (2) With a dying ubq_daemon, ublk_drv ends(aborts) rqs issued to
> userspace(ublksrv) before crash.
> (3) With a dying ubq_daemon, in task work and ublk_queue_rq(),
> ublk_drv requeues rqs.
>
> UBLK_F_USER_RECOVERY_REISSUE implies that:
> (1) everything UBLK_F_USER_RECOVERY implies except
> (2) With a dying ubq_daemon, ublk_drv requeues rqs issued to
> userspace(ublksrv) before crash.
>
> UBLK_F_USER_RECOVERY_REISSUE is designed for backends which:
> (1) tolerates double-writes because ublk_drv may issue the same rq
> twice.
> (2) does not let frontend users get I/O error. such as read-only FS
> and VM backend.
>
> Signed-off-by: ZiyangZhang <[email protected]>
> ---
> drivers/block/ublk_drv.c | 45 ++++++++++++++++++++++++++++++++++-
> include/uapi/linux/ublk_cmd.h | 7 ++++++
> 2 files changed, 51 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 0c6db0978ed0..23337bd7c105 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -49,7 +49,9 @@
> /* All UBLK_F_* have to be included into UBLK_F_ALL */
> #define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \
> | UBLK_F_URING_CMD_COMP_IN_TASK \
> - | UBLK_F_NEED_GET_DATA)
> + | UBLK_F_NEED_GET_DATA \
> + | UBLK_F_USER_RECOVERY \
> + | UBLK_F_USER_RECOVERY_REISSUE)
>
> /* All UBLK_PARAM_TYPE_* should be included here */
> #define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD)
> @@ -323,6 +325,47 @@ static inline int ublk_queue_cmd_buf_size(struct ublk_device *ub, int q_id)
> PAGE_SIZE);
> }
>
> +static inline bool ublk_queue_can_use_recovery(
> + struct ublk_queue *ubq)
> +{
> + if (ubq->flags & UBLK_F_USER_RECOVERY)
> + return true;
> + return false;
> +}
> +
> +static inline void ublk_disable_recovery(struct ublk_device *ub)
> +{
> + int i;
> +
> + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> + ubq->flags &= ~UBLK_F_USER_RECOVERY;
> + }
> +}

Flags is supposed to not changed, especially ublk_disable_recovery
isn't necessary with my suggestion in the following link:

https://lore.kernel.org/linux-block/YylEjEply6y+bs0B@T590/T/#u


> +
> +static inline bool ublk_can_use_recovery(struct ublk_device *ub)
> +{
> + int i;
> +
> + for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
> + struct ublk_queue *ubq = ublk_get_queue(ub, i);
> +
> + if (!ublk_queue_can_use_recovery(ubq))
> + return false;
> + }
> + return true;
> +}

The above is too tricky, why can't check ub->dev_info &
UBLK_F_USER_RECOVERY directly?

> +
> +static inline bool ublk_queue_can_use_recovery_reissue(
> + struct ublk_queue *ubq)
> +{
> + if (ublk_queue_can_use_recovery(ubq) &&
> + (ubq->flags & UBLK_F_USER_RECOVERY_REISSUE))
> + return true;
> + return false;
> +}
> +
> static void ublk_free_disk(struct gendisk *disk)
> {
> struct ublk_device *ub = disk->private_data;
> diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
> index 677edaab2b66..87204c39f1ee 100644
> --- a/include/uapi/linux/ublk_cmd.h
> +++ b/include/uapi/linux/ublk_cmd.h
> @@ -17,6 +17,8 @@
> #define UBLK_CMD_STOP_DEV 0x07
> #define UBLK_CMD_SET_PARAMS 0x08
> #define UBLK_CMD_GET_PARAMS 0x09
> +#define UBLK_CMD_START_USER_RECOVERY 0x10
> +#define UBLK_CMD_END_USER_RECOVERY 0x11
>
> /*
> * IO commands, issued by ublk server, and handled by ublk driver.
> @@ -74,9 +76,14 @@
> */
> #define UBLK_F_NEED_GET_DATA (1UL << 2)
>
> +#define UBLK_F_USER_RECOVERY (1UL << 3)
> +
> +#define UBLK_F_USER_RECOVERY_REISSUE (1UL << 4)

The above are two features. I'd suggest to add UBLK_F_USER_RECOVERY
and its implementation first, then add one delta patch for supporting
the new feature of UBLK_F_USER_RECOVERY_REISSUE.

Not only it is more helpful for reviewing, but also easier to understand
the two's difference.


thanks,
Ming

2022-09-20 06:15:20

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [PATCH V3 5/7] ublk_drv: consider recovery feature in aborting mechanism

On 2022/9/20 12:45, Ziyang Zhang wrote:
> On 2022/9/20 11:04, Ming Lei wrote:
>> On Tue, Sep 20, 2022 at 09:49:33AM +0800, Ziyang Zhang wrote:
>>> On 2022/9/19 20:33, Ming Lei wrote:
>>>>>>> +
>>>>>>> +static void ublk_quiesce_queue(struct ublk_device *ub,
>>>>>>> + struct ublk_queue *ubq)
>>>>>>> +{
>>>>>>> + int i;
>>>>>>> +
>>>>>>> + for (i = 0; i < ubq->q_depth; i++) {
>>>>>>> + struct ublk_io *io = &ubq->ios[i];
>>>>>>> +
>>>>>>> + if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
>>>>>>> + struct request *rq = blk_mq_tag_to_rq(
>>>>>>> + ub->tag_set.tags[ubq->q_id], i);
>>>>>>> +
>>>>>>> + WARN_ON_ONCE(!rq);
>>>>>>> + pr_devel("%s: %s rq: qid %d tag %d io_flags %x\n", __func__,
>>>>>>> + ublk_queue_can_use_recovery_reissue(ubq) ?
>>>>>>> + "requeue" : "abort",
>>>>>>> + ubq->q_id, i, io->flags);
>>>>>>> + if (ublk_queue_can_use_recovery_reissue(ubq))
>>>>>>> + blk_mq_requeue_request(rq, false);
>>>>>>
>>>>>> This way is too violent.
>>>>>>
>>>>>> There may be just one queue dying, but you requeue all requests
>>>>>> from any queue. I'd suggest to take the approach in ublk_daemon_monitor_work(),
>>>>>> such as, just requeuing requests in dying queue.
>>>>>
>>>>> If we want to start a new process after a crash for USER_RECOVERY, all old ubq_daemons
>>>>> must exit and rqs of all queues have to be requeued/aborted. We cannot let live
>>>>> ubq_daemons run any more because they do not belong to the new process.
>>>>
>>>> IMO, the old process really can exist, and recently even I got such
>>>> requirement for switching queue from one thread to another.
>>>
>>> For now, only one process can open /dev/ublkcX, so a new process is necessary now.
>>>
>>> If you think "per ubq_daemon" recovery is reasonable, I can do that in the future
>>> if multiple processes is supported. But I really suggest that we can keep current
>>> design as the first step which assumes all ubq_daemons are exited and a new process
>>> is started, and that really meets our requirement.
>>>
>>> BTW, START_USER_RECOVERY has to be reconsidered because we may need to pass a ubq_id
>>> with it.
>>>
>>>>
>>>> What we should do is to get all inflight requests done, and cancel all io
>>>> commands, no matter if the ubq pthread is dead or live.
>>>>
>>>>>
>>>>> BTW, I really wonder why there could be just one queue dying? All queues must be dying
>>>>> shortly after any ubq_daemon is dying since they are all pthreads in the same process.
>>>>
>>>> You can't assume it is always so. Maybe one pthread is dead first, and
>>>> others are dying later, maybe just one is dead.
>>>
>>> Yes, I know there may be only one pthread is dead while others keep running, but now
>>> ublk_drv only support one process opening the same /dev/ublkcX, so other pthreads
>>> must dead(no matter they are aborted by signal or themselves) later.
>>>
>>>>
>>>> If one queue's pthread is live, you may get trouble by simply requeuing
>>>> the request, that is why I suggest to re-use the logic of
>>>> ublk_daemon_monitor_work/ublk_abort_queue().
>>>
>>> Actually, if any ubq_daemon is live, no rqs are requeued, please see the check in
>>> ublk_quiesce_dev(). It always makes sure that ALL ubq_daemons are dying, then it
>>> starts quiesce jobs.
>>
>> OK, looks I miss this point, but you should have quiesced queue at the
>> beginning of ublk_quiesce_dev(), then the transition period can be kept
>> as short as possible. Otherwise, if one queue pthread isn't dying, the
>> device can be kept in this part-working state forever.
>>
>
> Ming, this is what you said in PATCH V2:
> "
> The simplest handling might be to exit all ublk queues first, and re-create one
> new process to recover all since the request queue is required to be
> quiesced first, and all ublk queue is actually quiesced too. So from user
> viewpoint, there is nothing visible comparing with just recovering
> single ubq daemon/pthread.
> "
>
> So I assume that quiesce_work starts only after all ubq_damons are dying.
> Note that current ublk does not support mutpile process opening the same chardev.
>
> Really we should agree on this. My suggestion is that we only consider "all ubq_daemons
> are dying".
>
> You mention that someone want to enable "switch ubq_daemon pthread to another one" and
> I think it is another feature but not recovery feature.
>
> Regards,
> Zhang.

This should be considered very carefully, Ming.