2022-08-24 05:51:36

by Ziyang Zhang

[permalink] [raw]
Subject: [RFC PATCH 0/9] ublk_drv: add USER_RECOVERY support

ublk_drv is a driver simply passes all blk-mq rqs to ublksrv[1] in
userspace. For each ublk queue, there is one ubq_daemon(pthread).
All ubq_daemons share the same process which opens /dev/ublkcX.
The ubq_daemon code infinitely loops on io_uring_enter() to
send/receive io_uring cmds which pass information of blk-mq
rqs.

Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv
must abort the dying ubq, stop the device and release everything.
This is not a good choice in practice because users do not expect
aborted requests, I/O errors and a released device. They may want
a recovery machenism so that no requests are aborted and no I/O
error occurs. Anyway, users just want everything works as uaual.

This RFC patchset implements USER_RECOVERY support. If the process
crashes, we allow ublksrv to provide new process and ubq_daemons.
We do not support single ubq_daemon(pthread) recovery because a
pthread rarely crashes.

Recovery feature is quite useful for products do not expect to
return any I/O error to frontend users. In detail, we support
this scenario:
(1) The /dev/ublkc0 is opened by process 0;
(2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all
rqs are handled by process 0.
(3) Process 0 suddenly crashes(e.g. segfault);
(4) Fio is still running and submit IOs(but these IOs cannot
complete now)
(5) User recovers with process 1 and attach it to /dev/ublkc0
(6) All rqs are handled by process 1 now and IOs can be
completed now.

Note: The backend must tolerate double-write because we re-issue
a rq sent to the old(dying) process before. We allow users to
choose whether re-issue these rqs or not, please see patch 7 for
more detail.

We provide a sample script here to simulate the above steps:

***************************script***************************
LOOPS=10

__ublk_get_pid() {
pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'`
echo $pid
}

ublk_recover_kill()
{
for CNT in `seq $LOOPS`; do
dmesg -C
pid=`__ublk_get_pid`
echo -e "*** kill $pid now ***"
kill -9 $pid
sleep 2
echo -e "*** recover now ***"
./ublk recover -n 0
sleep 4
done
}

ublk_test()
{
dmesg -C
echo -e "*** add ublk device ***"
./ublk add -t null -d 4 -i 1
sleep 2
echo -e "*** start fio ***"
fio --bs=4k \
--filename=/dev/ublkb0 \
--runtime=100s \
--rw=read &
sleep 4
ublk_recover_kill
wait
echo -e "*** delete ublk device ***"
./ublk del -n 0
}

for CNT in `seq 4`; do
modprobe -rv ublk_drv
modprobe ublk_drv
echo -e "************ round $CNT ************"
ublk_test
sleep 5
done
***************************script***************************

You may run it with our modified ublksrv[3] which supports
recovey feature. No I/O error occurs and you can verify it
by typing
$ perf-tools/bin/tpoint block:block_rq_error

The basic idea of USER_RECOVERY is quite straightfoward:

(1) release/free everything belongs to the dying process.

Note: Since ublk_drv does save information about user process,
this work is important because we don't expect any resource
lekage. Particularly, ioucmds from the dying ubq_daemons
need to be completed(freed). Current ublk_drv code cannot satisfy
our need while considering USER_RECOVERY. So we refactor some code
shown in patch 1-5 to gracefully free all ioucmds.

(2) init ublk queues including requeuing/aborting rqs.

(3) allow new ubq_daemons issue FETCH_REQ.

Here is steps to reocver:

(1) For a user, after a process crash(how he detect a crash is not related
to this patchset), he sends START_USER_RECOVERY ctrl-cmd to
/dev/ublk-control with a dev_id X (such as 3 for /dev/ublkc3).

(2) Then ublk_drv should perpare for a new process to attach /dev/ublkcX.
We have described this before. The driver must quiesce the request
queue to ban any incoming ublk_queue_rq().

(3) Then, user should start a new process and ubq_daemons(pthreads) and
send FETCH_REQ by io_uring_enter() to make all ubqs be ready. The
user must correctly setup queues, flags and so on(how to persist
ublksrv information is not related to this patchset).

(4) The user sends END_USER_RECOVERY ctrl-cmd to /dev/ublk-control with a
dev_id X.

(5) ublk_drv waits for all ubq_daemons getting ready. Then it unquiesces
request queue and new rqs are allowed.

After applying refactor patches(patch 1-5), with current ublksrv[1], all
tests[2] passes. Note that refactor patches DO NOT involve any recovery
feature.

After applying all patches(patch 1-9), you should use ublksrv[3] and
tests[4] provided by us. We add 2 additional tests to verify that
recovery feature works. our code will be PR-ed to Ming's repo soon.

[1] https://github.com/ming1/ubdsrv
[2] https://github.com/ming1/ubdsrv/tree/master/tests
[3] https://github.com/old-memories/ubdsrv/tree/recovery-v1
[4] https://github.com/old-memories/ubdsrv/tree/recovery-v1/tests/generic

ZiyangZhang (9):
ublk_drv: check 'current' instead of 'ubq_daemon'
ublk_drv: refactor ublk_cancel_queue()
ublk_drv: add a helper to get ioucmd from pdu
ublk_drv: refactor __ublk_rq_task_work() and aborting machenism
ublk_drv: refactor ublk_stop_dev()
ublk_drv: add pr_devel() to prepare for recovery feature
ublk_drv: define macros for recovery feature and check them
ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support
ublk_drv: do not schedule monitor_work with recovery feature enabled

drivers/block/ublk_drv.c | 517 ++++++++++++++++++++++++++--------
include/uapi/linux/ublk_cmd.h | 7 +
2 files changed, 408 insertions(+), 116 deletions(-)

--
2.27.0


2022-08-24 05:52:34

by Ziyang Zhang

[permalink] [raw]
Subject: [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task_work() and aborting machenism

If one rq is handled by io_uring_cmd_complete_in_task(), after a crash
this rq is actually handled by an io_uring fallback wq. We have to
end(abort) this rq since this fallback wq is a task other than the
crashed task. However, current code does not call io_uring_cmd_done()
at the same time but do it in ublk_cancel_queue(). With current design,
this does work because ublk_cancel_queue() is called AFTER del_gendisk(),
which waits for the rq ended(aborted) in fallback wq. This implies that
fallback wq on this rq is scheduled BEFORE calling io_uring_cmd_done()
on the corresponding ioucmd in ublk_cancel_queue().

However, while considering recovery feature, we cannot rely on
del_gendisk() or blk_mq_freeze_queue() to wait for completion of all
rqs because we may not want any aborted rq. Besides, io_uring does not
provide "flush fallback" machenism so we cannot trace this ioucmd.

The recovery machenism needs to complete all ioucmds of a dying ubq
to avoid leaking io_uring ctx. But as talked above, we are unsafe
to call io_uring_cmd_done() in the recovery task if fallback wq happens
to run simultaneously. This is a UAF case because io_uring ctx may be
freed. Actually a similar case happens in
(5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
re-issued request aborted previously to ioucmd's task_work).

Besides, in order to implement recovery machenism, in ublk_queue_rq()
and __ublk_rq_task_work(), we should not end(abort) current rq while
ubq_daemon is dying. Instead, we should wait for new ubq_daemon getting
ready and requeue it.

In summary, We refactor some code to avoid the UAF problem and prepare
for recovery machenism:

(1) Refactor monitor_work
Monitor_work is only used without recovery feature which aborts rqs and
stops the device. Now ublk_abort_queue() is the only function end(abort)
inflight rqs with a dying ubq_daemon. While for a not inflight(idle)
rq, its ioucmd is completed by io_uring_cmd_done() safely. We do not
rely on UBLK_IO_FLAG_ACTIVE anymore. monitor_work also sets 'force_abort'
for a dying ubq.

(2) Refactor ublk_queue_rq()
Check 'force_abort' before blk_mq_start_request(). If 'force_abort' is
true, end(abort) current rq immediately. 'force_abort' is set by
monitor_work for a dying ubq. Aborting rqs ASAP ensures that all rqs'
status do not change while we traverse rqs in monitor_work.

(3) Refactor __ublk_rq_task_work().
No matter we use task_work_add() or io_uring_cmd_complete_in_task(),
now __ublk_rq_task_work() only accepts one arg: ioucmd, which is set in
ublk_queue_rq() eariler. And no matter ubq_daemon is dying or not,
we always call io_uring_cmd_done(ioucmd). Note that ioucmd might not be
the same as io->cmd because a new ubq_daemon may set new ioucmds before
old fallback wq or exit_task_work runs. In this way the old ioucmd
can be safely freed eventually and io->cmd can be updated without race.

(4) Refactor ublk_cancel_dev()
ublk_cancel_dev() cannot complete ioucmds for a dying ubq because we
have done this in monitor_work.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 216 +++++++++++++++++++++------------------
1 file changed, 115 insertions(+), 101 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 8add6e3ae15f..1b1c0190bba4 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -54,12 +54,10 @@
/* All UBLK_PARAM_TYPE_* should be included here */
#define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD)

-struct ublk_rq_data {
- struct callback_head work;
-};
-
struct ublk_uring_cmd_pdu {
- struct request *req;
+ int q_id;
+ int tag;
+ struct callback_head work;
};

/*
@@ -122,6 +120,7 @@ struct ublk_queue {
bool abort_work_pending;
unsigned short nr_io_ready; /* how many ios setup */
struct ublk_device *dev;
+ bool force_abort;
struct ublk_io ios[0];
};

@@ -610,24 +609,6 @@ static void ublk_complete_rq(struct request *req)
__blk_mq_end_request(req, BLK_STS_OK);
}

-/*
- * Since __ublk_rq_task_work always fails requests immediately during
- * exiting, __ublk_fail_req() is only called from abort context during
- * exiting. So lock is unnecessary.
- *
- * Also aborting may not be started yet, keep in mind that one failed
- * request may be issued by block layer again.
- */
-static void __ublk_fail_req(struct ublk_io *io, struct request *req)
-{
- WARN_ON_ONCE(io->flags & UBLK_IO_FLAG_ACTIVE);
-
- if (!(io->flags & UBLK_IO_FLAG_ABORTED)) {
- io->flags |= UBLK_IO_FLAG_ABORTED;
- blk_mq_end_request(req, BLK_STS_IOERR);
- }
-}
-
static void ubq_complete_io_cmd(struct ublk_io *io, int res)
{
/* mark this cmd owned by ublksrv */
@@ -645,18 +626,14 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)

#define UBLK_REQUEUE_DELAY_MS 3

-static inline void __ublk_rq_task_work(struct request *req)
+static inline void __ublk_rq_task_work(struct io_uring_cmd *cmd)
{
- struct ublk_queue *ubq = req->mq_hctx->driver_data;
- struct ublk_device *ub = ubq->dev;
- int tag = req->tag;
- struct ublk_io *io = &ubq->ios[tag];
+ struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
+ struct ublk_device *ub = cmd->file->private_data;
+ struct ublk_queue *ubq = ublk_get_queue(ub, pdu->q_id);
+ struct ublk_io *io;
+ struct request *req;
unsigned int mapped_bytes;
-
- pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n",
- __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags,
- ublk_get_iod(ubq, req->tag)->addr);
-
/*
* Task is exiting if either:
*
@@ -667,11 +644,22 @@ static inline void __ublk_rq_task_work(struct request *req)
* (2) current->flags & PF_EXITING.
*/
if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) {
- blk_mq_end_request(req, BLK_STS_IOERR);
- mod_delayed_work(system_wq, &ub->monitor_work, 0);
+ pr_devel("%s: (%s) done old cmd: qid %d tag %d\n",
+ __func__,
+ current != ubq->ubq_daemon ? "fallback wq" : "exit_task_work",
+ pdu->q_id, pdu->tag);
+ io_uring_cmd_done(cmd, UBLK_IO_RES_ABORT, 0);
return;
}

+ io = &ubq->ios[pdu->tag];
+ req = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], pdu->tag);
+ WARN_ON_ONCE(!req);
+
+ pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n",
+ __func__, cmd->cmd_op, ubq->q_id, req->tag, io->flags,
+ ublk_get_iod(ubq, req->tag)->addr);
+
if (ublk_need_get_data(ubq) &&
(req_op(req) == REQ_OP_WRITE ||
req_op(req) == REQ_OP_FLUSH)) {
@@ -728,18 +716,16 @@ static inline void __ublk_rq_task_work(struct request *req)

static void ublk_rq_task_work_cb(struct io_uring_cmd *cmd)
{
- struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
-
- __ublk_rq_task_work(pdu->req);
+ __ublk_rq_task_work(cmd);
}

static void ublk_rq_task_work_fn(struct callback_head *work)
{
- struct ublk_rq_data *data = container_of(work,
- struct ublk_rq_data, work);
- struct request *req = blk_mq_rq_from_pdu(data);
+ struct ublk_uring_cmd_pdu *pdu = container_of(work,
+ struct ublk_uring_cmd_pdu, work);
+ struct io_uring_cmd *cmd = ublk_uring_cmd_from_pdu(pdu);

- __ublk_rq_task_work(req);
+ __ublk_rq_task_work(cmd);
}

static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
@@ -747,6 +733,8 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
{
struct ublk_queue *ubq = hctx->driver_data;
struct request *rq = bd->rq;
+ struct ublk_io *io = &ubq->ios[rq->tag];
+ struct ublk_uring_cmd_pdu *pdu;
blk_status_t res;

/* fill iod to slot in io cmd buffer */
@@ -754,43 +742,43 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
if (unlikely(res != BLK_STS_OK))
return BLK_STS_IOERR;

- blk_mq_start_request(bd->rq);
-
- if (unlikely(ubq_daemon_is_dying(ubq))) {
- fail:
- mod_delayed_work(system_wq, &ubq->dev->monitor_work, 0);
+ /*
+ * We set force_abort because we want to abort this rq ASAP.
+ *
+ * NOTE: At this moment, rq is NOT inflight. So after the ubq is marked
+ * as force_abort, no new rq can be inflight and we are safe to check
+ * all rqs' status in monitor_work and choose whether:
+ *
+ * (1) if inflight, end(abort) this rq;
+ * (2) if not inflight, io_uring_cmd_done() the ioucmd.
+ */
+ if (ubq->force_abort) {
+ pr_devel("%s: force abort: qid %d tag %d io_flag %d\n",
+ __func__, ubq->q_id, rq->tag, io->flags);
return BLK_STS_IOERR;
}

+ blk_mq_start_request(bd->rq);
+ pr_devel("%s: start req: q_id %d tag %d io_flags %d\n",
+ __func__, ubq->q_id, rq->tag, io->flags);
+ pdu = ublk_get_uring_cmd_pdu(io->cmd);
+ pdu->q_id = ubq->q_id;
+ pdu->tag = rq->tag;
+
if (ublk_can_use_task_work(ubq)) {
- struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
enum task_work_notify_mode notify_mode = bd->last ?
TWA_SIGNAL_NO_IPI : TWA_NONE;

- if (task_work_add(ubq->ubq_daemon, &data->work, notify_mode))
- goto fail;
- } else {
- struct ublk_io *io = &ubq->ios[rq->tag];
- struct io_uring_cmd *cmd = io->cmd;
- struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
+ init_task_work(&pdu->work, ublk_rq_task_work_fn);

- /*
- * If the check pass, we know that this is a re-issued request aborted
- * previously in monitor_work because the ubq_daemon(cmd's task) is
- * PF_EXITING. We cannot call io_uring_cmd_complete_in_task() anymore
- * because this ioucmd's io_uring context may be freed now if no inflight
- * ioucmd exists. Otherwise we may cause null-deref in ctx->fallback_work.
- *
- * Note: monitor_work sets UBLK_IO_FLAG_ABORTED and ends this request(releasing
- * the tag). Then the request is re-started(allocating the tag) and we are here.
- * Since releasing/allocating a tag implies smp_mb(), finding UBLK_IO_FLAG_ABORTED
- * guarantees that here is a re-issued request aborted previously.
- */
- if ((io->flags & UBLK_IO_FLAG_ABORTED))
- goto fail;
+ /* If task_work_add() fails, we know that ubq_daemon(cmd's task) is PF_EXITING. */
+ if (task_work_add(ubq->ubq_daemon, &pdu->work, notify_mode))
+ pr_devel("%s: done old cmd: qid %d tag %d\n",
+ __func__, ubq->q_id, rq->tag);
+ io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0);

- pdu->req = rq;
- io_uring_cmd_complete_in_task(cmd, ublk_rq_task_work_cb);
+ } else {
+ io_uring_cmd_complete_in_task(io->cmd, ublk_rq_task_work_cb);
}

return BLK_STS_OK;
@@ -814,20 +802,10 @@ static int ublk_init_hctx(struct blk_mq_hw_ctx *hctx, void *driver_data,
return 0;
}

-static int ublk_init_rq(struct blk_mq_tag_set *set, struct request *req,
- unsigned int hctx_idx, unsigned int numa_node)
-{
- struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);
-
- init_task_work(&data->work, ublk_rq_task_work_fn);
- return 0;
-}
-
static const struct blk_mq_ops ublk_mq_ops = {
.queue_rq = ublk_queue_rq,
.commit_rqs = ublk_commit_rqs,
.init_hctx = ublk_init_hctx,
- .init_request = ublk_init_rq,
};

static int ublk_ch_open(struct inode *inode, struct file *filp)
@@ -906,31 +884,46 @@ static void ublk_commit_completion(struct ublk_device *ub,
ublk_complete_rq(req);
}

-/*
- * When ->ubq_daemon is exiting, either new request is ended immediately,
- * or any queued io command is drained, so it is safe to abort queue
- * lockless
+/* do cleanup work for a dying(PF_EXITING) ubq_daemon:
+ * (1) end(abort) all 'inflight' rqs here.
+ * (2) complete ioucmd of all not 'inflight' rqs here.
+ *
+ * Note: we have set ubq->force_abort before. So ublk_queue_rq() must fail
+ * a rq immediately before it calls blk_mq_start_request() which will
+ * set a rq's status as 'inflight'. Therefore, set of 'inflight'
+ * rqs must not change for a dying ubq.
+ *
+ * Note: If we fail one rq in ublk_queue_rq(), we cannot fail it here again.
+ * This is because checking 'force_abort' happens before blk_mq_start_request()
+ * so its status is always 'idle' and never changes to 'inflight'.
+ *
+ * Also aborting may not be started yet, keep in mind that one failed
+ * request may be issued by block layer again.
*/
static void ublk_abort_queue(struct ublk_device *ub, struct ublk_queue *ubq)
{
int i;

+ WARN_ON_ONCE(!ubq->force_abort);
+
if (!ublk_get_device(ub))
return;

for (i = 0; i < ubq->q_depth; i++) {
- struct ublk_io *io = &ubq->ios[i];
-
- if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
- struct request *rq;
+ struct request *rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i);

+ if (rq && blk_mq_request_started(rq)) {
/*
- * Either we fail the request or ublk_rq_task_work_fn
+ * Either we fail the request or ublk_queue_rq
* will do it
*/
- rq = blk_mq_tag_to_rq(ub->tag_set.tags[ubq->q_id], i);
- if (rq)
- __ublk_fail_req(io, rq);
+ pr_devel("%s: abort request: qid %d tag %d\n",
+ __func__, ubq->q_id, i);
+ blk_mq_end_request(rq, BLK_STS_IOERR);
+ } else {
+ pr_devel("%s: done old cmd: qid %d tag %d\n",
+ __func__, ubq->q_id, i);
+ io_uring_cmd_done(ubq->ios[i].cmd, UBLK_IO_RES_ABORT, 0);
}
}
ublk_put_device(ub);
@@ -945,7 +938,18 @@ static void ublk_daemon_monitor_work(struct work_struct *work)
for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
struct ublk_queue *ubq = ublk_get_queue(ub, i);

- if (ubq_daemon_is_dying(ubq)) {
+ /* check force_abort so we do not abort a queue two times! */
+ if (ubq->ubq_daemon && ubq_daemon_is_dying(ubq) && !ubq->force_abort) {
+ struct request_queue *q = ub->ub_disk->queue;
+
+ ubq->force_abort = true;
+
+ /* ensure that all ublk_queue_rq() calls see force_abort */
+ if (blk_queue_has_srcu(q))
+ synchronize_srcu(q->srcu);
+ else
+ synchronize_rcu();
+
schedule_work(&ub->stop_work);

/* abort queue is for making forward progress */
@@ -997,8 +1001,18 @@ static void ublk_cancel_dev(struct ublk_device *ub)
{
int i;

- for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+ struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+ /*
+ * for ubq with a dying daemon, we have io_uring_cmd_done() all cmd in
+ * ublk_abort_queue(). Do not io_uring_cmd_done() cmd two times!
+ */
+ if (ubq->ubq_daemon && ubq_daemon_is_dying(ubq))
+ continue;
+
ublk_cancel_queue(ublk_get_queue(ub, i));
+ }
}

static void ublk_stop_dev(struct ublk_device *ub)
@@ -1037,17 +1051,17 @@ static void ublk_handle_need_get_data(struct ublk_device *ub, int q_id,
int tag, struct io_uring_cmd *cmd)
{
struct ublk_queue *ubq = ublk_get_queue(ub, q_id);
- struct request *req = blk_mq_tag_to_rq(ub->tag_set.tags[q_id], tag);
+ struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
+
+ pdu->q_id = q_id;
+ pdu->tag = tag;

if (ublk_can_use_task_work(ubq)) {
- struct ublk_rq_data *data = blk_mq_rq_to_pdu(req);

+ init_task_work(&pdu->work, ublk_rq_task_work_fn);
/* should not fail since we call it just in ubq->ubq_daemon */
- task_work_add(ubq->ubq_daemon, &data->work, TWA_SIGNAL_NO_IPI);
+ task_work_add(ubq->ubq_daemon, &pdu->work, TWA_SIGNAL_NO_IPI);
} else {
- struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_cmd_pdu(cmd);
-
- pdu->req = req;
io_uring_cmd_complete_in_task(cmd, ublk_rq_task_work_cb);
}
}
@@ -1320,7 +1334,7 @@ static int ublk_add_tag_set(struct ublk_device *ub)
ub->tag_set.nr_hw_queues = ub->dev_info.nr_hw_queues;
ub->tag_set.queue_depth = ub->dev_info.queue_depth;
ub->tag_set.numa_node = NUMA_NO_NODE;
- ub->tag_set.cmd_size = sizeof(struct ublk_rq_data);
+ ub->tag_set.cmd_size = 0;
ub->tag_set.flags = BLK_MQ_F_SHOULD_MERGE;
ub->tag_set.driver_data = ub;
return blk_mq_alloc_tag_set(&ub->tag_set);
--
2.27.0

2022-08-24 05:52:37

by Ziyang Zhang

[permalink] [raw]
Subject: [RFC PATCH 2/9] ublk_drv: refactor ublk_cancel_queue()

Assume only a few FETCH_REQ ioucmds are sent to ublk_drv, then the
ubq_daemon exits, We have to call io_uring_cmd_done() for all ioucmds
received so that io_uring ctx will not leak.

ublk_cancel_queue() may be called before START_DEV or after STOP_DEV,
we decrease ubq->nr_io_ready and clear UBLK_IO_FLAG_ACTIVE so that we
won't call io_uring_cmd_done() twice for one ioucmd to avoid UAF. Also
clearing UBLK_IO_FLAG_ACTIVE makes the code more reasonable.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index c39b67d7133d..e08f636b0b9d 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -967,18 +967,23 @@ static void ublk_cancel_queue(struct ublk_queue *ubq)
{
int i;

- if (!ublk_queue_ready(ubq))
+ if (!ubq->nr_io_ready)
return;

for (i = 0; i < ubq->q_depth; i++) {
struct ublk_io *io = &ubq->ios[i];

- if (io->flags & UBLK_IO_FLAG_ACTIVE)
+ if (io->flags & UBLK_IO_FLAG_ACTIVE) {
+ pr_devel("%s: done old cmd: qid %d tag %d\n",
+ __func__, ubq->q_id, i);
io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0);
+ io->flags &= ~UBLK_IO_FLAG_ACTIVE;
+ ubq->nr_io_ready--;
+ }
}

/* all io commands are canceled */
- ubq->nr_io_ready = 0;
+ WARN_ON_ONCE(ubq->nr_io_ready);
}

/* Cancel all pending commands, must be called after del_gendisk() returns */
--
2.27.0

2022-08-24 05:52:43

by Ziyang Zhang

[permalink] [raw]
Subject: [RFC PATCH 9/9] ublk_drv: do not schedule monitor_work with recovery feature enabled

monitor_work is used to end(abort) all dying ubqs and cleanup device.
With recovery feature enabled, we reinit dying ubqs in
START_USER_RECOVERY command and we do not cleanup ublk device. So
scheduling monitor_work in START_DEV is unnecessary.

We manually schedule monitor_work in STOP_DEV regardless reocvery
feature is enabled or not. So we can cleanup everything in the end.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index ccaf3761de74..f8bb2e818d33 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1435,7 +1435,8 @@ static int ublk_ctrl_start_dev(struct io_uring_cmd *cmd)

wait_for_completion_interruptible(&ub->completion);

- schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON_MONITOR_PERIOD);
+ if (!ublk_can_use_recovery(ub))
+ schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON_MONITOR_PERIOD);

mutex_lock(&ub->mutex);
if (ub->dev_info.state != UBLK_S_DEV_DEAD ||
--
2.27.0

2022-08-24 05:52:46

by Ziyang Zhang

[permalink] [raw]
Subject: [RFC PATCH 8/9] ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support

START_USER_RECOVERY and END_USER_RECOVERY are two new control commands
to support user recovery feature.

After a crash, user should send START_USER_RECOVERY, it will:
(1) check current ublk_device state and make sure all ubq_daemons are
dying. We always expect crash of the ublksrv 'process', not exit of
a single ubq_daemon 'thread'.
(2) quiesce the request queue so here is no incoming ublk_queue_rq().
(3) reinit all ubqs, including:
(a) put the dying task(thread).
(b) requeue/abort(depends on user's setting, some backends can
tolerate double writes) rqs issued to ublksrv before crash.
(c) requeue rqs dispatched after crash.
(d) complete ioucmds on tags not used.
(4) convert state to UBLK_S_DEV_RECOVERING and reset ub->mm to NULL.

Then, user should start a new 'process' and send FETCH_REQ on each
ubq_daemon.

After a crash, user should send END_USER_RECOVERY, it will:
(1) wait for all new ubq_daemons getting ready.
(2) unquiesce the request queue and expect incoming ublk_queue_rq()
(2) convert state to UBLK_S_DEV_LIVE

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 197 +++++++++++++++++++++++++++++++++++++++
1 file changed, 197 insertions(+)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 0ee871fa3f92..ccaf3761de74 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1796,6 +1796,197 @@ static int ublk_ctrl_set_params(struct io_uring_cmd *cmd)
return ret;
}

+/*
+ * reinit a ubq with a dying ubq_daemon for recovery. It must be called after ub is quiesced.
+ */
+static bool ublk_recover_rq(struct request *rq, void *data)
+{
+ struct ublk_queue *ubq = rq->mq_hctx->driver_data;
+ struct ublk_device *ub = ubq->dev;
+ struct ublk_io *io = &ubq->ios[rq->tag];
+
+ /* The request has been sent to ublksrv, and it is inflight. */
+ if (!(io->flags & UBLK_IO_FLAG_ACTIVE)) {
+ pr_devel("%s: req has sent to ublksrv, %s it: qid %d tag %d io_flags %x\n",
+ __func__,
+ ublk_can_use_recovery_reissue(ub) ? "requeue" : "abort",
+ ubq->q_id, rq->tag, io->flags);
+ /* requeue rq but never call ublk_queue_rq() until the queue is unquiesced.*/
+ if (ublk_can_use_recovery_reissue(ub)) {
+ blk_mq_requeue_request(rq, false);
+ blk_mq_delay_kick_requeue_list(rq->q,
+ UBLK_REQUEUE_DELAY_MS);
+ /* abort this req */
+ } else {
+ /*
+ * Laterly, rq may be issued again but ublk_queue_rq()
+ * cannot be called until we unquiesce the queue.
+ */
+ blk_mq_end_request(rq, BLK_STS_IOERR);
+ }
+ /*
+ * The request has not been sent to ublksrv, but it is inflight. So:
+ *
+ * (1) corresponding ublk_queue_rq() call happens after ubq_daemon is dying.
+ *
+ * (2) exit_task_work() runs task_work after ubq_daemon is dying; or
+ *
+ * (3) io_uring fallback_work is running in workqueue after ubq_daemon is dying.
+ *
+ * Therefore, we simply requeue the request.
+ *
+ * Note:
+ * (1) We do not io_uring_cmd_done() the old ioucmd here because (2) or (3)
+ * described above can do this.
+ *
+ * (2) We are safe to requeue rq since (2) or (3) only touches ioucmd(passed as
+ * an argument) first. Then they may detect that ubq_daemon is dying, return
+ * in advance and never touch rq(struct request), so no race on rq.
+ *
+ * (3) We requeue rq but ublk_queue_rq() is not called until the queue is
+ * unquiesced.
+ */
+ } else {
+ pr_devel("%s: requeue req: qid %d tag %d io_flags %x\n",
+ __func__, ubq->q_id, rq->tag, io->flags);
+ blk_mq_requeue_request(rq, false);
+ blk_mq_delay_kick_requeue_list(rq->q,
+ UBLK_REQUEUE_DELAY_MS);
+ }
+
+ /* forget everything now and be ready for new FETCH_REQ */
+ io->flags = 0;
+ io->cmd = NULL;
+ io->addr = 0;
+ ubq->nr_io_ready--;
+ return true;
+}
+
+static int ublk_ctrl_start_recovery(struct io_uring_cmd *cmd)
+{
+ struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+ struct ublk_device *ub;
+ int ret = -EINVAL;
+ int i, j;
+
+ ub = ublk_get_device_from_id(header->dev_id);
+ if (!ub)
+ return ret;
+
+ mutex_lock(&ub->mutex);
+
+ if (!ublk_can_use_recovery(ub))
+ goto out_unlock;
+
+ if (ub->dev_info.state != UBLK_S_DEV_LIVE) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+
+ /* Recovery feature is for 'process' crash, so all ubq_daemon should be PF_EXITING */
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+ struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+ if (!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq)))
+ goto out_unlock;
+ }
+
+ blk_mq_quiesce_queue(ub->ub_disk->queue);
+ pr_devel("%s: queue quiesced.\n", __func__);
+ ub->dev_info.state = UBLK_S_DEV_RECOVERING;
+ /* set to NULL, otherwise new ubq_daemon cannot mmap the io_cmd_buf */
+ ub->mm = NULL;
+ init_completion(&ub->completion);
+
+ blk_mq_tagset_busy_iter(&ub->tag_set, ublk_recover_rq, NULL);
+
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++) {
+ struct ublk_queue *ubq = ublk_get_queue(ub, i);
+
+ WARN_ON_ONCE(!(ubq->ubq_daemon && ubq_daemon_is_dying(ubq)));
+
+ pr_devel("%s: dying ubq_daemon: qid %d\n", __func__, ubq->q_id);
+
+ /* old daemon is PF_EXITING, put it now */
+ put_task_struct(ubq->ubq_daemon);
+ /* otherwise ublk_ch_uring_cmd() won't accept new FETCH_REQ */
+ ubq->ubq_daemon = NULL;
+
+ for (j = 0; j < ubq->q_depth; j++) {
+ struct ublk_io *io = &ubq->ios[j];
+
+ /* This is an inflight rq, pass */
+ if (!io->cmd)
+ continue;
+ /*
+ * No request has been issued(allocated) on this tag. We are safe to
+ * io_uring_cmd_done() the old ioucmd because no one else can do this.
+ *
+ * Note that ublk_cancel_queue() calling io_uring_cmd_done() must
+ * only be called with NOT dying ubq_daemon.
+ */
+ pr_devel("%s: done old cmd: qid %d tag %d\n",
+ __func__, ubq->q_id, j);
+ io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0);
+
+ /* forget everything now and be ready for new FETCH_REQ */
+ io->flags = 0;
+ io->cmd = NULL;
+ io->addr = 0;
+ ubq->nr_io_ready--;
+ }
+
+ WARN_ON_ONCE(ubq->nr_io_ready);
+ ub->nr_queues_ready--;
+ }
+
+ WARN_ON_ONCE(ub->nr_queues_ready);
+ ret = 0;
+
+out_unlock:
+ mutex_unlock(&ub->mutex);
+ ublk_put_device(ub);
+ return ret;
+}
+
+static int ublk_ctrl_end_recovery(struct io_uring_cmd *cmd)
+{
+ struct ublksrv_ctrl_cmd *header = (struct ublksrv_ctrl_cmd *)cmd->cmd;
+ int ublksrv_pid = (int)header->data[0];
+ struct ublk_device *ub;
+ int ret = -EINVAL;
+
+ ub = ublk_get_device_from_id(header->dev_id);
+ if (!ub)
+ return ret;
+
+ pr_devel("%s: Waiting for new ubq_daemon is ready...\n", __func__);
+ /* wait until new ubq_daemon sending all FETCH_REQ */
+ wait_for_completion_interruptible(&ub->completion);
+ pr_devel("%s: All new ubq_daemon is ready.\n", __func__);
+
+ mutex_lock(&ub->mutex);
+
+ if (!ublk_can_use_recovery(ub))
+ goto out_unlock;
+
+ if (ub->dev_info.state != UBLK_S_DEV_RECOVERING) {
+ ret = -EBUSY;
+ goto out_unlock;
+ }
+ ub->dev_info.ublksrv_pid = ublksrv_pid;
+ pr_devel("%s: new ublksrv_pid %d\n", __func__,
+ ub->dev_info.ublksrv_pid);
+ blk_mq_unquiesce_queue(ub->ub_disk->queue);
+ pr_devel("%s: queue unquiesced.\n", __func__);
+ ub->dev_info.state = UBLK_S_DEV_LIVE;
+ ret = 0;
+out_unlock:
+ mutex_unlock(&ub->mutex);
+ ublk_put_device(ub);
+ return ret;
+}
+
static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
unsigned int issue_flags)
{
@@ -1837,6 +2028,12 @@ static int ublk_ctrl_uring_cmd(struct io_uring_cmd *cmd,
case UBLK_CMD_SET_PARAMS:
ret = ublk_ctrl_set_params(cmd);
break;
+ case UBLK_CMD_START_USER_RECOVERY:
+ ret = ublk_ctrl_start_recovery(cmd);
+ break;
+ case UBLK_CMD_END_USER_RECOVERY:
+ ret = ublk_ctrl_end_recovery(cmd);
+ break;
default:
break;
}
--
2.27.0

2022-08-24 05:52:54

by Ziyang Zhang

[permalink] [raw]
Subject: [RFC PATCH 7/9] ublk_drv: define macros for recovery feature and check them

Define some macros for recovery feature. Especially define a new state:
UBLK_S_DEV_RECOVERING which implies the ublk_device is recovering.

UBLK_F_USER_RECOVERY implies that:
(1) ublk_drv enables recovery feature. It won't schedule monitor_work to
automatically abort rqs and release the device. Instead, it waits
for user's START_USER_RECOVERY ctrl-cmd.
(2) while re-initing a ubq, ublk_drv ends(aborts) rqs issued to
userspace(ublksrv) before crash.

(3) while re-initing a ubq, ublk_drv requeues rqs dispatched after crash.

UBLK_F_USER_RECOVERY_REISSUE implies that:
(1) everything UBLK_F_USER_RECOVERY implies except
(2) ublk_drv requeues rqs issued to userspace(ublksrv) before crash.

UBLK_F_USER_RECOVERY_REISSUE is designed for backends which:
(1) tolerate double-writes because we may issue the same rq twice.
(2) cannot let frontend users get I/O error, such as a RDONLY system.

For now, we do not allow STOP_DEV while we are in UBLK_S_DEV_RECOVERING.
This means that user must assign a new ubq_daemon for each ubq after
sending START_USER_RECOVERY ctrl-cmd.

Also modify checks on state in START_DEV and SET_PARAMS because now we
have three states.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 47 ++++++++++++++++++++++++++++++-----
include/uapi/linux/ublk_cmd.h | 7 ++++++
2 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 4bbd97ccaedf..0ee871fa3f92 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -49,7 +49,9 @@
/* All UBLK_F_* have to be included into UBLK_F_ALL */
#define UBLK_F_ALL (UBLK_F_SUPPORT_ZERO_COPY \
| UBLK_F_URING_CMD_COMP_IN_TASK \
- | UBLK_F_NEED_GET_DATA)
+ | UBLK_F_NEED_GET_DATA \
+ | UBLK_F_USER_RECOVERY \
+ | UBLK_F_USER_RECOVERY_REISSUE)

/* All UBLK_PARAM_TYPE_* should be included here */
#define UBLK_PARAM_TYPE_ALL (UBLK_PARAM_TYPE_BASIC | UBLK_PARAM_TYPE_DISCARD)
@@ -322,6 +324,33 @@ static inline int ublk_queue_cmd_buf_size(struct ublk_device *ub, int q_id)
PAGE_SIZE);
}

+/*
+ * TODO: UBLK_F_USER_RECOVERY should be a flag for device, not for queue,
+ * since "some queues are aborted while others are recovered" is really weird.
+ */
+static inline bool ublk_can_use_recovery(struct ublk_device *ub)
+{
+ struct ublk_queue *ubq = ublk_get_queue(ub, 0);
+
+ if (ubq->flags & UBLK_F_USER_RECOVERY)
+ return true;
+ return false;
+}
+
+/*
+ * TODO: UBLK_F_USER_RECOVERY_REISSUE should be a flag for device, not for queue,
+ * since "some queues are aborted while others are recovered" is really weird.
+ */
+static inline bool ublk_can_use_recovery_reissue(struct ublk_device *ub)
+{
+ struct ublk_queue *ubq = ublk_get_queue(ub, 0);
+
+ if ((ubq->flags & UBLK_F_USER_RECOVERY) &&
+ (ubq->flags & UBLK_F_USER_RECOVERY_REISSUE))
+ return true;
+ return false;
+}
+
static void ublk_free_disk(struct gendisk *disk)
{
struct ublk_device *ub = disk->private_data;
@@ -1029,10 +1058,15 @@ static void ublk_stop_dev(struct ublk_device *ub)
{
mutex_lock(&ub->mutex);
/* No gendisk is live. ubq may be ready or not */
- if (ub->dev_info.state == UBLK_S_DEV_DEAD)
+ if (ub->dev_info.state == UBLK_S_DEV_DEAD) {
goto out_cancel_dev;
-
- mod_delayed_work(system_wq, &ub->monitor_work, 0);
+ /* TODO: support stop_dev just after start_recovery */
+ } else if (ub->dev_info.state == UBLK_S_DEV_RECOVERING) {
+ goto out_unlock;
+ /* schedule monitor_work to abort any dying queue */
+ } else {
+ mod_delayed_work(system_wq, &ub->monitor_work, 0);
+ }
pr_devel("%s: Wait for all requests ended...\n", __func__);
blk_mq_freeze_queue(ub->ub_disk->queue);
ub->dev_info.state = UBLK_S_DEV_DEAD;
@@ -1044,6 +1078,7 @@ static void ublk_stop_dev(struct ublk_device *ub)
ub->ub_disk = NULL;
out_cancel_dev:
ublk_cancel_dev(ub);
+ out_unlock:
mutex_unlock(&ub->mutex);
}

@@ -1403,7 +1438,7 @@ static int ublk_ctrl_start_dev(struct io_uring_cmd *cmd)
schedule_delayed_work(&ub->monitor_work, UBLK_DAEMON_MONITOR_PERIOD);

mutex_lock(&ub->mutex);
- if (ub->dev_info.state == UBLK_S_DEV_LIVE ||
+ if (ub->dev_info.state != UBLK_S_DEV_DEAD ||
test_bit(UB_STATE_USED, &ub->state)) {
ret = -EEXIST;
goto out_unlock;
@@ -1746,7 +1781,7 @@ static int ublk_ctrl_set_params(struct io_uring_cmd *cmd)

/* parameters can only be changed when device isn't live */
mutex_lock(&ub->mutex);
- if (ub->dev_info.state == UBLK_S_DEV_LIVE) {
+ if (ub->dev_info.state != UBLK_S_DEV_DEAD) {
ret = -EACCES;
} else if (copy_from_user(&ub->params, argp, ph.len)) {
ret = -EFAULT;
diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h
index 677edaab2b66..7f7e6f44cec5 100644
--- a/include/uapi/linux/ublk_cmd.h
+++ b/include/uapi/linux/ublk_cmd.h
@@ -17,6 +17,8 @@
#define UBLK_CMD_STOP_DEV 0x07
#define UBLK_CMD_SET_PARAMS 0x08
#define UBLK_CMD_GET_PARAMS 0x09
+#define UBLK_CMD_START_USER_RECOVERY 0x10
+#define UBLK_CMD_END_USER_RECOVERY 0x11

/*
* IO commands, issued by ublk server, and handled by ublk driver.
@@ -74,9 +76,14 @@
*/
#define UBLK_F_NEED_GET_DATA (1UL << 2)

+#define UBLK_F_USER_RECOVERY (1UL << 3)
+
+#define UBLK_F_USER_RECOVERY_REISSUE (1UL << 4)
+
/* device state */
#define UBLK_S_DEV_DEAD 0
#define UBLK_S_DEV_LIVE 1
+#define UBLK_S_DEV_RECOVERING 2

/* shipped via sqe->cmd of io_uring command */
struct ublksrv_ctrl_cmd {
--
2.27.0

2022-08-24 05:53:20

by Ziyang Zhang

[permalink] [raw]
Subject: [RFC PATCH 6/9] ublk_drv: add pr_devel() to prepare for recovery feature

Recovery feature needs to correctly complete all old ioucmds. So print
some message for one ioucmd to track its status.

After io_uring ctx is freed, the /dev/ublkcX file can be released. Print
while opening/releasing the file struct to make sure there is no
resource leakage.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index df8751ea3711..4bbd97ccaedf 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -572,6 +572,9 @@ static void ublk_complete_rq(struct request *req)
struct ublk_io *io = &ubq->ios[req->tag];
unsigned int unmapped_bytes;

+ pr_devel("%s complete req: qid %d tag %d io_flags %d\n",
+ __func__, ubq->q_id, req->tag, io->flags);
+
/* failed read IO if nothing is read */
if (!io->res && req_op(req) == REQ_OP_READ)
io->res = -EIO;
@@ -620,6 +623,9 @@ static void ubq_complete_io_cmd(struct ublk_io *io, int res)
*/
io->flags &= ~UBLK_IO_FLAG_ACTIVE;

+ pr_devel("%s: complete: op %d, res %d io_flags %x\n",
+ __func__, io->cmd->cmd_op, res, io->flags);
+
/* tell ublksrv one io request is coming */
io_uring_cmd_done(io->cmd, res, 0);
}
@@ -816,6 +822,7 @@ static int ublk_ch_open(struct inode *inode, struct file *filp)
if (test_and_set_bit(UB_STATE_OPEN, &ub->state))
return -EBUSY;
filp->private_data = ub;
+ pr_devel("%s: /dev/ublkc%d opened.\n", __func__, ub->dev_info.dev_id);
return 0;
}

@@ -824,6 +831,7 @@ static int ublk_ch_release(struct inode *inode, struct file *filp)
struct ublk_device *ub = filp->private_data;

clear_bit(UB_STATE_OPEN, &ub->state);
+ pr_devel("%s: /dev/ublkc%d released.\n", __func__, ub->dev_info.dev_id);
return 0;
}

@@ -942,6 +950,8 @@ static void ublk_daemon_monitor_work(struct work_struct *work)
if (ubq->ubq_daemon && ubq_daemon_is_dying(ubq) && !ubq->force_abort) {
struct request_queue *q = ub->ub_disk->queue;

+ pr_devel("%s: find dying ubq_daemon: qid %d\n",
+ __func__, ubq->q_id);
ubq->force_abort = true;

/* ensure that all ublk_queue_rq() calls see force_abort */
@@ -1043,12 +1053,16 @@ static void ublk_mark_io_ready(struct ublk_device *ub, struct ublk_queue *ubq)
mutex_lock(&ub->mutex);
ubq->nr_io_ready++;
if (ublk_queue_ready(ubq)) {
+ pr_devel("%s: ubq %d ready\n", __func__, ubq->q_id);
ubq->ubq_daemon = current;
get_task_struct(ubq->ubq_daemon);
ub->nr_queues_ready++;
}
- if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues)
+ if (ub->nr_queues_ready == ub->dev_info.nr_hw_queues) {
+ pr_devel("%s: all queues(%d) ready\n",
+ __func__, ub->dev_info.nr_hw_queues);
complete_all(&ub->completion);
+ }
mutex_unlock(&ub->mutex);
}

--
2.27.0

2022-08-24 05:59:52

by Ziyang Zhang

[permalink] [raw]
Subject: [RFC PATCH 1/9] ublk_drv: check 'current' instead of 'ubq_daemon'

This check is not atomic. So with recovery feature, ubq_daemon may be
updated simultaneously by recovery task. Instead, check 'current' is
safe here because 'current' never changes.

Also add comment explaining this check, which is really important for
understanding recovery feature.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 6a4a94b4cdf4..c39b67d7133d 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -645,14 +645,22 @@ static inline void __ublk_rq_task_work(struct request *req)
struct ublk_device *ub = ubq->dev;
int tag = req->tag;
struct ublk_io *io = &ubq->ios[tag];
- bool task_exiting = current != ubq->ubq_daemon || ubq_daemon_is_dying(ubq);
unsigned int mapped_bytes;

pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n",
__func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags,
ublk_get_iod(ubq, req->tag)->addr);

- if (unlikely(task_exiting)) {
+ /*
+ * Task is exiting if either:
+ *
+ * (1) current != ubq_daemon.
+ * io_uring_cmd_complete_in_task() tries to run task_work
+ * in a workqueue if ubq_daemon(cmd's task) is PF_EXITING.
+ *
+ * (2) current->flags & PF_EXITING.
+ */
+ if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) {
blk_mq_end_request(req, BLK_STS_IOERR);
mod_delayed_work(system_wq, &ub->monitor_work, 0);
return;
--
2.27.0

2022-08-24 06:00:42

by Ziyang Zhang

[permalink] [raw]
Subject: [RFC PATCH 5/9] ublk_drv: refactor ublk_stop_dev()

Since we rely on rq's status to end(abort) it in monitor_work, calling
del_gendisk() too early can result in UAF. So add blk_mq_freeze_queue()
to end all inflight rqs. del_gendisk() can be then safely called with a
frozen queue.

Note: blk_mq_freeze_queue() hangs until:
(1) all rqs are aborted and ioucmds are copmleted(freed) on a dying ubq
by monitor_work scheduled.
(2) all rqs are ended on other ubqs.

We cancel monitor_work before del_gendisk() because monitor_work also
touches rq's status. Monitor_work is guaranteed to be canceled because
we mark ub's state as UBLK_S_DEV_DEAD.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 1b1c0190bba4..df8751ea3711 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -1018,18 +1018,23 @@ static void ublk_cancel_dev(struct ublk_device *ub)
static void ublk_stop_dev(struct ublk_device *ub)
{
mutex_lock(&ub->mutex);
- if (ub->dev_info.state != UBLK_S_DEV_LIVE)
- goto unlock;
+ /* No gendisk is live. ubq may be ready or not */
+ if (ub->dev_info.state == UBLK_S_DEV_DEAD)
+ goto out_cancel_dev;

- del_gendisk(ub->ub_disk);
+ mod_delayed_work(system_wq, &ub->monitor_work, 0);
+ pr_devel("%s: Wait for all requests ended...\n", __func__);
+ blk_mq_freeze_queue(ub->ub_disk->queue);
ub->dev_info.state = UBLK_S_DEV_DEAD;
+ cancel_delayed_work_sync(&ub->monitor_work);
+ pr_devel("%s: All requests are ended.\n", __func__);
+ del_gendisk(ub->ub_disk);
ub->dev_info.ublksrv_pid = -1;
put_disk(ub->ub_disk);
ub->ub_disk = NULL;
- unlock:
+ out_cancel_dev:
ublk_cancel_dev(ub);
mutex_unlock(&ub->mutex);
- cancel_delayed_work_sync(&ub->monitor_work);
}

/* device can only be started after all IOs are ready */
--
2.27.0

2022-08-24 06:14:53

by Ziyang Zhang

[permalink] [raw]
Subject: [RFC PATCH 3/9] ublk_drv: add a helper to get ioucmd from pdu

We store pointer of task_work in pdu. And we should get ioucmd from pdu
since we prepare to only pass ioucmd to task_work function.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index e08f636b0b9d..8add6e3ae15f 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -555,6 +555,12 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu;
}

+static inline struct io_uring_cmd *ublk_uring_cmd_from_pdu(
+ struct ublk_uring_cmd_pdu *pdu)
+{
+ return container_of((u8 *)pdu, struct io_uring_cmd, pdu[0]);
+}
+
static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
{
return ubq->ubq_daemon->flags & PF_EXITING;
--
2.27.0

2022-08-29 02:12:56

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] ublk_drv: add USER_RECOVERY support

On Wed, Aug 24, 2022 at 01:47:35PM +0800, ZiyangZhang wrote:
> ublk_drv is a driver simply passes all blk-mq rqs to ublksrv[1] in
> userspace. For each ublk queue, there is one ubq_daemon(pthread).
> All ubq_daemons share the same process which opens /dev/ublkcX.
> The ubq_daemon code infinitely loops on io_uring_enter() to
> send/receive io_uring cmds which pass information of blk-mq
> rqs.
>
> Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv
> must abort the dying ubq, stop the device and release everything.
> This is not a good choice in practice because users do not expect
> aborted requests, I/O errors and a released device. They may want
> a recovery machenism so that no requests are aborted and no I/O
> error occurs. Anyway, users just want everything works as uaual.

I understand the requirement is that both /dev/ublkbN and /dev/ublkcN
won't be deleted & re-added from user viewpoint after user recovery,
so the device context won't be lost.

>
> This RFC patchset implements USER_RECOVERY support. If the process
> crashes, we allow ublksrv to provide new process and ubq_daemons.
> We do not support single ubq_daemon(pthread) recovery because a
> pthread rarely crashes.
>
> Recovery feature is quite useful for products do not expect to
> return any I/O error to frontend users.

That looks one very ideal requirement. To be honest, no any block driver
can guarantee that 100%, so it is just one soft requirement?

Cause memory allocation may fail, network may be disconnected,
re-creating pthread or process may fail too, ...

> In detail, we support
> this scenario:
> (1) The /dev/ublkc0 is opened by process 0;
> (2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all
> rqs are handled by process 0.
> (3) Process 0 suddenly crashes(e.g. segfault);
> (4) Fio is still running and submit IOs(but these IOs cannot
> complete now)
> (5) User recovers with process 1 and attach it to /dev/ublkc0
> (6) All rqs are handled by process 1 now and IOs can be
> completed now.
>
> Note: The backend must tolerate double-write because we re-issue
> a rq sent to the old(dying) process before. We allow users to
> choose whether re-issue these rqs or not, please see patch 7 for
> more detail.
>
> We provide a sample script here to simulate the above steps:
>
> ***************************script***************************
> LOOPS=10
>
> __ublk_get_pid() {
> pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'`
> echo $pid
> }
>
> ublk_recover_kill()
> {
> for CNT in `seq $LOOPS`; do
> dmesg -C
> pid=`__ublk_get_pid`
> echo -e "*** kill $pid now ***"
> kill -9 $pid
> sleep 2
> echo -e "*** recover now ***"
> ./ublk recover -n 0

The current behavior is that /dev/ublkb* is removed after device is
aborted because ubq daemon is killed.

What if 'ublk recover' command isn't sent? So the current behavior
without recovery is changed? Or just changed with this feature enabled?

BTW, I do not mean the change isn't reasonable, but suggest to document
the user visible change, so it can get reviewed from either user
viewpoint and technical point.

> sleep 4
> done
> }
>
> ublk_test()
> {
> dmesg -C
> echo -e "*** add ublk device ***"
> ./ublk add -t null -d 4 -i 1
> sleep 2
> echo -e "*** start fio ***"
> fio --bs=4k \
> --filename=/dev/ublkb0 \
> --runtime=100s \
> --rw=read &
> sleep 4
> ublk_recover_kill
> wait
> echo -e "*** delete ublk device ***"
> ./ublk del -n 0
> }
>
> for CNT in `seq 4`; do
> modprobe -rv ublk_drv
> modprobe ublk_drv
> echo -e "************ round $CNT ************"
> ublk_test
> sleep 5
> done
> ***************************script***************************
>
> You may run it with our modified ublksrv[3] which supports
> recovey feature. No I/O error occurs and you can verify it
> by typing
> $ perf-tools/bin/tpoint block:block_rq_error
>
> The basic idea of USER_RECOVERY is quite straightfoward:
>
> (1) release/free everything belongs to the dying process.
>
> Note: Since ublk_drv does save information about user process,
> this work is important because we don't expect any resource
> lekage. Particularly, ioucmds from the dying ubq_daemons
> need to be completed(freed). Current ublk_drv code cannot satisfy
> our need while considering USER_RECOVERY. So we refactor some code
> shown in patch 1-5 to gracefully free all ioucmds.
>
> (2) init ublk queues including requeuing/aborting rqs.
>
> (3) allow new ubq_daemons issue FETCH_REQ.
>
> Here is steps to reocver:
>
> (1) For a user, after a process crash(how he detect a crash is not related
> to this patchset), he sends START_USER_RECOVERY ctrl-cmd to

I'd suggest to describe crash detector a bit at least, as one whole use case,
crash detector should be the input of the use case of user recovery, which is
usually one part of use case when modeling software requirement/design.

Such as, crash is detected after the ubq daemon pthread/process is crashed?
Will you consider io hang in the daemon pthread/process? IMO, long-term,
the crash detector utility should be part of ublksrv.

We don't implement ublk driver's IO timeout yet, but that implementation may be
related with this recovery feature closely, such as, one simple approach is to
kill ubq-daemon if we can't move on after retrying several times, then
let userspace detect & recovery.


Thanks,
Ming

2022-08-29 02:15:16

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 1/9] ublk_drv: check 'current' instead of 'ubq_daemon'

On Wed, Aug 24, 2022 at 01:47:36PM +0800, ZiyangZhang wrote:
> This check is not atomic. So with recovery feature, ubq_daemon may be
> updated simultaneously by recovery task. Instead, check 'current' is
> safe here because 'current' never changes.
>
> Also add comment explaining this check, which is really important for
> understanding recovery feature.
>
> Signed-off-by: ZiyangZhang <[email protected]>
> ---
> drivers/block/ublk_drv.c | 12 ++++++++++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index 6a4a94b4cdf4..c39b67d7133d 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -645,14 +645,22 @@ static inline void __ublk_rq_task_work(struct request *req)
> struct ublk_device *ub = ubq->dev;
> int tag = req->tag;
> struct ublk_io *io = &ubq->ios[tag];
> - bool task_exiting = current != ubq->ubq_daemon || ubq_daemon_is_dying(ubq);
> unsigned int mapped_bytes;
>
> pr_devel("%s: complete: op %d, qid %d tag %d io_flags %x addr %llx\n",
> __func__, io->cmd->cmd_op, ubq->q_id, req->tag, io->flags,
> ublk_get_iod(ubq, req->tag)->addr);
>
> - if (unlikely(task_exiting)) {
> + /*
> + * Task is exiting if either:
> + *
> + * (1) current != ubq_daemon.
> + * io_uring_cmd_complete_in_task() tries to run task_work
> + * in a workqueue if ubq_daemon(cmd's task) is PF_EXITING.
> + *
> + * (2) current->flags & PF_EXITING.
> + */
> + if (unlikely(current != ubq->ubq_daemon || current->flags & PF_EXITING)) {

Reviewed-by: Ming Lei <[email protected]>


Thanks,
Ming

2022-08-29 03:07:14

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 2/9] ublk_drv: refactor ublk_cancel_queue()

On Wed, Aug 24, 2022 at 01:47:37PM +0800, ZiyangZhang wrote:
> Assume only a few FETCH_REQ ioucmds are sent to ublk_drv, then the
> ubq_daemon exits, We have to call io_uring_cmd_done() for all ioucmds
> received so that io_uring ctx will not leak.
>
> ublk_cancel_queue() may be called before START_DEV or after STOP_DEV,
> we decrease ubq->nr_io_ready and clear UBLK_IO_FLAG_ACTIVE so that we
> won't call io_uring_cmd_done() twice for one ioucmd to avoid UAF. Also
> clearing UBLK_IO_FLAG_ACTIVE makes the code more reasonable.
>
> Signed-off-by: ZiyangZhang <[email protected]>
> ---
> drivers/block/ublk_drv.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index c39b67d7133d..e08f636b0b9d 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -967,18 +967,23 @@ static void ublk_cancel_queue(struct ublk_queue *ubq)
> {
> int i;
>
> - if (!ublk_queue_ready(ubq))
> + if (!ubq->nr_io_ready)
> return;
>
> for (i = 0; i < ubq->q_depth; i++) {
> struct ublk_io *io = &ubq->ios[i];
>
> - if (io->flags & UBLK_IO_FLAG_ACTIVE)
> + if (io->flags & UBLK_IO_FLAG_ACTIVE) {
> + pr_devel("%s: done old cmd: qid %d tag %d\n",
> + __func__, ubq->q_id, i);
> io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0);
> + io->flags &= ~UBLK_IO_FLAG_ACTIVE;
> + ubq->nr_io_ready--;
> + }
> }
>
> /* all io commands are canceled */
> - ubq->nr_io_ready = 0;
> + WARN_ON_ONCE(ubq->nr_io_ready);

The change looks fine, but suggest to add comment like the
following given the above WARN_ON_ONCE() change isn't obvious.

```
1) ublk_cancel_dev() is called before sending START_DEV(), ->mutex
provides protection on above update.

2) ublk_cancel_dev() is called after sending START_DEV(), disk is
deleted first, UBLK_IO_RES_ABORT is returned so that any new io
command can't be issued to driver, so updating on io flags and
nr_io_ready is safe here

Also ->nr_io_ready is guaranteed to become zero after ublk_cance_queue
returns since request queue is either frozen or not present in both two cases.

```


Thanks,
Ming

2022-08-29 03:23:40

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 3/9] ublk_drv: add a helper to get ioucmd from pdu

On Wed, Aug 24, 2022 at 01:47:38PM +0800, ZiyangZhang wrote:
> We store pointer of task_work in pdu. And we should get ioucmd from pdu
> since we prepare to only pass ioucmd to task_work function.
>
> Signed-off-by: ZiyangZhang <[email protected]>
> ---
> drivers/block/ublk_drv.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index e08f636b0b9d..8add6e3ae15f 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -555,6 +555,12 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
> return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu;
> }
>
> +static inline struct io_uring_cmd *ublk_uring_cmd_from_pdu(
> + struct ublk_uring_cmd_pdu *pdu)
> +{
> + return container_of((u8 *)pdu, struct io_uring_cmd, pdu[0]);
> +}
> +

Patch isn't supposed to be written in this way, it is one helper, either
change its caller in this patch, or merge this one wth the patch which
applies it.

Also looks this change belong to include/linux/io_uring.h if you think
it is useful.

thanks,
Ming

2022-08-29 04:37:15

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] ublk_drv: add USER_RECOVERY support

On 2022/8/29 10:08, Ming Lei wrote:
> On Wed, Aug 24, 2022 at 01:47:35PM +0800, ZiyangZhang wrote:
>> ublk_drv is a driver simply passes all blk-mq rqs to ublksrv[1] in
>> userspace. For each ublk queue, there is one ubq_daemon(pthread).
>> All ubq_daemons share the same process which opens /dev/ublkcX.
>> The ubq_daemon code infinitely loops on io_uring_enter() to
>> send/receive io_uring cmds which pass information of blk-mq
>> rqs.
>>
>> Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv
>> must abort the dying ubq, stop the device and release everything.
>> This is not a good choice in practice because users do not expect
>> aborted requests, I/O errors and a released device. They may want
>> a recovery machenism so that no requests are aborted and no I/O
>> error occurs. Anyway, users just want everything works as uaual.
>
> I understand the requirement is that both /dev/ublkbN and /dev/ublkcN
> won't be deleted & re-added from user viewpoint after user recovery,
> so the device context won't be lost.

Yes, after the 'process' is killed or crashed(such as segmentation fault)
both /dev/ublkb0 and /dev/ublkc0 is not deleted.

>
>>
>> This RFC patchset implements USER_RECOVERY support. If the process
>> crashes, we allow ublksrv to provide new process and ubq_daemons.
>> We do not support single ubq_daemon(pthread) recovery because a
>> pthread rarely crashes.
>>
>> Recovery feature is quite useful for products do not expect to
>> return any I/O error to frontend users.
>
> That looks one very ideal requirement. To be honest, no any block driver
> can guarantee that 100%, so it is just one soft requirement?
>
> Cause memory allocation may fail, network may be disconnected,
> re-creating pthread or process may fail too, ...

Yes, I know there are many other problem which may cause a failure.

The recovery mechanism only guarantees that rqs sent to ublksrv
before crash are not aborted. Instead, ublk_drv re-issues the request
itself and fio does not konw about it. Of course the backend must
tolerate a double-write/read.

>
>> In detail, we support
>> this scenario:
>> (1) The /dev/ublkc0 is opened by process 0;
>> (2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all
>> rqs are handled by process 0.
>> (3) Process 0 suddenly crashes(e.g. segfault);
>> (4) Fio is still running and submit IOs(but these IOs cannot
>> complete now)
>> (5) User recovers with process 1 and attach it to /dev/ublkc0
>> (6) All rqs are handled by process 1 now and IOs can be
>> completed now.
>>
>> Note: The backend must tolerate double-write because we re-issue
>> a rq sent to the old(dying) process before. We allow users to
>> choose whether re-issue these rqs or not, please see patch 7 for
>> more detail.
>>
>> We provide a sample script here to simulate the above steps:
>>
>> ***************************script***************************
>> LOOPS=10
>>
>> __ublk_get_pid() {
>> pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'`
>> echo $pid
>> }
>>
>> ublk_recover_kill()
>> {
>> for CNT in `seq $LOOPS`; do
>> dmesg -C
>> pid=`__ublk_get_pid`
>> echo -e "*** kill $pid now ***"
>> kill -9 $pid
>> sleep 2
>> echo -e "*** recover now ***"
>> ./ublk recover -n 0
>
> The current behavior is that /dev/ublkb* is removed after device is
> aborted because ubq daemon is killed.
>
> What if 'ublk recover' command isn't sent? So the current behavior
> without recovery is changed? Or just changed with this feature enabled?

No, I do not change the default behavior. You can verify this by running
generic/002 and generic/003. These tests passes with either recovery enabled
or disabled.


(1) With recovery disabled, the monitor_work scheduled periodically or
STOP_DEV ctrl-cmd issued manually can cleanup everything and remove the
gendisk.

(2)With recovery enabled, the monitor_work is not scheduled anymore, see
patch 9. So after a crash,all resources are still in kernel.
Then, there are two options for a user:
(a) You don't want to recover it, so just send STOP_DEV ctrl-cmd. This will
schedule monitor_work once and cleanup everything. Please see patch 5.
(b) You want to recover it, so just send START_RECOVERY ctrl-cmd. Then you
HAVE TO start a new process and send END_RECOVERY.

Note: Actually I am thinking what if a user has sent START_RECOVERY but he fails
to start a new process. I have a rough idea: just abort all rqs after we unqiuesce
the request queue. But that is not included in this RFC patchset because I
want to make it simpler. Maybe we can consider it later?

>
> BTW, I do not mean the change isn't reasonable, but suggest to document
> the user visible change, so it can get reviewed from either user
> viewpoint and technical point.
>
>> sleep 4
>> done
>> }
>>
>> ublk_test()
>> {
>> dmesg -C
>> echo -e "*** add ublk device ***"
>> ./ublk add -t null -d 4 -i 1
>> sleep 2
>> echo -e "*** start fio ***"
>> fio --bs=4k \
>> --filename=/dev/ublkb0 \
>> --runtime=100s \
>> --rw=read &
>> sleep 4
>> ublk_recover_kill
>> wait
>> echo -e "*** delete ublk device ***"
>> ./ublk del -n 0
>> }
>>
>> for CNT in `seq 4`; do
>> modprobe -rv ublk_drv
>> modprobe ublk_drv
>> echo -e "************ round $CNT ************"
>> ublk_test
>> sleep 5
>> done
>> ***************************script***************************
>>
>> You may run it with our modified ublksrv[3] which supports
>> recovey feature. No I/O error occurs and you can verify it
>> by typing
>> $ perf-tools/bin/tpoint block:block_rq_error
>>
>> The basic idea of USER_RECOVERY is quite straightfoward:
>>
>> (1) release/free everything belongs to the dying process.
>>
>> Note: Since ublk_drv does save information about user process,
>> this work is important because we don't expect any resource
>> lekage. Particularly, ioucmds from the dying ubq_daemons
>> need to be completed(freed). Current ublk_drv code cannot satisfy
>> our need while considering USER_RECOVERY. So we refactor some code
>> shown in patch 1-5 to gracefully free all ioucmds.
>>
>> (2) init ublk queues including requeuing/aborting rqs.
>>
>> (3) allow new ubq_daemons issue FETCH_REQ.
>>
>> Here is steps to reocver:
>>
>> (1) For a user, after a process crash(how he detect a crash is not related
>> to this patchset), he sends START_USER_RECOVERY ctrl-cmd to
>
> I'd suggest to describe crash detector a bit at least, as one whole use case,
> crash detector should be the input of the use case of user recovery, which is
> usually one part of use case when modeling software requirement/design.

This patchset tries to answer only one question: After a process crash, how to
re-attach the device by another process. So I do not consider other questions
too much, such as:
(1) How to detect a crash?
(2) Is IO hang a crash? Should we kill the process?
(3) What if a blk-mq rq timeout? Does the process dies? Should we kill the process?

I think we can answer them after kernel-support of USER_RECOVERY is available.


For now I only try to directly kill the process in testcases and manually inject
a crash in handle_io_async().

>
> Such as, crash is detected after the ubq daemon pthread/process is crashed?
> Will you consider io hang in the daemon pthread/process? IMO, long-term,
> the crash detector utility should be part of ublksrv.

Yes, we should design a crash detector in ublksrv. For IO hang, my idea is that:

(1) the ublksrv_tgt code should handle it if user runs ublksrv directly.

(2) the backend should handle it if user only uses libublksrv and embeds it inside
the backend code.

>
> We don't implement ublk driver's IO timeout yet, but that implementation may be
> related with this recovery feature closely, such as, one simple approach is to
> kill ubq-daemon if we can't move on after retrying several times, then
> let userspace detect & recovery.

You mean the ublk_drv can kill the ubq_daemon? I have not consider this case yet...

BTW, I don't think we should put too much logic(IO hang, detector) in ublk_drv
because it should only pass-through rqs to userpsace. We should make ublk_drv simple.
Accept a new daemon and re-attach it to /dev/ublkb0 is what it can do I think.


Regards,
Zhang

2022-08-29 05:02:46

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 3/9] ublk_drv: add a helper to get ioucmd from pdu

On 2022/8/29 11:06, Ming Lei wrote:
> On Wed, Aug 24, 2022 at 01:47:38PM +0800, ZiyangZhang wrote:
>> We store pointer of task_work in pdu. And we should get ioucmd from pdu
>> since we prepare to only pass ioucmd to task_work function.
>>
>> Signed-off-by: ZiyangZhang <[email protected]>
>> ---
>> drivers/block/ublk_drv.c | 6 ++++++
>> 1 file changed, 6 insertions(+)
>>
>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>> index e08f636b0b9d..8add6e3ae15f 100644
>> --- a/drivers/block/ublk_drv.c
>> +++ b/drivers/block/ublk_drv.c
>> @@ -555,6 +555,12 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
>> return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu;
>> }
>>
>> +static inline struct io_uring_cmd *ublk_uring_cmd_from_pdu(
>> + struct ublk_uring_cmd_pdu *pdu)
>> +{
>> + return container_of((u8 *)pdu, struct io_uring_cmd, pdu[0]);
>> +}
>> +
>
> Patch isn't supposed to be written in this way, it is one helper, either
> change its caller in this patch, or merge this one wth the patch which
> applies it.
>
> Also looks this change belong to include/linux/io_uring.h if you think
> it is useful.

Maybe add a helper in include/linux/io_uring.h is good since ioucmd and pdu is
only used by NVMe and ublk_drv. NVMe does not need to transform pdu to ioucmd.
But in ublk_drv if we want to get ioucmd in task work, this transformation
is needed:

struct callback_head *work
--> struct ublk_uring_cmd_pdu *pdu
--> struct io_uring_cmd *cmd

Regards,
Zhang

2022-08-29 05:51:45

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 2/9] ublk_drv: refactor ublk_cancel_queue()

On 2022/8/29 11:01, Ming Lei wrote:
> On Wed, Aug 24, 2022 at 01:47:37PM +0800, ZiyangZhang wrote:
>> Assume only a few FETCH_REQ ioucmds are sent to ublk_drv, then the
>> ubq_daemon exits, We have to call io_uring_cmd_done() for all ioucmds
>> received so that io_uring ctx will not leak.
>>
>> ublk_cancel_queue() may be called before START_DEV or after STOP_DEV,
>> we decrease ubq->nr_io_ready and clear UBLK_IO_FLAG_ACTIVE so that we
>> won't call io_uring_cmd_done() twice for one ioucmd to avoid UAF. Also
>> clearing UBLK_IO_FLAG_ACTIVE makes the code more reasonable.
>>
>> Signed-off-by: ZiyangZhang <[email protected]>
>> ---
>> drivers/block/ublk_drv.c | 11 ++++++++---
>> 1 file changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>> index c39b67d7133d..e08f636b0b9d 100644
>> --- a/drivers/block/ublk_drv.c
>> +++ b/drivers/block/ublk_drv.c
>> @@ -967,18 +967,23 @@ static void ublk_cancel_queue(struct ublk_queue *ubq)
>> {
>> int i;
>>
>> - if (!ublk_queue_ready(ubq))
>> + if (!ubq->nr_io_ready)
>> return;
>>
>> for (i = 0; i < ubq->q_depth; i++) {
>> struct ublk_io *io = &ubq->ios[i];
>>
>> - if (io->flags & UBLK_IO_FLAG_ACTIVE)
>> + if (io->flags & UBLK_IO_FLAG_ACTIVE) {
>> + pr_devel("%s: done old cmd: qid %d tag %d\n",
>> + __func__, ubq->q_id, i);
>> io_uring_cmd_done(io->cmd, UBLK_IO_RES_ABORT, 0);
>> + io->flags &= ~UBLK_IO_FLAG_ACTIVE;
>> + ubq->nr_io_ready--;
>> + }
>> }
>>
>> /* all io commands are canceled */
>> - ubq->nr_io_ready = 0;
>> + WARN_ON_ONCE(ubq->nr_io_ready);
>
> The change looks fine, but suggest to add comment like the
> following given the above WARN_ON_ONCE() change isn't obvious.
>
> ```
> 1) ublk_cancel_dev() is called before sending START_DEV(), ->mutex
> provides protection on above update.
>
> 2) ublk_cancel_dev() is called after sending START_DEV(), disk is
> deleted first, UBLK_IO_RES_ABORT is returned so that any new io
> command can't be issued to driver, so updating on io flags and
> nr_io_ready is safe here
>
> Also ->nr_io_ready is guaranteed to become zero after ublk_cance_queue
> returns since request queue is either frozen or not present in both two cases.
>
> ```
>

Thanks for your advice, Ming.

Regards,
Zhang

2022-08-29 06:24:32

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] ublk_drv: add USER_RECOVERY support

On Mon, Aug 29, 2022 at 12:00:49PM +0800, Ziyang Zhang wrote:
> On 2022/8/29 10:08, Ming Lei wrote:
> > On Wed, Aug 24, 2022 at 01:47:35PM +0800, ZiyangZhang wrote:
> >> ublk_drv is a driver simply passes all blk-mq rqs to ublksrv[1] in
> >> userspace. For each ublk queue, there is one ubq_daemon(pthread).
> >> All ubq_daemons share the same process which opens /dev/ublkcX.
> >> The ubq_daemon code infinitely loops on io_uring_enter() to
> >> send/receive io_uring cmds which pass information of blk-mq
> >> rqs.
> >>
> >> Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv
> >> must abort the dying ubq, stop the device and release everything.
> >> This is not a good choice in practice because users do not expect
> >> aborted requests, I/O errors and a released device. They may want
> >> a recovery machenism so that no requests are aborted and no I/O
> >> error occurs. Anyway, users just want everything works as uaual.
> >
> > I understand the requirement is that both /dev/ublkbN and /dev/ublkcN
> > won't be deleted & re-added from user viewpoint after user recovery,
> > so the device context won't be lost.
>
> Yes, after the 'process' is killed or crashed(such as segmentation fault)
> both /dev/ublkb0 and /dev/ublkc0 is not deleted.
>
> >
> >>
> >> This RFC patchset implements USER_RECOVERY support. If the process
> >> crashes, we allow ublksrv to provide new process and ubq_daemons.
> >> We do not support single ubq_daemon(pthread) recovery because a
> >> pthread rarely crashes.
> >>
> >> Recovery feature is quite useful for products do not expect to
> >> return any I/O error to frontend users.
> >
> > That looks one very ideal requirement. To be honest, no any block driver
> > can guarantee that 100%, so it is just one soft requirement?
> >
> > Cause memory allocation may fail, network may be disconnected,
> > re-creating pthread or process may fail too, ...
>
> Yes, I know there are many other problem which may cause a failure.
>
> The recovery mechanism only guarantees that rqs sent to ublksrv
> before crash are not aborted. Instead, ublk_drv re-issues the request
> itself and fio does not konw about it. Of course the backend must
> tolerate a double-write/read.

My comment is for 'do not expect to return any I/O error to frontend users',
and I still think it is just one soft requirement, and no one can
guarantee there isn't any error for frontend users really.

>
> >
> >> In detail, we support
> >> this scenario:
> >> (1) The /dev/ublkc0 is opened by process 0;
> >> (2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all
> >> rqs are handled by process 0.
> >> (3) Process 0 suddenly crashes(e.g. segfault);
> >> (4) Fio is still running and submit IOs(but these IOs cannot
> >> complete now)
> >> (5) User recovers with process 1 and attach it to /dev/ublkc0
> >> (6) All rqs are handled by process 1 now and IOs can be
> >> completed now.
> >>
> >> Note: The backend must tolerate double-write because we re-issue
> >> a rq sent to the old(dying) process before. We allow users to
> >> choose whether re-issue these rqs or not, please see patch 7 for
> >> more detail.
> >>
> >> We provide a sample script here to simulate the above steps:
> >>
> >> ***************************script***************************
> >> LOOPS=10
> >>
> >> __ublk_get_pid() {
> >> pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'`
> >> echo $pid
> >> }
> >>
> >> ublk_recover_kill()
> >> {
> >> for CNT in `seq $LOOPS`; do
> >> dmesg -C
> >> pid=`__ublk_get_pid`
> >> echo -e "*** kill $pid now ***"
> >> kill -9 $pid
> >> sleep 2
> >> echo -e "*** recover now ***"
> >> ./ublk recover -n 0
> >
> > The current behavior is that /dev/ublkb* is removed after device is
> > aborted because ubq daemon is killed.
> >
> > What if 'ublk recover' command isn't sent? So the current behavior
> > without recovery is changed? Or just changed with this feature enabled?
>
> No, I do not change the default behavior. You can verify this by running
> generic/002 and generic/003. These tests passes with either recovery enabled
> or disabled.
>
>
> (1) With recovery disabled, the monitor_work scheduled periodically or
> STOP_DEV ctrl-cmd issued manually can cleanup everything and remove the
> gendisk.
>
> (2)With recovery enabled, the monitor_work is not scheduled anymore, see
> patch 9. So after a crash,all resources are still in kernel.
> Then, there are two options for a user:
> (a) You don't want to recover it, so just send STOP_DEV ctrl-cmd. This will
> schedule monitor_work once and cleanup everything. Please see patch 5.

But what if people sends nothing and starts to reboot, then hang forever without
monitor_work involved.

> (b) You want to recover it, so just send START_RECOVERY ctrl-cmd. Then you
> HAVE TO start a new process and send END_RECOVERY.
>
> Note: Actually I am thinking what if a user has sent START_RECOVERY but he fails
> to start a new process. I have a rough idea: just abort all rqs after we unqiuesce
> the request queue. But that is not included in this RFC patchset because I
> want to make it simpler. Maybe we can consider it later?

It is pretty easy to fail all in-queue requests when user recovery
can't move on.

>
> >
> > BTW, I do not mean the change isn't reasonable, but suggest to document
> > the user visible change, so it can get reviewed from either user
> > viewpoint and technical point.
> >
> >> sleep 4
> >> done
> >> }
> >>
> >> ublk_test()
> >> {
> >> dmesg -C
> >> echo -e "*** add ublk device ***"
> >> ./ublk add -t null -d 4 -i 1
> >> sleep 2
> >> echo -e "*** start fio ***"
> >> fio --bs=4k \
> >> --filename=/dev/ublkb0 \
> >> --runtime=100s \
> >> --rw=read &
> >> sleep 4
> >> ublk_recover_kill
> >> wait
> >> echo -e "*** delete ublk device ***"
> >> ./ublk del -n 0
> >> }
> >>
> >> for CNT in `seq 4`; do
> >> modprobe -rv ublk_drv
> >> modprobe ublk_drv
> >> echo -e "************ round $CNT ************"
> >> ublk_test
> >> sleep 5
> >> done
> >> ***************************script***************************
> >>
> >> You may run it with our modified ublksrv[3] which supports
> >> recovey feature. No I/O error occurs and you can verify it
> >> by typing
> >> $ perf-tools/bin/tpoint block:block_rq_error
> >>
> >> The basic idea of USER_RECOVERY is quite straightfoward:
> >>
> >> (1) release/free everything belongs to the dying process.
> >>
> >> Note: Since ublk_drv does save information about user process,
> >> this work is important because we don't expect any resource
> >> lekage. Particularly, ioucmds from the dying ubq_daemons
> >> need to be completed(freed). Current ublk_drv code cannot satisfy
> >> our need while considering USER_RECOVERY. So we refactor some code
> >> shown in patch 1-5 to gracefully free all ioucmds.
> >>
> >> (2) init ublk queues including requeuing/aborting rqs.
> >>
> >> (3) allow new ubq_daemons issue FETCH_REQ.
> >>
> >> Here is steps to reocver:
> >>
> >> (1) For a user, after a process crash(how he detect a crash is not related
> >> to this patchset), he sends START_USER_RECOVERY ctrl-cmd to
> >
> > I'd suggest to describe crash detector a bit at least, as one whole use case,
> > crash detector should be the input of the use case of user recovery, which is
> > usually one part of use case when modeling software requirement/design.
>
> This patchset tries to answer only one question: After a process crash, how to
> re-attach the device by another process. So I do not consider other questions
> too much, such as:
> (1) How to detect a crash?
> (2) Is IO hang a crash? Should we kill the process?
> (3) What if a blk-mq rq timeout? Does the process dies? Should we kill the process?

But you have to define what is 'crash', otherwise how can you define
what to be recovered?

So far please just define the crash as the whole daemon being dead
abnormally(without sending stop command) if you don't have better idea.

>
> I think we can answer them after kernel-support of USER_RECOVERY is available.
>
>
> For now I only try to directly kill the process in testcases and manually inject
> a crash in handle_io_async().
>
> >
> > Such as, crash is detected after the ubq daemon pthread/process is crashed?
> > Will you consider io hang in the daemon pthread/process? IMO, long-term,
> > the crash detector utility should be part of ublksrv.
>
> Yes, we should design a crash detector in ublksrv. For IO hang, my idea is that:
>
> (1) the ublksrv_tgt code should handle it if user runs ublksrv directly.
>
> (2) the backend should handle it if user only uses libublksrv and embeds it inside
> the backend code.
>
> >
> > We don't implement ublk driver's IO timeout yet, but that implementation may be
> > related with this recovery feature closely, such as, one simple approach is to
> > kill ubq-daemon if we can't move on after retrying several times, then
> > let userspace detect & recovery.
>
> You mean the ublk_drv can kill the ubq_daemon? I have not consider this case yet...
>
> BTW, I don't think we should put too much logic(IO hang, detector) in ublk_drv
> because it should only pass-through rqs to userpsace. We should make ublk_drv simple.
> Accept a new daemon and re-attach it to /dev/ublkb0 is what it can do I think.

Actually I was thinking the use case of container-ware ublk device when
ADMIN privilege requirement can be removed, so people can do whatever
they want in ublksrv. But sooner or later, request timeout needs to be
considered for real ublk use case.

thanks,
Ming

2022-08-29 06:26:25

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task_work() and aborting machenism

On 2022/8/29 13:40, Ming Lei wrote:
> On Wed, Aug 24, 2022 at 01:47:39PM +0800, ZiyangZhang wrote:
>> If one rq is handled by io_uring_cmd_complete_in_task(), after a crash
>> this rq is actually handled by an io_uring fallback wq. We have to
>> end(abort) this rq since this fallback wq is a task other than the
>> crashed task. However, current code does not call io_uring_cmd_done()
>> at the same time but do it in ublk_cancel_queue(). With current design,
>> this does work because ublk_cancel_queue() is called AFTER del_gendisk(),
>> which waits for the rq ended(aborted) in fallback wq. This implies that
>> fallback wq on this rq is scheduled BEFORE calling io_uring_cmd_done()
>> on the corresponding ioucmd in ublk_cancel_queue().
>
> Right.
>
>>
>> However, while considering recovery feature, we cannot rely on
>> del_gendisk() or blk_mq_freeze_queue() to wait for completion of all
>> rqs because we may not want any aborted rq. Besides, io_uring does not
>> provide "flush fallback" machenism so we cannot trace this ioucmd.
>
> Why not?
>
> If user recovery is enabled, del_gendisk() can be replaced with
> blk_mq_quiesce_queue(), then let abort work function do:
>
> - cancel all in-flight requests by holding them into requeue list
> instead of finishing them as before, and this way is safe because
> abort worker does know the ubq daemon is dying
> - cancel pending commands as before, because the situation is same
> with disk deleted or queue frozen

The problem is: we cannot control when fallback wq is scheduled.
So we are unsafe to call io_uring_cmd_done() in another process.
Otherwise, there is a UAF, just as
(5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
re-issued request aborted previously to ioucmd's task_work).

Yeah I know the answer is very simple: flush the fallback wq.
But here are two more questions:

(1) Should ublk_drv rely on the fallback wq machenism?
IMO, ublk_drv should not know detail of io_uring_cmd_complete_in_task()
because its implementation may change in the future.
BTW, I think current ublk_rq_task_work_cb() is not correct because
it does not always call io_uring_cmd_done() before returning.
nvme_uring_cmd_end_io() always calls io_uring_cmd_done() for each ioucmd
no matter the rq succeeds or fails.

(2) Suppose io_uring does export the symbol 'flush_fallback_work', should we call
it before starting a new process(recovery)?
What if fallback wq is not scheduled immediately if there are many processes
running and the system overhead is heavy. In this case the recovery process
may wait for too long. Really we should not depend on fallback wq and please
let the fallback wq complete the ioucmd itself.

>
> With this way, the current abort logic won't be changed much.
>
> And user recovery should only be started _after_ ublk device is found
> as aborted.

START_RECOVERY will check if all ubq_daemons(the process) are PF_EXITING.

>
>>
>> The recovery machenism needs to complete all ioucmds of a dying ubq
>> to avoid leaking io_uring ctx. But as talked above, we are unsafe
>> to call io_uring_cmd_done() in the recovery task if fallback wq happens
>> to run simultaneously. This is a UAF case because io_uring ctx may be
>> freed. Actually a similar case happens in
>> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
>> re-issued request aborted previously to ioucmd's task_work).
>
> If you take the above approach, I guess there isn't such problem because
> abort can handle the case well as before.

Ming, we did think this approach(quiesce, requeue rq/complete ioucmd)
at the very beginning. But we decided to drop it because we don not want
rely on 'flush fallback wq' machenism, which
makes ublk_drv rely on io_uring's internal implementation.

>
>>
>> Besides, in order to implement recovery machenism, in ublk_queue_rq()
>> and __ublk_rq_task_work(), we should not end(abort) current rq while
>> ubq_daemon is dying.
>
> Right, I believe one helper of ublk_abort_request() is helpful here.
>
>
> Thanks,
> Ming

2022-08-29 06:28:45

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task_work() and aborting machenism

On Wed, Aug 24, 2022 at 01:47:39PM +0800, ZiyangZhang wrote:
> If one rq is handled by io_uring_cmd_complete_in_task(), after a crash
> this rq is actually handled by an io_uring fallback wq. We have to
> end(abort) this rq since this fallback wq is a task other than the
> crashed task. However, current code does not call io_uring_cmd_done()
> at the same time but do it in ublk_cancel_queue(). With current design,
> this does work because ublk_cancel_queue() is called AFTER del_gendisk(),
> which waits for the rq ended(aborted) in fallback wq. This implies that
> fallback wq on this rq is scheduled BEFORE calling io_uring_cmd_done()
> on the corresponding ioucmd in ublk_cancel_queue().

Right.

>
> However, while considering recovery feature, we cannot rely on
> del_gendisk() or blk_mq_freeze_queue() to wait for completion of all
> rqs because we may not want any aborted rq. Besides, io_uring does not
> provide "flush fallback" machenism so we cannot trace this ioucmd.

Why not?

If user recovery is enabled, del_gendisk() can be replaced with
blk_mq_quiesce_queue(), then let abort work function do:

- cancel all in-flight requests by holding them into requeue list
instead of finishing them as before, and this way is safe because
abort worker does know the ubq daemon is dying
- cancel pending commands as before, because the situation is same
with disk deleted or queue frozen

With this way, the current abort logic won't be changed much.

And user recovery should only be started _after_ ublk device is found
as aborted.

>
> The recovery machenism needs to complete all ioucmds of a dying ubq
> to avoid leaking io_uring ctx. But as talked above, we are unsafe
> to call io_uring_cmd_done() in the recovery task if fallback wq happens
> to run simultaneously. This is a UAF case because io_uring ctx may be
> freed. Actually a similar case happens in
> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
> re-issued request aborted previously to ioucmd's task_work).

If you take the above approach, I guess there isn't such problem because
abort can handle the case well as before.

>
> Besides, in order to implement recovery machenism, in ublk_queue_rq()
> and __ublk_rq_task_work(), we should not end(abort) current rq while
> ubq_daemon is dying.

Right, I believe one helper of ublk_abort_request() is helpful here.


Thanks,
Ming

2022-08-29 07:41:56

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] ublk_drv: add USER_RECOVERY support

On 2022/8/29 13:56, Ming Lei wrote:
> On Mon, Aug 29, 2022 at 12:00:49PM +0800, Ziyang Zhang wrote:
>> On 2022/8/29 10:08, Ming Lei wrote:
>>> On Wed, Aug 24, 2022 at 01:47:35PM +0800, ZiyangZhang wrote:
>>>> ublk_drv is a driver simply passes all blk-mq rqs to ublksrv[1] in
>>>> userspace. For each ublk queue, there is one ubq_daemon(pthread).
>>>> All ubq_daemons share the same process which opens /dev/ublkcX.
>>>> The ubq_daemon code infinitely loops on io_uring_enter() to
>>>> send/receive io_uring cmds which pass information of blk-mq
>>>> rqs.
>>>>
>>>> Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv
>>>> must abort the dying ubq, stop the device and release everything.
>>>> This is not a good choice in practice because users do not expect
>>>> aborted requests, I/O errors and a released device. They may want
>>>> a recovery machenism so that no requests are aborted and no I/O
>>>> error occurs. Anyway, users just want everything works as uaual.
>>>
>>> I understand the requirement is that both /dev/ublkbN and /dev/ublkcN
>>> won't be deleted & re-added from user viewpoint after user recovery,
>>> so the device context won't be lost.
>>
>> Yes, after the 'process' is killed or crashed(such as segmentation fault)
>> both /dev/ublkb0 and /dev/ublkc0 is not deleted.
>>
>>>
>>>>
>>>> This RFC patchset implements USER_RECOVERY support. If the process
>>>> crashes, we allow ublksrv to provide new process and ubq_daemons.
>>>> We do not support single ubq_daemon(pthread) recovery because a
>>>> pthread rarely crashes.
>>>>
>>>> Recovery feature is quite useful for products do not expect to
>>>> return any I/O error to frontend users.
>>>
>>> That looks one very ideal requirement. To be honest, no any block driver
>>> can guarantee that 100%, so it is just one soft requirement?
>>>
>>> Cause memory allocation may fail, network may be disconnected,
>>> re-creating pthread or process may fail too, ...
>>
>> Yes, I know there are many other problem which may cause a failure.
>>
>> The recovery mechanism only guarantees that rqs sent to ublksrv
>> before crash are not aborted. Instead, ublk_drv re-issues the request
>> itself and fio does not konw about it. Of course the backend must
>> tolerate a double-write/read.
>
> My comment is for 'do not expect to return any I/O error to frontend users',
> and I still think it is just one soft requirement, and no one can
> guarantee there isn't any error for frontend users really.

Yes, I get your point now. Indeed it is just one soft requirement.

>
>>
>>>
>>>> In detail, we support
>>>> this scenario:
>>>> (1) The /dev/ublkc0 is opened by process 0;
>>>> (2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all
>>>> rqs are handled by process 0.
>>>> (3) Process 0 suddenly crashes(e.g. segfault);
>>>> (4) Fio is still running and submit IOs(but these IOs cannot
>>>> complete now)
>>>> (5) User recovers with process 1 and attach it to /dev/ublkc0
>>>> (6) All rqs are handled by process 1 now and IOs can be
>>>> completed now.
>>>>
>>>> Note: The backend must tolerate double-write because we re-issue
>>>> a rq sent to the old(dying) process before. We allow users to
>>>> choose whether re-issue these rqs or not, please see patch 7 for
>>>> more detail.
>>>>
>>>> We provide a sample script here to simulate the above steps:
>>>>
>>>> ***************************script***************************
>>>> LOOPS=10
>>>>
>>>> __ublk_get_pid() {
>>>> pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'`
>>>> echo $pid
>>>> }
>>>>
>>>> ublk_recover_kill()
>>>> {
>>>> for CNT in `seq $LOOPS`; do
>>>> dmesg -C
>>>> pid=`__ublk_get_pid`
>>>> echo -e "*** kill $pid now ***"
>>>> kill -9 $pid
>>>> sleep 2
>>>> echo -e "*** recover now ***"
>>>> ./ublk recover -n 0
>>>
>>> The current behavior is that /dev/ublkb* is removed after device is
>>> aborted because ubq daemon is killed.
>>>
>>> What if 'ublk recover' command isn't sent? So the current behavior
>>> without recovery is changed? Or just changed with this feature enabled?
>>
>> No, I do not change the default behavior. You can verify this by running
>> generic/002 and generic/003. These tests passes with either recovery enabled
>> or disabled.
>>
>>
>> (1) With recovery disabled, the monitor_work scheduled periodically or
>> STOP_DEV ctrl-cmd issued manually can cleanup everything and remove the
>> gendisk.
>>
>> (2)With recovery enabled, the monitor_work is not scheduled anymore, see
>> patch 9. So after a crash,all resources are still in kernel.
>> Then, there are two options for a user:
>> (a) You don't want to recover it, so just send STOP_DEV ctrl-cmd. This will
>> schedule monitor_work once and cleanup everything. Please see patch 5.
>
> But what if people sends nothing and starts to reboot, then hang forever without
> monitor_work involved.

Emm... you are right. But here is a conflict: I must reserve resources for a
potential recovery mission, but I may hang if the user directly reboots...

What about add a systemd service?

>
>> (b) You want to recover it, so just send START_RECOVERY ctrl-cmd. Then you
>> HAVE TO start a new process and send END_RECOVERY.
>>
>> Note: Actually I am thinking what if a user has sent START_RECOVERY but he fails
>> to start a new process. I have a rough idea: just abort all rqs after we unqiuesce
>> the request queue. But that is not included in this RFC patchset because I
>> want to make it simpler. Maybe we can consider it later?
>
> It is pretty easy to fail all in-queue requests when user recovery
> can't move on.

Actually I wrote some code not included in the patchset:

(0) Now the request_queue is quiesced since we are after START_RECOVERY.

(1) mark all ubqs as 'force_abort', which makes ublk_queue_rq() fail
all IOs before calling blk_mq_atart_request() after unqiescing
request_queue.

(2) end(abort) all rqs inflight. Note that set of inflight does not change
since we are quiesced.

(3) unqiesce request_queue. Note that set of inflight rqs does not change
since we marked all ubqs as 'force_abort'. We have to unqiesce or del_gendisk()
will hang forever(not sure I am correct).

(4) del_gendisk()

(5) complete ALL ioucmds by calling io_uring_cmd_done()

But this is really a ugly implementation since I do not consider:

(1) After START_RECOVERY, what if the new process crashes 'before' all FETCH_REQ
cmds are sent(some ubqs are ready while others are not).

(2) After START_RECOVERY, what if the new process crashes 'after' all FETCH_REQ
cmds are sent but 'before' END_RECOVERY.


Ming, I suggest that this could be added in a future patch because I want to make
this one simple and easy to understand :)

If you do not agree, I can add this in next version though I really think we should
be more careful.


>
>>
>>>
>>> BTW, I do not mean the change isn't reasonable, but suggest to document
>>> the user visible change, so it can get reviewed from either user
>>> viewpoint and technical point.
>>>
>>>> sleep 4
>>>> done
>>>> }
>>>>
>>>> ublk_test()
>>>> {
>>>> dmesg -C
>>>> echo -e "*** add ublk device ***"
>>>> ./ublk add -t null -d 4 -i 1
>>>> sleep 2
>>>> echo -e "*** start fio ***"
>>>> fio --bs=4k \
>>>> --filename=/dev/ublkb0 \
>>>> --runtime=100s \
>>>> --rw=read &
>>>> sleep 4
>>>> ublk_recover_kill
>>>> wait
>>>> echo -e "*** delete ublk device ***"
>>>> ./ublk del -n 0
>>>> }
>>>>
>>>> for CNT in `seq 4`; do
>>>> modprobe -rv ublk_drv
>>>> modprobe ublk_drv
>>>> echo -e "************ round $CNT ************"
>>>> ublk_test
>>>> sleep 5
>>>> done
>>>> ***************************script***************************
>>>>
>>>> You may run it with our modified ublksrv[3] which supports
>>>> recovey feature. No I/O error occurs and you can verify it
>>>> by typing
>>>> $ perf-tools/bin/tpoint block:block_rq_error
>>>>
>>>> The basic idea of USER_RECOVERY is quite straightfoward:
>>>>
>>>> (1) release/free everything belongs to the dying process.
>>>>
>>>> Note: Since ublk_drv does save information about user process,
>>>> this work is important because we don't expect any resource
>>>> lekage. Particularly, ioucmds from the dying ubq_daemons
>>>> need to be completed(freed). Current ublk_drv code cannot satisfy
>>>> our need while considering USER_RECOVERY. So we refactor some code
>>>> shown in patch 1-5 to gracefully free all ioucmds.
>>>>
>>>> (2) init ublk queues including requeuing/aborting rqs.
>>>>
>>>> (3) allow new ubq_daemons issue FETCH_REQ.
>>>>
>>>> Here is steps to reocver:
>>>>
>>>> (1) For a user, after a process crash(how he detect a crash is not related
>>>> to this patchset), he sends START_USER_RECOVERY ctrl-cmd to
>>>
>>> I'd suggest to describe crash detector a bit at least, as one whole use case,
>>> crash detector should be the input of the use case of user recovery, which is
>>> usually one part of use case when modeling software requirement/design.
>>
>> This patchset tries to answer only one question: After a process crash, how to
>> re-attach the device by another process. So I do not consider other questions
>> too much, such as:
>> (1) How to detect a crash?
>> (2) Is IO hang a crash? Should we kill the process?
>> (3) What if a blk-mq rq timeout? Does the process dies? Should we kill the process?
>
> But you have to define what is 'crash', otherwise how can you define
> what to be recovered?
>
> So far please just define the crash as the whole daemon being dead
> abnormally(without sending stop command) if you don't have better idea.

Yes, I argee that a ublk_drv crash means the whole daemon(process) being dead
abnormally(without sending stop command).

I only consider the 'process' crash, not a single 'pthread'(ubq_daemon) crash.
The process crashes if:

(1) The user kill it(the detector can do this, the backend can do this, or the
ublksrv_tgt can do this...)

(2) It exits because of exception(segfault, divisor error, oom...)

>
>>
>> I think we can answer them after kernel-support of USER_RECOVERY is available.
>>
>>
>> For now I only try to directly kill the process in testcases and manually inject
>> a crash in handle_io_async().
>>
>>>
>>> Such as, crash is detected after the ubq daemon pthread/process is crashed?
>>> Will you consider io hang in the daemon pthread/process? IMO, long-term,
>>> the crash detector utility should be part of ublksrv.
>>
>> Yes, we should design a crash detector in ublksrv. For IO hang, my idea is that:
>>
>> (1) the ublksrv_tgt code should handle it if user runs ublksrv directly.
>>
>> (2) the backend should handle it if user only uses libublksrv and embeds it inside
>> the backend code.
>>
>>>
>>> We don't implement ublk driver's IO timeout yet, but that implementation may be
>>> related with this recovery feature closely, such as, one simple approach is to
>>> kill ubq-daemon if we can't move on after retrying several times, then
>>> let userspace detect & recovery.
>>
>> You mean the ublk_drv can kill the ubq_daemon? I have not consider this case yet...
>>
>> BTW, I don't think we should put too much logic(IO hang, detector) in ublk_drv
>> because it should only pass-through rqs to userpsace. We should make ublk_drv simple.
>> Accept a new daemon and re-attach it to /dev/ublkb0 is what it can do I think.
>
> Actually I was thinking the use case of container-ware ublk device when
> ADMIN privilege requirement can be removed, so people can do whatever
> they want in ublksrv. But sooner or later, request timeout needs to be
> considered for real ublk use case.

Agree.

>
> thanks,
> Ming

2022-08-29 08:15:32

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task_work() and aborting machenism

On Mon, Aug 29, 2022 at 02:13:12PM +0800, Ziyang Zhang wrote:
> On 2022/8/29 13:40, Ming Lei wrote:
> > On Wed, Aug 24, 2022 at 01:47:39PM +0800, ZiyangZhang wrote:
> >> If one rq is handled by io_uring_cmd_complete_in_task(), after a crash
> >> this rq is actually handled by an io_uring fallback wq. We have to
> >> end(abort) this rq since this fallback wq is a task other than the
> >> crashed task. However, current code does not call io_uring_cmd_done()
> >> at the same time but do it in ublk_cancel_queue(). With current design,
> >> this does work because ublk_cancel_queue() is called AFTER del_gendisk(),
> >> which waits for the rq ended(aborted) in fallback wq. This implies that
> >> fallback wq on this rq is scheduled BEFORE calling io_uring_cmd_done()
> >> on the corresponding ioucmd in ublk_cancel_queue().
> >
> > Right.
> >
> >>
> >> However, while considering recovery feature, we cannot rely on
> >> del_gendisk() or blk_mq_freeze_queue() to wait for completion of all
> >> rqs because we may not want any aborted rq. Besides, io_uring does not
> >> provide "flush fallback" machenism so we cannot trace this ioucmd.
> >
> > Why not?
> >
> > If user recovery is enabled, del_gendisk() can be replaced with
> > blk_mq_quiesce_queue(), then let abort work function do:
> >
> > - cancel all in-flight requests by holding them into requeue list
> > instead of finishing them as before, and this way is safe because
> > abort worker does know the ubq daemon is dying
> > - cancel pending commands as before, because the situation is same
> > with disk deleted or queue frozen
>
> The problem is: we cannot control when fallback wq is scheduled.
> So we are unsafe to call io_uring_cmd_done() in another process.

What is the other process?

It can't be fallback wq since any ublk request is aborted at the beginning
of __ublk_rq_task_work().

It shouldn't be the process calling ublk_cancel_dev(), since it is
safe to call io_uring_cmd_done() if ubq->nr_io_ready > 0.

Or others?

> Otherwise, there is a UAF, just as
> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
> re-issued request aborted previously to ioucmd's task_work).

As I mentioned, del_gendisk() can be replaced with
blk_mq_quiesce_queue() in case of user recovery, then no any new
request can be queued after blk_mq_quiesce_queue() returns.

>
> Yeah I know the answer is very simple: flush the fallback wq.
> But here are two more questions:

I don't see why we need to flush fallback wq, care to provide some
details?

>
> (1) Should ublk_drv rely on the fallback wq machenism?
> IMO, ublk_drv should not know detail of io_uring_cmd_complete_in_task()
> because its implementation may change in the future.
> BTW, I think current ublk_rq_task_work_cb() is not correct because
> it does not always call io_uring_cmd_done() before returning.
> nvme_uring_cmd_end_io() always calls io_uring_cmd_done() for each ioucmd
> no matter the rq succeeds or fails.
>
> (2) Suppose io_uring does export the symbol 'flush_fallback_work', should we call
> it before starting a new process(recovery)?
> What if fallback wq is not scheduled immediately if there are many processes
> running and the system overhead is heavy. In this case the recovery process
> may wait for too long. Really we should not depend on fallback wq and please
> let the fallback wq complete the ioucmd itself.
>
> >
> > With this way, the current abort logic won't be changed much.
> >
> > And user recovery should only be started _after_ ublk device is found
> > as aborted.
>
> START_RECOVERY will check if all ubq_daemons(the process) are PF_EXITING.

That is different. If START_RECOVERY is only run on aborted device, the
recovery handler could be simplified.

>
> >
> >>
> >> The recovery machenism needs to complete all ioucmds of a dying ubq
> >> to avoid leaking io_uring ctx. But as talked above, we are unsafe
> >> to call io_uring_cmd_done() in the recovery task if fallback wq happens
> >> to run simultaneously. This is a UAF case because io_uring ctx may be
> >> freed. Actually a similar case happens in
> >> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
> >> re-issued request aborted previously to ioucmd's task_work).
> >
> > If you take the above approach, I guess there isn't such problem because
> > abort can handle the case well as before.
>
> Ming, we did think this approach(quiesce, requeue rq/complete ioucmd)
> at the very beginning. But we decided to drop it because we don not want
> rely on 'flush fallback wq' machenism, which
> makes ublk_drv rely on io_uring's internal implementation.

Then the focus is 'flush fallback wq', please see my above question.


Thanks,
Ming

2022-08-29 09:14:36

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 0/9] ublk_drv: add USER_RECOVERY support

On Mon, Aug 29, 2022 at 03:29:08PM +0800, Ziyang Zhang wrote:
> On 2022/8/29 13:56, Ming Lei wrote:
> > On Mon, Aug 29, 2022 at 12:00:49PM +0800, Ziyang Zhang wrote:
> >> On 2022/8/29 10:08, Ming Lei wrote:
> >>> On Wed, Aug 24, 2022 at 01:47:35PM +0800, ZiyangZhang wrote:
> >>>> ublk_drv is a driver simply passes all blk-mq rqs to ublksrv[1] in
> >>>> userspace. For each ublk queue, there is one ubq_daemon(pthread).
> >>>> All ubq_daemons share the same process which opens /dev/ublkcX.
> >>>> The ubq_daemon code infinitely loops on io_uring_enter() to
> >>>> send/receive io_uring cmds which pass information of blk-mq
> >>>> rqs.
> >>>>
> >>>> Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv
> >>>> must abort the dying ubq, stop the device and release everything.
> >>>> This is not a good choice in practice because users do not expect
> >>>> aborted requests, I/O errors and a released device. They may want
> >>>> a recovery machenism so that no requests are aborted and no I/O
> >>>> error occurs. Anyway, users just want everything works as uaual.
> >>>
> >>> I understand the requirement is that both /dev/ublkbN and /dev/ublkcN
> >>> won't be deleted & re-added from user viewpoint after user recovery,
> >>> so the device context won't be lost.
> >>
> >> Yes, after the 'process' is killed or crashed(such as segmentation fault)
> >> both /dev/ublkb0 and /dev/ublkc0 is not deleted.
> >>
> >>>
> >>>>
> >>>> This RFC patchset implements USER_RECOVERY support. If the process
> >>>> crashes, we allow ublksrv to provide new process and ubq_daemons.
> >>>> We do not support single ubq_daemon(pthread) recovery because a
> >>>> pthread rarely crashes.
> >>>>
> >>>> Recovery feature is quite useful for products do not expect to
> >>>> return any I/O error to frontend users.
> >>>
> >>> That looks one very ideal requirement. To be honest, no any block driver
> >>> can guarantee that 100%, so it is just one soft requirement?
> >>>
> >>> Cause memory allocation may fail, network may be disconnected,
> >>> re-creating pthread or process may fail too, ...
> >>
> >> Yes, I know there are many other problem which may cause a failure.
> >>
> >> The recovery mechanism only guarantees that rqs sent to ublksrv
> >> before crash are not aborted. Instead, ublk_drv re-issues the request
> >> itself and fio does not konw about it. Of course the backend must
> >> tolerate a double-write/read.
> >
> > My comment is for 'do not expect to return any I/O error to frontend users',
> > and I still think it is just one soft requirement, and no one can
> > guarantee there isn't any error for frontend users really.
>
> Yes, I get your point now. Indeed it is just one soft requirement.
>
> >
> >>
> >>>
> >>>> In detail, we support
> >>>> this scenario:
> >>>> (1) The /dev/ublkc0 is opened by process 0;
> >>>> (2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all
> >>>> rqs are handled by process 0.
> >>>> (3) Process 0 suddenly crashes(e.g. segfault);
> >>>> (4) Fio is still running and submit IOs(but these IOs cannot
> >>>> complete now)
> >>>> (5) User recovers with process 1 and attach it to /dev/ublkc0
> >>>> (6) All rqs are handled by process 1 now and IOs can be
> >>>> completed now.
> >>>>
> >>>> Note: The backend must tolerate double-write because we re-issue
> >>>> a rq sent to the old(dying) process before. We allow users to
> >>>> choose whether re-issue these rqs or not, please see patch 7 for
> >>>> more detail.
> >>>>
> >>>> We provide a sample script here to simulate the above steps:
> >>>>
> >>>> ***************************script***************************
> >>>> LOOPS=10
> >>>>
> >>>> __ublk_get_pid() {
> >>>> pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'`
> >>>> echo $pid
> >>>> }
> >>>>
> >>>> ublk_recover_kill()
> >>>> {
> >>>> for CNT in `seq $LOOPS`; do
> >>>> dmesg -C
> >>>> pid=`__ublk_get_pid`
> >>>> echo -e "*** kill $pid now ***"
> >>>> kill -9 $pid
> >>>> sleep 2
> >>>> echo -e "*** recover now ***"
> >>>> ./ublk recover -n 0
> >>>
> >>> The current behavior is that /dev/ublkb* is removed after device is
> >>> aborted because ubq daemon is killed.
> >>>
> >>> What if 'ublk recover' command isn't sent? So the current behavior
> >>> without recovery is changed? Or just changed with this feature enabled?
> >>
> >> No, I do not change the default behavior. You can verify this by running
> >> generic/002 and generic/003. These tests passes with either recovery enabled
> >> or disabled.
> >>
> >>
> >> (1) With recovery disabled, the monitor_work scheduled periodically or
> >> STOP_DEV ctrl-cmd issued manually can cleanup everything and remove the
> >> gendisk.
> >>
> >> (2)With recovery enabled, the monitor_work is not scheduled anymore, see
> >> patch 9. So after a crash,all resources are still in kernel.
> >> Then, there are two options for a user:
> >> (a) You don't want to recover it, so just send STOP_DEV ctrl-cmd. This will
> >> schedule monitor_work once and cleanup everything. Please see patch 5.
> >
> > But what if people sends nothing and starts to reboot, then hang forever without
> > monitor_work involved.
>
> Emm... you are right. But here is a conflict: I must reserve resources for a
> potential recovery mission, but I may hang if the user directly reboots...
>
> What about add a systemd service?

That is definitely one solution, but maybe not the best one.

Another approach I thought of is:

1) after monitor work aborts the device, send one KOBJ_CHANGE event of
/dev/ublkcN to userspace, and userspace should check if the situation
needs to be recovered, no matter yes or not, it should tell ublk driver
via one control command

2) meantime monitor work starts a timer to monitor if the expected
command is received during one reasonable period. If not, the device
will be stopped. Otherwise the driver decides to recover or not.

>
> >
> >> (b) You want to recover it, so just send START_RECOVERY ctrl-cmd. Then you
> >> HAVE TO start a new process and send END_RECOVERY.
> >>
> >> Note: Actually I am thinking what if a user has sent START_RECOVERY but he fails
> >> to start a new process. I have a rough idea: just abort all rqs after we unqiuesce
> >> the request queue. But that is not included in this RFC patchset because I
> >> want to make it simpler. Maybe we can consider it later?
> >
> > It is pretty easy to fail all in-queue requests when user recovery
> > can't move on.
>
> Actually I wrote some code not included in the patchset:
>
> (0) Now the request_queue is quiesced since we are after START_RECOVERY.
>
> (1) mark all ubqs as 'force_abort', which makes ublk_queue_rq() fail
> all IOs before calling blk_mq_atart_request() after unqiescing
> request_queue.
>
> (2) end(abort) all rqs inflight. Note that set of inflight does not change
> since we are quiesced.
>
> (3) unqiesce request_queue. Note that set of inflight rqs does not change
> since we marked all ubqs as 'force_abort'. We have to unqiesce or del_gendisk()
> will hang forever(not sure I am correct).
>
> (4) del_gendisk()
>
> (5) complete ALL ioucmds by calling io_uring_cmd_done()

The above should be done easily.

>
> But this is really a ugly implementation since I do not consider:
>
> (1) After START_RECOVERY, what if the new process crashes 'before' all FETCH_REQ
> cmds are sent(some ubqs are ready while others are not).
>
> (2) After START_RECOVERY, what if the new process crashes 'after' all FETCH_REQ
> cmds are sent but 'before' END_RECOVERY.

Let's start simply. If everything doesn't work as expected, terminate
recovery and stop the disk.

>
>
> Ming, I suggest that this could be added in a future patch because I want to make
> this one simple and easy to understand :)
>
> If you do not agree, I can add this in next version though I really think we should
> be more careful.
>
>
> >
> >>
> >>>
> >>> BTW, I do not mean the change isn't reasonable, but suggest to document
> >>> the user visible change, so it can get reviewed from either user
> >>> viewpoint and technical point.
> >>>
> >>>> sleep 4
> >>>> done
> >>>> }
> >>>>
> >>>> ublk_test()
> >>>> {
> >>>> dmesg -C
> >>>> echo -e "*** add ublk device ***"
> >>>> ./ublk add -t null -d 4 -i 1
> >>>> sleep 2
> >>>> echo -e "*** start fio ***"
> >>>> fio --bs=4k \
> >>>> --filename=/dev/ublkb0 \
> >>>> --runtime=100s \
> >>>> --rw=read &
> >>>> sleep 4
> >>>> ublk_recover_kill
> >>>> wait
> >>>> echo -e "*** delete ublk device ***"
> >>>> ./ublk del -n 0
> >>>> }
> >>>>
> >>>> for CNT in `seq 4`; do
> >>>> modprobe -rv ublk_drv
> >>>> modprobe ublk_drv
> >>>> echo -e "************ round $CNT ************"
> >>>> ublk_test
> >>>> sleep 5
> >>>> done
> >>>> ***************************script***************************
> >>>>
> >>>> You may run it with our modified ublksrv[3] which supports
> >>>> recovey feature. No I/O error occurs and you can verify it
> >>>> by typing
> >>>> $ perf-tools/bin/tpoint block:block_rq_error
> >>>>
> >>>> The basic idea of USER_RECOVERY is quite straightfoward:
> >>>>
> >>>> (1) release/free everything belongs to the dying process.
> >>>>
> >>>> Note: Since ublk_drv does save information about user process,
> >>>> this work is important because we don't expect any resource
> >>>> lekage. Particularly, ioucmds from the dying ubq_daemons
> >>>> need to be completed(freed). Current ublk_drv code cannot satisfy
> >>>> our need while considering USER_RECOVERY. So we refactor some code
> >>>> shown in patch 1-5 to gracefully free all ioucmds.
> >>>>
> >>>> (2) init ublk queues including requeuing/aborting rqs.
> >>>>
> >>>> (3) allow new ubq_daemons issue FETCH_REQ.
> >>>>
> >>>> Here is steps to reocver:
> >>>>
> >>>> (1) For a user, after a process crash(how he detect a crash is not related
> >>>> to this patchset), he sends START_USER_RECOVERY ctrl-cmd to
> >>>
> >>> I'd suggest to describe crash detector a bit at least, as one whole use case,
> >>> crash detector should be the input of the use case of user recovery, which is
> >>> usually one part of use case when modeling software requirement/design.
> >>
> >> This patchset tries to answer only one question: After a process crash, how to
> >> re-attach the device by another process. So I do not consider other questions
> >> too much, such as:
> >> (1) How to detect a crash?
> >> (2) Is IO hang a crash? Should we kill the process?
> >> (3) What if a blk-mq rq timeout? Does the process dies? Should we kill the process?
> >
> > But you have to define what is 'crash', otherwise how can you define
> > what to be recovered?
> >
> > So far please just define the crash as the whole daemon being dead
> > abnormally(without sending stop command) if you don't have better idea.
>
> Yes, I argee that a ublk_drv crash means the whole daemon(process) being dead
> abnormally(without sending stop command).
>
> I only consider the 'process' crash, not a single 'pthread'(ubq_daemon) crash.
> The process crashes if:
>
> (1) The user kill it(the detector can do this, the backend can do this, or the
> ublksrv_tgt can do this...)

Yeah, sometimes 'kill -9' is very useful for me to handle IO hang, then no
reboot is needed.

But it is just for debug or development of ublksrv, and that can be
thought of early shape of container-ware block device, and other kind of
block devices can't have such privilege, :-)

>
> (2) It exits because of exception(segfault, divisor error, oom...)

True, that is why I raise the question because the problem to be
solved needs to be defined first.


Thanks,
Ming

2022-08-29 09:16:13

by Ziyang Zhang

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task_work() and aborting machenism

On 2022/8/29 16:11, Ming Lei wrote:
> On Mon, Aug 29, 2022 at 02:13:12PM +0800, Ziyang Zhang wrote:
>> On 2022/8/29 13:40, Ming Lei wrote:
>>> On Wed, Aug 24, 2022 at 01:47:39PM +0800, ZiyangZhang wrote:
>>>> If one rq is handled by io_uring_cmd_complete_in_task(), after a crash
>>>> this rq is actually handled by an io_uring fallback wq. We have to
>>>> end(abort) this rq since this fallback wq is a task other than the
>>>> crashed task. However, current code does not call io_uring_cmd_done()
>>>> at the same time but do it in ublk_cancel_queue(). With current design,
>>>> this does work because ublk_cancel_queue() is called AFTER del_gendisk(),
>>>> which waits for the rq ended(aborted) in fallback wq. This implies that
>>>> fallback wq on this rq is scheduled BEFORE calling io_uring_cmd_done()
>>>> on the corresponding ioucmd in ublk_cancel_queue().
>>>
>>> Right.

[1]

>>>
>>>>
>>>> However, while considering recovery feature, we cannot rely on
>>>> del_gendisk() or blk_mq_freeze_queue() to wait for completion of all
>>>> rqs because we may not want any aborted rq. Besides, io_uring does not
>>>> provide "flush fallback" machenism so we cannot trace this ioucmd.
>>>
>>> Why not?
>>>
>>> If user recovery is enabled, del_gendisk() can be replaced with
>>> blk_mq_quiesce_queue(), then let abort work function do:
>>>
>>> - cancel all in-flight requests by holding them into requeue list
>>> instead of finishing them as before, and this way is safe because
>>> abort worker does know the ubq daemon is dying
>>> - cancel pending commands as before, because the situation is same
>>> with disk deleted or queue frozen
>>
>> The problem is: we cannot control when fallback wq is scheduled.
>> So we are unsafe to call io_uring_cmd_done() in another process.
>
> What is the other process?

Hi Ming.

Actually patch 1-5 are all preparations for patch 6-9. So I suggest
you may read patch 6-9 at the same time if you are confused on why
I say there is a 'problem'. In current ublk_drv we really do not need
to consider it(As I have explained in [1] and you replied 'Right').

Answer your question now: in the patchset, it is START_RECOVERY process,
which calls ublk_recover_rq(). Please see patch 8.

>
> It can't be fallback wq since any ublk request is aborted at the beginning
> of __ublk_rq_task_work().

With recovery feature enabled, we cannot abort the rq, but just let
__ublk_rq_task_work() return. We will requeue the rq soon.

>
> It shouldn't be the process calling ublk_cancel_dev(), since it is
> safe to call io_uring_cmd_done() if ubq->nr_io_ready > 0.
>
> Or others?
It is START_RECOVERY process(or the 'abort_work' you proposed).
It is whatever that calls io_uring_cmd_done() on the old ioucmd
belonging to the dying ubq_daemon.

>
>> Otherwise, there is a UAF, just as
>> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
>> re-issued request aborted previously to ioucmd's task_work).
>
> As I mentioned, del_gendisk() can be replaced with
> blk_mq_quiesce_queue() in case of user recovery, then no any new
> request can be queued after blk_mq_quiesce_queue() returns.

For this, +1.

>
>>
>> Yeah I know the answer is very simple: flush the fallback wq.
>> But here are two more questions:
>
> I don't see why we need to flush fallback wq, care to provide some
> details?

Anyway, here is a case:

(1) Assume there is only one tag, only one ubq.

(2) The ublk_io is ACTIVE, which means an ioucmd(cmd_1) is sent from ublksrv
and ublk_drv has not completed it yet(no io_uring_cmd_done() is called).

(3) New rq is coming, and ublk_queue_rq() is called.

(4) The ublksrv process crashes(PF_EXITING).

(5) io_uring_cmd_complete_in_task(cmd_1) is called in ublk_queue_rq(), and
cmd_1 is put into a fallback_list.

(6) We want to re-attach a new process and assing a new ioucmd(cmd_2) to ublk_io.

(7) We try to complete the old cmd_1 now by io_uring_cmd_done(cmd_1)...

(8) Shortly after (7), io_uring exit work is scheduled and it finds that no
inflight iocumd exists, so it starts to release some resources

(9) Shortly after (8), in fallback wq, a null-deref on cmd_1 or ctx->refs may
happen, just like
(5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
re-issued request aborted previously to ioucmd's task_work).

If we flush fallback wq before (7), then everything is OKAY.

Why this happens? The root cause is that we do not ALWAYS complete io_uring cmd
in ublk_rq_task_work_cb(). The commit: "ublk_drv: do not add a re-issued request
aborted previously to ioucmd's task_work" does fix a problem. But I think we
really need to refactor ublk_rq_task_work_cb() which focuses on old ioucmd.

>
>>
>> (1) Should ublk_drv rely on the fallback wq machenism?
>> IMO, ublk_drv should not know detail of io_uring_cmd_complete_in_task()
>> because its implementation may change in the future.
>> BTW, I think current ublk_rq_task_work_cb() is not correct because
>> it does not always call io_uring_cmd_done() before returning.
>> nvme_uring_cmd_end_io() always calls io_uring_cmd_done() for each ioucmd
>> no matter the rq succeeds or fails.
>>
>> (2) Suppose io_uring does export the symbol 'flush_fallback_work', should we call
>> it before starting a new process(recovery)?
>> What if fallback wq is not scheduled immediately if there are many processes
>> running and the system overhead is heavy. In this case the recovery process
>> may wait for too long. Really we should not depend on fallback wq and please
>> let the fallback wq complete the ioucmd itself.
>>
>>>
>>> With this way, the current abort logic won't be changed much.
>>>
>>> And user recovery should only be started _after_ ublk device is found
>>> as aborted.
>>
>> START_RECOVERY will check if all ubq_daemons(the process) are PF_EXITING.
>
> That is different. If START_RECOVERY is only run on aborted device, the
> recovery handler could be simplified.

But stop_dev could become complicated since with recovery enabled, stop_dev
has to do different things. Please see patch 5. Really we do not have to
do anything after the crash until user sends START_RECOVERY or STOP_DEV.

Regards,
Zhang

>
>>
>>>
>>>>
>>>> The recovery machenism needs to complete all ioucmds of a dying ubq
>>>> to avoid leaking io_uring ctx. But as talked above, we are unsafe
>>>> to call io_uring_cmd_done() in the recovery task if fallback wq happens
>>>> to run simultaneously. This is a UAF case because io_uring ctx may be
>>>> freed. Actually a similar case happens in
>>>> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
>>>> re-issued request aborted previously to ioucmd's task_work).
>>>
>>> If you take the above approach, I guess there isn't such problem because
>>> abort can handle the case well as before.
>>
>> Ming, we did think this approach(quiesce, requeue rq/complete ioucmd)
>> at the very beginning. But we decided to drop it because we don not want
>> rely on 'flush fallback wq' machenism, which
>> makes ublk_drv rely on io_uring's internal implementation.
>
> Then the focus is 'flush fallback wq', please see my above question.
>
>
> Thanks,
> Ming

2022-08-29 10:35:05

by Ming Lei

[permalink] [raw]
Subject: Re: [RFC PATCH 4/9] ublk_drv: refactor __ublk_rq_task_work() and aborting machenism

On Mon, Aug 29, 2022 at 05:09:45PM +0800, Ziyang Zhang wrote:
> On 2022/8/29 16:11, Ming Lei wrote:
> > On Mon, Aug 29, 2022 at 02:13:12PM +0800, Ziyang Zhang wrote:
> >> On 2022/8/29 13:40, Ming Lei wrote:
> >>> On Wed, Aug 24, 2022 at 01:47:39PM +0800, ZiyangZhang wrote:
> >>>> If one rq is handled by io_uring_cmd_complete_in_task(), after a crash
> >>>> this rq is actually handled by an io_uring fallback wq. We have to
> >>>> end(abort) this rq since this fallback wq is a task other than the
> >>>> crashed task. However, current code does not call io_uring_cmd_done()
> >>>> at the same time but do it in ublk_cancel_queue(). With current design,
> >>>> this does work because ublk_cancel_queue() is called AFTER del_gendisk(),
> >>>> which waits for the rq ended(aborted) in fallback wq. This implies that
> >>>> fallback wq on this rq is scheduled BEFORE calling io_uring_cmd_done()
> >>>> on the corresponding ioucmd in ublk_cancel_queue().
> >>>
> >>> Right.
>
> [1]
>
> >>>
> >>>>
> >>>> However, while considering recovery feature, we cannot rely on
> >>>> del_gendisk() or blk_mq_freeze_queue() to wait for completion of all
> >>>> rqs because we may not want any aborted rq. Besides, io_uring does not
> >>>> provide "flush fallback" machenism so we cannot trace this ioucmd.
> >>>
> >>> Why not?
> >>>
> >>> If user recovery is enabled, del_gendisk() can be replaced with
> >>> blk_mq_quiesce_queue(), then let abort work function do:
> >>>
> >>> - cancel all in-flight requests by holding them into requeue list
> >>> instead of finishing them as before, and this way is safe because
> >>> abort worker does know the ubq daemon is dying
> >>> - cancel pending commands as before, because the situation is same
> >>> with disk deleted or queue frozen
> >>
> >> The problem is: we cannot control when fallback wq is scheduled.
> >> So we are unsafe to call io_uring_cmd_done() in another process.
> >
> > What is the other process?
>
> Hi Ming.
>
> Actually patch 1-5 are all preparations for patch 6-9. So I suggest
> you may read patch 6-9 at the same time if you are confused on why
> I say there is a 'problem'. In current ublk_drv we really do not need
> to consider it(As I have explained in [1] and you replied 'Right').
>
> Answer your question now: in the patchset, it is START_RECOVERY process,
> which calls ublk_recover_rq(). Please see patch 8.

Why does START_RECOVERY process need to call io_uring_cmd_done()?

As I mentioned, clean the old ubq_daemon by ublk_cancel_queue()
just like before. Then the recovery handling can be simplified a lot.

>
> >
> > It can't be fallback wq since any ublk request is aborted at the beginning
> > of __ublk_rq_task_work().
>
> With recovery feature enabled, we cannot abort the rq, but just let
> __ublk_rq_task_work() return. We will requeue the rq soon.

Yeah, the request is requeued, and the queue is quiesced during
aborting in ublk_cancel_queue().

>
> >
> > It shouldn't be the process calling ublk_cancel_dev(), since it is
> > safe to call io_uring_cmd_done() if ubq->nr_io_ready > 0.
> >
> > Or others?
> It is START_RECOVERY process(or the 'abort_work' you proposed).
> It is whatever that calls io_uring_cmd_done() on the old ioucmd
> belonging to the dying ubq_daemon.
>
> >
> >> Otherwise, there is a UAF, just as
> >> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
> >> re-issued request aborted previously to ioucmd's task_work).
> >
> > As I mentioned, del_gendisk() can be replaced with
> > blk_mq_quiesce_queue() in case of user recovery, then no any new
> > request can be queued after blk_mq_quiesce_queue() returns.
>
> For this, +1.
>
> >
> >>
> >> Yeah I know the answer is very simple: flush the fallback wq.
> >> But here are two more questions:
> >
> > I don't see why we need to flush fallback wq, care to provide some
> > details?
>
> Anyway, here is a case:
>
> (1) Assume there is only one tag, only one ubq.
>
> (2) The ublk_io is ACTIVE, which means an ioucmd(cmd_1) is sent from ublksrv
> and ublk_drv has not completed it yet(no io_uring_cmd_done() is called).
>
> (3) New rq is coming, and ublk_queue_rq() is called.
>
> (4) The ublksrv process crashes(PF_EXITING).
>
> (5) io_uring_cmd_complete_in_task(cmd_1) is called in ublk_queue_rq(), and
> cmd_1 is put into a fallback_list.
>

What I suggested is to abort ubq daemon in ublk_abort_device() like before,
but without failing in-flight request, just quiesce queue and requeue all
in-flight request.

Specifically, you can wait until all in-flight requests are requeued,
and the similar handling has been used by NVMe for long time, then
fallback wq can be thought as being flushed here.

There are two kinds of in-flight requests:

1) UBLK_IO_FLAG_ACTIVE is set, that is what ublk_abort_queue() needs to
wait until req->state becomes IDLE, which can be done via the change at
the entry of __ublk_rq_task_work():

if (unlikely(task_exiting)) {
if (user_recovery)
blk_mq_requeue_request(req, false);
else
blk_mq_end_request(req, BLK_STS_IOERR);
}

2) UBLK_IO_FLAG_ACTIVE is cleared
- no need to wait since io_uring_cmd_done() is called for this
request


> (6) We want to re-attach a new process and assing a new ioucmd(cmd_2) to ublk_io.
>
> (7) We try to complete the old cmd_1 now by io_uring_cmd_done(cmd_1)...
>
> (8) Shortly after (7), io_uring exit work is scheduled and it finds that no
> inflight iocumd exists, so it starts to release some resources
>
> (9) Shortly after (8), in fallback wq, a null-deref on cmd_1 or ctx->refs may
> happen, just like
> (5804987b7272f437299011c76b7363b8df6f8515: ublk_drv: do not add a
> re-issued request aborted previously to ioucmd's task_work).
>
> If we flush fallback wq before (7), then everything is OKAY.
>
> Why this happens? The root cause is that we do not ALWAYS complete io_uring cmd
> in ublk_rq_task_work_cb(). The commit: "ublk_drv: do not add a re-issued request
> aborted previously to ioucmd's task_work" does fix a problem. But I think we
> really need to refactor ublk_rq_task_work_cb() which focuses on old ioucmd.

The race is because you drop the existed abort mechanism for user
recovery. If existed abort is reused for recovery, no above race at all.

>
> >
> >>
> >> (1) Should ublk_drv rely on the fallback wq machenism?
> >> IMO, ublk_drv should not know detail of io_uring_cmd_complete_in_task()
> >> because its implementation may change in the future.
> >> BTW, I think current ublk_rq_task_work_cb() is not correct because
> >> it does not always call io_uring_cmd_done() before returning.
> >> nvme_uring_cmd_end_io() always calls io_uring_cmd_done() for each ioucmd
> >> no matter the rq succeeds or fails.
> >>
> >> (2) Suppose io_uring does export the symbol 'flush_fallback_work', should we call
> >> it before starting a new process(recovery)?
> >> What if fallback wq is not scheduled immediately if there are many processes
> >> running and the system overhead is heavy. In this case the recovery process
> >> may wait for too long. Really we should not depend on fallback wq and please
> >> let the fallback wq complete the ioucmd itself.
> >>
> >>>
> >>> With this way, the current abort logic won't be changed much.
> >>>
> >>> And user recovery should only be started _after_ ublk device is found
> >>> as aborted.
> >>
> >> START_RECOVERY will check if all ubq_daemons(the process) are PF_EXITING.
> >
> > That is different. If START_RECOVERY is only run on aborted device, the
> > recovery handler could be simplified.
>
> But stop_dev could become complicated since with recovery enabled, stop_dev
> has to do different things. Please see patch 5. Really we do not have to
> do anything after the crash until user sends START_RECOVERY or STOP_DEV.

99% is same, the extra thing to just fail all in-queue requests by unquiesce queue
before stopping one to-be-recovered device really.


Thanks,
Ming