2022-09-23 06:26:53

by Ziyang Zhang

[permalink] [raw]
Subject: [RESEND PATCH V5 0/7] ublk_drv: add USER_RECOVERY support

ublk_drv is a driver simply passes all blk-mq rqs to userspace
target(such as ublksrv[1]). For each ublk queue, there is one
ubq_daemon(pthread). All ubq_daemons share the same process
which opens /dev/ublkcX. The ubq_daemon code infinitely loops on
io_uring_enter() to send/receive io_uring cmds which pass
information of blk-mq rqs.

Since the real IO handler(the process/thread opening /dev/ublkcX) is
in userspace, it could crash if:
(1) the user kills -9 it because of IO hang on backend, system
reboot, etc...
(2) the process/thread catches a exception(segfault, divisor error,
oom...) Therefore, the kernel driver has to deal with a dying
ubq_daemon or the process.

Now, if one ubq_daemon(pthread) or the process crashes, ublk_drv
must abort the dying ubq, stop the device and release everything.
This is not a good choice in practice because users do not expect
aborted requests, I/O errors and a released device. They may want
a recovery machenism so that no requests are aborted and no I/O
error occurs. Anyway, users just want everything works as usual.

This patchset implements USER_RECOVERY support. If the process
or any ubq_daemon(pthread) crashes(exits accidentally), we allow
user to provide new process and ubq_daemons.

Note: The responsibility of recovery belongs to the user who opens
/dev/ublkcX. After a crash, the kernel driver only switch the
device's state to be ready for recovery(START_USER_RECOVERY) or
termination(STOP_DEV). The state is defined as UBLK_S_DEV_QUIESCED.
This patchset does not provide how to detect such a crash in userspace.
The user has may ways to do so. For example, user may:
(1) send GET_DEV_INFO on specific dev_id and check if its state is
UBLK_S_DEV_QUIESCED.
(2) 'ps' on ublksrv_pid.

Recovery feature is quite useful for real products. In detail,
we support this scenario:
(1) The /dev/ublkc0 is opened by process 0.
(2) Fio is running on /dev/ublkb0 exposed by ublk_drv and all
rqs are handled by process 0.
(3) Process 0 suddenly crashes(e.g. segfault);
(4) Fio is still running and submit IOs(but these IOs cannot
be dispatched now)
(5) User starts process 1 and attach it to /dev/ublkc0
(6) All rqs are handled by process 1 now and IOs can be
completed now.

Note: The backend must tolerate double-write because we re-issue
a rq sent to the old process 0 before.

We provide a sample script here to simulate the above steps:

***************************script***************************
LOOPS=10

__ublk_get_pid() {
pid=`./ublk list -n 0 | grep "pid" | awk '{print $7}'`
echo $pid
}

ublk_recover_kill()
{
for CNT in `seq $LOOPS`; do
dmesg -C
pid=`__ublk_get_pid`
echo -e "*** kill $pid now ***"
kill -9 $pid
sleep 6
echo -e "*** recover now ***"
./ublk recover -n 0
sleep 6
done
}

ublk_test()
{
echo -e "*** add ublk device ***"
./ublk add -t null -d 4 -i 1
sleep 2
echo -e "*** start fio ***"
fio --bs=4k \
--filename=/dev/ublkb0 \
--runtime=140s \
--rw=read &
sleep 4
ublk_recover_kill
wait
echo -e "*** delete ublk device ***"
./ublk del -n 0
}

for CNT in `seq 4`; do
modprobe -rv ublk_drv
modprobe ublk_drv
echo -e "************ round $CNT ************"
ublk_test
sleep 5
done
***************************script***************************

You may run it with our modified ublksrv[2] which supports
recovery feature. No I/O error occurs and you can verify it
by typing
$ perf-tools/bin/tpoint block:block_rq_error

The basic idea of USER_RECOVERY is quite straightfoward:
(1) quiesce ublk queues and requeue/abort rqs.
(2) release/free everything belongs to the dying process.
Note: Since ublk_drv does save information about user process,
this work is important because we don't expect any resource
lekage. Particularly, ioucmds from the dying ubq_daemons
need to be completed(freed).
(3) allow new ubq_daemons issue FETCH_REQ.
Note: ublk_ch_uring_cmd() checks some states and flags. We
have to set them to a correct value.

Here is steps to reocver:
(0) requests dispatched after the corresponding ubq_daemon is dying
are requeued.
(1) monitor_work finds one dying ubq_daemon, and it should
schedule quiesce_work and requeue/abort requests issued to
userspace before the ubq_daemon is dying.
(2) quiesce_work must (a)quiesce request queue to ban any incoming
ublk_queue_rq(), (b)wait unitl all rqs are IDLE, (c)complete old
ioucmds. Then the ublk device is ready for recovery or stop.
(3) The user sends START_USER_RECOVERY ctrl-cmd to /dev/ublk-control
with a dev_id X (such as 3 for /dev/ublkc3).
(4) Then ublk_drv should perpare for a new process to attach /dev/ublkcX.
All ublk_io structures are cleared and ubq_daemons are reset.
(5) Then, user should start a new process and ubq_daemons(pthreads) and
send FETCH_REQ by io_uring_enter() to make all ubqs be ready. The
user must correctly setup queues, flags and so on(how to persist
user's information is not related to this patchset).
(6) The user sends END_USER_RECOVERY ctrl-cmd to /dev/ublk-control with a
dev_id X.
(7) After receiving END_USER_RECOVERY, ublk_drv waits for all ubq_daemons
getting ready. Then it unquiesces request queue and new rqs are
allowed.

You should use ublksrv[2] and tests[3] provided by us. We add 3 additional
tests to verify that recovery feature works. Our code will be PR-ed to
Ming's repo soon.

[1] https://github.com/ming1/ubdsrv
[2] https://github.com/old-memories/ubdsrv/tree/recovery-v1
[3] https://github.com/old-memories/ubdsrv/tree/recovery-v1/tests/generic

Since V4:
(1) remove ublk_cancel_dev() refactor patch
(2) keep START_USER_RECOVERY and END_USER_RECOVERY
(3) avoid UAF on ubq_daemon in monitor_work
(4) add one helper for requeuing/ending rqs

Since V3:
(1) do not kick requeue list in ublk_queue_rq() or io_uring fallback wq
with a dying ubq_daemon but kicking the list once while unquiescing dev
(2) add comment on requeing rqs in ublk_queue_rq(), or io_uring fallback wq
with a dying ubq_daemon
(3) split support for UBLK_F_USER_RECOVERY_REISSUE into a single patch
(4) let monitor_work abort/requeue rqs issued to userspace instead of
quiesce_work with recovery enabled
(5) alway wait until no INFLIGHT rq exists in ublk_quiesce_dev()
(6) move ublk re-init stuff into ublk_ch_release()
(7) let ublk_quiesce_dev() go on as long as one ubq_daemon is dying
(8) add only one ctrl-cmd and rename it as RESTART_DEV
(9) check ub.dev_info->flags instead of iterating on all ubqs
(10) do not disable recoevry feature, but always qiuesce dev in
ublk_stop_dev() and then unquiesce it
(11) add doc on USER_RECOVERY feature

Since V2:
(1) run ublk_quiesce_dev() in a standalone work.
(2) do not run monitor_work after START_USER_RECOVERY is handled.
(3) refactor recovery feature code so that it does not affect current code.

Since V1:
(1) refactor cover letter. Add intruduction on "how to detect a crash" and
"why we need recovery feature".
(2) do not refactor task_work and ublk_queue_rq().
(3) allow users freely stop/recover the device.
(4) add comment on ublk_cancel_queue().
(5) refactor monitor_work and aborting machenism since we add recovery
machenism in monitor_work.

ZiyangZhang (7):
ublk_drv: check 'current' instead of 'ubq_daemon'
ublk_drv: define macros for recovery feature and check them
ublk_drv: requeue rqs with recovery feature enabled
ublk_drv: consider recovery feature in aborting mechanism
ublk_drv: support UBLK_F_USER_RECOVERY_REISSUE
ublk_drv: add START_USER_RECOVERY and END_USER_RECOVERY support
Documentation: document ublk user recovery feature

Documentation/block/ublk.rst | 32 ++++
drivers/block/ublk_drv.c | 297 ++++++++++++++++++++++++++++++++--
include/uapi/linux/ublk_cmd.h | 8 +-
3 files changed, 322 insertions(+), 15 deletions(-)

--
2.27.0


2022-09-23 06:27:32

by Ziyang Zhang

[permalink] [raw]
Subject: [RESEND PATCH V5 7/7] Documentation: document ublk user recovery feature

Add documentation for user recovery feature of ublk subsystem.

Signed-off-by: ZiyangZhang <[email protected]>
---
Documentation/block/ublk.rst | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)

diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
index 2122d1a4a541..c3dde087e601 100644
--- a/Documentation/block/ublk.rst
+++ b/Documentation/block/ublk.rst
@@ -144,6 +144,38 @@ managing and controlling ublk devices with help of several control commands:
For retrieving device info via ``ublksrv_ctrl_dev_info``. It is the server's
responsibility to save IO target specific info in userspace.

+- ``UBLK_CMD_START_USER_RECOVERY``
+
+ This command is valid if ``UBLK_F_USER_RECOVERY`` feature is enabled. This
+ command is accepted after the old process has exited, ublk device is quiesced
+ and ``/dev/ublkc*`` is closed. User should send this command before he starts
+ a new process which opens ``/dev/ublkc*``. When this command returns, the
+ ublk device is ready for the new process.
+
+- ``UBLK_CMD_END_USER_RECOVERY``
+
+ This command is valid if ``UBLK_F_USER_RECOVERY`` feature is enabled. This
+ command is accepted after a new process has opened ``/dev/ublkc*`` and get
+ all ublk queues be ready. When this command returns, ublk device is
+ unquiesced and new I/O requests are passed to the new process.
+
+- user recovery feature description
+
+ Two new features are added for user recovery: ``UBLK_F_USER_RECOVERY`` and
+ ``UBLK_F_USER_RECOVERY_REISSUE``.
+
+ With ``UBLK_F_USER_RECOVERY`` set, after one ubq_daemon(ublksrv io handler) is
+ dying, ublk does not release ``/dev/ublkc*`` or ``/dev/ublkb*`` but requeues all
+ inflight requests which have not been issued to userspace. Requests which have
+ been issued to userspace are aborted.
+
+ With ``UBLK_F_USER_RECOVERY_REISSUE`` set, after one ubq_daemon(ublksrv io
+ handler) is dying, contrary to ``UBLK_F_USER_RECOVERY``, requests which have been
+ issued to userspace are requeued and will be re-issued to the new process after
+ handling ``UBLK_CMD_END_USER_RECOVERY``. ``UBLK_F_USER_RECOVERY_REISSUE`` is
+ designed for backends who tolerate double-write since the driver may issue the
+ same I/O request twice. It might be useful to a read-only FS or a VM backend.
+
Data plane
----------

--
2.27.0

2022-09-23 06:52:03

by Ziyang Zhang

[permalink] [raw]
Subject: [RESEND PATCH V5 4/7] ublk_drv: consider recovery feature in aborting mechanism

With USER_RECOVERY feature enabled, the monitor_work schedules
quiesce_work after finding a dying ubq_daemon. The monitor_work
should also abort all rqs issued to userspace before the ubq_daemon is
dying. The quiesce_work's job is to:
(1) quiesce request queue.
(2) check if there is any INFLIGHT rq. If so, we retry until all these
rqs are requeued and become IDLE. These rqs should be requeued by
ublk_queue_rq(), task work, io_uring fallback wq or monitor_work.
(3) complete all ioucmds by calling io_uring_cmd_done(). We are safe to
do so because no ioucmd can be referenced now.
(5) set ub's state to UBLK_S_DEV_QUIESCED, which means we are ready for
recovery. This state is exposed to userspace by GET_DEV_INFO.

The driver can always handle STOP_DEV and cleanup everything no matter
ub's state is LIVE or QUIESCED. After ub's state is UBLK_S_DEV_QUIESCED,
user can recover with new process.

Note: we do not change the default behavior with reocvery feature
disabled. monitor_work still schedules stop_work and abort inflight
rqs. And finally ublk_device is released.

Signed-off-by: ZiyangZhang <[email protected]>
---
drivers/block/ublk_drv.c | 116 +++++++++++++++++++++++++++++++++++++--
1 file changed, 110 insertions(+), 6 deletions(-)

diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
index 3ae13e46ece6..2a4891c3e5fd 100644
--- a/drivers/block/ublk_drv.c
+++ b/drivers/block/ublk_drv.c
@@ -120,7 +120,7 @@ struct ublk_queue {

unsigned long io_addr; /* mapped vm address */
unsigned int max_io_sz;
- bool abort_work_pending;
+ bool force_abort;
unsigned short nr_io_ready; /* how many ios setup */
struct ublk_device *dev;
struct ublk_io ios[0];
@@ -162,6 +162,7 @@ struct ublk_device {
* monitor each queue's daemon periodically
*/
struct delayed_work monitor_work;
+ struct work_struct quiesce_work;
struct work_struct stop_work;
};

@@ -773,6 +774,17 @@ static blk_status_t ublk_queue_rq(struct blk_mq_hw_ctx *hctx,
res = ublk_setup_iod(ubq, rq);
if (unlikely(res != BLK_STS_OK))
return BLK_STS_IOERR;
+ /* With recovery feature enabled, force_abort is set in
+ * ublk_stop_dev() before calling del_gendisk(). We have to
+ * abort all requeued and new rqs here to let del_gendisk()
+ * move on. Besides, we cannot not call io_uring_cmd_complete_in_task()
+ * to avoid UAF on io_uring ctx.
+ *
+ * Note: force_abort is guaranteed to be seen because it is set
+ * before request queue is unqiuesced.
+ */
+ if (ublk_queue_can_use_recovery(ubq) && unlikely(ubq->force_abort))
+ return BLK_STS_IOERR;

blk_mq_start_request(bd->rq);

@@ -967,7 +979,10 @@ static void ublk_daemon_monitor_work(struct work_struct *work)
struct ublk_queue *ubq = ublk_get_queue(ub, i);

if (ubq_daemon_is_dying(ubq)) {
- schedule_work(&ub->stop_work);
+ if (ublk_queue_can_use_recovery(ubq))
+ schedule_work(&ub->quiesce_work);
+ else
+ schedule_work(&ub->stop_work);

/* abort queue is for making forward progress */
ublk_abort_queue(ub, ubq);
@@ -975,12 +990,13 @@ static void ublk_daemon_monitor_work(struct work_struct *work)
}

/*
- * We can't schedule monitor work after ublk_remove() is started.
+ * We can't schedule monitor work after ub's state is not UBLK_S_DEV_LIVE.
+ * after ublk_remove() or __ublk_quiesce_dev() is started.
*
* No need ub->mutex, monitor work are canceled after state is marked
- * as DEAD, so DEAD state is observed reliably.
+ * as not LIVE, so new state is observed reliably.
*/
- if (ub->dev_info.state != UBLK_S_DEV_DEAD)
+ if (ub->dev_info.state == UBLK_S_DEV_LIVE)
schedule_delayed_work(&ub->monitor_work,
UBLK_DAEMON_MONITOR_PERIOD);
}
@@ -1017,12 +1033,97 @@ static void ublk_cancel_dev(struct ublk_device *ub)
ublk_cancel_queue(ublk_get_queue(ub, i));
}

-static void ublk_stop_dev(struct ublk_device *ub)
+static bool ublk_check_inflight_rq(struct request *rq, void *data)
+{
+ bool *idle = data;
+
+ if (blk_mq_request_started(rq)) {
+ *idle = false;
+ return false;
+ }
+ return true;
+}
+
+static void ublk_wait_tagset_rqs_idle(struct ublk_device *ub)
+{
+ bool idle;
+
+ WARN_ON_ONCE(!blk_queue_quiesced(ub->ub_disk->queue));
+ while (true) {
+ idle = true;
+ blk_mq_tagset_busy_iter(&ub->tag_set,
+ ublk_check_inflight_rq, &idle);
+ if (idle)
+ break;
+ msleep(UBLK_REQUEUE_DELAY_MS);
+ }
+}
+
+static void __ublk_quiesce_dev(struct ublk_device *ub)
{
+ pr_devel("%s: quiesce ub: dev_id %d state %s\n",
+ __func__, ub->dev_info.dev_id,
+ ub->dev_info.state == UBLK_S_DEV_LIVE ?
+ "LIVE" : "QUIESCED");
+ blk_mq_quiesce_queue(ub->ub_disk->queue);
+ ublk_wait_tagset_rqs_idle(ub);
+ ub->dev_info.state = UBLK_S_DEV_QUIESCED;
+ ublk_cancel_dev(ub);
+ /* we are going to release task_struct of ubq_daemon and resets
+ * ->ubq_daemon to NULL. So in monitor_work, check on ubq_daemon causes UAF.
+ * Besides, monitor_work is not necessary in QUIESCED state since we have
+ * already scheduled quiesce_work and quiesced all ubqs.
+ *
+ * Do not let monitor_work schedule itself if state it QUIESCED. And we cancel
+ * it here and re-schedule it in END_USER_RECOVERY to avoid UAF.
+ */
+ cancel_delayed_work_sync(&ub->monitor_work);
+}
+
+static void ublk_quiesce_work_fn(struct work_struct *work)
+{
+ struct ublk_device *ub =
+ container_of(work, struct ublk_device, quiesce_work);
+
mutex_lock(&ub->mutex);
if (ub->dev_info.state != UBLK_S_DEV_LIVE)
goto unlock;
+ __ublk_quiesce_dev(ub);
+ unlock:
+ mutex_unlock(&ub->mutex);
+}

+static void ublk_unquiesce_dev(struct ublk_device *ub)
+{
+ int i;
+
+ pr_devel("%s: unquiesce ub: dev_id %d state %s\n",
+ __func__, ub->dev_info.dev_id,
+ ub->dev_info.state == UBLK_S_DEV_LIVE ?
+ "LIVE" : "QUIESCED");
+ /* quiesce_work has run. We let requeued rqs be aborted
+ * before running fallback_wq. "force_abort" must be seen
+ * after request queue is unqiuesced. Then del_gendisk()
+ * can move on.
+ */
+ for (i = 0; i < ub->dev_info.nr_hw_queues; i++)
+ ublk_get_queue(ub, i)->force_abort = true;
+
+ blk_mq_unquiesce_queue(ub->ub_disk->queue);
+ /* We may have requeued some rqs in ublk_quiesce_queue() */
+ blk_mq_kick_requeue_list(ub->ub_disk->queue);
+}
+
+static void ublk_stop_dev(struct ublk_device *ub)
+{
+ mutex_lock(&ub->mutex);
+ if (ub->dev_info.state == UBLK_S_DEV_DEAD)
+ goto unlock;
+ if (ublk_can_use_recovery(ub)) {
+ if (ub->dev_info.state == UBLK_S_DEV_LIVE)
+ __ublk_quiesce_dev(ub);
+ ublk_unquiesce_dev(ub);
+ }
del_gendisk(ub->ub_disk);
ub->dev_info.state = UBLK_S_DEV_DEAD;
ub->dev_info.ublksrv_pid = -1;
@@ -1346,6 +1447,7 @@ static void ublk_remove(struct ublk_device *ub)
{
ublk_stop_dev(ub);
cancel_work_sync(&ub->stop_work);
+ cancel_work_sync(&ub->quiesce_work);
cdev_device_del(&ub->cdev, &ub->cdev_dev);
put_device(&ub->cdev_dev);
}
@@ -1522,6 +1624,7 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd)
goto out_unlock;
mutex_init(&ub->mutex);
spin_lock_init(&ub->mm_lock);
+ INIT_WORK(&ub->quiesce_work, ublk_quiesce_work_fn);
INIT_WORK(&ub->stop_work, ublk_stop_work_fn);
INIT_DELAYED_WORK(&ub->monitor_work, ublk_daemon_monitor_work);

@@ -1642,6 +1745,7 @@ static int ublk_ctrl_stop_dev(struct io_uring_cmd *cmd)

ublk_stop_dev(ub);
cancel_work_sync(&ub->stop_work);
+ cancel_work_sync(&ub->quiesce_work);

ublk_put_device(ub);
return 0;
--
2.27.0

2022-09-23 14:13:18

by Ming Lei

[permalink] [raw]
Subject: Re: [RESEND PATCH V5 7/7] Documentation: document ublk user recovery feature

On Fri, Sep 23, 2022 at 02:15:05PM +0800, ZiyangZhang wrote:
> Add documentation for user recovery feature of ublk subsystem.
>
> Signed-off-by: ZiyangZhang <[email protected]>
> ---
> Documentation/block/ublk.rst | 32 ++++++++++++++++++++++++++++++++
> 1 file changed, 32 insertions(+)
>
> diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> index 2122d1a4a541..c3dde087e601 100644
> --- a/Documentation/block/ublk.rst
> +++ b/Documentation/block/ublk.rst
> @@ -144,6 +144,38 @@ managing and controlling ublk devices with help of several control commands:
> For retrieving device info via ``ublksrv_ctrl_dev_info``. It is the server's
> responsibility to save IO target specific info in userspace.
>
> +- ``UBLK_CMD_START_USER_RECOVERY``
> +
> + This command is valid if ``UBLK_F_USER_RECOVERY`` feature is enabled. This
> + command is accepted after the old process has exited, ublk device is quiesced
> + and ``/dev/ublkc*`` is closed. User should send this command before he starts
> + a new process which opens ``/dev/ublkc*``. When this command returns, the
> + ublk device is ready for the new process.
> +
> +- ``UBLK_CMD_END_USER_RECOVERY``
> +
> + This command is valid if ``UBLK_F_USER_RECOVERY`` feature is enabled. This
> + command is accepted after a new process has opened ``/dev/ublkc*`` and get
> + all ublk queues be ready. When this command returns, ublk device is
> + unquiesced and new I/O requests are passed to the new process.
> +
> +- user recovery feature description
> +
> + Two new features are added for user recovery: ``UBLK_F_USER_RECOVERY`` and
> + ``UBLK_F_USER_RECOVERY_REISSUE``.
> +
> + With ``UBLK_F_USER_RECOVERY`` set, after one ubq_daemon(ublksrv io handler) is
> + dying, ublk does not release ``/dev/ublkc*`` or ``/dev/ublkb*`` but requeues all

The above looks not accurate, the old ubq daemon has to release
/dev/ublkc*, and the new ubq daemon needs to re-open it, and
here I think it is fine to just mention /dev/ublkb* won't be
deleted during the whole recovery, or the device ID is kept,
and it is ublk server's responsibility to recover the device
context by its own knowledge.



thanks,
Ming

2022-09-23 14:33:08

by Ming Lei

[permalink] [raw]
Subject: Re: [RESEND PATCH V5 4/7] ublk_drv: consider recovery feature in aborting mechanism

On Fri, Sep 23, 2022 at 02:15:02PM +0800, ZiyangZhang wrote:
> With USER_RECOVERY feature enabled, the monitor_work schedules
> quiesce_work after finding a dying ubq_daemon. The monitor_work
> should also abort all rqs issued to userspace before the ubq_daemon is
> dying. The quiesce_work's job is to:
> (1) quiesce request queue.
> (2) check if there is any INFLIGHT rq. If so, we retry until all these
> rqs are requeued and become IDLE. These rqs should be requeued by
> ublk_queue_rq(), task work, io_uring fallback wq or monitor_work.
> (3) complete all ioucmds by calling io_uring_cmd_done(). We are safe to
> do so because no ioucmd can be referenced now.
> (5) set ub's state to UBLK_S_DEV_QUIESCED, which means we are ready for
> recovery. This state is exposed to userspace by GET_DEV_INFO.
>
> The driver can always handle STOP_DEV and cleanup everything no matter
> ub's state is LIVE or QUIESCED. After ub's state is UBLK_S_DEV_QUIESCED,
> user can recover with new process.
>
> Note: we do not change the default behavior with reocvery feature
> disabled. monitor_work still schedules stop_work and abort inflight
> rqs. And finally ublk_device is released.
>
> Signed-off-by: ZiyangZhang <[email protected]>

Looks fine,

Reviewed-by: Ming Lei <[email protected]>


Thanks,
Ming