LinuxLists.cc - [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request

2022-10-22 15:48:07

Subject: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

Function blk_mq_alloc_request_hctx() is missing zeroing/init of rq->bio,
biotail, __sector, and __data_len members, which blk_mq_alloc_request()
has.

Move init'ing of those members to common blk_mq_rq_ctx_init().

Fixes: 1f5bd336b9150 ("blk-mq: add blk_mq_alloc_request_hctx")
Suggested-by: Bart Van Assche <[email protected]>
Signed-off-by: John Garry <[email protected]>

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 8070b6c10e8d..260adeb2e455 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -402,6 +402,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
}
}

+ rq->__data_len = 0;
+ rq->__sector = (sector_t) -1;
+ rq->bio = rq->biotail = NULL;
+
return rq;
}

@@ -591,9 +595,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
if (!rq)
goto out_queue_exit;
}
- rq->__data_len = 0;
- rq->__sector = (sector_t) -1;
- rq->bio = rq->biotail = NULL;
return rq;
out_queue_exit:
blk_queue_exit(q);
--
2.35.3

2022-10-23 13:55:46

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On Sun, Oct 23, 2022 at 12:07:26AM +0800, John Garry wrote:
> Function blk_mq_alloc_request_hctx() is missing zeroing/init of rq->bio,
> biotail, __sector, and __data_len members, which blk_mq_alloc_request()
> has.
>
> Move init'ing of those members to common blk_mq_rq_ctx_init().
>
> Fixes: 1f5bd336b9150 ("blk-mq: add blk_mq_alloc_request_hctx")
> Suggested-by: Bart Van Assche <[email protected]>
> Signed-off-by: John Garry <[email protected]>
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8070b6c10e8d..260adeb2e455 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -402,6 +402,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
> }
> }
>
> + rq->__data_len = 0;
> + rq->__sector = (sector_t) -1;
> + rq->bio = rq->biotail = NULL;
> +
> return rq;
> }
>
> @@ -591,9 +595,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
> if (!rq)
> goto out_queue_exit;
> }
> - rq->__data_len = 0;
> - rq->__sector = (sector_t) -1;
> - rq->bio = rq->biotail = NULL;

This patch looks not good, why do you switch to initialize the three fields
twice in fast path?

BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please
avoid to extend it to other use cases.

Thanks,
Ming

2022-10-24 11:33:24

by John Garry

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On 23/10/2022 14:12, Ming Lei wrote:
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 8070b6c10e8d..260adeb2e455 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -402,6 +402,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
>> }
>> }
>>
>> + rq->__data_len = 0;
>> + rq->__sector = (sector_t) -1;
>> + rq->bio = rq->biotail = NULL;
>> +
>> return rq;
>> }
>>
>> @@ -591,9 +595,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
>> if (!rq)
>> goto out_queue_exit;
>> }
>> - rq->__data_len = 0;
>> - rq->__sector = (sector_t) -1;
>> - rq->bio = rq->biotail = NULL;
> This patch looks not good, why do you switch to initialize the three fields
> twice in fast path?

Can you please show me how these are initialized twice?

If there is a real concern with this then we go with my original idea,
which was to copy the init method of blk_mq_alloc_request() (in
blk_mq_alloc_request_hctx())

>
> BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please
> avoid to extend it to other use cases.

Yeah, I know this, but sometimes we just need to allocate for a specific
HW queue...

For my usecase of interest, it should not impact if the cpumask of the
HW queue goes offline after selecting the cpu in
blk_mq_alloc_request_hctx(), so any race is ok ... I think.

However it should be still possible to make blk_mq_alloc_request_hctx()
more robust. How about using something like work_on_cpu_safe() to
allocate and execute the request with blk_mq_alloc_request() on a cpu
associated with the HW queue, such that we know the cpu is online and
stays online until we execute it? Or also extent to
work_on_cpumask_safe() variant, so that we don't need to try all cpus in
the mask (to see if online)?

Thanks,
John

2022-10-24 21:11:46

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On Mon, Oct 24, 2022 at 11:56:21AM +0100, John Garry wrote:
> On 23/10/2022 14:12, Ming Lei wrote:
> > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > index 8070b6c10e8d..260adeb2e455 100644
> > > --- a/block/blk-mq.c
> > > +++ b/block/blk-mq.c
> > > @@ -402,6 +402,10 @@ static struct request *blk_mq_rq_ctx_init(struct blk_mq_alloc_data *data,
> > > }
> > > }
> > > + rq->__data_len = 0;
> > > + rq->__sector = (sector_t) -1;
> > > + rq->bio = rq->biotail = NULL;
> > > +
> > > return rq;
> > > }
> > > @@ -591,9 +595,6 @@ struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
> > > if (!rq)
> > > goto out_queue_exit;
> > > }
> > > - rq->__data_len = 0;
> > > - rq->__sector = (sector_t) -1;
> > > - rq->bio = rq->biotail = NULL;
> > This patch looks not good, why do you switch to initialize the three fields
> > twice in fast path?
>
> Can you please show me how these are initialized twice?

blk_mq_bio_to_request() is one which setup these fields, then you add
another one in blk_mq_rq_ctx_init().

>
> If there is a real concern with this then we go with my original idea, which
> was to copy the init method of blk_mq_alloc_request() (in
> blk_mq_alloc_request_hctx())
>
> >
> > BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please
> > avoid to extend it to other use cases.
>
> Yeah, I know this,

Did you know the exact issue on nvme-tcp, nvme-rdma or nvme-fc maybe
with blk_mq_alloc_request_hctx()?

> but sometimes we just need to allocate for a specific HW
> queue...
>
> For my usecase of interest, it should not impact if the cpumask of the HW
> queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(),
> so any race is ok ... I think.
>
> However it should be still possible to make blk_mq_alloc_request_hctx() more
> robust. How about using something like work_on_cpu_safe() to allocate and
> execute the request with blk_mq_alloc_request() on a cpu associated with the
> HW queue, such that we know the cpu is online and stays online until we
> execute it? Or also extent to work_on_cpumask_safe() variant, so that we
> don't need to try all cpus in the mask (to see if online)?

But all cpus on this hctx->cpumask could become offline.

Thanks,
Ming

2022-10-24 23:48:10

by John Garry

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On 24/10/2022 14:27, Ming Lei wrote:
>>>> - rq->bio = rq->biotail = NULL;
>>> This patch looks not good, why do you switch to initialize the three fields
>>> twice in fast path?
>> Can you please show me how these are initialized twice?
> blk_mq_bio_to_request() is one which setup these fields, then you add
> another one in blk_mq_rq_ctx_init().

ok, understood.

>
>> If there is a real concern with this then we go with my original idea, which
>> was to copy the init method of blk_mq_alloc_request() (in
>> blk_mq_alloc_request_hctx())
>>
>>> BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please
>>> avoid to extend it to other use cases.
>> Yeah, I know this,
> Did you know the exact issue on nvme-tcp, nvme-rdma or nvme-fc maybe
> with blk_mq_alloc_request_hctx()?

I thought that the original issue was an OoO bounds issue, fixed in
14dc7a18. Now there is still some issue in the following link, which is
still unresolved as I understand:
https://lore.kernel.org/linux-block/[email protected]/#t

But I think that 14dc7a18 may still leave undesirable scenario:
- all cpus in HW queue cpumask may go offline after cpu_online_mask read
in blk_mq_alloc_request_hctx() and before we get the driver tag and set
rq->hctx

>
>> but sometimes we just need to allocate for a specific HW
>> queue...
>>
>> For my usecase of interest, it should not impact if the cpumask of the HW
>> queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(),
>> so any race is ok ... I think.
>>
>> However it should be still possible to make blk_mq_alloc_request_hctx() more
>> robust. How about using something like work_on_cpu_safe() to allocate and
>> execute the request with blk_mq_alloc_request() on a cpu associated with the
>> HW queue, such that we know the cpu is online and stays online until we
>> execute it? Or also extent to work_on_cpumask_safe() variant, so that we
>> don't need to try all cpus in the mask (to see if online)?
> But all cpus on this hctx->cpumask could become offline.

If all hctx->cpumask are offline then we should not allocate a request
and this is acceptable. Maybe I am missing your point.

Thanks,
John

2022-10-25 02:32:50

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On Mon, Oct 24, 2022 at 05:56:15PM +0100, John Garry wrote:
> On 24/10/2022 14:27, Ming Lei wrote:
> > > > > - rq->bio = rq->biotail = NULL;
> > > > This patch looks not good, why do you switch to initialize the three fields
> > > > twice in fast path?
> > > Can you please show me how these are initialized twice?
> > blk_mq_bio_to_request() is one which setup these fields, then you add
> > another one in blk_mq_rq_ctx_init().
>
> ok, understood.
>
> >
> > > If there is a real concern with this then we go with my original idea, which
> > > was to copy the init method of blk_mq_alloc_request() (in
> > > blk_mq_alloc_request_hctx())
> > >
> > > > BTW, we know blk_mq_alloc_request_hctx() has big trouble, so please
> > > > avoid to extend it to other use cases.
> > > Yeah, I know this,
> > Did you know the exact issue on nvme-tcp, nvme-rdma or nvme-fc maybe
> > with blk_mq_alloc_request_hctx()?
>
> I thought that the original issue was an OoO bounds issue, fixed in
> 14dc7a18. Now there is still some issue in the following link, which is
> still unresolved as I understand:
> https://lore.kernel.org/linux-block/[email protected]/#t
>
> But I think that 14dc7a18 may still leave undesirable scenario:
> - all cpus in HW queue cpumask may go offline after cpu_online_mask read in
> blk_mq_alloc_request_hctx() and before we get the driver tag and set
> rq->hctx

Yeah.

>
> >
> > > but sometimes we just need to allocate for a specific HW
> > > queue...
> > >
> > > For my usecase of interest, it should not impact if the cpumask of the HW
> > > queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(),
> > > so any race is ok ... I think.
> > >
> > > However it should be still possible to make blk_mq_alloc_request_hctx() more
> > > robust. How about using something like work_on_cpu_safe() to allocate and
> > > execute the request with blk_mq_alloc_request() on a cpu associated with the
> > > HW queue, such that we know the cpu is online and stays online until we
> > > execute it? Or also extent to work_on_cpumask_safe() variant, so that we
> > > don't need to try all cpus in the mask (to see if online)?
> > But all cpus on this hctx->cpumask could become offline.
>
> If all hctx->cpumask are offline then we should not allocate a request and
> this is acceptable. Maybe I am missing your point.

As you saw, this API has the above problem too, but any one of CPUs
may become online later, maybe just during blk_mq_alloc_request_hctx(),
and it is easy to cause inconsistence.

You didn't share your use case, but for nvme connection request, if it
is 1:1 mapping, if any one of CPU becomes offline, the controller
initialization could be failed, that isn't good from user viewpoint at
all.

Thanks,
Ming

2022-10-25 08:37:58

by John Garry

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On 25/10/2022 01:34, Ming Lei wrote:
>>>> but sometimes we just need to allocate for a specific HW
>>>> queue...
>>>>
>>>> For my usecase of interest, it should not impact if the cpumask of the HW
>>>> queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(),
>>>> so any race is ok ... I think.
>>>>
>>>> However it should be still possible to make blk_mq_alloc_request_hctx() more
>>>> robust. How about using something like work_on_cpu_safe() to allocate and
>>>> execute the request with blk_mq_alloc_request() on a cpu associated with the
>>>> HW queue, such that we know the cpu is online and stays online until we
>>>> execute it? Or also extent to work_on_cpumask_safe() variant, so that we
>>>> don't need to try all cpus in the mask (to see if online)?
>>> But all cpus on this hctx->cpumask could become offline.
>> If all hctx->cpumask are offline then we should not allocate a request and
>> this is acceptable. Maybe I am missing your point.
> As you saw, this API has the above problem too, but any one of CPUs
> may become online later, maybe just during blk_mq_alloc_request_hctx(),
> and it is easy to cause inconsistence.
>
> You didn't share your use case, but for nvme connection request, if it
> is 1:1 mapping, if any one of CPU becomes offline, the controller
> initialization could be failed, that isn't good from user viewpoint at
> all.

My use case is in SCSI EH domain. For my HBA controller of interest, to
abort an erroneous IO we must send a controller proprietary abort
command on same HW queue as original command. So we would need to
allocate this abort request for a specific HW queue.

I mentioned before that if no hctx->cpumask is online then we don't need
to allocate a request. That is because if no hctx->cpumask is online,
this means that original erroneous IO must be completed due to nature of
how blk-mq cpu hotplug handler works, i.e. drained, and then we don't
actually need to abort it any longer, so ok to not get a request.

I have an RFC series for this work in which I am using
blk_mq_alloc_request_hctx(). However, as I mentioned before, I can
experiment with using something like work_on_cpu_safe() to alloc and
execute the abort request to safeguard against cpu hotplug events.

Thanks,
John

2022-10-25 09:07:43

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On Tue, Oct 25, 2022 at 3:40 PM John Garry <[email protected]> wrote:
>
> On 25/10/2022 01:34, Ming Lei wrote:
> >>>> but sometimes we just need to allocate for a specific HW
> >>>> queue...
> >>>>
> >>>> For my usecase of interest, it should not impact if the cpumask of the HW
> >>>> queue goes offline after selecting the cpu in blk_mq_alloc_request_hctx(),
> >>>> so any race is ok ... I think.
> >>>>
> >>>> However it should be still possible to make blk_mq_alloc_request_hctx() more
> >>>> robust. How about using something like work_on_cpu_safe() to allocate and
> >>>> execute the request with blk_mq_alloc_request() on a cpu associated with the
> >>>> HW queue, such that we know the cpu is online and stays online until we
> >>>> execute it? Or also extent to work_on_cpumask_safe() variant, so that we
> >>>> don't need to try all cpus in the mask (to see if online)?
> >>> But all cpus on this hctx->cpumask could become offline.
> >> If all hctx->cpumask are offline then we should not allocate a request and
> >> this is acceptable. Maybe I am missing your point.
> > As you saw, this API has the above problem too, but any one of CPUs
> > may become online later, maybe just during blk_mq_alloc_request_hctx(),
> > and it is easy to cause inconsistence.
> >
> > You didn't share your use case, but for nvme connection request, if it
> > is 1:1 mapping, if any one of CPU becomes offline, the controller
> > initialization could be failed, that isn't good from user viewpoint at
> > all.
>
> My use case is in SCSI EH domain. For my HBA controller of interest, to
> abort an erroneous IO we must send a controller proprietary abort
> command on same HW queue as original command. So we would need to
> allocate this abort request for a specific HW queue.

IMO, it is one bad hw/sw interface.

First such request has to be reserved, since all inflight IOs can be in error.

Second error handling needs to provide forward-progress, and it is supposed
to not require external dependency, otherwise easy to cause deadlock, but
here request from specific HW queue just depends on this queue's cpumask.

Also if it has to be reserved, it can be done as one device/driver private
command, so why bother blk-mq for this special use case?

>
> I mentioned before that if no hctx->cpumask is online then we don't need
> to allocate a request. That is because if no hctx->cpumask is online,
> this means that original erroneous IO must be completed due to nature of
> how blk-mq cpu hotplug handler works, i.e. drained, and then we don't
> actually need to abort it any longer, so ok to not get a request.

No, it is really not OK, if all cpus in hctx->cpumask are offline, you
can't allocate
request on the specified hw queue, then the erroneous IO can't be handled,
then cpu hotplug handler may hang for ever.

Thanks,
Ming

2022-10-25 09:21:11

by John Garry

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On 25/10/2022 10:00, Ming Lei wrote:
>> My use case is in SCSI EH domain. For my HBA controller of interest, to
>> abort an erroneous IO we must send a controller proprietary abort
>> command on same HW queue as original command. So we would need to
>> allocate this abort request for a specific HW queue.
> IMO, it is one bad hw/sw interface.
>
> First such request has to be reserved, since all inflight IOs can be in error.

Right

>
> Second error handling needs to provide forward-progress, and it is supposed
> to not require external dependency, otherwise easy to cause deadlock, but
> here request from specific HW queue just depends on this queue's cpumask.
>
> Also if it has to be reserved, it can be done as one device/driver private
> command, so why bother blk-mq for this special use case?

I have a series for reserved request support, which I will send later.
Please have a look. And as I mentioned, I would prob not end up using
blk_mq_alloc_request_hctx() anyway.

>
>> I mentioned before that if no hctx->cpumask is online then we don't need
>> to allocate a request. That is because if no hctx->cpumask is online,
>> this means that original erroneous IO must be completed due to nature of
>> how blk-mq cpu hotplug handler works, i.e. drained, and then we don't
>> actually need to abort it any longer, so ok to not get a request.
> No, it is really not OK, if all cpus in hctx->cpumask are offline, you
> can't allocate
> request on the specified hw queue, then the erroneous IO can't be handled,
> then cpu hotplug handler may hang for ever.

If the erroneous IO is still in-flight from blk-mq perspective, then how
can hctx->cpumask still be offline? I thought that we guarantee that
hctx->cpumask cannot go offline until drained.

Thanks,
John

2022-10-25 09:29:54

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On Tue, Oct 25, 2022 at 10:08:10AM +0100, John Garry wrote:
> On 25/10/2022 10:00, Ming Lei wrote:
> > > My use case is in SCSI EH domain. For my HBA controller of interest, to
> > > abort an erroneous IO we must send a controller proprietary abort
> > > command on same HW queue as original command. So we would need to
> > > allocate this abort request for a specific HW queue.
> > IMO, it is one bad hw/sw interface.
> >
> > First such request has to be reserved, since all inflight IOs can be in error.
>
> Right
>
> >
> > Second error handling needs to provide forward-progress, and it is supposed
> > to not require external dependency, otherwise easy to cause deadlock, but
> > here request from specific HW queue just depends on this queue's cpumask.
> >
> > Also if it has to be reserved, it can be done as one device/driver private
> > command, so why bother blk-mq for this special use case?
>
> I have a series for reserved request support, which I will send later.
> Please have a look. And as I mentioned, I would prob not end up using
> blk_mq_alloc_request_hctx() anyway.
>
> >
> > > I mentioned before that if no hctx->cpumask is online then we don't need
> > > to allocate a request. That is because if no hctx->cpumask is online,
> > > this means that original erroneous IO must be completed due to nature of
> > > how blk-mq cpu hotplug handler works, i.e. drained, and then we don't
> > > actually need to abort it any longer, so ok to not get a request.
> > No, it is really not OK, if all cpus in hctx->cpumask are offline, you
> > can't allocate
> > request on the specified hw queue, then the erroneous IO can't be handled,
> > then cpu hotplug handler may hang for ever.
>
> If the erroneous IO is still in-flight from blk-mq perspective, then how can
> hctx->cpumask still be offline? I thought that we guarantee that
> hctx->cpumask cannot go offline until drained.

Yeah, the draining is done before the cpu is offline. But the drain is
simply waiting for the inflight IO to be completed. If the IO is failed
during the waiting, you can't allocate such reserved request for error
handling, then hang ever in blk_mq_hctx_notify_offline().

If you just make it one driver private command, there can't be such
issue. Block layer is supposed for handling common case(normal io and pt io),
I'd suggest to not put such special cases into block layer.

thanks,
Ming

2022-10-25 09:49:08

by John Garry

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On 25/10/2022 10:16, Ming Lei wrote:
>>>> I mentioned before that if no hctx->cpumask is online then we don't need
>>>> to allocate a request. That is because if no hctx->cpumask is online,
>>>> this means that original erroneous IO must be completed due to nature of
>>>> how blk-mq cpu hotplug handler works, i.e. drained, and then we don't
>>>> actually need to abort it any longer, so ok to not get a request.
>>> No, it is really not OK, if all cpus in hctx->cpumask are offline, you
>>> can't allocate
>>> request on the specified hw queue, then the erroneous IO can't be handled,
>>> then cpu hotplug handler may hang for ever.
>> If the erroneous IO is still in-flight from blk-mq perspective, then how can
>> hctx->cpumask still be offline? I thought that we guarantee that
>> hctx->cpumask cannot go offline until drained.
> Yeah, the draining is done before the cpu is offline. But the drain is
> simply waiting for the inflight IO to be completed. If the IO is failed
> during the waiting, you can't allocate such reserved request for error
> handling, then hang ever in blk_mq_hctx_notify_offline().

Actually if final cpu in hctx->cpumask is going offline, then hctx won't
queue any more requests, right? In this case I don't think we can queue
on that hctx anyway. I need to think about this more.

>
> If you just make it one driver private command, there can't be such
> issue.

Well we're trying to use reserved requests for EH commands, which that
goes against.

> Block layer is supposed for handling common case(normal io and pt io),
> I'd suggest to not put such special cases into block layer.

It also supports reserved commands, which I would assume would be
suitable for EH scenarios.

Thanks,
John

2022-10-25 11:36:02

by Ming Lei

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On Tue, Oct 25, 2022 at 10:32:28AM +0100, John Garry wrote:
> On 25/10/2022 10:16, Ming Lei wrote:
> > > > > I mentioned before that if no hctx->cpumask is online then we don't need
> > > > > to allocate a request. That is because if no hctx->cpumask is online,
> > > > > this means that original erroneous IO must be completed due to nature of
> > > > > how blk-mq cpu hotplug handler works, i.e. drained, and then we don't
> > > > > actually need to abort it any longer, so ok to not get a request.
> > > > No, it is really not OK, if all cpus in hctx->cpumask are offline, you
> > > > can't allocate
> > > > request on the specified hw queue, then the erroneous IO can't be handled,
> > > > then cpu hotplug handler may hang for ever.
> > > If the erroneous IO is still in-flight from blk-mq perspective, then how can
> > > hctx->cpumask still be offline? I thought that we guarantee that
> > > hctx->cpumask cannot go offline until drained.
> > Yeah, the draining is done before the cpu is offline. But the drain is
> > simply waiting for the inflight IO to be completed. If the IO is failed
> > during the waiting, you can't allocate such reserved request for error
> > handling, then hang ever in blk_mq_hctx_notify_offline().
>
> Actually if final cpu in hctx->cpumask is going offline, then hctx won't
> queue any more requests, right? In this case I don't think we can queue on
> that hctx anyway. I need to think about this more.

It can be queued actually, but interrupt may not be delivered if managed
irq is used.

>
> >
> > If you just make it one driver private command, there can't be such
> > issue.
>
> Well we're trying to use reserved requests for EH commands, which that goes
> against.
>
> > Block layer is supposed for handling common case(normal io and pt io),
> > I'd suggest to not put such special cases into block layer.
>
> It also supports reserved commands, which I would assume would be suitable
> for EH scenarios.

Then you have to be careful, as I mentioned, EH has to provide forward
progress, if you let blk-mq allocate & submit EH request, the implied
dependency from blk-mq has to be payed attention.

Thanks,
Ming

2022-10-25 11:45:27

by John Garry

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On 25/10/2022 12:21, Ming Lei wrote:
>> Actually if final cpu in hctx->cpumask is going offline, then hctx won't
>> queue any more requests, right? In this case I don't think we can queue on
>> that hctx anyway. I need to think about this more.
> It can be queued actually, but interrupt may not be delivered if managed
> irq is used.

Yes, I think it will be queued elsewhere. I would need to check the code
again.

>
>>> If you just make it one driver private command, there can't be such
>>> issue.
>> Well we're trying to use reserved requests for EH commands, which that goes
>> against.
>>
>>> Block layer is supposed for handling common case(normal io and pt io),
>>> I'd suggest to not put such special cases into block layer.
>> It also supports reserved commands, which I would assume would be suitable
>> for EH scenarios.
> Then you have to be careful, as I mentioned, EH has to provide forward
> progress, if you let blk-mq allocate & submit EH request, the implied
> dependency from blk-mq has to be payed attention.

OK, thanks, I know that this carries risk, but it seems right approach.

I have been thinking about my HW queue allocation requirement and maybe
we can solve in low-level driver instead.

The requirement is to send this abort command on same queue as erroneous
command to ensure that they do not race in HW submission, even though
chance of this is really tiny. Maybe we can make low-level driver wait
until erroneous command is really submitted to HW by checking HW
register, etc. before issuing abort on any HW queue (and so would not
need blk_mq_alloc_request_hctx() or similar).

BTW, I would still like to fix blk_mq_alloc_request_hctx() to properly
init ->bio and other fields - ok?

Thanks,
John

2022-10-25 13:02:22

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH] blk-mq: Properly init bios from blk_mq_alloc_request_hctx()

On Tue, Oct 25, 2022 at 12:36:10PM +0100, John Garry wrote:
> The requirement is to send this abort command on same queue as erroneous
> command to ensure that they do not race in HW submission, even though
> chance of this is really tiny. Maybe we can make low-level driver wait
> until erroneous command is really submitted to HW by checking HW register,
> etc. before issuing abort on any HW queue (and so would not need
> blk_mq_alloc_request_hctx() or similar).

I'm not sure this is a good idea. I can think of all kinds of interfaces
that could have similar requirements that absolutely do make sense from
the hardware / firmware side. So despite Ming not liking
blk_mq_alloc_request_hctx there is very little chance of it going away
and thus also very little need to avoid users as more will eventually
pop up if we want it or not.

> BTW, I would still like to fix blk_mq_alloc_request_hctx() to properly init
> ->bio and other fields - ok?

Yes, it should behave the same blk_mq_alloc_request in that respect,
and we should just copy the assignments to bio, biotail, __sector
and __data_len from it as you did in your RFC patch.

>
> Thanks,
> John
---end quoted text---