Hi,
I have a question on $subject which I hope you can shed some light on.
According to commit c5cb83bb337c25 ("genirq/cpuhotplug: Handle managed
IRQs on CPU hotplug"), if we offline the last CPU in a managed IRQ
affinity mask, the IRQ is shutdown.
The reasoning is that this IRQ is thought to be associated with a
specific queue on a MQ device, and the CPUs in the IRQ affinity mask are
the same CPUs associated with the queue. So, if no CPU is using the
queue, then no need for the IRQ.
However how does this handle scenario of last CPU in IRQ affinity mask
being offlined while IO associated with queue is still in flight?
Or if we make the decision to use queue associated with the current CPU,
and then that CPU (being the last CPU online in the queue's IRQ
afffinity mask) goes offline and we finish the delivery with another CPU?
In these cases, when the IO completes, it would not be serviced and timeout.
I have actually tried this on my arm64 system and I see IO timeouts.
Thanks in advance,
John
On 1/29/19 12:25 PM, John Garry wrote:
> Hi,
>
> I have a question on $subject which I hope you can shed some light on.
>
> According to commit c5cb83bb337c25 ("genirq/cpuhotplug: Handle managed
> IRQs on CPU hotplug"), if we offline the last CPU in a managed IRQ
> affinity mask, the IRQ is shutdown.
>
> The reasoning is that this IRQ is thought to be associated with a
> specific queue on a MQ device, and the CPUs in the IRQ affinity mask are
> the same CPUs associated with the queue. So, if no CPU is using the
> queue, then no need for the IRQ.
>
> However how does this handle scenario of last CPU in IRQ affinity mask
> being offlined while IO associated with queue is still in flight?
>
> Or if we make the decision to use queue associated with the current CPU,
> and then that CPU (being the last CPU online in the queue's IRQ
> afffinity mask) goes offline and we finish the delivery with another CPU?
>
> In these cases, when the IO completes, it would not be serviced and
> timeout.
>
> I have actually tried this on my arm64 system and I see IO timeouts.
>
That actually is a very good question, and I have been wondering about
this for quite some time.
I find it a bit hard to envision a scenario where the IRQ affinity is
automatically (and, more importantly, atomically!) re-routed to one of
the other CPUs.
And even it it were, chances are that there are checks in the driver
_preventing_ them from handling those requests, seeing that they should
have been handled by another CPU ...
I guess the safest bet is to implement a 'cleanup' worker queue which is
responsible of looking through all the outstanding commands (on all
hardware queues), and then complete those for which no corresponding CPU
/ irqhandler can be found.
But I defer to the higher authorities here; maybe I'm totally wrong and
it's already been taken care of.
But if there is no generic mechanism this really is a fit topic for
LSF/MM, as most other drivers would be affected, too.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
On Tue, 29 Jan 2019, Hannes Reinecke wrote:
> That actually is a very good question, and I have been wondering about this
> for quite some time.
>
> I find it a bit hard to envision a scenario where the IRQ affinity is
> automatically (and, more importantly, atomically!) re-routed to one of the
> other CPUs.
> And even it it were, chances are that there are checks in the driver
> _preventing_ them from handling those requests, seeing that they should have
> been handled by another CPU ...
>
> I guess the safest bet is to implement a 'cleanup' worker queue which is
> responsible of looking through all the outstanding commands (on all hardware
> queues), and then complete those for which no corresponding CPU / irqhandler
> can be found.
>
> But I defer to the higher authorities here; maybe I'm totally wrong and it's
> already been taken care of.
TBH, I don't know. I merily was involved in the genirq side of this. But
yes, in order to make this work correctly the basic contract for CPU
hotplug case must be:
If the last CPU which is associated to a queue (and the corresponding
interrupt) goes offline, then the subsytem/driver code has to make sure
that:
1) No more requests can be queued on that queue
2) All outstanding of that queue have been completed or redirected
(don't know if that's possible at all) to some other queue.
That has to be done in that order obviously. Whether any of the
subsystems/drivers actually implements this, I can't tell.
Thanks,
tglx
Hi Hannes, Thomas,
On 29/01/2019 12:01, Thomas Gleixner wrote:
> On Tue, 29 Jan 2019, Hannes Reinecke wrote:
>> That actually is a very good question, and I have been wondering about this
>> for quite some time.
>>
>> I find it a bit hard to envision a scenario where the IRQ affinity is
>> automatically (and, more importantly, atomically!) re-routed to one of the
>> other CPUs.
Isn't this what happens today for non-managed IRQs?
>> And even it it were, chances are that there are checks in the driver
>> _preventing_ them from handling those requests, seeing that they should have
>> been handled by another CPU ...
Really? I would not think that it matters which CPU we service the
interrupt on.
>>
>> I guess the safest bet is to implement a 'cleanup' worker queue which is
>> responsible of looking through all the outstanding commands (on all hardware
>> queues), and then complete those for which no corresponding CPU / irqhandler
>> can be found.
>>
>> But I defer to the higher authorities here; maybe I'm totally wrong and it's
>> already been taken care of.
>
> TBH, I don't know. I merily was involved in the genirq side of this. But
> yes, in order to make this work correctly the basic contract for CPU
> hotplug case must be:
>
> If the last CPU which is associated to a queue (and the corresponding
> interrupt) goes offline, then the subsytem/driver code has to make sure
> that:
>
> 1) No more requests can be queued on that queue
>
> 2) All outstanding of that queue have been completed or redirected
> (don't know if that's possible at all) to some other queue.
This may not be possible. For the HW I deal with, we have symmetrical
delivery and completion queues, and a command delivered on DQx will
always complete on CQx. Each completion queue has a dedicated IRQ.
>
> That has to be done in that order obviously. Whether any of the
> subsystems/drivers actually implements this, I can't tell.
Going back to c5cb83bb337c25, it seems to me that the change was made
with the idea that we can maintain the affinity for the IRQ as we're
shutting it down as no interrupts should occur.
However I don't see why we can't instead keep the IRQ up and set the
affinity to all online CPUs in offline path, and restore the original
affinity in online path. The reason we set the queue affinity to
specific CPUs is for performance, but I would not say that this matters
for handling residual IRQs.
Thanks,
John
>
> Thanks,
>
> tglx
>
> .
>
On Tue, Jan 29, 2019 at 03:25:48AM -0800, John Garry wrote:
> Hi,
>
> I have a question on $subject which I hope you can shed some light on.
>
> According to commit c5cb83bb337c25 ("genirq/cpuhotplug: Handle managed
> IRQs on CPU hotplug"), if we offline the last CPU in a managed IRQ
> affinity mask, the IRQ is shutdown.
>
> The reasoning is that this IRQ is thought to be associated with a
> specific queue on a MQ device, and the CPUs in the IRQ affinity mask are
> the same CPUs associated with the queue. So, if no CPU is using the
> queue, then no need for the IRQ.
>
> However how does this handle scenario of last CPU in IRQ affinity mask
> being offlined while IO associated with queue is still in flight?
>
> Or if we make the decision to use queue associated with the current CPU,
> and then that CPU (being the last CPU online in the queue's IRQ
> afffinity mask) goes offline and we finish the delivery with another CPU?
>
> In these cases, when the IO completes, it would not be serviced and timeout.
>
> I have actually tried this on my arm64 system and I see IO timeouts.
Hm, we used to freeze the queues with CPUHP_BLK_MQ_PREPARE callback,
which would reap all outstanding commands before the CPU and IRQ are
taken offline. That was removed with commit 4b855ad37194f ("blk-mq:
Create hctx for each present CPU"). It sounds like we should bring
something like that back, but make more fine grain to the per-cpu context.
On Tue, 29 Jan 2019, John Garry wrote:
> On 29/01/2019 12:01, Thomas Gleixner wrote:
> > If the last CPU which is associated to a queue (and the corresponding
> > interrupt) goes offline, then the subsytem/driver code has to make sure
> > that:
> >
> > 1) No more requests can be queued on that queue
> >
> > 2) All outstanding of that queue have been completed or redirected
> > (don't know if that's possible at all) to some other queue.
>
> This may not be possible. For the HW I deal with, we have symmetrical delivery
> and completion queues, and a command delivered on DQx will always complete on
> CQx. Each completion queue has a dedicated IRQ.
So you can stop queueing on DQx and wait for all outstanding ones to come
in on CQx, right?
> > That has to be done in that order obviously. Whether any of the
> > subsystems/drivers actually implements this, I can't tell.
>
> Going back to c5cb83bb337c25, it seems to me that the change was made with the
> idea that we can maintain the affinity for the IRQ as we're shutting it down
> as no interrupts should occur.
>
> However I don't see why we can't instead keep the IRQ up and set the affinity
> to all online CPUs in offline path, and restore the original affinity in
> online path. The reason we set the queue affinity to specific CPUs is for
> performance, but I would not say that this matters for handling residual IRQs.
Oh yes it does. The problem is especially on x86, that if you have a large
number of queues and you take a large number of CPUs offline, then you run
into vector space exhaustion on the remaining online CPUs.
In the worst case a single CPU on x86 has only 186 vectors available for
device interrupts. So just take a quad socket machine with 144 CPUs and two
multiqueue devices with a queue per cpu. ---> FAIL
It probably fails already with one device because there are lots of other
devices which have regular interrupt which cannot be shut down.
Thanks,
tglx
On 29/01/2019 15:44, Keith Busch wrote:
> On Tue, Jan 29, 2019 at 03:25:48AM -0800, John Garry wrote:
>> Hi,
>>
>> I have a question on $subject which I hope you can shed some light on.
>>
>> According to commit c5cb83bb337c25 ("genirq/cpuhotplug: Handle managed
>> IRQs on CPU hotplug"), if we offline the last CPU in a managed IRQ
>> affinity mask, the IRQ is shutdown.
>>
>> The reasoning is that this IRQ is thought to be associated with a
>> specific queue on a MQ device, and the CPUs in the IRQ affinity mask are
>> the same CPUs associated with the queue. So, if no CPU is using the
>> queue, then no need for the IRQ.
>>
>> However how does this handle scenario of last CPU in IRQ affinity mask
>> being offlined while IO associated with queue is still in flight?
>>
>> Or if we make the decision to use queue associated with the current CPU,
>> and then that CPU (being the last CPU online in the queue's IRQ
>> afffinity mask) goes offline and we finish the delivery with another CPU?
>>
>> In these cases, when the IO completes, it would not be serviced and timeout.
>>
>> I have actually tried this on my arm64 system and I see IO timeouts.
>
> Hm, we used to freeze the queues with CPUHP_BLK_MQ_PREPARE callback,
> which would reap all outstanding commands before the CPU and IRQ are
> taken offline. That was removed with commit 4b855ad37194f ("blk-mq:
> Create hctx for each present CPU"). It sounds like we should bring
> something like that back, but make more fine grain to the per-cpu context.
>
Seems reasonable. But we would need it to deal with drivers where they
only expose a single queue to BLK MQ, but use many queues internally. I
think megaraid sas does this, for example.
I would also be slightly concerned with commands being issued from the
driver unknown to blk mq, like SCSI TMF.
Thanks,
John
> .
>
On Tue, Jan 29, 2019 at 05:12:40PM +0000, John Garry wrote:
> On 29/01/2019 15:44, Keith Busch wrote:
> >
> > Hm, we used to freeze the queues with CPUHP_BLK_MQ_PREPARE callback,
> > which would reap all outstanding commands before the CPU and IRQ are
> > taken offline. That was removed with commit 4b855ad37194f ("blk-mq:
> > Create hctx for each present CPU"). It sounds like we should bring
> > something like that back, but make more fine grain to the per-cpu context.
> >
>
> Seems reasonable. But we would need it to deal with drivers where they only
> expose a single queue to BLK MQ, but use many queues internally. I think
> megaraid sas does this, for example.
>
> I would also be slightly concerned with commands being issued from the
> driver unknown to blk mq, like SCSI TMF.
I don't think either of those descriptions sound like good candidates
for using managed IRQ affinities.
On 29/01/2019 16:27, Thomas Gleixner wrote:
> On Tue, 29 Jan 2019, John Garry wrote:
>> On 29/01/2019 12:01, Thomas Gleixner wrote:
>>> If the last CPU which is associated to a queue (and the corresponding
>>> interrupt) goes offline, then the subsytem/driver code has to make sure
>>> that:
>>>
>>> 1) No more requests can be queued on that queue
>>>
>>> 2) All outstanding of that queue have been completed or redirected
>>> (don't know if that's possible at all) to some other queue.
>>
>> This may not be possible. For the HW I deal with, we have symmetrical delivery
>> and completion queues, and a command delivered on DQx will always complete on
>> CQx. Each completion queue has a dedicated IRQ.
>
> So you can stop queueing on DQx and wait for all outstanding ones to come
> in on CQx, right?
Right, and this sounds like what Keith Busch mentioned in his reply.
>
>>> That has to be done in that order obviously. Whether any of the
>>> subsystems/drivers actually implements this, I can't tell.
>>
>> Going back to c5cb83bb337c25, it seems to me that the change was made with the
>> idea that we can maintain the affinity for the IRQ as we're shutting it down
>> as no interrupts should occur.
>>
>> However I don't see why we can't instead keep the IRQ up and set the affinity
>> to all online CPUs in offline path, and restore the original affinity in
>> online path. The reason we set the queue affinity to specific CPUs is for
>> performance, but I would not say that this matters for handling residual IRQs.
>
> Oh yes it does. The problem is especially on x86, that if you have a large
> number of queues and you take a large number of CPUs offline, then you run
> into vector space exhaustion on the remaining online CPUs.
>
> In the worst case a single CPU on x86 has only 186 vectors available for
> device interrupts. So just take a quad socket machine with 144 CPUs and two
> multiqueue devices with a queue per cpu. ---> FAIL
>
> It probably fails already with one device because there are lots of other
> devices which have regular interrupt which cannot be shut down.
OK, understood.
Thanks,
John
>
> Thanks,
>
> tglx
>
>
> .
>
On 29/01/2019 17:20, Keith Busch wrote:
> On Tue, Jan 29, 2019 at 05:12:40PM +0000, John Garry wrote:
>> On 29/01/2019 15:44, Keith Busch wrote:
>>>
>>> Hm, we used to freeze the queues with CPUHP_BLK_MQ_PREPARE callback,
>>> which would reap all outstanding commands before the CPU and IRQ are
>>> taken offline. That was removed with commit 4b855ad37194f ("blk-mq:
>>> Create hctx for each present CPU"). It sounds like we should bring
>>> something like that back, but make more fine grain to the per-cpu context.
>>>
>>
>> Seems reasonable. But we would need it to deal with drivers where they only
>> expose a single queue to BLK MQ, but use many queues internally. I think
>> megaraid sas does this, for example.
>>
>> I would also be slightly concerned with commands being issued from the
>> driver unknown to blk mq, like SCSI TMF.
>
> I don't think either of those descriptions sound like good candidates
> for using managed IRQ affinities.
I wouldn't say that this behaviour is obvious to the developer. I can't
see anything in Documentation/PCI/MSI-HOWTO.txt
It also seems that this policy to rely on upper layer to flush+freeze
queues would cause issues if managed IRQs are used by drivers in other
subsystems. Networks controllers may have multiple queues and
unsoliciated interrupts.
Thanks,
John
>
> .
>
On Wed, 30 Jan 2019, John Garry wrote:
> On 29/01/2019 17:20, Keith Busch wrote:
> > On Tue, Jan 29, 2019 at 05:12:40PM +0000, John Garry wrote:
> > > On 29/01/2019 15:44, Keith Busch wrote:
> > > >
> > > > Hm, we used to freeze the queues with CPUHP_BLK_MQ_PREPARE callback,
> > > > which would reap all outstanding commands before the CPU and IRQ are
> > > > taken offline. That was removed with commit 4b855ad37194f ("blk-mq:
> > > > Create hctx for each present CPU"). It sounds like we should bring
> > > > something like that back, but make more fine grain to the per-cpu
> > > > context.
> > > >
> > >
> > > Seems reasonable. But we would need it to deal with drivers where they
> > > only
> > > expose a single queue to BLK MQ, but use many queues internally. I think
> > > megaraid sas does this, for example.
> > >
> > > I would also be slightly concerned with commands being issued from the
> > > driver unknown to blk mq, like SCSI TMF.
> >
> > I don't think either of those descriptions sound like good candidates
> > for using managed IRQ affinities.
>
> I wouldn't say that this behaviour is obvious to the developer. I can't see
> anything in Documentation/PCI/MSI-HOWTO.txt
>
> It also seems that this policy to rely on upper layer to flush+freeze queues
> would cause issues if managed IRQs are used by drivers in other subsystems.
> Networks controllers may have multiple queues and unsoliciated interrupts.
It's doesn't matter which part is managing flush/freeze of queues as long
as something (either common subsystem code, upper layers or the driver
itself) does it.
So for the megaraid SAS example the BLK MQ layer obviously can't do
anything because it only sees a single request queue. But the driver could,
if the the hardware supports it. tell the device to stop queueing
completions on the completion queue which is associated with a particular
CPU (or set of CPUs) during offline and then wait for the on flight stuff
to be finished. If the hardware does not allow that, then managed
interrupts can't work for it.
Thanks,
tglx
On 30/01/2019 12:43, Thomas Gleixner wrote:
> On Wed, 30 Jan 2019, John Garry wrote:
>> On 29/01/2019 17:20, Keith Busch wrote:
>>> On Tue, Jan 29, 2019 at 05:12:40PM +0000, John Garry wrote:
>>>> On 29/01/2019 15:44, Keith Busch wrote:
>>>>>
>>>>> Hm, we used to freeze the queues with CPUHP_BLK_MQ_PREPARE callback,
>>>>> which would reap all outstanding commands before the CPU and IRQ are
>>>>> taken offline. That was removed with commit 4b855ad37194f ("blk-mq:
>>>>> Create hctx for each present CPU"). It sounds like we should bring
>>>>> something like that back, but make more fine grain to the per-cpu
>>>>> context.
>>>>>
>>>>
>>>> Seems reasonable. But we would need it to deal with drivers where they
>>>> only
>>>> expose a single queue to BLK MQ, but use many queues internally. I think
>>>> megaraid sas does this, for example.
>>>>
>>>> I would also be slightly concerned with commands being issued from the
>>>> driver unknown to blk mq, like SCSI TMF.
>>>
>>> I don't think either of those descriptions sound like good candidates
>>> for using managed IRQ affinities.
>>
>> I wouldn't say that this behaviour is obvious to the developer. I can't see
>> anything in Documentation/PCI/MSI-HOWTO.txt
>>
>> It also seems that this policy to rely on upper layer to flush+freeze queues
>> would cause issues if managed IRQs are used by drivers in other subsystems.
>> Networks controllers may have multiple queues and unsoliciated interrupts.
>
> It's doesn't matter which part is managing flush/freeze of queues as long
> as something (either common subsystem code, upper layers or the driver
> itself) does it.
>
> So for the megaraid SAS example the BLK MQ layer obviously can't do
> anything because it only sees a single request queue. But the driver could,
> if the the hardware supports it. tell the device to stop queueing
> completions on the completion queue which is associated with a particular
> CPU (or set of CPUs) during offline and then wait for the on flight stuff
> to be finished. If the hardware does not allow that, then managed
> interrupts can't work for it.
>
A rough audit of current SCSI drivers tells that these set
PCI_IRQ_AFFINITY in some path but don't set Scsi_host.nr_hw_queues at all:
aacraid, be2iscsi, csiostor, megaraid, mpt3sas
I don't know specific driver details, like changing completion queue.
Thanks,
John
> Thanks,
>
> tglx
>
> .
>
On 1/31/19 6:48 PM, John Garry wrote:
> On 30/01/2019 12:43, Thomas Gleixner wrote:
>> On Wed, 30 Jan 2019, John Garry wrote:
>>> On 29/01/2019 17:20, Keith Busch wrote:
>>>> On Tue, Jan 29, 2019 at 05:12:40PM +0000, John Garry wrote:
>>>>> On 29/01/2019 15:44, Keith Busch wrote:
>>>>>>
>>>>>> Hm, we used to freeze the queues with CPUHP_BLK_MQ_PREPARE callback,
>>>>>> which would reap all outstanding commands before the CPU and IRQ are
>>>>>> taken offline. That was removed with commit 4b855ad37194f ("blk-mq:
>>>>>> Create hctx for each present CPU"). It sounds like we should bring
>>>>>> something like that back, but make more fine grain to the per-cpu
>>>>>> context.
>>>>>>
>>>>>
>>>>> Seems reasonable. But we would need it to deal with drivers where they
>>>>> only
>>>>> expose a single queue to BLK MQ, but use many queues internally. I
>>>>> think
>>>>> megaraid sas does this, for example.
>>>>>
>>>>> I would also be slightly concerned with commands being issued from the
>>>>> driver unknown to blk mq, like SCSI TMF.
>>>>
>>>> I don't think either of those descriptions sound like good candidates
>>>> for using managed IRQ affinities.
>>>
>>> I wouldn't say that this behaviour is obvious to the developer. I
>>> can't see
>>> anything in Documentation/PCI/MSI-HOWTO.txt
>>>
>>> It also seems that this policy to rely on upper layer to flush+freeze
>>> queues
>>> would cause issues if managed IRQs are used by drivers in other
>>> subsystems.
>>> Networks controllers may have multiple queues and unsoliciated
>>> interrupts.
>>
>> It's doesn't matter which part is managing flush/freeze of queues as long
>> as something (either common subsystem code, upper layers or the driver
>> itself) does it.
>>
>> So for the megaraid SAS example the BLK MQ layer obviously can't do
>> anything because it only sees a single request queue. But the driver
>> could,
>> if the the hardware supports it. tell the device to stop queueing
>> completions on the completion queue which is associated with a particular
>> CPU (or set of CPUs) during offline and then wait for the on flight stuff
>> to be finished. If the hardware does not allow that, then managed
>> interrupts can't work for it.
>>
>
> A rough audit of current SCSI drivers tells that these set
> PCI_IRQ_AFFINITY in some path but don't set Scsi_host.nr_hw_queues at all:
> aacraid, be2iscsi, csiostor, megaraid, mpt3sas
>
Megaraid and mpt3sas don't have that functionality (or, at least, not
that I'm aware).
And in general I'm not sure if the above approach is feasible.
Thing is, if we have _managed_ CPU hotplug (ie if the hardware provides
some means of quiescing the CPU before hotplug) then the whole thing is
trivial; disable SQ and wait for all outstanding commands to complete.
Then trivially all requests are completed and the issue is resolved.
Even with todays infrastructure.
And I'm not sure if we can handle surprise CPU hotplug at all, given all
the possible race conditions.
But then I might be wrong.
Cheers,
Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)
On Fri, 1 Feb 2019, Hannes Reinecke wrote:
> Thing is, if we have _managed_ CPU hotplug (ie if the hardware provides some
> means of quiescing the CPU before hotplug) then the whole thing is trivial;
> disable SQ and wait for all outstanding commands to complete.
> Then trivially all requests are completed and the issue is resolved.
> Even with todays infrastructure.
>
> And I'm not sure if we can handle surprise CPU hotplug at all, given all the
> possible race conditions.
> But then I might be wrong.
The kernel would completely fall apart when a CPU would vanish by surprise,
i.e. uncontrolled by the kernel. Then the SCSI driver exploding would be
the least of our problems.
Thanks,
tglx
On 2/1/19 10:57 PM, Thomas Gleixner wrote:
> On Fri, 1 Feb 2019, Hannes Reinecke wrote:
>> Thing is, if we have _managed_ CPU hotplug (ie if the hardware provides some
>> means of quiescing the CPU before hotplug) then the whole thing is trivial;
>> disable SQ and wait for all outstanding commands to complete.
>> Then trivially all requests are completed and the issue is resolved.
>> Even with todays infrastructure.
>>
>> And I'm not sure if we can handle surprise CPU hotplug at all, given all the
>> possible race conditions.
>> But then I might be wrong.
>
> The kernel would completely fall apart when a CPU would vanish by surprise,
> i.e. uncontrolled by the kernel. Then the SCSI driver exploding would be
> the least of our problems.
>
Hehe. As I thought.
So, as the user then has to wait for the system to declars 'ready for
CPU remove', why can't we just disable the SQ and wait for all I/O to
complete?
We can make it more fine-grained by just waiting on all outstanding I/O
on that SQ to complete, but waiting for all I/O should be good as an
initial try.
With that we wouldn't need to fiddle with driver internals, and could
make it pretty generic.
And we could always add more detailed logic if the driver has the means
for doing so.
Cheers,
Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
On 04/02/2019 07:12, Hannes Reinecke wrote:
> On 2/1/19 10:57 PM, Thomas Gleixner wrote:
>> On Fri, 1 Feb 2019, Hannes Reinecke wrote:
>>> Thing is, if we have _managed_ CPU hotplug (ie if the hardware
>>> provides some
>>> means of quiescing the CPU before hotplug) then the whole thing is
>>> trivial;
>>> disable SQ and wait for all outstanding commands to complete.
>>> Then trivially all requests are completed and the issue is resolved.
>>> Even with todays infrastructure.
>>>
>>> And I'm not sure if we can handle surprise CPU hotplug at all, given
>>> all the
>>> possible race conditions.
>>> But then I might be wrong.
>>
>> The kernel would completely fall apart when a CPU would vanish by
>> surprise,
>> i.e. uncontrolled by the kernel. Then the SCSI driver exploding would be
>> the least of our problems.
>>
> Hehe. As I thought.
Hi Hannes,
>
> So, as the user then has to wait for the system to declars 'ready for
> CPU remove', why can't we just disable the SQ and wait for all I/O to
> complete?
> We can make it more fine-grained by just waiting on all outstanding I/O
> on that SQ to complete, but waiting for all I/O should be good as an
> initial try.
> With that we wouldn't need to fiddle with driver internals, and could
> make it pretty generic.
I don't fully understand this idea - specifically, at which layer would
we be waiting for all the IO to complete?
> And we could always add more detailed logic if the driver has the means
> for doing so.
>
Thanks,
John
> Cheers,
>
> Hannes
On Tue, Feb 05, 2019 at 05:24:11AM -0800, John Garry wrote:
> On 04/02/2019 07:12, Hannes Reinecke wrote:
>
> Hi Hannes,
>
> >
> > So, as the user then has to wait for the system to declars 'ready for
> > CPU remove', why can't we just disable the SQ and wait for all I/O to
> > complete?
> > We can make it more fine-grained by just waiting on all outstanding I/O
> > on that SQ to complete, but waiting for all I/O should be good as an
> > initial try.
> > With that we wouldn't need to fiddle with driver internals, and could
> > make it pretty generic.
>
> I don't fully understand this idea - specifically, at which layer would
> we be waiting for all the IO to complete?
Whichever layer dispatched the IO to a CPU specific context should
be the one to wait for its completion. That should be blk-mq for most
block drivers.
On 2/5/19 3:52 PM, Keith Busch wrote:
> On Tue, Feb 05, 2019 at 05:24:11AM -0800, John Garry wrote:
>> On 04/02/2019 07:12, Hannes Reinecke wrote:
>>
>> Hi Hannes,
>>
>>>
>>> So, as the user then has to wait for the system to declars 'ready for
>>> CPU remove', why can't we just disable the SQ and wait for all I/O to
>>> complete?
>>> We can make it more fine-grained by just waiting on all outstanding I/O
>>> on that SQ to complete, but waiting for all I/O should be good as an
>>> initial try.
>>> With that we wouldn't need to fiddle with driver internals, and could
>>> make it pretty generic.
>>
>> I don't fully understand this idea - specifically, at which layer would
>> we be waiting for all the IO to complete?
>
> Whichever layer dispatched the IO to a CPU specific context should
> be the one to wait for its completion. That should be blk-mq for most
> block drivers.
>
Indeed.
But we don't provide any mechanisms for that ATM, right?
Maybe this would be a topic fit for LSF/MM?
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: F. Imendörffer, J. Smithard, D. Upmanyu, G. Norton
HRB 21284 (AG Nürnberg)
On 05/02/2019 14:52, Keith Busch wrote:
> On Tue, Feb 05, 2019 at 05:24:11AM -0800, John Garry wrote:
>> On 04/02/2019 07:12, Hannes Reinecke wrote:
>>
>> Hi Hannes,
>>
>>>
>>> So, as the user then has to wait for the system to declars 'ready for
>>> CPU remove', why can't we just disable the SQ and wait for all I/O to
>>> complete?
>>> We can make it more fine-grained by just waiting on all outstanding I/O
>>> on that SQ to complete, but waiting for all I/O should be good as an
>>> initial try.
>>> With that we wouldn't need to fiddle with driver internals, and could
>>> make it pretty generic.
>>
>> I don't fully understand this idea - specifically, at which layer would
>> we be waiting for all the IO to complete?
>
> Whichever layer dispatched the IO to a CPU specific context should
> be the one to wait for its completion. That should be blk-mq for most
> block drivers.
For SCSI devices, unfortunately not all IO sent to the HW originates
from blk-mq or any other single entity.
Thanks,
John
>
> .
>
On Tue, Feb 05, 2019 at 03:09:28PM +0000, John Garry wrote:
> On 05/02/2019 14:52, Keith Busch wrote:
> > On Tue, Feb 05, 2019 at 05:24:11AM -0800, John Garry wrote:
> > > On 04/02/2019 07:12, Hannes Reinecke wrote:
> > >
> > > Hi Hannes,
> > >
> > > >
> > > > So, as the user then has to wait for the system to declars 'ready for
> > > > CPU remove', why can't we just disable the SQ and wait for all I/O to
> > > > complete?
> > > > We can make it more fine-grained by just waiting on all outstanding I/O
> > > > on that SQ to complete, but waiting for all I/O should be good as an
> > > > initial try.
> > > > With that we wouldn't need to fiddle with driver internals, and could
> > > > make it pretty generic.
> > >
> > > I don't fully understand this idea - specifically, at which layer would
> > > we be waiting for all the IO to complete?
> >
> > Whichever layer dispatched the IO to a CPU specific context should
> > be the one to wait for its completion. That should be blk-mq for most
> > block drivers.
>
> For SCSI devices, unfortunately not all IO sent to the HW originates from
> blk-mq or any other single entity.
Then they'll need to register their own CPU notifiers and handle the
ones they dispatched.
On 2/5/19 4:09 PM, John Garry wrote:
> On 05/02/2019 14:52, Keith Busch wrote:
>> On Tue, Feb 05, 2019 at 05:24:11AM -0800, John Garry wrote:
>>> On 04/02/2019 07:12, Hannes Reinecke wrote:
>>>
>>> Hi Hannes,
>>>
>>>>
>>>> So, as the user then has to wait for the system to declars 'ready for
>>>> CPU remove', why can't we just disable the SQ and wait for all I/O to
>>>> complete?
>>>> We can make it more fine-grained by just waiting on all outstanding I/O
>>>> on that SQ to complete, but waiting for all I/O should be good as an
>>>> initial try.
>>>> With that we wouldn't need to fiddle with driver internals, and could
>>>> make it pretty generic.
>>>
>>> I don't fully understand this idea - specifically, at which layer would
>>> we be waiting for all the IO to complete?
>>
>> Whichever layer dispatched the IO to a CPU specific context should
>> be the one to wait for its completion. That should be blk-mq for most
>> block drivers.
>
> For SCSI devices, unfortunately not all IO sent to the HW originates
> from blk-mq or any other single entity.
>
No, not as such.
But each IO sent to the HW requires a unique identifcation (ie a valid
tag). And as the tagspace is managed by block-mq (minus management
commands, but I'm working on that currently) we can easily figure out if
the device is busy by checking for an empty tag map.
Should be doable for most modern HBAs.
Cheers,
Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
[email protected] +49 911 74053 688
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: F. Imend?rffer, J. Smithard, J. Guild, D. Upmanyu, G. Norton
HRB 21284 (AG N?rnberg)
On Tue, Feb 05, 2019 at 04:10:47PM +0100, Hannes Reinecke wrote:
> On 2/5/19 3:52 PM, Keith Busch wrote:
> > Whichever layer dispatched the IO to a CPU specific context should
> > be the one to wait for its completion. That should be blk-mq for most
> > block drivers.
> >
> Indeed.
> But we don't provide any mechanisms for that ATM, right?
>
> Maybe this would be a topic fit for LSF/MM?
Right, there's nothing handling this now, and sounds like it'd be a good
discussion to bring to the storage track.
On 05/02/2019 15:15, Hannes Reinecke wrote:
> On 2/5/19 4:09 PM, John Garry wrote:
>> On 05/02/2019 14:52, Keith Busch wrote:
>>> On Tue, Feb 05, 2019 at 05:24:11AM -0800, John Garry wrote:
>>>> On 04/02/2019 07:12, Hannes Reinecke wrote:
>>>>
>>>> Hi Hannes,
>>>>
>>>>>
>>>>> So, as the user then has to wait for the system to declars 'ready for
>>>>> CPU remove', why can't we just disable the SQ and wait for all I/O to
>>>>> complete?
>>>>> We can make it more fine-grained by just waiting on all outstanding
>>>>> I/O
>>>>> on that SQ to complete, but waiting for all I/O should be good as an
>>>>> initial try.
>>>>> With that we wouldn't need to fiddle with driver internals, and could
>>>>> make it pretty generic.
>>>>
>>>> I don't fully understand this idea - specifically, at which layer would
>>>> we be waiting for all the IO to complete?
>>>
>>> Whichever layer dispatched the IO to a CPU specific context should
>>> be the one to wait for its completion. That should be blk-mq for most
>>> block drivers.
>>
>> For SCSI devices, unfortunately not all IO sent to the HW originates
>> from blk-mq or any other single entity.
>>
> No, not as such.
> But each IO sent to the HW requires a unique identifcation (ie a valid
> tag). And as the tagspace is managed by block-mq (minus management
> commands, but I'm working on that currently) we can easily figure out if
> the device is busy by checking for an empty tag map.
That sounds like a reasonable starting solution.
Thanks,
John
>
> Should be doable for most modern HBAs.
>
> Cheers,
>
> Hannes
On Tue, Feb 05, 2019 at 03:09:28PM +0000, John Garry wrote:
> For SCSI devices, unfortunately not all IO sent to the HW originates from
> blk-mq or any other single entity.
Where else would SCSI I/O originate from?
On 05/02/2019 18:23, Christoph Hellwig wrote:
> On Tue, Feb 05, 2019 at 03:09:28PM +0000, John Garry wrote:
>> For SCSI devices, unfortunately not all IO sent to the HW originates from
>> blk-mq or any other single entity.
>
> Where else would SCSI I/O originate from?
Please note that I was referring to other management IO, like SAS SMP,
TMFs, and other proprietary commands which the driver may generate for
the HBA - https://marc.info/?l=linux-scsi&m=154831889001973&w=2
discusses some of them also.
Thanks,
John
>
> .
>
On Wed, Feb 06, 2019 at 09:21:40AM +0000, John Garry wrote:
> On 05/02/2019 18:23, Christoph Hellwig wrote:
> > On Tue, Feb 05, 2019 at 03:09:28PM +0000, John Garry wrote:
> > > For SCSI devices, unfortunately not all IO sent to the HW originates from
> > > blk-mq or any other single entity.
> >
> > Where else would SCSI I/O originate from?
>
> Please note that I was referring to other management IO, like SAS SMP, TMFs,
> and other proprietary commands which the driver may generate for the HBA -
> https://marc.info/?l=linux-scsi&m=154831889001973&w=2 discusses some of them
> also.
>
Especially the TMFs send via SCSI EH are a bit of a pain I guess,
because they are entirely managed by the device drivers, but depending
on the device driver they might not even qualify for the problem Hannes
is seeing.
--
With Best Regards, Benjamin Block / Linux on IBM Z Kernel Development
IBM Systems & Technology Group / IBM Deutschland Research & Development GmbH
Vorsitz. AufsR.: Matthias Hartmann / Gesch?ftsf?hrung: Dirk Wittkopp
Sitz der Gesellschaft: B?blingen / Registergericht: AmtsG Stuttgart, HRB 243294