2022-11-15 04:02:45

by Angus Chen

[permalink] [raw]
Subject: IRQ affinity problem from virtio_blk

Hi All.
I test the linux 6.1 and found the virtio_blk use irq_affinity with IRQD_AFFINITY_MANAGED.
The machine has 80 cpus with two numa node.

Before probe one virtio_blk.
crash_cts> p *vector_matrix
$44 = {
matrix_bits = 256,
alloc_start = 32,
alloc_end = 236,
alloc_size = 204,
global_available = 15354,
global_reserved = 154,
systembits_inalloc = 3,
total_allocated = 411,
online_maps = 80,
maps = 0x46100,
scratch_map = {1160908723191807, 0, 1, 18435222497520517120},
system_map = {1125904739729407, 0, 1, 18435221191850459136}
}
After probe one virtio_blk.
crash_cts> p *vector_matrix
$45 = {
matrix_bits = 256,
alloc_start = 32,
alloc_end = 236,
alloc_size = 204,
global_available = 15273,
global_reserved = 154,
systembits_inalloc = 3,
total_allocated = 413,
online_maps = 80,
maps = 0x46100,
scratch_map = {25769803776, 0, 0, 14680064},
system_map = {1125904739729407, 0, 1, 18435221191850459136}
}

We can see global_available drop from 15354 to 15273, is 81.
And the total_allocated increase from 411 to 413. One config irq,and one vq irq.

It is easy to expend the irq resource ,because virtio_blk device could be more than 512.
And I read the matrix code of irq,with IRQD_AFFINITY_MANAGED be set ,it is a kind of feature.

If we cosume irq exhausted,it will break per_vq_vectors ,so the ' virtblk_map_queues ' will
Fall back to blk_mq_map_queues finally.

Or if we don’t cosume irq exhausted,we just use irq bits of one cpu more than others for example,
IRQD_AFFINITY_MANAGED will fail too,because it not balance.

I'm not a native English speaker, any suggestion will be appreciated.


2022-11-15 22:49:19

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: IRQ affinity problem from virtio_blk

Thanks Thomas, I have a question:

On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote:
> On Tue, Nov 15 2022 at 03:40, Angus Chen wrote:
> > Before probe one virtio_blk.
> > crash_cts> p *vector_matrix
> > $44 = {
> > matrix_bits = 256,
> > alloc_start = 32,
> > alloc_end = 236,
> > alloc_size = 204,
> > global_available = 15354,
> > global_reserved = 154,
> > systembits_inalloc = 3,
> > total_allocated = 411,
> > online_maps = 80,
> > maps = 0x46100,
> > scratch_map = {1160908723191807, 0, 1, 18435222497520517120},
> > system_map = {1125904739729407, 0, 1, 18435221191850459136}
> > }
> > After probe one virtio_blk.
> > crash_cts> p *vector_matrix
> > $45 = {
> > matrix_bits = 256,
> > alloc_start = 32,
> > alloc_end = 236,
> > alloc_size = 204,
> > global_available = 15273,
> > global_reserved = 154,
> > systembits_inalloc = 3,
> > total_allocated = 413,
> > online_maps = 80,
> > maps = 0x46100,
> > scratch_map = {25769803776, 0, 0, 14680064},
> > system_map = {1125904739729407, 0, 1, 18435221191850459136}
> > }
> >
> > We can see global_available drop from 15354 to 15273, is 81.
> > And the total_allocated increase from 411 to 413. One config irq,and
> > one vq irq.
>
> Right. That's perfectly fine. At the point where you looking at it, the
> matrix allocator has given out 2 vectors as can be seen via
> total_allocated.
>
> But then it also has another 79 vectors put aside for the other queues,


What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ?



> but those queues have not yet requested the interrupts so there is no
> allocation yet. But the vectors are guaranteed to be available when
> request_irq() for those queues runs, which does the actual allocation.
>
> Btw, you can enable CONFIG_GENERIC_IRQ_DEBUGFS and then look at the
> content of /sys/kernel/debug/irq/domain/VECTOR which gives you a very
> clear picture of what's going on. No need for gdb.
>
> > It is easy to expend the irq resource ,because virtio_blk device could
> > be more than 512.
>
> How so? virtio_blk allocates a config interrupt and one queue interrupt
> per CPU. So in your case a total of 81.
>
> How would you exhaust the vector space? Each CPU has about ~200 (in your
> case exactly 204) vectors which can be handed out to devices. You'd need
> to instantiate about 200 virtio_blk devices to get to the point of
> vector exhaustion.
>
> So what are you actually worried about and which problem are you trying
> to solve?
>
> Thanks,
>
> tglx
>


2022-11-15 23:15:05

by Thomas Gleixner

[permalink] [raw]
Subject: Re: IRQ affinity problem from virtio_blk

On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote:
> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote:
>> > We can see global_available drop from 15354 to 15273, is 81.
>> > And the total_allocated increase from 411 to 413. One config irq,and
>> > one vq irq.
>>
>> Right. That's perfectly fine. At the point where you looking at it, the
>> matrix allocator has given out 2 vectors as can be seen via
>> total_allocated.
>>
>> But then it also has another 79 vectors put aside for the other queues,
>
> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ?

init_vq() -> virtio_find_vqs() -> vp_find_vqs() ->
vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity()

init_vq() hands in a struct irq_affinity which means that
pci_alloc_irq_vectors_affinity() will spread out interrupts and have one
for config and one per queue if vp_request_msix_vectors() is invoked
with per_vq_vectors == true, which is what the first invocation in
vp_find_vqs() does.

Thanks,

tglx






2022-11-15 23:30:30

by Thomas Gleixner

[permalink] [raw]
Subject: Re: IRQ affinity problem from virtio_blk

On Tue, Nov 15 2022 at 03:40, Angus Chen wrote:
> Before probe one virtio_blk.
> crash_cts> p *vector_matrix
> $44 = {
> matrix_bits = 256,
> alloc_start = 32,
> alloc_end = 236,
> alloc_size = 204,
> global_available = 15354,
> global_reserved = 154,
> systembits_inalloc = 3,
> total_allocated = 411,
> online_maps = 80,
> maps = 0x46100,
> scratch_map = {1160908723191807, 0, 1, 18435222497520517120},
> system_map = {1125904739729407, 0, 1, 18435221191850459136}
> }
> After probe one virtio_blk.
> crash_cts> p *vector_matrix
> $45 = {
> matrix_bits = 256,
> alloc_start = 32,
> alloc_end = 236,
> alloc_size = 204,
> global_available = 15273,
> global_reserved = 154,
> systembits_inalloc = 3,
> total_allocated = 413,
> online_maps = 80,
> maps = 0x46100,
> scratch_map = {25769803776, 0, 0, 14680064},
> system_map = {1125904739729407, 0, 1, 18435221191850459136}
> }
>
> We can see global_available drop from 15354 to 15273, is 81.
> And the total_allocated increase from 411 to 413. One config irq,and
> one vq irq.

Right. That's perfectly fine. At the point where you looking at it, the
matrix allocator has given out 2 vectors as can be seen via
total_allocated.

But then it also has another 79 vectors put aside for the other queues,
but those queues have not yet requested the interrupts so there is no
allocation yet. But the vectors are guaranteed to be available when
request_irq() for those queues runs, which does the actual allocation.

Btw, you can enable CONFIG_GENERIC_IRQ_DEBUGFS and then look at the
content of /sys/kernel/debug/irq/domain/VECTOR which gives you a very
clear picture of what's going on. No need for gdb.

> It is easy to expend the irq resource ,because virtio_blk device could
> be more than 512.

How so? virtio_blk allocates a config interrupt and one queue interrupt
per CPU. So in your case a total of 81.

How would you exhaust the vector space? Each CPU has about ~200 (in your
case exactly 204) vectors which can be handed out to devices. You'd need
to instantiate about 200 virtio_blk devices to get to the point of
vector exhaustion.

So what are you actually worried about and which problem are you trying
to solve?

Thanks,

tglx



2022-11-15 23:40:25

by Thomas Gleixner

[permalink] [raw]
Subject: Re: IRQ affinity problem from virtio_blk

On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote:

> On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote:
>> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote:
>>> > We can see global_available drop from 15354 to 15273, is 81.
>>> > And the total_allocated increase from 411 to 413. One config irq,and
>>> > one vq irq.
>>>
>>> Right. That's perfectly fine. At the point where you looking at it, the
>>> matrix allocator has given out 2 vectors as can be seen via
>>> total_allocated.
>>>
>>> But then it also has another 79 vectors put aside for the other queues,
>>
>> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ?
>
> init_vq() -> virtio_find_vqs() -> vp_find_vqs() ->
> vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity()
>
> init_vq() hands in a struct irq_affinity which means that
> pci_alloc_irq_vectors_affinity() will spread out interrupts and have one
> for config and one per queue if vp_request_msix_vectors() is invoked
> with per_vq_vectors == true, which is what the first invocation in
> vp_find_vqs() does.

I just checked on a random VM. The PCI device as advertised to the guest
does not expose that many vectors. One has 2 and the other 4.

But as the interrupts are requested 'managed' the core ends up setting
the vectors aside. That's a fundamental property of managed interrupts.

Assume you have less queues than CPUs, which is the case with 2 vectors
and tons of CPUs, i.e. one ends up for config and the other for the
actual queue. So the affinity spreading code will end up having the full
cpumask for the queue vector, which is marked managed. And managed means
that it's guaranteed e.g. in the CPU hotplug case that the interrupt can
be migrated to a still online CPU.

So we end up setting 79 vectors aside (one per CPU) in the case that the
virtio device only provides two vectors.

But that's not the end of the world as you really would need ~200 such
devices to exhaust the vector space...

Thanks,

tglx



2022-11-15 23:55:21

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: IRQ affinity problem from virtio_blk

On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
> On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote:
>
> > On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote:
> >> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote:
> >>> > We can see global_available drop from 15354 to 15273, is 81.
> >>> > And the total_allocated increase from 411 to 413. One config irq,and
> >>> > one vq irq.
> >>>
> >>> Right. That's perfectly fine. At the point where you looking at it, the
> >>> matrix allocator has given out 2 vectors as can be seen via
> >>> total_allocated.
> >>>
> >>> But then it also has another 79 vectors put aside for the other queues,
> >>
> >> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ?
> >
> > init_vq() -> virtio_find_vqs() -> vp_find_vqs() ->
> > vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity()
> >
> > init_vq() hands in a struct irq_affinity which means that
> > pci_alloc_irq_vectors_affinity() will spread out interrupts and have one
> > for config and one per queue if vp_request_msix_vectors() is invoked
> > with per_vq_vectors == true, which is what the first invocation in
> > vp_find_vqs() does.
>
> I just checked on a random VM. The PCI device as advertised to the guest
> does not expose that many vectors. One has 2 and the other 4.
>
> But as the interrupts are requested 'managed' the core ends up setting
> the vectors aside. That's a fundamental property of managed interrupts.
>
> Assume you have less queues than CPUs, which is the case with 2 vectors
> and tons of CPUs, i.e. one ends up for config and the other for the
> actual queue. So the affinity spreading code will end up having the full
> cpumask for the queue vector, which is marked managed. And managed means
> that it's guaranteed e.g. in the CPU hotplug case that the interrupt can
> be migrated to a still online CPU.
>
> So we end up setting 79 vectors aside (one per CPU) in the case that the
> virtio device only provides two vectors.
>
> But that's not the end of the world as you really would need ~200 such
> devices to exhaust the vector space...
>
> Thanks,
>
> tglx

Let's say we have 20 queues - then just 10 devices will exhaust the
vector space right?

--
MST


2022-11-16 01:20:14

by Angus Chen

[permalink] [raw]
Subject: RE: IRQ affinity problem from virtio_blk



> -----Original Message-----
> From: Michael S. Tsirkin <[email protected]>
> Sent: Wednesday, November 16, 2022 7:37 AM
> To: Thomas Gleixner <[email protected]>
> Cc: Angus Chen <[email protected]>; [email protected];
> Ming Lei <[email protected]>; Jason Wang <[email protected]>
> Subject: Re: IRQ affinity problem from virtio_blk
>
> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
> > On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote:
> >
> > > On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote:
> > >> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote:
> > >>> > We can see global_available drop from 15354 to 15273, is 81.
> > >>> > And the total_allocated increase from 411 to 413. One config irq,and
> > >>> > one vq irq.
> > >>>
> > >>> Right. That's perfectly fine. At the point where you looking at it, the
> > >>> matrix allocator has given out 2 vectors as can be seen via
> > >>> total_allocated.
> > >>>
> > >>> But then it also has another 79 vectors put aside for the other queues,
> > >>
> > >> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ?
> > >
> > > init_vq() -> virtio_find_vqs() -> vp_find_vqs() ->
> > > vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity()
> > >
> > > init_vq() hands in a struct irq_affinity which means that
> > > pci_alloc_irq_vectors_affinity() will spread out interrupts and have one
> > > for config and one per queue if vp_request_msix_vectors() is invoked
> > > with per_vq_vectors == true, which is what the first invocation in
> > > vp_find_vqs() does.
> >
> > I just checked on a random VM. The PCI device as advertised to the guest
> > does not expose that many vectors. One has 2 and the other 4.
> >
> > But as the interrupts are requested 'managed' the core ends up setting
> > the vectors aside. That's a fundamental property of managed interrupts.
> >
> > Assume you have less queues than CPUs, which is the case with 2 vectors
> > and tons of CPUs, i.e. one ends up for config and the other for the
> > actual queue. So the affinity spreading code will end up having the full
> > cpumask for the queue vector, which is marked managed. And managed
> means
> > that it's guaranteed e.g. in the CPU hotplug case that the interrupt can
> > be migrated to a still online CPU.
> >
> > So we end up setting 79 vectors aside (one per CPU) in the case that the
> > virtio device only provides two vectors.
> >
> > But that's not the end of the world as you really would need ~200 such
> > devices to exhaust the vector space...

Provide the VECTOR Information:
[root@localhost domains]# cat VECTOR
name: VECTOR
size: 0
mapped: 2015
flags: 0x00000003
Online bitmaps: 80
Global available: 0
Global reserved: 154
Total allocated: 1861
System: 39: 0-19,29,32,50,128,236,240-242,244,246-255
| CPU | avl | man | mac | act | vectors
0 0 180 2 23 33-46,48,110,132,162,185,206-207,228-229
1 0 180 2 23 33-37,41,44,124,134,156-157,167,180,186-187,198,225,228-233
2 0 180 2 23 33-40,123,154-155,164,177,186,202,221-224,227,232-233,235
3 0 180 2 23 33-36,70,123-124,140,156,168,174,197,199,201,207,225-228,232-235
4 0 180 2 23 33-39,101,122,133,147,207,217-221,227-228,231,233-235
5 0 180 2 23 33-38,83,115,156,165-166,177,207-209,220-222,228,231-234
6 0 180 2 23 33-38,55,91,146,154,160,164,187-188,209,217-218,221-222,232-235
7 0 180 2 23 33-37,81-82,113,145,154,186-188,207,221-224,226,229,232,234-235
8 0 180 2 23 33-37,81,91,148-149,189,198-199,201,210,217-218,222,229-232,234-235
9 0 180 2 23 33-38,59,133,146,157,165,174,196,205,207,220-221,225-226,232-235
10 0 180 2 23 33-36,87,133-134,142,174,188,198-199,206,214,217-220,228-230,234-235
11 0 180 2 23 33-35,83,94,113,127,129,157,187-188,209,219-224,229-230,233-235
12 0 180 2 23 33-34,36,55,113-114,129,158-159,168,175,189-190,197,208-209,219-220,227,232-235
13 0 180 2 23 33-34,37-38,83,94,156-158,186-187,207,221-222,225-227,230-235
14 0 180 2 23 33-35,43,70,101-102,170,175-177,215,217-218,220,226-230,232-233,235
15 0 180 2 23 33-35,104,112,134,144,158,167-168,170,175-176,187,198,208,221-222,228-229,233-235
16 0 180 2 23 34-36,71,91,146,155-156,189-190,217-219,223,225-228,231-235
17 0 180 2 23 33-34,49,92,101,134,144,187,195-197,207-209,216-217,221,230-235
18 0 180 2 23 33-34,135-136,146,174,198,206-209,217,224-231,233-235
19 0 180 2 23 33-34,58,91,101,113,122,135,165,197-199,206,221-223,228-229,231-235
20 0 180 2 23 33-34,215-235
21 0 180 2 23 33-34,214,216-235
22 0 180 2 23 33-34,215-235
23 0 180 2 23 33-34,215-235
24 0 180 2 23 33-35,216-235
25 0 180 2 23 33-35,216-235
26 0 180 2 23 33-35,216-235
27 0 180 2 23 33-35,216-235
28 0 180 2 23 33-35,216-235
29 0 180 2 23 33-35,216-235
30 0 180 2 23 33-35,216-235
31 0 180 2 23 33-35,216-235
32 0 180 2 23 33-34,215-235
33 0 180 2 23 33-34,215-235
34 0 180 2 23 33-34,215-235
35 0 180 2 23 33-34,215-235
36 0 180 2 23 33-34,211,216-235
37 0 180 2 23 33-34,215-235
38 0 180 2 23 33-34,215-235
39 0 180 2 23 33-34,215-235
40 0 180 2 23 33-34,56,65,134,170,176-178,207-210,225-229,231-235
41 0 180 2 23 33-34,54,113,135-137,143,169,195-198,216-217,224,228-230,232-235
42 0 180 2 23 33,36,57,111-112,126,164,175-176,199-200,207-210,225-226,230-235
43 0 180 2 23 33-34,70,82,133-135,145,155,166,174,188-189,207,209,218,226-229,233-235
44 0 180 2 23 33-34,59,103,111,126,166-167,185-186,207-208,217-218,226-232,234-235
45 0 180 2 23 33,35-36,81,106,145-146,165,176,187,195,220-221,226-235
46 0 180 2 23 33-34,69,137,143,155,176,180,185-187,197,206-207,212-213,225-228,230,234-235
47 0 180 2 23 34,36,71,91-92,103-104,143,165,179,185-186,195,208-209,220-221,230-235
48 0 180 2 23 33-34,36,93,122,157,174,186-188,198,208-209,225,227-235
49 0 180 2 23 34-35,132-133,147-148,156,176-177,194-197,212,226-228,230-235
50 0 180 2 23 33-34,45,123,138,162,164-166,195-196,208-209,219,224-226,228,230-231,233-235
51 0 180 2 23 33-34,55,69-70,110,167,179-181,197-198,217-220,228-230,232-235
52 0 180 2 23 33-34,70,132,145,156,178,186-188,190,210-212,218-219,228-230,232-235
53 0 180 2 23 33,35,70,111,144,194-195,197,209,216-219,224,226-231,233-235
54 0 180 2 23 33-34,102,115,147,154,164-166,181,188,200,210-211,219-220,228-229,231-235
55 0 180 2 23 33-36,55,114,154-156,174,187,198,207-209,224-225,227-229,233-235
56 0 180 2 23 33-34,54,104,113,132,154,175,188,209,216-221,226-227,230-233,235
57 0 180 2 23 34-35,47,100,127,132-133,176-178,196-197,208,220,224-226,230-235
58 0 180 2 23 34,37,42,100,110-111,143,164-165,185,198,206-208,216-218,228-229,231,233-235
59 0 180 3 24 33-35,39,43,81-82,111,126,164-165,184,186,211-212,219-221,223,231-235
60 0 180 3 24 33-35,215-235
61 0 180 3 24 33-35,215-235
62 0 180 3 24 33-35,215-235
63 0 180 3 24 33-35,215-235
64 0 180 3 24 33-35,215-235
65 0 180 3 24 33-35,215-235
66 0 180 3 24 33-35,215-235
67 0 180 3 24 33-35,215-235
68 0 180 3 24 33-35,211,216-235
69 0 180 3 24 33-35,215-235
70 0 180 3 24 33-35,215-235
71 0 180 3 24 33-35,215-235
72 0 180 3 24 33-35,215-235
73 0 180 3 24 33-35,215-235
74 0 180 3 24 33-35,215-235
75 0 180 3 24 33-35,215-235
76 0 180 3 24 33-35,215-235
77 0 180 3 24 33-35,215-235
78 0 180 3 24 33-35,214,216-235
79 0 180 3 24 33-35,215-235



crash_cts> p *vector_matrix
$98 = {
matrix_bits = 256,
alloc_start = 32,
alloc_end = 236,
alloc_size = 204,
global_available = 0,
global_reserved = 154,
systembits_inalloc = 3,
total_allocated = 1861,
online_maps = 80,
maps = 0x46100,
scratch_map = {18446744069952503807, 18446744073709551615, 18446744073709551615, 18435229987943481343},
system_map = {1125904739729407, 0, 1, 18435221191850459136}
}
Any other information I need to provide,pls tell me.
Thanks.
> >
> > Thanks,
> >
> > tglx
>
> Let's say we have 20 queues - then just 10 devices will exhaust the
> vector space right?
>
> --
> MST

2022-11-16 01:39:21

by Angus Chen

[permalink] [raw]
Subject: RE: IRQ affinity problem from virtio_blk



> -----Original Message-----
> From: Thomas Gleixner <[email protected]>
> Sent: Wednesday, November 16, 2022 7:24 AM
> To: Michael S. Tsirkin <[email protected]>
> Cc: Angus Chen <[email protected]>; [email protected];
> Ming Lei <[email protected]>; Jason Wang <[email protected]>
> Subject: Re: IRQ affinity problem from virtio_blk
>
> On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote:
>
> > On Tue, Nov 15 2022 at 17:44, Michael S. Tsirkin wrote:
> >> On Tue, Nov 15, 2022 at 11:19:47PM +0100, Thomas Gleixner wrote:
> >>> > We can see global_available drop from 15354 to 15273, is 81.
> >>> > And the total_allocated increase from 411 to 413. One config irq,and
> >>> > one vq irq.
> >>>
> >>> Right. That's perfectly fine. At the point where you looking at it, the
> >>> matrix allocator has given out 2 vectors as can be seen via
> >>> total_allocated.
> >>>
> >>> But then it also has another 79 vectors put aside for the other queues,
en,it not the truth,in fact ,I just has one queue for one virtio_blk.

crash_cts> struct virtio_blk.num_vqs 0xffff888147b79c00
num_vqs = 1,
I think is the key we talk about.
> >>
> >> What makes it put these vectors aside? pci_alloc_irq_vectors_affinity ?
> >
> > init_vq() -> virtio_find_vqs() -> vp_find_vqs() ->
> > vp_request_msix_vectors() -> pci_alloc_irq_vectors_affinity()
> >
> > init_vq() hands in a struct irq_affinity which means that
> > pci_alloc_irq_vectors_affinity() will spread out interrupts and have one
> > for config and one per queue if vp_request_msix_vectors() is invoked
> > with per_vq_vectors == true, which is what the first invocation in
> > vp_find_vqs() does.
>
> I just checked on a random VM. The PCI device as advertised to the guest
> does not expose that many vectors. One has 2 and the other 4.
>
> But as the interrupts are requested 'managed' the core ends up setting
> the vectors aside. That's a fundamental property of managed interrupts.
>
> Assume you have less queues than CPUs, which is the case with 2 vectors
> and tons of CPUs, i.e. one ends up for config and the other for the
> actual queue. So the affinity spreading code will end up having the full
> cpumask for the queue vector, which is marked managed. And managed means
> that it's guaranteed e.g. in the CPU hotplug case that the interrupt can
> be migrated to a still online CPU.
>
> So we end up setting 79 vectors aside (one per CPU) in the case that the
> virtio device only provides two vectors.
>
> But that's not the end of the world as you really would need ~200 such
> devices to exhaust the vector space...
>
Thank you for your reply..
Let's look the dmesg for more information.
...
Nov 14 11:48:45 localhost kernel: virtio_blk virtio181: 1/0/0 default/read/poll queues
Nov 14 11:48:45 localhost kernel: virtio_blk virtio181: [vdpr] 20480 512-byte logical blocks (10.5 MB/10.0 MiB)
Nov 14 11:48:46 localhost kernel: virtio-pci 0000:37:16.4: enabling device (0000 -> 0002)
Nov 14 11:48:46 localhost kernel: virtio-pci 0000:37:16.4: virtio_pci: leaving for legacy driver
Nov 14 11:48:46 localhost kernel: virtio_blk virtio182: 1/0/0 default/read/poll queues---------the virtio182 means index 182.
Nov 14 11:48:46 localhost kernel: vp_find_vqs_msix return err=-28-----------------------------the first time we get 'no space' error from irq subsystem.
...
We are easy to get the output is :
crash_cts> p *vector_matrix
$97 = {
matrix_bits = 256,
alloc_start = 32,
alloc_end = 236,
alloc_size = 204,
global_available = 0,------------the irq is exhausted.
global_reserved = 154,
systembits_inalloc = 3,
total_allocated = 1861,
online_maps = 80,
maps = 0x46100,
scratch_map = {18446744069952503807, 18446744073709551615, 18446744073709551615, 18435229987943481343},
system_map = {1125904739729407, 0, 1, 18435221191850459136}
}

After that ,all the irq request will be returned "no space".

If the percpu irq vector is more asymmetric,than the more quickly we get the 'no space' error when we probe the irq with
IRQD_AFFINITY_MANAGED.

> Thanks,
>
> tglx
>

2022-11-16 11:15:25

by Thomas Gleixner

[permalink] [raw]
Subject: RE: IRQ affinity problem from virtio_blk

On Wed, Nov 16 2022 at 01:02, Angus Chen wrote:
>> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
> Any other information I need to provide,pls tell me.

A sensible use case for 180+ virtio block devices in a single guest.

Thanks,

tglx

2022-11-16 11:21:24

by Thomas Gleixner

[permalink] [raw]
Subject: RE: IRQ affinity problem from virtio_blk

On Wed, Nov 16 2022 at 00:46, Angus Chen wrote:
>> On Wed, Nov 16 2022 at 00:04, Thomas Gleixner wrote:
>> >>> But then it also has another 79 vectors put aside for the other queues,
> en,it not the truth,in fact ,I just has one queue for one virtio_blk.

Which does not matter. See my reply to Michael. It's ONE vector per CPU
and block device.

> Nov 14 11:48:45 localhost kernel: virtio_blk virtio181: 1/0/0 default/read/poll queues
> Nov 14 11:48:45 localhost kernel: virtio_blk virtio181: [vdpr] 20480 512-byte logical blocks (10.5 MB/10.0 MiB)
> Nov 14 11:48:46 localhost kernel: virtio-pci 0000:37:16.4: enabling device (0000 -> 0002)
> Nov 14 11:48:46 localhost kernel: virtio-pci 0000:37:16.4: virtio_pci: leaving for legacy driver
> Nov 14 11:48:46 localhost kernel: virtio_blk virtio182: 1/0/0 default/read/poll queues---------the virtio182 means index 182.
> Nov 14 11:48:46 localhost kernel: vp_find_vqs_msix return err=-28-----------------------------the first time we get 'no space' error from irq subsystem.

That's close to 200 virtio devices and the vector space is exhausted.
Works as expected.

Interrupt vectors are a limited resource on x86 and not only on x86. Not
any different from any other resource.

Thanks,

tglx








2022-11-16 11:38:55

by Angus Chen

[permalink] [raw]
Subject: RE: IRQ affinity problem from virtio_blk



> -----Original Message-----
> From: Thomas Gleixner <[email protected]>
> Sent: Wednesday, November 16, 2022 6:56 PM
> To: Angus Chen <[email protected]>; Michael S. Tsirkin
> <[email protected]>
> Cc: [email protected]; Ming Lei <[email protected]>; Jason
> Wang <[email protected]>
> Subject: RE: IRQ affinity problem from virtio_blk
>
> On Wed, Nov 16 2022 at 01:02, Angus Chen wrote:
> >> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
> > Any other information I need to provide,pls tell me.
>
> A sensible use case for 180+ virtio block devices in a single guest.
>
Our card can provide more than 512 virtio_blk devices .
one virtio_blk passthrough to one container,like docker.
So we need so much devices.
In the first patch ,I del the IRQD_AFFINITY_MANAGED in virtio_blk .

As you know, if we just use small queues number ,like 1or 2,we
Still occupy 80 vector ,that is kind of waste,and it is easy to eahausted the
Irq resource.

IRQD_AFFINITY_MANAGED is not the problem,
but many devices use the IRQD_AFFINITY_MANAGED will be problem.

Thanks.

> Thanks,
>
> tglx

2022-11-16 11:49:31

by Thomas Gleixner

[permalink] [raw]
Subject: Re: IRQ affinity problem from virtio_blk

On Tue, Nov 15 2022 at 18:36, Michael S. Tsirkin wrote:
> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
>> I just checked on a random VM. The PCI device as advertised to the guest
>> does not expose that many vectors. One has 2 and the other 4.
>>
>> But as the interrupts are requested 'managed' the core ends up setting
>> the vectors aside. That's a fundamental property of managed interrupts.
>>
>> Assume you have less queues than CPUs, which is the case with 2 vectors
>> and tons of CPUs, i.e. one ends up for config and the other for the
>> actual queue. So the affinity spreading code will end up having the full
>> cpumask for the queue vector, which is marked managed. And managed means
>> that it's guaranteed e.g. in the CPU hotplug case that the interrupt can
>> be migrated to a still online CPU.
>>
>> So we end up setting 79 vectors aside (one per CPU) in the case that the
>> virtio device only provides two vectors.
>>
>> But that's not the end of the world as you really would need ~200 such
>> devices to exhaust the vector space...
>
> Let's say we have 20 queues - then just 10 devices will exhaust the
> vector space right?

No.

If you have 20 queues then the queues are spread out over the
CPUs. Assume 80 CPUs:

Then each queue is associated to 80/20 = 4 CPUs and the resulting
affinity mask of each queue contains exactly 4 CPUs:

q0: 0 - 3
q1: 4 - 7
...
q19: 76 - 79

So this puts exactly 80 vectors aside, one per CPU.

As long as at least one CPU of a queue mask is online the queue is
enabled. If the last CPU of a queue mask goes offline then the queue is
shutdown which means the interrupt associated to the queue is shut down
too. That's all handled by the block MQ and the interrupt core. If a CPU
of a queue mask comes back online then the guaranteed vector is
allocated again.

So it does not matter how many queues per device you have it will
reserve exactly ONE interrupt per CPU.

Ergo you need 200 devices to exhaust the vector space.

Thanks,

tglx








2022-11-16 11:56:11

by Ming Lei

[permalink] [raw]
Subject: Re: IRQ affinity problem from virtio_blk

On Wed, Nov 16, 2022 at 11:43:24AM +0100, Thomas Gleixner wrote:
> On Tue, Nov 15 2022 at 18:36, Michael S. Tsirkin wrote:
> > On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
> >> I just checked on a random VM. The PCI device as advertised to the guest
> >> does not expose that many vectors. One has 2 and the other 4.
> >>
> >> But as the interrupts are requested 'managed' the core ends up setting
> >> the vectors aside. That's a fundamental property of managed interrupts.
> >>
> >> Assume you have less queues than CPUs, which is the case with 2 vectors
> >> and tons of CPUs, i.e. one ends up for config and the other for the
> >> actual queue. So the affinity spreading code will end up having the full
> >> cpumask for the queue vector, which is marked managed. And managed means
> >> that it's guaranteed e.g. in the CPU hotplug case that the interrupt can
> >> be migrated to a still online CPU.
> >>
> >> So we end up setting 79 vectors aside (one per CPU) in the case that the
> >> virtio device only provides two vectors.
> >>
> >> But that's not the end of the world as you really would need ~200 such
> >> devices to exhaust the vector space...
> >
> > Let's say we have 20 queues - then just 10 devices will exhaust the
> > vector space right?
>
> No.
>
> If you have 20 queues then the queues are spread out over the
> CPUs. Assume 80 CPUs:
>
> Then each queue is associated to 80/20 = 4 CPUs and the resulting
> affinity mask of each queue contains exactly 4 CPUs:
>
> q0: 0 - 3
> q1: 4 - 7
> ...
> q19: 76 - 79
>
> So this puts exactly 80 vectors aside, one per CPU.
>
> As long as at least one CPU of a queue mask is online the queue is
> enabled. If the last CPU of a queue mask goes offline then the queue is
> shutdown which means the interrupt associated to the queue is shut down
> too. That's all handled by the block MQ and the interrupt core. If a CPU
> of a queue mask comes back online then the guaranteed vector is
> allocated again.
>
> So it does not matter how many queues per device you have it will
> reserve exactly ONE interrupt per CPU.
>
> Ergo you need 200 devices to exhaust the vector space.

Hi Thomas,

I am wondering why one interrupt needs to be reserved for each CPU, in
theory one queue needs one irq, I understand, so would you mind
explaining the story a bit?


Thanks,
Ming


2022-11-16 13:30:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: IRQ affinity problem from virtio_blk

On Wed, Nov 16 2022 at 19:35, Ming Lei wrote:
> On Wed, Nov 16, 2022 at 11:43:24AM +0100, Thomas Gleixner wrote:
>> > Let's say we have 20 queues - then just 10 devices will exhaust the
>> > vector space right?
>>
>> No.
>>
>> If you have 20 queues then the queues are spread out over the
>> CPUs. Assume 80 CPUs:
>>
>> Then each queue is associated to 80/20 = 4 CPUs and the resulting
>> affinity mask of each queue contains exactly 4 CPUs:
>>
>> q0: 0 - 3
>> q1: 4 - 7
>> ...
>> q19: 76 - 79
>>
>> So this puts exactly 80 vectors aside, one per CPU.
>>
>> As long as at least one CPU of a queue mask is online the queue is
>> enabled. If the last CPU of a queue mask goes offline then the queue is
>> shutdown which means the interrupt associated to the queue is shut down
>> too. That's all handled by the block MQ and the interrupt core. If a CPU
>> of a queue mask comes back online then the guaranteed vector is
>> allocated again.
>>
>> So it does not matter how many queues per device you have it will
>> reserve exactly ONE interrupt per CPU.
>>
>> Ergo you need 200 devices to exhaust the vector space.
>
> I am wondering why one interrupt needs to be reserved for each CPU, in
> theory one queue needs one irq, I understand, so would you mind
> explaining the story a bit?

It's only one interrupt per queue. Interrupt != vector.

The guarantee of managed interrupts always was that if there are less
queues than CPUs that CPU hotunplug cannot result in vector exhaustion.

Therefore we differentiate between managed and non-managed
interrupts. Managed have a guaranteed reservation, non-managed do not.

That's been a very deliberate design decision from the very beginning.

Thanks,

tglx







2022-11-16 13:44:18

by Thomas Gleixner

[permalink] [raw]
Subject: RE: IRQ affinity problem from virtio_blk

On Wed, Nov 16 2022 at 11:24, Angus Chen wrote:
>> >> On Wed, Nov 16, 2022 at 12:24:24AM +0100, Thomas Gleixner wrote:
>> > Any other information I need to provide,pls tell me.
>>
>> A sensible use case for 180+ virtio block devices in a single guest.
>>
> Our card can provide more than 512 virtio_blk devices .
> one virtio_blk passthrough to one container,like docker.

I'm not sure whether that's sensible, but that's how your hardware is
designed. You could have provided this information upfront instead of
random memory dumps of the irq matrix internals.

> So we need so much devices.
> In the first patch ,I del the IRQD_AFFINITY_MANAGED in virtio_blk .

There is no IRQD_AFFINITY_MANAGED in virtio_blk. That flag is internal
to the interrupt core code and you can neither delete it nor fiddle with
it from inside virtio_blk.

You can do that in your private kernel, but that's not an option for
mainline as it will break existing setups and it's fundamentally wrong.

The block-mq code has assumptions about the semantics of managed
interrupts. It happens to work for the single queue case because that
always ends up with queue affinity == cpu_possible_mask.

For anything else which assigns the queues to partitions of the CPU
space it definitely expects the semantics of managed interrupts.

> As you know, if we just use small queues number ,like 1or 2,we Still
> occupy 80 vector ,that is kind of waste,and it is easy to eahausted
> the Irq resource.

We know that by now. No point in repeating this over and over. Aside of
that it's not that easy because this is the first time within 5 years
that someone ran into this problem.

The real question is how to solve this proper without creating problems
for other scenarios. That needs involvment of the blk-mq people.

Thanks,

tglx