2022-12-22 07:54:07

by Xinghui Li

[permalink] [raw]
Subject: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

From: Xinghui Li <[email protected]>

Commit ee81ee84f873("PCI: vmd: Disable MSI-X remapping when possible")
disable the vmd MSI-X remapping for optimizing pci performance.However,
this feature severely negatively optimized performance in multi-disk
situations.

In FIO 4K random test, we test 1 disk in the 1 CPU

when disable MSI-X remapping:
read: IOPS=1183k, BW=4622MiB/s (4847MB/s)(1354GiB/300001msec)
READ: bw=4622MiB/s (4847MB/s), 4622MiB/s-4622MiB/s (4847MB/s-4847MB/s),
io=1354GiB (1454GB), run=300001-300001msec

When not disable MSI-X remapping:
read: IOPS=1171k, BW=4572MiB/s (4795MB/s)(1340GiB/300001msec)
READ: bw=4572MiB/s (4795MB/s), 4572MiB/s-4572MiB/s (4795MB/s-4795MB/s),
io=1340GiB (1438GB), run=300001-300001msec

However, the bypass mode could increase the interrupts costs in CPU.
We test 12 disks in the 6 CPU,

When disable MSI-X remapping:
read: IOPS=562k, BW=2197MiB/s (2304MB/s)(644GiB/300001msec)
READ: bw=2197MiB/s (2304MB/s), 2197MiB/s-2197MiB/s (2304MB/s-2304MB/s),
io=644GiB (691GB), run=300001-300001msec

When not disable MSI-X remapping:
read: IOPS=1144k, BW=4470MiB/s (4687MB/s)(1310GiB/300005msec)
READ: bw=4470MiB/s (4687MB/s), 4470MiB/s-4470MiB/s (4687MB/s-4687MB/s),
io=1310GiB (1406GB), run=300005-300005msec

Signed-off-by: Xinghui Li <[email protected]>
---
drivers/pci/controller/vmd.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
index e06e9f4fc50f..9f6e9324d67d 100644
--- a/drivers/pci/controller/vmd.c
+++ b/drivers/pci/controller/vmd.c
@@ -998,8 +998,7 @@ static const struct pci_device_id vmd_ids[] = {
.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP,},
{PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_28C0),
.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
- VMD_FEAT_HAS_BUS_RESTRICTIONS |
- VMD_FEAT_CAN_BYPASS_MSI_REMAP,},
+ VMD_FEAT_HAS_BUS_RESTRICTIONS,},
{PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x467f),
.driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP |
VMD_FEAT_HAS_BUS_RESTRICTIONS |
--
2.39.0


2022-12-22 10:05:52

by Jonathan Derrick

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller



On 12/22/22 12:26 AM, [email protected] wrote:
> From: Xinghui Li <[email protected]>
>
> Commit ee81ee84f873("PCI: vmd: Disable MSI-X remapping when possible")
> disable the vmd MSI-X remapping for optimizing pci performance.However,
> this feature severely negatively optimized performance in multi-disk
> situations.
>
> In FIO 4K random test, we test 1 disk in the 1 CPU
>
> when disable MSI-X remapping:
> read: IOPS=1183k, BW=4622MiB/s (4847MB/s)(1354GiB/300001msec)
> READ: bw=4622MiB/s (4847MB/s), 4622MiB/s-4622MiB/s (4847MB/s-4847MB/s),
> io=1354GiB (1454GB), run=300001-300001msec
>
> When not disable MSI-X remapping:
> read: IOPS=1171k, BW=4572MiB/s (4795MB/s)(1340GiB/300001msec)
> READ: bw=4572MiB/s (4795MB/s), 4572MiB/s-4572MiB/s (4795MB/s-4795MB/s),
> io=1340GiB (1438GB), run=300001-300001msec
>
> However, the bypass mode could increase the interrupts costs in CPU.
> We test 12 disks in the 6 CPU,
Well the bypass mode was made to improve performance where you have >4
drives so this is pretty surprising. With bypass mode disabled, VMD will
intercept and forward interrupts, increasing costs.

I think Nirmal would want to to understand if there's some other factor
going on here.

>
> When disable MSI-X remapping:
> read: IOPS=562k, BW=2197MiB/s (2304MB/s)(644GiB/300001msec)
> READ: bw=2197MiB/s (2304MB/s), 2197MiB/s-2197MiB/s (2304MB/s-2304MB/s),
> io=644GiB (691GB), run=300001-300001msec
>
> When not disable MSI-X remapping:
> read: IOPS=1144k, BW=4470MiB/s (4687MB/s)(1310GiB/300005msec)
> READ: bw=4470MiB/s (4687MB/s), 4470MiB/s-4470MiB/s (4687MB/s-4687MB/s),
> io=1310GiB (1406GB), run=300005-300005msec
>
> Signed-off-by: Xinghui Li <[email protected]>
> ---
> drivers/pci/controller/vmd.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/pci/controller/vmd.c b/drivers/pci/controller/vmd.c
> index e06e9f4fc50f..9f6e9324d67d 100644
> --- a/drivers/pci/controller/vmd.c
> +++ b/drivers/pci/controller/vmd.c
> @@ -998,8 +998,7 @@ static const struct pci_device_id vmd_ids[] = {
> .driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP,},
> {PCI_DEVICE(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_VMD_28C0),
> .driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW |
> - VMD_FEAT_HAS_BUS_RESTRICTIONS |
> - VMD_FEAT_CAN_BYPASS_MSI_REMAP,},
> + VMD_FEAT_HAS_BUS_RESTRICTIONS,},
> {PCI_DEVICE(PCI_VENDOR_ID_INTEL, 0x467f),
> .driver_data = VMD_FEAT_HAS_MEMBAR_SHADOW_VSCAP |
> VMD_FEAT_HAS_BUS_RESTRICTIONS |

2022-12-22 22:10:13

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

On Thu, Dec 22, 2022 at 02:15:20AM -0700, Jonathan Derrick wrote:
> On 12/22/22 12:26 AM, [email protected] wrote:
> >
> > However, the bypass mode could increase the interrupts costs in CPU.
> > We test 12 disks in the 6 CPU,
>
> Well the bypass mode was made to improve performance where you have >4
> drives so this is pretty surprising. With bypass mode disabled, VMD will
> intercept and forward interrupts, increasing costs.
>
> I think Nirmal would want to to understand if there's some other factor
> going on here.

With 12 drives and only 6 CPUs, the bypass mode is going to get more irq
context switching. Sounds like the non-bypass mode is aggregating and
spreading interrupts across the cores better, but there's probably some
cpu:drive count tipping point where performance favors the other way.

The fio jobs could also probably set their cpus_allowed differently to
get better performance in the bypass mode.

2022-12-23 08:21:20

by Xinghui Li

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

Jonathan Derrick <[email protected]> 于2022年12月22日周四 17:15写道:
>
>
>
> On 12/22/22 12:26 AM, [email protected] wrote:
> > From: Xinghui Li <[email protected]>
> >
> > Commit ee81ee84f873("PCI: vmd: Disable MSI-X remapping when possible")
> > disable the vmd MSI-X remapping for optimizing pci performance.However,
> > this feature severely negatively optimized performance in multi-disk
> > situations.
> >
> > In FIO 4K random test, we test 1 disk in the 1 CPU
> >
> > when disable MSI-X remapping:
> > read: IOPS=1183k, BW=4622MiB/s (4847MB/s)(1354GiB/300001msec)
> > READ: bw=4622MiB/s (4847MB/s), 4622MiB/s-4622MiB/s (4847MB/s-4847MB/s),
> > io=1354GiB (1454GB), run=300001-300001msec
> >
> > When not disable MSI-X remapping:
> > read: IOPS=1171k, BW=4572MiB/s (4795MB/s)(1340GiB/300001msec)
> > READ: bw=4572MiB/s (4795MB/s), 4572MiB/s-4572MiB/s (4795MB/s-4795MB/s),
> > io=1340GiB (1438GB), run=300001-300001msec
> >
> > However, the bypass mode could increase the interrupts costs in CPU.
> > We test 12 disks in the 6 CPU,
> Well the bypass mode was made to improve performance where you have >4
> drives so this is pretty surprising. With bypass mode disabled, VMD will
> intercept and forward interrupts, increasing costs.

We also find the more drives we tested, the more severe the
performance degradation.
When we tested 8 drives in 6 CPU, there is about 30% drop.

> I think Nirmal would want to to understand if there's some other factor
> going on here.

I also agree with this. The tested server is None io-scheduler.
We tested the same server. Tested drives are Samsung Gen-4 nvme.
Is there anything else you worried effecting test results?

2022-12-23 08:25:04

by Xinghui Li

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

Keith Busch <[email protected]> 于2022年12月23日周五 05:56写道:
>
> With 12 drives and only 6 CPUs, the bypass mode is going to get more irq
> context switching. Sounds like the non-bypass mode is aggregating and
> spreading interrupts across the cores better, but there's probably some
> cpu:drive count tipping point where performance favors the other way.

We found that tunning the interrupt aggregation can also bring the
drive performance back to normal.

> The fio jobs could also probably set their cpus_allowed differently to
> get better performance in the bypass mode.

We used the cpus_allowed in FIO to fix 12 dirves in 6 different CPU.

By the way, sorry for emailing twice, the last one had the format problem.

2022-12-27 23:01:04

by Jonathan Derrick

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller



On 12/23/2022 2:02 AM, Xinghui Li wrote:
> Keith Busch <[email protected]> 于2022年12月23日周五 05:56写道:
>>
>> With 12 drives and only 6 CPUs, the bypass mode is going to get more irq
>> context switching. Sounds like the non-bypass mode is aggregating and
>> spreading interrupts across the cores better, but there's probably some
>> cpu:drive count tipping point where performance favors the other way.
>
> We found that tunning the interrupt aggregation can also bring the
> drive performance back to normal.
>
>> The fio jobs could also probably set their cpus_allowed differently to
>> get better performance in the bypass mode.
>
> We used the cpus_allowed in FIO to fix 12 dirves in 6 different CPU.
>
> By the way, sorry for emailing twice, the last one had the format problem.

The bypass mode should help in the cases where drives irqs (eg nproc) exceed
VMD I/O irqs. VMD I/O irqs for 28c0 should be min(63, nproc). You have
very few cpus for a Skylake system with that many drives, unless you mean you
are explicitly restricting the 12 drives to only 6 cpus. Either way, bypass mode
is effectively VMD-disabled, which points to other issues. Though I have also seen
much smaller interrupt aggregation benefits.

2022-12-28 02:48:07

by Xinghui Li

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

Jonathan Derrick <[email protected]> 于2022年12月28日周三 06:32写道:
>
> The bypass mode should help in the cases where drives irqs (eg nproc) exceed
> VMD I/O irqs. VMD I/O irqs for 28c0 should be min(63, nproc). You have
> very few cpus for a Skylake system with that many drives, unless you mean you
> are explicitly restricting the 12 drives to only 6 cpus. Either way, bypass mode
> is effectively VMD-disabled, which points to other issues. Though I have also seen
> much smaller interrupt aggregation benefits.

Firstly,I am sorry for my words misleading you. We totally tested 12 drives.
And each drive run in 6 CPU cores with 8 jobs.

Secondly, I try to test the drives with VMD disabled,I found the results to
be largely consistent with bypass mode. I suppose the bypass mode just
"bypass" the VMD controller.

The last one,we found in bypass mode the CPU idle is 91%. But in remapping mode
the CPU idle is 78%. And the bypass's context-switchs is much fewer
than the remapping
mode's. It seems the system is watiing for something in bypass mode.

2023-01-09 21:30:13

by Jonathan Derrick

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

As the bypass mode seems to affect performance greatly depending on the specific configuration,
it may make sense to use a moduleparam to control it

I'd vote for it being in VMD mode (non-bypass) by default.

On 12/27/2022 7:19 PM, Xinghui Li wrote:
> Jonathan Derrick <[email protected]> 于2022年12月28日周三 06:32写道:
>>
>> The bypass mode should help in the cases where drives irqs (eg nproc) exceed
>> VMD I/O irqs. VMD I/O irqs for 28c0 should be min(63, nproc). You have
>> very few cpus for a Skylake system with that many drives, unless you mean you
>> are explicitly restricting the 12 drives to only 6 cpus. Either way, bypass mode
>> is effectively VMD-disabled, which points to other issues. Though I have also seen
>> much smaller interrupt aggregation benefits.
>
> Firstly,I am sorry for my words misleading you. We totally tested 12 drives.
> And each drive run in 6 CPU cores with 8 jobs.
>
> Secondly, I try to test the drives with VMD disabled,I found the results to
> be largely consistent with bypass mode. I suppose the bypass mode just
> "bypass" the VMD controller.
>
> The last one,we found in bypass mode the CPU idle is 91%. But in remapping mode
> the CPU idle is 78%. And the bypass's context-switchs is much fewer
> than the remapping
> mode's. It seems the system is watiing for something in bypass mode.

2023-01-10 13:09:53

by Xinghui Li

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

Jonathan Derrick <[email protected]> 于2023年1月10日周二 05:00写道:
>
> As the bypass mode seems to affect performance greatly depending on the specific configuration,
> it may make sense to use a moduleparam to control it
>
We found that each pcie port can mount four drives. If we only test 2
or 1 dirve of one pcie port,
the performance of the drive performance will be normal. Also, we
observed the interruptions in different modes.
bypass:
.....
2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68
2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65
2022-12-28-11-39-14: RES 26743 Rescheduling interrupts
2022-12-28-11-39-17: irqtop - IRQ : 3029, TOTAL : 2100315228, CPU :
192, ACTIVE CPU : 192
disable:
......
2022-12-28-12-05-56: 1714 169797 IR-PCI-MSI 14155850-edge nvme1q74
2022-12-28-12-05-56: 1701 168753 IR-PCI-MSI 14155849-edge nvme1q73
2022-12-28-12-05-56: LOC 163697 Local timer interrupts
2022-12-28-12-05-56: TLB 5465 TLB shootdowns
2022-12-28-12-06-00: irqtop - IRQ : 3029, TOTAL : 2179022106, CPU :
192, ACTIVE CPU : 192
remapping:
022-12-28-11-25-38: 283 325568 IR-PCI-MSI 24651790-edge vmd3
2022-12-28-11-25-38: 140 267899 IR-PCI-MSI 13117447-edge vmd1
2022-12-28-11-25-38: 183 265978 IR-PCI-MSI 13117490-edge vmd1
......
2022-12-28-11-25-42: irqtop - IRQ : 2109, TOTAL : 2377172002, CPU :
192, ACTIVE CPU : 192

From the result it is not difficult to find, in remapping mode the
interruptions come from vmd.
While in other modes, interrupts come from nvme devices. Besides, we
found the port mounting
4 dirves total interruptions is much fewer than the port mounting 2 or 1 drive.
NVME 8 and 9 mount in one port, other port mount 4 dirves.

2022-12-28-11-39-14: 2582 494635 IR-PCI-MSI 470810698-edge nvme9q74
2022-12-28-11-39-14: 2579 489972 IR-PCI-MSI 470810697-edge nvme9q73
2022-12-28-11-39-14: 2573 480024 IR-PCI-MSI 470810695-edge nvme9q71
2022-12-28-11-39-14: 2544 312967 IR-PCI-MSI 470286401-edge nvme8q65
2022-12-28-11-39-14: 2556 312229 IR-PCI-MSI 470286405-edge nvme8q69
2022-12-28-11-39-14: 2547 310013 IR-PCI-MSI 470286402-edge nvme8q66
2022-12-28-11-39-14: 2550 308993 IR-PCI-MSI 470286403-edge nvme8q67
2022-12-28-11-39-14: 2559 308794 IR-PCI-MSI 470286406-edge nvme8q70
......
2022-12-28-11-39-14: 1296 185773 IR-PCI-MSI 202375243-edge nvme1q75
2022-12-28-11-39-14: 1209 185646 IR-PCI-MSI 201850947-edge nvme0q67
2022-12-28-11-39-14: 1831 184151 IR-PCI-MSI 203423828-edge nvme3q84
2022-12-28-11-39-14: 1254 182313 IR-PCI-MSI 201850950-edge nvme0q70
2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68
2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65
> I'd vote for it being in VMD mode (non-bypass) by default.
I speculate that the vmd controller equalizes the interrupt load and
acts like a buffer,
which improves the performance of nvme. I am not sure about my
analysis. So, I'd like
to discuss it with the community.

2023-02-06 12:45:06

by Xinghui Li

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

Friendly ping~

Xinghui Li <[email protected]> 于2023年1月10日周二 20:28写道:
>
> Jonathan Derrick <[email protected]> 于2023年1月10日周二 05:00写道:
> >
> > As the bypass mode seems to affect performance greatly depending on the specific configuration,
> > it may make sense to use a moduleparam to control it
> >
> We found that each pcie port can mount four drives. If we only test 2
> or 1 dirve of one pcie port,
> the performance of the drive performance will be normal. Also, we
> observed the interruptions in different modes.
> bypass:
> .....
> 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68
> 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65
> 2022-12-28-11-39-14: RES 26743 Rescheduling interrupts
> 2022-12-28-11-39-17: irqtop - IRQ : 3029, TOTAL : 2100315228, CPU :
> 192, ACTIVE CPU : 192
> disable:
> ......
> 2022-12-28-12-05-56: 1714 169797 IR-PCI-MSI 14155850-edge nvme1q74
> 2022-12-28-12-05-56: 1701 168753 IR-PCI-MSI 14155849-edge nvme1q73
> 2022-12-28-12-05-56: LOC 163697 Local timer interrupts
> 2022-12-28-12-05-56: TLB 5465 TLB shootdowns
> 2022-12-28-12-06-00: irqtop - IRQ : 3029, TOTAL : 2179022106, CPU :
> 192, ACTIVE CPU : 192
> remapping:
> 022-12-28-11-25-38: 283 325568 IR-PCI-MSI 24651790-edge vmd3
> 2022-12-28-11-25-38: 140 267899 IR-PCI-MSI 13117447-edge vmd1
> 2022-12-28-11-25-38: 183 265978 IR-PCI-MSI 13117490-edge vmd1
> ......
> 2022-12-28-11-25-42: irqtop - IRQ : 2109, TOTAL : 2377172002, CPU :
> 192, ACTIVE CPU : 192
>
> From the result it is not difficult to find, in remapping mode the
> interruptions come from vmd.
> While in other modes, interrupts come from nvme devices. Besides, we
> found the port mounting
> 4 dirves total interruptions is much fewer than the port mounting 2 or 1 drive.
> NVME 8 and 9 mount in one port, other port mount 4 dirves.
>
> 2022-12-28-11-39-14: 2582 494635 IR-PCI-MSI 470810698-edge nvme9q74
> 2022-12-28-11-39-14: 2579 489972 IR-PCI-MSI 470810697-edge nvme9q73
> 2022-12-28-11-39-14: 2573 480024 IR-PCI-MSI 470810695-edge nvme9q71
> 2022-12-28-11-39-14: 2544 312967 IR-PCI-MSI 470286401-edge nvme8q65
> 2022-12-28-11-39-14: 2556 312229 IR-PCI-MSI 470286405-edge nvme8q69
> 2022-12-28-11-39-14: 2547 310013 IR-PCI-MSI 470286402-edge nvme8q66
> 2022-12-28-11-39-14: 2550 308993 IR-PCI-MSI 470286403-edge nvme8q67
> 2022-12-28-11-39-14: 2559 308794 IR-PCI-MSI 470286406-edge nvme8q70
> ......
> 2022-12-28-11-39-14: 1296 185773 IR-PCI-MSI 202375243-edge nvme1q75
> 2022-12-28-11-39-14: 1209 185646 IR-PCI-MSI 201850947-edge nvme0q67
> 2022-12-28-11-39-14: 1831 184151 IR-PCI-MSI 203423828-edge nvme3q84
> 2022-12-28-11-39-14: 1254 182313 IR-PCI-MSI 201850950-edge nvme0q70
> 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68
> 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65
> > I'd vote for it being in VMD mode (non-bypass) by default.
> I speculate that the vmd controller equalizes the interrupt load and
> acts like a buffer,
> which improves the performance of nvme. I am not sure about my
> analysis. So, I'd like
> to discuss it with the community.

2023-02-06 18:12:34

by Nirmal Patel

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

On 2/6/2023 5:45 AM, Xinghui Li wrote:
> Friendly ping~
>
> Xinghui Li <[email protected]> 于2023年1月10日周二 20:28写道:
>> Jonathan Derrick <[email protected]> 于2023年1月10日周二 05:00写道:
>>> As the bypass mode seems to affect performance greatly depending on the specific configuration,
>>> it may make sense to use a moduleparam to control it
>>>
>> We found that each pcie port can mount four drives. If we only test 2
>> or 1 dirve of one pcie port,
>> the performance of the drive performance will be normal. Also, we
>> observed the interruptions in different modes.
>> bypass:
>> .....
>> 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68
>> 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65
>> 2022-12-28-11-39-14: RES 26743 Rescheduling interrupts
>> 2022-12-28-11-39-17: irqtop - IRQ : 3029, TOTAL : 2100315228, CPU :
>> 192, ACTIVE CPU : 192
>> disable:
>> ......
>> 2022-12-28-12-05-56: 1714 169797 IR-PCI-MSI 14155850-edge nvme1q74
>> 2022-12-28-12-05-56: 1701 168753 IR-PCI-MSI 14155849-edge nvme1q73
>> 2022-12-28-12-05-56: LOC 163697 Local timer interrupts
>> 2022-12-28-12-05-56: TLB 5465 TLB shootdowns
>> 2022-12-28-12-06-00: irqtop - IRQ : 3029, TOTAL : 2179022106, CPU :
>> 192, ACTIVE CPU : 192
>> remapping:
>> 022-12-28-11-25-38: 283 325568 IR-PCI-MSI 24651790-edge vmd3
>> 2022-12-28-11-25-38: 140 267899 IR-PCI-MSI 13117447-edge vmd1
>> 2022-12-28-11-25-38: 183 265978 IR-PCI-MSI 13117490-edge vmd1
>> ......
>> 2022-12-28-11-25-42: irqtop - IRQ : 2109, TOTAL : 2377172002, CPU :
>> 192, ACTIVE CPU : 192
>>
>> From the result it is not difficult to find, in remapping mode the
>> interruptions come from vmd.
>> While in other modes, interrupts come from nvme devices. Besides, we
>> found the port mounting
>> 4 dirves total interruptions is much fewer than the port mounting 2 or 1 drive.
>> NVME 8 and 9 mount in one port, other port mount 4 dirves.
>>
>> 2022-12-28-11-39-14: 2582 494635 IR-PCI-MSI 470810698-edge nvme9q74
>> 2022-12-28-11-39-14: 2579 489972 IR-PCI-MSI 470810697-edge nvme9q73
>> 2022-12-28-11-39-14: 2573 480024 IR-PCI-MSI 470810695-edge nvme9q71
>> 2022-12-28-11-39-14: 2544 312967 IR-PCI-MSI 470286401-edge nvme8q65
>> 2022-12-28-11-39-14: 2556 312229 IR-PCI-MSI 470286405-edge nvme8q69
>> 2022-12-28-11-39-14: 2547 310013 IR-PCI-MSI 470286402-edge nvme8q66
>> 2022-12-28-11-39-14: 2550 308993 IR-PCI-MSI 470286403-edge nvme8q67
>> 2022-12-28-11-39-14: 2559 308794 IR-PCI-MSI 470286406-edge nvme8q70
>> ......
>> 2022-12-28-11-39-14: 1296 185773 IR-PCI-MSI 202375243-edge nvme1q75
>> 2022-12-28-11-39-14: 1209 185646 IR-PCI-MSI 201850947-edge nvme0q67
>> 2022-12-28-11-39-14: 1831 184151 IR-PCI-MSI 203423828-edge nvme3q84
>> 2022-12-28-11-39-14: 1254 182313 IR-PCI-MSI 201850950-edge nvme0q70
>> 2022-12-28-11-39-14: 1224 181665 IR-PCI-MSI 201850948-edge nvme0q68
>> 2022-12-28-11-39-14: 1179 180115 IR-PCI-MSI 201850945-edge nvme0q65
>>> I'd vote for it being in VMD mode (non-bypass) by default.
>> I speculate that the vmd controller equalizes the interrupt load and
>> acts like a buffer,
>> which improves the performance of nvme. I am not sure about my
>> analysis. So, I'd like
>> to discuss it with the community.

I like the idea of module parameter to allow switching between the modes
but keep MSI remapping enabled (non-bypass) by default.


2023-02-06 18:28:49

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

On Mon, Feb 06, 2023 at 11:11:36AM -0700, Patel, Nirmal wrote:
> I like the idea of module parameter to allow switching between the modes
> but keep MSI remapping enabled (non-bypass) by default.

Isn't there a more programatic way to go about selecting the best option at
runtime? I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".

2023-02-07 03:17:50

by Xinghui Li

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

Keith Busch <[email protected]> 于2023年2月7日周二 02:28写道:
>
> On Mon, Feb 06, 2023 at 11:11:36AM -0700, Patel, Nirmal wrote:
> > I like the idea of module parameter to allow switching between the modes
> > but keep MSI remapping enabled (non-bypass) by default.
>
> Isn't there a more programatic way to go about selecting the best option at
> runtime?
Do you mean that the operating mode is automatically selected by
detecting the number of devices and CPUs instead of being set
manually?
>I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
For this situation, My speculation is that the PCIE nodes are
over-mounted and not just because of the CPU to Drive ratio.
We considered designing online nodes, because we were concerned that
the IO of different chunk sizes would adapt to different MSI-X modes.
I privately think that it may be logically complicated if programmatic
judgments are made.

2023-02-07 20:32:26

by Nirmal Patel

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

On 2/6/2023 8:18 PM, Xinghui Li wrote:
> Keith Busch <[email protected]> 于2023年2月7日周二 02:28写道:
>> On Mon, Feb 06, 2023 at 11:11:36AM -0700, Patel, Nirmal wrote:
>>> I like the idea of module parameter to allow switching between the modes
>>> but keep MSI remapping enabled (non-bypass) by default.
>> Isn't there a more programatic way to go about selecting the best option at
>> runtime?
> Do you mean that the operating mode is automatically selected by
> detecting the number of devices and CPUs instead of being set
> manually?
>> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
> For this situation, My speculation is that the PCIE nodes are
> over-mounted and not just because of the CPU to Drive ratio.
> We considered designing online nodes, because we were concerned that
> the IO of different chunk sizes would adapt to different MSI-X modes.
> I privately think that it may be logically complicated if programmatic
> judgments are made.

Also newer CPUs have more MSIx (128) which means we can still have
better performance without bypass. It would be better if user have
can chose module parameter based on their requirements. Thanks.


2023-02-09 12:09:38

by Xinghui Li

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

Patel, Nirmal <[email protected]> 于2023年2月8日周三 04:32写道:
>
> Also newer CPUs have more MSIx (128) which means we can still have
> better performance without bypass. It would be better if user have
> can chose module parameter based on their requirements. Thanks.
>
All right~I will reset the patch V2 with the online node version later.

Thanks

2023-02-09 23:05:52

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

On Tue, Feb 07, 2023 at 01:32:20PM -0700, Patel, Nirmal wrote:
> On 2/6/2023 8:18 PM, Xinghui Li wrote:
> > Keith Busch <[email protected]> 于2023年2月7日周二 02:28写道:
> >> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
> > For this situation, My speculation is that the PCIE nodes are
> > over-mounted and not just because of the CPU to Drive ratio.
> > We considered designing online nodes, because we were concerned that
> > the IO of different chunk sizes would adapt to different MSI-X modes.
> > I privately think that it may be logically complicated if programmatic
> > judgments are made.
>
> Also newer CPUs have more MSIx (128) which means we can still have
> better performance without bypass. It would be better if user have
> can chose module parameter based on their requirements. Thanks.

So what? More vectors just pushes the threshold to when bypass becomes
relevant, which is exactly why I suggested it. There has to be an empirical
answer to when bypass beats muxing. Why do you want a user tunable if there's a
verifiable and automated better choice?

2023-02-09 23:58:06

by Nirmal Patel

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

On 2/9/2023 4:05 PM, Keith Busch wrote:
> On Tue, Feb 07, 2023 at 01:32:20PM -0700, Patel, Nirmal wrote:
>> On 2/6/2023 8:18 PM, Xinghui Li wrote:
>>> Keith Busch <[email protected]> 于2023年2月7日周二 02:28写道:
>>>> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
>>> For this situation, My speculation is that the PCIE nodes are
>>> over-mounted and not just because of the CPU to Drive ratio.
>>> We considered designing online nodes, because we were concerned that
>>> the IO of different chunk sizes would adapt to different MSI-X modes.
>>> I privately think that it may be logically complicated if programmatic
>>> judgments are made.
>> Also newer CPUs have more MSIx (128) which means we can still have
>> better performance without bypass. It would be better if user have
>> can chose module parameter based on their requirements. Thanks.
> So what? More vectors just pushes the threshold to when bypass becomes
> relevant, which is exactly why I suggested it. There has to be an empirical
> answer to when bypass beats muxing. Why do you want a user tunable if there's a
> verifiable and automated better choice?

Make sense about the automated choice. I am not sure what is the exact
tipping point. The commit message includes only two cases. one 1 drive
1 CPU and second 12 drives 6 CPU. Also performance gets worse from 8
drives to 12 drives.
One the previous comments also mentioned something about FIO changing
cpus_allowed; will there be an issue when VMD driver decides to bypass
the remapping during the boot up, but FIO job changes the cpu_allowed?


2023-02-10 00:47:47

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] PCI: vmd: Do not disable MSI-X remapping in VMD 28C0 controller

On Thu, Feb 09, 2023 at 04:57:59PM -0700, Patel, Nirmal wrote:
> On 2/9/2023 4:05 PM, Keith Busch wrote:
> > On Tue, Feb 07, 2023 at 01:32:20PM -0700, Patel, Nirmal wrote:
> >> On 2/6/2023 8:18 PM, Xinghui Li wrote:
> >>> Keith Busch <[email protected]> 于2023年2月7日周二 02:28写道:
> >>>> I suspect bypass is the better choice if "num_active_cpus() > pci_msix_vec_count(vmd->dev)".
> >>> For this situation, My speculation is that the PCIE nodes are
> >>> over-mounted and not just because of the CPU to Drive ratio.
> >>> We considered designing online nodes, because we were concerned that
> >>> the IO of different chunk sizes would adapt to different MSI-X modes.
> >>> I privately think that it may be logically complicated if programmatic
> >>> judgments are made.
> >> Also newer CPUs have more MSIx (128) which means we can still have
> >> better performance without bypass. It would be better if user have
> >> can chose module parameter based on their requirements. Thanks.
> > So what? More vectors just pushes the threshold to when bypass becomes
> > relevant, which is exactly why I suggested it. There has to be an empirical
> > answer to when bypass beats muxing. Why do you want a user tunable if there's a
> > verifiable and automated better choice?
>
> Make sense about the automated choice. I am not sure what is the exact
> tipping point. The commit message includes only two cases. one 1 drive
> 1 CPU and second 12 drives 6 CPU. Also performance gets worse from 8
> drives to 12 drives.

That configuration's storage performance overwhelms the CPU with interrupt
context switching. That problem probably inverts when your active CPU count
exceeds your VMD vectors because you'll be funnelling more interrupts into
fewer CPUs, leaving other CPUs idle.

> One the previous comments also mentioned something about FIO changing
> cpus_allowed; will there be an issue when VMD driver decides to bypass
> the remapping during the boot up, but FIO job changes the cpu_allowed?

No. Bypass mode uses managed interrupts for your nvme child devices, which sets
the best possible affinity.