Depending on the number of online CPUs in the original kernel, it is
likely for CPU #0 to be offline in a kdump kernel. The associated IRQs
in the affinity mappings provided by irq_create_affinity_masks() are
thus not started by irq_startup(), as per-design with managed IRQs.
This can be a problem with multi-queue block devices driven by blk-mq :
such a non-started IRQ is very likely paired with the single queue
enforced by blk-mq during kdump (see blk_mq_alloc_tag_set()). This
causes the device to remain silent and likely hangs the guest at
some point.
This is a regression caused by commit 9ea69a55b3b9 ("powerpc/pseries:
Pass MSI affinity to irq_create_mapping()"). Note that this only happens
with the XIVE interrupt controller because XICS has a workaround to bypass
affinity, which is activated during kdump with the "noirqdistrib" kernel
parameter.
The issue comes from a combination of factors:
- discrepancy between the number of queues detected by the multi-queue
block driver, that was used to create the MSI vectors, and the single
queue mode enforced later on by blk-mq because of kdump (i.e. keeping
all queues fixes the issue)
- CPU#0 offline (i.e. kdump always succeed with CPU#0)
Given that I couldn't reproduce on x86, which seems to always have CPU#0
online even during kdump, I'm not sure where this should be fixed. Hence
going for another approach : fine-grained affinity is for performance
and we don't really care about that during kdump. Simply revert to the
previous working behavior of ignoring affinity masks in this case only.
Fixes: 9ea69a55b3b9 ("powerpc/pseries: Pass MSI affinity to irq_create_mapping()")
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Greg Kurz <[email protected]>
---
arch/powerpc/platforms/pseries/msi.c | 24 ++++++++++++++++++++++--
1 file changed, 22 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c
index b3ac2455faad..29d04b83288d 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -458,8 +458,28 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type)
return hwirq;
}
- virq = irq_create_mapping_affinity(NULL, hwirq,
- entry->affinity);
+ /*
+ * Depending on the number of online CPUs in the original
+ * kernel, it is likely for CPU #0 to be offline in a kdump
+ * kernel. The associated IRQs in the affinity mappings
+ * provided by irq_create_affinity_masks() are thus not
+ * started by irq_startup(), as per-design for managed IRQs.
+ * This can be a problem with multi-queue block devices driven
+ * by blk-mq : such a non-started IRQ is very likely paired
+ * with the single queue enforced by blk-mq during kdump (see
+ * blk_mq_alloc_tag_set()). This causes the device to remain
+ * silent and likely hangs the guest at some point.
+ *
+ * We don't really care for fine-grained affinity when doing
+ * kdump actually : simply ignore the pre-computed affinity
+ * masks in this case and let the default mask with all CPUs
+ * be used when creating the IRQ mappings.
+ */
+ if (is_kdump_kernel())
+ virq = irq_create_mapping(NULL, hwirq);
+ else
+ virq = irq_create_mapping_affinity(NULL, hwirq,
+ entry->affinity);
if (!virq) {
pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
--
2.26.2
On 12/02/2021 17:41, Greg Kurz wrote:
> Depending on the number of online CPUs in the original kernel, it is
> likely for CPU #0 to be offline in a kdump kernel. The associated IRQs
> in the affinity mappings provided by irq_create_affinity_masks() are
> thus not started by irq_startup(), as per-design with managed IRQs.
>
> This can be a problem with multi-queue block devices driven by blk-mq :
> such a non-started IRQ is very likely paired with the single queue
> enforced by blk-mq during kdump (see blk_mq_alloc_tag_set()). This
> causes the device to remain silent and likely hangs the guest at
> some point.
>
> This is a regression caused by commit 9ea69a55b3b9 ("powerpc/pseries:
> Pass MSI affinity to irq_create_mapping()"). Note that this only happens
> with the XIVE interrupt controller because XICS has a workaround to bypass
> affinity, which is activated during kdump with the "noirqdistrib" kernel
> parameter.
>
> The issue comes from a combination of factors:
> - discrepancy between the number of queues detected by the multi-queue
> block driver, that was used to create the MSI vectors, and the single
> queue mode enforced later on by blk-mq because of kdump (i.e. keeping
> all queues fixes the issue)
> - CPU#0 offline (i.e. kdump always succeed with CPU#0)
>
> Given that I couldn't reproduce on x86, which seems to always have CPU#0
> online even during kdump, I'm not sure where this should be fixed. Hence
> going for another approach : fine-grained affinity is for performance
> and we don't really care about that during kdump. Simply revert to the
> previous working behavior of ignoring affinity masks in this case only.
>
> Fixes: 9ea69a55b3b9 ("powerpc/pseries: Pass MSI affinity to irq_create_mapping()")
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Greg Kurz <[email protected]>
> ---
> arch/powerpc/platforms/pseries/msi.c | 24 ++++++++++++++++++++++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c
> index b3ac2455faad..29d04b83288d 100644
> --- a/arch/powerpc/platforms/pseries/msi.c
> +++ b/arch/powerpc/platforms/pseries/msi.c
> @@ -458,8 +458,28 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type)
> return hwirq;
> }
>
> - virq = irq_create_mapping_affinity(NULL, hwirq,
> - entry->affinity);
> + /*
> + * Depending on the number of online CPUs in the original
> + * kernel, it is likely for CPU #0 to be offline in a kdump
> + * kernel. The associated IRQs in the affinity mappings
> + * provided by irq_create_affinity_masks() are thus not
> + * started by irq_startup(), as per-design for managed IRQs.
> + * This can be a problem with multi-queue block devices driven
> + * by blk-mq : such a non-started IRQ is very likely paired
> + * with the single queue enforced by blk-mq during kdump (see
> + * blk_mq_alloc_tag_set()). This causes the device to remain
> + * silent and likely hangs the guest at some point.
> + *
> + * We don't really care for fine-grained affinity when doing
> + * kdump actually : simply ignore the pre-computed affinity
> + * masks in this case and let the default mask with all CPUs
> + * be used when creating the IRQ mappings.
> + */
> + if (is_kdump_kernel())
> + virq = irq_create_mapping(NULL, hwirq);
> + else
> + virq = irq_create_mapping_affinity(NULL, hwirq,
> + entry->affinity);
>
> if (!virq) {
> pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
>
Reviewed-by: Laurent Vivier <[email protected]>
On 2/12/21 5:41 PM, Greg Kurz wrote:
> Depending on the number of online CPUs in the original kernel, it is
> likely for CPU #0 to be offline in a kdump kernel. The associated IRQs
> in the affinity mappings provided by irq_create_affinity_masks() are
> thus not started by irq_startup(), as per-design with managed IRQs.
>
> This can be a problem with multi-queue block devices driven by blk-mq :
> such a non-started IRQ is very likely paired with the single queue
> enforced by blk-mq during kdump (see blk_mq_alloc_tag_set()). This
> causes the device to remain silent and likely hangs the guest at
> some point.
>
> This is a regression caused by commit 9ea69a55b3b9 ("powerpc/pseries:
> Pass MSI affinity to irq_create_mapping()"). Note that this only happens
> with the XIVE interrupt controller because XICS has a workaround to bypass
> affinity, which is activated during kdump with the "noirqdistrib" kernel
> parameter.
>
> The issue comes from a combination of factors:
> - discrepancy between the number of queues detected by the multi-queue
> block driver, that was used to create the MSI vectors, and the single
> queue mode enforced later on by blk-mq because of kdump (i.e. keeping
> all queues fixes the issue)
> - CPU#0 offline (i.e. kdump always succeed with CPU#0)
>
> Given that I couldn't reproduce on x86, which seems to always have CPU#0
> online even during kdump, I'm not sure where this should be fixed. Hence
> going for another approach : fine-grained affinity is for performance
> and we don't really care about that during kdump. Simply revert to the
> previous working behavior of ignoring affinity masks in this case only.
>
> Fixes: 9ea69a55b3b9 ("powerpc/pseries: Pass MSI affinity to irq_create_mapping()")
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Greg Kurz <[email protected]>
Reviewed-by: C?dric Le Goater <[email protected]>
Thanks for tracking this issue.
This layer needs a rework. Patches adding a MSI domain should be ready
in a couple of releases. Hopefully.
C.
> ---
> arch/powerpc/platforms/pseries/msi.c | 24 ++++++++++++++++++++++--
> 1 file changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c
> index b3ac2455faad..29d04b83288d 100644
> --- a/arch/powerpc/platforms/pseries/msi.c
> +++ b/arch/powerpc/platforms/pseries/msi.c
> @@ -458,8 +458,28 @@ static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type)
> return hwirq;
> }
>
> - virq = irq_create_mapping_affinity(NULL, hwirq,
> - entry->affinity);
> + /*
> + * Depending on the number of online CPUs in the original
> + * kernel, it is likely for CPU #0 to be offline in a kdump
> + * kernel. The associated IRQs in the affinity mappings
> + * provided by irq_create_affinity_masks() are thus not
> + * started by irq_startup(), as per-design for managed IRQs.
> + * This can be a problem with multi-queue block devices driven
> + * by blk-mq : such a non-started IRQ is very likely paired
> + * with the single queue enforced by blk-mq during kdump (see
> + * blk_mq_alloc_tag_set()). This causes the device to remain
> + * silent and likely hangs the guest at some point.
> + *
> + * We don't really care for fine-grained affinity when doing
> + * kdump actually : simply ignore the pre-computed affinity
> + * masks in this case and let the default mask with all CPUs
> + * be used when creating the IRQ mappings.
> + */
> + if (is_kdump_kernel())
> + virq = irq_create_mapping(NULL, hwirq);
> + else
> + virq = irq_create_mapping_affinity(NULL, hwirq,
> + entry->affinity);
>
> if (!virq) {
> pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
>
Hi Greg,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on powerpc/next]
[also build test ERROR on linus/master v5.11-rc7 next-20210211]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Greg-Kurz/powerpc-pseries-Don-t-enforce-MSI-affinity-with-kdump/20210213-004658
base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/0day-ci/linux/commit/1e5f7523fcfc57ab9437b8c7b29a974b62bde79d
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Greg-Kurz/powerpc-pseries-Don-t-enforce-MSI-affinity-with-kdump/20210213-004658
git checkout 1e5f7523fcfc57ab9437b8c7b29a974b62bde79d
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>
All errors (new ones prefixed by >>):
arch/powerpc/platforms/pseries/msi.c: In function 'rtas_setup_msi_irqs':
>> arch/powerpc/platforms/pseries/msi.c:478:7: error: implicit declaration of function 'is_kdump_kernel' [-Werror=implicit-function-declaration]
478 | if (is_kdump_kernel())
| ^~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/is_kdump_kernel +478 arch/powerpc/platforms/pseries/msi.c
369
370 static int rtas_setup_msi_irqs(struct pci_dev *pdev, int nvec_in, int type)
371 {
372 struct pci_dn *pdn;
373 int hwirq, virq, i, quota, rc;
374 struct msi_desc *entry;
375 struct msi_msg msg;
376 int nvec = nvec_in;
377 int use_32bit_msi_hack = 0;
378
379 if (type == PCI_CAP_ID_MSIX)
380 rc = check_req_msix(pdev, nvec);
381 else
382 rc = check_req_msi(pdev, nvec);
383
384 if (rc)
385 return rc;
386
387 quota = msi_quota_for_device(pdev, nvec);
388
389 if (quota && quota < nvec)
390 return quota;
391
392 if (type == PCI_CAP_ID_MSIX && check_msix_entries(pdev))
393 return -EINVAL;
394
395 /*
396 * Firmware currently refuse any non power of two allocation
397 * so we round up if the quota will allow it.
398 */
399 if (type == PCI_CAP_ID_MSIX) {
400 int m = roundup_pow_of_two(nvec);
401 quota = msi_quota_for_device(pdev, m);
402
403 if (quota >= m)
404 nvec = m;
405 }
406
407 pdn = pci_get_pdn(pdev);
408
409 /*
410 * Try the new more explicit firmware interface, if that fails fall
411 * back to the old interface. The old interface is known to never
412 * return MSI-Xs.
413 */
414 again:
415 if (type == PCI_CAP_ID_MSI) {
416 if (pdev->no_64bit_msi) {
417 rc = rtas_change_msi(pdn, RTAS_CHANGE_32MSI_FN, nvec);
418 if (rc < 0) {
419 /*
420 * We only want to run the 32 bit MSI hack below if
421 * the max bus speed is Gen2 speed
422 */
423 if (pdev->bus->max_bus_speed != PCIE_SPEED_5_0GT)
424 return rc;
425
426 use_32bit_msi_hack = 1;
427 }
428 } else
429 rc = -1;
430
431 if (rc < 0)
432 rc = rtas_change_msi(pdn, RTAS_CHANGE_MSI_FN, nvec);
433
434 if (rc < 0) {
435 pr_debug("rtas_msi: trying the old firmware call.\n");
436 rc = rtas_change_msi(pdn, RTAS_CHANGE_FN, nvec);
437 }
438
439 if (use_32bit_msi_hack && rc > 0)
440 rtas_hack_32bit_msi_gen2(pdev);
441 } else
442 rc = rtas_change_msi(pdn, RTAS_CHANGE_MSIX_FN, nvec);
443
444 if (rc != nvec) {
445 if (nvec != nvec_in) {
446 nvec = nvec_in;
447 goto again;
448 }
449 pr_debug("rtas_msi: rtas_change_msi() failed\n");
450 return rc;
451 }
452
453 i = 0;
454 for_each_pci_msi_entry(entry, pdev) {
455 hwirq = rtas_query_irq_number(pdn, i++);
456 if (hwirq < 0) {
457 pr_debug("rtas_msi: error (%d) getting hwirq\n", rc);
458 return hwirq;
459 }
460
461 /*
462 * Depending on the number of online CPUs in the original
463 * kernel, it is likely for CPU #0 to be offline in a kdump
464 * kernel. The associated IRQs in the affinity mappings
465 * provided by irq_create_affinity_masks() are thus not
466 * started by irq_startup(), as per-design for managed IRQs.
467 * This can be a problem with multi-queue block devices driven
468 * by blk-mq : such a non-started IRQ is very likely paired
469 * with the single queue enforced by blk-mq during kdump (see
470 * blk_mq_alloc_tag_set()). This causes the device to remain
471 * silent and likely hangs the guest at some point.
472 *
473 * We don't really care for fine-grained affinity when doing
474 * kdump actually : simply ignore the pre-computed affinity
475 * masks in this case and let the default mask with all CPUs
476 * be used when creating the IRQ mappings.
477 */
> 478 if (is_kdump_kernel())
479 virq = irq_create_mapping(NULL, hwirq);
480 else
481 virq = irq_create_mapping_affinity(NULL, hwirq,
482 entry->affinity);
483
484 if (!virq) {
485 pr_debug("rtas_msi: Failed mapping hwirq %d\n", hwirq);
486 return -ENOSPC;
487 }
488
489 dev_dbg(&pdev->dev, "rtas_msi: allocated virq %d\n", virq);
490 irq_set_msi_desc(virq, entry);
491
492 /* Read config space back so we can restore after reset */
493 __pci_read_msi_msg(entry, &msg);
494 entry->msg = msg;
495 }
496
497 return 0;
498 }
499
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/[email protected]