2013-09-05 13:42:40

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH v2 0/6] AHCI: Conserve interrupts with pci_enable_msi_block_part() interface

This series is aimed to conserve on othewise wasted interrupt
resources for 10 of 16 unused MSI vectors for AHCI devices on
Intel chipsets.

Changes from v1:
- roundup_pow_of_two() and is_power_of_2() functions used
- patch 2/6 generic pci_get_msi_cap() interface introduced
- patch 4/6 MMC reg. value used to initialize multiple MSIs;
Fallback to single MSI mode is simplified
- patch 5/6 sanity check for MRSM bit added

Alexander Gordeev (6):
1/6 PCI/MSI: Introduce pci_enable_msi_block_part() interface
2/6 PCI/MSI: Factor out pci_get_msi_cap() interface
3/6 MSI/x86: Support pci_enable_msi_block_part() interface
4/6 AHCI: Conserve interrupts with pci_enable_msi_block_part() interface
5/6 AHCI: Check MRSM bit when multiple MSIs enabled
6/6 PCI/MSI: Get rid of pci_enable_msi_block_auto() interface

Documentation/PCI/MSI-HOWTO.txt | 68 ++++++++++++++++-------
arch/mips/pci/msi-octeon.c | 2 +-
arch/powerpc/kernel/msi.c | 4 +-
arch/s390/pci/pci.c | 2 +-
arch/x86/include/asm/pci.h | 8 ++-
arch/x86/include/asm/x86_init.h | 3 +-
arch/x86/kernel/apic/io_apic.c | 3 +-
drivers/ata/ahci.c | 60 ++++++++++++++------
drivers/ata/ahci.h | 1 +
drivers/iommu/irq_remapping.c | 14 +++---
drivers/pci/msi.c | 115 ++++++++++++++++++++++-----------------
include/linux/msi.h | 5 +-
include/linux/pci.h | 13 ++++-
13 files changed, 188 insertions(+), 110 deletions(-)

--
1.7.7.6


--
Regards,
Alexander Gordeev
[email protected]


2013-09-05 12:50:34

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH v2 1/6] PCI/MSI: Introduce pci_enable_msi_block_part() interface

There are PCI devices that require a particular value written
to the Multiple Message Enable (MME) register while aligned on
power of 2 boundary value of actually used MSI vectors 'nvec'
is a lesser of that MME value:

roundup_pow_of_two(nvec) < 'Multiple Message Enable'

However the existing pci_enable_msi_block() interface is not
able to configure such devices, since the value written to the
MME register is calculated from the number of requested MSIs
'nvec':

'Multiple Message Enable' = roundup_pow_of_two(nvec)

In this case the result written to the MME register may not
satisfy the aforementioned PCI devices requirement and therefore
the PCI functions will not operate in a desired mode.

This update introduces pci_enable_msi_block_part() extension to
pci_enable_msi_block() interface that accepts extra 'nvec_mme'
argument which is then written to the MME register while the
value of 'nvec' is still used to setup as many interrupts as
requested.

Signed-off-by: Alexander Gordeev <[email protected]>
---
Documentation/PCI/MSI-HOWTO.txt | 56 ++++++++++++++++++++++++----
arch/mips/pci/msi-octeon.c | 2 +-
arch/powerpc/kernel/msi.c | 4 +-
arch/s390/pci/pci.c | 2 +-
arch/x86/include/asm/pci.h | 8 +++--
arch/x86/include/asm/x86_init.h | 3 +-
arch/x86/kernel/apic/io_apic.c | 3 +-
drivers/iommu/irq_remapping.c | 2 +-
drivers/pci/msi.c | 77 ++++++++++++++++++++++++++-------------
include/linux/msi.h | 5 ++-
include/linux/pci.h | 8 ++++
11 files changed, 125 insertions(+), 45 deletions(-)

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index a091780..32d7d15 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -127,7 +127,47 @@ on the number of vectors that can be allocated; pci_enable_msi_block()
returns as soon as it finds any constraint that doesn't allow the
call to succeed.

-4.2.3 pci_enable_msi_block_auto
+4.2.3 pci_enable_msi_block_part
+
+int pci_enable_msi_block_part(struct pci_dev *dev, int count, int alloc)
+
+This variation on the above call allows a device driver to request 'alloc'
+number of multiple MSIs while setup 'count' number of MSIs, which could be
+a lesser of 'alloc'. The MSI specification only allows interrupts to be
+allocated in powers of two, up to a maximum of 2^5 (32).
+
+In case the driver wants to allocate a maximum possible number of MSIs
+for the device it may pass a negative number as 'alloc' parameter.
+
+If this function returns 0, it has succeeded in allocating 'alloc'
+interrupts and setting up 'count' interrupts. In this case, the function
+enables MSI on this device and updates dev->irq to be the lowest of the
+new interrupts assigned to it. The other interrupts assigned to the
+device are in the range dev->irq to dev->irq + count - 1.
+
+If this function returns -ERANGE, it indicates 'count' is greater than
+'alloc' and the driver should adjust either or both parameters.
+
+If this function returns other negative number, it indicates an error
+and the driver should not attempt to request any more MSI interrupts
+for this device. If this function returns a positive number, it is
+less than 'alloc' and indicates the number of interrupts that could have
+been allocated. In neither case is the irq value updated or the device
+switched into MSI mode.
+
+The device driver must decide what action to take if
+pci_enable_msi_block_part() returns a value less than 'alloc'. For
+instance, the driver could still make use of fewer interrupts; in this
+case the driver should possibly adjust 'count' parameter and call
+pci_enable_msi_block_part() again or even call pci_enable_msi_block()
+instead. Note that it is not guaranteed to succeed, even when the
+'alloc' has been reduced to the value returned from a previous call to
+pci_enable_msi_block_part(). This is because there are multiple
+constraints on the number of vectors that can be allocated;
+pci_enable_msi_block_part() returns as soon as it finds any constraint
+that doesn't allow the call to succeed.
+
+4.2.4 pci_enable_msi_block_auto

int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *count)

@@ -153,16 +193,16 @@ succeeds, but returns a value less than the number of interrupts supported.
If the device driver does not need to know the number of interrupts
supported, it can set the pointer count to NULL.

-4.2.4 pci_disable_msi
+4.2.5 pci_disable_msi

void pci_disable_msi(struct pci_dev *dev)

-This function should be used to undo the effect of pci_enable_msi() or
-pci_enable_msi_block() or pci_enable_msi_block_auto(). Calling it restores
-dev->irq to the pin-based interrupt number and frees the previously
-allocated message signaled interrupt(s). The interrupt may subsequently be
-assigned to another device, so drivers should not cache the value of
-dev->irq.
+This function should be used to undo the effect of pci_enable_msi_block(),
+pci_enable_msi(), pci_enable_msi_block_auto() or pci_enable_msi_block_part().
+Calling it restores dev->irq to the pin-based interrupt number and frees the
+previously allocated message signaled interrupt(s). The interrupt may
+subsequently be assigned to another device, so drivers should not cache the
+value of dev->irq.

Before calling this function, a device driver must always call free_irq()
on any interrupt for which it previously called request_irq().
diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
index d37be36..c9aaf8d 100644
--- a/arch/mips/pci/msi-octeon.c
+++ b/arch/mips/pci/msi-octeon.c
@@ -177,7 +177,7 @@ msi_irq_allocated:
return 0;
}

-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
{
struct msi_desc *entry;
int ret;
diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index 8bbc12d..fc70513 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -13,7 +13,7 @@

#include <asm/machdep.h>

-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
+int arch_msi_check_device(struct pci_dev* dev, int nvec, int nvec_mme, int type)
{
if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
pr_debug("msi: Platform doesn't provide MSI callbacks.\n");
@@ -32,7 +32,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
return 0;
}

-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type)
{
return ppc_md.setup_msi_irqs(dev, nvec, type);
}
diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index e2956ad..688a5db 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -538,7 +538,7 @@ static void zpci_teardown_msi(struct pci_dev *pdev)
aisb_max--;
}

-int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
+int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int nvec_mme, int type)
{
pr_debug("%s: requesting %d MSI-X interrupts...", __func__, nvec);
if (type != PCI_CAP_ID_MSIX && type != PCI_CAP_ID_MSI)
diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index d9e9e6c..620642f 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -101,9 +101,10 @@ extern void pci_iommu_alloc(void);

#ifdef CONFIG_PCI_MSI
/* MSI arch specific hooks */
-static inline int x86_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+static inline int x86_setup_msi_irqs(struct pci_dev *dev,
+ int nvec, int nvec_mme, int type)
{
- return x86_msi.setup_msi_irqs(dev, nvec, type);
+ return x86_msi.setup_msi_irqs(dev, nvec, nvec_mme, type);
}

static inline void x86_teardown_msi_irqs(struct pci_dev *dev)
@@ -125,7 +126,8 @@ static inline void x86_restore_msi_irqs(struct pci_dev *dev, int irq)
#define arch_restore_msi_irqs x86_restore_msi_irqs
/* implemented in arch/x86/kernel/apic/io_apic. */
struct msi_desc;
-int native_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
+int native_setup_msi_irqs(struct pci_dev *dev,
+ int nvec, int nvec_mme, int type);
void native_teardown_msi_irq(unsigned int irq);
void native_restore_msi_irqs(struct pci_dev *dev, int irq);
int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc,
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index 828a156..04a8767 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -174,7 +174,8 @@ struct pci_dev;
struct msi_msg;

struct x86_msi_ops {
- int (*setup_msi_irqs)(struct pci_dev *dev, int nvec, int type);
+ int (*setup_msi_irqs)(struct pci_dev *dev,
+ int nvec, int nvec_mme, int type);
void (*compose_msi_msg)(struct pci_dev *dev, unsigned int irq,
unsigned int dest, struct msi_msg *msg,
u8 hpet_id);
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 9ed796c..21f6a44 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -3132,7 +3132,8 @@ int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc,
return 0;
}

-int native_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int native_setup_msi_irqs(struct pci_dev *dev,
+ int nvec, int nvec_mme, int type)
{
unsigned int irq, irq_want;
struct msi_desc *msidesc;
diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 39f81ae..1a220a0 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -142,7 +142,7 @@ error:
}

static int irq_remapping_setup_msi_irqs(struct pci_dev *dev,
- int nvec, int type)
+ int nvec, int nvec_mme, int type)
{
if (type == PCI_CAP_ID_MSI)
return do_setup_msi_irqs(dev, nvec);
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index aca7578..647e9b1 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -31,7 +31,8 @@ static int pci_msi_enable = 1;
/* Arch hooks */

#ifndef arch_msi_check_device
-int arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
+int arch_msi_check_device(struct pci_dev *dev,
+ int nvec, int nvec_mme, int type)
{
return 0;
}
@@ -43,7 +44,8 @@ int arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
#endif

#ifdef HAVE_DEFAULT_MSI_SETUP_IRQS
-int default_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
+int default_setup_msi_irqs(struct pci_dev *dev,
+ int nvec, int nvec_mme, int type)
{
struct msi_desc *entry;
int ret;
@@ -540,6 +542,7 @@ out_unroll:
* msi_capability_init - configure device's MSI capability structure
* @dev: pointer to the pci_dev data structure of MSI device function
* @nvec: number of interrupts to allocate
+ * @nvec_mme: number of interrupts to write to Multiple Message Enable register
*
* Setup the MSI capability structure of the device with the requested
* number of interrupts. A return value of zero indicates the successful
@@ -547,7 +550,7 @@ out_unroll:
* an error, and a positive return value indicates the number of interrupts
* which could have been allocated.
*/
-static int msi_capability_init(struct pci_dev *dev, int nvec)
+static int msi_capability_init(struct pci_dev *dev, int nvec, int nvec_mme)
{
struct msi_desc *entry;
int ret;
@@ -582,7 +585,7 @@ static int msi_capability_init(struct pci_dev *dev, int nvec)
list_add_tail(&entry->list, &dev->msi_list);

/* Configure MSI capability structure */
- ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
+ ret = arch_setup_msi_irqs(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
if (ret) {
msi_mask_irq(entry, mask, ~mask);
free_msi_irqs(dev);
@@ -700,7 +703,8 @@ static int msix_capability_init(struct pci_dev *dev,
if (ret)
return ret;

- ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
+ /* nvec_mme parameter does not make sense in case of MSI-X */
+ ret = arch_setup_msi_irqs(dev, nvec, -1, PCI_CAP_ID_MSIX);
if (ret)
goto error;

@@ -755,13 +759,15 @@ error:
* pci_msi_check_device - check whether MSI may be enabled on a device
* @dev: pointer to the pci_dev data structure of MSI device function
* @nvec: how many MSIs have been requested ?
+ * @nvec_mme: how many MSIs write to Multiple Message Enable register ?
* @type: are we checking for MSI or MSI-X ?
*
* Look at global flags, the device itself, and its parent busses
* to determine if MSI/-X are supported for the device. If MSI/-X is
* supported return 0, else return an error code.
**/
-static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
+static int pci_msi_check_device(struct pci_dev *dev,
+ int nvec, int nvec_mme, int type)
{
struct pci_bus *bus;
int ret;
@@ -789,27 +795,15 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
if (bus->bus_flags & PCI_BUS_FLAGS_NO_MSI)
return -EINVAL;

- ret = arch_msi_check_device(dev, nvec, type);
+ ret = arch_msi_check_device(dev, nvec, nvec_mme, type);
if (ret)
return ret;

return 0;
}

-/**
- * pci_enable_msi_block - configure device's MSI capability structure
- * @dev: device to configure
- * @nvec: number of interrupts to configure
- *
- * Allocate IRQs for a device with the MSI capability.
- * This function returns a negative errno if an error occurs. If it
- * is unable to allocate the number of interrupts requested, it returns
- * the number of interrupts it might be able to allocate. If it successfully
- * allocates at least the number of interrupts requested, it returns 0 and
- * updates the @dev's irq member to the lowest new interrupt number; the
- * other interrupt numbers allocated to this device are consecutive.
- */
-int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
+int pci_enable_msi_block_part(struct pci_dev *dev,
+ unsigned int nvec, int nvec_mme)
{
int status, maxvec;
u16 msgctl;
@@ -819,10 +813,17 @@ int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)

pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &msgctl);
maxvec = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
- if (nvec > maxvec)
+
+ if (nvec_mme < 0)
+ nvec_mme = maxvec;
+ if (nvec_mme > maxvec)
return maxvec;
+ if (!is_power_of_2(nvec_mme))
+ return -EINVAL;
+ if (nvec > nvec_mme)
+ return -ERANGE;

- status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
+ status = pci_msi_check_device(dev, nvec, nvec_mme, PCI_CAP_ID_MSI);
if (status)
return status;

@@ -835,9 +836,34 @@ int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
return -EINVAL;
}

- status = msi_capability_init(dev, nvec);
+ status = msi_capability_init(dev, nvec, nvec_mme);
return status;
}
+EXPORT_SYMBOL(pci_enable_msi_block_part);
+
+/**
+ * pci_enable_msi_block - configure device's MSI capability structure
+ * @dev: device to configure
+ * @nvec: number of interrupts to configure
+ *
+ * Allocate IRQs for a device with the MSI capability.
+ * This function returns a negative errno if an error occurs. If it
+ * is unable to allocate the number of interrupts requested, it returns
+ * the number of interrupts it might be able to allocate. If it successfully
+ * allocates at least the number of interrupts requested, it returns 0 and
+ * updates the @dev's irq member to the lowest new interrupt number; the
+ * other interrupt numbers allocated to this device are consecutive.
+ */
+int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
+{
+ /*
+ * Archtectures which do not support nvec_mme should ignore it.
+ * However, it would be surprising if an architecture write to
+ * the Multiple Message Enable register something else than nvec
+ * rounded up to the power of two.
+ */
+ return pci_enable_msi_block_part(dev, nvec, roundup_pow_of_two(nvec));
+}
EXPORT_SYMBOL(pci_enable_msi_block);

int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *maxvec)
@@ -941,7 +967,8 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
if (!entries || !dev->msix_cap)
return -EINVAL;

- status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
+ /* nvec_mme parameter does not make sense in case of MSI-X */
+ status = pci_msi_check_device(dev, nvec, -1, PCI_CAP_ID_MSIX);
if (status)
return status;

diff --git a/include/linux/msi.h b/include/linux/msi.h
index ee66f3a..e27ad31 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -55,8 +55,9 @@ struct msi_desc {
*/
int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
void arch_teardown_msi_irq(unsigned int irq);
-int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
+int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme, int type);
void arch_teardown_msi_irqs(struct pci_dev *dev);
-int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
+int arch_msi_check_device(struct pci_dev* dev,
+ int nvec, int nvec_mme, int type);

#endif /* LINUX_MSI_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 0fd1f15..6552cee 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1122,6 +1122,12 @@ struct msix_entry {


#ifndef CONFIG_PCI_MSI
+static inline int
+pci_enable_msi_block_part(struct pci_dev *dev, unsigned int nvec, int nvec_mme)
+{
+ return -1;
+}
+
static inline int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
{
return -1;
@@ -1163,6 +1169,8 @@ static inline int pci_msi_enabled(void)
return 0;
}
#else
+int pci_enable_msi_block_part(struct pci_dev *dev,
+ unsigned int nvec, int nvec_mme);
int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec);
int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *maxvec);
void pci_msi_shutdown(struct pci_dev *dev);
--
1.7.7.6


--
Regards,
Alexander Gordeev
[email protected]

2013-09-05 12:50:54

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Factor out the operation of retrieving the number of MSI
vectors the device requested, that is a value encoded in
the Multiple Message Capable register.

This change is a prerequisite for the forthcoming update
of the AHCI device driver.

Signed-off-by: Alexander Gordeev <[email protected]>
---
Documentation/PCI/MSI-HOWTO.txt | 12 ++++++++++++
drivers/pci/msi.c | 30 ++++++++++++++++++++----------
include/linux/pci.h | 6 ++++++
3 files changed, 38 insertions(+), 10 deletions(-)

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 32d7d15..205eb5d 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -209,6 +209,18 @@ on any interrupt for which it previously called request_irq().
Failure to do so results in a BUG_ON(), leaving the device with
MSI enabled and thus leaking its vector.

+4.2.6 pci_get_msi_cap
+
+int pci_get_msi_cap(struct pci_dev *dev)
+
+This function could be used to retrieve the number of MSI vectors the
+device requested (via the Multiple Message Capable register). The MSI
+specification only allows the returned value to be a power of two,
+up to a maximum of 2^5 (32).
+
+If this function returns a negative number, it indicates the device is
+not capable of sending MSIs.
+
4.3 Using MSI-X

The MSI-X capability is much more flexible than the MSI capability.
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 647e9b1..c50d518 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -802,17 +802,30 @@ static int pci_msi_check_device(struct pci_dev *dev,
return 0;
}

-int pci_enable_msi_block_part(struct pci_dev *dev,
- unsigned int nvec, int nvec_mme)
+int pci_get_msi_cap(struct pci_dev *dev)
{
- int status, maxvec;
+ int ret;
u16 msgctl;

if (!dev->msi_cap)
return -EINVAL;

pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &msgctl);
- maxvec = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
+ ret = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
+
+ return ret;
+}
+EXPORT_SYMBOL(pci_get_msi_cap);
+
+int pci_enable_msi_block_part(struct pci_dev *dev,
+ unsigned int nvec, int nvec_mme)
+{
+ int status, maxvec;
+ u16 msgctl;
+
+ maxvec = pci_get_msi_cap(dev);
+ if (maxvec < 0)
+ return maxvec;

if (nvec_mme < 0)
nvec_mme = maxvec;
@@ -869,13 +882,10 @@ EXPORT_SYMBOL(pci_enable_msi_block);
int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *maxvec)
{
int ret, nvec;
- u16 msgctl;
-
- if (!dev->msi_cap)
- return -EINVAL;

- pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &msgctl);
- ret = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
+ ret = pci_get_msi_cap(dev);
+ if (ret < 0)
+ return ret;

if (maxvec)
*maxvec = ret;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 6552cee..b866048 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1122,6 +1122,11 @@ struct msix_entry {


#ifndef CONFIG_PCI_MSI
+static inline int pci_get_msi_cap(struct pci_dev *dev)
+{
+ return -1;
+}
+
static inline int
pci_enable_msi_block_part(struct pci_dev *dev, unsigned int nvec, int nvec_mme)
{
@@ -1169,6 +1174,7 @@ static inline int pci_msi_enabled(void)
return 0;
}
#else
+int pci_get_msi_cap(struct pci_dev *dev);
int pci_enable_msi_block_part(struct pci_dev *dev,
unsigned int nvec, int nvec_mme);
int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec);
--
1.7.7.6


--
Regards,
Alexander Gordeev
[email protected]

2013-09-05 12:51:29

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH v2 3/6] MSI/x86: Support pci_enable_msi_block_part() interface

This change is a prerequisite for the forthcoming update
of the AHCI device driver to conserve 10/16 MSIs on Intel
chipsets. The update makes use of 'nvec_mme' parameter
for pci_enable_msi_block_part() interface.

Signed-off-by: Alexander Gordeev <[email protected]>
---
drivers/iommu/irq_remapping.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/irq_remapping.c b/drivers/iommu/irq_remapping.c
index 1a220a0..7671da3 100644
--- a/drivers/iommu/irq_remapping.c
+++ b/drivers/iommu/irq_remapping.c
@@ -49,9 +49,9 @@ static void irq_remapping_disable_io_apic(void)
disconnect_bsp_APIC(0);
}

-static int do_setup_msi_irqs(struct pci_dev *dev, int nvec)
+static int do_setup_msi_irqs(struct pci_dev *dev, int nvec, int nvec_mme)
{
- int node, ret, sub_handle, nvec_pow2, index = 0;
+ int node, ret, sub_handle, index = 0;
unsigned int irq;
struct msi_desc *msidesc;

@@ -60,18 +60,18 @@ static int do_setup_msi_irqs(struct pci_dev *dev, int nvec)
WARN_ON(msidesc->irq);
WARN_ON(msidesc->msi_attrib.multiple);
WARN_ON(msidesc->nvec_used);
+ BUG_ON(!is_power_of_2(nvec_mme));

node = dev_to_node(&dev->dev);
irq = __create_irqs(get_nr_irqs_gsi(), nvec, node);
if (irq == 0)
return -ENOSPC;

- nvec_pow2 = __roundup_pow_of_two(nvec);
msidesc->nvec_used = nvec;
- msidesc->msi_attrib.multiple = ilog2(nvec_pow2);
+ msidesc->msi_attrib.multiple = ilog2(nvec_mme);
for (sub_handle = 0; sub_handle < nvec; sub_handle++) {
if (!sub_handle) {
- index = msi_alloc_remapped_irq(dev, irq, nvec_pow2);
+ index = msi_alloc_remapped_irq(dev, irq, nvec_mme);
if (index < 0) {
ret = index;
goto error;
@@ -145,7 +145,7 @@ static int irq_remapping_setup_msi_irqs(struct pci_dev *dev,
int nvec, int nvec_mme, int type)
{
if (type == PCI_CAP_ID_MSI)
- return do_setup_msi_irqs(dev, nvec);
+ return do_setup_msi_irqs(dev, nvec, nvec_mme);
else
return do_setup_msix_irqs(dev, nvec);
}
--
1.7.7.6


--
Regards,
Alexander Gordeev
[email protected]

2013-09-05 12:52:00

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH v2 4/6] AHCI: Conserve interrupts with pci_enable_msi_block_part() interface

Make use of the new pci_enable_msi_block_part() interface
and conserve on othewise wasted interrupt resources for 10
of 16 unused MSI vectors on Intel chipsets.

Signed-off-by: Alexander Gordeev <[email protected]>
---
drivers/ata/ahci.c | 46 +++++++++++++++++++++++++++-------------------
1 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index db4380d..a6bb618 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1091,26 +1091,34 @@ static inline void ahci_gtf_filter_workaround(struct ata_host *host)
{}
#endif

-int ahci_init_interrupts(struct pci_dev *pdev, struct ahci_host_priv *hpriv)
+int ahci_init_interrupts(struct pci_dev *pdev, unsigned int n_ports,
+ struct ahci_host_priv *hpriv)
{
int rc;
unsigned int maxvec;

- if (!(hpriv->flags & AHCI_HFLAG_NO_MSI)) {
- rc = pci_enable_msi_block_auto(pdev, &maxvec);
- if (rc > 0) {
- if ((rc == maxvec) || (rc == 1))
- return rc;
- /*
- * Assume that advantage of multipe MSIs is negated,
- * so fallback to single MSI mode to save resources
- */
- pci_disable_msi(pdev);
- if (!pci_enable_msi(pdev))
- return 1;
- }
- }
+ if (hpriv->flags & AHCI_HFLAG_NO_MSI)
+ goto intx;
+
+ maxvec = pci_get_msi_cap(pdev);
+ if (maxvec < 0)
+ goto intx;
+ if (maxvec < n_ports)
+ goto single_msi;
+
+ rc = pci_enable_msi_block_part(pdev, n_ports, maxvec);
+ if (rc < 0)
+ goto intx;
+ if (rc)
+ goto single_msi;
+
+ return maxvec;

+single_msi:
+ if (!pci_enable_msi(pdev))
+ return 1;
+
+intx:
pci_intx(pdev, 1);
return 0;
}
@@ -1277,10 +1285,6 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)

hpriv->mmio = pcim_iomap_table(pdev)[ahci_pci_bar];

- n_msis = ahci_init_interrupts(pdev, hpriv);
- if (n_msis > 1)
- hpriv->flags |= AHCI_HFLAG_MULTI_MSI;
-
/* save initial config */
ahci_pci_save_initial_config(pdev, hpriv);

@@ -1327,6 +1331,10 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
*/
n_ports = max(ahci_nr_ports(hpriv->cap), fls(hpriv->port_map));

+ n_msis = ahci_init_interrupts(pdev, n_ports, hpriv);
+ if (n_msis > 1)
+ hpriv->flags |= AHCI_HFLAG_MULTI_MSI;
+
host = ata_host_alloc_pinfo(&pdev->dev, ppi, n_ports);
if (!host)
return -ENOMEM;
--
1.7.7.6


--
Regards,
Alexander Gordeev
[email protected]

2013-09-05 12:52:25

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH v2 5/6] AHCI: Check MRSM bit when multiple MSIs enabled

Do not trust the hardware and always check if MSI
Revert to Single Message mode was enforced. Fall
back to the single MSI mode in case it did. Not
doing so might screw up the interrupt handling.

Signed-off-by: Alexander Gordeev <[email protected]>
---
drivers/ata/ahci.c | 16 ++++++++++++++++
drivers/ata/ahci.h | 1 +
2 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index a6bb618..af5d535 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -1091,6 +1091,14 @@ static inline void ahci_gtf_filter_workaround(struct ata_host *host)
{}
#endif

+static int ahci_get_mrsm(struct ahci_host_priv *hpriv)
+{
+ void __iomem *mmio = hpriv->mmio;
+ u32 ctl = readl(mmio + HOST_CTL);
+
+ return (ctl & HOST_MRSM);
+}
+
int ahci_init_interrupts(struct pci_dev *pdev, unsigned int n_ports,
struct ahci_host_priv *hpriv)
{
@@ -1111,6 +1119,14 @@ int ahci_init_interrupts(struct pci_dev *pdev, unsigned int n_ports,
goto intx;
if (rc)
goto single_msi;
+ if (maxvec == 1)
+ return 1;
+
+ if (ahci_get_mrsm(hpriv)) {
+ pci_disable_msi(pdev);
+ printk(KERN_INFO "ahci: MRSM is on, fallback to single MSI\n");
+ goto single_msi;
+ }

return maxvec;

diff --git a/drivers/ata/ahci.h b/drivers/ata/ahci.h
index 1145637..19bc846 100644
--- a/drivers/ata/ahci.h
+++ b/drivers/ata/ahci.h
@@ -91,6 +91,7 @@ enum {
/* HOST_CTL bits */
HOST_RESET = (1 << 0), /* reset controller; self-clear */
HOST_IRQ_EN = (1 << 1), /* global IRQ enable */
+ HOST_MRSM = (1 << 2), /* MSI Revert to Single Message */
HOST_AHCI_EN = (1 << 31), /* AHCI enabled */

/* HOST_CAP bits */
--
1.7.7.6


--
Regards,
Alexander Gordeev
[email protected]

2013-09-05 12:52:52

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH v2 6/6] PCI/MSI: Get rid of pci_enable_msi_block_auto() interface

There are no consumers of pci_enable_msi_block_auto()
interface have left. Eliminate it for now.

Signed-off-by: Alexander Gordeev <[email protected]>
---
Documentation/PCI/MSI-HOWTO.txt | 42 +++++++-------------------------------
drivers/pci/msi.c | 22 --------------------
include/linux/pci.h | 7 ------
3 files changed, 8 insertions(+), 63 deletions(-)

diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
index 205eb5d..6c64151 100644
--- a/Documentation/PCI/MSI-HOWTO.txt
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -167,49 +167,23 @@ constraints on the number of vectors that can be allocated;
pci_enable_msi_block_part() returns as soon as it finds any constraint
that doesn't allow the call to succeed.

-4.2.4 pci_enable_msi_block_auto
-
-int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *count)
-
-This variation on pci_enable_msi() call allows a device driver to request
-the maximum possible number of MSIs. The MSI specification only allows
-interrupts to be allocated in powers of two, up to a maximum of 2^5 (32).
-
-If this function returns a positive number, it indicates that it has
-succeeded and the returned value is the number of allocated interrupts. In
-this case, the function enables MSI on this device and updates dev->irq to
-be the lowest of the new interrupts assigned to it. The other interrupts
-assigned to the device are in the range dev->irq to dev->irq + returned
-value - 1.
-
-If this function returns a negative number, it indicates an error and
-the driver should not attempt to request any more MSI interrupts for
-this device.
-
-If the device driver needs to know the number of interrupts the device
-supports it can pass the pointer count where that number is stored. The
-device driver must decide what action to take if pci_enable_msi_block_auto()
-succeeds, but returns a value less than the number of interrupts supported.
-If the device driver does not need to know the number of interrupts
-supported, it can set the pointer count to NULL.
-
-4.2.5 pci_disable_msi
+4.2.4 pci_disable_msi

void pci_disable_msi(struct pci_dev *dev)

-This function should be used to undo the effect of pci_enable_msi_block(),
-pci_enable_msi(), pci_enable_msi_block_auto() or pci_enable_msi_block_part().
-Calling it restores dev->irq to the pin-based interrupt number and frees the
-previously allocated message signaled interrupt(s). The interrupt may
-subsequently be assigned to another device, so drivers should not cache the
-value of dev->irq.
+This function should be used to undo the effect of pci_enable_msi() or
+pci_enable_msi_block() or pci_enable_msi_block_part(). Calling it restores
+dev->irq to the pin-based interrupt number and frees the previously
+allocated message signaled interrupt(s). The interrupt may subsequently be
+assigned to another device, so drivers should not cache the value of
+dev->irq.

Before calling this function, a device driver must always call free_irq()
on any interrupt for which it previously called request_irq().
Failure to do so results in a BUG_ON(), leaving the device with
MSI enabled and thus leaking its vector.

-4.2.6 pci_get_msi_cap
+4.2.5 pci_get_msi_cap

int pci_get_msi_cap(struct pci_dev *dev)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index c50d518..d378600 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -879,28 +879,6 @@ int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
}
EXPORT_SYMBOL(pci_enable_msi_block);

-int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *maxvec)
-{
- int ret, nvec;
-
- ret = pci_get_msi_cap(dev);
- if (ret < 0)
- return ret;
-
- if (maxvec)
- *maxvec = ret;
-
- do {
- nvec = ret;
- ret = pci_enable_msi_block(dev, nvec);
- } while (ret > 0);
-
- if (ret < 0)
- return ret;
- return nvec;
-}
-EXPORT_SYMBOL(pci_enable_msi_block_auto);
-
void pci_msi_shutdown(struct pci_dev *dev)
{
struct msi_desc *desc;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index b866048..c5c37de 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1138,12 +1138,6 @@ static inline int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
return -1;
}

-static inline int
-pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *maxvec)
-{
- return -1;
-}
-
static inline void pci_msi_shutdown(struct pci_dev *dev)
{ }
static inline void pci_disable_msi(struct pci_dev *dev)
@@ -1178,7 +1172,6 @@ int pci_get_msi_cap(struct pci_dev *dev);
int pci_enable_msi_block_part(struct pci_dev *dev,
unsigned int nvec, int nvec_mme);
int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec);
-int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *maxvec);
void pci_msi_shutdown(struct pci_dev *dev);
void pci_disable_msi(struct pci_dev *dev);
int pci_msix_table_size(struct pci_dev *dev);
--
1.7.7.6


--
Regards,
Alexander Gordeev
[email protected]

2013-09-05 13:09:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Hello, Alexander.

On Thu, Sep 05, 2013 at 02:52:47PM +0200, Alexander Gordeev wrote:
> -int pci_enable_msi_block_part(struct pci_dev *dev,
> - unsigned int nvec, int nvec_mme)
> +int pci_get_msi_cap(struct pci_dev *dev)
> {
> - int status, maxvec;
> + int ret;
> u16 msgctl;
>
> if (!dev->msi_cap)
> return -EINVAL;
>
> pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &msgctl);
> - maxvec = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
> + ret = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(pci_get_msi_cap);

One curiosity - with the above factored out, is
pci_enable_msi_block_part() returning positive number still necessary?
I followed most of code paths in x86 and nothing seems to need it and
positive return seems to be just causing confusion - ie. returning 1
on multiple msi config failure from some functions, which is silly.

Thanks.

--
tejun

2013-09-05 13:10:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] AHCI: Conserve interrupts with pci_enable_msi_block_part() interface

Hello,

On Thu, Sep 05, 2013 at 02:53:47PM +0200, Alexander Gordeev wrote:
> Make use of the new pci_enable_msi_block_part() interface
> and conserve on othewise wasted interrupt resources for 10
> of 16 unused MSI vectors on Intel chipsets.
>
> Signed-off-by: Alexander Gordeev <[email protected]>

Acked-by: Tejun Heo <[email protected]>

One more question tho. How much does saving some interrupt vectors
matter? Are int vectors contrained resource on some configurations?

Thanks.

--
tejun

2013-09-05 13:11:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] AHCI: Check MRSM bit when multiple MSIs enabled

On Thu, Sep 05, 2013 at 02:54:17PM +0200, Alexander Gordeev wrote:
> Do not trust the hardware and always check if MSI
> Revert to Single Message mode was enforced. Fall
> back to the single MSI mode in case it did. Not
> doing so might screw up the interrupt handling.
>
> Signed-off-by: Alexander Gordeev <[email protected]>

Acked-by: Tejun Heo <[email protected]>

The patch should probably be re-sequenced to the head of the series
and cc [email protected]. Not checking MRSM is scary.

Thanks.

--
tejun

2013-09-05 14:32:41

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] AHCI: Check MRSM bit when multiple MSIs enabled

Hello.

On 09/05/2013 04:54 PM, Alexander Gordeev wrote:

> Do not trust the hardware and always check if MSI
> Revert to Single Message mode was enforced. Fall
> back to the single MSI mode in case it did. Not
> doing so might screw up the interrupt handling.

> Signed-off-by: Alexander Gordeev <[email protected]>
> ---
> drivers/ata/ahci.c | 16 ++++++++++++++++
> drivers/ata/ahci.h | 1 +
> 2 files changed, 17 insertions(+), 0 deletions(-)

> diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
> index a6bb618..af5d535 100644
> --- a/drivers/ata/ahci.c
> +++ b/drivers/ata/ahci.c
> @@ -1091,6 +1091,14 @@ static inline void ahci_gtf_filter_workaround(struct ata_host *host)
> {}
> #endif
>
> +static int ahci_get_mrsm(struct ahci_host_priv *hpriv)
> +{
> + void __iomem *mmio = hpriv->mmio;
> + u32 ctl = readl(mmio + HOST_CTL);
> +
> + return (ctl & HOST_MRSM);

Parens not needed here, in case you'll respin this once more.

WBR, Sergei

2013-09-05 15:01:10

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Thu, Sep 05, 2013 at 09:09:02AM -0400, Tejun Heo wrote:
> Hello, Alexander.
>
> On Thu, Sep 05, 2013 at 02:52:47PM +0200, Alexander Gordeev wrote:
> One curiosity - with the above factored out, is
> pci_enable_msi_block_part() returning positive number still necessary?
> I followed most of code paths in x86 and nothing seems to need it and
> positive return seems to be just causing confusion - ie. returning 1
> on multiple msi config failure from some functions, which is silly.

You mean we could treat positive numbers returned by architecture as
failures and translate it into negative error codes?
If so, I would prefer not to do this for two reasons:
1. It will not be possible to call pci_enable_msi_block_part() in a loop.
Although there are no consumers right now I think the very possibility
is better to keep.
2. The semantics of pci_enable_msi_block_part() is very close to
pci_enable_msi_block(). I believe having a consisting interface for
these two helps readability.

> Thanks.
>
> --
> tejun

--
Regards,
Alexander Gordeev
[email protected]

2013-09-05 15:04:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Hello,

On Thu, Sep 05, 2013 at 05:03:00PM +0200, Alexander Gordeev wrote:
> You mean we could treat positive numbers returned by architecture as
> failures and translate it into negative error codes?
> If so, I would prefer not to do this for two reasons:
> 1. It will not be possible to call pci_enable_msi_block_part() in a loop.
> Although there are no consumers right now I think the very possibility
> is better to keep.
> 2. The semantics of pci_enable_msi_block_part() is very close to
> pci_enable_msi_block(). I believe having a consisting interface for
> these two helps readability.

The thing is, do we even have cases where arch code returns positive
return to indicate possible partial allocation? If not, the whole
interface is convoluted for no good reason and we can just make
everything return 0 or -errno, which is a lot simpler. No?

Thanks.

--
tejun

2013-09-05 15:21:59

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 4/6] AHCI: Conserve interrupts with pci_enable_msi_block_part() interface

On Thu, Sep 05, 2013 at 09:10:50AM -0400, Tejun Heo wrote:
> One more question tho. How much does saving some interrupt vectors
> matter? Are int vectors contrained resource on some configurations?

Honestly, I am not aware of any configuration struggling with interrupt
vectors. However, conserving on x86 interrupt vector space is always a
good idea IMO. Also, PPC pSeries has a concept of MSI quota, so..

> Thanks.
>
> --
> tejun

--
Regards,
Alexander Gordeev
[email protected]

2013-09-05 15:23:46

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 5/6] AHCI: Check MRSM bit when multiple MSIs enabled

On Thu, Sep 05, 2013 at 09:11:49AM -0400, Tejun Heo wrote:
> On Thu, Sep 05, 2013 at 02:54:17PM +0200, Alexander Gordeev wrote:
> The patch should probably be re-sequenced to the head of the series
> and cc [email protected]. Not checking MRSM is scary.

Will resend v3.

> Thanks.
>
> --
> tejun

--
Regards,
Alexander Gordeev
[email protected]

2013-09-05 15:38:48

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Thu, Sep 05, 2013 at 11:04:42AM -0400, Tejun Heo wrote:
> The thing is, do we even have cases where arch code returns positive
> return to indicate possible partial allocation? If not, the whole
> interface is convoluted for no good reason and we can just make
> everything return 0 or -errno, which is a lot simpler. No?

As I mentioned in my other note, at least PPC has a concept of MSI quota,
so exceeding it would be the very case, I believe.

> Thanks.
>
> --
> tejun

--
Regards,
Alexander Gordeev
[email protected]

2013-09-05 15:44:42

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Hello,

On Thu, Sep 05, 2013 at 05:40:42PM +0200, Alexander Gordeev wrote:
> On Thu, Sep 05, 2013 at 11:04:42AM -0400, Tejun Heo wrote:
> > The thing is, do we even have cases where arch code returns positive
> > return to indicate possible partial allocation? If not, the whole
> > interface is convoluted for no good reason and we can just make
> > everything return 0 or -errno, which is a lot simpler. No?
>
> As I mentioned in my other note, at least PPC has a concept of MSI quota,
> so exceeding it would be the very case, I believe.

Given that multiple MSI is something which isn't too popular / already
superseded and that the condition is highly unlikely, do we really
care about possible partial success? This sort of interface is
unnecessarily complex and actively harmful. It forces all users to
wonder what could possibly happen and implement all sorts of nutty
fallback logic which is highly likely to be error-prone on both the
software and hardware side. Seriously, how much testing would such
code path would get both on the driver and firmware sides?

It's an operation which isn't too likely to fail with a firm
known-to-work fallback. It's pointless and error-prone to try to
extract the last point zero zero one percent.

Thanks.

--
tejun

2013-09-05 18:52:49

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Thu, Sep 05, 2013 at 11:44:36AM -0400, Tejun Heo wrote:
> Given that multiple MSI is something which isn't too popular / already
> superseded and that the condition is highly unlikely, do we really
> care about possible partial success? This sort of interface is
> unnecessarily complex and actively harmful. It forces all users to
> wonder what could possibly happen and implement all sorts of nutty
> fallback logic which is highly likely to be error-prone on both the
> software and hardware side. Seriously, how much testing would such
> code path would get both on the driver and firmware sides?
>
> It's an operation which isn't too likely to fail with a firm
> known-to-work fallback. It's pointless and error-prone to try to
> extract the last point zero zero one percent.

I assume reasons for having this type of interface at the moment of
taking design decision about pci_enable_msi_block() still hold true.
I do not know what those reasons were, but I think the fact multiple
MSIs are rarely used and MSI-X exists does not invalidate them now.

I did consider the other argument - since pci_enable_msi_block_part()
is explicitly provided with a value of MME the caller will not be
satisfied with any other value and hence a repeated call with a lesser
MME does not make sense for the caller. Therefore we could just fail
in case the architecture returned a positive value. Same result, but
different reasoning.

At the moment I still prefer pci_enable_msi_block_part() to be similar
to pci_enable_msi_block(). I do agree the fallback logic is error-prone,
but I would not dare to scrap it all right away.

Bjorn?

>
> Thanks.
>
> --
> tejun

--
Regards,
Alexander Gordeev
[email protected]

2013-09-05 20:06:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Hello, Alexander.

On Thu, Sep 05, 2013 at 08:54:40PM +0200, Alexander Gordeev wrote:
> I assume reasons for having this type of interface at the moment of
> taking design decision about pci_enable_msi_block() still hold true.
> I do not know what those reasons were, but I think the fact multiple
> MSIs are rarely used and MSI-X exists does not invalidate them now.

Well, it does change the underlying assumptions to make trade-offs
against. If something is widely used, expected to continue to expand,
additional complexity to achieve better outcome is likely to be more
justifiable. Nothing exists in vacuum. That said, I'm not even sure
whether we want this sort of interface even if multiple MSI were still
the hot thing.

> I did consider the other argument - since pci_enable_msi_block_part()
> is explicitly provided with a value of MME the caller will not be
> satisfied with any other value and hence a repeated call with a lesser
> MME does not make sense for the caller. Therefore we could just fail
> in case the architecture returned a positive value. Same result, but
> different reasoning.

Just making the whole thing including arch methods to return 0/-errno
would probably be a lot cleaner.

> At the moment I still prefer pci_enable_msi_block_part() to be similar
> to pci_enable_msi_block(). I do agree the fallback logic is error-prone,
> but I would not dare to scrap it all right away.

Yeah, of course, pci_enable_msi_block() would need to be updated to
match too. I understand this is going a bit off the original scope of
the patchset but I can't help but cringing at the interface and the
resulting "fallback" logic it ends up creating in its users. Bjorn,
what do you think?

Thanks.

--
tejun

2013-09-06 13:15:48

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Thu, Sep 05, 2013 at 05:40:41PM +0200, Alexander Gordeev wrote:
> On Thu, Sep 05, 2013 at 11:04:42AM -0400, Tejun Heo wrote:
> > The thing is, do we even have cases where arch code returns positive
> > return to indicate possible partial allocation? If not, the whole
> > interface is convoluted for no good reason and we can just make
> > everything return 0 or -errno, which is a lot simpler. No?
>
> As I mentioned in my other note, at least PPC has a concept of MSI quota,
> so exceeding it would be the very case, I believe.

An update here. No other arch supports multiple MSIs besides Intel (+IOMMU).
Yet, ironically, the concept of possible partial allocation was adopted from
MSI-X. I briefly checked all archs and it seems none of them throws out a
possible other number of MSIs > 1 (including PPC I mistakenly read).

--
Regards,
Alexander Gordeev
[email protected]

2013-09-06 16:02:01

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Thu, Sep 5, 2013 at 2:06 PM, Tejun Heo <[email protected]> wrote:
> Hello, Alexander.
>
> On Thu, Sep 05, 2013 at 08:54:40PM +0200, Alexander Gordeev wrote:
>> I assume reasons for having this type of interface at the moment of
>> taking design decision about pci_enable_msi_block() still hold true.
>> I do not know what those reasons were, but I think the fact multiple
>> MSIs are rarely used and MSI-X exists does not invalidate them now.
>
> Well, it does change the underlying assumptions to make trade-offs
> against. If something is widely used, expected to continue to expand,
> additional complexity to achieve better outcome is likely to be more
> justifiable. Nothing exists in vacuum. That said, I'm not even sure
> whether we want this sort of interface even if multiple MSI were still
> the hot thing.
>
>> I did consider the other argument - since pci_enable_msi_block_part()
>> is explicitly provided with a value of MME the caller will not be
>> satisfied with any other value and hence a repeated call with a lesser
>> MME does not make sense for the caller. Therefore we could just fail
>> in case the architecture returned a positive value. Same result, but
>> different reasoning.
>
> Just making the whole thing including arch methods to return 0/-errno
> would probably be a lot cleaner.
>
>> At the moment I still prefer pci_enable_msi_block_part() to be similar
>> to pci_enable_msi_block(). I do agree the fallback logic is error-prone,
>> but I would not dare to scrap it all right away.
>
> Yeah, of course, pci_enable_msi_block() would need to be updated to
> match too. I understand this is going a bit off the original scope of
> the patchset but I can't help but cringing at the interface and the
> resulting "fallback" logic it ends up creating in its users. Bjorn,
> what do you think?

Sorry, I haven't jumped in here yet because I saw your discussion and
was hoping you guys would figure something out without my help. It
will take me a few hours to look into this and come up with anything
constructive to say.

I do remember disliking the complicated interface of
pci_enable_msi_block() (return negative errno, return positive "we
might be able to do this" values, or zero), but I'll have to do some
more research before I can say much more than that.

Bjorn

2013-09-06 16:06:29

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Hello, Bjorn.

On Fri, Sep 06, 2013 at 10:01:38AM -0600, Bjorn Helgaas wrote:
> Sorry, I haven't jumped in here yet because I saw your discussion and
> was hoping you guys would figure something out without my help. It
> will take me a few hours to look into this and come up with anything
> constructive to say.
>
> I do remember disliking the complicated interface of
> pci_enable_msi_block() (return negative errno, return positive "we
> might be able to do this" values, or zero), but I'll have to do some
> more research before I can say much more than that.

According to Alexander, it doesn't even seem like we have any actual
use case for the positive return numbers. I say just rip it out and
do the regular 0/-errno all the way through.

Thanks.

--
tejun

2013-09-06 23:32:16

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Fri, Sep 06, 2013 at 12:06:21PM -0400, Tejun Heo wrote:
> Hello, Bjorn.
>
> On Fri, Sep 06, 2013 at 10:01:38AM -0600, Bjorn Helgaas wrote:
> > Sorry, I haven't jumped in here yet because I saw your discussion and
> > was hoping you guys would figure something out without my help. It
> > will take me a few hours to look into this and come up with anything
> > constructive to say.
> >
> > I do remember disliking the complicated interface of
> > pci_enable_msi_block() (return negative errno, return positive "we
> > might be able to do this" values, or zero), but I'll have to do some
> > more research before I can say much more than that.
>
> According to Alexander, it doesn't even seem like we have any actual
> use case for the positive return numbers. I say just rip it out and
> do the regular 0/-errno all the way through.

I agree, that would be much simpler.

I propose that you rework it that way, and at least find out what
(if anything) would break if we do that. Or maybe we just give up
some optimization; it would be nice to quantify that, too.

Bjorn

2013-09-09 15:18:55

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Fri, Sep 06, 2013 at 05:32:05PM -0600, Bjorn Helgaas wrote:
> I propose that you rework it that way, and at least find out what
> (if anything) would break if we do that. Or maybe we just give up
> some optimization; it would be nice to quantify that, too.

Hi Bjorn,

The series is what it seems a direction to take.

Looks like we need PPC folks to agree on the quota check update
for pSeries (yes, they do bail out with a positive return value
from arch_msi_check_device()):

quota = msi_quota_for_device(pdev, nvec);

if (quota && quota < nvec)
return quota;

return 0;

> Bjorn

--
Regards,
Alexander Gordeev
[email protected]

2013-09-09 15:20:33

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH 1/9] PCI/MSI/PPC: Fix wrong RTAS error code reporting

Signed-off-by: Alexander Gordeev <[email protected]>
---
arch/powerpc/platforms/pseries/msi.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c
index 6d2f0ab..009ec73 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -467,7 +467,7 @@ again:
list_for_each_entry(entry, &pdev->msi_list, list) {
hwirq = rtas_query_irq_number(pdn, i++);
if (hwirq < 0) {
- pr_debug("rtas_msi: error (%d) getting hwirq\n", rc);
+ pr_debug("rtas_msi: error (%d) getting hwirq\n", hwirq);
return hwirq;
}

--
1.7.7.6

--
Regards,
Alexander Gordeev
[email protected]

2013-09-09 15:21:02

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH 2/9] PCI/MSI/PPC: Make return values only 0/-errno when MSIs allocated

Signed-off-by: Alexander Gordeev <[email protected]>
---
arch/powerpc/kernel/msi.c | 2 +-
arch/powerpc/platforms/pseries/msi.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index 8bbc12d..36d70b9 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -22,7 +22,7 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)

/* PowerPC doesn't support multiple MSI yet */
if (type == PCI_CAP_ID_MSI && nvec > 1)
- return 1;
+ return -EINVAL;

if (ppc_md.msi_check_device) {
pr_debug("msi: Using platform check routine.\n");
diff --git a/arch/powerpc/platforms/pseries/msi.c b/arch/powerpc/platforms/pseries/msi.c
index 009ec73..89648c1 100644
--- a/arch/powerpc/platforms/pseries/msi.c
+++ b/arch/powerpc/platforms/pseries/msi.c
@@ -348,7 +348,7 @@ static int rtas_msi_check_device(struct pci_dev *pdev, int nvec, int type)
quota = msi_quota_for_device(pdev, nvec);

if (quota && quota < nvec)
- return quota;
+ return -ENOSPC;

return 0;
}
--
1.7.7.6

--
Regards,
Alexander Gordeev
[email protected]

2013-09-09 15:22:15

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH 3/9] PCI/MSI/x86: Make return values only 0/-errno when MSIs allocated

Signed-off-by: Alexander Gordeev <[email protected]>
---
arch/x86/kernel/apic/io_apic.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 9ed796c..4a95d9a 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -3140,7 +3140,7 @@ int native_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)

/* Multiple MSI vectors only supported with interrupt remapping */
if (type == PCI_CAP_ID_MSI && nvec > 1)
- return 1;
+ return -EINVAL;

node = dev_to_node(&dev->dev);
irq_want = nr_irqs_gsi;
--
1.7.7.6

--
Regards,
Alexander Gordeev
[email protected]

2013-09-09 15:23:00

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH 4/9] PCI/MSI/MIPS: Make return values only 0/-errno when MSIs allocated

Signed-off-by: Alexander Gordeev <[email protected]>
---
arch/mips/pci/msi-octeon.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/mips/pci/msi-octeon.c b/arch/mips/pci/msi-octeon.c
index d37be36..0ee5f4d 100644
--- a/arch/mips/pci/msi-octeon.c
+++ b/arch/mips/pci/msi-octeon.c
@@ -193,7 +193,7 @@ int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
* override arch_setup_msi_irqs()
*/
if (type == PCI_CAP_ID_MSI && nvec > 1)
- return 1;
+ return -EINVAL;

list_for_each_entry(entry, &dev->msi_list, list) {
ret = arch_setup_msi_irq(dev, entry);
--
1.7.7.6

--
Regards,
Alexander Gordeev
[email protected]

2013-09-09 15:23:55

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH 2/9] PCI/MSI/PPC: Make return values only 0/-errno when MSIs allocated[PATCH 5/9] PCI/MSI/s390: Make return values only 0/-errno when MSIs allocated

Signed-off-by: Alexander Gordeev <[email protected]>
---
arch/s390/pci/pci.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index f17a834..c79c6e4 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -427,6 +427,8 @@ int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
pr_debug("%s: requesting %d MSI-X interrupts...", __func__, nvec);
if (type != PCI_CAP_ID_MSIX && type != PCI_CAP_ID_MSI)
return -EINVAL;
+ if (type == PCI_CAP_ID_MSI && nvec > 1)
+ return 1;
msi_vecs = min(nvec, ZPCI_MSI_VEC_MAX);
msi_vecs = min_t(unsigned int, msi_vecs, CONFIG_PCI_NR_MSI);

--
1.7.7.6

--
Regards,
Alexander Gordeev
[email protected]

2013-09-09 15:25:11

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH 5/9] PCI/MSI/s390: Make return values only 0/-errno when MSIs allocated

Signed-off-by: Alexander Gordeev <[email protected]>
---
arch/s390/pci/pci.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index f17a834..c79c6e4 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -427,6 +427,8 @@ int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
pr_debug("%s: requesting %d MSI-X interrupts...", __func__, nvec);
if (type != PCI_CAP_ID_MSIX && type != PCI_CAP_ID_MSI)
return -EINVAL;
+ if (type == PCI_CAP_ID_MSI && nvec > 1)
+ return 1;
msi_vecs = min(nvec, ZPCI_MSI_VEC_MAX);
msi_vecs = min_t(unsigned int, msi_vecs, CONFIG_PCI_NR_MSI);

--
1.7.7.6

--
Regards,
Alexander Gordeev
[email protected]

2013-09-09 15:25:49

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH 6/9] PCI/MSI/s390: Remove superfluous check of MSI type

Signed-off-by: Alexander Gordeev <[email protected]>
---
arch/s390/pci/pci.c | 2 --
1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index c79c6e4..61a3c2c 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -425,8 +425,6 @@ int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
int rc;

pr_debug("%s: requesting %d MSI-X interrupts...", __func__, nvec);
- if (type != PCI_CAP_ID_MSIX && type != PCI_CAP_ID_MSI)
- return -EINVAL;
if (type == PCI_CAP_ID_MSI && nvec > 1)
return 1;
msi_vecs = min(nvec, ZPCI_MSI_VEC_MAX);
--
1.7.7.6

--
Regards,
Alexander Gordeev
[email protected]

2013-09-09 15:26:27

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH 7/9] PCI/MSI/s390: Make return values only 0/-errno when MSIs allocated

Signed-off-by: Alexander Gordeev <[email protected]>
---
arch/s390/pci/pci.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
index 61a3c2c..45a1875 100644
--- a/arch/s390/pci/pci.c
+++ b/arch/s390/pci/pci.c
@@ -426,7 +426,7 @@ int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)

pr_debug("%s: requesting %d MSI-X interrupts...", __func__, nvec);
if (type == PCI_CAP_ID_MSI && nvec > 1)
- return 1;
+ return -EINVAL;
msi_vecs = min(nvec, ZPCI_MSI_VEC_MAX);
msi_vecs = min_t(unsigned int, msi_vecs, CONFIG_PCI_NR_MSI);

--
1.7.7.6

--
Regards,
Alexander Gordeev
[email protected]

2013-09-09 15:27:18

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH 8/9] PCI/MSI: Fix return value when populate_msi_sysfs() failed

Signed-off-by: Alexander Gordeev <[email protected]>
---
drivers/pci/msi.c | 11 +++++------
1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index aca7578..21e6471 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -702,7 +702,7 @@ static int msix_capability_init(struct pci_dev *dev,

ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
if (ret)
- goto error;
+ goto out_avail;

/*
* Some devices require MSI-X to be enabled before we can touch the
@@ -715,10 +715,8 @@ static int msix_capability_init(struct pci_dev *dev,
msix_program_entries(dev, entries);

ret = populate_msi_sysfs(dev);
- if (ret) {
- ret = 0;
- goto error;
- }
+ if (ret)
+ goto out_free;

/* Set MSI-X enabled bits and unmask the function */
pci_intx_for_msi(dev, 0);
@@ -729,7 +727,7 @@ static int msix_capability_init(struct pci_dev *dev,

return 0;

-error:
+out_avail:
if (ret < 0) {
/*
* If we had some success, report the number of irqs
@@ -746,6 +744,7 @@ error:
ret = avail;
}

+out_free:
free_msi_irqs(dev);

return ret;
--
1.7.7.6

--
Regards,
Alexander Gordeev
[email protected]

2013-09-09 15:28:00

by Alexander Gordeev

[permalink] [raw]
Subject: [PATCH 9/9] PCI/MSI: Make return values only 0/-errno when MSIs allocated

Signed-off-by: Alexander Gordeev <[email protected]>
---
drivers/pci/msi.c | 87 +++++++++++++++++++----------------------------------
1 files changed, 31 insertions(+), 56 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 21e6471..f5fcb74 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -702,7 +702,7 @@ static int msix_capability_init(struct pci_dev *dev,

ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSIX);
if (ret)
- goto out_avail;
+ goto error;

/*
* Some devices require MSI-X to be enabled before we can touch the
@@ -716,7 +716,7 @@ static int msix_capability_init(struct pci_dev *dev,

ret = populate_msi_sysfs(dev);
if (ret)
- goto out_free;
+ goto error;

/* Set MSI-X enabled bits and unmask the function */
pci_intx_for_msi(dev, 0);
@@ -727,24 +727,7 @@ static int msix_capability_init(struct pci_dev *dev,

return 0;

-out_avail:
- if (ret < 0) {
- /*
- * If we had some success, report the number of irqs
- * we succeeded in setting up.
- */
- struct msi_desc *entry;
- int avail = 0;
-
- list_for_each_entry(entry, &dev->msi_list, list) {
- if (entry->irq != 0)
- avail++;
- }
- if (avail != 0)
- ret = avail;
- }
-
-out_free:
+error:
free_msi_irqs(dev);

return ret;
@@ -800,17 +783,15 @@ static int pci_msi_check_device(struct pci_dev *dev, int nvec, int type)
* @dev: device to configure
* @nvec: number of interrupts to configure
*
- * Allocate IRQs for a device with the MSI capability.
- * This function returns a negative errno if an error occurs. If it
- * is unable to allocate the number of interrupts requested, it returns
- * the number of interrupts it might be able to allocate. If it successfully
- * allocates at least the number of interrupts requested, it returns 0 and
- * updates the @dev's irq member to the lowest new interrupt number; the
- * other interrupt numbers allocated to this device are consecutive.
+ * Allocate IRQs for a device with the MSI capability. This function returns
+ * a negative errno if an error occurs. If it successfully allocates at least
+ * the number of interrupts requested, it returns 0 and updates the @dev's
+ * irq member to the lowest new interrupt number; the other interrupt numbers
+ * allocated to this device are consecutive.
*/
int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
{
- int status, maxvec;
+ int ret, maxvec;
u16 msgctl;

if (!dev->msi_cap)
@@ -819,11 +800,11 @@ int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &msgctl);
maxvec = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
if (nvec > maxvec)
- return maxvec;
+ return -EINVAL;

- status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
- if (status)
- return status;
+ ret = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
+ if (ret)
+ return ret;

WARN_ON(!!dev->msi_enabled);

@@ -834,33 +815,30 @@ int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
return -EINVAL;
}

- status = msi_capability_init(dev, nvec);
- return status;
+ ret = msi_capability_init(dev, nvec);
+ return ret;
}
EXPORT_SYMBOL(pci_enable_msi_block);

-int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *maxvec)
+int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *maxvec_ptr)
{
- int ret, nvec;
+ int ret, maxvec;
u16 msgctl;

if (!dev->msi_cap)
return -EINVAL;

pci_read_config_word(dev, dev->msi_cap + PCI_MSI_FLAGS, &msgctl);
- ret = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
-
- if (maxvec)
- *maxvec = ret;
+ maxvec = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);

- do {
- nvec = ret;
- ret = pci_enable_msi_block(dev, nvec);
- } while (ret > 0);
+ if (maxvec_ptr)
+ *maxvec_ptr = maxvec;

- if (ret < 0)
+ ret = pci_enable_msi_block(dev, maxvec);
+ if (ret)
return ret;
- return nvec;
+
+ return maxvec;
}
EXPORT_SYMBOL(pci_enable_msi_block_auto);

@@ -928,25 +906,22 @@ int pci_msix_table_size(struct pci_dev *dev)
* MSI-X mode enabled on its hardware device function. A return of zero
* indicates the successful configuration of MSI-X capability structure
* with new allocated MSI-X irqs. A return of < 0 indicates a failure.
- * Or a return of > 0 indicates that driver request is exceeding the number
- * of irqs or MSI-X vectors available. Driver should use the returned value to
- * re-send its request.
**/
int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
{
- int status, nr_entries;
+ int ret, nr_entries;
int i, j;

if (!entries || !dev->msix_cap)
return -EINVAL;

- status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
- if (status)
- return status;
+ ret = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSIX);
+ if (ret)
+ return ret;

nr_entries = pci_msix_table_size(dev);
if (nvec > nr_entries)
- return nr_entries;
+ return -EINVAL;

/* Check for any invalid entries */
for (i = 0; i < nvec; i++) {
@@ -965,8 +940,8 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
"(MSI IRQ already assigned)\n");
return -EINVAL;
}
- status = msix_capability_init(dev, entries, nvec);
- return status;
+ ret = msix_capability_init(dev, entries, nvec);
+ return ret;
}
EXPORT_SYMBOL(pci_enable_msix);

--
1.7.7.6

--
Regards,
Alexander Gordeev
[email protected]

2013-09-09 15:36:16

by Alexander Gordeev

[permalink] [raw]
Subject: scrap this one


--
Regards,
Alexander Gordeev
[email protected]

2013-09-09 15:38:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Hello,

On Mon, Sep 09, 2013 at 05:20:44PM +0200, Alexander Gordeev wrote:
> The series is what it seems a direction to take.

It can definitely use better descriptions explaining why this is
happening but the general direction seems good to me.

> Looks like we need PPC folks to agree on the quota check update
> for pSeries (yes, they do bail out with a positive return value
> from arch_msi_check_device()):
>
> quota = msi_quota_for_device(pdev, nvec);
>
> if (quota && quota < nvec)
> return quota;

I think it should return -ENOSPC.

Thanks.

--
tejun

2013-09-09 15:43:30

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Mon, Sep 09, 2013 at 11:37:54AM -0400, Tejun Heo wrote:
> Hello,
>
> On Mon, Sep 09, 2013 at 05:20:44PM +0200, Alexander Gordeev wrote:
> > The series is what it seems a direction to take.
>
> It can definitely use better descriptions explaining why this is
> happening but the general direction seems good to me.

Oh, those are not even tested. Did not want turn to drivers before
the interface/direction generally agreed.

> > Looks like we need PPC folks to agree on the quota check update
> > for pSeries (yes, they do bail out with a positive return value
> > from arch_msi_check_device()):
> >
> > quota = msi_quota_for_device(pdev, nvec);
> >
> > if (quota && quota < nvec)
> > return quota;
>
> I think it should return -ENOSPC.

Absolutely, in patch #2 :)

> Thanks.
>
> --
> tejun

--
Regards,
Alexander Gordeev
[email protected]

2013-09-10 12:42:21

by Sergei Shtylyov

[permalink] [raw]
Subject: Re: [PATCH 5/9] PCI/MSI/s390: Make return values only 0/-errno when MSIs allocated

Hello.

On 09-09-2013 19:26, Alexander Gordeev wrote:

> Signed-off-by: Alexander Gordeev <[email protected]>
> ---
> arch/s390/pci/pci.c | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)

> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index f17a834..c79c6e4 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -427,6 +427,8 @@ int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
> pr_debug("%s: requesting %d MSI-X interrupts...", __func__, nvec);
> if (type != PCI_CAP_ID_MSIX && type != PCI_CAP_ID_MSI)
> return -EINVAL;
> + if (type == PCI_CAP_ID_MSI && nvec > 1)
> + return 1;

This contradicts to the patch subject.

WBR, Sergei

2013-09-10 13:07:55

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH 5/9] PCI/MSI/s390: Make return values only 0/-errno when MSIs allocated

On Tue, Sep 10, 2013 at 04:42:17PM +0400, Sergei Shtylyov wrote:
> This contradicts to the patch subject.

Sorry, Sergei et al.
I screwed this series is bit. The second 5/9 fixes that.

> WBR, Sergei
>

--
Regards,
Alexander Gordeev
[email protected]

2013-09-16 11:20:20

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Mon, Sep 09, 2013 at 05:20:44PM +0200, Alexander Gordeev wrote:
> On Fri, Sep 06, 2013 at 05:32:05PM -0600, Bjorn Helgaas wrote:
> > I propose that you rework it that way, and at least find out what
> > (if anything) would break if we do that. Or maybe we just give up
> > some optimization; it would be nice to quantify that, too.
>
> Hi Bjorn,
>
> The series is what it seems a direction to take.
>
> Looks like we need PPC folks to agree on the quota check update
> for pSeries (yes, they do bail out with a positive return value
> from arch_msi_check_device()):

Hi Ben,

An initiative to simplify MSI/MSI-X allocation interface is brewing.
It seems pSeries quota thing is an obstacle. If it could be given up
(patch 2/9).

Thanks!

--
Regards,
Alexander Gordeev
[email protected]

2013-09-17 14:30:29

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Mon, Sep 16, 2013 at 12:22:11PM +0200, Alexander Gordeev wrote:
> On Mon, Sep 09, 2013 at 05:20:44PM +0200, Alexander Gordeev wrote:
> > On Fri, Sep 06, 2013 at 05:32:05PM -0600, Bjorn Helgaas wrote:
> > > I propose that you rework it that way, and at least find out what
> > > (if anything) would break if we do that. Or maybe we just give up
> > > some optimization; it would be nice to quantify that, too.
> >
> > Hi Bjorn,
> >
> > The series is what it seems a direction to take.
> >
> > Looks like we need PPC folks to agree on the quota check update
> > for pSeries (yes, they do bail out with a positive return value
> > from arch_msi_check_device()):
>
> Hi Ben,
>
> An initiative to simplify MSI/MSI-X allocation interface is brewing.
> It seems pSeries quota thing is an obstacle. If it could be given up
> (patch 2/9).

How about no?

We have a small number of MSIs available, limited by hardware &
firmware, if we don't impose a quota then the first device that probes
will get most/all of the MSIs and other devices miss out.

Anyway I don't see what problem you're trying to solve? I agree the
-ve/0/+ve return value pattern is ugly, but it's hardly the end of the
world.

cheers

2013-09-18 09:46:14

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Wed, Sep 18, 2013 at 12:30:23AM +1000, Michael Ellerman wrote:
> How about no?
>
> We have a small number of MSIs available, limited by hardware &
> firmware, if we don't impose a quota then the first device that probes
> will get most/all of the MSIs and other devices miss out.

Out of curiosity - how pSeries has had done it without quotas before
448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?

> Anyway I don't see what problem you're trying to solve? I agree the
> -ve/0/+ve return value pattern is ugly, but it's hardly the end of the
> world.

Well, the interface recently has been re-classified from "ugly" to
"unnecessarily complex and actively harmful" in Tejun's words ;)

Indeed, I checked most of the drivers and it is incredible how people
are creative in misusing the interface: from innocent pci_disable_msix()
calls when if pci_enable_msix() failed to assuming MSI-Xs were enabled
if pci_enable_msix() returned a positive value (apparently untested).

Roughly third of the drivers just do not care and bail out once
pci_enable_msix() has not succeeded. Not sure how many of these are
mandated by the hardware.

Another quite common pattern is a call to pci_enable_msix() to figure out
the number of MSI-Xs available and a repeated call of pci_enable_msix()
to enable those MSI-Xs, this time.

The last pattern makes most of sense to me and could be updated with a more
clear sequence - a call to (bit modified) pci_msix_table_size() followed
by a call to pci_enable_msix(). I think this pattern can effectively
supersede the currently recommended "loop" practice.

But as pSeries quota is still required I propose to introduce a new interface
pci_get_msix_limit() that combines pci_msix_table_size() and (also new)
arch_get_msix_limit(). The latter would check the quota thing in case of
pSeries and none in case of all other architectures.

The recommended practice would be:

/*
* Retrieving 'nvec' by means other than pci_msix_table_size()
*/

rc = pci_get_msix_limit(pdev);
if (rc < 0)
return rc;

/*
* nvec = min(rc, nvec);
*/

for (i = 0; i < nvec; i++)
msix_entry[i].entry = i;

rc = pci_enable_msix(pdev, msix_entry, nvec);
if (rc)
return rc;

Thoughts?

> cheers

--
Regards,
Alexander Gordeev
[email protected]

2013-09-18 14:22:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Hello,

On Wed, Sep 18, 2013 at 11:48:00AM +0200, Alexander Gordeev wrote:
> On Wed, Sep 18, 2013 at 12:30:23AM +1000, Michael Ellerman wrote:
> > How about no?
> >
> > We have a small number of MSIs available, limited by hardware &
> > firmware, if we don't impose a quota then the first device that probes
> > will get most/all of the MSIs and other devices miss out.
>
> Out of curiosity - how pSeries has had done it without quotas before
> 448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?

Hmmm... do we need to treat this any differently? If the platform
can't allocate full range of requested MSIs, just failing should be
enough regardless of why such allocation can't be met, no?

> > Anyway I don't see what problem you're trying to solve? I agree the
> > -ve/0/+ve return value pattern is ugly, but it's hardly the end of the
> > world.
>
> Well, the interface recently has been re-classified from "ugly" to
> "unnecessarily complex and actively harmful" in Tejun's words ;)

LOL. :)

> Indeed, I checked most of the drivers and it is incredible how people
> are creative in misusing the interface: from innocent pci_disable_msix()
> calls when if pci_enable_msix() failed to assuming MSI-Xs were enabled
> if pci_enable_msix() returned a positive value (apparently untested).
>
> Roughly third of the drivers just do not care and bail out once
> pci_enable_msix() has not succeeded. Not sure how many of these are
> mandated by the hardware.

Yeah, I mean, this type of interface is a trap. People have to
actively resist to avoid doing silly stuff which is a lot to ask.

> /*
> * Retrieving 'nvec' by means other than pci_msix_table_size()
> */
>
> rc = pci_get_msix_limit(pdev);
> if (rc < 0)
> return rc;
>
> /*
> * nvec = min(rc, nvec);
> */
>
> for (i = 0; i < nvec; i++)
> msix_entry[i].entry = i;
>
> rc = pci_enable_msix(pdev, msix_entry, nvec);
> if (rc)
> return rc;

I really think what we should do is

* Determine the number of MSIs the controller wants. Don't worry
about quotas or limits or anything. Just determine the number
necessary to enable enhanced interrupt handling.

* Try allocating that number of MSIs. If it fails, then just revert
to single interrupt mode. It's not the end of the world and mostly
guaranteed to work. Let's please not even try to do partial
multiple interrupts. I really don't think it's worth the risk or
complexity.

Thanks.

--
tejun

2013-09-18 16:48:53

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Wed, Sep 18, 2013 at 09:22:31AM -0500, Tejun Heo wrote:
> > > We have a small number of MSIs available, limited by hardware &
> > > firmware, if we don't impose a quota then the first device that probes
> > > will get most/all of the MSIs and other devices miss out.
> >
> > Out of curiosity - how pSeries has had done it without quotas before
> > 448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?
>
> Hmmm... do we need to treat this any differently? If the platform
> can't allocate full range of requested MSIs, just failing should be
> enough regardless of why such allocation can't be met, no?

That depends from what "full range of requested MSIs" is. If that is
a maximum number of MSIs the controller advertised, then no. As MSI
design essentially allows devices to operate in lower-than-maximum
modes it is responsibility of a caller to decide how many vectors to
request. So in case of pSeries I think it is completely legitimate to
request lessers to overcome the platform limitation and let all devices
work.

> I really think what we should do is
>
> * Determine the number of MSIs the controller wants. Don't worry
> about quotas or limits or anything. Just determine the number
> necessary to enable enhanced interrupt handling.

Actually, I do not see much contradiction with what I proposed. The
key words here "determine the number of MSIs the controller wants".

In general case it is not what pci_msix_table_size() returns (or at
least we should not limit ourselves to it) - there could be non-
standard means to report number of MSIs: hardcoded, version-dependant,
device-specific registers etc.

Next, if we opt to determine the number of MSIs by non-MSI standard
means then there is no reason not to call pci_get_msix_limit() (or
whatever) at this step.

The question how I see it - do we want pci_get_msix_limit() interface
as part of the MSI framework or do we want it pSeries-specific?

> * Try allocating that number of MSIs. If it fails, then just revert
> to single interrupt mode. It's not the end of the world and mostly
> guaranteed to work. Let's please not even try to do partial
> multiple interrupts. I really don't think it's worth the risk or
> complexity.

Being Captain Obvious here, but it is up to the device driver to handle
a failure. There could be no such option as single MSI mode after all :)

> Thanks.
>
> --
> tejun

--
Regards,
Alexander Gordeev
[email protected]

2013-09-20 08:23:12

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Michael et al.

The identifiable options sounded so far were:

* Do not change anything

* Make pci_enable_msix() return 0/-errno

* Make pci_enable_msix() return 0/-errno and introduce arch_get_msix_limit()/
arch_get_msi_limit()

* Make pci_enable_msix() return 0/-errno and introduce pci_get_msix_limit()/
pci_get_msi_limit() and arch_get_msix_limit()/arch_get_msi_limit() so that:
pci_get_msix_limit() = min(arch_get_msix_limit(), pci_msix_table_size())
pci_get_msi_limit() = min(arch_get_msi_limit(), pci_get_msi_cap())

Can we have a conclusion here?

Thanks!

--
Regards,
Alexander Gordeev
[email protected]

2013-09-20 12:26:12

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Hello,

On Wed, Sep 18, 2013 at 06:50:45PM +0200, Alexander Gordeev wrote:
> Actually, I do not see much contradiction with what I proposed. The
> key words here "determine the number of MSIs the controller wants".
>
> In general case it is not what pci_msix_table_size() returns (or at
> least we should not limit ourselves to it) - there could be non-
> standard means to report number of MSIs: hardcoded, version-dependant,
> device-specific registers etc.
>
> Next, if we opt to determine the number of MSIs by non-MSI standard
> means then there is no reason not to call pci_get_msix_limit() (or
> whatever) at this step.

Yeah, that's all fine. My point is that we shouldn't try to use
"degraded" multiple MSI mode where the number of MSIs allocated is
smaller than performing full multiple MSI operation. How that number
is determined doesn't really matter but that number is a property
which is solely decided by the device driver, right? If a device
needs full multiple MSI mode, given specific configuration, it needs
>= X number of MSIs and that's the number it should request.

> Being Captain Obvious here, but it is up to the device driver to handle
> a failure. There could be no such option as single MSI mode after all :)

I don't think there actually is a mainstream device which can't
fallback to single interrupt. Anyways, the point is the same, let's
please not try to create an interface which encourages complex retry
logic in its users which are likely to involve less traveled and
tested paths in both the driver and firmware.

Thanks.

--
tejun

2013-09-20 12:27:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Fri, Sep 20, 2013 at 10:24:59AM +0200, Alexander Gordeev wrote:
> * Make pci_enable_msix() return 0/-errno

My choice would be this one.

Thanks.

--
tejun

2013-09-25 18:02:30

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Fri, Sep 20, 2013 at 07:27:36AM -0500, Tejun Heo wrote:
> On Fri, Sep 20, 2013 at 10:24:59AM +0200, Alexander Gordeev wrote:
> > * Make pci_enable_msix() return 0/-errno
>
> My choice would be this one.

I agree; it sounds like you've identified several bugs related to the
current confusing interface, so fixing that seems like the first step.

I hope we can avoid adding a plethora of interfaces to address unusual
corner cases. But if we do the above and it turns out not to be enough,
we can always extend it later.

Bjorn

2013-09-25 20:56:09

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Wed, Sep 25, 2013 at 12:02:20PM -0600, Bjorn Helgaas wrote:
> On Fri, Sep 20, 2013 at 07:27:36AM -0500, Tejun Heo wrote:
> > On Fri, Sep 20, 2013 at 10:24:59AM +0200, Alexander Gordeev wrote:
> > > * Make pci_enable_msix() return 0/-errno
> >
> > My choice would be this one.
>
> I agree; it sounds like you've identified several bugs related to the
> current confusing interface, so fixing that seems like the first step.

Yeah, I am trying to. Turns out to be a nice exercise ;)

> I hope we can avoid adding a plethora of interfaces to address unusual
> corner cases. But if we do the above and it turns out not to be enough,
> we can always extend it later.

Unfortunately, pSeries is a shows-topper here :( It seems we have to
introduce pci_get_msi{,x}_limit() interfaces to honour the quota
thing. I just hope the hardware set for pSeries is limited and we
won't need to use it for all drivers.

> Bjorn

--
Regards,
Alexander Gordeev
[email protected]

2013-09-25 21:00:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Hello,

On Wed, Sep 25, 2013 at 10:58:05PM +0200, Alexander Gordeev wrote:
> Unfortunately, pSeries is a shows-topper here :( It seems we have to
> introduce pci_get_msi{,x}_limit() interfaces to honour the quota
> thing. I just hope the hardware set for pSeries is limited and we
> won't need to use it for all drivers.

Can you please go into a bit of detail on that? Why does it matter?
Is it because you're worried you might cause performance regression by
forcing prevoius partial multiple allocations to single interrupt
operation?

Thanks.

--
tejun

2013-09-26 07:44:53

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Wed, Sep 25, 2013 at 05:00:16PM -0400, Tejun Heo wrote:
> Hello,
>
> On Wed, Sep 25, 2013 at 10:58:05PM +0200, Alexander Gordeev wrote:
> > Unfortunately, pSeries is a shows-topper here :( It seems we have to
> > introduce pci_get_msi{,x}_limit() interfaces to honour the quota
> > thing. I just hope the hardware set for pSeries is limited and we
> > won't need to use it for all drivers.
>
> Can you please go into a bit of detail on that? Why does it matter?

Because otherwise we will re-introduce a problem described by Michael:
"We have a small number of MSIs available, limited by hardware &
firmware, if we don't impose a quota then the first device that probes
will get most/all of the MSIs and other devices miss out."

> Is it because you're worried you might cause performance regression by
> forcing prevoius partial multiple allocations to single interrupt
> operation?

Well, not really. I think it won't be possible to force people not to use
partial allocations anyway. Some controllers just do not care how many MSIs
they are configured with. Some drivers derive the number of MSIs desired
from the number of CPUs online - in such cases allocating more MSIs (i.e.
a number the controller advertised) could cause a performance degradation
even.

So when driver authors will know/measure/believe their hardware performs
better with partial allocations they will stand for it. What we can do is
to prevent those try-and-decrease fallbacks.

> Thanks.
>
> --
> tejun

--
Regards,
Alexander Gordeev
[email protected]

2013-09-26 09:20:54

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

> Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface
>
> On Wed, Sep 25, 2013 at 05:00:16PM -0400, Tejun Heo wrote:
> > Hello,
> >
> > On Wed, Sep 25, 2013 at 10:58:05PM +0200, Alexander Gordeev wrote:
> > > Unfortunately, pSeries is a shows-topper here :( It seems we have to
> > > introduce pci_get_msi{,x}_limit() interfaces to honour the quota
> > > thing. I just hope the hardware set for pSeries is limited and we
> > > won't need to use it for all drivers.
> >
> > Can you please go into a bit of detail on that? Why does it matter?
>
> Because otherwise we will re-introduce a problem described by Michael:
> "We have a small number of MSIs available, limited by hardware &
> firmware, if we don't impose a quota then the first device that probes
> will get most/all of the MSIs and other devices miss out."

Would it be possible to do some kind of 2-stage allocation.
In the first pass the driver would pass a minimum and desired
number of MSI-X interrupts, but not actually be given any.
Interrupts could then be allocated after it is known how many
are required and how many are available.

David


2013-09-26 10:43:56

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Thu, Sep 26, 2013 at 09:58:53AM +0100, David Laight wrote:
> Would it be possible to do some kind of 2-stage allocation.
> In the first pass the driver would pass a minimum and desired
> number of MSI-X interrupts, but not actually be given any.

Repeated calls to msi_enable_msi/msix() is what we are trying to avoid.

> Interrupts could then be allocated after it is known how many
> are required and how many are available.

Yep, that what we are heading to. So basic pattern I see would be like this:

int foo_driver_enable_msix(struct pci_dev *pdev, int nvec)
{
...

rc = pci_msix_table_size(pdev);
if (rc < 0)
return rc;

nvec = min(nvec, rc);
if (nvec < FOO_DRIVER_MINIMUM_NVEC)
goto single_msi;

for (i = 0; i < nvec; i++)
entries[i].entry = i;

rc = pci_enable_msix(pdev, entries, nvec);
if (rc)
goto single_msi;

return 0;

single_msi:
...

}

But this will break pSeries and we might end up with something like this:

int foo_driver_enable_msix(struct pci_dev *pdev, int nvec)
{
...

rc = pci_msix_table_size(pdev);
if (rc < 0)
return rc;

nvec = min(nvec, rc);
if (nvec < FOO_DRIVER_MINIMUM_NVEC)
goto single_msi;

rc = pci_get_msix_limit(pdev, nvec);
if (rc < 0)
return rc;

nvec = min(nvec, rc);
if (nvec < FOO_DRIVER_MINIMUM_NVEC)
goto single_msi;

for (i = 0; i < nvec; i++)
entries[i].entry = i;

rc = pci_enable_msix(pdev, entries, nvec);
if (rc)
goto single_msi;

return 0;

single_msi:
...

}

> David

--
Regards,
Alexander Gordeev
[email protected]

2013-09-26 11:36:32

by David Laight

[permalink] [raw]
Subject: RE: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

> On Thu, Sep 26, 2013 at 09:58:53AM +0100, David Laight wrote:
> > Would it be possible to do some kind of 2-stage allocation.
> > In the first pass the driver would pass a minimum and desired
> > number of MSI-X interrupts, but not actually be given any.
>
> Repeated calls to msi_enable_msi/msix() is what we are trying to avoid.

I was thinking that the first call would be done during driver probe
(assuming such a time exists) so that the subsystem could determine
how many interrupts all the drivers would like, so it can then
hand out a smaller number to some of the early drivers in order
to have some left to satisfy the minimum requirement of later
ones.

So all it would do is sum the requirements of all the drivers.

David


2013-09-26 12:11:28

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Thu, Sep 26, 2013 at 12:34:36PM +0100, David Laight wrote:
> I was thinking that the first call would be done during driver probe
> (assuming such a time exists) so that the subsystem could determine
> how many interrupts all the drivers would like, so it can then
> hand out a smaller number to some of the early drivers in order
> to have some left to satisfy the minimum requirement of later
> ones.
>
> So all it would do is sum the requirements of all the drivers.

It is already implemented - please see commit 448e2ca ("powerpc/pseries:
Implement a quota system for MSIs")

All other archs do not have MSI vector space limitations.

> David

--
Regards,
Alexander Gordeev
[email protected]

2013-09-26 12:48:55

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On 13-09-18 05:48 AM, Alexander Gordeev wrote:
>
> The last pattern makes most of sense to me and could be updated with a more
> clear sequence - a call to (bit modified) pci_msix_table_size() followed
> by a call to pci_enable_msix(). I think this pattern can effectively
> supersede the currently recommended "loop" practice.

The loop is still necessary, because there's a race between those two calls,
so that pci_enable_msix() can still fail due to lack of MSIX slots.

2013-09-26 13:01:50

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Thu, Sep 26, 2013 at 08:32:53AM -0400, Mark Lord wrote:
> On 13-09-18 05:48 AM, Alexander Gordeev wrote:
> > The last pattern makes most of sense to me and could be updated with a more
> > clear sequence - a call to (bit modified) pci_msix_table_size() followed
> > by a call to pci_enable_msix(). I think this pattern can effectively
> > supersede the currently recommended "loop" practice.
>
> The loop is still necessary, because there's a race between those two calls,
> so that pci_enable_msix() can still fail due to lack of MSIX slots.

Moreover, the existing loop pattern is racy and could fail just as easily ;)
But (1) that is something drivers should expect and (2) there is basically
nothing to race against - that is probably the reason it has not been a
problem for pSeries. So I think we should not care about this.

--
Regards,
Alexander Gordeev
[email protected]

2013-09-26 13:12:01

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Hello,

On Thu, Sep 26, 2013 at 09:46:46AM +0200, Alexander Gordeev wrote:
> > Can you please go into a bit of detail on that? Why does it matter?
>
> Because otherwise we will re-introduce a problem described by Michael:
> "We have a small number of MSIs available, limited by hardware &
> firmware, if we don't impose a quota then the first device that probes
> will get most/all of the MSIs and other devices miss out."

Still not following. Why wouldn't just letting the drivers request
the optimal number they want and falling back to single interrupt mode
work? ie. why can't we just have an all or nothing interface?

> > Is it because you're worried you might cause performance regression by
> > forcing prevoius partial multiple allocations to single interrupt
> > operation?
>
> Well, not really. I think it won't be possible to force people not to use
> partial allocations anyway. Some controllers just do not care how many MSIs
> they are configured with. Some drivers derive the number of MSIs desired
> from the number of CPUs online - in such cases allocating more MSIs (i.e.
> a number the controller advertised) could cause a performance degradation
> even.

Yeah, sure thing but just let the *driver* decide that number without
worrying about how many they can actually get. Ultimately, what we
want is removing this extra variable which can arbitrarily affect the
number of allocated interrupts so that we only have to worry about
either proper multiple MSI mode or single interrupt mode, not
something random inbetween. It is possible that there exists a driver
which absolutely requires partial allocation on certain archs, but
that should be a very special case and the interface should look
accordingly ugly / special. But do we actually have those?

Thanks.

--
tejun

2013-09-26 14:37:14

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Thu, Sep 26, 2013 at 09:11:47AM -0400, Tejun Heo wrote:
> > Because otherwise we will re-introduce a problem described by Michael:
> > "We have a small number of MSIs available, limited by hardware &
> > firmware, if we don't impose a quota then the first device that probes
> > will get most/all of the MSIs and other devices miss out."
>
> Still not following. Why wouldn't just letting the drivers request
> the optimal number they want and falling back to single interrupt mode
> work? ie. why can't we just have an all or nothing interface?

I can imagine a scenario where the first device probes in, requests its
optimal number, acquires that number and exhausts MSIs in pSeries firmware.
The next few devices possibly end up with single MSI, since no MSIs left
to satisfy their optimal numbers. If one of those single-MSI'ed devices
happened to be a high-performance HBA hitting a degraded performance that
alone would force (IBM) to introduce the quotas. Now, if the same/another
device happened does not support the legacy interrupt mode and no MSI
resources have left in the platform firmware at all...

> tejun

--
Regards,
Alexander Gordeev
[email protected]

2013-09-26 14:42:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Hello,

On Thu, Sep 26, 2013 at 10:39 AM, Alexander Gordeev <[email protected]> wrote:
> I can imagine a scenario where the first device probes in, requests its

Well, we can imagine a lot of thing but usually have to draw the line somewhere.

> optimal number, acquires that number and exhausts MSIs in pSeries firmware.
> The next few devices possibly end up with single MSI, since no MSIs left
> to satisfy their optimal numbers. If one of those single-MSI'ed devices
> happened to be a high-performance HBA hitting a degraded performance that
> alone would force (IBM) to introduce the quotas. Now, if the same/another
> device happened does not support the legacy interrupt mode and no MSI
> resources have left in the platform firmware at all...

If that happens, that's just the platform code being dumb. Quota is
there to prevent that from happening, right? Let's please do something
simple and worry about problems if they actually exist.

Thanks.

--
tejun

2013-10-01 07:20:01

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Thu, Sep 26, 2013 at 04:39:02PM +0200, Alexander Gordeev wrote:
> On Thu, Sep 26, 2013 at 09:11:47AM -0400, Tejun Heo wrote:
> > > Because otherwise we will re-introduce a problem described by Michael:
> > > "We have a small number of MSIs available, limited by hardware &
> > > firmware, if we don't impose a quota then the first device that probes
> > > will get most/all of the MSIs and other devices miss out."
> >
> > Still not following. Why wouldn't just letting the drivers request
> > the optimal number they want and falling back to single interrupt mode
> > work? ie. why can't we just have an all or nothing interface?
>
> I can imagine a scenario where the first device probes in, requests its
> optimal number, acquires that number and exhausts MSIs in pSeries firmware.
> The next few devices possibly end up with single MSI, since no MSIs left
> to satisfy their optimal numbers. If one of those single-MSI'ed devices
> happened to be a high-performance HBA hitting a degraded performance that
> alone would force (IBM) to introduce the quotas.

Yes that's exactly the scenario, and I didn't imagine it, our test
people actually hit it and yelled at me.

I don't remember exactly which adapters it was, I might be able to find
the details if I looked hard, a quick search through my mail archive
didn't find it - it might have come in via irc / bugzilla etc.

cheers

2013-10-01 07:26:49

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Fri, Sep 20, 2013 at 07:26:03AM -0500, Tejun Heo wrote:
> Hello,
>
> On Wed, Sep 18, 2013 at 06:50:45PM +0200, Alexander Gordeev wrote:
> > Actually, I do not see much contradiction with what I proposed. The
> > key words here "determine the number of MSIs the controller wants".
> >
> > In general case it is not what pci_msix_table_size() returns (or at
> > least we should not limit ourselves to it) - there could be non-
> > standard means to report number of MSIs: hardcoded, version-dependant,
> > device-specific registers etc.
> >
> > Next, if we opt to determine the number of MSIs by non-MSI standard
> > means then there is no reason not to call pci_get_msix_limit() (or
> > whatever) at this step.
>
> Yeah, that's all fine. My point is that we shouldn't try to use
> "degraded" multiple MSI mode where the number of MSIs allocated is
> smaller than performing full multiple MSI operation. How that number
> is determined doesn't really matter but that number is a property
> which is solely decided by the device driver, right? If a device
> needs full multiple MSI mode, given specific configuration, it needs
> >= X number of MSIs and that's the number it should request.

Sure, the driver is in full control. If it can ONLY work with N MSIs
then it should try for N, else fallback to 1.

But some drivers are able to work with a range of values for N, and
performance is improved vs using a single MSI.

> > Being Captain Obvious here, but it is up to the device driver to handle
> > a failure. There could be no such option as single MSI mode after all :)
>
> I don't think there actually is a mainstream device which can't
> fallback to single interrupt. Anyways, the point is the same, let's
> please not try to create an interface which encourages complex retry
> logic in its users which are likely to involve less traveled and
> tested paths in both the driver and firmware.

Why support > 1 MSI at all? It just adds complex logic and less travelled
paths in the driver and firmware.

cheers

2013-10-01 07:35:54

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Wed, Sep 18, 2013 at 09:22:31AM -0500, Tejun Heo wrote:
> Hello,
>
> On Wed, Sep 18, 2013 at 11:48:00AM +0200, Alexander Gordeev wrote:
> > On Wed, Sep 18, 2013 at 12:30:23AM +1000, Michael Ellerman wrote:
> > > How about no?
> > >
> > > We have a small number of MSIs available, limited by hardware &
> > > firmware, if we don't impose a quota then the first device that probes
> > > will get most/all of the MSIs and other devices miss out.
> >
> > Out of curiosity - how pSeries has had done it without quotas before
> > 448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?
>
> Hmmm... do we need to treat this any differently? If the platform
> can't allocate full range of requested MSIs, just failing should be
> enough regardless of why such allocation can't be met, no?
>
> > > Anyway I don't see what problem you're trying to solve? I agree the
> > > -ve/0/+ve return value pattern is ugly, but it's hardly the end of the
> > > world.
> >
> > Well, the interface recently has been re-classified from "ugly" to
> > "unnecessarily complex and actively harmful" in Tejun's words ;)
>
> LOL. :)
>
> > Indeed, I checked most of the drivers and it is incredible how people
> > are creative in misusing the interface: from innocent pci_disable_msix()
> > calls when if pci_enable_msix() failed to assuming MSI-Xs were enabled
> > if pci_enable_msix() returned a positive value (apparently untested).
> >
> > Roughly third of the drivers just do not care and bail out once
> > pci_enable_msix() has not succeeded. Not sure how many of these are
> > mandated by the hardware.
>
> Yeah, I mean, this type of interface is a trap. People have to
> actively resist to avoid doing silly stuff which is a lot to ask.

I really think you're overstating the complexity here.

Functions typically return a boolean -> nothing to see here
This function returns a tristate value -> brain explosion!


> > /*
> > * Retrieving 'nvec' by means other than pci_msix_table_size()
> > */
> >
> > rc = pci_get_msix_limit(pdev);
> > if (rc < 0)
> > return rc;
> >
> > /*
> > * nvec = min(rc, nvec);
> > */
> >
> > for (i = 0; i < nvec; i++)
> > msix_entry[i].entry = i;
> >
> > rc = pci_enable_msix(pdev, msix_entry, nvec);
> > if (rc)
> > return rc;
>
> I really think what we should do is
>
> * Determine the number of MSIs the controller wants. Don't worry
> about quotas or limits or anything. Just determine the number
> necessary to enable enhanced interrupt handling.
>
> * Try allocating that number of MSIs. If it fails, then just revert
> to single interrupt mode. It's not the end of the world and mostly
> guaranteed to work. Let's please not even try to do partial
> multiple interrupts. I really don't think it's worth the risk or
> complexity.

It will potentially break existing setups on our hardware.

Can I make that any clearer?

cheers

2013-10-01 07:51:39

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Wed, Sep 18, 2013 at 11:48:00AM +0200, Alexander Gordeev wrote:
> On Wed, Sep 18, 2013 at 12:30:23AM +1000, Michael Ellerman wrote:
> > How about no?
> >
> > We have a small number of MSIs available, limited by hardware &
> > firmware, if we don't impose a quota then the first device that probes
> > will get most/all of the MSIs and other devices miss out.
>
> Out of curiosity - how pSeries has had done it without quotas before
> 448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?
>
> > Anyway I don't see what problem you're trying to solve? I agree the
> > -ve/0/+ve return value pattern is ugly, but it's hardly the end of the
> > world.
>
> Well, the interface recently has been re-classified from "ugly" to
> "unnecessarily complex and actively harmful" in Tejun's words ;)
>
> Indeed, I checked most of the drivers and it is incredible how people
> are creative in misusing the interface: from innocent pci_disable_msix()
> calls when if pci_enable_msix() failed to assuming MSI-Xs were enabled
> if pci_enable_msix() returned a positive value (apparently untested).

OK, but we have the source to the drivers, we could just fix them.

We could even add:

pci_enable_msix_i_am_stupid()

Which swallows the positive return and just gives back -ve/0.

> Roughly third of the drivers just do not care and bail out once
> pci_enable_msix() has not succeeded. Not sure how many of these are
> mandated by the hardware.

Sure, that's fine if those drivers do that, it's up to the drivers after
all.

> Another quite common pattern is a call to pci_enable_msix() to figure out
> the number of MSI-Xs available and a repeated call of pci_enable_msix()
> to enable those MSI-Xs, this time.

Also fine, though as the documentation suggests a loop is the best
construct rather than two explicit calls.

> The recommended practice would be:
>
> /*
> * Retrieving 'nvec' by means other than pci_msix_table_size()
> */
>
> rc = pci_get_msix_limit(pdev);
> if (rc < 0)
> return rc;
>
> /*
> * nvec = min(rc, nvec);
> */
>
> for (i = 0; i < nvec; i++)
> msix_entry[i].entry = i;
>
> rc = pci_enable_msix(pdev, msix_entry, nvec);
> if (rc)
> return rc;
>
> Thoughts?

We could probably make that work.

The disadvantage is that any restriction imposed on us above the quota
can only be reported as an error from pci_enable_msix().

The quota code, called from pci_get_msix_limit(), can only do so much to
interogate firmware about the limitations. The ultimate way to check if
firmware will give us enough MSIs is to try and allocate them. But we
can't do that from pci_get_msix_limit() because the driver is not asking
us to enable MSIs, just query them.

You'll also need to add another arch hook, for the quota check, and
we'll have to add it to our per-platform indirection as well.

All a lot of bother for no real gain IMHO.

cheers

2013-10-01 10:33:34

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Tue, Oct 01, 2013 at 05:51:33PM +1000, Michael Ellerman wrote:
> The disadvantage is that any restriction imposed on us above the quota
> can only be reported as an error from pci_enable_msix().
>
> The quota code, called from pci_get_msix_limit(), can only do so much to
> interogate firmware about the limitations. The ultimate way to check if
> firmware will give us enough MSIs is to try and allocate them. But we
> can't do that from pci_get_msix_limit() because the driver is not asking
> us to enable MSIs, just query them.

If things are this way then pci_enable_msix() already exposed to this
problem internally on pSeries.

I see that even successful quota checks in rtas_msi_check_device() and
rtas_setup_msi_irqs() do not guarantee (as you say) that firmware will
give enough MSIs. Hence, pci_enable_msix() might fail even though the
its quota checks succeeded.

Therefore, nothing will really change if we make pci_get_msix_limit() check
quota and hope the follow-up call to pci_enable_msix() succeeded.

(Of course, we could allocate-deallocate MSIs at check time, but I think it
is an overkill).

> You'll also need to add another arch hook, for the quota check, and
> we'll have to add it to our per-platform indirection as well.

Already, in a branch, hidden from Bjorn & Tejun eyes ;)

> All a lot of bother for no real gain IMHO.

Well, I do not have a strong opinion here. I leave it to the ones who have :)
But few drivers have became clearer as result of this change (and messy ones
are still messy).

> cheers

--
Regards,
Alexander Gordeev
[email protected]

2013-10-01 11:55:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

Hello,

On Tue, Oct 01, 2013 at 05:35:48PM +1000, Michael Ellerman wrote:
> > > Roughly third of the drivers just do not care and bail out once
> > > pci_enable_msix() has not succeeded. Not sure how many of these are
> > > mandated by the hardware.
> >
> > Yeah, I mean, this type of interface is a trap. People have to
> > actively resist to avoid doing silly stuff which is a lot to ask.
>
> I really think you're overstating the complexity here.
>
> Functions typically return a boolean -> nothing to see here
> This function returns a tristate value -> brain explosion!

It is an interface which forces the driver writers to write
complicated fallback code which won't usually be excercised. The same
goes for the hardware. In isolation, it doesn't look like much but
things like this are bound to lead to subtle bugs which are diffiuclt
to trigger.

> > * Determine the number of MSIs the controller wants. Don't worry
> > about quotas or limits or anything. Just determine the number
> > necessary to enable enhanced interrupt handling.
> >
> > * Try allocating that number of MSIs. If it fails, then just revert
> > to single interrupt mode. It's not the end of the world and mostly
> > guaranteed to work. Let's please not even try to do partial
> > multiple interrupts. I really don't think it's worth the risk or
> > complexity.
>
> It will potentially break existing setups on our hardware.

I think it'd be much better to have a separate interface for the
drivers which actually require it *in practice* rather than forcing
everyone to go "oh this interface supports that, I don't know if I
need it but let's implement fallback logic which I won't and have no
means of testing". Are we talking about some limited number of device
drivers here? Also, is the quota still necessary for machines in
production today?

Thanks.

--
tejun

2013-10-02 02:33:46

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Tue, Oct 01, 2013 at 07:55:03AM -0400, Tejun Heo wrote:
> Hello,
>
> On Tue, Oct 01, 2013 at 05:35:48PM +1000, Michael Ellerman wrote:
> > > > Roughly third of the drivers just do not care and bail out once
> > > > pci_enable_msix() has not succeeded. Not sure how many of these are
> > > > mandated by the hardware.
> > >
> > > Yeah, I mean, this type of interface is a trap. People have to
> > > actively resist to avoid doing silly stuff which is a lot to ask.
> >
> > I really think you're overstating the complexity here.
> >
> > Functions typically return a boolean -> nothing to see here
> > This function returns a tristate value -> brain explosion!
>
> It is an interface which forces the driver writers to write
> complicated fallback code which won't usually be excercised.

It does not force anyone to do anything. That's just bull.

Code which is unwilling or unable to cope with the extra complexity
can simply do:

if (pci_enable_msix(..))
goto fail;

It's as simple as that.

> > > * Determine the number of MSIs the controller wants. Don't worry
> > > about quotas or limits or anything. Just determine the number
> > > necessary to enable enhanced interrupt handling.
> > >
> > > * Try allocating that number of MSIs. If it fails, then just revert
> > > to single interrupt mode. It's not the end of the world and mostly
> > > guaranteed to work. Let's please not even try to do partial
> > > multiple interrupts. I really don't think it's worth the risk or
> > > complexity.
> >
> > It will potentially break existing setups on our hardware.
>
> I think it'd be much better to have a separate interface for the
> drivers which actually require it *in practice* rather than forcing
> everyone to go "oh this interface supports that, I don't know if I
> need it but let's implement fallback logic which I won't and have no
> means of testing".

Sure, that's easy:

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index d5f90d6..48d0252 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -988,6 +988,18 @@ int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
}
EXPORT_SYMBOL(pci_enable_msix);

+int pci_enable_msix_or_fail(struct pci_dev *dev, struct msix_entry *entries,
+ int nvec)
+{
+ int rc;
+
+ rc = pci_enable_msix(dev, entries, nvec);
+ if (rc > 0)
+ rc = -ENOSPC;
+
+ return rc;
+}
+
void pci_msix_shutdown(struct pci_dev *dev)
{
struct msi_desc *entry;


> Are we talking about some limited number of device drivers here?

I don't have a list, but yeah there are certain drivers that folks care about.

> Also, is the quota still necessary for machines in production today?

As far as I know yes. The number of MSIs is growing on modern systems, but so
is the number of cpus and devices.

cheers

2013-10-02 02:43:32

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Tue, Oct 01, 2013 at 12:35:27PM +0200, Alexander Gordeev wrote:
> On Tue, Oct 01, 2013 at 05:51:33PM +1000, Michael Ellerman wrote:
> > The disadvantage is that any restriction imposed on us above the quota
> > can only be reported as an error from pci_enable_msix().
> >
> > The quota code, called from pci_get_msix_limit(), can only do so much to
> > interogate firmware about the limitations. The ultimate way to check if
> > firmware will give us enough MSIs is to try and allocate them. But we
> > can't do that from pci_get_msix_limit() because the driver is not asking
> > us to enable MSIs, just query them.
>
> If things are this way then pci_enable_msix() already exposed to this
> problem internally on pSeries.
>
> I see that even successful quota checks in rtas_msi_check_device() and
> rtas_setup_msi_irqs() do not guarantee (as you say) that firmware will
> give enough MSIs. Hence, pci_enable_msix() might fail even though the
> its quota checks succeeded.

Yes, but it can report that failure to the caller, which can then retry.

> Therefore, nothing will really change if we make pci_get_msix_limit() check
> quota and hope the follow-up call to pci_enable_msix() succeeded.

No that's not equivalent. Under your scheme if pci_enable_msix() fails
then the caller just bails, it will never try again with a lower number.

> (Of course, we could allocate-deallocate MSIs at check time, but I think it
> is an overkill).

It's not only overkill, it's messing with the device behind the drivers
back, which is definitely a no-no in my opinion.

> > You'll also need to add another arch hook, for the quota check, and
> > we'll have to add it to our per-platform indirection as well.
>
> Already, in a branch, hidden from Bjorn & Tejun eyes ;)
>
> > All a lot of bother for no real gain IMHO.
>
> Well, I do not have a strong opinion here. I leave it to the ones who have :)
> But few drivers have became clearer as result of this change (and messy ones
> are still messy).

Amen.

cheers

2013-10-02 02:46:57

by Mark Lord

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On 13-09-26 09:03 AM, Alexander Gordeev wrote:
> On Thu, Sep 26, 2013 at 08:32:53AM -0400, Mark Lord wrote:
>> On 13-09-18 05:48 AM, Alexander Gordeev wrote:
>>> The last pattern makes most of sense to me and could be updated with a more
>>> clear sequence - a call to (bit modified) pci_msix_table_size() followed
>>> by a call to pci_enable_msix(). I think this pattern can effectively
>>> supersede the currently recommended "loop" practice.
>>
>> The loop is still necessary, because there's a race between those two calls,
>> so that pci_enable_msix() can still fail due to lack of MSIX slots.
>
> Moreover, the existing loop pattern is racy and could fail just as easily ;)

Yes, but it then loops again to correct things.

> But (1) that is something drivers should expect and (2) there is basically
> nothing to race against - that is probably the reason it has not been a
> problem for pSeries. So I think we should not care about this.

I always care about race conditions.

2013-10-02 03:23:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Wed, Oct 02, 2013 at 12:33:38PM +1000, Michael Ellerman wrote:
> > It is an interface which forces the driver writers to write
> > complicated fallback code which won't usually be excercised.
>
> It does not force anyone to do anything. That's just bull.

Yeah, sure, we don't have shitty code in drivers which don't need any
of that retry logic, right? What the hell is up with the gratuituous
escalation? You really wanna go that way?

> Code which is unwilling or unable to cope with the extra complexity
> can simply do:
>
> if (pci_enable_msix(..))
> goto fail;
>
> It's as simple as that.

You apparently have no clue how people behave. You give a function
which indicates complex failure mode, driver writers *will* try to
handle that whether they actually understand the implication or not.
That's a natural and correct behavior too because any half-competent
software eng would design API so that it receives and returns
information which is relevant to its users. If there are special
cases to handle, make the damn interface for it special too so that it
doesn't confuse the common case.

Driver codes already have generally lower quality than core code, if
for nothing else, due to the sheer volume, and there are many driver
writers who aren't too privvy with various kernel subsystems. They
usually just copy whatever other similar driver is doing, and this one
is a lot worse - this thing affects hardware directly. If you expect
all the shitty implementations of ahci to handle the different
variations of multiple MSI config cases, you just don't have any
experience dealing with cheap commodity hardware.

Driver APIs should be intuitive, clear in its intentions, and don't
tempt fate with hairy configs for vast majority of cases.

> +int pci_enable_msix_or_fail(struct pci_dev *dev, struct msix_entry *entries,
> + int nvec)
> +{
> + int rc;
> +
> + rc = pci_enable_msix(dev, entries, nvec);
> + if (rc > 0)
> + rc = -ENOSPC;
> +
> + return rc;
> +}

Make the *default* case simple and give clearly special interface for
the special cases. What's so hard about that?

> > Are we talking about some limited number of device drivers here?
>
> I don't have a list, but yeah there are certain drivers that folks care about.

And here's another problem with the current interface. Because the
default interface is the unnecessrily complicated one, now we can't
tell which ones actually need the complicated treatment.

> > Also, is the quota still necessary for machines in production today?
>
> As far as I know yes. The number of MSIs is growing on modern systems, but so
> is the number of cpus and devices.

That's a bummer, but let's please make the default interface simple.
I really don't wanna see partial allocations for ahci.

--
tejun

2013-10-02 07:08:09

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Wed, Oct 02, 2013 at 12:43:24PM +1000, Michael Ellerman wrote:
> On Tue, Oct 01, 2013 at 12:35:27PM +0200, Alexander Gordeev wrote:
> > On Tue, Oct 01, 2013 at 05:51:33PM +1000, Michael Ellerman wrote:
> > > The disadvantage is that any restriction imposed on us above the quota
> > > can only be reported as an error from pci_enable_msix().
> > >
> > > The quota code, called from pci_get_msix_limit(), can only do so much to
> > > interogate firmware about the limitations. The ultimate way to check if
> > > firmware will give us enough MSIs is to try and allocate them. But we
> > > can't do that from pci_get_msix_limit() because the driver is not asking
> > > us to enable MSIs, just query them.
> >
> > If things are this way then pci_enable_msix() already exposed to this
> > problem internally on pSeries.
> >
> > I see that even successful quota checks in rtas_msi_check_device() and
> > rtas_setup_msi_irqs() do not guarantee (as you say) that firmware will
> > give enough MSIs. Hence, pci_enable_msix() might fail even though the
> > its quota checks succeeded.
>
> Yes, but it can report that failure to the caller, which can then retry.

If a driver wants to retry after a failure it is up to the driver (but why?).
The current guidlines state:

"If this function returns a negative number, it indicates an error and
the driver should not attempt to allocate any more MSI-X interrupts for
this device."

Anyway, what number could the driver retry with after it got a negative errno?

> > Therefore, nothing will really change if we make pci_get_msix_limit() check
> > quota and hope the follow-up call to pci_enable_msix() succeeded.
>
> No that's not equivalent. Under your scheme if pci_enable_msix() fails
> then the caller just bails, it will never try again with a lower number.

Currently under the very same circumstances (the quota check within
rtas_setup_msi_irqs() returned Q vectors while the firmware has only F
vectors to allocate and Q > F) rtas_setup_msi_irqs() fails, pci_enable_msix()
fails, the caller bails and never try again with a lower number.

Am I missing something here?

> cheers

--
Regards,
Alexander Gordeev
[email protected]

2013-10-02 07:24:43

by Alexander Gordeev

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface

On Tue, Oct 01, 2013 at 10:46:32PM -0400, Mark Lord wrote:
> >>> The last pattern makes most of sense to me and could be updated with a more
> >>> clear sequence - a call to (bit modified) pci_msix_table_size() followed
> >>> by a call to pci_enable_msix(). I think this pattern can effectively
> >>> supersede the currently recommended "loop" practice.
> >>
> >> The loop is still necessary, because there's a race between those two calls,
> >> so that pci_enable_msix() can still fail due to lack of MSIX slots.
> >
> > Moreover, the existing loop pattern is racy and could fail just as easily ;)
>
> Yes, but it then loops again to correct things.

No. If it failed it should exit the loop.

--
Regards,
Alexander Gordeev
[email protected]