2004-06-22 20:07:11

by long

[permalink] [raw]
Subject: [PATCH]2.6.7 MSI-X Update

On Tuesday, June 22, 2004 Roland Dreier wrote:
>Do you have any plans for when this should be fixed? Right now, with
>the standard kernel, if I unload and then reload my driver module,
>setting up MSI-X fails the second time through because the core has
>not cleaned up the memory region from the first time.

For the case where a device function implements both the MSI
capability structure and the MSI-X capability structure, the
MSI support in kernel 2.6.x chooses to enable the MSI-X capability
structure because one of its key advantages over MSI allows kernel
to provide a device function multiple messages. We've received inputs
from some IHVs requesting the kernel to provide a device driver the
ability to selectively decide to enable MSI or MSI-X to fit its
specific needs.

Also, the kernel may encounter MSI-X vector shortages when handling an
MSI-X request from a device driver. This can cause a failure to enable
MSI-X if the requested number of vectors are not available. To allow
the driver to still use MSI-X but reduce the number of vectors
requested to the amount available the kernel should return the maximum
number of MSI-X vectors available to the caller. In addition to the
device driver requires the ability to selectively decide which MSI-X
entries of the MSI-X table to be enabled(ABC--, A-B-C, A--CB, etc...).

As a result, I would like to propose the following changes to the
current 2.6 MSI implementation:

1. Make existing API pci_enable_msi(struct pci_dev *dev) to
support only MSI.
2. Consolidate existing msi_alloc_vectors() and
msi_free_vectors() into a single API called pci_enable_msix
(struct pci_dev *dev, unsigned int *data, int nvec) to
support MSI-X.
3. To provide finer granularity in handling MSI/MSI-X vectors
freed by a device driver as well as MSI/MSI-X reassign on new
request.
4. Update MSI-HOWTO to describe more details on items 1, 2, and
3.

For implementation details refer to the patch.

Starting on 06/26, I do not have access to email in two weeks. I'll
respond to lkml inputs after that.

Thanks,
Long

---------------------------------------------------------------------
diff -urN linux-2.6.7/Documentation/MSI-HOWTO.txt patch-2.6.7-msix/Documentation/MSI-HOWTO.txt
--- linux-2.6.7/Documentation/MSI-HOWTO.txt 2004-05-09 22:31:58.000000000 -0400
+++ patch-2.6.7-msix/Documentation/MSI-HOWTO.txt 2004-06-22 10:16:09.000000000 -0400
@@ -3,13 +3,14 @@
10/03/2003
Revised Feb 12, 2004 by Martine Silbermann
email: [email protected]
+ Revised May 24, 2004 by Tom L Nguyen

1. About this guide

-This guide describes the basics of Message Signaled Interrupts(MSI), the
-advantages of using MSI over traditional interrupt mechanisms, and how
-to enable your driver to use MSI or MSI-X. Also included is a Frequently
-Asked Questions.
+This guide describes the basics of Message Signaled Interrupts (MSI),
+the advantages of using MSI over traditional interrupt mechanisms,
+and how to enable your driver to use MSI or MSI-X. Also included is
+a Frequently Asked Questions.

2. Copyright 2003 Intel Corporation

@@ -35,7 +36,7 @@
the MSI/MSI-X capability structure in its PCI capability list. The
device function may implement both the MSI capability structure and
the MSI-X capability structure; however, the bus driver should not
-enable both, but instead enable only the MSI-X capability structure.
+enable both.

The MSI capability structure contains Message Control register,
Message Address register and Message Data register. These registers
@@ -86,7 +87,7 @@
support for better interrupt performance.

Using MSI enables the device functions to support two or more
-vectors, which can be configure to target different CPU's to
+vectors, which can be configured to target different CPU's to
increase scalability.

5. Configuring a driver to use MSI/MSI-X
@@ -95,26 +96,39 @@
support this capability. The CONFIG_PCI_USE_VECTOR kernel option
must be selected to enable MSI/MSI-X support.

-5.1 Including MSI support into the kernel
+5.1 Including MSI/MSI-X support into the kernel

-To allow MSI-Capable device drivers to selectively enable MSI (using
-pci_enable_msi as described below), the VECTOR based scheme needs to
-be enabled by setting CONFIG_PCI_USE_VECTOR.
+To allow MSI/MSI-X capable device drivers to selectively enable
+MSI/MSI-X (using pci_enable_msi()/pci_enable_msix() as described
+below), the VECTOR based scheme needs to be enabled by setting
+CONFIG_PCI_USE_VECTOR during kernel config.

Since the target of the inbound message is the local APIC, providing
-CONFIG_PCI_USE_VECTOR is dependent on whether CONFIG_X86_LOCAL_APIC
-is enabled or not.
+CONFIG_X86_LOCAL_APIC must be enabled as well as CONFIG_PCI_USE_VECTOR.

-int pci_enable_msi(struct pci_dev *)
+5.2 Configuring for MSI support
+
+Due to the non-contiguous fashion in vector assignment of the
+existing Linux kernel, this version does not support multiple
+messages regardless of a device function is capable of supporting
+more than one vector. To enable MSI on a device function's MSI
+capability structure requires a device driver to call the function
+pci_enable_msi() explicitly.
+
+5.2.1 API pci_enable_msi
+
+int pci_enable_msi(struct pci_dev *dev)

With this new API, any existing device driver, which like to have
-MSI enabled on its device function, must call this explicitly. A
-successful call will initialize the MSI/MSI-X capability structure
-with ONE vector, regardless of whether the device function is
+MSI enabled on its device function, must call this API to enable MSI
+A successful call will initialize the MSI capability structure
+with ONE vector, regardless of whether a device function is
capable of supporting multiple messages. This vector replaces the
pre-assigned dev->irq with a new MSI vector. To avoid the conflict
of new assigned vector with existing pre-assigned vector requires
-the device driver to call this API before calling request_irq(...).
+a device driver to call this API before calling request_irq().
+
+5.2.2 MSI mode vs. legacy mode diagram

The below diagram shows the events, which switches the interrupt
mode on the MSI-capable device function between MSI mode and
@@ -126,103 +140,238 @@
| | ===============> | |
------------ free_irq ------------------------

-5.2 Configuring for MSI support
+Figure 1.0 MSI Mode vs. Legacy Mode

-Due to the non-contiguous fashion in vector assignment of the
-existing Linux kernel, this version does not support multiple
-messages regardless of the device function is capable of supporting
-more than one vector. The bus driver initializes only entry 0 of
-this capability if pci_enable_msi(...) is called successfully by
-the device driver.
+In Figure 1.0, a device operates by default in legacy mode. Legacy
+in this context means PCI pin-irq assertion or PCI-Express INTx
+emulation. A successful MSI request (using pci_enable_msi()) switches
+a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector
+stored in dev->irq will be saved by the PCI subsystem and a new
+assigned MSI vector will replace dev->irq.
+
+To return back to its default mode, a device driver must call
+free_irq() using the allocated MSI vector. The PCI subsystem restores a
+device's dev->irq with a pre-assigned IOAPIC vector and marks released
+MSI vector as unused. Once being marked as unused, there is no
+guarantee that the PCI subsystem will reserve this MSI vector for a
+device. Depending on the availability of current PCI vector resources
+and the number of MSI/MSI-X requests from other drivers, this MSI
+may be re-assigned. For the case where the PCI subsystem re-assigned
+this MSI vector another driver, a request to switching back to MSI
+mode may result in being assigned a different MSI vector or a failure
+if no more vectors are available.

5.3 Configuring for MSI-X support

-Both the MSI capability structure and the MSI-X capability structure
-share the same above semantics; however, due to the ability of the
-system software to configure each vector of the MSI-X capability
-structure with an independent message address and message data, the
-non-contiguous fashion in vector assignment of the existing Linux
-kernel has no impact on supporting multiple messages on an MSI-X
-capable device functions. By default, as mentioned above, ONE vector
-should be always allocated to the MSI-X capability structure at
-entry 0. The bus driver does not initialize other entries of the
-MSI-X table.
-
-Note that the PCI subsystem should have full control of a MSI-X
-table that resides in Memory Space. The software device driver
-should not access this table.
-
-To request for additional vectors, the device software driver should
-call function msi_alloc_vectors(). It is recommended that the
-software driver should call this function once during the
+Due to the ability of the system software to configure each vector of
+the MSI-X capability structure with an independent message address
+and message data, the non-contiguous fashion in vector assignment of
+the existing Linux kernel has no impact on supporting multiple
+messages on an MSI-X capable device functions. To enable MSI-X on
+a device function's MSI-X capability structure requires its device
+driver to call the function pci_enable_msix() explicitly.
+
+The function pci_enable_msix(), once invoked, enables either
+all or nothing, depending on the current availability of PCI vector
+resources. If the PCI vector resources are available for the number
+of vectors requested by a device driver, this function will configure
+the MSI-X table of the MSI-X capability structure of a device with
+requested messages. To emphasize this reason, for example, a device
+may be capable for supporting the maximum of 32 vectors while its
+software driver usually may request 4 vectors. It is recommended
+that the device driver should call this function once during the
initialization phase of the device driver.

-The function msi_alloc_vectors(), once invoked, enables either
-all or nothing, depending on the current availability of vector
-resources. If no vector resources are available, the device function
-still works with ONE vector. If the vector resources are available
-for the number of vectors requested by the driver, this function
-will reconfigure the MSI-X capability structure of the device with
-additional messages, starting from entry 1. To emphasize this
-reason, for example, the device may be capable for supporting the
-maximum of 32 vectors while its software driver usually may request
-4 vectors.
-
-For each vector, after this successful call, the device driver is
-responsible to call other functions like request_irq(), enable_irq(),
-etc. to enable this vector with its corresponding interrupt service
-handler. It is the device driver's choice to have all vectors shared
-the same interrupt service handler or each vector with a unique
-interrupt service handler.
-
-In addition to the function msi_alloc_vectors(), another function
-msi_free_vectors() is provided to allow the software driver to
-release a number of vectors back to the vector resources. Once
-invoked, the PCI subsystem disables (masks) each vector released.
-These vectors are no longer valid for the hardware device and its
-software driver to use. Like free_irq, it recommends that the
-device driver should also call msi_free_vectors to release all
-additional vectors previously requested.
-
-int msi_alloc_vectors(struct pci_dev *dev, int *vector, int nvec)
-
-This API enables the software driver to request the PCI subsystem
-for additional messages. Depending on the number of vectors
-available, the PCI subsystem enables either all or nothing.
+Unlike the function pci_enable_msi(), the function pci_enable_msix()
+does not replace the pre-assigned IOAPIC dev->irq with a new MSI
+vector because the PCI subsystem writes the 1:1 vector-to-entry mapping
+into the field vector of each element contained in a second argument.
+Note that the pre-assigned IO-APIC dev->irq is valid only if the device
+operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt of
+using dev->irq by the device driver to request for interrupt service
+may result unpredictabe behavior.
+
+For each MSI-X vector granted, a device driver is responsible to call
+other functions like request_irq(), enable_irq(), etc. to enable
+this vector with its corresponding interrupt service handler. It is
+a device driver's choice to assign all vectors with the same
+interrupt service handler or each vector with a unique interrupt
+service handler.
+
+5.3.1 Handling MMIO address space of MSI-X Table
+
+The PCI 3.0 specification has implementation notes that MMIO address
+space for a device's MSI-X structure should be isolated so that the
+software system can set different page for controlling accesses to
+the MSI-X structure. The implementation of MSI patch requires the PCI
+subsystem, not a device driver, to maintain full control of the MSI-X
+table/MSI-X PBA and MMIO address space of the MSI-X table/MSI-X PBA.
+A device driver is prohibited from requesting the MMIO address space
+of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem will fail
+enabling MSI-X on its hardware device when it calls the function
+pci_enable_msix().
+
+5.3.2 Handling MSI-X allocation
+
+Determining the number of MSI-X vectors allocated to a function is
+dependent on the number of MSI capable devices and MSI-X capable
+devices populated in the system. The policy of allocating MSI-X
+vectors to a function is defined as the following:
+
+#of MSI-X vectors allocated to a function = (x - y)/z where
+
+x = The number of available PCI vector resources by the time
+ the device driver calls pci_enable_msix(). The PCI vector
+ resources is the sum of the number of unassigned vectors
+ (new) and the number of released vectors when any MSI/MSI-X
+ device driver switches its hardware device back to a legacy
+ mode or is hot-removed. The number of unassigned vectors
+ may exclude some vectors reserved, as defined in parameter
+ NR_HP_RESERVED_VECTORS, for the case where the system is
+ capable of supporting hot-add/hot-remove operations. Users
+ may change the value defined in NR_HR_RESERVED_VECTORS to
+ meet their specific needs.
+
+y = The number of MSI capable devices populated in the system.
+ This policy ensures that each MSI capable device has its
+ vector reserved to avoid the case where some MSI-X capable
+ drivers may attempt to claim all available vector resources.
+
+z = The number of MSI-X capable devices pupulated in the system.
+ This policy ensures that maximum (x - y) is distributed
+ evenly among MSI-X capable devices.
+
+Note that the PCI subsystem scans y and z during a bus enumeration.
+When the PCI subsystem completes configuring MSI/MSI-X capability
+structure of a device as requested by its device driver, y/z is
+decremented accordingly.
+
+5.3.3 Handling MSI-X shortages
+
+For the case where fewer MSI-X vectors are allocated to a function
+than requested, the function pci_enable_msix() will return the
+maximum number of MSI-X vectors available to the caller. A device
+driver may re-send its request with fewer or equal vectors indicated
+in a return. For example, if a device driver requests 5 vectors, but
+the number of available vectors is 3 vectors, a value of 3 will be a
+return as a result of pci_enable_msix() call. A function could be
+designed for its driver to use only 3 MSI-X table entries as
+different combinations as ABC--, A-B-C, A--CB, etc. Note that this
+patch does not support multiple entries with the same vector. Such
+attempt by a device driver to use 5 MSI-X table entries with 3 vectors
+as ABBCC, AABCC, BCCBA, etc will result as a failure by the function
+pci_enable_msix(). Below are the reasons why supporting multiple
+entries with the same vector is an undesirable solution.
+
+ - The PCI subsystem can not determine which entry, which
+ generated the message, to mask/unmask MSI while handling
+ software driver ISR. Attempting to walk through all MSI-X
+ table entries (2048 max) to mask/unmask any match vector
+ is an undesirable solution.
+
+ - Walk through all MSI-X table entries (2048 max) to handle
+ SMP affinity of any match vector is an undesirable solution.
+
+5.3.4 API pci_enable_msix
+
+int pci_enable_msix(struct pci_dev *dev, u32 *entries, int nvec)
+
+This API enables a device driver to request the PCI subsystem
+for enabling MSI-X messages on its hardware device. Depending on
+the availability of PCI vectors resources, the PCI subsystem enables
+either all or nothing.

Argument dev points to the device (pci_dev) structure.
-Argument vector is a pointer of integer type. The number of
-elements is indicated in argument nvec.
+
+Argument entries is a pointer of unsigned integer type. The number of
+elements is indicated in argument nvec. The content of each element
+will be mapped to the following struct defined in /driver/pci/msi.h.
+
+struct msix_entry {
+ __u32 vector : 16; /* kernel uses to write alloc vector */
+ __u32 entry : 16; /* driver uses to specify entry */
+};
+
+A device driver is responsible for initializing the field entry of
+each element with unique entry supported by MSI-X table. Otherwise,
+-EINVAL will be returned as a result. A successful return of zero
+indicates the PCI subsystem completes initializing each of requested
+entries of the MSI-X table with message address and message data.
+Last but not least, the PCI subsystem will write the 1:1
+vector-to-entry mapping into the field vector of each element. A
+device driver is responsible of keeping track of allocated MSI-X
+vectors in its internal data structure.
+
Argument nvec is an integer indicating the number of messages
requested.
-A return of zero indicates that the number of allocated vector is
-successfully allocated. Otherwise, indicate resources not
-available.
-
-int msi_free_vectors(struct pci_dev* dev, int *vector, int nvec)
-
-This API enables the software driver to inform the PCI subsystem
-that it is willing to release a number of vectors back to the
-MSI resource pool. Once invoked, the PCI subsystem disables each
-MSI-X entry associated with each vector stored in the argument 2.
-These vectors are no longer valid for the hardware device and
-its software driver to use.

-Argument dev points to the device (pci_dev) structure.
-Argument vector is a pointer of integer type. The number of
-elements is indicated in argument nvec.
-Argument nvec is an integer indicating the number of messages
-released.
-A return of zero indicates that the number of allocated vectors
-is successfully released. Otherwise, indicates a failure.
+A return of zero indicates that the number of MSI-X vectors is
+successfully allocated. A return of greater than zero indicates
+MSI-X vector shortage. Or a return of less than zero indicates
+a failure. This failure may be a result of duplicate entries
+specified in second argument, or a result of no available vector,
+or a result of failing to initialize MSI-X table entries.
+
+5.3.5 MSI-X mode vs. legacy mode diagram
+
+The below diagram shows the events, which switches the interrupt
+mode on the MSI-X capable device function between MSI-X mode and
+PIN-IRQ assertion mode (legacy).
+
+ ------------ pci_enable_msix(,,n) ------------------------
+ | | <=============== | |
+ | MSI-X MODE | | PIN-IRQ ASSERTION MODE |
+ | | ===============> | |
+ ------------ (n)free_irq ------------------------
+
+Figure 2.0 MSI-X Mode vs. Legacy Mode
+
+In Figure 2.0, a device operates by default in legacy mode. A
+successful MSI-X request (using pci_enable_msix()) switches a
+device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector
+stored in dev->irq will be saved by the PCI subsystem; however,
+unlike MSI mode, the PCI subsystem will not replace dev->irq with
+assigned MSI-x vector because the PCI subsystem already writes the 1:1
+vector-to-entry mapping into the field vector of each element
+specified in second argument.
+
+To return back to its default mode, a device driver requires to call
+free_irq() on all allocated MSI vectors. Unlike MSI mode, the PCI
+subsystem switches a device function back to its default legacy mode
+if and only if its device driver successfully releases all allocated
+MSI-X vectors (n) through (n) number of free_irq calls.
+
+Note that if a device still operates in MSI-X mode, its device
+driver can use request_irq/free_irq to any vectors in subset n. When
+the PCI subsystem detects all MSI-X vectors being released by a device
+driver, it will switches a function's interrupt mode from MSI-X mode
+to legacy mode and mark all allocated MSI-X vectors as unused. Once
+being marked as unused, there is no guarantee that the PCI subsystem
+will reserve these MSI-X vectors for a device. Depending on the
+availability of current PCI vector resources and the number of
+MSI/MSI-X requests from other drivers, these MSI-X vectors may be
+re-assigned. For the case where the PCI subsystem re-assigned
+these MSI-X vectors to other driver, a request to switching back to
+MSI-X mode may result being assigned with another set of MSI-X vectors
+or a failure.
+
+5.4 Handling function implementng both MSI and MSI-X capabilities
+
+For the case where a function implements both MSI and MSI-X
+capabilities, the PCI subsystem enables a device to run either in MSI
+mode or MSI-X mode but not both. A device driver determines whether it
+wants MSI or MSI-X enabled on its hardware device. Once a device
+driver requests for MSI, for example, it is prohibited to request for
+MSI-X; in other words, a device driver is not permitted to ping-pong
+between MSI mod MSI-X mode during a run-time.

-5.4 Hardware requirements for MSI support
-MSI support requires support from both system hardware and
+5.5 Hardware requirements for MSI/MSI-X support
+MSI/MSI-X support requires support from both system hardware and
individual hardware device functions.

-5.4.1 System hardware support
+5.5.1 System hardware support
Since the target of MSI address is the local APIC CPU, enabling
-MSI support in Linux kernel is dependent on whether existing
+MSI/MSI-X support in Linux kernel is dependent on whether existing
system hardware supports local APIC. Users should verify their
system whether it runs when CONFIG_X86_LOCAL_APIC=y.

@@ -231,14 +380,14 @@
CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting
CONFIG_PCI_USE_VECTOR enables the VECTOR based scheme and
the option for MSI-capable device drivers to selectively enable
-MSI (using pci_enable_msi as described below).
+MSI/MSI-X.

-Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI
-vector is allocated new during runtime and MSI support does not
-depend on BIOS support. This key independency enables MSI support
-on future IOxAPIC free platform.
+Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X
+vector is allocated new during runtime and MSI/MSI-X support does not
+depend on BIOS support. This key independency enables MSI/MSI-X
+support on future IOxAPIC free platform.

-5.4.2 Device hardware support
+5.5.2 Device hardware support
The hardware device function supports MSI by indicating the
MSI/MSI-X capability structure on its PCI capability list. By
default, this capability structure will not be initialized by
@@ -249,17 +398,19 @@
MSI-capable hardware is responsible for whether calling
pci_enable_msi or not. A return of zero indicates the kernel
successfully initializes the MSI/MSI-X capability structure of the
-device funtion. The device function is now running on MSI mode.
+device funtion. The device function is now running on MSI/MSI-X mode.

-5.5 How to tell whether MSI is enabled on device function
+5.6 How to tell whether MSI/MSI-X is enabled on device function

-At the driver level, a return of zero from pci_enable_msi(...)
-indicates to the device driver that its device function is
-initialized successfully and ready to run in MSI mode.
+At the driver level, a return of zero from the function call of
+pci_enable_msi()/pci_enable_msix() indicates to a device driver that
+its device function is initialized successfully and ready to run in
+MSI/MSI-X mode.

At the user level, users can use command 'cat /proc/interrupts'
-to display the vector allocated for the device and its interrupt
-mode, as shown below.
+to display the vector allocated for a device and its interrupt
+MSI/MSI-X mode ("PCI MSI"/"PCI MSIX"). Below shows below MSI mode is
+enabled on a SCSI Adaptec 39320D Ultra320.

CPU0 CPU1
0: 324639 0 IO-APIC-edge timer
diff -urN linux-2.6.7/drivers/pci/msi.c patch-2.6.7-msix/drivers/pci/msi.c
--- linux-2.6.7/drivers/pci/msi.c 2004-05-09 22:33:20.000000000 -0400
+++ patch-2.6.7-msix/drivers/pci/msi.c 2004-06-22 11:53:03.000000000 -0400
@@ -179,6 +179,18 @@

static unsigned int startup_msi_irq_w_maskbit(unsigned int vector)
{
+ struct msi_desc *entry;
+ unsigned long flags;
+
+ spin_lock_irqsave(&msi_lock, flags);
+ entry = msi_desc[vector];
+ if (!entry || !entry->dev) {
+ spin_unlock_irqrestore(&msi_lock, flags);
+ return 0;
+ }
+ entry->msi_attrib.state = 1; /* Mark it active */
+ spin_unlock_irqrestore(&msi_lock, flags);
+
unmask_MSI_irq(vector);
return 0; /* never anything pending */
}
@@ -200,7 +212,7 @@
* which implement the MSI-X Capability Structure.
*/
static struct hw_interrupt_type msix_irq_type = {
- .typename = "PCI MSI-X",
+ .typename = "PCI-MSI-X",
.startup = startup_msi_irq_w_maskbit,
.shutdown = shutdown_msi_irq_w_maskbit,
.enable = enable_msi_irq_w_maskbit,
@@ -216,7 +228,7 @@
* Mask-and-Pending Bits.
*/
static struct hw_interrupt_type msi_irq_w_maskbit_type = {
- .typename = "PCI MSI",
+ .typename = "PCI-MSI",
.startup = startup_msi_irq_w_maskbit,
.shutdown = shutdown_msi_irq_w_maskbit,
.enable = enable_msi_irq_w_maskbit,
@@ -232,7 +244,7 @@
* Mask-and-Pending Bits.
*/
static struct hw_interrupt_type msi_irq_wo_maskbit_type = {
- .typename = "PCI MSI",
+ .typename = "PCI-MSI",
.startup = startup_msi_irq_wo_maskbit,
.shutdown = shutdown_msi_irq_wo_maskbit,
.enable = enable_msi_irq_wo_maskbit,
@@ -265,6 +277,7 @@
msi_address->lo_address.value |= (MSI_TARGET_CPU << MSI_TARGET_CPU_SHIFT);
}

+static int msi_free_vector(struct pci_dev* dev, int vector, int reassign);
static int assign_msi_vector(void)
{
static int new_vector_avail = 1;
@@ -278,6 +291,8 @@
spin_lock_irqsave(&msi_lock, flags);

if (!new_vector_avail) {
+ int free_vector = 0;
+
/*
* vector_irq[] = -1 indicates that this specific vector is:
* - assigned for MSI (since MSI have no associated IRQ) or
@@ -294,13 +309,34 @@
for (vector = FIRST_DEVICE_VECTOR; vector < NR_IRQS; vector++) {
if (vector_irq[vector] != 0)
continue;
- vector_irq[vector] = -1;
- nr_released_vectors--;
- spin_unlock_irqrestore(&msi_lock, flags);
- return vector;
+ free_vector = vector;
+ if (!msi_desc[vector])
+ break;
+ else
+ continue;
}
+ if (!free_vector) {
+ spin_unlock_irqrestore(&msi_lock, flags);
+ return -EBUSY;
+ }
+ vector_irq[free_vector] = -1;
+ nr_released_vectors--;
spin_unlock_irqrestore(&msi_lock, flags);
- return -EBUSY;
+ if (msi_desc[free_vector] != NULL) {
+ struct pci_dev *dev;
+ int tail;
+
+ /* free all linked vectors before re-assign */
+ do {
+ spin_lock_irqsave(&msi_lock, flags);
+ dev = msi_desc[free_vector]->dev;
+ tail = msi_desc[free_vector]->link.tail;
+ spin_unlock_irqrestore(&msi_lock, flags);
+ msi_free_vector(dev, tail, 1);
+ } while (free_vector != tail);
+ }
+
+ return free_vector;
}
vector = assign_irq_vector(AUTO_ASSIGN);
last_alloc_vector = vector;
@@ -333,6 +369,15 @@
printk(KERN_INFO "WARNING: MSI INIT FAILURE\n");
return status;
}
+ last_alloc_vector = assign_irq_vector(AUTO_ASSIGN);
+ if (last_alloc_vector < 0) {
+ pci_msi_enable = 0;
+ printk(KERN_INFO "WARNING: ALL VECTORS ARE BUSY\n");
+ status = -EBUSY;
+ return status;
+ }
+ vector_irq[last_alloc_vector] = 0;
+ nr_released_vectors++;
printk(KERN_INFO "MSI INIT SUCCESS\n");

return status;
@@ -431,7 +476,7 @@
}
}

-static int msi_lookup_vector(struct pci_dev *dev)
+static int msi_lookup_vector(struct pci_dev *dev, int type)
{
int vector;
unsigned long flags;
@@ -439,11 +484,11 @@
spin_lock_irqsave(&msi_lock, flags);
for (vector = FIRST_DEVICE_VECTOR; vector < NR_IRQS; vector++) {
if (!msi_desc[vector] || msi_desc[vector]->dev != dev ||
- msi_desc[vector]->msi_attrib.entry_nr ||
+ msi_desc[vector]->msi_attrib.type != type ||
msi_desc[vector]->msi_attrib.default_vector != dev->irq)
- continue; /* not entry 0, skip */
+ continue;
spin_unlock_irqrestore(&msi_lock, flags);
- /* This pre-assigned entry-0 MSI vector for this device
+ /* This pre-assigned MSI vector for this device
already exits. Override dev->irq with this vector */
dev->irq = vector;
return 0;
@@ -458,10 +503,9 @@
if (!dev)
return;

- if (pci_find_capability(dev, PCI_CAP_ID_MSIX) > 0) {
- nr_reserved_vectors++;
+ if (pci_find_capability(dev, PCI_CAP_ID_MSIX) > 0)
nr_msix_devices++;
- } else if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0)
+ else if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0)
nr_reserved_vectors++;
}

@@ -483,19 +527,8 @@
u32 control;

pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
- if (!pos)
- return -EINVAL;
-
dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
2, &control);
- if (control & PCI_MSI_FLAGS_ENABLE)
- return 0;
-
- if (!msi_lookup_vector(dev)) {
- /* Lookup Sucess */
- enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
- return 0;
- }
/* MSI Entry Initialization */
if (!(entry = alloc_msi_entry()))
return -ENOMEM;
@@ -504,11 +537,14 @@
kmem_cache_free(msi_cachep, entry);
return -EBUSY;
}
+ entry->link.head = vector;
+ entry->link.tail = vector;
entry->msi_attrib.type = PCI_CAP_ID_MSI;
+ entry->msi_attrib.state = 1; /* Mark it active */
entry->msi_attrib.entry_nr = 0;
entry->msi_attrib.maskbit = is_mask_bit_support(control);
- entry->msi_attrib.default_vector = dev->irq;
- dev->irq = vector; /* save default pre-assigned ioapic vector */
+ entry->msi_attrib.default_vector = dev->irq; /* Save IOAPIC IRQ */
+ dev->irq = vector;
entry->dev = dev;
if (is_mask_bit_support(control)) {
entry->mask_base = msi_mask_bits_reg(pos,
@@ -556,135 +592,170 @@
* @dev: pointer to the pci_dev data structure of MSI-X device function
*
* Setup the MSI-X capability structure of device funtion with a
- * single MSI-X vector. A return of zero indicates the successful setup
- * of an entry zero with the new MSI-X vector or non-zero for otherwise.
- * To request for additional MSI-X vectors, the device drivers are
- * required to utilize the following supported APIs:
- * 1) msi_alloc_vectors(...) for requesting one or more MSI-X vectors
- * 2) msi_free_vectors(...) for releasing one or more MSI-X vectors
- * back to PCI subsystem before calling free_irq(...)
+ * single MSI-X vector. A return of zero indicates the successful setup of
+ * requested MSI-X entries with allocated vectors or non-zero for otherwise.
**/
-static int msix_capability_init(struct pci_dev *dev)
+static int msix_capability_init(struct pci_dev *dev,
+ struct msix_entry *entries, int nvec)
{
- struct msi_desc *entry;
+ struct msi_desc *head = NULL, *tail = NULL, *entry = NULL;
struct msg_address address;
struct msg_data data;
- int vector = 0, pos, dev_msi_cap;
+ int vector, pos, i, j, nr_entries, temp = 0;
u32 phys_addr, table_offset;
u32 control;
u8 bir;
void *base;
-
+
pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
- if (!pos)
- return -EINVAL;
-
/* Request & Map MSI-X table region */
dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos), 2,
&control);
- if (control & PCI_MSIX_FLAGS_ENABLE)
- return 0;
-
- if (!msi_lookup_vector(dev)) {
- /* Lookup Sucess */
- enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
- return 0;
- }
-
- dev_msi_cap = multi_msix_capable(control);
+ nr_entries = multi_msix_capable(control);
dev->bus->ops->read(dev->bus, dev->devfn,
msix_table_offset_reg(pos), 4, &table_offset);
bir = (u8)(table_offset & PCI_MSIX_FLAGS_BIRMASK);
phys_addr = pci_resource_start (dev, bir);
phys_addr += (u32)(table_offset & ~PCI_MSIX_FLAGS_BIRMASK);
if (!request_mem_region(phys_addr,
- dev_msi_cap * PCI_MSIX_ENTRY_SIZE,
- "MSI-X iomap Failure"))
+ nr_entries * PCI_MSIX_ENTRY_SIZE,
+ "MSI-X vector table"))
return -ENOMEM;
- base = ioremap_nocache(phys_addr, dev_msi_cap * PCI_MSIX_ENTRY_SIZE);
- if (base == NULL)
- goto free_region;
- /* MSI Entry Initialization */
- entry = alloc_msi_entry();
- if (!entry)
- goto free_iomap;
- if ((vector = get_msi_vector(dev)) < 0)
- goto free_entry;
-
- entry->msi_attrib.type = PCI_CAP_ID_MSIX;
- entry->msi_attrib.entry_nr = 0;
- entry->msi_attrib.maskbit = 1;
- entry->msi_attrib.default_vector = dev->irq;
- dev->irq = vector; /* save default pre-assigned ioapic vector */
- entry->dev = dev;
- entry->mask_base = (unsigned long)base;
- /* Replace with MSI handler */
- irq_handler_init(PCI_CAP_ID_MSIX, vector, 1);
- /* Configure MSI-X capability structure */
- msi_address_init(&address);
- msi_data_init(&data, vector);
- entry->msi_attrib.current_cpu = ((address.lo_address.u.dest_id >>
- MSI_TARGET_CPU_SHIFT) & MSI_TARGET_CPU_MASK);
- writel(address.lo_address.value, base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- writel(address.hi_address, base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
- writel(*(u32*)&data, base + PCI_MSIX_ENTRY_DATA_OFFSET);
- /* Initialize all entries from 1 up to 0 */
- for (pos = 1; pos < dev_msi_cap; pos++) {
- writel(0, base + pos * PCI_MSIX_ENTRY_SIZE +
+ base = ioremap_nocache(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
+ if (base == NULL) {
+ release_mem_region(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
+ return -ENOMEM;
+ }
+ /* MSI-X Table Initialization */
+ for (i = 0; i < nvec; i++) {
+ entry = alloc_msi_entry();
+ if (!entry)
+ break;
+ if ((vector = get_msi_vector(dev)) < 0)
+ break;
+
+ j = (entries + i)->entry;
+ (entries + i)->vector = vector;
+ entry->msi_attrib.type = PCI_CAP_ID_MSIX;
+ entry->msi_attrib.state = 1; /* Mark it active */
+ entry->msi_attrib.entry_nr = j;
+ entry->msi_attrib.maskbit = 1;
+ entry->msi_attrib.default_vector = dev->irq;
+ entry->dev = dev;
+ entry->mask_base = (unsigned long)base;
+ if (!head) {
+ entry->link.head = vector;
+ entry->link.tail = vector;
+ head = entry;
+ } else {
+ entry->link.head = temp;
+ entry->link.tail = tail->link.tail;
+ tail->link.tail = vector;
+ head->link.head = vector;
+ }
+ temp = vector;
+ tail = entry;
+ /* Replace with MSI-X handler */
+ irq_handler_init(PCI_CAP_ID_MSIX, vector, 1);
+ /* Configure MSI-X capability structure */
+ msi_address_init(&address);
+ msi_data_init(&data, vector);
+ entry->msi_attrib.current_cpu =
+ ((address.lo_address.u.dest_id >>
+ MSI_TARGET_CPU_SHIFT) & MSI_TARGET_CPU_MASK);
+ writel(address.lo_address.value,
+ base + j * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- writel(0, base + pos * PCI_MSIX_ENTRY_SIZE +
+ writel(address.hi_address,
+ base + j * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
- writel(0, base + pos * PCI_MSIX_ENTRY_SIZE +
+ writel(*(u32*)&data,
+ base + j * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_DATA_OFFSET);
+ attach_msi_entry(entry, vector);
}
- attach_msi_entry(entry, vector);
- /* Set MSI enabled bits */
+ if (i != nvec) {
+ i--;
+ for (; i >= 0; i--) {
+ vector = (entries + i)->vector;
+ msi_free_vector(dev, vector, 0);
+ (entries + i)->vector = 0;
+ }
+ return -EBUSY;
+ }
+ /* Set MSI-X enabled bits */
enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
-
+
return 0;
-
-free_entry:
- kmem_cache_free(msi_cachep, entry);
-free_iomap:
- iounmap(base);
-free_region:
- release_mem_region(phys_addr, dev_msi_cap * PCI_MSIX_ENTRY_SIZE);
-
- return ((vector < 0) ? -EBUSY : -ENOMEM);
}

/**
- * pci_enable_msi - configure device's MSI(X) capability structure
- * @dev: pointer to the pci_dev data structure of MSI(X) device function
+ * pci_enable_msi - configure device's MSI capability structure
+ * @dev: pointer to the pci_dev data structure of MSI device function
*
- * Setup the MSI/MSI-X capability structure of device function with
- * a single MSI(X) vector upon its software driver call to request for
- * MSI(X) mode enabled on its hardware device function. A return of zero
- * indicates the successful setup of an entry zero with the new MSI(X)
+ * Setup the MSI capability structure of device function with
+ * a single MSI vector upon its software driver call to request for
+ * MSI mode enabled on its hardware device function. A return of zero
+ * indicates the successful setup of an entry zero with the new MSI
* vector or non-zero for otherwise.
**/
int pci_enable_msi(struct pci_dev* dev)
{
- int status = -EINVAL;
+ int pos, temp = dev->irq, status = -EINVAL;
+ u32 control;

if (!pci_msi_enable || !dev)
return status;

- if (msi_init() < 0)
- return -ENOMEM;
+ if ((status = msi_init()) < 0)
+ return status;

- if ((status = msix_capability_init(dev)) == -EINVAL)
- status = msi_capability_init(dev);
- if (!status)
- nr_reserved_vectors--;
+ if (!(pos = pci_find_capability(dev, PCI_CAP_ID_MSI)))
+ return -EINVAL;
+
+ dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
+ 2, &control);
+ if (control & PCI_MSI_FLAGS_ENABLE)
+ return 0; /* Already in MSI mode */
+
+ if (!msi_lookup_vector(dev, PCI_CAP_ID_MSI)) {
+ /* Lookup Sucess */
+ unsigned long flags;
+
+ spin_lock_irqsave(&msi_lock, flags);
+ if (!vector_irq[dev->irq]) {
+ msi_desc[dev->irq]->msi_attrib.state = 1;
+ vector_irq[dev->irq] = -1;
+ nr_released_vectors--;
+ spin_unlock_irqrestore(&msi_lock, flags);
+ enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
+ return 0;
+ }
+ spin_unlock_irqrestore(&msi_lock, flags);
+ dev->irq = temp;
+ }
+ /* Check whether driver already requested for MSI-X vectors */
+ if ((pos = pci_find_capability(dev, PCI_CAP_ID_MSIX)) > 0 &&
+ !msi_lookup_vector(dev, PCI_CAP_ID_MSIX)) {
+ printk(KERN_INFO "Can't enable MSI. Device already had MSI-X vectors assigned\n");
+ dev->irq = temp;
+ return -EINVAL;
+ }
+ status = msi_capability_init(dev);
+ if (!status) {
+ if (!pos)
+ nr_reserved_vectors--; /* Only MSI capable */
+ else if (nr_msix_devices > 0)
+ nr_msix_devices--; /* Both MSI and MSI-X capable,
+ but choose enabling MSI */
+ }

return status;
}

-static int msi_free_vector(struct pci_dev* dev, int vector);
static void pci_disable_msi(unsigned int vector)
{
- int head, tail, type, default_vector;
+ int type, default_vector;
struct msi_desc *entry;
struct pci_dev *dev;
unsigned long flags;
@@ -697,168 +768,235 @@
}
dev = entry->dev;
type = entry->msi_attrib.type;
- head = entry->link.head;
- tail = entry->link.tail;
+ entry->msi_attrib.state = 0; /* Mark it not active */
default_vector = entry->msi_attrib.default_vector;
spin_unlock_irqrestore(&msi_lock, flags);
-
- disable_msi_mode(dev, pci_find_capability(dev, type), type);
- /* Restore dev->irq to its default pin-assertion vector */
- dev->irq = default_vector;
- if (type == PCI_CAP_ID_MSIX && head != tail) {
- /* Bad driver, which do not call msi_free_vectors before exit.
- We must do a cleanup here */
- while (1) {
- spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[vector];
- head = entry->link.head;
- tail = entry->link.tail;
+ switch (type) {
+ case PCI_CAP_ID_MSI:
+ spin_lock_irqsave(&msi_lock, flags);
+ vector_irq[vector] = 0; /* Mark it free */
+ nr_released_vectors++;
+ spin_unlock_irqrestore(&msi_lock, flags);
+ break;
+ case PCI_CAP_ID_MSIX:
+ spin_lock_irqsave(&msi_lock, flags);
+ while (vector != entry->link.tail) {
+ entry = msi_desc[entry->link.tail];
+ if (!entry->msi_attrib.state)
+ continue;
spin_unlock_irqrestore(&msi_lock, flags);
- if (tail == head)
- break;
- if (msi_free_vector(dev, entry->link.tail))
- break;
+ /*
+ * Device still operates in MSI-X mode. Do not
+ * switch interrupt mode
+ */
+ return;
}
+ entry = msi_desc[vector];
+ vector_irq[vector] = 0; /* Mark it free */
+ nr_released_vectors++;
+ while (vector != entry->link.tail) {
+ vector_irq[entry->link.tail] = 0; /* Mark it free */
+ nr_released_vectors++;
+ entry = msi_desc[entry->link.tail];
+ }
+ spin_unlock_irqrestore(&msi_lock, flags);
+ break;
+ default:
+ return;
}
+ /* Restore dev->irq to its default pin-assertion vector */
+ dev->irq = default_vector;
+ disable_msi_mode(dev, pci_find_capability(dev, type), type);
}

-static int msi_alloc_vector(struct pci_dev* dev, int head)
+static int msi_free_vector(struct pci_dev* dev, int vector, int reassign)
{
struct msi_desc *entry;
- struct msg_address address;
- struct msg_data data;
- int i, offset, pos, dev_msi_cap, vector;
- u32 low_address, control;
+ int head, entry_nr, type;
unsigned long base = 0L;
unsigned long flags;

spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[dev->irq];
- if (!entry) {
+ entry = msi_desc[vector];
+ if (!entry || entry->dev != dev) {
spin_unlock_irqrestore(&msi_lock, flags);
return -EINVAL;
}
+ type = entry->msi_attrib.type;
+ entry_nr = entry->msi_attrib.entry_nr;
+ head = entry->link.head;
base = entry->mask_base;
- spin_unlock_irqrestore(&msi_lock, flags);
-
- pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
- dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
- 2, &control);
- dev_msi_cap = multi_msix_capable(control);
- for (i = 1; i < dev_msi_cap; i++) {
- if (!(low_address = readl(base + i * PCI_MSIX_ENTRY_SIZE)))
- break;
+ msi_desc[entry->link.head]->link.tail = entry->link.tail;
+ msi_desc[entry->link.tail]->link.head = entry->link.head;
+ entry->dev = NULL;
+ if (!reassign) {
+ vector_irq[vector] = 0;
+ nr_released_vectors++;
}
- if (i >= dev_msi_cap)
- return -EINVAL;
+ msi_desc[vector] = NULL;
+ spin_unlock_irqrestore(&msi_lock, flags);

- /* MSI Entry Initialization */
- if (!(entry = alloc_msi_entry()))
- return -ENOMEM;
+ kmem_cache_free(msi_cachep, entry);

- if ((vector = get_new_vector()) < 0) {
- kmem_cache_free(msi_cachep, entry);
- return vector;
+ if (type == PCI_CAP_ID_MSIX) {
+ if (!reassign)
+ writel(1, base +
+ entry_nr * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
+
+ if (head == vector) {
+ /*
+ * Detect last MSI-X vector to be released.
+ * Release the MSI-X memory-mapped table.
+ */
+ int pos, nr_entries;
+ u32 phys_addr, table_offset;
+ u32 control;
+ u8 bir;
+
+ pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
+ dev->bus->ops->read(dev->bus, dev->devfn,
+ msi_control_reg(pos), 2, &control);
+ nr_entries = multi_msix_capable(control);
+ dev->bus->ops->read(dev->bus, dev->devfn,
+ msix_table_offset_reg(pos), 4, &table_offset);
+ bir = (u8)(table_offset & PCI_MSIX_FLAGS_BIRMASK);
+ phys_addr = pci_resource_start (dev, bir);
+ phys_addr += (u32)(table_offset &
+ ~PCI_MSIX_FLAGS_BIRMASK);
+ iounmap((void*)base);
+ release_mem_region(phys_addr,
+ nr_entries * PCI_MSIX_ENTRY_SIZE);
+ }
}
- entry->msi_attrib.type = PCI_CAP_ID_MSIX;
- entry->msi_attrib.entry_nr = i;
- entry->msi_attrib.maskbit = 1;
- entry->dev = dev;
- entry->link.head = head;
- entry->mask_base = base;
- irq_handler_init(PCI_CAP_ID_MSIX, vector, 1);
- /* Configure MSI-X capability structure */
- msi_address_init(&address);
- msi_data_init(&data, vector);
- entry->msi_attrib.current_cpu = ((address.lo_address.u.dest_id >>
- MSI_TARGET_CPU_SHIFT) & MSI_TARGET_CPU_MASK);
- offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
- writel(address.lo_address.value, base + offset +
- PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- writel(address.hi_address, base + offset +
- PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
- writel(*(u32*)&data, base + offset + PCI_MSIX_ENTRY_DATA_OFFSET);
- writel(1, base + offset + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
- attach_msi_entry(entry, vector);

- return vector;
+ return 0;
}

-static int msi_free_vector(struct pci_dev* dev, int vector)
+static int reroute_msix_table(int head, struct msix_entry *entries, int *nvec)
{
- struct msi_desc *entry;
- int entry_nr, type;
+ int vector = head, tail = 0;
+ int i = 0, j = 0, nr_entries = 0;
unsigned long base = 0L;
unsigned long flags;
-
+
spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[vector];
- if (!entry || entry->dev != dev) {
+ while (head != tail) {
+ nr_entries++;
+ tail = msi_desc[vector]->link.tail;
+ if (entries->entry == msi_desc[vector]->msi_attrib.entry_nr)
+ j = vector;
+ vector = tail;
+ }
+ if (*nvec > nr_entries) {
spin_unlock_irqrestore(&msi_lock, flags);
+ *nvec = nr_entries;
return -EINVAL;
}
- type = entry->msi_attrib.type;
- entry_nr = entry->msi_attrib.entry_nr;
- base = entry->mask_base;
- if (entry->link.tail != entry->link.head) {
- msi_desc[entry->link.head]->link.tail = entry->link.tail;
- if (entry->link.tail)
- msi_desc[entry->link.tail]->link.head = entry->link.head;
+ vector = ((j > 0) ? j : head);
+ for (i = 0; i < *nvec; i++) {
+ j = msi_desc[vector]->msi_attrib.entry_nr;
+ msi_desc[vector]->msi_attrib.state = 1; /* Mark it active */
+ vector_irq[vector] = -1; /* Mark it busy */
+ nr_released_vectors--;
+ (entries + i)->vector = vector;
+ if (j != (entries + i)->entry) {
+ base = msi_desc[vector]->mask_base;
+ msi_desc[vector]->msi_attrib.entry_nr =
+ (entries + i)->entry;
+ writel( readl(base + j * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET), base +
+ (entries + i)->entry * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
+ writel( readl(base + j * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET), base +
+ (entries + i)->entry * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
+ writel( (readl(base + j * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_DATA_OFFSET) & 0xff00) | vector,
+ base + (entries+i)->entry*PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_DATA_OFFSET);
+ }
+ vector = msi_desc[vector]->link.tail;
}
- entry->dev = NULL;
- vector_irq[vector] = 0;
- nr_released_vectors++;
- msi_desc[vector] = NULL;
spin_unlock_irqrestore(&msi_lock, flags);
-
- kmem_cache_free(msi_cachep, entry);
- if (type == PCI_CAP_ID_MSIX) {
- int offset;
-
- offset = entry_nr * PCI_MSIX_ENTRY_SIZE;
- writel(1, base + offset + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
- writel(0, base + offset + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- }
-
+
return 0;
}

/**
- * msi_alloc_vectors - allocate additional MSI-X vectors
+ * pci_enable_msix - configure device's MSI-X capability structure
* @dev: pointer to the pci_dev data structure of MSI-X device function
- * @vector: pointer to an array of new allocated MSI-X vectors
+ * @data: pointer to an array of MSI-X entries
* @nvec: number of MSI-X vectors requested for allocation by device driver
*
- * Allocate additional MSI-X vectors requested by device driver. A
- * return of zero indicates the successful setup of MSI-X capability
- * structure with new allocated MSI-X vectors or non-zero for otherwise.
+ * Setup the MSI-X capability structure of device function with the number
+ * of requested vectors upon its software driver call to request for
+ * MSI-X mode enabled on its hardware device function. A return of zero
+ * indicates the successful configuration of MSI-X capability structure
+ * with new allocated MSI-X vectors. A return of < 0 indicates a failure.
+ * Or a return of > 0 indicates that driver request is exceeding the number
+ * of vectors available. Driver should use the returned value to re-send
+ * its request.
**/
-int msi_alloc_vectors(struct pci_dev* dev, int *vector, int nvec)
+int pci_enable_msix(struct pci_dev* dev, unsigned int *data, int nvec)
{
- struct msi_desc *entry;
- int i, head, pos, vec, free_vectors, alloc_vectors;
- int *vectors = (int *)vector;
+ struct msix_entry *entries = (struct msix_entry *)data;
+ int status, pos, nr_entries, free_vectors;
+ int i, j, temp;
u32 control;
unsigned long flags;

- if (!pci_msi_enable || !dev)
+ if (!pci_msi_enable || !dev || !data)
return -EINVAL;
-
+
+ if ((status = msi_init()) < 0)
+ return status;
+
if (!(pos = pci_find_capability(dev, PCI_CAP_ID_MSIX)))
return -EINVAL;
-
- dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos), 2, &control);
- if (nvec > multi_msix_capable(control))
- return -EINVAL;
-
- spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[dev->irq];
- if (!entry || entry->dev != dev || /* legal call */
- entry->msi_attrib.type != PCI_CAP_ID_MSIX || /* must be MSI-X */
- entry->link.head != entry->link.tail) { /* already multi */
- spin_unlock_irqrestore(&msi_lock, flags);
+
+ dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
+ 2, &control);
+ if (control & PCI_MSIX_FLAGS_ENABLE)
+ return -EINVAL; /* Already in MSI-X mode */
+
+ nr_entries = multi_msix_capable(control);
+ if (nvec > nr_entries)
return -EINVAL;
+
+ /* Check for any invalid entries */
+ for (i = 0; i < nvec; i++) {
+ if ((entries + i)->entry >= nr_entries)
+ return -EINVAL; /* invalid entry */
+ for (j = i + 1; j < nvec; j++) {
+ if ((entries + i)->entry == (entries + j)->entry)
+ return -EINVAL; /* duplicate entry */
+ }
+ }
+ temp = dev->irq;
+ if (!msi_lookup_vector(dev, PCI_CAP_ID_MSIX)) {
+ /* Lookup Sucess */
+ nr_entries = nvec;
+ /* Reroute MSI-X table */
+ if (reroute_msix_table(dev->irq, entries, &nr_entries)) {
+ /* #requested > #previous-assigned */
+ dev->irq = temp;
+ return nr_entries;
+ }
+ dev->irq = temp;
+ enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
+ return 0;
}
+ /* Check whether driver already requested for MSI vector */
+ if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0 &&
+ !msi_lookup_vector(dev, PCI_CAP_ID_MSI)) {
+ printk(KERN_INFO "Can't enable MSI-X. Device already had MSI vector assigned\n");
+ dev->irq = temp;
+ return -EINVAL;
+ }
+
+ spin_lock_irqsave(&msi_lock, flags);
/*
* msi_lock is provided to ensure that enough vectors resources are
* available before granting.
@@ -874,71 +1012,18 @@
free_vectors /= nr_msix_devices;
spin_unlock_irqrestore(&msi_lock, flags);

- if (nvec > free_vectors)
- return -EBUSY;
+ if (nvec > free_vectors) {
+ if (free_vectors > 0)
+ return free_vectors;
+ else
+ return -EBUSY;
+ }

- alloc_vectors = 0;
- head = dev->irq;
- for (i = 0; i < nvec; i++) {
- if ((vec = msi_alloc_vector(dev, head)) < 0)
- break;
- *(vectors + i) = vec;
- head = vec;
- alloc_vectors++;
- }
- if (alloc_vectors != nvec) {
- for (i = 0; i < alloc_vectors; i++) {
- vec = *(vectors + i);
- msi_free_vector(dev, vec);
- }
- spin_lock_irqsave(&msi_lock, flags);
- msi_desc[dev->irq]->link.tail = msi_desc[dev->irq]->link.head;
- spin_unlock_irqrestore(&msi_lock, flags);
- return -EBUSY;
- }
- if (nr_msix_devices > 0)
+ status = msix_capability_init(dev, entries, nvec);
+ if (!status && nr_msix_devices > 0)
nr_msix_devices--;
-
- return 0;
-}
-
-/**
- * msi_free_vectors - reclaim MSI-X vectors to unused state
- * @dev: pointer to the pci_dev data structure of MSI-X device function
- * @vector: pointer to an array of released MSI-X vectors
- * @nvec: number of MSI-X vectors requested for release by device driver
- *
- * Reclaim MSI-X vectors released by device driver to unused state,
- * which may be used later on. A return of zero indicates the
- * success or non-zero for otherwise. Device driver should call this
- * before calling function free_irq.
- **/
-int msi_free_vectors(struct pci_dev* dev, int *vector, int nvec)
-{
- struct msi_desc *entry;
- int i;
- unsigned long flags;
-
- if (!pci_msi_enable)
- return -EINVAL;
-
- spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[dev->irq];
- if (!entry || entry->dev != dev ||
- entry->msi_attrib.type != PCI_CAP_ID_MSIX ||
- entry->link.head == entry->link.tail) { /* Nothing to free */
- spin_unlock_irqrestore(&msi_lock, flags);
- return -EINVAL;
- }
- spin_unlock_irqrestore(&msi_lock, flags);
-
- for (i = 0; i < nvec; i++) {
- if (*(vector + i) == dev->irq)
- continue;/* Don't free entry 0 if mistaken by driver */
- msi_free_vector(dev, *(vector + i));
- }
-
- return 0;
+
+ return status;
}

/**
@@ -952,62 +1037,67 @@
**/
void msi_remove_pci_irq_vectors(struct pci_dev* dev)
{
- struct msi_desc *entry;
- int type, temp;
+ int state, pos, temp;
unsigned long flags;
-
+
if (!pci_msi_enable || !dev)
return;
-
- if (!pci_find_capability(dev, PCI_CAP_ID_MSI)) {
- if (!pci_find_capability(dev, PCI_CAP_ID_MSIX))
- return;
- }
- temp = dev->irq;
- if (msi_lookup_vector(dev))
- return;
-
- spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[dev->irq];
- if (!entry || entry->dev != dev) {
+
+ temp = dev->irq; /* Save IOAPIC IRQ */
+ if ((pos = pci_find_capability(dev, PCI_CAP_ID_MSI)) > 0 &&
+ !msi_lookup_vector(dev, PCI_CAP_ID_MSI)) {
+ spin_lock_irqsave(&msi_lock, flags);
+ state = msi_desc[dev->irq]->msi_attrib.state;
spin_unlock_irqrestore(&msi_lock, flags);
- return;
- }
- type = entry->msi_attrib.type;
- spin_unlock_irqrestore(&msi_lock, flags);
-
- msi_free_vector(dev, dev->irq);
- if (type == PCI_CAP_ID_MSIX) {
- int i, pos, dev_msi_cap;
- u32 phys_addr, table_offset;
- u32 control;
- u8 bir;
-
- pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
- dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos), 2, &control);
- dev_msi_cap = multi_msix_capable(control);
- dev->bus->ops->read(dev->bus, dev->devfn,
- msix_table_offset_reg(pos), 4, &table_offset);
- bir = (u8)(table_offset & PCI_MSIX_FLAGS_BIRMASK);
- phys_addr = pci_resource_start (dev, bir);
- phys_addr += (u32)(table_offset & ~PCI_MSIX_FLAGS_BIRMASK);
- for (i = FIRST_DEVICE_VECTOR; i < NR_IRQS; i++) {
+ if (state)
+ printk("WARNING! Driver fails freeing MSI vector[%d]\n",
+ dev->irq);
+ else /* Release MSI vector assigned to this device */
+ msi_free_vector(dev, dev->irq, 0);
+ dev->irq = temp; /* Restore IOAPIC IRQ */
+ }
+ if ((pos = pci_find_capability(dev, PCI_CAP_ID_MSIX)) > 0 &&
+ !msi_lookup_vector(dev, PCI_CAP_ID_MSIX)) {
+ int vector, head, tail = 0, warning = 0;
+ unsigned long base = 0L;
+
+ vector = head = dev->irq;
+ while (head != tail) {
spin_lock_irqsave(&msi_lock, flags);
- if (!msi_desc[i] || msi_desc[i]->dev != dev) {
- spin_unlock_irqrestore(&msi_lock, flags);
- continue;
- }
+ state = msi_desc[vector]->msi_attrib.state;
+ tail = msi_desc[vector]->link.tail;
+ base = msi_desc[vector]->mask_base;
spin_unlock_irqrestore(&msi_lock, flags);
- msi_free_vector(dev, i);
+ if (state) {
+ printk("WARNING! Driver fails freeing MSI-X vector[%d]\n",
+ vector);
+ warning = 1;
+ } else if (vector != head) /* Release MSI-X vector */
+ msi_free_vector(dev, vector, 0);
+ vector = tail;
+ }
+ msi_free_vector(dev, vector, 0);
+ if (warning) {
+ /* Force to release the MSI-X memory-mapped table */
+ u32 phys_addr, table_offset;
+ u32 control;
+ u8 bir;
+
+ dev->bus->ops->read(dev->bus, dev->devfn,
+ msi_control_reg(pos), 2, &control);
+ dev->bus->ops->read(dev->bus, dev->devfn,
+ msix_table_offset_reg(pos), 4, &table_offset);
+ bir = (u8)(table_offset & PCI_MSIX_FLAGS_BIRMASK);
+ phys_addr = pci_resource_start (dev, bir);
+ phys_addr += (u32)(table_offset &
+ ~PCI_MSIX_FLAGS_BIRMASK);
+ iounmap((void*)base);
+ release_mem_region(phys_addr, PCI_MSIX_ENTRY_SIZE *
+ multi_msix_capable(control));
}
- writel(1, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
- iounmap((void*)entry->mask_base);
- release_mem_region(phys_addr, dev_msi_cap * PCI_MSIX_ENTRY_SIZE);
+ dev->irq = temp; /* Restore IOAPIC IRQ */
}
- dev->irq = temp;
- nr_reserved_vectors++;
}

EXPORT_SYMBOL(pci_enable_msi);
-EXPORT_SYMBOL(msi_alloc_vectors);
-EXPORT_SYMBOL(msi_free_vectors);
+EXPORT_SYMBOL(pci_enable_msix);
diff -urN linux-2.6.7/drivers/pci/msi.h patch-2.6.7-msix/drivers/pci/msi.h
--- linux-2.6.7/drivers/pci/msi.h 2004-05-09 22:32:53.000000000 -0400
+++ patch-2.6.7-msix/drivers/pci/msi.h 2004-06-22 10:16:09.000000000 -0400
@@ -90,6 +90,11 @@
#define MSI_LOGICAL_MODE 1
#define MSI_REDIRECTION_HINT_MODE 0

+struct msix_entry {
+ __u32 vector : 16; /* kernel uses to write allocated vector */
+ __u32 entry : 16; /* driver uses to specify entry, OS writes */
+};
+
struct msg_data {
#if defined(__LITTLE_ENDIAN_BITFIELD)
__u32 vector : 8;
@@ -140,7 +145,8 @@
struct {
__u8 type : 5; /* {0: unused, 5h:MSI, 11h:MSI-X} */
__u8 maskbit : 1; /* mask-pending bit supported ? */
- __u8 reserved: 2; /* reserved */
+ __u8 state : 1; /* {0: free, 1: busy} */
+ __u8 reserved: 1; /* reserved */
__u8 entry_nr; /* specific enabled entry */
__u8 default_vector; /* default pre-assigned vector */
__u8 current_cpu; /* current destination cpu */
diff -urN linux-2.6.7/include/linux/pci.h patch-2.6.7-msix/include/linux/pci.h
--- linux-2.6.7/include/linux/pci.h 2004-06-22 10:11:40.000000000 -0400
+++ patch-2.6.7-msix/include/linux/pci.h 2004-06-22 10:16:09.000000000 -0400
@@ -789,13 +789,14 @@
#ifndef CONFIG_PCI_USE_VECTOR
static inline void pci_scan_msi_device(struct pci_dev *dev) {}
static inline int pci_enable_msi(struct pci_dev *dev) {return -1;}
+static inline int pci_enable_msix(struct pci_dev* dev,
+ unsigned int *data, int nvec) {return -1;}
static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) {}
#else
extern void pci_scan_msi_device(struct pci_dev *dev);
extern int pci_enable_msi(struct pci_dev *dev);
+extern int pci_enable_msix(struct pci_dev* dev, unsigned int *data, int nvec);
extern void msi_remove_pci_irq_vectors(struct pci_dev *dev);
-extern int msi_alloc_vectors(struct pci_dev* dev, int *vector, int nvec);
-extern int msi_free_vectors(struct pci_dev* dev, int *vector, int nvec);
#endif

#endif /* CONFIG_PCI */


2004-06-22 23:19:06

by Nguyen, Tom L

[permalink] [raw]
Subject: RE: [PATCH]2.6.7 MSI-X Update

On Tuesday, June 22, 2004 Eli Cohen wrote:
>I encountered the same problem and noticed that after unloading >the
driver, the msix
>enable bit remains set (I could verify that by doing cat
/proc/bus/pci/<my device> |
>hexdump. I managed to overcome this problem by clearing the msix enable
bit as part
>of the cleanup when I unload the driver.

According to PCI specs, the software driver is prohibited to make any
change to the
control register of MSI/MSI-X capability structure. I think the MSI-X
update patch
will fix the problem.

>One more interesting point: pci_enable_msi calls pci_request_region for
the relevant
>BAR. So the driver should consider that otherwise a second call will
fail. The question >is should pci_enable_msi() call pci_release_region()
when finished to allow the driver
>make an explicit call.

I do not think it is a good idea to give control of the MSI-X region
back to a device driver because the kernel uses this region during
run-time to handle SMP_Affinity, mask/unmask MSI interrupts while
servicing device driver's ISR, etc.

Thanks,
Long


-----Original Message-----
From: long [mailto:[email protected]]
Sent: Wednesday, June 23, 2004 12:48 AM
To: [email protected]; [email protected]; [email protected]; [email protected];
[email protected]; [email protected];
[email protected]; [email protected]
Cc: [email protected]
Subject: [PATCH]2.6.7 MSI-X Update


On Tuesday, June 22, 2004 Roland Dreier wrote:
>Do you have any plans for when this should be fixed? Right now, with
>the standard kernel, if I unload and then reload my driver module,
>setting up MSI-X fails the second time through because the core has
>not cleaned up the memory region from the first time.

For the case where a device function implements both the MSI
capability structure and the MSI-X capability structure, the
MSI support in kernel 2.6.x chooses to enable the MSI-X capability
structure because one of its key advantages over MSI allows kernel
to provide a device function multiple messages. We've received inputs
from some IHVs requesting the kernel to provide a device driver the
ability to selectively decide to enable MSI or MSI-X to fit its
specific needs.

Also, the kernel may encounter MSI-X vector shortages when handling an
MSI-X request from a device driver. This can cause a failure to enable
MSI-X if the requested number of vectors are not available. To allow
the driver to still use MSI-X but reduce the number of vectors
requested to the amount available the kernel should return the maximum
number of MSI-X vectors available to the caller. In addition to the
device driver requires the ability to selectively decide which MSI-X
entries of the MSI-X table to be enabled(ABC--, A-B-C, A--CB, etc...).

As a result, I would like to propose the following changes to the
current 2.6 MSI implementation:

1. Make existing API pci_enable_msi(struct pci_dev *dev) to
support only MSI.
2. Consolidate existing msi_alloc_vectors() and
msi_free_vectors() into a single API called pci_enable_msix
(struct pci_dev *dev, unsigned int *data, int nvec) to
support MSI-X.
3. To provide finer granularity in handling MSI/MSI-X vectors
freed by a device driver as well as MSI/MSI-X reassign on new
request.
4. Update MSI-HOWTO to describe more details on items 1, 2, and
3.

For implementation details refer to the patch.

Starting on 06/26, I do not have access to email in two weeks. I'll
respond to lkml inputs after that.

Thanks,
Long

---------------------------------------------------------------------
diff -urN linux-2.6.7/Documentation/MSI-HOWTO.txt
patch-2.6.7-msix/Documentation/MSI-HOWTO.txt
--- linux-2.6.7/Documentation/MSI-HOWTO.txt 2004-05-09
22:31:58.000000000 -0400
+++ patch-2.6.7-msix/Documentation/MSI-HOWTO.txt 2004-06-22
10:16:09.000000000 -0400
@@ -3,13 +3,14 @@
10/03/2003
Revised Feb 12, 2004 by Martine Silbermann
email: [email protected]
+ Revised May 24, 2004 by Tom L Nguyen

1. About this guide

-This guide describes the basics of Message Signaled Interrupts(MSI),
the
-advantages of using MSI over traditional interrupt mechanisms, and how
-to enable your driver to use MSI or MSI-X. Also included is a
Frequently
-Asked Questions.
+This guide describes the basics of Message Signaled Interrupts (MSI),
+the advantages of using MSI over traditional interrupt mechanisms,
+and how to enable your driver to use MSI or MSI-X. Also included is
+a Frequently Asked Questions.

2. Copyright 2003 Intel Corporation

@@ -35,7 +36,7 @@
the MSI/MSI-X capability structure in its PCI capability list. The
device function may implement both the MSI capability structure and
the MSI-X capability structure; however, the bus driver should not
-enable both, but instead enable only the MSI-X capability structure.
+enable both.

The MSI capability structure contains Message Control register,
Message Address register and Message Data register. These registers
@@ -86,7 +87,7 @@
support for better interrupt performance.

Using MSI enables the device functions to support two or more
-vectors, which can be configure to target different CPU's to
+vectors, which can be configured to target different CPU's to
increase scalability.

5. Configuring a driver to use MSI/MSI-X
@@ -95,26 +96,39 @@
support this capability. The CONFIG_PCI_USE_VECTOR kernel option
must be selected to enable MSI/MSI-X support.

-5.1 Including MSI support into the kernel
+5.1 Including MSI/MSI-X support into the kernel

-To allow MSI-Capable device drivers to selectively enable MSI (using
-pci_enable_msi as described below), the VECTOR based scheme needs to
-be enabled by setting CONFIG_PCI_USE_VECTOR.
+To allow MSI/MSI-X capable device drivers to selectively enable
+MSI/MSI-X (using pci_enable_msi()/pci_enable_msix() as described
+below), the VECTOR based scheme needs to be enabled by setting
+CONFIG_PCI_USE_VECTOR during kernel config.

Since the target of the inbound message is the local APIC, providing
-CONFIG_PCI_USE_VECTOR is dependent on whether CONFIG_X86_LOCAL_APIC
-is enabled or not.
+CONFIG_X86_LOCAL_APIC must be enabled as well as CONFIG_PCI_USE_VECTOR.


-int pci_enable_msi(struct pci_dev *)
+5.2 Configuring for MSI support
+
+Due to the non-contiguous fashion in vector assignment of the
+existing Linux kernel, this version does not support multiple
+messages regardless of a device function is capable of supporting
+more than one vector. To enable MSI on a device function's MSI
+capability structure requires a device driver to call the function
+pci_enable_msi() explicitly.
+
+5.2.1 API pci_enable_msi
+
+int pci_enable_msi(struct pci_dev *dev)

With this new API, any existing device driver, which like to have
-MSI enabled on its device function, must call this explicitly. A
-successful call will initialize the MSI/MSI-X capability structure
-with ONE vector, regardless of whether the device function is
+MSI enabled on its device function, must call this API to enable MSI
+A successful call will initialize the MSI capability structure
+with ONE vector, regardless of whether a device function is
capable of supporting multiple messages. This vector replaces the
pre-assigned dev->irq with a new MSI vector. To avoid the conflict
of new assigned vector with existing pre-assigned vector requires
-the device driver to call this API before calling request_irq(...).
+a device driver to call this API before calling request_irq().
+
+5.2.2 MSI mode vs. legacy mode diagram

The below diagram shows the events, which switches the interrupt
mode on the MSI-capable device function between MSI mode and
@@ -126,103 +140,238 @@
| | ===============> | |
------------ free_irq ------------------------

-5.2 Configuring for MSI support
+Figure 1.0 MSI Mode vs. Legacy Mode

-Due to the non-contiguous fashion in vector assignment of the
-existing Linux kernel, this version does not support multiple
-messages regardless of the device function is capable of supporting
-more than one vector. The bus driver initializes only entry 0 of
-this capability if pci_enable_msi(...) is called successfully by
-the device driver.
+In Figure 1.0, a device operates by default in legacy mode. Legacy
+in this context means PCI pin-irq assertion or PCI-Express INTx
+emulation. A successful MSI request (using pci_enable_msi()) switches
+a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector
+stored in dev->irq will be saved by the PCI subsystem and a new
+assigned MSI vector will replace dev->irq.
+
+To return back to its default mode, a device driver must call
+free_irq() using the allocated MSI vector. The PCI subsystem restores a

+device's dev->irq with a pre-assigned IOAPIC vector and marks released
+MSI vector as unused. Once being marked as unused, there is no
+guarantee that the PCI subsystem will reserve this MSI vector for a
+device. Depending on the availability of current PCI vector resources
+and the number of MSI/MSI-X requests from other drivers, this MSI
+may be re-assigned. For the case where the PCI subsystem re-assigned
+this MSI vector another driver, a request to switching back to MSI
+mode may result in being assigned a different MSI vector or a failure
+if no more vectors are available.

5.3 Configuring for MSI-X support

-Both the MSI capability structure and the MSI-X capability structure
-share the same above semantics; however, due to the ability of the
-system software to configure each vector of the MSI-X capability
-structure with an independent message address and message data, the
-non-contiguous fashion in vector assignment of the existing Linux
-kernel has no impact on supporting multiple messages on an MSI-X
-capable device functions. By default, as mentioned above, ONE vector
-should be always allocated to the MSI-X capability structure at
-entry 0. The bus driver does not initialize other entries of the
-MSI-X table.
-
-Note that the PCI subsystem should have full control of a MSI-X
-table that resides in Memory Space. The software device driver
-should not access this table.
-
-To request for additional vectors, the device software driver should
-call function msi_alloc_vectors(). It is recommended that the
-software driver should call this function once during the
+Due to the ability of the system software to configure each vector of
+the MSI-X capability structure with an independent message address
+and message data, the non-contiguous fashion in vector assignment of
+the existing Linux kernel has no impact on supporting multiple
+messages on an MSI-X capable device functions. To enable MSI-X on
+a device function's MSI-X capability structure requires its device
+driver to call the function pci_enable_msix() explicitly.
+
+The function pci_enable_msix(), once invoked, enables either
+all or nothing, depending on the current availability of PCI vector
+resources. If the PCI vector resources are available for the number
+of vectors requested by a device driver, this function will configure
+the MSI-X table of the MSI-X capability structure of a device with
+requested messages. To emphasize this reason, for example, a device
+may be capable for supporting the maximum of 32 vectors while its
+software driver usually may request 4 vectors. It is recommended
+that the device driver should call this function once during the
initialization phase of the device driver.

-The function msi_alloc_vectors(), once invoked, enables either
-all or nothing, depending on the current availability of vector
-resources. If no vector resources are available, the device function
-still works with ONE vector. If the vector resources are available
-for the number of vectors requested by the driver, this function
-will reconfigure the MSI-X capability structure of the device with
-additional messages, starting from entry 1. To emphasize this
-reason, for example, the device may be capable for supporting the
-maximum of 32 vectors while its software driver usually may request
-4 vectors.
-
-For each vector, after this successful call, the device driver is
-responsible to call other functions like request_irq(), enable_irq(),
-etc. to enable this vector with its corresponding interrupt service
-handler. It is the device driver's choice to have all vectors shared
-the same interrupt service handler or each vector with a unique
-interrupt service handler.
-
-In addition to the function msi_alloc_vectors(), another function
-msi_free_vectors() is provided to allow the software driver to
-release a number of vectors back to the vector resources. Once
-invoked, the PCI subsystem disables (masks) each vector released.
-These vectors are no longer valid for the hardware device and its
-software driver to use. Like free_irq, it recommends that the
-device driver should also call msi_free_vectors to release all
-additional vectors previously requested.
-
-int msi_alloc_vectors(struct pci_dev *dev, int *vector, int nvec)
-
-This API enables the software driver to request the PCI subsystem
-for additional messages. Depending on the number of vectors
-available, the PCI subsystem enables either all or nothing.
+Unlike the function pci_enable_msi(), the function pci_enable_msix()
+does not replace the pre-assigned IOAPIC dev->irq with a new MSI
+vector because the PCI subsystem writes the 1:1 vector-to-entry mapping

+into the field vector of each element contained in a second argument.
+Note that the pre-assigned IO-APIC dev->irq is valid only if the device

+operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt of
+using dev->irq by the device driver to request for interrupt service
+may result unpredictabe behavior.
+
+For each MSI-X vector granted, a device driver is responsible to call
+other functions like request_irq(), enable_irq(), etc. to enable
+this vector with its corresponding interrupt service handler. It is
+a device driver's choice to assign all vectors with the same
+interrupt service handler or each vector with a unique interrupt
+service handler.
+
+5.3.1 Handling MMIO address space of MSI-X Table
+
+The PCI 3.0 specification has implementation notes that MMIO address
+space for a device's MSI-X structure should be isolated so that the
+software system can set different page for controlling accesses to
+the MSI-X structure. The implementation of MSI patch requires the PCI
+subsystem, not a device driver, to maintain full control of the MSI-X
+table/MSI-X PBA and MMIO address space of the MSI-X table/MSI-X PBA.
+A device driver is prohibited from requesting the MMIO address space
+of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem will fail
+enabling MSI-X on its hardware device when it calls the function
+pci_enable_msix().
+
+5.3.2 Handling MSI-X allocation
+
+Determining the number of MSI-X vectors allocated to a function is
+dependent on the number of MSI capable devices and MSI-X capable
+devices populated in the system. The policy of allocating MSI-X
+vectors to a function is defined as the following:
+
+#of MSI-X vectors allocated to a function = (x - y)/z where
+
+x = The number of available PCI vector resources by the time
+ the device driver calls pci_enable_msix(). The PCI vector
+ resources is the sum of the number of unassigned vectors
+ (new) and the number of released vectors when any MSI/MSI-X
+ device driver switches its hardware device back to a legacy
+ mode or is hot-removed. The number of unassigned vectors
+ may exclude some vectors reserved, as defined in parameter
+ NR_HP_RESERVED_VECTORS, for the case where the system is
+ capable of supporting hot-add/hot-remove operations. Users
+ may change the value defined in NR_HR_RESERVED_VECTORS to
+ meet their specific needs.
+
+y = The number of MSI capable devices populated in the system.
+ This policy ensures that each MSI capable device has its
+ vector reserved to avoid the case where some MSI-X capable
+ drivers may attempt to claim all available vector resources.
+
+z = The number of MSI-X capable devices pupulated in the system.
+ This policy ensures that maximum (x - y) is distributed
+ evenly among MSI-X capable devices.
+
+Note that the PCI subsystem scans y and z during a bus enumeration.
+When the PCI subsystem completes configuring MSI/MSI-X capability
+structure of a device as requested by its device driver, y/z is
+decremented accordingly.
+
+5.3.3 Handling MSI-X shortages
+
+For the case where fewer MSI-X vectors are allocated to a function
+than requested, the function pci_enable_msix() will return the
+maximum number of MSI-X vectors available to the caller. A device
+driver may re-send its request with fewer or equal vectors indicated
+in a return. For example, if a device driver requests 5 vectors, but
+the number of available vectors is 3 vectors, a value of 3 will be a
+return as a result of pci_enable_msix() call. A function could be
+designed for its driver to use only 3 MSI-X table entries as
+different combinations as ABC--, A-B-C, A--CB, etc. Note that this
+patch does not support multiple entries with the same vector. Such
+attempt by a device driver to use 5 MSI-X table entries with 3 vectors
+as ABBCC, AABCC, BCCBA, etc will result as a failure by the function
+pci_enable_msix(). Below are the reasons why supporting multiple
+entries with the same vector is an undesirable solution.
+
+ - The PCI subsystem can not determine which entry, which
+ generated the message, to mask/unmask MSI while handling
+ software driver ISR. Attempting to walk through all MSI-X
+ table entries (2048 max) to mask/unmask any match vector
+ is an undesirable solution.
+
+ - Walk through all MSI-X table entries (2048 max) to handle
+ SMP affinity of any match vector is an undesirable solution.
+
+5.3.4 API pci_enable_msix
+
+int pci_enable_msix(struct pci_dev *dev, u32 *entries, int nvec)
+
+This API enables a device driver to request the PCI subsystem
+for enabling MSI-X messages on its hardware device. Depending on
+the availability of PCI vectors resources, the PCI subsystem enables
+either all or nothing.

Argument dev points to the device (pci_dev) structure.
-Argument vector is a pointer of integer type. The number of
-elements is indicated in argument nvec.
+
+Argument entries is a pointer of unsigned integer type. The number of
+elements is indicated in argument nvec. The content of each element
+will be mapped to the following struct defined in /driver/pci/msi.h.
+
+struct msix_entry {
+ __u32 vector : 16; /* kernel uses to write alloc vector */
+ __u32 entry : 16; /* driver uses to specify entry */
+};
+
+A device driver is responsible for initializing the field entry of
+each element with unique entry supported by MSI-X table. Otherwise,
+-EINVAL will be returned as a result. A successful return of zero
+indicates the PCI subsystem completes initializing each of requested
+entries of the MSI-X table with message address and message data.
+Last but not least, the PCI subsystem will write the 1:1
+vector-to-entry mapping into the field vector of each element. A
+device driver is responsible of keeping track of allocated MSI-X
+vectors in its internal data structure.
+
Argument nvec is an integer indicating the number of messages
requested.
-A return of zero indicates that the number of allocated vector is
-successfully allocated. Otherwise, indicate resources not
-available.
-
-int msi_free_vectors(struct pci_dev* dev, int *vector, int nvec)
-
-This API enables the software driver to inform the PCI subsystem
-that it is willing to release a number of vectors back to the
-MSI resource pool. Once invoked, the PCI subsystem disables each
-MSI-X entry associated with each vector stored in the argument 2.
-These vectors are no longer valid for the hardware device and
-its software driver to use.

-Argument dev points to the device (pci_dev) structure.
-Argument vector is a pointer of integer type. The number of
-elements is indicated in argument nvec.
-Argument nvec is an integer indicating the number of messages
-released.
-A return of zero indicates that the number of allocated vectors
-is successfully released. Otherwise, indicates a failure.
+A return of zero indicates that the number of MSI-X vectors is
+successfully allocated. A return of greater than zero indicates
+MSI-X vector shortage. Or a return of less than zero indicates
+a failure. This failure may be a result of duplicate entries
+specified in second argument, or a result of no available vector,
+or a result of failing to initialize MSI-X table entries.
+
+5.3.5 MSI-X mode vs. legacy mode diagram
+
+The below diagram shows the events, which switches the interrupt
+mode on the MSI-X capable device function between MSI-X mode and
+PIN-IRQ assertion mode (legacy).
+
+ ------------ pci_enable_msix(,,n) ------------------------
+ | | <=============== | |
+ | MSI-X MODE | | PIN-IRQ ASSERTION MODE |
+ | | ===============> | |
+ ------------ (n)free_irq ------------------------
+
+Figure 2.0 MSI-X Mode vs. Legacy Mode
+
+In Figure 2.0, a device operates by default in legacy mode. A
+successful MSI-X request (using pci_enable_msix()) switches a
+device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector
+stored in dev->irq will be saved by the PCI subsystem; however,
+unlike MSI mode, the PCI subsystem will not replace dev->irq with
+assigned MSI-x vector because the PCI subsystem already writes the 1:1
+vector-to-entry mapping into the field vector of each element
+specified in second argument.
+
+To return back to its default mode, a device driver requires to call
+free_irq() on all allocated MSI vectors. Unlike MSI mode, the PCI
+subsystem switches a device function back to its default legacy mode
+if and only if its device driver successfully releases all allocated
+MSI-X vectors (n) through (n) number of free_irq calls.
+
+Note that if a device still operates in MSI-X mode, its device
+driver can use request_irq/free_irq to any vectors in subset n. When
+the PCI subsystem detects all MSI-X vectors being released by a device
+driver, it will switches a function's interrupt mode from MSI-X mode
+to legacy mode and mark all allocated MSI-X vectors as unused. Once
+being marked as unused, there is no guarantee that the PCI subsystem
+will reserve these MSI-X vectors for a device. Depending on the
+availability of current PCI vector resources and the number of
+MSI/MSI-X requests from other drivers, these MSI-X vectors may be
+re-assigned. For the case where the PCI subsystem re-assigned
+these MSI-X vectors to other driver, a request to switching back to
+MSI-X mode may result being assigned with another set of MSI-X vectors
+or a failure.
+
+5.4 Handling function implementng both MSI and MSI-X capabilities
+
+For the case where a function implements both MSI and MSI-X
+capabilities, the PCI subsystem enables a device to run either in MSI
+mode or MSI-X mode but not both. A device driver determines whether it
+wants MSI or MSI-X enabled on its hardware device. Once a device
+driver requests for MSI, for example, it is prohibited to request for
+MSI-X; in other words, a device driver is not permitted to ping-pong
+between MSI mod MSI-X mode during a run-time.

-5.4 Hardware requirements for MSI support
-MSI support requires support from both system hardware and
+5.5 Hardware requirements for MSI/MSI-X support
+MSI/MSI-X support requires support from both system hardware and
individual hardware device functions.

-5.4.1 System hardware support
+5.5.1 System hardware support
Since the target of MSI address is the local APIC CPU, enabling
-MSI support in Linux kernel is dependent on whether existing
+MSI/MSI-X support in Linux kernel is dependent on whether existing
system hardware supports local APIC. Users should verify their
system whether it runs when CONFIG_X86_LOCAL_APIC=y.

@@ -231,14 +380,14 @@
CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting
CONFIG_PCI_USE_VECTOR enables the VECTOR based scheme and
the option for MSI-capable device drivers to selectively enable
-MSI (using pci_enable_msi as described below).
+MSI/MSI-X.

-Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI
-vector is allocated new during runtime and MSI support does not
-depend on BIOS support. This key independency enables MSI support
-on future IOxAPIC free platform.
+Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X
+vector is allocated new during runtime and MSI/MSI-X support does not
+depend on BIOS support. This key independency enables MSI/MSI-X
+support on future IOxAPIC free platform.

-5.4.2 Device hardware support
+5.5.2 Device hardware support
The hardware device function supports MSI by indicating the
MSI/MSI-X capability structure on its PCI capability list. By
default, this capability structure will not be initialized by
@@ -249,17 +398,19 @@
MSI-capable hardware is responsible for whether calling
pci_enable_msi or not. A return of zero indicates the kernel
successfully initializes the MSI/MSI-X capability structure of the
-device funtion. The device function is now running on MSI mode.
+device funtion. The device function is now running on MSI/MSI-X mode.

-5.5 How to tell whether MSI is enabled on device function
+5.6 How to tell whether MSI/MSI-X is enabled on device function

-At the driver level, a return of zero from pci_enable_msi(...)
-indicates to the device driver that its device function is
-initialized successfully and ready to run in MSI mode.
+At the driver level, a return of zero from the function call of
+pci_enable_msi()/pci_enable_msix() indicates to a device driver that
+its device function is initialized successfully and ready to run in
+MSI/MSI-X mode.

At the user level, users can use command 'cat /proc/interrupts'
-to display the vector allocated for the device and its interrupt
-mode, as shown below.
+to display the vector allocated for a device and its interrupt
+MSI/MSI-X mode ("PCI MSI"/"PCI MSIX"). Below shows below MSI mode is
+enabled on a SCSI Adaptec 39320D Ultra320.

CPU0 CPU1
0: 324639 0 IO-APIC-edge timer
diff -urN linux-2.6.7/drivers/pci/msi.c
patch-2.6.7-msix/drivers/pci/msi.c
--- linux-2.6.7/drivers/pci/msi.c 2004-05-09 22:33:20.000000000
-0400
+++ patch-2.6.7-msix/drivers/pci/msi.c 2004-06-22 11:53:03.000000000
-0400
@@ -179,6 +179,18 @@

static unsigned int startup_msi_irq_w_maskbit(unsigned int vector)
{
+ struct msi_desc *entry;
+ unsigned long flags;
+
+ spin_lock_irqsave(&msi_lock, flags);
+ entry = msi_desc[vector];
+ if (!entry || !entry->dev) {
+ spin_unlock_irqrestore(&msi_lock, flags);
+ return 0;
+ }
+ entry->msi_attrib.state = 1; /* Mark it active */
+ spin_unlock_irqrestore(&msi_lock, flags);
+
unmask_MSI_irq(vector);
return 0; /* never anything pending */
}
@@ -200,7 +212,7 @@
* which implement the MSI-X Capability Structure.
*/
static struct hw_interrupt_type msix_irq_type = {
- .typename = "PCI MSI-X",
+ .typename = "PCI-MSI-X",
.startup = startup_msi_irq_w_maskbit,
.shutdown = shutdown_msi_irq_w_maskbit,
.enable = enable_msi_irq_w_maskbit,
@@ -216,7 +228,7 @@
* Mask-and-Pending Bits.
*/
static struct hw_interrupt_type msi_irq_w_maskbit_type = {
- .typename = "PCI MSI",
+ .typename = "PCI-MSI",
.startup = startup_msi_irq_w_maskbit,
.shutdown = shutdown_msi_irq_w_maskbit,
.enable = enable_msi_irq_w_maskbit,
@@ -232,7 +244,7 @@
* Mask-and-Pending Bits.
*/
static struct hw_interrupt_type msi_irq_wo_maskbit_type = {
- .typename = "PCI MSI",
+ .typename = "PCI-MSI",
.startup = startup_msi_irq_wo_maskbit,
.shutdown = shutdown_msi_irq_wo_maskbit,
.enable = enable_msi_irq_wo_maskbit,
@@ -265,6 +277,7 @@
msi_address->lo_address.value |= (MSI_TARGET_CPU <<
MSI_TARGET_CPU_SHIFT);
}

+static int msi_free_vector(struct pci_dev* dev, int vector, int
reassign);
static int assign_msi_vector(void)
{
static int new_vector_avail = 1;
@@ -278,6 +291,8 @@
spin_lock_irqsave(&msi_lock, flags);

if (!new_vector_avail) {
+ int free_vector = 0;
+
/*
* vector_irq[] = -1 indicates that this specific vector
is:
* - assigned for MSI (since MSI have no associated IRQ)
or
@@ -294,13 +309,34 @@
for (vector = FIRST_DEVICE_VECTOR; vector < NR_IRQS;
vector++) {
if (vector_irq[vector] != 0)
continue;
- vector_irq[vector] = -1;
- nr_released_vectors--;
- spin_unlock_irqrestore(&msi_lock, flags);
- return vector;
+ free_vector = vector;
+ if (!msi_desc[vector])
+ break;
+ else
+ continue;
}
+ if (!free_vector) {
+ spin_unlock_irqrestore(&msi_lock, flags);
+ return -EBUSY;
+ }
+ vector_irq[free_vector] = -1;
+ nr_released_vectors--;
spin_unlock_irqrestore(&msi_lock, flags);
- return -EBUSY;
+ if (msi_desc[free_vector] != NULL) {
+ struct pci_dev *dev;
+ int tail;
+
+ /* free all linked vectors before re-assign */
+ do {
+ spin_lock_irqsave(&msi_lock, flags);
+ dev = msi_desc[free_vector]->dev;
+ tail = msi_desc[free_vector]->link.tail;

+ spin_unlock_irqrestore(&msi_lock,
flags);
+ msi_free_vector(dev, tail, 1);
+ } while (free_vector != tail);
+ }
+
+ return free_vector;
}
vector = assign_irq_vector(AUTO_ASSIGN);
last_alloc_vector = vector;
@@ -333,6 +369,15 @@
printk(KERN_INFO "WARNING: MSI INIT FAILURE\n");
return status;
}
+ last_alloc_vector = assign_irq_vector(AUTO_ASSIGN);
+ if (last_alloc_vector < 0) {
+ pci_msi_enable = 0;
+ printk(KERN_INFO "WARNING: ALL VECTORS ARE BUSY\n");
+ status = -EBUSY;
+ return status;
+ }
+ vector_irq[last_alloc_vector] = 0;
+ nr_released_vectors++;
printk(KERN_INFO "MSI INIT SUCCESS\n");

return status;
@@ -431,7 +476,7 @@
}
}

-static int msi_lookup_vector(struct pci_dev *dev)
+static int msi_lookup_vector(struct pci_dev *dev, int type)
{
int vector;
unsigned long flags;
@@ -439,11 +484,11 @@
spin_lock_irqsave(&msi_lock, flags);
for (vector = FIRST_DEVICE_VECTOR; vector < NR_IRQS; vector++) {

if (!msi_desc[vector] || msi_desc[vector]->dev != dev ||

- msi_desc[vector]->msi_attrib.entry_nr ||
+ msi_desc[vector]->msi_attrib.type != type ||
msi_desc[vector]->msi_attrib.default_vector !=
dev->irq)
- continue; /* not entry 0, skip */
+ continue;
spin_unlock_irqrestore(&msi_lock, flags);
- /* This pre-assigned entry-0 MSI vector for this device
+ /* This pre-assigned MSI vector for this device
already exits. Override dev->irq with this vector */
dev->irq = vector;
return 0;
@@ -458,10 +503,9 @@
if (!dev)
return;

- if (pci_find_capability(dev, PCI_CAP_ID_MSIX) > 0) {
- nr_reserved_vectors++;
+ if (pci_find_capability(dev, PCI_CAP_ID_MSIX) > 0)
nr_msix_devices++;
- } else if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0)
+ else if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0)
nr_reserved_vectors++;
}

@@ -483,19 +527,8 @@
u32 control;

pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
- if (!pos)
- return -EINVAL;
-
dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
2, &control);
- if (control & PCI_MSI_FLAGS_ENABLE)
- return 0;
-
- if (!msi_lookup_vector(dev)) {
- /* Lookup Sucess */
- enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
- return 0;
- }
/* MSI Entry Initialization */
if (!(entry = alloc_msi_entry()))
return -ENOMEM;
@@ -504,11 +537,14 @@
kmem_cache_free(msi_cachep, entry);
return -EBUSY;
}
+ entry->link.head = vector;
+ entry->link.tail = vector;
entry->msi_attrib.type = PCI_CAP_ID_MSI;
+ entry->msi_attrib.state = 1; /* Mark it
active */
entry->msi_attrib.entry_nr = 0;
entry->msi_attrib.maskbit = is_mask_bit_support(control);
- entry->msi_attrib.default_vector = dev->irq;
- dev->irq = vector; /* save default pre-assigned ioapic
vector */
+ entry->msi_attrib.default_vector = dev->irq; /* Save IOAPIC
IRQ */
+ dev->irq = vector;
entry->dev = dev;
if (is_mask_bit_support(control)) {
entry->mask_base = msi_mask_bits_reg(pos,
@@ -556,135 +592,170 @@
* @dev: pointer to the pci_dev data structure of MSI-X device function

*
* Setup the MSI-X capability structure of device funtion with a
- * single MSI-X vector. A return of zero indicates the successful setup

- * of an entry zero with the new MSI-X vector or non-zero for
otherwise.
- * To request for additional MSI-X vectors, the device drivers are
- * required to utilize the following supported APIs:
- * 1) msi_alloc_vectors(...) for requesting one or more MSI-X vectors
- * 2) msi_free_vectors(...) for releasing one or more MSI-X vectors
- * back to PCI subsystem before calling free_irq(...)
+ * single MSI-X vector. A return of zero indicates the successful setup
of
+ * requested MSI-X entries with allocated vectors or non-zero for
otherwise.
**/
-static int msix_capability_init(struct pci_dev *dev)
+static int msix_capability_init(struct pci_dev *dev,
+ struct msix_entry *entries, int nvec)
{
- struct msi_desc *entry;
+ struct msi_desc *head = NULL, *tail = NULL, *entry = NULL;
struct msg_address address;
struct msg_data data;
- int vector = 0, pos, dev_msi_cap;
+ int vector, pos, i, j, nr_entries, temp = 0;
u32 phys_addr, table_offset;
u32 control;
u8 bir;
void *base;
-
+
pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
- if (!pos)
- return -EINVAL;
-
/* Request & Map MSI-X table region */
dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
2,
&control);
- if (control & PCI_MSIX_FLAGS_ENABLE)
- return 0;
-
- if (!msi_lookup_vector(dev)) {
- /* Lookup Sucess */
- enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
- return 0;
- }
-
- dev_msi_cap = multi_msix_capable(control);
+ nr_entries = multi_msix_capable(control);
dev->bus->ops->read(dev->bus, dev->devfn,
msix_table_offset_reg(pos), 4, &table_offset);
bir = (u8)(table_offset & PCI_MSIX_FLAGS_BIRMASK);
phys_addr = pci_resource_start (dev, bir);
phys_addr += (u32)(table_offset & ~PCI_MSIX_FLAGS_BIRMASK);
if (!request_mem_region(phys_addr,
- dev_msi_cap * PCI_MSIX_ENTRY_SIZE,
- "MSI-X iomap Failure"))
+ nr_entries * PCI_MSIX_ENTRY_SIZE,
+ "MSI-X vector table"))
return -ENOMEM;
- base = ioremap_nocache(phys_addr, dev_msi_cap *
PCI_MSIX_ENTRY_SIZE);
- if (base == NULL)
- goto free_region;
- /* MSI Entry Initialization */
- entry = alloc_msi_entry();
- if (!entry)
- goto free_iomap;
- if ((vector = get_msi_vector(dev)) < 0)
- goto free_entry;
-
- entry->msi_attrib.type = PCI_CAP_ID_MSIX;
- entry->msi_attrib.entry_nr = 0;
- entry->msi_attrib.maskbit = 1;
- entry->msi_attrib.default_vector = dev->irq;
- dev->irq = vector; /* save default pre-assigned ioapic
vector */
- entry->dev = dev;
- entry->mask_base = (unsigned long)base;
- /* Replace with MSI handler */
- irq_handler_init(PCI_CAP_ID_MSIX, vector, 1);
- /* Configure MSI-X capability structure */
- msi_address_init(&address);
- msi_data_init(&data, vector);
- entry->msi_attrib.current_cpu = ((address.lo_address.u.dest_id
>>
- MSI_TARGET_CPU_SHIFT) &
MSI_TARGET_CPU_MASK);
- writel(address.lo_address.value, base +
PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- writel(address.hi_address, base +
PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
- writel(*(u32*)&data, base + PCI_MSIX_ENTRY_DATA_OFFSET);
- /* Initialize all entries from 1 up to 0 */
- for (pos = 1; pos < dev_msi_cap; pos++) {
- writel(0, base + pos * PCI_MSIX_ENTRY_SIZE +
+ base = ioremap_nocache(phys_addr, nr_entries *
PCI_MSIX_ENTRY_SIZE);
+ if (base == NULL) {
+ release_mem_region(phys_addr, nr_entries *
PCI_MSIX_ENTRY_SIZE);
+ return -ENOMEM;
+ }
+ /* MSI-X Table Initialization */
+ for (i = 0; i < nvec; i++) {
+ entry = alloc_msi_entry();
+ if (!entry)
+ break;
+ if ((vector = get_msi_vector(dev)) < 0)
+ break;
+
+ j = (entries + i)->entry;
+ (entries + i)->vector = vector;
+ entry->msi_attrib.type = PCI_CAP_ID_MSIX;
+ entry->msi_attrib.state = 1; /* Mark it
active */
+ entry->msi_attrib.entry_nr = j;
+ entry->msi_attrib.maskbit = 1;
+ entry->msi_attrib.default_vector = dev->irq;
+ entry->dev = dev;
+ entry->mask_base = (unsigned long)base;
+ if (!head) {
+ entry->link.head = vector;
+ entry->link.tail = vector;
+ head = entry;
+ } else {
+ entry->link.head = temp;
+ entry->link.tail = tail->link.tail;
+ tail->link.tail = vector;
+ head->link.head = vector;
+ }
+ temp = vector;
+ tail = entry;
+ /* Replace with MSI-X handler */
+ irq_handler_init(PCI_CAP_ID_MSIX, vector, 1);
+ /* Configure MSI-X capability structure */
+ msi_address_init(&address);
+ msi_data_init(&data, vector);
+ entry->msi_attrib.current_cpu =
+ ((address.lo_address.u.dest_id >>
+ MSI_TARGET_CPU_SHIFT) & MSI_TARGET_CPU_MASK);
+ writel(address.lo_address.value,
+ base + j * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- writel(0, base + pos * PCI_MSIX_ENTRY_SIZE +
+ writel(address.hi_address,
+ base + j * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
- writel(0, base + pos * PCI_MSIX_ENTRY_SIZE +
+ writel(*(u32*)&data,
+ base + j * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_DATA_OFFSET);
+ attach_msi_entry(entry, vector);
}
- attach_msi_entry(entry, vector);
- /* Set MSI enabled bits */
+ if (i != nvec) {
+ i--;
+ for (; i >= 0; i--) {
+ vector = (entries + i)->vector;
+ msi_free_vector(dev, vector, 0);
+ (entries + i)->vector = 0;
+ }
+ return -EBUSY;
+ }
+ /* Set MSI-X enabled bits */
enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
-
+
return 0;
-
-free_entry:
- kmem_cache_free(msi_cachep, entry);
-free_iomap:
- iounmap(base);
-free_region:
- release_mem_region(phys_addr, dev_msi_cap *
PCI_MSIX_ENTRY_SIZE);
-
- return ((vector < 0) ? -EBUSY : -ENOMEM);
}

/**
- * pci_enable_msi - configure device's MSI(X) capability structure
- * @dev: pointer to the pci_dev data structure of MSI(X) device
function
+ * pci_enable_msi - configure device's MSI capability structure
+ * @dev: pointer to the pci_dev data structure of MSI device function
*
- * Setup the MSI/MSI-X capability structure of device function with
- * a single MSI(X) vector upon its software driver call to request for
- * MSI(X) mode enabled on its hardware device function. A return of
zero
- * indicates the successful setup of an entry zero with the new MSI(X)
+ * Setup the MSI capability structure of device function with
+ * a single MSI vector upon its software driver call to request for
+ * MSI mode enabled on its hardware device function. A return of zero
+ * indicates the successful setup of an entry zero with the new MSI
* vector or non-zero for otherwise.
**/
int pci_enable_msi(struct pci_dev* dev)
{
- int status = -EINVAL;
+ int pos, temp = dev->irq, status = -EINVAL;
+ u32 control;

if (!pci_msi_enable || !dev)
return status;

- if (msi_init() < 0)
- return -ENOMEM;
+ if ((status = msi_init()) < 0)
+ return status;

- if ((status = msix_capability_init(dev)) == -EINVAL)
- status = msi_capability_init(dev);
- if (!status)
- nr_reserved_vectors--;
+ if (!(pos = pci_find_capability(dev, PCI_CAP_ID_MSI)))
+ return -EINVAL;
+
+ dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
+ 2, &control);
+ if (control & PCI_MSI_FLAGS_ENABLE)
+ return 0; /* Already in MSI mode
*/
+
+ if (!msi_lookup_vector(dev, PCI_CAP_ID_MSI)) {
+ /* Lookup Sucess */
+ unsigned long flags;
+
+ spin_lock_irqsave(&msi_lock, flags);
+ if (!vector_irq[dev->irq]) {
+ msi_desc[dev->irq]->msi_attrib.state = 1;
+ vector_irq[dev->irq] = -1;
+ nr_released_vectors--;
+ spin_unlock_irqrestore(&msi_lock, flags);
+ enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
+ return 0;
+ }
+ spin_unlock_irqrestore(&msi_lock, flags);
+ dev->irq = temp;
+ }
+ /* Check whether driver already requested for MSI-X vectors */
+ if ((pos = pci_find_capability(dev, PCI_CAP_ID_MSIX)) > 0 &&
+ !msi_lookup_vector(dev, PCI_CAP_ID_MSIX)) {
+ printk(KERN_INFO "Can't enable MSI. Device
already had MSI-X vectors assigned\n");
+ dev->irq = temp;
+ return -EINVAL;
+ }
+ status = msi_capability_init(dev);
+ if (!status) {
+ if (!pos)
+ nr_reserved_vectors--; /* Only MSI capable */
+ else if (nr_msix_devices > 0)
+ nr_msix_devices--; /* Both MSI and MSI-X
capable,
+ but choose enabling
MSI */
+ }

return status;
}

-static int msi_free_vector(struct pci_dev* dev, int vector);
static void pci_disable_msi(unsigned int vector)
{
- int head, tail, type, default_vector;
+ int type, default_vector;
struct msi_desc *entry;
struct pci_dev *dev;
unsigned long flags;
@@ -697,168 +768,235 @@
}
dev = entry->dev;
type = entry->msi_attrib.type;
- head = entry->link.head;
- tail = entry->link.tail;
+ entry->msi_attrib.state = 0; /* Mark it not active */

default_vector = entry->msi_attrib.default_vector;
spin_unlock_irqrestore(&msi_lock, flags);
-
- disable_msi_mode(dev, pci_find_capability(dev, type), type);
- /* Restore dev->irq to its default pin-assertion vector */
- dev->irq = default_vector;
- if (type == PCI_CAP_ID_MSIX && head != tail) {
- /* Bad driver, which do not call msi_free_vectors before
exit.
- We must do a cleanup here */
- while (1) {
- spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[vector];
- head = entry->link.head;
- tail = entry->link.tail;
+ switch (type) {
+ case PCI_CAP_ID_MSI:
+ spin_lock_irqsave(&msi_lock, flags);
+ vector_irq[vector] = 0; /* Mark it free */
+ nr_released_vectors++;
+ spin_unlock_irqrestore(&msi_lock, flags);
+ break;
+ case PCI_CAP_ID_MSIX:
+ spin_lock_irqsave(&msi_lock, flags);
+ while (vector != entry->link.tail) {
+ entry = msi_desc[entry->link.tail];
+ if (!entry->msi_attrib.state)
+ continue;
spin_unlock_irqrestore(&msi_lock, flags);
- if (tail == head)
- break;
- if (msi_free_vector(dev, entry->link.tail))
- break;
+ /*
+ * Device still operates in MSI-X mode. Do not
+ * switch interrupt mode
+ */
+ return;
}
+ entry = msi_desc[vector];
+ vector_irq[vector] = 0; /* Mark it
free */
+ nr_released_vectors++;
+ while (vector != entry->link.tail) {
+ vector_irq[entry->link.tail] = 0; /* Mark it
free */
+ nr_released_vectors++;
+ entry = msi_desc[entry->link.tail];
+ }
+ spin_unlock_irqrestore(&msi_lock, flags);
+ break;
+ default:
+ return;
}
+ /* Restore dev->irq to its default pin-assertion vector */
+ dev->irq = default_vector;
+ disable_msi_mode(dev, pci_find_capability(dev, type), type);
}

-static int msi_alloc_vector(struct pci_dev* dev, int head)
+static int msi_free_vector(struct pci_dev* dev, int vector, int
reassign)
{
struct msi_desc *entry;
- struct msg_address address;
- struct msg_data data;
- int i, offset, pos, dev_msi_cap, vector;
- u32 low_address, control;
+ int head, entry_nr, type;
unsigned long base = 0L;
unsigned long flags;

spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[dev->irq];
- if (!entry) {
+ entry = msi_desc[vector];
+ if (!entry || entry->dev != dev) {
spin_unlock_irqrestore(&msi_lock, flags);
return -EINVAL;
}
+ type = entry->msi_attrib.type;
+ entry_nr = entry->msi_attrib.entry_nr;
+ head = entry->link.head;
base = entry->mask_base;
- spin_unlock_irqrestore(&msi_lock, flags);
-
- pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
- dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
- 2, &control);
- dev_msi_cap = multi_msix_capable(control);
- for (i = 1; i < dev_msi_cap; i++) {
- if (!(low_address = readl(base + i *
PCI_MSIX_ENTRY_SIZE)))
- break;
+ msi_desc[entry->link.head]->link.tail = entry->link.tail;
+ msi_desc[entry->link.tail]->link.head = entry->link.head;
+ entry->dev = NULL;
+ if (!reassign) {
+ vector_irq[vector] = 0;
+ nr_released_vectors++;
}
- if (i >= dev_msi_cap)
- return -EINVAL;
+ msi_desc[vector] = NULL;
+ spin_unlock_irqrestore(&msi_lock, flags);

- /* MSI Entry Initialization */
- if (!(entry = alloc_msi_entry()))
- return -ENOMEM;
+ kmem_cache_free(msi_cachep, entry);

- if ((vector = get_new_vector()) < 0) {
- kmem_cache_free(msi_cachep, entry);
- return vector;
+ if (type == PCI_CAP_ID_MSIX) {
+ if (!reassign)
+ writel(1, base +
+ entry_nr * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
+
+ if (head == vector) {
+ /*
+ * Detect last MSI-X vector to be released.
+ * Release the MSI-X memory-mapped table.
+ */
+ int pos, nr_entries;
+ u32 phys_addr, table_offset;
+ u32 control;
+ u8 bir;
+
+ pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);

+ dev->bus->ops->read(dev->bus, dev->devfn,
+ msi_control_reg(pos), 2, &control);
+ nr_entries = multi_msix_capable(control);
+ dev->bus->ops->read(dev->bus, dev->devfn,
+ msix_table_offset_reg(pos), 4,
&table_offset);
+ bir = (u8)(table_offset &
PCI_MSIX_FLAGS_BIRMASK);
+ phys_addr = pci_resource_start (dev, bir);
+ phys_addr += (u32)(table_offset &
+ ~PCI_MSIX_FLAGS_BIRMASK);
+ iounmap((void*)base);
+ release_mem_region(phys_addr,
+ nr_entries * PCI_MSIX_ENTRY_SIZE);
+ }
}
- entry->msi_attrib.type = PCI_CAP_ID_MSIX;
- entry->msi_attrib.entry_nr = i;
- entry->msi_attrib.maskbit = 1;
- entry->dev = dev;
- entry->link.head = head;
- entry->mask_base = base;
- irq_handler_init(PCI_CAP_ID_MSIX, vector, 1);
- /* Configure MSI-X capability structure */
- msi_address_init(&address);
- msi_data_init(&data, vector);
- entry->msi_attrib.current_cpu = ((address.lo_address.u.dest_id
>>
- MSI_TARGET_CPU_SHIFT) &
MSI_TARGET_CPU_MASK);
- offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
- writel(address.lo_address.value, base + offset +
- PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- writel(address.hi_address, base + offset +
- PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
- writel(*(u32*)&data, base + offset +
PCI_MSIX_ENTRY_DATA_OFFSET);
- writel(1, base + offset + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
- attach_msi_entry(entry, vector);

- return vector;
+ return 0;
}

-static int msi_free_vector(struct pci_dev* dev, int vector)
+static int reroute_msix_table(int head, struct msix_entry *entries, int
*nvec)
{
- struct msi_desc *entry;
- int entry_nr, type;
+ int vector = head, tail = 0;
+ int i = 0, j = 0, nr_entries = 0;
unsigned long base = 0L;
unsigned long flags;
-
+
spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[vector];
- if (!entry || entry->dev != dev) {
+ while (head != tail) {
+ nr_entries++;
+ tail = msi_desc[vector]->link.tail;
+ if (entries->entry ==
msi_desc[vector]->msi_attrib.entry_nr)
+ j = vector;
+ vector = tail;
+ }
+ if (*nvec > nr_entries) {
spin_unlock_irqrestore(&msi_lock, flags);
+ *nvec = nr_entries;
return -EINVAL;
}
- type = entry->msi_attrib.type;
- entry_nr = entry->msi_attrib.entry_nr;
- base = entry->mask_base;
- if (entry->link.tail != entry->link.head) {
- msi_desc[entry->link.head]->link.tail =
entry->link.tail;
- if (entry->link.tail)
- msi_desc[entry->link.tail]->link.head =
entry->link.head;
+ vector = ((j > 0) ? j : head);
+ for (i = 0; i < *nvec; i++) {
+ j = msi_desc[vector]->msi_attrib.entry_nr;
+ msi_desc[vector]->msi_attrib.state = 1; /* Mark it
active */
+ vector_irq[vector] = -1; /* Mark it busy
*/
+ nr_released_vectors--;
+ (entries + i)->vector = vector;
+ if (j != (entries + i)->entry) {
+ base = msi_desc[vector]->mask_base;
+ msi_desc[vector]->msi_attrib.entry_nr =
+ (entries + i)->entry;
+ writel( readl(base + j * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET), base
+
+ (entries + i)->entry *
PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
+ writel( readl(base + j * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET), base
+
+ (entries + i)->entry *
PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
+ writel( (readl(base + j * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_DATA_OFFSET) & 0xff00) |
vector,
+ base +
(entries+i)->entry*PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_DATA_OFFSET);
+ }
+ vector = msi_desc[vector]->link.tail;
}
- entry->dev = NULL;
- vector_irq[vector] = 0;
- nr_released_vectors++;
- msi_desc[vector] = NULL;
spin_unlock_irqrestore(&msi_lock, flags);
-
- kmem_cache_free(msi_cachep, entry);
- if (type == PCI_CAP_ID_MSIX) {
- int offset;
-
- offset = entry_nr * PCI_MSIX_ENTRY_SIZE;
- writel(1, base + offset +
PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
- writel(0, base + offset +
PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- }
-
+
return 0;
}

/**
- * msi_alloc_vectors - allocate additional MSI-X vectors
+ * pci_enable_msix - configure device's MSI-X capability structure
* @dev: pointer to the pci_dev data structure of MSI-X device function

- * @vector: pointer to an array of new allocated MSI-X vectors
+ * @data: pointer to an array of MSI-X entries
* @nvec: number of MSI-X vectors requested for allocation by device
driver
*
- * Allocate additional MSI-X vectors requested by device driver. A
- * return of zero indicates the successful setup of MSI-X capability
- * structure with new allocated MSI-X vectors or non-zero for
otherwise.
+ * Setup the MSI-X capability structure of device function with the
number
+ * of requested vectors upon its software driver call to request for
+ * MSI-X mode enabled on its hardware device function. A return of zero

+ * indicates the successful configuration of MSI-X capability structure

+ * with new allocated MSI-X vectors. A return of < 0 indicates a
failure.
+ * Or a return of > 0 indicates that driver request is exceeding the
number
+ * of vectors available. Driver should use the returned value to
re-send
+ * its request.
**/
-int msi_alloc_vectors(struct pci_dev* dev, int *vector, int nvec)
+int pci_enable_msix(struct pci_dev* dev, unsigned int *data, int nvec)
{
- struct msi_desc *entry;
- int i, head, pos, vec, free_vectors, alloc_vectors;
- int *vectors = (int *)vector;
+ struct msix_entry *entries = (struct msix_entry *)data;
+ int status, pos, nr_entries, free_vectors;
+ int i, j, temp;
u32 control;
unsigned long flags;

- if (!pci_msi_enable || !dev)
+ if (!pci_msi_enable || !dev || !data)
return -EINVAL;
-
+
+ if ((status = msi_init()) < 0)
+ return status;
+
if (!(pos = pci_find_capability(dev, PCI_CAP_ID_MSIX)))
return -EINVAL;
-
- dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
2, &control);
- if (nvec > multi_msix_capable(control))
- return -EINVAL;
-
- spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[dev->irq];
- if (!entry || entry->dev != dev || /* legal call */

- entry->msi_attrib.type != PCI_CAP_ID_MSIX || /* must be MSI-X
*/
- entry->link.head != entry->link.tail) { /* already multi
*/
- spin_unlock_irqrestore(&msi_lock, flags);
+
+ dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
+ 2, &control);
+ if (control & PCI_MSIX_FLAGS_ENABLE)
+ return -EINVAL; /* Already in MSI-X mode
*/
+
+ nr_entries = multi_msix_capable(control);
+ if (nvec > nr_entries)
return -EINVAL;
+
+ /* Check for any invalid entries */
+ for (i = 0; i < nvec; i++) {
+ if ((entries + i)->entry >= nr_entries)
+ return -EINVAL; /* invalid entry */
+ for (j = i + 1; j < nvec; j++) {
+ if ((entries + i)->entry == (entries +
j)->entry)
+ return -EINVAL; /* duplicate entry */
+ }
+ }
+ temp = dev->irq;
+ if (!msi_lookup_vector(dev, PCI_CAP_ID_MSIX)) {
+ /* Lookup Sucess */
+ nr_entries = nvec;
+ /* Reroute MSI-X table */
+ if (reroute_msix_table(dev->irq, entries, &nr_entries))
{
+ /* #requested > #previous-assigned */
+ dev->irq = temp;
+ return nr_entries;
+ }
+ dev->irq = temp;
+ enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
+ return 0;
}
+ /* Check whether driver already requested for MSI vector */
+ if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0 &&
+ !msi_lookup_vector(dev, PCI_CAP_ID_MSI)) {
+ printk(KERN_INFO "Can't enable MSI-X. Device already had
MSI vector assigned\n");
+ dev->irq = temp;
+ return -EINVAL;
+ }
+
+ spin_lock_irqsave(&msi_lock, flags);
/*
* msi_lock is provided to ensure that enough vectors resources
are
* available before granting.
@@ -874,71 +1012,18 @@
free_vectors /= nr_msix_devices;
spin_unlock_irqrestore(&msi_lock, flags);

- if (nvec > free_vectors)
- return -EBUSY;
+ if (nvec > free_vectors) {
+ if (free_vectors > 0)
+ return free_vectors;
+ else
+ return -EBUSY;
+ }

- alloc_vectors = 0;
- head = dev->irq;
- for (i = 0; i < nvec; i++) {
- if ((vec = msi_alloc_vector(dev, head)) < 0)
- break;
- *(vectors + i) = vec;
- head = vec;
- alloc_vectors++;
- }
- if (alloc_vectors != nvec) {
- for (i = 0; i < alloc_vectors; i++) {
- vec = *(vectors + i);
- msi_free_vector(dev, vec);
- }
- spin_lock_irqsave(&msi_lock, flags);
- msi_desc[dev->irq]->link.tail =
msi_desc[dev->irq]->link.head;
- spin_unlock_irqrestore(&msi_lock, flags);
- return -EBUSY;
- }
- if (nr_msix_devices > 0)
+ status = msix_capability_init(dev, entries, nvec);
+ if (!status && nr_msix_devices > 0)
nr_msix_devices--;
-
- return 0;
-}
-
-/**
- * msi_free_vectors - reclaim MSI-X vectors to unused state
- * @dev: pointer to the pci_dev data structure of MSI-X device function

- * @vector: pointer to an array of released MSI-X vectors
- * @nvec: number of MSI-X vectors requested for release by device
driver
- *
- * Reclaim MSI-X vectors released by device driver to unused state,
- * which may be used later on. A return of zero indicates the
- * success or non-zero for otherwise. Device driver should call this
- * before calling function free_irq.
- **/
-int msi_free_vectors(struct pci_dev* dev, int *vector, int nvec)
-{
- struct msi_desc *entry;
- int i;
- unsigned long flags;
-
- if (!pci_msi_enable)
- return -EINVAL;
-
- spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[dev->irq];
- if (!entry || entry->dev != dev ||
- entry->msi_attrib.type != PCI_CAP_ID_MSIX ||
- entry->link.head == entry->link.tail) { /* Nothing to
free */
- spin_unlock_irqrestore(&msi_lock, flags);
- return -EINVAL;
- }
- spin_unlock_irqrestore(&msi_lock, flags);
-
- for (i = 0; i < nvec; i++) {
- if (*(vector + i) == dev->irq)
- continue;/* Don't free entry 0 if mistaken by
driver */
- msi_free_vector(dev, *(vector + i));
- }
-
- return 0;
+
+ return status;
}

/**
@@ -952,62 +1037,67 @@
**/
void msi_remove_pci_irq_vectors(struct pci_dev* dev)
{
- struct msi_desc *entry;
- int type, temp;
+ int state, pos, temp;
unsigned long flags;
-
+
if (!pci_msi_enable || !dev)
return;
-
- if (!pci_find_capability(dev, PCI_CAP_ID_MSI)) {
- if (!pci_find_capability(dev, PCI_CAP_ID_MSIX))
- return;
- }
- temp = dev->irq;
- if (msi_lookup_vector(dev))
- return;
-
- spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[dev->irq];
- if (!entry || entry->dev != dev) {
+
+ temp = dev->irq; /* Save IOAPIC IRQ */
+ if ((pos = pci_find_capability(dev, PCI_CAP_ID_MSI)) > 0 &&
+ !msi_lookup_vector(dev, PCI_CAP_ID_MSI)) {
+ spin_lock_irqsave(&msi_lock, flags);
+ state = msi_desc[dev->irq]->msi_attrib.state;
spin_unlock_irqrestore(&msi_lock, flags);
- return;
- }
- type = entry->msi_attrib.type;
- spin_unlock_irqrestore(&msi_lock, flags);
-
- msi_free_vector(dev, dev->irq);
- if (type == PCI_CAP_ID_MSIX) {
- int i, pos, dev_msi_cap;
- u32 phys_addr, table_offset;
- u32 control;
- u8 bir;
-
- pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
- dev->bus->ops->read(dev->bus, dev->devfn,
msi_control_reg(pos), 2, &control);
- dev_msi_cap = multi_msix_capable(control);
- dev->bus->ops->read(dev->bus, dev->devfn,
- msix_table_offset_reg(pos), 4, &table_offset);
- bir = (u8)(table_offset & PCI_MSIX_FLAGS_BIRMASK);
- phys_addr = pci_resource_start (dev, bir);
- phys_addr += (u32)(table_offset &
~PCI_MSIX_FLAGS_BIRMASK);
- for (i = FIRST_DEVICE_VECTOR; i < NR_IRQS; i++) {
+ if (state)
+ printk("WARNING! Driver fails freeing MSI
vector[%d]\n",
+ dev->irq);
+ else /* Release MSI vector assigned to this device */
+ msi_free_vector(dev, dev->irq, 0);
+ dev->irq = temp; /* Restore IOAPIC IRQ */

+ }
+ if ((pos = pci_find_capability(dev, PCI_CAP_ID_MSIX)) > 0 &&
+ !msi_lookup_vector(dev, PCI_CAP_ID_MSIX)) {
+ int vector, head, tail = 0, warning = 0;
+ unsigned long base = 0L;
+
+ vector = head = dev->irq;
+ while (head != tail) {
spin_lock_irqsave(&msi_lock, flags);
- if (!msi_desc[i] || msi_desc[i]->dev != dev) {
- spin_unlock_irqrestore(&msi_lock,
flags);
- continue;
- }
+ state = msi_desc[vector]->msi_attrib.state;
+ tail = msi_desc[vector]->link.tail;
+ base = msi_desc[vector]->mask_base;
spin_unlock_irqrestore(&msi_lock, flags);
- msi_free_vector(dev, i);
+ if (state) {
+ printk("WARNING! Driver fails freeing
MSI-X vector[%d]\n",
+ vector);
+ warning = 1;
+ } else if (vector != head) /* Release MSI-X
vector */
+ msi_free_vector(dev, vector, 0);
+ vector = tail;
+ }
+ msi_free_vector(dev, vector, 0);
+ if (warning) {
+ /* Force to release the MSI-X memory-mapped
table */
+ u32 phys_addr, table_offset;
+ u32 control;
+ u8 bir;
+
+ dev->bus->ops->read(dev->bus, dev->devfn,
+ msi_control_reg(pos), 2, &control);
+ dev->bus->ops->read(dev->bus, dev->devfn,
+ msix_table_offset_reg(pos), 4,
&table_offset);
+ bir = (u8)(table_offset &
PCI_MSIX_FLAGS_BIRMASK);
+ phys_addr = pci_resource_start (dev, bir);
+ phys_addr += (u32)(table_offset &
+ ~PCI_MSIX_FLAGS_BIRMASK);
+ iounmap((void*)base);
+ release_mem_region(phys_addr,
PCI_MSIX_ENTRY_SIZE *
+ multi_msix_capable(control));
}
- writel(1, entry->mask_base +
PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
- iounmap((void*)entry->mask_base);
- release_mem_region(phys_addr, dev_msi_cap *
PCI_MSIX_ENTRY_SIZE);
+ dev->irq = temp; /* Restore IOAPIC IRQ */

}
- dev->irq = temp;
- nr_reserved_vectors++;
}

EXPORT_SYMBOL(pci_enable_msi);
-EXPORT_SYMBOL(msi_alloc_vectors);
-EXPORT_SYMBOL(msi_free_vectors);
+EXPORT_SYMBOL(pci_enable_msix);
diff -urN linux-2.6.7/drivers/pci/msi.h
patch-2.6.7-msix/drivers/pci/msi.h
--- linux-2.6.7/drivers/pci/msi.h 2004-05-09 22:32:53.000000000
-0400
+++ patch-2.6.7-msix/drivers/pci/msi.h 2004-06-22 10:16:09.000000000
-0400
@@ -90,6 +90,11 @@
#define MSI_LOGICAL_MODE 1
#define MSI_REDIRECTION_HINT_MODE 0

+struct msix_entry {
+ __u32 vector : 16; /* kernel uses to write allocated vector
*/
+ __u32 entry : 16; /* driver uses to specify entry, OS
writes */
+};
+
struct msg_data {
#if defined(__LITTLE_ENDIAN_BITFIELD)
__u32 vector : 8;
@@ -140,7 +145,8 @@
struct {
__u8 type : 5; /* {0: unused, 5h:MSI,
11h:MSI-X} */
__u8 maskbit : 1; /* mask-pending bit supported ?
*/
- __u8 reserved: 2; /* reserved
*/
+ __u8 state : 1; /* {0: free, 1: busy}
*/
+ __u8 reserved: 1; /* reserved
*/
__u8 entry_nr; /* specific enabled entry
*/
__u8 default_vector; /* default pre-assigned vector
*/
__u8 current_cpu; /* current destination cpu
*/
diff -urN linux-2.6.7/include/linux/pci.h
patch-2.6.7-msix/include/linux/pci.h
--- linux-2.6.7/include/linux/pci.h 2004-06-22 10:11:40.000000000
-0400
+++ patch-2.6.7-msix/include/linux/pci.h 2004-06-22
10:16:09.000000000 -0400
@@ -789,13 +789,14 @@
#ifndef CONFIG_PCI_USE_VECTOR
static inline void pci_scan_msi_device(struct pci_dev *dev) {}
static inline int pci_enable_msi(struct pci_dev *dev) {return -1;}
+static inline int pci_enable_msix(struct pci_dev* dev,
+ unsigned int *data, int nvec) {return -1;}
static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) {}
#else
extern void pci_scan_msi_device(struct pci_dev *dev);
extern int pci_enable_msi(struct pci_dev *dev);
+extern int pci_enable_msix(struct pci_dev* dev, unsigned int *data, int
nvec);
extern void msi_remove_pci_irq_vectors(struct pci_dev *dev);
-extern int msi_alloc_vectors(struct pci_dev* dev, int *vector, int
nvec);
-extern int msi_free_vectors(struct pci_dev* dev, int *vector, int
nvec);
#endif

#endif /* CONFIG_PCI */

2004-06-22 23:59:11

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH]2.6.7 MSI-X Update

This looks good, a definite improvement over what's currently in the
kernel. I do have one question about the whole msi.c file (and this
applies to the code that's already in the tree, too). Why is config
space being accessed via calls like

dev->bus->ops->read(dev->bus, dev->devfn, ... )

instead of just calling

pci_read_config_word(dev, ... )

The only difference seems to be that MSI is bypassing the locking in
access.c. Is there some reason for this?

Thanks,
Roland

2004-06-23 00:26:59

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH]2.6.7 MSI-X Update

Roland Dreier wrote:
> This looks good, a definite improvement over what's currently in the
> kernel. I do have one question about the whole msi.c file (and this
> applies to the code that's already in the tree, too). Why is config
> space being accessed via calls like
>
> dev->bus->ops->read(dev->bus, dev->devfn, ... )
>
> instead of just calling
>
> pci_read_config_word(dev, ... )
>
> The only difference seems to be that MSI is bypassing the locking in
> access.c. Is there some reason for this?

hmmmmmmm.

Unless it's already inside the lock somehow... it definitely needs to
take the lock, one way or another.

Jeff


2004-06-23 01:18:42

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH]2.6.7 MSI-X Update

Jeff> hmmmmmmm.

Jeff> Unless it's already inside the lock somehow... it
Jeff> definitely needs to take the lock, one way or another.

At least most (in not all) of the code paths are not inside the lock.
For example pci_enable_msi() is called from a device driver, and it
directly does

dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
2, &control);

as well as calling lots of other functions that do similar stuff.

In fact I don't see how any of the stuff in msi.c could be protected
by pci_lock, since pci_lock is static to access.c and only used inside
the pci_bus_read_config_xxx and pci_bus_write_config_xxx functions
defined there.

- Roland


2004-06-23 03:45:39

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH]2.6.7 MSI-X Update

A couple of other comments on the patch:

> +Argument entries is a pointer of unsigned integer type. The number of
> +elements is indicated in argument nvec. The content of each element
> +will be mapped to the following struct defined in /driver/pci/msi.h.
> +
> +struct msix_entry {
> + __u32 vector : 16; /* kernel uses to write alloc vector */
> + __u32 entry : 16; /* driver uses to specify entry */
> +};
> +
> +A device driver is responsible for initializing the field entry of
> +each element with unique entry supported by MSI-X table.

I think this structure should be defined in a header in include/linux,
probably <linux/pci.h>. We could create a new <linux/msi.h> include
but I don't think it's worth it at this point. Also I don't see any
reason to use bitfields or userspace types like __u32 (since no
userspace code is going to use this include file). I would just
declare the type as

struct msix_entry {
u16 vector;
u16 entry;
};

> +int pci_enable_msix(struct pci_dev *dev, u32 *entries, int nvec)

Since this function takes an array of struct msix_entry in its entries
parameter, I think entries should be declared as struct msix_entry *
rather than just u32 *. That is, I would write the prototype as

int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
int nvec);

> + j = (entries + i)->entry;
> + (entries + i)->vector = vector;

Finally, this is a nitpick, but this just looks odd to me. Why not
write this as

j = entries[i].entry;
entries[i].vector = vector;

By the way, I have MSI working with my device with this patch
applied. I am getting ready to test MSI-X, which is why I have
comments about the MSI-X API now.

- Roland


2004-06-23 16:51:06

by Nguyen, Tom L

[permalink] [raw]
Subject: RE: [PATCH]2.6.7 MSI-X Update

On Tuesday, June 22, 2004 Roland Dreier wrote:
>I think this structure should be defined in a header in include/linux,
>probably <linux/pci.h>. We could create a new <linux/msi.h> include
>but I don't think it's worth it at this point. Also I don't see any
>reason to use bitfields or userspace types like __u32 (since no
>userspace code is going to use this include file). I would just
>declare the type as
>
>struct msix_entry {
> u16 vector;
> u16 entry;
>};
>
> > +int pci_enable_msix(struct pci_dev *dev, u32 *entries, int nvec)
>
>Since this function takes an array of struct msix_entry in its entries
>parameter, I think entries should be declared as struct msix_entry *
>rather than just u32 *. That is, I would write the prototype as
>
>int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries,
> int nvec);
>

Agree. Thanks for your suggestion of defining struct msix_entry in
<linux/pci.h>.

> > + j = (entries + i)->entry;
> > + (entries + i)->vector = vector;
>
>Finally, this is a nitpick, but this just looks odd to me. Why not
>write this as
>
> j = entries[i].entry;
> entries[i].vector = vector;
>

Agree. Thanks.

Thanks,
Long


2004-06-23 17:01:08

by Nguyen, Tom L

[permalink] [raw]
Subject: RE: [PATCH]2.6.7 MSI-X Update

On Tuesday, June 22, 2004 Roland Dreier wrote:
>This looks good, a definite improvement over what's currently in the
>kernel. I do have one question about the whole msi.c file (and this
>applies to the code that's already in the tree, too). Why is config
>space being accessed via calls like
>
> dev->bus->ops->read(dev->bus, dev->devfn, ... )
>
>instead of just calling
>
> pci_read_config_word(dev, ... )
>
>The only difference seems to be that MSI is bypassing the locking in
>access.c. Is there some reason for this?

I think that the locking in access.c is not necessary. But I agree
with you that using pci_read_config_word() would be cleaner.

Thanks,
Long


2004-06-23 17:07:48

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH]2.6.7 MSI-X Update

OK, yet another comment on this update :)

Overall I like the idea of separating MSI and MSI-X support and
getting rid of the msi_alloc_vectors()/msi_free_vectors(). However it
seems there is a slight asymmetry in how MSI-X is handled now.

If a driver calls pci_enable_msix() (and the call succeeds), then the
device is immediately put into MSI-X mode -- that is, the enable bit
of its MSI-X capability is set. However, this bit will not be cleared
until the driver calls free_irq() for the last MSI-X vector.

This means that for a driver to clear the MSI-X enable bit, it must
first do request_irq() on all the vectors it was assigned and then
call free_irq(). It seems quite possible to me that a driver may not
use all the MSI-X vectors it is assigned, so device cleanup becomes a
problem. Also, there is no way for the driver to free its unused
MSI-X vectors.

It seems we need a pci_disable_msix() call to match the
pci_enable_msix() call. (And remove the disabling of MSI-X from the
free_irq code path)

I guess there is actually a similar problem with MSI -- if a driver
calls pci_enable_msi(), MSI will not be disabled until the driver does
request_irq/free_irq. This is not quite as serious because a driver
is unlikely not to use the since MSI vector it gets, but it is still a
problem for error cleanup paths. So maybe we need pci_disable_msi()
as well.

What do you think?

Thanks,
Roland

2004-06-23 22:07:41

by Nguyen, Tom L

[permalink] [raw]
Subject: RE: [PATCH]2.6.7 MSI-X Update

On Wed, June 23, 2004 Roland Dreier wrote:
> OK, yet another comment on this update :)
>
>Overall I like the idea of separating MSI and MSI-X support and
>getting rid of the msi_alloc_vectors()/msi_free_vectors(). However it
>seems there is a slight asymmetry in how MSI-X is handled now.
>
>If a driver calls pci_enable_msix() (and the call succeeds), then the
>device is immediately put into MSI-X mode -- that is, the enable bit
>of its MSI-X capability is set. However, this bit will not be cleared
>until the driver calls free_irq() for the last MSI-X vector.
>
>This means that for a driver to clear the MSI-X enable bit, it must
>first do request_irq() on all the vectors it was assigned and then
>call free_irq(). It seems quite possible to me that a driver may not
>use all the MSI-X vectors it is assigned, so device cleanup becomes a
>problem. Also, there is no way for the driver to free its unused
>MSI-X vectors.
>
>It seems we need a pci_disable_msix() call to match the
>pci_enable_msix() call. (And remove the disabling of MSI-X from the
>free_irq code path)
>
>I guess there is actually a similar problem with MSI -- if a driver
>calls pci_enable_msi(), MSI will not be disabled until the driver does
>request_irq/free_irq. This is not quite as serious because a driver
>is unlikely not to use the since MSI vector it gets, but it is still a
>problem for error cleanup paths. So maybe we need pci_disable_msi()
>as well.
>
>What do you think?
It was my initial thought. However, below two reasons convince me that
the patch should enforce the rules to ensure that the software device
driver behaves properly.

-The software driver is responsible for asking how many vectors
it actually needs to serivce its hardward needs. Why does it asks for
more than what it actually uses?
-Assuming I add pci_disable_msix() API, then there are two consequences
for this approach. It would be a problem for the kernel to determine
whether the MSI vectors are unhooked cleanly from their corresponding
driver ISR before freeing these MSI vectors for reuse purpose on other
devices. If there is an error in the driver, it may result an
unexpected behavior. Second, for example if a driver asks for 10 MSI
vector for the first time and decides to call pci_disable_msix() to free
up 5. If the driver switches interrupt mode back and forth, the next MSI
request by calling pci_enable_msix() will result 5 given instead of 10.

Please tell me what you think?

Thanks,
Long

2004-06-24 01:56:09

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH]2.6.7 MSI-X Update

Long> It was my initial thought. However, below two reasons
Long> convince me that the patch should enforce the rules to
Long> ensure that the software device driver behaves properly.

Long> -The software driver is responsible for asking how many
Long> vectors it actually needs to serivce its hardward needs. Why
Long> does it asks for more than what it actually uses?

I could imagine hardware where the driver does not know exactly how
many vectors it will use until it starts up. As a hypothetical
example, imagine some storage networking host adapter that supports an
interrupt vector per storage target. The driver does not know how
many vectors it will actually use until it has logged into the storage
fabric; in fact, the driver may want to keep some vectors "in reserve"
in case a new target is added to the fabric later.

I think it would be better to preserve maximum flexibility for devices
and drivers, and not mandate that every allocated MSI-X vector is
always used.

Long> -Assuming I add pci_disable_msix() API, then there are two
Long> consequences for this approach. It would be a problem for
Long> the kernel to determine whether the MSI vectors are unhooked
Long> cleanly from their corresponding driver ISR before freeing
Long> these MSI vectors for reuse purpose on other devices. If
Long> there is an error in the driver, it may result an unexpected
Long> behavior. Second, for example if a driver asks for 10 MSI
Long> vector for the first time and decides to call
Long> pci_disable_msix() to free up 5. If the driver switches
Long> interrupt mode back and forth, the next MSI request by
Long> calling pci_enable_msix() will result 5 given instead of 10.

It seems in the code right now you are able to tell if any MSI-X
vectors are hooked, since you wait for the last vector to be unhooked
to disable MSI-X. I would just have it be a WARN_ON() (or maybe
BUG_ON()) if a driver calls pci_disable_msix() without calling
free_irq for all its MSI-X vectors.

Right now there is an issue if a driver is unloaded without freeing
all its IRQs -- the device will be left in MSI-X mode and can not be
recovered without rebooting.

Also, drivers have a problem in their error paths right now with
freeing MSI-X resources. For example, suppose a driver successfully
requests 4 MSI-X vectors. request_irq() is a function call that can
fail, for example if the kernel can't allocate memory. What should
the driver do if its second (out of 4) request_irq() call fails?
There doesn't seem to be any way for it to proceed without leaking
MSI-X resources.

Similarly, with the API as it stands in your patch, a driver must be
very careful not to take any action that may fail in between calling
pci_enable_msix() and actually calling request_irq(), or otherwise the
only way to avoid leaking MSI-X resources is to take the very risky
step of calling request_irq() on an error path. This doesn't fit very
well with the structure of lots of device drivers, for example Intel's
very own e1000 driver, which wait until the device is actually opened
to call request_irq().

For your second point, I would have pci_disable_msix() always free all
MSI-X vectors that have been allocated... the only parameter that I
expect it would take is a struct pci_dev *.

- Roland

2004-06-24 06:35:29

by Zwane Mwaikambo

[permalink] [raw]
Subject: Re: [PATCH]2.6.7 MSI-X Update

On Wed, 23 Jun 2004, Roland Dreier wrote:

> I could imagine hardware where the driver does not know exactly how
> many vectors it will use until it starts up. As a hypothetical
> example, imagine some storage networking host adapter that supports an
> interrupt vector per storage target. The driver does not know how
> many vectors it will actually use until it has logged into the storage
> fabric; in fact, the driver may want to keep some vectors "in reserve"
> in case a new target is added to the fabric later.
>
> I think it would be better to preserve maximum flexibility for devices
> and drivers, and not mandate that every allocated MSI-X vector is
> always used.

The MSI subsystem should at most reserve and the driver make a request.
There may be a limit per PCI device as specified by the MSI subsystem for
some reason or other. Isn't this what we're all saying?

> It seems in the code right now you are able to tell if any MSI-X
> vectors are hooked, since you wait for the last vector to be unhooked
> to disable MSI-X. I would just have it be a WARN_ON() (or maybe
> BUG_ON()) if a driver calls pci_disable_msix() without calling
> free_irq for all its MSI-X vectors.
>
> Right now there is an issue if a driver is unloaded without freeing
> all its IRQs -- the device will be left in MSI-X mode and can not be
> recovered without rebooting.

This sounds like a case of bad driver bug generally the kernel would oops
when the ISR text gets unloaded. What kind of behaviour do you expect
here?

> Also, drivers have a problem in their error paths right now with
> freeing MSI-X resources. For example, suppose a driver successfully
> requests 4 MSI-X vectors. request_irq() is a function call that can
> fail, for example if the kernel can't allocate memory. What should
> the driver do if its second (out of 4) request_irq() call fails?
> There doesn't seem to be any way for it to proceed without leaking
> MSI-X resources.

I agree here, the request/free of vectors must be controllable in the
driver, this is one place where we may have to allow people to hang
themselves.

> Similarly, with the API as it stands in your patch, a driver must be
> very careful not to take any action that may fail in between calling
> pci_enable_msix() and actually calling request_irq(), or otherwise the
> only way to avoid leaking MSI-X resources is to take the very risky
> step of calling request_irq() on an error path. This doesn't fit very
> well with the structure of lots of device drivers, for example Intel's
> very own e1000 driver, which wait until the device is actually opened
> to call request_irq().

Could you elaborate further here? Won't a matched pci_disable_msix() free
the necessary resources on failure?

> For your second point, I would have pci_disable_msix() always free all
> MSI-X vectors that have been allocated... the only parameter that I
> expect it would take is a struct pci_dev *.

If the driver is doing this, then we won't have to bother about
pci_disable_msix() doing the vector free surely?

Thanks Roland,
Zwane

2004-06-24 07:28:05

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH]2.6.7 MSI-X Update

Hi, I think you may not have read Long's patch/API carefully, since
you seem to be misunderstanding my objection. In any case...

Roland> I could imagine hardware where the driver does not know
Roland> exactly how many vectors it will use until it starts up.
Roland> As a hypothetical example, imagine some storage networking
Roland> host adapter that supports an interrupt vector per storage
Roland> target. The driver does not know how many vectors it will
Roland> actually use until it has logged into the storage fabric;
Roland> in fact, the driver may want to keep some vectors "in
Roland> reserve" in case a new target is added to the fabric
Roland> later.

Roland> I think it would be better to preserve maximum flexibility
Roland> for devices and drivers, and not mandate that every
Roland> allocated MSI-X vector is always used.

Zwane> The MSI subsystem should at most reserve and the driver
Zwane> make a request. There may be a limit per PCI device as
Zwane> specified by the MSI subsystem for some reason or
Zwane> other. Isn't this what we're all saying?

No, Long is actually saying that a driver must actually call
request_irq() on all the vectors that it is allocated. I am saying
that this requirement is too stringent, since there may be devices and
drivers that cannot predict exactly how many MSI-X vectors they will
use during driver initialization.

Roland> It seems in the code right now you are able to tell if any
Roland> MSI-X vectors are hooked, since you wait for the last
Roland> vector to be unhooked to disable MSI-X. I would just have
Roland> it be a WARN_ON() (or maybe BUG_ON()) if a driver calls
Roland> pci_disable_msix() without calling free_irq for all its
Roland> MSI-X vectors.

Roland> Right now there is an issue if a driver is unloaded
Roland> without freeing all its IRQs -- the device will be left in
Roland> MSI-X mode and can not be recovered without rebooting.

Zwane> This sounds like a case of bad driver bug generally the
Zwane> kernel would oops when the ISR text gets unloaded. What
Zwane> kind of behaviour do you expect here?

Yes, I agree, it is a bad driver bug if the driver is unloaded without
doing free_irq() on all the vectors it has done request_irq() on.
However, with Long's API, there is a problem if for example a device
driver does pci_enable_msix() and is allocated 2 vectors, then
correctly does request_irq()/free_irq() on one vector and doesn't
touch the second vector, and then is unloaded. The device will be
left with MSI-X enabled and leak its vectors.

In the proposed API, since there is no pci_disable_msix() call, the
only way the driver can free its MSI-X vector is to actually do
request_irq()/free_irq() on it.

Roland> Similarly, with the API as it stands in your patch, a
Roland> driver must be very careful not to take any action that
Roland> may fail in between calling pci_enable_msix() and actually
Roland> calling request_irq(), or otherwise the only way to avoid
Roland> leaking MSI-X resources is to take the very risky step of
Roland> calling request_irq() on an error path. This doesn't fit
Roland> very well with the structure of lots of device drivers,
Roland> for example Intel's very own e1000 driver, which wait
Roland> until the device is actually opened to call request_irq().

Zwane> Could you elaborate further here? Won't a matched
Zwane> pci_disable_msix() free the necessary resources on failure?

Yes, a matched pci_disable_msix() would be exactly what is needed.
However, look at Long's patch -- there is no such function in the API
he is proposing.

Roland> For your second point, I would have pci_disable_msix()
Roland> always free all MSI-X vectors that have been
Roland> allocated... the only parameter that I expect it would
Roland> take is a struct pci_dev *.

Zwane> If the driver is doing this, then we won't have to bother
Zwane> about pci_disable_msix() doing the vector free surely?

I think the PCI core needs to know which vectors are in use and which
are free (and ready to assign to PCI devices that request them).

I believe the correct API/semantics for a device driver are:

pci_enable_msix(dev, &entries, num_entries);
/* On success, driver now has full use of the num_entries
interrupt vectors returned through entries. MSI-X enable
bit is set in PCI header. */
/* ... */
/* driver freely does request_irq()/free_irq() on some or all
vectors in entries while running. */
/* ... */
pci_disable_msix(dev);
/* All handlers attached to MSI-X vectors must be removed with
free_irq() before pci_disable_msi() call. */
/* MSI-X enable bit is now cleared from PCI header, and all
interrupt vectors are returned to the core for possible
reallocation. */

The major change from Long's proposal is the addition of the
pci_disable_msix() function.

- Roland

2004-06-24 16:34:40

by Nguyen, Tom L

[permalink] [raw]
Subject: RE: [PATCH]2.6.7 MSI-X Update

ON Thursday, June 24, 2004 Roland Dreier wrote:
> Roland> I could imagine hardware where the driver does not know
> Roland> exactly how many vectors it will use until it starts up.
> Roland> As a hypothetical example, imagine some storage networking
> Roland> host adapter that supports an interrupt vector per storage
> Roland> target. The driver does not know how many vectors it will
> Roland> actually use until it has logged into the storage fabric;
> Roland> in fact, the driver may want to keep some vectors "in
> Roland> reserve" in case a new target is added to the fabric
> Roland> later.
>
> Roland> I think it would be better to preserve maximum flexibility
> Roland> for devices and drivers, and not mandate that every
> Roland> allocated MSI-X vector is always used.
>
> Zwane> The MSI subsystem should at most reserve and the driver
> Zwane> make a request. There may be a limit per PCI device as
> Zwane> specified by the MSI subsystem for some reason or
> Zwane> other. Isn't this what we're all saying?
>
>No, Long is actually saying that a driver must actually call
>request_irq() on all the vectors that it is allocated. I am saying
>that this requirement is too stringent, since there may be devices and
>drivers that cannot predict exactly how many MSI-X vectors they will
>use during driver initialization.

That is what we're all saying.

> Roland> It seems in the code right now you are able to tell if any
> Roland> MSI-X vectors are hooked, since you wait for the last
> Roland> vector to be unhooked to disable MSI-X. I would just have
> Roland> it be a WARN_ON() (or maybe BUG_ON()) if a driver calls
> Roland> pci_disable_msix() without calling free_irq for all its
> Roland> MSI-X vectors.
>
> Roland> Right now there is an issue if a driver is unloaded
> Roland> without freeing all its IRQs -- the device will be left in
> Roland> MSI-X mode and can not be recovered without rebooting.
>
> Zwane> This sounds like a case of bad driver bug generally the
> Zwane> kernel would oops when the ISR text gets unloaded. What
> Zwane> kind of behaviour do you expect here?
>
>Yes, I agree, it is a bad driver bug if the driver is unloaded without
>doing free_irq() on all the vectors it has done request_irq() on.
>However, with Long's API, there is a problem if for example a device
>driver does pci_enable_msix() and is allocated 2 vectors, then
>correctly does request_irq()/free_irq() on one vector and doesn't
>touch the second vector, and then is unloaded. The device will be
>left with MSI-X enabled and leak its vectors.

It's very convincing. The addition of the pci_disable_msi() and
pci_disable_msix() functions are what is needed to handle this issue.

Thanks,
Long

2004-06-25 23:28:31

by long

[permalink] [raw]
Subject: RE:[PATCH]2.6.7 MSI-X Update

Hi Roland,

Attach is an update on previous MSI-X patch. This update reflects lkml
comments on the code as well as suggestion of two additional APIs
(pci_disable_msi and pci_disable_msix). Please test it out and let me
know what you think.

Starting on 06/26, I'll not have access to email for two weeks. I'll
respond to any inputs after that.

Thanks,
Long

----------------------------------------------------------------------
diff -urN linux-2.6.7/Documentation/MSI-HOWTO.txt patch-2.6.7-fix-msix/Documentation/MSI-HOWTO.txt
--- linux-2.6.7/Documentation/MSI-HOWTO.txt 2004-05-09 22:31:58.000000000 -0400
+++ patch-2.6.7-fix-msix/Documentation/MSI-HOWTO.txt 2004-06-25 16:10:58.000000000 -0400
@@ -3,13 +3,14 @@
10/03/2003
Revised Feb 12, 2004 by Martine Silbermann
email: [email protected]
+ Revised Jun 25, 2004 by Tom L Nguyen

1. About this guide

-This guide describes the basics of Message Signaled Interrupts(MSI), the
-advantages of using MSI over traditional interrupt mechanisms, and how
-to enable your driver to use MSI or MSI-X. Also included is a Frequently
-Asked Questions.
+This guide describes the basics of Message Signaled Interrupts (MSI),
+the advantages of using MSI over traditional interrupt mechanisms,
+and how to enable your driver to use MSI or MSI-X. Also included is
+a Frequently Asked Questions.

2. Copyright 2003 Intel Corporation

@@ -35,7 +36,7 @@
the MSI/MSI-X capability structure in its PCI capability list. The
device function may implement both the MSI capability structure and
the MSI-X capability structure; however, the bus driver should not
-enable both, but instead enable only the MSI-X capability structure.
+enable both.

The MSI capability structure contains Message Control register,
Message Address register and Message Data register. These registers
@@ -86,7 +87,7 @@
support for better interrupt performance.

Using MSI enables the device functions to support two or more
-vectors, which can be configure to target different CPU's to
+vectors, which can be configured to target different CPU's to
increase scalability.

5. Configuring a driver to use MSI/MSI-X
@@ -95,26 +96,49 @@
support this capability. The CONFIG_PCI_USE_VECTOR kernel option
must be selected to enable MSI/MSI-X support.

-5.1 Including MSI support into the kernel
+5.1 Including MSI/MSI-X support into the kernel

-To allow MSI-Capable device drivers to selectively enable MSI (using
-pci_enable_msi as described below), the VECTOR based scheme needs to
-be enabled by setting CONFIG_PCI_USE_VECTOR.
+To allow MSI/MSI-X capable device drivers to selectively enable
+MSI/MSI-X (using pci_enable_msi()/pci_enable_msix() as described
+below), the VECTOR based scheme needs to be enabled by setting
+CONFIG_PCI_USE_VECTOR during kernel config.

Since the target of the inbound message is the local APIC, providing
-CONFIG_PCI_USE_VECTOR is dependent on whether CONFIG_X86_LOCAL_APIC
-is enabled or not.
+CONFIG_X86_LOCAL_APIC must be enabled as well as CONFIG_PCI_USE_VECTOR.

-int pci_enable_msi(struct pci_dev *)
+5.2 Configuring for MSI support
+
+Due to the non-contiguous fashion in vector assignment of the
+existing Linux kernel, this version does not support multiple
+messages regardless of a device function is capable of supporting
+more than one vector. To enable MSI on a device function's MSI
+capability structure requires a device driver to call the function
+pci_enable_msi() explicitly.
+
+5.2.1 API pci_enable_msi
+
+int pci_enable_msi(struct pci_dev *dev)

With this new API, any existing device driver, which like to have
-MSI enabled on its device function, must call this explicitly. A
-successful call will initialize the MSI/MSI-X capability structure
-with ONE vector, regardless of whether the device function is
+MSI enabled on its device function, must call this API to enable MSI
+A successful call will initialize the MSI capability structure
+with ONE vector, regardless of whether a device function is
capable of supporting multiple messages. This vector replaces the
pre-assigned dev->irq with a new MSI vector. To avoid the conflict
of new assigned vector with existing pre-assigned vector requires
-the device driver to call this API before calling request_irq(...).
+a device driver to call this API before calling request_irq().
+
+5.2.2 API pci_disable_msi
+
+void pci_disable_msi(struct pci_dev *dev)
+
+This API is needed to encounter the case where a device driver is
+unloaded without doing request_irq on assigned MSI vector results
+a device being left in MSI mode and not be able to recover without
+rebooting. This API provides a device driver an ability to recover
+and resolves vector leakage.
+
+5.2.3 MSI mode vs. legacy mode diagram

The below diagram shows the events, which switches the interrupt
mode on the MSI-capable device function between MSI mode and
@@ -124,105 +148,258 @@
| | <=============== | |
| MSI MODE | | PIN-IRQ ASSERTION MODE |
| | ===============> | |
- ------------ free_irq ------------------------
+ ------------ free_irq/ ------------------------
+ pci_disable_msi

-5.2 Configuring for MSI support

-Due to the non-contiguous fashion in vector assignment of the
-existing Linux kernel, this version does not support multiple
-messages regardless of the device function is capable of supporting
-more than one vector. The bus driver initializes only entry 0 of
-this capability if pci_enable_msi(...) is called successfully by
-the device driver.
+Figure 1.0 MSI Mode vs. Legacy Mode
+
+In Figure 1.0, a device operates by default in legacy mode. Legacy
+in this context means PCI pin-irq assertion or PCI-Express INTx
+emulation. A successful MSI request (using pci_enable_msi()) switches
+a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector
+stored in dev->irq will be saved by the PCI subsystem and a new
+assigned MSI vector will replace dev->irq.
+
+To return back to its default mode, a device driver must call
+free_irq() using the allocated MSI vector. The PCI subsystem restores a
+device's dev->irq with a pre-assigned IOAPIC vector and marks released
+MSI vector as unused. Once being marked as unused, there is no
+guarantee that the PCI subsystem will reserve this MSI vector for a
+device. Depending on the availability of current PCI vector resources
+and the number of MSI/MSI-X requests from other drivers, this MSI
+may be re-assigned. For the case where the PCI subsystem re-assigned
+this MSI vector another driver, a request to switching back to MSI
+mode may result in being assigned a different MSI vector or a failure
+if no more vectors are available.

5.3 Configuring for MSI-X support

-Both the MSI capability structure and the MSI-X capability structure
-share the same above semantics; however, due to the ability of the
-system software to configure each vector of the MSI-X capability
-structure with an independent message address and message data, the
-non-contiguous fashion in vector assignment of the existing Linux
-kernel has no impact on supporting multiple messages on an MSI-X
-capable device functions. By default, as mentioned above, ONE vector
-should be always allocated to the MSI-X capability structure at
-entry 0. The bus driver does not initialize other entries of the
-MSI-X table.
-
-Note that the PCI subsystem should have full control of a MSI-X
-table that resides in Memory Space. The software device driver
-should not access this table.
-
-To request for additional vectors, the device software driver should
-call function msi_alloc_vectors(). It is recommended that the
-software driver should call this function once during the
+Due to the ability of the system software to configure each vector of
+the MSI-X capability structure with an independent message address
+and message data, the non-contiguous fashion in vector assignment of
+the existing Linux kernel has no impact on supporting multiple
+messages on an MSI-X capable device functions. To enable MSI-X on
+a device function's MSI-X capability structure requires its device
+driver to call the function pci_enable_msix() explicitly.
+
+The function pci_enable_msix(), once invoked, enables either
+all or nothing, depending on the current availability of PCI vector
+resources. If the PCI vector resources are available for the number
+of vectors requested by a device driver, this function will configure
+the MSI-X table of the MSI-X capability structure of a device with
+requested messages. To emphasize this reason, for example, a device
+may be capable for supporting the maximum of 32 vectors while its
+software driver usually may request 4 vectors. It is recommended
+that the device driver should call this function once during the
initialization phase of the device driver.

-The function msi_alloc_vectors(), once invoked, enables either
-all or nothing, depending on the current availability of vector
-resources. If no vector resources are available, the device function
-still works with ONE vector. If the vector resources are available
-for the number of vectors requested by the driver, this function
-will reconfigure the MSI-X capability structure of the device with
-additional messages, starting from entry 1. To emphasize this
-reason, for example, the device may be capable for supporting the
-maximum of 32 vectors while its software driver usually may request
-4 vectors.
-
-For each vector, after this successful call, the device driver is
-responsible to call other functions like request_irq(), enable_irq(),
-etc. to enable this vector with its corresponding interrupt service
-handler. It is the device driver's choice to have all vectors shared
-the same interrupt service handler or each vector with a unique
-interrupt service handler.
-
-In addition to the function msi_alloc_vectors(), another function
-msi_free_vectors() is provided to allow the software driver to
-release a number of vectors back to the vector resources. Once
-invoked, the PCI subsystem disables (masks) each vector released.
-These vectors are no longer valid for the hardware device and its
-software driver to use. Like free_irq, it recommends that the
-device driver should also call msi_free_vectors to release all
-additional vectors previously requested.
-
-int msi_alloc_vectors(struct pci_dev *dev, int *vector, int nvec)
-
-This API enables the software driver to request the PCI subsystem
-for additional messages. Depending on the number of vectors
-available, the PCI subsystem enables either all or nothing.
+Unlike the function pci_enable_msi(), the function pci_enable_msix()
+does not replace the pre-assigned IOAPIC dev->irq with a new MSI
+vector because the PCI subsystem writes the 1:1 vector-to-entry mapping
+into the field vector of each element contained in a second argument.
+Note that the pre-assigned IO-APIC dev->irq is valid only if the device
+operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt of
+using dev->irq by the device driver to request for interrupt service
+may result unpredictabe behavior.
+
+For each MSI-X vector granted, a device driver is responsible to call
+other functions like request_irq(), enable_irq(), etc. to enable
+this vector with its corresponding interrupt service handler. It is
+a device driver's choice to assign all vectors with the same
+interrupt service handler or each vector with a unique interrupt
+service handler.
+
+5.3.1 Handling MMIO address space of MSI-X Table
+
+The PCI 3.0 specification has implementation notes that MMIO address
+space for a device's MSI-X structure should be isolated so that the
+software system can set different page for controlling accesses to
+the MSI-X structure. The implementation of MSI patch requires the PCI
+subsystem, not a device driver, to maintain full control of the MSI-X
+table/MSI-X PBA and MMIO address space of the MSI-X table/MSI-X PBA.
+A device driver is prohibited from requesting the MMIO address space
+of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem will fail
+enabling MSI-X on its hardware device when it calls the function
+pci_enable_msix().
+
+5.3.2 Handling MSI-X allocation
+
+Determining the number of MSI-X vectors allocated to a function is
+dependent on the number of MSI capable devices and MSI-X capable
+devices populated in the system. The policy of allocating MSI-X
+vectors to a function is defined as the following:
+
+#of MSI-X vectors allocated to a function = (x - y)/z where
+
+x = The number of available PCI vector resources by the time
+ the device driver calls pci_enable_msix(). The PCI vector
+ resources is the sum of the number of unassigned vectors
+ (new) and the number of released vectors when any MSI/MSI-X
+ device driver switches its hardware device back to a legacy
+ mode or is hot-removed. The number of unassigned vectors
+ may exclude some vectors reserved, as defined in parameter
+ NR_HP_RESERVED_VECTORS, for the case where the system is
+ capable of supporting hot-add/hot-remove operations. Users
+ may change the value defined in NR_HR_RESERVED_VECTORS to
+ meet their specific needs.
+
+y = The number of MSI capable devices populated in the system.
+ This policy ensures that each MSI capable device has its
+ vector reserved to avoid the case where some MSI-X capable
+ drivers may attempt to claim all available vector resources.
+
+z = The number of MSI-X capable devices pupulated in the system.
+ This policy ensures that maximum (x - y) is distributed
+ evenly among MSI-X capable devices.
+
+Note that the PCI subsystem scans y and z during a bus enumeration.
+When the PCI subsystem completes configuring MSI/MSI-X capability
+structure of a device as requested by its device driver, y/z is
+decremented accordingly.
+
+5.3.3 Handling MSI-X shortages
+
+For the case where fewer MSI-X vectors are allocated to a function
+than requested, the function pci_enable_msix() will return the
+maximum number of MSI-X vectors available to the caller. A device
+driver may re-send its request with fewer or equal vectors indicated
+in a return. For example, if a device driver requests 5 vectors, but
+the number of available vectors is 3 vectors, a value of 3 will be a
+return as a result of pci_enable_msix() call. A function could be
+designed for its driver to use only 3 MSI-X table entries as
+different combinations as ABC--, A-B-C, A--CB, etc. Note that this
+patch does not support multiple entries with the same vector. Such
+attempt by a device driver to use 5 MSI-X table entries with 3 vectors
+as ABBCC, AABCC, BCCBA, etc will result as a failure by the function
+pci_enable_msix(). Below are the reasons why supporting multiple
+entries with the same vector is an undesirable solution.
+
+ - The PCI subsystem can not determine which entry, which
+ generated the message, to mask/unmask MSI while handling
+ software driver ISR. Attempting to walk through all MSI-X
+ table entries (2048 max) to mask/unmask any match vector
+ is an undesirable solution.
+
+ - Walk through all MSI-X table entries (2048 max) to handle
+ SMP affinity of any match vector is an undesirable solution.
+
+5.3.4 API pci_enable_msix
+
+int pci_enable_msix(struct pci_dev *dev, u32 *entries, int nvec)
+
+This API enables a device driver to request the PCI subsystem
+for enabling MSI-X messages on its hardware device. Depending on
+the availability of PCI vectors resources, the PCI subsystem enables
+either all or nothing.

Argument dev points to the device (pci_dev) structure.
-Argument vector is a pointer of integer type. The number of
-elements is indicated in argument nvec.
+
+Argument entries is a pointer of unsigned integer type. The number of
+elements is indicated in argument nvec. The content of each element
+will be mapped to the following struct defined in /driver/pci/msi.h.
+
+struct msix_entry {
+ u16 vector; /* kernel uses to write alloc vector */
+ u16 entry; /* driver uses to specify entry */
+};
+
+A device driver is responsible for initializing the field entry of
+each element with unique entry supported by MSI-X table. Otherwise,
+-EINVAL will be returned as a result. A successful return of zero
+indicates the PCI subsystem completes initializing each of requested
+entries of the MSI-X table with message address and message data.
+Last but not least, the PCI subsystem will write the 1:1
+vector-to-entry mapping into the field vector of each element. A
+device driver is responsible of keeping track of allocated MSI-X
+vectors in its internal data structure.
+
Argument nvec is an integer indicating the number of messages
requested.
-A return of zero indicates that the number of allocated vector is
-successfully allocated. Otherwise, indicate resources not
-available.
-
-int msi_free_vectors(struct pci_dev* dev, int *vector, int nvec)
-
-This API enables the software driver to inform the PCI subsystem
-that it is willing to release a number of vectors back to the
-MSI resource pool. Once invoked, the PCI subsystem disables each
-MSI-X entry associated with each vector stored in the argument 2.
-These vectors are no longer valid for the hardware device and
-its software driver to use.

-Argument dev points to the device (pci_dev) structure.
-Argument vector is a pointer of integer type. The number of
-elements is indicated in argument nvec.
-Argument nvec is an integer indicating the number of messages
-released.
-A return of zero indicates that the number of allocated vectors
-is successfully released. Otherwise, indicates a failure.
+A return of zero indicates that the number of MSI-X vectors is
+successfully allocated. A return of greater than zero indicates
+MSI-X vector shortage. Or a return of less than zero indicates
+a failure. This failure may be a result of duplicate entries
+specified in second argument, or a result of no available vector,
+or a result of failing to initialize MSI-X table entries.
+
+5.3.5 API pci_disable_msix
+
+void pci_disable_msix(struct pci_dev *dev)
+
+This API is needed to encounter the case where a device driver is
+unloaded without doing request_irq on all ssigned MSI-X vector
+results a device being left in MSI-x mode and not be able to recover
+without rebooting. For example, a device driver does pci_enable_msix
+and is allocated 2 vectors, then correctly does request_irq/free_irq
+on one vector but does not touch the second vector. When a device
+driver is unloaded, it will be left in MSI-X mode. This API provides
+a device driver an ability to recover without rebooting and resolves
+vector leakage.
+
+5.3.6 MSI-X mode vs. legacy mode diagram
+
+The below diagram shows the events, which switches the interrupt
+mode on the MSI-X capable device function between MSI-X mode and
+PIN-IRQ assertion mode (legacy).
+
+ ------------ pci_enable_msix(,,n) ------------------------
+ | | <=============== | |
+ | MSI-X MODE | | PIN-IRQ ASSERTION MODE |
+ | | ===============> | |
+ ------------ (n)free_irq/ ------------------------
+ pci_disable_msix
+
+Figure 2.0 MSI-X Mode vs. Legacy Mode
+
+In Figure 2.0, a device operates by default in legacy mode. A
+successful MSI-X request (using pci_enable_msix()) switches a
+device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector
+stored in dev->irq will be saved by the PCI subsystem; however,
+unlike MSI mode, the PCI subsystem will not replace dev->irq with
+assigned MSI-x vector because the PCI subsystem already writes the 1:1
+vector-to-entry mapping into the field vector of each element
+specified in second argument.
+
+To return back to its default mode, a device driver requires to call
+free_irq() on all allocated MSI vectors associated with doing
+request_irq. Unlike MSI mode, the PCI subsystem switches a device
+function back to its default legacy mode if and only if its device
+driver successfully releases all allocated MSI-X vectors correctly
+associated with request_irq/free_irq.
+
+Note that if a device still operates in MSI-X mode, its device
+driver can use request_irq/free_irq to any vectors in subset n. When
+the PCI subsystem detects all MSI-X vectors being released by a device
+driver, it will switches a function's interrupt mode from MSI-X mode
+to legacy mode and mark all allocated MSI-X vectors as unused. Once
+being marked as unused, there is no guarantee that the PCI subsystem
+will reserve these MSI-X vectors for a device. Depending on the
+availability of current PCI vector resources and the number of
+MSI/MSI-X requests from other drivers, these MSI-X vectors may be
+re-assigned. For the case where the PCI subsystem re-assigned
+these MSI-X vectors to other driver, a request to switching back to
+MSI-X mode may result being assigned with another set of MSI-X vectors
+or a failure.
+
+5.4 Handling function implementng both MSI and MSI-X capabilities
+
+For the case where a function implements both MSI and MSI-X
+capabilities, the PCI subsystem enables a device to run either in MSI
+mode or MSI-X mode but not both. A device driver determines whether it
+wants MSI or MSI-X enabled on its hardware device. Once a device
+driver requests for MSI, for example, it is prohibited to request for
+MSI-X; in other words, a device driver is not permitted to ping-pong
+between MSI mod MSI-X mode during a run-time.

-5.4 Hardware requirements for MSI support
-MSI support requires support from both system hardware and
+5.5 Hardware requirements for MSI/MSI-X support
+MSI/MSI-X support requires support from both system hardware and
individual hardware device functions.

-5.4.1 System hardware support
+5.5.1 System hardware support
Since the target of MSI address is the local APIC CPU, enabling
-MSI support in Linux kernel is dependent on whether existing
+MSI/MSI-X support in Linux kernel is dependent on whether existing
system hardware supports local APIC. Users should verify their
system whether it runs when CONFIG_X86_LOCAL_APIC=y.

@@ -231,14 +408,14 @@
CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting
CONFIG_PCI_USE_VECTOR enables the VECTOR based scheme and
the option for MSI-capable device drivers to selectively enable
-MSI (using pci_enable_msi as described below).
+MSI/MSI-X.

-Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI
-vector is allocated new during runtime and MSI support does not
-depend on BIOS support. This key independency enables MSI support
-on future IOxAPIC free platform.
+Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X
+vector is allocated new during runtime and MSI/MSI-X support does not
+depend on BIOS support. This key independency enables MSI/MSI-X
+support on future IOxAPIC free platform.

-5.4.2 Device hardware support
+5.5.2 Device hardware support
The hardware device function supports MSI by indicating the
MSI/MSI-X capability structure on its PCI capability list. By
default, this capability structure will not be initialized by
@@ -249,17 +426,19 @@
MSI-capable hardware is responsible for whether calling
pci_enable_msi or not. A return of zero indicates the kernel
successfully initializes the MSI/MSI-X capability structure of the
-device funtion. The device function is now running on MSI mode.
+device funtion. The device function is now running on MSI/MSI-X mode.

-5.5 How to tell whether MSI is enabled on device function
+5.6 How to tell whether MSI/MSI-X is enabled on device function

-At the driver level, a return of zero from pci_enable_msi(...)
-indicates to the device driver that its device function is
-initialized successfully and ready to run in MSI mode.
+At the driver level, a return of zero from the function call of
+pci_enable_msi()/pci_enable_msix() indicates to a device driver that
+its device function is initialized successfully and ready to run in
+MSI/MSI-X mode.

At the user level, users can use command 'cat /proc/interrupts'
-to display the vector allocated for the device and its interrupt
-mode, as shown below.
+to display the vector allocated for a device and its interrupt
+MSI/MSI-X mode ("PCI MSI"/"PCI MSIX"). Below shows below MSI mode is
+enabled on a SCSI Adaptec 39320D Ultra320.

CPU0 CPU1
0: 324639 0 IO-APIC-edge timer
diff -urN linux-2.6.7/drivers/pci/msi.c patch-2.6.7-fix-msix/drivers/pci/msi.c
--- linux-2.6.7/drivers/pci/msi.c 2004-05-09 22:33:20.000000000 -0400
+++ patch-2.6.7-fix-msix/drivers/pci/msi.c 2004-06-25 14:43:10.000000000 -0400
@@ -67,12 +67,10 @@
unsigned int mask_bits;

pos = entry->mask_base;
- entry->dev->bus->ops->read(entry->dev->bus, entry->dev->devfn,
- pos, 4, &mask_bits);
+ pci_read_config_dword(entry->dev, pos, &mask_bits);
mask_bits &= ~(1);
mask_bits |= flag;
- entry->dev->bus->ops->write(entry->dev->bus, entry->dev->devfn,
- pos, 4, mask_bits);
+ pci_write_config_dword(entry->dev, pos, mask_bits);
break;
}
case PCI_CAP_ID_MSIX:
@@ -105,15 +103,13 @@
if (!(pos = pci_find_capability(entry->dev, PCI_CAP_ID_MSI)))
return;

- entry->dev->bus->ops->read(entry->dev->bus, entry->dev->devfn,
- msi_lower_address_reg(pos), 4,
+ pci_read_config_dword(entry->dev, msi_lower_address_reg(pos),
&address.lo_address.value);
address.lo_address.value &= MSI_ADDRESS_DEST_ID_MASK;
address.lo_address.value |= (cpu_mask_to_apicid(cpu_mask) <<
MSI_TARGET_CPU_SHIFT);
entry->msi_attrib.current_cpu = cpu_mask_to_apicid(cpu_mask);
- entry->dev->bus->ops->write(entry->dev->bus, entry->dev->devfn,
- msi_lower_address_reg(pos), 4,
+ pci_write_config_dword(entry->dev, msi_lower_address_reg(pos),
address.lo_address.value);
break;
}
@@ -158,13 +154,25 @@

static unsigned int startup_msi_irq_wo_maskbit(unsigned int vector)
{
+ struct msi_desc *entry;
+ unsigned long flags;
+
+ spin_lock_irqsave(&msi_lock, flags);
+ entry = msi_desc[vector];
+ if (!entry || !entry->dev) {
+ spin_unlock_irqrestore(&msi_lock, flags);
+ return 0;
+ }
+ entry->msi_attrib.state = 1; /* Mark it active */
+ spin_unlock_irqrestore(&msi_lock, flags);
+
return 0; /* never anything pending */
}

-static void pci_disable_msi(unsigned int vector);
+static void release_msi(unsigned int vector);
static void shutdown_msi_irq(unsigned int vector)
{
- pci_disable_msi(vector);
+ release_msi(vector);
}

#define shutdown_msi_irq_wo_maskbit shutdown_msi_irq
@@ -179,6 +187,18 @@

static unsigned int startup_msi_irq_w_maskbit(unsigned int vector)
{
+ struct msi_desc *entry;
+ unsigned long flags;
+
+ spin_lock_irqsave(&msi_lock, flags);
+ entry = msi_desc[vector];
+ if (!entry || !entry->dev) {
+ spin_unlock_irqrestore(&msi_lock, flags);
+ return 0;
+ }
+ entry->msi_attrib.state = 1; /* Mark it active */
+ spin_unlock_irqrestore(&msi_lock, flags);
+
unmask_MSI_irq(vector);
return 0; /* never anything pending */
}
@@ -200,7 +220,7 @@
* which implement the MSI-X Capability Structure.
*/
static struct hw_interrupt_type msix_irq_type = {
- .typename = "PCI MSI-X",
+ .typename = "PCI-MSI-X",
.startup = startup_msi_irq_w_maskbit,
.shutdown = shutdown_msi_irq_w_maskbit,
.enable = enable_msi_irq_w_maskbit,
@@ -216,7 +236,7 @@
* Mask-and-Pending Bits.
*/
static struct hw_interrupt_type msi_irq_w_maskbit_type = {
- .typename = "PCI MSI",
+ .typename = "PCI-MSI",
.startup = startup_msi_irq_w_maskbit,
.shutdown = shutdown_msi_irq_w_maskbit,
.enable = enable_msi_irq_w_maskbit,
@@ -232,7 +252,7 @@
* Mask-and-Pending Bits.
*/
static struct hw_interrupt_type msi_irq_wo_maskbit_type = {
- .typename = "PCI MSI",
+ .typename = "PCI-MSI",
.startup = startup_msi_irq_wo_maskbit,
.shutdown = shutdown_msi_irq_wo_maskbit,
.enable = enable_msi_irq_wo_maskbit,
@@ -265,6 +285,7 @@
msi_address->lo_address.value |= (MSI_TARGET_CPU << MSI_TARGET_CPU_SHIFT);
}

+static int msi_free_vector(struct pci_dev* dev, int vector, int reassign);
static int assign_msi_vector(void)
{
static int new_vector_avail = 1;
@@ -278,6 +299,8 @@
spin_lock_irqsave(&msi_lock, flags);

if (!new_vector_avail) {
+ int free_vector = 0;
+
/*
* vector_irq[] = -1 indicates that this specific vector is:
* - assigned for MSI (since MSI have no associated IRQ) or
@@ -294,13 +317,34 @@
for (vector = FIRST_DEVICE_VECTOR; vector < NR_IRQS; vector++) {
if (vector_irq[vector] != 0)
continue;
- vector_irq[vector] = -1;
- nr_released_vectors--;
- spin_unlock_irqrestore(&msi_lock, flags);
- return vector;
+ free_vector = vector;
+ if (!msi_desc[vector])
+ break;
+ else
+ continue;
}
+ if (!free_vector) {
+ spin_unlock_irqrestore(&msi_lock, flags);
+ return -EBUSY;
+ }
+ vector_irq[free_vector] = -1;
+ nr_released_vectors--;
spin_unlock_irqrestore(&msi_lock, flags);
- return -EBUSY;
+ if (msi_desc[free_vector] != NULL) {
+ struct pci_dev *dev;
+ int tail;
+
+ /* free all linked vectors before re-assign */
+ do {
+ spin_lock_irqsave(&msi_lock, flags);
+ dev = msi_desc[free_vector]->dev;
+ tail = msi_desc[free_vector]->link.tail;
+ spin_unlock_irqrestore(&msi_lock, flags);
+ msi_free_vector(dev, tail, 1);
+ } while (free_vector != tail);
+ }
+
+ return free_vector;
}
vector = assign_irq_vector(AUTO_ASSIGN);
last_alloc_vector = vector;
@@ -333,6 +377,15 @@
printk(KERN_INFO "WARNING: MSI INIT FAILURE\n");
return status;
}
+ last_alloc_vector = assign_irq_vector(AUTO_ASSIGN);
+ if (last_alloc_vector < 0) {
+ pci_msi_enable = 0;
+ printk(KERN_INFO "WARNING: ALL VECTORS ARE BUSY\n");
+ status = -EBUSY;
+ return status;
+ }
+ vector_irq[last_alloc_vector] = 0;
+ nr_released_vectors++;
printk(KERN_INFO "MSI INIT SUCCESS\n");

return status;
@@ -383,55 +436,49 @@

static void enable_msi_mode(struct pci_dev *dev, int pos, int type)
{
- u32 control;
+ u16 control;

- dev->bus->ops->read(dev->bus, dev->devfn,
- msi_control_reg(pos), 2, &control);
+ pci_read_config_word(dev, msi_control_reg(pos), &control);
if (type == PCI_CAP_ID_MSI) {
/* Set enabled bits to single MSI & enable MSI_enable bit */
msi_enable(control, 1);
- dev->bus->ops->write(dev->bus, dev->devfn,
- msi_control_reg(pos), 2, control);
+ pci_write_config_word(dev, msi_control_reg(pos), control);
} else {
msix_enable(control);
- dev->bus->ops->write(dev->bus, dev->devfn,
- msi_control_reg(pos), 2, control);
+ pci_write_config_word(dev, msi_control_reg(pos), control);
}
if (pci_find_capability(dev, PCI_CAP_ID_EXP)) {
/* PCI Express Endpoint device detected */
- u32 cmd;
- dev->bus->ops->read(dev->bus, dev->devfn, PCI_COMMAND, 2, &cmd);
+ u16 cmd;
+ pci_read_config_word(dev, PCI_COMMAND, &cmd);
cmd |= PCI_COMMAND_INTX_DISABLE;
- dev->bus->ops->write(dev->bus, dev->devfn, PCI_COMMAND, 2, cmd);
+ pci_write_config_word(dev, PCI_COMMAND, cmd);
}
}

static void disable_msi_mode(struct pci_dev *dev, int pos, int type)
{
- u32 control;
+ u16 control;

- dev->bus->ops->read(dev->bus, dev->devfn,
- msi_control_reg(pos), 2, &control);
+ pci_read_config_word(dev, msi_control_reg(pos), &control);
if (type == PCI_CAP_ID_MSI) {
/* Set enabled bits to single MSI & enable MSI_enable bit */
msi_disable(control);
- dev->bus->ops->write(dev->bus, dev->devfn,
- msi_control_reg(pos), 2, control);
+ pci_write_config_word(dev, msi_control_reg(pos), control);
} else {
msix_disable(control);
- dev->bus->ops->write(dev->bus, dev->devfn,
- msi_control_reg(pos), 2, control);
+ pci_write_config_word(dev, msi_control_reg(pos), control);
}
if (pci_find_capability(dev, PCI_CAP_ID_EXP)) {
/* PCI Express Endpoint device detected */
- u32 cmd;
- dev->bus->ops->read(dev->bus, dev->devfn, PCI_COMMAND, 2, &cmd);
+ u16 cmd;
+ pci_read_config_word(dev, PCI_COMMAND, &cmd);
cmd &= ~PCI_COMMAND_INTX_DISABLE;
- dev->bus->ops->write(dev->bus, dev->devfn, PCI_COMMAND, 2, cmd);
+ pci_write_config_word(dev, PCI_COMMAND, cmd);
}
}

-static int msi_lookup_vector(struct pci_dev *dev)
+static int msi_lookup_vector(struct pci_dev *dev, int type)
{
int vector;
unsigned long flags;
@@ -439,11 +486,11 @@
spin_lock_irqsave(&msi_lock, flags);
for (vector = FIRST_DEVICE_VECTOR; vector < NR_IRQS; vector++) {
if (!msi_desc[vector] || msi_desc[vector]->dev != dev ||
- msi_desc[vector]->msi_attrib.entry_nr ||
+ msi_desc[vector]->msi_attrib.type != type ||
msi_desc[vector]->msi_attrib.default_vector != dev->irq)
- continue; /* not entry 0, skip */
+ continue;
spin_unlock_irqrestore(&msi_lock, flags);
- /* This pre-assigned entry-0 MSI vector for this device
+ /* This pre-assigned MSI vector for this device
already exits. Override dev->irq with this vector */
dev->irq = vector;
return 0;
@@ -458,10 +505,9 @@
if (!dev)
return;

- if (pci_find_capability(dev, PCI_CAP_ID_MSIX) > 0) {
- nr_reserved_vectors++;
+ if (pci_find_capability(dev, PCI_CAP_ID_MSIX) > 0)
nr_msix_devices++;
- } else if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0)
+ else if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0)
nr_reserved_vectors++;
}

@@ -480,22 +526,10 @@
struct msg_address address;
struct msg_data data;
int pos, vector;
- u32 control;
+ u16 control;

pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
- if (!pos)
- return -EINVAL;
-
- dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
- 2, &control);
- if (control & PCI_MSI_FLAGS_ENABLE)
- return 0;
-
- if (!msi_lookup_vector(dev)) {
- /* Lookup Sucess */
- enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
- return 0;
- }
+ pci_read_config_word(dev, msi_control_reg(pos), &control);
/* MSI Entry Initialization */
if (!(entry = alloc_msi_entry()))
return -ENOMEM;
@@ -504,11 +538,14 @@
kmem_cache_free(msi_cachep, entry);
return -EBUSY;
}
+ entry->link.head = vector;
+ entry->link.tail = vector;
entry->msi_attrib.type = PCI_CAP_ID_MSI;
+ entry->msi_attrib.state = 0; /* Mark it not active */
entry->msi_attrib.entry_nr = 0;
entry->msi_attrib.maskbit = is_mask_bit_support(control);
- entry->msi_attrib.default_vector = dev->irq;
- dev->irq = vector; /* save default pre-assigned ioapic vector */
+ entry->msi_attrib.default_vector = dev->irq; /* Save IOAPIC IRQ */
+ dev->irq = vector;
entry->dev = dev;
if (is_mask_bit_support(control)) {
entry->mask_base = msi_mask_bits_reg(pos,
@@ -521,27 +558,27 @@
msi_data_init(&data, vector);
entry->msi_attrib.current_cpu = ((address.lo_address.u.dest_id >>
MSI_TARGET_CPU_SHIFT) & MSI_TARGET_CPU_MASK);
- dev->bus->ops->write(dev->bus, dev->devfn, msi_lower_address_reg(pos),
- 4, address.lo_address.value);
+ pci_write_config_dword(dev, msi_lower_address_reg(pos),
+ address.lo_address.value);
if (is_64bit_address(control)) {
- dev->bus->ops->write(dev->bus, dev->devfn,
- msi_upper_address_reg(pos), 4, address.hi_address);
- dev->bus->ops->write(dev->bus, dev->devfn,
- msi_data_reg(pos, 1), 2, *((u32*)&data));
+ pci_write_config_dword(dev,
+ msi_upper_address_reg(pos), address.hi_address);
+ pci_write_config_word(dev,
+ msi_data_reg(pos, 1), *((u32*)&data));
} else
- dev->bus->ops->write(dev->bus, dev->devfn,
- msi_data_reg(pos, 0), 2, *((u32*)&data));
+ pci_write_config_word(dev,
+ msi_data_reg(pos, 0), *((u32*)&data));
if (entry->msi_attrib.maskbit) {
unsigned int maskbits, temp;
/* All MSIs are unmasked by default, Mask them all */
- dev->bus->ops->read(dev->bus, dev->devfn,
- msi_mask_bits_reg(pos, is_64bit_address(control)), 4,
+ pci_read_config_dword(dev,
+ msi_mask_bits_reg(pos, is_64bit_address(control)),
&maskbits);
temp = (1 << multi_msi_capable(control));
temp = ((temp - 1) & ~temp);
maskbits |= temp;
- dev->bus->ops->write(dev->bus, dev->devfn,
- msi_mask_bits_reg(pos, is_64bit_address(control)), 4,
+ pci_write_config_dword(dev,
+ msi_mask_bits_reg(pos, is_64bit_address(control)),
maskbits);
}
attach_msi_entry(entry, vector);
@@ -556,135 +593,208 @@
* @dev: pointer to the pci_dev data structure of MSI-X device function
*
* Setup the MSI-X capability structure of device funtion with a
- * single MSI-X vector. A return of zero indicates the successful setup
- * of an entry zero with the new MSI-X vector or non-zero for otherwise.
- * To request for additional MSI-X vectors, the device drivers are
- * required to utilize the following supported APIs:
- * 1) msi_alloc_vectors(...) for requesting one or more MSI-X vectors
- * 2) msi_free_vectors(...) for releasing one or more MSI-X vectors
- * back to PCI subsystem before calling free_irq(...)
+ * single MSI-X vector. A return of zero indicates the successful setup of
+ * requested MSI-X entries with allocated vectors or non-zero for otherwise.
**/
-static int msix_capability_init(struct pci_dev *dev)
+static int msix_capability_init(struct pci_dev *dev,
+ struct msix_entry *entries, int nvec)
{
- struct msi_desc *entry;
+ struct msi_desc *head = NULL, *tail = NULL, *entry = NULL;
struct msg_address address;
struct msg_data data;
- int vector = 0, pos, dev_msi_cap;
+ int vector, pos, i, j, nr_entries, temp = 0;
u32 phys_addr, table_offset;
- u32 control;
+ u16 control;
u8 bir;
void *base;
-
+
pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
- if (!pos)
- return -EINVAL;
-
/* Request & Map MSI-X table region */
- dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos), 2,
- &control);
- if (control & PCI_MSIX_FLAGS_ENABLE)
- return 0;
-
- if (!msi_lookup_vector(dev)) {
- /* Lookup Sucess */
- enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
- return 0;
- }
-
- dev_msi_cap = multi_msix_capable(control);
- dev->bus->ops->read(dev->bus, dev->devfn,
- msix_table_offset_reg(pos), 4, &table_offset);
+ pci_read_config_word(dev, msi_control_reg(pos), &control);
+ nr_entries = multi_msix_capable(control);
+ pci_read_config_dword(dev, msix_table_offset_reg(pos),
+ &table_offset);
bir = (u8)(table_offset & PCI_MSIX_FLAGS_BIRMASK);
phys_addr = pci_resource_start (dev, bir);
phys_addr += (u32)(table_offset & ~PCI_MSIX_FLAGS_BIRMASK);
if (!request_mem_region(phys_addr,
- dev_msi_cap * PCI_MSIX_ENTRY_SIZE,
- "MSI-X iomap Failure"))
+ nr_entries * PCI_MSIX_ENTRY_SIZE,
+ "MSI-X vector table"))
return -ENOMEM;
- base = ioremap_nocache(phys_addr, dev_msi_cap * PCI_MSIX_ENTRY_SIZE);
- if (base == NULL)
- goto free_region;
- /* MSI Entry Initialization */
- entry = alloc_msi_entry();
- if (!entry)
- goto free_iomap;
- if ((vector = get_msi_vector(dev)) < 0)
- goto free_entry;
-
- entry->msi_attrib.type = PCI_CAP_ID_MSIX;
- entry->msi_attrib.entry_nr = 0;
- entry->msi_attrib.maskbit = 1;
- entry->msi_attrib.default_vector = dev->irq;
- dev->irq = vector; /* save default pre-assigned ioapic vector */
- entry->dev = dev;
- entry->mask_base = (unsigned long)base;
- /* Replace with MSI handler */
- irq_handler_init(PCI_CAP_ID_MSIX, vector, 1);
- /* Configure MSI-X capability structure */
- msi_address_init(&address);
- msi_data_init(&data, vector);
- entry->msi_attrib.current_cpu = ((address.lo_address.u.dest_id >>
- MSI_TARGET_CPU_SHIFT) & MSI_TARGET_CPU_MASK);
- writel(address.lo_address.value, base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- writel(address.hi_address, base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
- writel(*(u32*)&data, base + PCI_MSIX_ENTRY_DATA_OFFSET);
- /* Initialize all entries from 1 up to 0 */
- for (pos = 1; pos < dev_msi_cap; pos++) {
- writel(0, base + pos * PCI_MSIX_ENTRY_SIZE +
+ base = ioremap_nocache(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
+ if (base == NULL) {
+ release_mem_region(phys_addr, nr_entries * PCI_MSIX_ENTRY_SIZE);
+ return -ENOMEM;
+ }
+ /* MSI-X Table Initialization */
+ for (i = 0; i < nvec; i++) {
+ entry = alloc_msi_entry();
+ if (!entry)
+ break;
+ if ((vector = get_msi_vector(dev)) < 0)
+ break;
+
+ j = entries[i].entry;
+ entries[i].vector = vector;
+ entry->msi_attrib.type = PCI_CAP_ID_MSIX;
+ entry->msi_attrib.state = 0; /* Mark it not active */
+ entry->msi_attrib.entry_nr = j;
+ entry->msi_attrib.maskbit = 1;
+ entry->msi_attrib.default_vector = dev->irq;
+ entry->dev = dev;
+ entry->mask_base = (unsigned long)base;
+ if (!head) {
+ entry->link.head = vector;
+ entry->link.tail = vector;
+ head = entry;
+ } else {
+ entry->link.head = temp;
+ entry->link.tail = tail->link.tail;
+ tail->link.tail = vector;
+ head->link.head = vector;
+ }
+ temp = vector;
+ tail = entry;
+ /* Replace with MSI-X handler */
+ irq_handler_init(PCI_CAP_ID_MSIX, vector, 1);
+ /* Configure MSI-X capability structure */
+ msi_address_init(&address);
+ msi_data_init(&data, vector);
+ entry->msi_attrib.current_cpu =
+ ((address.lo_address.u.dest_id >>
+ MSI_TARGET_CPU_SHIFT) & MSI_TARGET_CPU_MASK);
+ writel(address.lo_address.value,
+ base + j * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- writel(0, base + pos * PCI_MSIX_ENTRY_SIZE +
+ writel(address.hi_address,
+ base + j * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
- writel(0, base + pos * PCI_MSIX_ENTRY_SIZE +
+ writel(*(u32*)&data,
+ base + j * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_DATA_OFFSET);
+ attach_msi_entry(entry, vector);
}
- attach_msi_entry(entry, vector);
- /* Set MSI enabled bits */
+ if (i != nvec) {
+ i--;
+ for (; i >= 0; i--) {
+ vector = (entries + i)->vector;
+ msi_free_vector(dev, vector, 0);
+ (entries + i)->vector = 0;
+ }
+ return -EBUSY;
+ }
+ /* Set MSI-X enabled bits */
enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
-
+
return 0;
-
-free_entry:
- kmem_cache_free(msi_cachep, entry);
-free_iomap:
- iounmap(base);
-free_region:
- release_mem_region(phys_addr, dev_msi_cap * PCI_MSIX_ENTRY_SIZE);
-
- return ((vector < 0) ? -EBUSY : -ENOMEM);
}

/**
- * pci_enable_msi - configure device's MSI(X) capability structure
- * @dev: pointer to the pci_dev data structure of MSI(X) device function
+ * pci_enable_msi - configure device's MSI capability structure
+ * @dev: pointer to the pci_dev data structure of MSI device function
*
- * Setup the MSI/MSI-X capability structure of device function with
- * a single MSI(X) vector upon its software driver call to request for
- * MSI(X) mode enabled on its hardware device function. A return of zero
- * indicates the successful setup of an entry zero with the new MSI(X)
+ * Setup the MSI capability structure of device function with
+ * a single MSI vector upon its software driver call to request for
+ * MSI mode enabled on its hardware device function. A return of zero
+ * indicates the successful setup of an entry zero with the new MSI
* vector or non-zero for otherwise.
**/
int pci_enable_msi(struct pci_dev* dev)
{
- int status = -EINVAL;
+ int pos, temp = dev->irq, status = -EINVAL;
+ u16 control;

if (!pci_msi_enable || !dev)
return status;

- if (msi_init() < 0)
- return -ENOMEM;
+ if ((status = msi_init()) < 0)
+ return status;

- if ((status = msix_capability_init(dev)) == -EINVAL)
- status = msi_capability_init(dev);
- if (!status)
- nr_reserved_vectors--;
+ if (!(pos = pci_find_capability(dev, PCI_CAP_ID_MSI)))
+ return -EINVAL;
+
+ pci_read_config_word(dev, msi_control_reg(pos), &control);
+ if (control & PCI_MSI_FLAGS_ENABLE)
+ return 0; /* Already in MSI mode */
+
+ if (!msi_lookup_vector(dev, PCI_CAP_ID_MSI)) {
+ /* Lookup Sucess */
+ unsigned long flags;
+
+ spin_lock_irqsave(&msi_lock, flags);
+ if (!vector_irq[dev->irq]) {
+ msi_desc[dev->irq]->msi_attrib.state = 0;
+ vector_irq[dev->irq] = -1;
+ nr_released_vectors--;
+ spin_unlock_irqrestore(&msi_lock, flags);
+ enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
+ return 0;
+ }
+ spin_unlock_irqrestore(&msi_lock, flags);
+ dev->irq = temp;
+ }
+ /* Check whether driver already requested for MSI-X vectors */
+ if ((pos = pci_find_capability(dev, PCI_CAP_ID_MSIX)) > 0 &&
+ !msi_lookup_vector(dev, PCI_CAP_ID_MSIX)) {
+ printk(KERN_INFO "Can't enable MSI. Device already had MSI-X vectors assigned\n");
+ dev->irq = temp;
+ return -EINVAL;
+ }
+ status = msi_capability_init(dev);
+ if (!status) {
+ if (!pos)
+ nr_reserved_vectors--; /* Only MSI capable */
+ else if (nr_msix_devices > 0)
+ nr_msix_devices--; /* Both MSI and MSI-X capable,
+ but choose enabling MSI */
+ }

return status;
}

-static int msi_free_vector(struct pci_dev* dev, int vector);
-static void pci_disable_msi(unsigned int vector)
+void pci_disable_msi(struct pci_dev* dev)
{
- int head, tail, type, default_vector;
+ struct msi_desc *entry;
+ int pos, default_vector;
+ u16 control;
+ unsigned long flags;
+
+ if (!dev || !(pos = pci_find_capability(dev, PCI_CAP_ID_MSI)))
+ return;
+
+ pci_read_config_word(dev, msi_control_reg(pos), &control);
+ if (!(control & PCI_MSI_FLAGS_ENABLE))
+ return;
+
+ spin_lock_irqsave(&msi_lock, flags);
+ entry = msi_desc[dev->irq];
+ if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
+ spin_unlock_irqrestore(&msi_lock, flags);
+ return;
+ }
+ if (entry->msi_attrib.state) {
+ spin_unlock_irqrestore(&msi_lock, flags);
+ printk(KERN_DEBUG "Driver[%d:%d:%d] unloaded wo doing free_irq on vector->%d\n",
+ dev->bus->number, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+ dev->irq);
+ BUG_ON(entry->msi_attrib.state > 0);
+ } else {
+ if (vector_irq[dev->irq] != 0) {
+ vector_irq[dev->irq] = 0; /* free it */
+ nr_released_vectors++;
+ }
+ default_vector = entry->msi_attrib.default_vector;
+ spin_unlock_irqrestore(&msi_lock, flags);
+ /* Restore dev->irq to its default pin-assertion vector */
+ dev->irq = default_vector;
+ disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
+ PCI_CAP_ID_MSI);
+ }
+}
+
+static void release_msi(unsigned int vector)
+{
+ int type, default_vector;
struct msi_desc *entry;
struct pci_dev *dev;
unsigned long flags;
@@ -697,168 +807,233 @@
}
dev = entry->dev;
type = entry->msi_attrib.type;
- head = entry->link.head;
- tail = entry->link.tail;
+ entry->msi_attrib.state = 0; /* Mark it not active */
default_vector = entry->msi_attrib.default_vector;
spin_unlock_irqrestore(&msi_lock, flags);
-
- disable_msi_mode(dev, pci_find_capability(dev, type), type);
- /* Restore dev->irq to its default pin-assertion vector */
- dev->irq = default_vector;
- if (type == PCI_CAP_ID_MSIX && head != tail) {
- /* Bad driver, which do not call msi_free_vectors before exit.
- We must do a cleanup here */
- while (1) {
- spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[vector];
- head = entry->link.head;
- tail = entry->link.tail;
+ switch (type) {
+ case PCI_CAP_ID_MSI:
+ spin_lock_irqsave(&msi_lock, flags);
+ vector_irq[vector] = 0; /* Mark it free */
+ nr_released_vectors++;
+ spin_unlock_irqrestore(&msi_lock, flags);
+ break;
+ case PCI_CAP_ID_MSIX:
+ spin_lock_irqsave(&msi_lock, flags);
+ while (vector != entry->link.tail) {
+ entry = msi_desc[entry->link.tail];
+ if (!entry->msi_attrib.state)
+ continue;
spin_unlock_irqrestore(&msi_lock, flags);
- if (tail == head)
- break;
- if (msi_free_vector(dev, entry->link.tail))
- break;
+ /*
+ * Device still operates in MSI-X mode. Do not
+ * switch interrupt mode
+ */
+ return;
}
+ entry = msi_desc[vector];
+ vector_irq[vector] = 0; /* Mark it free */
+ nr_released_vectors++;
+ while (vector != entry->link.tail) {
+ vector_irq[entry->link.tail] = 0; /* Mark it free */
+ nr_released_vectors++;
+ entry = msi_desc[entry->link.tail];
+ }
+ spin_unlock_irqrestore(&msi_lock, flags);
+ break;
+ default:
+ return;
}
+ /* Restore dev->irq to its default pin-assertion vector */
+ dev->irq = default_vector;
+ disable_msi_mode(dev, pci_find_capability(dev, type), type);
}

-static int msi_alloc_vector(struct pci_dev* dev, int head)
+static int msi_free_vector(struct pci_dev* dev, int vector, int reassign)
{
struct msi_desc *entry;
- struct msg_address address;
- struct msg_data data;
- int i, offset, pos, dev_msi_cap, vector;
- u32 low_address, control;
+ int head, entry_nr, type;
unsigned long base = 0L;
unsigned long flags;

spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[dev->irq];
- if (!entry) {
+ entry = msi_desc[vector];
+ if (!entry || entry->dev != dev) {
spin_unlock_irqrestore(&msi_lock, flags);
return -EINVAL;
}
+ type = entry->msi_attrib.type;
+ entry_nr = entry->msi_attrib.entry_nr;
+ head = entry->link.head;
base = entry->mask_base;
- spin_unlock_irqrestore(&msi_lock, flags);
-
- pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
- dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos),
- 2, &control);
- dev_msi_cap = multi_msix_capable(control);
- for (i = 1; i < dev_msi_cap; i++) {
- if (!(low_address = readl(base + i * PCI_MSIX_ENTRY_SIZE)))
- break;
+ msi_desc[entry->link.head]->link.tail = entry->link.tail;
+ msi_desc[entry->link.tail]->link.head = entry->link.head;
+ entry->dev = NULL;
+ if (!reassign) {
+ vector_irq[vector] = 0;
+ nr_released_vectors++;
}
- if (i >= dev_msi_cap)
- return -EINVAL;
+ msi_desc[vector] = NULL;
+ spin_unlock_irqrestore(&msi_lock, flags);

- /* MSI Entry Initialization */
- if (!(entry = alloc_msi_entry()))
- return -ENOMEM;
+ kmem_cache_free(msi_cachep, entry);

- if ((vector = get_new_vector()) < 0) {
- kmem_cache_free(msi_cachep, entry);
- return vector;
+ if (type == PCI_CAP_ID_MSIX) {
+ if (!reassign)
+ writel(1, base +
+ entry_nr * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
+
+ if (head == vector) {
+ /*
+ * Detect last MSI-X vector to be released.
+ * Release the MSI-X memory-mapped table.
+ */
+ int pos, nr_entries;
+ u32 phys_addr, table_offset;
+ u16 control;
+ u8 bir;
+
+ pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
+ pci_read_config_word(dev, msi_control_reg(pos),
+ &control);
+ nr_entries = multi_msix_capable(control);
+ pci_read_config_dword(dev, msix_table_offset_reg(pos),
+ &table_offset);
+ bir = (u8)(table_offset & PCI_MSIX_FLAGS_BIRMASK);
+ phys_addr = pci_resource_start (dev, bir);
+ phys_addr += (u32)(table_offset &
+ ~PCI_MSIX_FLAGS_BIRMASK);
+ iounmap((void*)base);
+ release_mem_region(phys_addr,
+ nr_entries * PCI_MSIX_ENTRY_SIZE);
+ }
}
- entry->msi_attrib.type = PCI_CAP_ID_MSIX;
- entry->msi_attrib.entry_nr = i;
- entry->msi_attrib.maskbit = 1;
- entry->dev = dev;
- entry->link.head = head;
- entry->mask_base = base;
- irq_handler_init(PCI_CAP_ID_MSIX, vector, 1);
- /* Configure MSI-X capability structure */
- msi_address_init(&address);
- msi_data_init(&data, vector);
- entry->msi_attrib.current_cpu = ((address.lo_address.u.dest_id >>
- MSI_TARGET_CPU_SHIFT) & MSI_TARGET_CPU_MASK);
- offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
- writel(address.lo_address.value, base + offset +
- PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- writel(address.hi_address, base + offset +
- PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
- writel(*(u32*)&data, base + offset + PCI_MSIX_ENTRY_DATA_OFFSET);
- writel(1, base + offset + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
- attach_msi_entry(entry, vector);

- return vector;
+ return 0;
}

-static int msi_free_vector(struct pci_dev* dev, int vector)
+static int reroute_msix_table(int head, struct msix_entry *entries, int *nvec)
{
- struct msi_desc *entry;
- int entry_nr, type;
+ int vector = head, tail = 0;
+ int i = 0, j = 0, nr_entries = 0;
unsigned long base = 0L;
unsigned long flags;
-
+
spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[vector];
- if (!entry || entry->dev != dev) {
+ while (head != tail) {
+ nr_entries++;
+ tail = msi_desc[vector]->link.tail;
+ if (entries[0].entry == msi_desc[vector]->msi_attrib.entry_nr)
+ j = vector;
+ vector = tail;
+ }
+ if (*nvec > nr_entries) {
spin_unlock_irqrestore(&msi_lock, flags);
+ *nvec = nr_entries;
return -EINVAL;
}
- type = entry->msi_attrib.type;
- entry_nr = entry->msi_attrib.entry_nr;
- base = entry->mask_base;
- if (entry->link.tail != entry->link.head) {
- msi_desc[entry->link.head]->link.tail = entry->link.tail;
- if (entry->link.tail)
- msi_desc[entry->link.tail]->link.head = entry->link.head;
+ vector = ((j > 0) ? j : head);
+ for (i = 0; i < *nvec; i++) {
+ j = msi_desc[vector]->msi_attrib.entry_nr;
+ msi_desc[vector]->msi_attrib.state = 0; /* Mark it not active */
+ vector_irq[vector] = -1; /* Mark it busy */
+ nr_released_vectors--;
+ entries[i].vector = vector;
+ if (j != (entries + i)->entry) {
+ base = msi_desc[vector]->mask_base;
+ msi_desc[vector]->msi_attrib.entry_nr =
+ (entries + i)->entry;
+ writel( readl(base + j * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET), base +
+ (entries + i)->entry * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
+ writel( readl(base + j * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET), base +
+ (entries + i)->entry * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
+ writel( (readl(base + j * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_DATA_OFFSET) & 0xff00) | vector,
+ base + (entries+i)->entry*PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_DATA_OFFSET);
+ }
+ vector = msi_desc[vector]->link.tail;
}
- entry->dev = NULL;
- vector_irq[vector] = 0;
- nr_released_vectors++;
- msi_desc[vector] = NULL;
spin_unlock_irqrestore(&msi_lock, flags);
-
- kmem_cache_free(msi_cachep, entry);
- if (type == PCI_CAP_ID_MSIX) {
- int offset;
-
- offset = entry_nr * PCI_MSIX_ENTRY_SIZE;
- writel(1, base + offset + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
- writel(0, base + offset + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- }
-
+
return 0;
}

/**
- * msi_alloc_vectors - allocate additional MSI-X vectors
+ * pci_enable_msix - configure device's MSI-X capability structure
* @dev: pointer to the pci_dev data structure of MSI-X device function
- * @vector: pointer to an array of new allocated MSI-X vectors
+ * @data: pointer to an array of MSI-X entries
* @nvec: number of MSI-X vectors requested for allocation by device driver
*
- * Allocate additional MSI-X vectors requested by device driver. A
- * return of zero indicates the successful setup of MSI-X capability
- * structure with new allocated MSI-X vectors or non-zero for otherwise.
+ * Setup the MSI-X capability structure of device function with the number
+ * of requested vectors upon its software driver call to request for
+ * MSI-X mode enabled on its hardware device function. A return of zero
+ * indicates the successful configuration of MSI-X capability structure
+ * with new allocated MSI-X vectors. A return of < 0 indicates a failure.
+ * Or a return of > 0 indicates that driver request is exceeding the number
+ * of vectors available. Driver should use the returned value to re-send
+ * its request.
**/
-int msi_alloc_vectors(struct pci_dev* dev, int *vector, int nvec)
+int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
{
- struct msi_desc *entry;
- int i, head, pos, vec, free_vectors, alloc_vectors;
- int *vectors = (int *)vector;
- u32 control;
+ int status, pos, nr_entries, free_vectors;
+ int i, j, temp;
+ u16 control;
unsigned long flags;

- if (!pci_msi_enable || !dev)
+ if (!pci_msi_enable || !dev || !entries)
return -EINVAL;
-
+
+ if ((status = msi_init()) < 0)
+ return status;
+
if (!(pos = pci_find_capability(dev, PCI_CAP_ID_MSIX)))
return -EINVAL;
-
- dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos), 2, &control);
- if (nvec > multi_msix_capable(control))
- return -EINVAL;
-
- spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[dev->irq];
- if (!entry || entry->dev != dev || /* legal call */
- entry->msi_attrib.type != PCI_CAP_ID_MSIX || /* must be MSI-X */
- entry->link.head != entry->link.tail) { /* already multi */
- spin_unlock_irqrestore(&msi_lock, flags);
+
+ pci_read_config_word(dev, msi_control_reg(pos), &control);
+ if (control & PCI_MSIX_FLAGS_ENABLE)
+ return -EINVAL; /* Already in MSI-X mode */
+
+ nr_entries = multi_msix_capable(control);
+ if (nvec > nr_entries)
return -EINVAL;
+
+ /* Check for any invalid entries */
+ for (i = 0; i < nvec; i++) {
+ if (entries[i].entry >= nr_entries)
+ return -EINVAL; /* invalid entry */
+ for (j = i + 1; j < nvec; j++) {
+ if (entries[i].entry == entries[j].entry)
+ return -EINVAL; /* duplicate entry */
+ }
}
+ temp = dev->irq;
+ if (!msi_lookup_vector(dev, PCI_CAP_ID_MSIX)) {
+ /* Lookup Sucess */
+ nr_entries = nvec;
+ /* Reroute MSI-X table */
+ if (reroute_msix_table(dev->irq, entries, &nr_entries)) {
+ /* #requested > #previous-assigned */
+ dev->irq = temp;
+ return nr_entries;
+ }
+ dev->irq = temp;
+ enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
+ return 0;
+ }
+ /* Check whether driver already requested for MSI vector */
+ if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0 &&
+ !msi_lookup_vector(dev, PCI_CAP_ID_MSI)) {
+ printk(KERN_INFO "Can't enable MSI-X. Device already had MSI vector assigned\n");
+ dev->irq = temp;
+ return -EINVAL;
+ }
+
+ spin_lock_irqsave(&msi_lock, flags);
/*
* msi_lock is provided to ensure that enough vectors resources are
* available before granting.
@@ -874,71 +1049,66 @@
free_vectors /= nr_msix_devices;
spin_unlock_irqrestore(&msi_lock, flags);

- if (nvec > free_vectors)
- return -EBUSY;
+ if (nvec > free_vectors) {
+ if (free_vectors > 0)
+ return free_vectors;
+ else
+ return -EBUSY;
+ }

- alloc_vectors = 0;
- head = dev->irq;
- for (i = 0; i < nvec; i++) {
- if ((vec = msi_alloc_vector(dev, head)) < 0)
- break;
- *(vectors + i) = vec;
- head = vec;
- alloc_vectors++;
- }
- if (alloc_vectors != nvec) {
- for (i = 0; i < alloc_vectors; i++) {
- vec = *(vectors + i);
- msi_free_vector(dev, vec);
- }
- spin_lock_irqsave(&msi_lock, flags);
- msi_desc[dev->irq]->link.tail = msi_desc[dev->irq]->link.head;
- spin_unlock_irqrestore(&msi_lock, flags);
- return -EBUSY;
- }
- if (nr_msix_devices > 0)
+ status = msix_capability_init(dev, entries, nvec);
+ if (!status && nr_msix_devices > 0)
nr_msix_devices--;
-
- return 0;
+
+ return status;
}

-/**
- * msi_free_vectors - reclaim MSI-X vectors to unused state
- * @dev: pointer to the pci_dev data structure of MSI-X device function
- * @vector: pointer to an array of released MSI-X vectors
- * @nvec: number of MSI-X vectors requested for release by device driver
- *
- * Reclaim MSI-X vectors released by device driver to unused state,
- * which may be used later on. A return of zero indicates the
- * success or non-zero for otherwise. Device driver should call this
- * before calling function free_irq.
- **/
-int msi_free_vectors(struct pci_dev* dev, int *vector, int nvec)
+void pci_disable_msix(struct pci_dev* dev)
{
- struct msi_desc *entry;
- int i;
- unsigned long flags;
+ int pos, temp;
+ u16 control;
+
+ if (!dev || !(pos = pci_find_capability(dev, PCI_CAP_ID_MSIX)))
+ return;

- if (!pci_msi_enable)
- return -EINVAL;
+ pci_read_config_word(dev, msi_control_reg(pos), &control);
+ if (!(control & PCI_MSIX_FLAGS_ENABLE))
+ return;
+ temp = dev->irq;
+ if (!msi_lookup_vector(dev, PCI_CAP_ID_MSIX)) {
+ int state, vector, head, tail = 0, warning = 0;
+ unsigned long flags;

- spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[dev->irq];
- if (!entry || entry->dev != dev ||
- entry->msi_attrib.type != PCI_CAP_ID_MSIX ||
- entry->link.head == entry->link.tail) { /* Nothing to free */
+ vector = head = dev->irq;
+ spin_lock_irqsave(&msi_lock, flags);
+ while (head != tail) {
+ state = msi_desc[vector]->msi_attrib.state;
+ if (state)
+ warning = 1;
+ else {
+ if (vector_irq[vector] != 0) {
+ vector_irq[vector] = 0; /* free it */
+ nr_released_vectors++;
+ }
+ }
+ tail = msi_desc[vector]->link.tail;
+ vector = tail;
+ }
spin_unlock_irqrestore(&msi_lock, flags);
- return -EINVAL;
- }
- spin_unlock_irqrestore(&msi_lock, flags);
+ if (warning) {
+ dev->irq = temp;
+ printk(KERN_DEBUG "Driver[%d:%d:%d] unloaded wo doing free_irq on all vectors\n",
+ dev->bus->number, PCI_SLOT(dev->devfn),
+ PCI_FUNC(dev->devfn));
+ BUG_ON(warning > 0);
+ } else {
+ dev->irq = temp;
+ disable_msi_mode(dev,
+ pci_find_capability(dev, PCI_CAP_ID_MSIX),
+ PCI_CAP_ID_MSIX);

- for (i = 0; i < nvec; i++) {
- if (*(vector + i) == dev->irq)
- continue;/* Don't free entry 0 if mistaken by driver */
- msi_free_vector(dev, *(vector + i));
+ }
}
-
- return 0;
}

/**
@@ -952,62 +1122,73 @@
**/
void msi_remove_pci_irq_vectors(struct pci_dev* dev)
{
- struct msi_desc *entry;
- int type, temp;
+ int state, pos, temp;
unsigned long flags;
-
+
if (!pci_msi_enable || !dev)
return;
-
- if (!pci_find_capability(dev, PCI_CAP_ID_MSI)) {
- if (!pci_find_capability(dev, PCI_CAP_ID_MSIX))
- return;
- }
- temp = dev->irq;
- if (msi_lookup_vector(dev))
- return;
-
- spin_lock_irqsave(&msi_lock, flags);
- entry = msi_desc[dev->irq];
- if (!entry || entry->dev != dev) {
+
+ temp = dev->irq; /* Save IOAPIC IRQ */
+ if ((pos = pci_find_capability(dev, PCI_CAP_ID_MSI)) > 0 &&
+ !msi_lookup_vector(dev, PCI_CAP_ID_MSI)) {
+ spin_lock_irqsave(&msi_lock, flags);
+ state = msi_desc[dev->irq]->msi_attrib.state;
spin_unlock_irqrestore(&msi_lock, flags);
- return;
- }
- type = entry->msi_attrib.type;
- spin_unlock_irqrestore(&msi_lock, flags);
-
- msi_free_vector(dev, dev->irq);
- if (type == PCI_CAP_ID_MSIX) {
- int i, pos, dev_msi_cap;
- u32 phys_addr, table_offset;
- u32 control;
- u8 bir;
-
- pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
- dev->bus->ops->read(dev->bus, dev->devfn, msi_control_reg(pos), 2, &control);
- dev_msi_cap = multi_msix_capable(control);
- dev->bus->ops->read(dev->bus, dev->devfn,
- msix_table_offset_reg(pos), 4, &table_offset);
- bir = (u8)(table_offset & PCI_MSIX_FLAGS_BIRMASK);
- phys_addr = pci_resource_start (dev, bir);
- phys_addr += (u32)(table_offset & ~PCI_MSIX_FLAGS_BIRMASK);
- for (i = FIRST_DEVICE_VECTOR; i < NR_IRQS; i++) {
+ if (state) {
+ printk(KERN_DEBUG "Driver[%d:%d:%d] unloaded wo doing free_irq on vector->%d\n",
+ dev->bus->number, PCI_SLOT(dev->devfn),
+ PCI_FUNC(dev->devfn), dev->irq);
+ BUG_ON(state > 0);
+ } else /* Release MSI vector assigned to this device */
+ msi_free_vector(dev, dev->irq, 0);
+ dev->irq = temp; /* Restore IOAPIC IRQ */
+ }
+ if ((pos = pci_find_capability(dev, PCI_CAP_ID_MSIX)) > 0 &&
+ !msi_lookup_vector(dev, PCI_CAP_ID_MSIX)) {
+ int vector, head, tail = 0, warning = 0;
+ unsigned long base = 0L;
+
+ vector = head = dev->irq;
+ while (head != tail) {
spin_lock_irqsave(&msi_lock, flags);
- if (!msi_desc[i] || msi_desc[i]->dev != dev) {
- spin_unlock_irqrestore(&msi_lock, flags);
- continue;
- }
+ state = msi_desc[vector]->msi_attrib.state;
+ tail = msi_desc[vector]->link.tail;
+ base = msi_desc[vector]->mask_base;
spin_unlock_irqrestore(&msi_lock, flags);
- msi_free_vector(dev, i);
+ if (state)
+ warning = 1;
+ else if (vector != head) /* Release MSI-X vector */
+ msi_free_vector(dev, vector, 0);
+ vector = tail;
+ }
+ msi_free_vector(dev, vector, 0);
+ if (warning) {
+ /* Force to release the MSI-X memory-mapped table */
+ u32 phys_addr, table_offset;
+ u16 control;
+ u8 bir;
+
+ pci_read_config_word(dev, msi_control_reg(pos),
+ &control);
+ pci_read_config_dword(dev, msix_table_offset_reg(pos),
+ &table_offset);
+ bir = (u8)(table_offset & PCI_MSIX_FLAGS_BIRMASK);
+ phys_addr = pci_resource_start (dev, bir);
+ phys_addr += (u32)(table_offset &
+ ~PCI_MSIX_FLAGS_BIRMASK);
+ iounmap((void*)base);
+ release_mem_region(phys_addr, PCI_MSIX_ENTRY_SIZE *
+ multi_msix_capable(control));
+ printk(KERN_DEBUG "Driver[%d:%d:%d] unloaded wo doing free_irq on all vectors\n",
+ dev->bus->number, PCI_SLOT(dev->devfn),
+ PCI_FUNC(dev->devfn));
+ BUG_ON(warning > 0);
}
- writel(1, entry->mask_base + PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
- iounmap((void*)entry->mask_base);
- release_mem_region(phys_addr, dev_msi_cap * PCI_MSIX_ENTRY_SIZE);
+ dev->irq = temp; /* Restore IOAPIC IRQ */
}
- dev->irq = temp;
- nr_reserved_vectors++;
}

EXPORT_SYMBOL(pci_enable_msi);
-EXPORT_SYMBOL(msi_alloc_vectors);
-EXPORT_SYMBOL(msi_free_vectors);
+EXPORT_SYMBOL(pci_disable_msi);
+EXPORT_SYMBOL(pci_enable_msix);
+EXPORT_SYMBOL(pci_disable_msix);
diff -urN linux-2.6.7/drivers/pci/msi.h patch-2.6.7-fix-msix/drivers/pci/msi.h
--- linux-2.6.7/drivers/pci/msi.h 2004-05-09 22:32:53.000000000 -0400
+++ patch-2.6.7-fix-msix/drivers/pci/msi.h 2004-06-25 14:44:08.000000000 -0400
@@ -140,7 +140,8 @@
struct {
__u8 type : 5; /* {0: unused, 5h:MSI, 11h:MSI-X} */
__u8 maskbit : 1; /* mask-pending bit supported ? */
- __u8 reserved: 2; /* reserved */
+ __u8 state : 1; /* {0: free, 1: busy} */
+ __u8 reserved: 1; /* reserved */
__u8 entry_nr; /* specific enabled entry */
__u8 default_vector; /* default pre-assigned vector */
__u8 current_cpu; /* current destination cpu */
diff -urN linux-2.6.7/include/linux/pci.h patch-2.6.7-fix-msix/include/linux/pci.h
--- linux-2.6.7/include/linux/pci.h 2004-06-22 10:11:40.000000000 -0400
+++ patch-2.6.7-fix-msix/include/linux/pci.h 2004-06-25 14:50:49.000000000 -0400
@@ -786,16 +786,27 @@
extern struct pci_dev *isa_bridge;
#endif

+struct msix_entry {
+ u16 vector; /* kernel uses to write allocated vector */
+ u16 entry; /* driver uses to specify entry, OS writes */
+};
+
#ifndef CONFIG_PCI_USE_VECTOR
static inline void pci_scan_msi_device(struct pci_dev *dev) {}
static inline int pci_enable_msi(struct pci_dev *dev) {return -1;}
+static inline void pci_disable_msi(struct pci_dev *dev) {}
+static inline int pci_enable_msix(struct pci_dev* dev,
+ struct msix_entry *entries, int nvec) {return -1;}
+static inline void pci_disable_msix(struct pci_dev *dev) {}
static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) {}
#else
extern void pci_scan_msi_device(struct pci_dev *dev);
extern int pci_enable_msi(struct pci_dev *dev);
+extern void pci_disable_msi(struct pci_dev *dev);
+extern int pci_enable_msix(struct pci_dev* dev,
+ struct msix_entry *entries, int nvec);
+extern void pci_disable_msix(struct pci_dev *dev);
extern void msi_remove_pci_irq_vectors(struct pci_dev *dev);
-extern int msi_alloc_vectors(struct pci_dev* dev, int *vector, int nvec);
-extern int msi_free_vectors(struct pci_dev* dev, int *vector, int nvec);
#endif

#endif /* CONFIG_PCI */

2004-06-26 00:33:25

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH]2.6.7 MSI-X Update

Thanks, I will give it a try.

- Roland

2004-06-26 01:38:43

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH]2.6.7 MSI-X Update

I like this new MSI patch much better since it has pci_disable_msi()
and pci_disable_msix() (as well as using pci_read_config_xxx instead
of bus ops), but I still feel the API is not quite right. I don't
think the pci_disable_msi() and pci_disable_msix() functions should
only be for error paths; I think that they should always be used to
undo the effect of pci_enable_msi() or pci_enable_msix() when a driver
is unloading, and that request()/free_irq() should not have any effect
on a device's MSI state.

As a concrete example, the e1000 net driver does request_irq() in its
e1000_up() function and free_irq() in its e1000_down() function.
Basically, the driver will do request_irq() when the user does
"ifconfig up" and free_irq() when the user does "ifconfig down".

If I were adding MSI support to the e1000 driver, I would really just
like to do pci_enable_msi() in the e1000_probe() function (along with
any device-specific magic to tell the e1000 chip to use MSI instead of
INTx), unconditionally do pci_disable_msi() in e1000_remove(), and be
able to do any number of request_irq()/free_irq() operations
(including no such operations) in between.

MSI-X support for multiple vectors makes things much harder to deal
with. If free_irq() releases the associated MSI-X vector, it seems a
driver can't call free_irq() on a vector if it ever expects to use it
again. I could easily imagine a dual-port network card with a
separate MSI-X vector for each port; with your current patch the
driver for that card could not use the standard network driver model
of request_irq() in the ->open method and free_irq() in the ->stop
method.

Finally, it just seems cleaner and easier to understand if
enabling/disabling MSI is explicit and separate from registering an
ISR. I would expect many people to be confused by code like the
below, which is what one would write with your current API:

int open() {
pci_enable_msi();
/* continue init... */
if (err)
goto out;

request_irq();
return 0;

out:
pci_disable_msi();
return err;
}

void close() {
free_irq();
/* why no pci_disable_msi() ?? */
}

It would be much clearer and easier to audit if every pci_enable_msi()
is balanced by a pci_disable_msi(), just as every request_irq() is
balanced by a free_irq() (and every pci_enable_device() is balanced by
a pci_disable_device(), etc).

To summarize:
1) free_irq() should not have the function of disabling MSI, since
drivers probably don't want to disable MSI or free MSI-X vectors
just because they call free_irq()
2) removing this overloaded function from free_irq() will also make
driver code clearer and easier to maintain.

Thanks,
Roland

PS To throw some good new in with all the nitpicking, even with your
previous patch I have multiple MSI-X vectors working with my driver.
My complaint is just that the MSI-X API makes my driver code a little
messier than I think it should be...

# cat /proc/interrupts
CPU0 CPU1 CPU2 CPU3
0: 70314 50016 60023 54159 IO-APIC-edge timer
2: 0 0 0 0 XT-PIC cascade
4: 263 0 0 16 IO-APIC-edge serial
8: 1 0 0 0 IO-APIC-edge rtc
14: 2748 6672 1818 0 IO-APIC-edge ide0
15: 35 0 0 0 IO-APIC-edge ide1
161: 5629 0 0 0 IO-APIC-level eth0
201: 280 219342 162219 60937 PCI-MSI-X ib_mthca (comp)
209: 0 0 2 0 PCI-MSI-X ib_mthca (async)
217: 79 104 2711 90 PCI-MSI-X ib_mthca (cmd)
NMI: 233505 233299 233297 233295
LOC: 233327 233311 233059 233309
ERR: 0
MIS: 0

2004-06-26 08:27:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH]2.6.7 MSI-X Update

On Fri, Jun 25, 2004 at 06:38:37PM -0700, Roland Dreier wrote:
> I like this new MSI patch much better since it has pci_disable_msi()
> and pci_disable_msix() (as well as using pci_read_config_xxx instead
> of bus ops), but I still feel the API is not quite right. I don't
> think the pci_disable_msi() and pci_disable_msix() functions should
> only be for error paths; I think that they should always be used to
> undo the effect of pci_enable_msi() or pci_enable_msix() when a driver
> is unloading, and that request()/free_irq() should not have any effect
> on a device's MSI state.

Agreed. Non-symmetric APIs are very bad.

> As a concrete example, the e1000 net driver does request_irq() in its
> e1000_up() function and free_irq() in its e1000_down() function.
> Basically, the driver will do request_irq() when the user does
> "ifconfig up" and free_irq() when the user does "ifconfig down".

Lots of networking drivers do that..

2004-06-26 17:30:28

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH]2.6.7 MSI-X Update

Roland> As a concrete example, the e1000 net driver does
Roland> request_irq() in its e1000_up() function and free_irq() in
Roland> its e1000_down() function. Basically, the driver will do
Roland> request_irq() when the user does "ifconfig up" and
Roland> free_irq() when the user does "ifconfig down".

Christoph> Lots of networking drivers do that..

Yup, I just wanted to pick one definite example so I could point to
real function names, etc. (And also I wanted to pick a piece of
hardware that I happen to know is MSI-capable).

Thanks,
Roland