LinuxLists.cc - Multiple MSI, take 4

2008-07-11 21:16:18

by Matthew Wilcox

[permalink] [raw]

Subject: Multiple MSI, take 4

Here we go with take 4. Changes:

- Check the requested number of interrupts against the maximum number
the device claims to support. Thanks to Hidetoshi Seto for pointing
out this oversight.
- Implemented Eric's suggestion of using a single IRQ and storing the
data with it.
- As a result, don't try to support the mode in the AHCI driver where
the excess ports all share the last interrupt. It could be done, but
it would be rather messy and I don't have hardware that supports that
mode anyway.

I'm fairly comfortable with the subchannel notion we're introducing
here. It's more flexible than MSI and doesn't impose a penalty on
architectures which don't implement it. It makes some things more
complex, but it makes other things simpler, so I think it's a wash from
a cleanliness standpoint.

Git tree still at
git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git multiple-msi
(and this push specifically at multiple-msi-20080711)

Patches to follow.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-07-11 21:18:47

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 6/6] x86-64: Support for multiple MSIs

Add support for allocating an aligned block of interrupt vectors.
Allow interrupts to have up to 32 subchannels.
Implement the arch_setup_msi_irqs() and arch_teardown_msi_irqs()
interfaces.

Signed-off-by: Matthew Wilcox <[email protected]>
---
arch/x86/kernel/io_apic_64.c | 221 +++++++++++++++++++++++++++++++++++------
arch/x86/kernel/irq_64.c | 2 +-
include/asm-x86/irq_64.h | 2 +
3 files changed, 191 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kernel/io_apic_64.c b/arch/x86/kernel/io_apic_64.c
index ef1a8df..4edf988 100644
--- a/arch/x86/kernel/io_apic_64.c
+++ b/arch/x86/kernel/io_apic_64.c
@@ -61,7 +61,7 @@ struct irq_cfg {
};

/* irq_cfg is indexed by the sum of all RTEs in all I/O APICs. */
-struct irq_cfg irq_cfg[NR_IRQS] __read_mostly = {
+static struct irq_cfg irq_cfg[NR_IRQS] __read_mostly = {
[0] = { .domain = CPU_MASK_ALL, .vector = IRQ0_VECTOR, },
[1] = { .domain = CPU_MASK_ALL, .vector = IRQ1_VECTOR, },
[2] = { .domain = CPU_MASK_ALL, .vector = IRQ2_VECTOR, },
@@ -683,6 +683,8 @@ static int pin_2_irq(int idx, int apic, int pin)
return irq;
}

+static int current_vector = FIRST_DEVICE_VECTOR;
+
static int __assign_irq_vector(int irq, cpumask_t mask)
{
/*
@@ -696,7 +698,7 @@ static int __assign_irq_vector(int irq, cpumask_t mask)
* Also, we've got to be careful not to trash gate
* 0x80, because int 0x80 is hm, kind of importantish. ;)
*/
- static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
+ static int current_offset = 0;
unsigned int old_vector;
int cpu;
struct irq_cfg *cfg;
@@ -769,11 +771,98 @@ static int assign_irq_vector(int irq, cpumask_t mask)
return err;
}

-static void __clear_irq_vector(int irq)
+static int __assign_irq_vector_block(int irq, int count, cpumask_t mask)
+{
+ unsigned int old_vector;
+ int i, cpu;
+ struct irq_cfg *cfg;
+
+ /*
+ * We've got to be careful not to trash gate 0x80,
+ * because int 0x80 is hm, kind of importantish. ;)
+ */
+ BUG_ON((unsigned)irq >= NR_IRQS);
+ cfg = &irq_cfg[irq];
+
+ /* Only try and allocate irqs on cpus that are present */
+ cpus_and(mask, mask, cpu_online_map);
+
+ if ((cfg->move_in_progress) || cfg->move_cleanup_count)
+ return -EBUSY;
+
+ old_vector = cfg->vector;
+ if (old_vector) {
+ cpumask_t tmp;
+ cpus_and(tmp, cfg->domain, mask);
+ if (!cpus_empty(tmp))
+ return 0;
+ }
+
+ for_each_cpu_mask(cpu, mask) {
+ cpumask_t domain, new_mask;
+ int new_cpu;
+ int vector;
+
+ domain = vector_allocation_domain(cpu);
+ cpus_and(new_mask, domain, cpu_online_map);
+
+ vector = current_vector & ~(count - 1);
+ next:
+ vector += count;
+ if (vector + count >= FIRST_SYSTEM_VECTOR) {
+ vector = FIRST_DEVICE_VECTOR & ~(count - 1);
+ if (vector < FIRST_DEVICE_VECTOR)
+ vector += count;
+ }
+ if (unlikely(vector == (current_vector & ~(count - 1))))
+ continue;
+ if ((IA32_SYSCALL_VECTOR >= vector) &&
+ (IA32_SYSCALL_VECTOR < vector + count))
+ goto next;
+ for_each_cpu_mask(new_cpu, new_mask) {
+ for (i = 0; i < count; i++) {
+ if (per_cpu(vector_irq, new_cpu)[vector + i]
+ != -1)
+ goto next;
+ }
+ }
+ /* Found one! */
+ current_vector = vector + count - 1;
+ if (old_vector) {
+ cfg->move_in_progress = 1;
+ cfg->old_domain = cfg->domain;
+ }
+ for_each_cpu_mask(new_cpu, new_mask) {
+ for (i = 0; i < count; i++) {
+ per_cpu(vector_irq, new_cpu)[vector + i] =
+ irq | (i << IRQ_SUBCHANNEL_SHIFT);
+ }
+ }
+ cfg->vector = vector;
+ cfg->domain = domain;
+ return 0;
+ }
+ return -ENOSPC;
+}
+
+/* Assumes that count is a power of two and aligns to that power of two */
+static int assign_irq_vector_block(int irq, int count, cpumask_t mask)
+{
+ int result;
+ unsigned long flags;
+
+ spin_lock_irqsave(&vector_lock, flags);
+ result = __assign_irq_vector_block(irq, count, mask);
+ spin_unlock_irqrestore(&vector_lock, flags);
+
+ return result;
+}
+
+static void __clear_irq_vectors(int irq, int count)
{
struct irq_cfg *cfg;
cpumask_t mask;
- int cpu, vector;
+ int cpu, vector, i;

BUG_ON((unsigned)irq >= NR_IRQS);
cfg = &irq_cfg[irq];
@@ -781,8 +870,10 @@ static void __clear_irq_vector(int irq)

vector = cfg->vector;
cpus_and(mask, cfg->domain, cpu_online_map);
- for_each_cpu_mask(cpu, mask)
- per_cpu(vector_irq, cpu)[vector] = -1;
+ for_each_cpu_mask(cpu, mask) {
+ for (i = 0; i < count; i++)
+ per_cpu(vector_irq, cpu)[vector + i] = -1;
+ }

cfg->vector = 0;
cpus_clear(cfg->domain);
@@ -1895,11 +1986,11 @@ device_initcall(ioapic_init_sysfs);
/*
* Dynamic irq allocate and deallocation
*/
-int create_irq(void)
+
+static int create_irq_block(int count)
{
/* Allocate an unused irq */
- int irq;
- int new;
+ int irq, rc, new;
unsigned long flags;

irq = -ENOSPC;
@@ -1909,34 +2000,49 @@ int create_irq(void)
continue;
if (irq_cfg[new].vector != 0)
continue;
- if (__assign_irq_vector(new, TARGET_CPUS) == 0)
+ if (count == 1)
+ rc = __assign_irq_vector(new, TARGET_CPUS);
+ else
+ rc = __assign_irq_vector_block(new, count, TARGET_CPUS);
+
+ if (rc == 0)
irq = new;
break;
}
spin_unlock_irqrestore(&vector_lock, flags);

- if (irq >= 0) {
+ if (irq >= 0)
dynamic_irq_init(irq);
- }
return irq;
}

-void destroy_irq(unsigned int irq)
+int create_irq(void)
+{
+ return create_irq_block(1);
+}
+
+static void destroy_irq_block(unsigned int irq, int count)
{
unsigned long flags;

dynamic_irq_cleanup(irq);

spin_lock_irqsave(&vector_lock, flags);
- __clear_irq_vector(irq);
+ __clear_irq_vectors(irq, count);
spin_unlock_irqrestore(&vector_lock, flags);
}

+void destroy_irq(unsigned int irq)
+{
+ destroy_irq_block(irq, 1);
+}
+
/*
* MSI message composition
*/
#ifdef CONFIG_PCI_MSI
-static int msi_compose_msg(struct pci_dev *pdev, unsigned int irq, struct msi_msg *msg)
+static int msi_compose_msg(struct pci_dev *pdev, unsigned int irq,
+ unsigned int count, struct msi_msg *msg)
{
struct irq_cfg *cfg = irq_cfg + irq;
int err;
@@ -1944,7 +2050,10 @@ static int msi_compose_msg(struct pci_dev *pdev, unsigned int irq, struct msi_ms
cpumask_t tmp;

tmp = TARGET_CPUS;
- err = assign_irq_vector(irq, tmp);
+ if (count == 1)
+ err = assign_irq_vector(irq, tmp);
+ else
+ err = assign_irq_vector_block(irq, count, tmp);
if (!err) {
cpus_and(tmp, cfg->domain, tmp);
dest = cpu_mask_to_apicid(tmp);
@@ -1975,6 +2084,8 @@ static int msi_compose_msg(struct pci_dev *pdev, unsigned int irq, struct msi_ms
static void set_msi_irq_affinity(unsigned int irq, cpumask_t mask)
{
struct irq_cfg *cfg = irq_cfg + irq;
+ struct msi_desc *desc = get_irq_msi(irq);
+ int count = 1 << desc->msi_attrib.multiple;
struct msi_msg msg;
unsigned int dest;
cpumask_t tmp;
@@ -1983,8 +2094,13 @@ static void set_msi_irq_affinity(unsigned int irq, cpumask_t mask)
if (cpus_empty(tmp))
return;

- if (assign_irq_vector(irq, mask))
- return;
+ if (count > 1) {
+ if (assign_irq_vector_block(irq, count, mask))
+ return;
+ } else {
+ if (assign_irq_vector(irq, mask))
+ return;
+ }

cpus_and(tmp, cfg->domain, mask);
dest = cpu_mask_to_apicid(tmp);
@@ -2016,31 +2132,70 @@ static struct irq_chip msi_chip = {
.retrigger = ioapic_retrigger_irq,
};

-int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc)
+static int x86_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *desc, int count)
{
struct msi_msg msg;
- int irq, ret;
- irq = create_irq();
- if (irq < 0)
- return irq;
-
- ret = msi_compose_msg(dev, irq, &msg);
- if (ret < 0) {
- destroy_irq(irq);
- return ret;
+ int irq, ret, alloc;
+
+ /* MSI can only allocate a power-of-two */
+ alloc = roundup_pow_of_two(count);
+
+ for (;;) {
+ irq = create_irq_block(alloc);
+ if (irq >= 0) {
+ if (alloc >= count)
+ break;
+ destroy_irq_block(irq, count);
+ return count;
+ }
+ if (alloc == 1)
+ return irq;
+ alloc /= 2;
}

- set_irq_msi(irq, desc);
- write_msi_msg(irq, &msg);
+ ret = msi_compose_msg(pdev, irq, alloc, &msg);
+ if (ret)
+ return ret;

+ desc->msi_attrib.multiple = order_base_2(alloc);
+
+ set_irq_msi(irq, desc);
set_irq_chip_and_handler_name(irq, &msi_chip, handle_edge_irq, "edge");
+ write_msi_msg(irq, &msg);

return 0;
}

-void arch_teardown_msi_irq(unsigned int irq)
+int arch_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
{
- destroy_irq(irq);
+ struct msi_desc *desc;
+ int ret;
+
+ if (type == PCI_CAP_ID_MSI) {
+ desc = list_first_entry(&pdev->msi_list, struct msi_desc, list);
+ ret = x86_setup_msi_irq(pdev, desc, nvec);
+ } else {
+ list_for_each_entry(desc, &pdev->msi_list, list) {
+ ret = x86_setup_msi_irq(pdev, desc, 1);
+ if (ret)
+ break;
+ }
+ }
+
+ return ret;
+}
+
+void arch_teardown_msi_irqs(struct pci_dev *dev)
+{
+ struct msi_desc *entry;
+
+ list_for_each_entry(entry, &dev->msi_list, list) {
+ int nvec;
+ if (entry->irq == 0)
+ continue;
+ nvec = 1 << entry->msi_attrib.multiple;
+ destroy_irq_block(entry->irq, nvec);
+ }
}

#ifdef CONFIG_DMAR
@@ -2090,7 +2245,7 @@ int arch_setup_dmar_msi(unsigned int irq)
int ret;
struct msi_msg msg;

- ret = msi_compose_msg(NULL, irq, &msg);
+ ret = msi_compose_msg(NULL, irq, 1, &msg);
if (ret < 0)
return ret;
dmar_msi_write(irq, &msg);
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 3aac154..dbb5487 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -173,7 +173,7 @@ asmlinkage unsigned int do_IRQ(struct pt_regs *regs)
stack_overflow_check(regs);
#endif

- if (likely(irq < NR_IRQS))
+ if (likely((get_irq_value(irq)) < NR_IRQS))
generic_handle_irq(irq);
else {
if (!disable_apic)
diff --git a/include/asm-x86/irq_64.h b/include/asm-x86/irq_64.h
index 083d35a..5259854 100644
--- a/include/asm-x86/irq_64.h
+++ b/include/asm-x86/irq_64.h
@@ -34,6 +34,8 @@
#define NR_IRQS (NR_VECTORS + (32 * NR_CPUS))
#define NR_IRQ_VECTORS NR_IRQS

+#define IRQ_SUBCHANNEL_BITS 5
+
static inline int irq_canonicalize(int irq)
{
return ((irq == 2) ? 9 : irq);
--
1.5.5.4

2008-07-11 21:19:31

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 1/6] Allow a small amount of data to be stored with the interrupt number

The PCI MSI capability allows up to 5 bits of information to be sent
from the device to the interrupt controller. Rather than allocate
multiple Linux interrupts to handle this, we can store the information in
the top 5 bits of the interrupt number. Architectures which have not
been converted to handle multiple MSI will not store data there and so
do not need to be taught about this.

Signed-off-by: Matthew Wilcox <[email protected]>
---
include/linux/irq.h | 22 +++++++++++++++++++++-
kernel/irq/chip.c | 5 +++--
kernel/irq/handle.c | 17 +++++++++--------
3 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 552e0ec..9bc7d61 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -282,6 +282,26 @@ extern unsigned int __do_IRQ(unsigned int irq);
#endif

/*
+ * We allow interrupts to have a few subchannels. It's up to the
+ * architecture to arrange how many there are per interrupt and what they
+ * are used for. To support subchannels, architectures should define
+ * IRQ_SUBCHANNEL_BITS in asm/irq.h. In order to support multiple PCI MSI,
+ * an architecture should set it to at least 5 in order to support 32
+ * subchannels (one for each vector). This leaves 27 bits for NR_IRQS. If
+ * the architecture wants to store something other than MSI data here, it
+ * may wish to make it larger, but should take care not to run into NR_IRQS.
+ */
+
+#ifndef IRQ_SUBCHANNEL_BITS
+#define get_irq_value(irq) (irq)
+#define get_irq_subchannel(irq) 0
+#else
+#define IRQ_SUBCHANNEL_SHIFT (32 - IRQ_SUBCHANNEL_BITS)
+#define get_irq_value(irq) (irq & ((1 << IRQ_SUBCHANNEL_SHIFT) - 1))
+#define get_irq_subchannel(irq) (irq >> IRQ_SUBCHANNEL_SHIFT)
+#endif
+
+/*
* Architectures call this to let the generic IRQ layer
* handle an interrupt. If the descriptor is attached to an
* irqchip-style controller then we call the ->handle_irq() handler,
@@ -289,7 +309,7 @@ extern unsigned int __do_IRQ(unsigned int irq);
*/
static inline void generic_handle_irq(unsigned int irq)
{
- struct irq_desc *desc = irq_desc + irq;
+ struct irq_desc *desc = irq_desc + get_irq_value(irq);

#ifdef CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ
desc->handle_irq(irq, desc);
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index 964964b..fd4f7f0 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -453,8 +453,9 @@ out:
* loop is left.
*/
void
-handle_edge_irq(unsigned int irq, struct irq_desc *desc)
+handle_edge_irq(unsigned int irq_plus_data, struct irq_desc *desc)
{
+ const unsigned int irq = get_irq_value(irq_plus_data);
const unsigned int cpu = smp_processor_id();

spin_lock(&desc->lock);
@@ -504,7 +505,7 @@ handle_edge_irq(unsigned int irq, struct irq_desc *desc)

desc->status &= ~IRQ_PENDING;
spin_unlock(&desc->lock);
- action_ret = handle_IRQ_event(irq, action);
+ action_ret = handle_IRQ_event(irq_plus_data, action);
if (!noirqdebug)
note_interrupt(irq, desc, action_ret);
spin_lock(&desc->lock);
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 5fa6198..4e463e6 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -126,7 +126,7 @@ irqreturn_t no_action(int cpl, void *dev_id)
*
* Handles the action chain of an irq event
*/
-irqreturn_t handle_IRQ_event(unsigned int irq, struct irqaction *action)
+irqreturn_t handle_IRQ_event(unsigned int irq_plus_data, struct irqaction *action)
{
irqreturn_t ret, retval = IRQ_NONE;
unsigned int status = 0;
@@ -137,7 +137,7 @@ irqreturn_t handle_IRQ_event(unsigned int irq, struct irqaction *action)
local_irq_enable_in_hardirq();

do {
- ret = action->handler(irq, action->dev_id);
+ ret = action->handler(irq_plus_data, action->dev_id);
if (ret == IRQ_HANDLED)
status |= action->flags;
retval |= ret;
@@ -145,7 +145,7 @@ irqreturn_t handle_IRQ_event(unsigned int irq, struct irqaction *action)
} while (action);

if (status & IRQF_SAMPLE_RANDOM)
- add_interrupt_randomness(irq);
+ add_interrupt_randomness(get_irq_value(irq_plus_data));
local_irq_disable();

return retval;
@@ -163,15 +163,16 @@ irqreturn_t handle_IRQ_event(unsigned int irq, struct irqaction *action)
* This is the original x86 implementation which is used for every
* interrupt type.
*/
-unsigned int __do_IRQ(unsigned int irq)
+unsigned int __do_IRQ(unsigned int irq_plus_data)
{
+ const unsigned int irq = get_irq_value(irq_plus_data);
struct irq_desc *desc = irq_desc + irq;
struct irqaction *action;
unsigned int status;

kstat_this_cpu.irqs[irq]++;
if (CHECK_IRQ_PER_CPU(desc->status)) {
- irqreturn_t action_ret;
+ irqreturn_t ret;

/*
* No locking required for CPU-local interrupts:
@@ -179,9 +180,9 @@ unsigned int __do_IRQ(unsigned int irq)
if (desc->chip->ack)
desc->chip->ack(irq);
if (likely(!(desc->status & IRQ_DISABLED))) {
- action_ret = handle_IRQ_event(irq, desc->action);
+ ret = handle_IRQ_event(irq_plus_data, desc->action);
if (!noirqdebug)
- note_interrupt(irq, desc, action_ret);
+ note_interrupt(irq, desc, ret);
}
desc->chip->end(irq);
return 1;
@@ -233,7 +234,7 @@ unsigned int __do_IRQ(unsigned int irq)

spin_unlock(&desc->lock);

- action_ret = handle_IRQ_event(irq, action);
+ action_ret = handle_IRQ_event(irq_plus_data, action);
if (!noirqdebug)
note_interrupt(irq, desc, action_ret);

--
1.5.5.4

2008-07-11 21:19:14

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 4/6] Rewrite MSI-HOWTO

I didn't find the previous version very useful, so I rewrote it.
I added documentation for the new pci_enable_msi_block() API.
I also moved it to the PCI subdirectory of Documentation.

Signed-off-by: Matthew Wilcox <[email protected]>
---
Documentation/MSI-HOWTO.txt | 511 ---------------------------------------
Documentation/PCI/MSI-HOWTO.txt | 322 ++++++++++++++++++++++++
2 files changed, 322 insertions(+), 511 deletions(-)
delete mode 100644 Documentation/MSI-HOWTO.txt
create mode 100644 Documentation/PCI/MSI-HOWTO.txt

diff --git a/Documentation/MSI-HOWTO.txt b/Documentation/MSI-HOWTO.txt
deleted file mode 100644
index a51f693..0000000
--- a/Documentation/MSI-HOWTO.txt
+++ /dev/null
@@ -1,511 +0,0 @@
- The MSI Driver Guide HOWTO
- Tom L Nguyen [email protected]
- 10/03/2003
- Revised Feb 12, 2004 by Martine Silbermann
- email: [email protected]
- Revised Jun 25, 2004 by Tom L Nguyen
-
-1. About this guide
-
-This guide describes the basics of Message Signaled Interrupts (MSI),
-the advantages of using MSI over traditional interrupt mechanisms,
-and how to enable your driver to use MSI or MSI-X. Also included is
-a Frequently Asked Questions (FAQ) section.
-
-1.1 Terminology
-
-PCI devices can be single-function or multi-function. In either case,
-when this text talks about enabling or disabling MSI on a "device
-function," it is referring to one specific PCI device and function and
-not to all functions on a PCI device (unless the PCI device has only
-one function).
-
-2. Copyright 2003 Intel Corporation
-
-3. What is MSI/MSI-X?
-
-Message Signaled Interrupt (MSI), as described in the PCI Local Bus
-Specification Revision 2.3 or later, is an optional feature, and a
-required feature for PCI Express devices. MSI enables a device function
-to request service by sending an Inbound Memory Write on its PCI bus to
-the FSB as a Message Signal Interrupt transaction. Because MSI is
-generated in the form of a Memory Write, all transaction conditions,
-such as a Retry, Master-Abort, Target-Abort or normal completion, are
-supported.
-
-A PCI device that supports MSI must also support pin IRQ assertion
-interrupt mechanism to provide backward compatibility for systems that
-do not support MSI. In systems which support MSI, the bus driver is
-responsible for initializing the message address and message data of
-the device function's MSI/MSI-X capability structure during device
-initial configuration.
-
-An MSI capable device function indicates MSI support by implementing
-the MSI/MSI-X capability structure in its PCI capability list. The
-device function may implement both the MSI capability structure and
-the MSI-X capability structure; however, the bus driver should not
-enable both.
-
-The MSI capability structure contains Message Control register,
-Message Address register and Message Data register. These registers
-provide the bus driver control over MSI. The Message Control register
-indicates the MSI capability supported by the device. The Message
-Address register specifies the target address and the Message Data
-register specifies the characteristics of the message. To request
-service, the device function writes the content of the Message Data
-register to the target address. The device and its software driver
-are prohibited from writing to these registers.
-
-The MSI-X capability structure is an optional extension to MSI. It
-uses an independent and separate capability structure. There are
-some key advantages to implementing the MSI-X capability structure
-over the MSI capability structure as described below.
-
- - Support a larger maximum number of vectors per function.
-
- - Provide the ability for system software to configure
- each vector with an independent message address and message
- data, specified by a table that resides in Memory Space.
-
- - MSI and MSI-X both support per-vector masking. Per-vector
- masking is an optional extension of MSI but a required
- feature for MSI-X. Per-vector masking provides the kernel the
- ability to mask/unmask a single MSI while running its
- interrupt service routine. If per-vector masking is
- not supported, then the device driver should provide the
- hardware/software synchronization to ensure that the device
- generates MSI when the driver wants it to do so.
-
-4. Why use MSI?
-
-As a benefit to the simplification of board design, MSI allows board
-designers to remove out-of-band interrupt routing. MSI is another
-step towards a legacy-free environment.
-
-Due to increasing pressure on chipset and processor packages to
-reduce pin count, the need for interrupt pins is expected to
-diminish over time. Devices, due to pin constraints, may implement
-messages to increase performance.
-
-PCI Express endpoints uses INTx emulation (in-band messages) instead
-of IRQ pin assertion. Using INTx emulation requires interrupt
-sharing among devices connected to the same node (PCI bridge) while
-MSI is unique (non-shared) and does not require BIOS configuration
-support. As a result, the PCI Express technology requires MSI
-support for better interrupt performance.
-
-Using MSI enables the device functions to support two or more
-vectors, which can be configured to target different CPUs to
-increase scalability.
-
-5. Configuring a driver to use MSI/MSI-X
-
-By default, the kernel will not enable MSI/MSI-X on all devices that
-support this capability. The CONFIG_PCI_MSI kernel option
-must be selected to enable MSI/MSI-X support.
-
-5.1 Including MSI/MSI-X support into the kernel
-
-To allow MSI/MSI-X capable device drivers to selectively enable
-MSI/MSI-X (using pci_enable_msi()/pci_enable_msix() as described
-below), the VECTOR based scheme needs to be enabled by setting
-CONFIG_PCI_MSI during kernel config.
-
-Since the target of the inbound message is the local APIC, providing
-CONFIG_X86_LOCAL_APIC must be enabled as well as CONFIG_PCI_MSI.
-
-5.2 Configuring for MSI support
-
-Due to the non-contiguous fashion in vector assignment of the
-existing Linux kernel, this version does not support multiple
-messages regardless of a device function is capable of supporting
-more than one vector. To enable MSI on a device function's MSI
-capability structure requires a device driver to call the function
-pci_enable_msi() explicitly.
-
-5.2.1 API pci_enable_msi
-
-int pci_enable_msi(struct pci_dev *dev)
-
-With this new API, a device driver that wants to have MSI
-enabled on its device function must call this API to enable MSI.
-A successful call will initialize the MSI capability structure
-with ONE vector, regardless of whether a device function is
-capable of supporting multiple messages. This vector replaces the
-pre-assigned dev->irq with a new MSI vector. To avoid a conflict
-of the new assigned vector with existing pre-assigned vector requires
-a device driver to call this API before calling request_irq().
-
-5.2.2 API pci_disable_msi
-
-void pci_disable_msi(struct pci_dev *dev)
-
-This API should always be used to undo the effect of pci_enable_msi()
-when a device driver is unloading. This API restores dev->irq with
-the pre-assigned IOAPIC vector and switches a device's interrupt
-mode to PCI pin-irq assertion/INTx emulation mode.
-
-Note that a device driver should always call free_irq() on the MSI vector
-that it has done request_irq() on before calling this API. Failure to do
-so results in a BUG_ON() and a device will be left with MSI enabled and
-leaks its vector.
-
-5.2.3 MSI mode vs. legacy mode diagram
-
-The below diagram shows the events which switch the interrupt
-mode on the MSI-capable device function between MSI mode and
-PIN-IRQ assertion mode.
-
- ------------ pci_enable_msi ------------------------
- | | <=============== | |
- | MSI MODE | | PIN-IRQ ASSERTION MODE |
- | | ===============> | |
- ------------ pci_disable_msi ------------------------
-
-
-Figure 1. MSI Mode vs. Legacy Mode
-
-In Figure 1, a device operates by default in legacy mode. Legacy
-in this context means PCI pin-irq assertion or PCI-Express INTx
-emulation. A successful MSI request (using pci_enable_msi()) switches
-a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector
-stored in dev->irq will be saved by the PCI subsystem and a new
-assigned MSI vector will replace dev->irq.
-
-To return back to its default mode, a device driver should always call
-pci_disable_msi() to undo the effect of pci_enable_msi(). Note that a
-device driver should always call free_irq() on the MSI vector it has
-done request_irq() on before calling pci_disable_msi(). Failure to do
-so results in a BUG_ON() and a device will be left with MSI enabled and
-leaks its vector. Otherwise, the PCI subsystem restores a device's
-dev->irq with a pre-assigned IOAPIC vector and marks the released
-MSI vector as unused.
-
-Once being marked as unused, there is no guarantee that the PCI
-subsystem will reserve this MSI vector for a device. Depending on
-the availability of current PCI vector resources and the number of
-MSI/MSI-X requests from other drivers, this MSI may be re-assigned.
-
-For the case where the PCI subsystem re-assigns this MSI vector to
-another driver, a request to switch back to MSI mode may result
-in being assigned a different MSI vector or a failure if no more
-vectors are available.
-
-5.3 Configuring for MSI-X support
-
-Due to the ability of the system software to configure each vector of
-the MSI-X capability structure with an independent message address
-and message data, the non-contiguous fashion in vector assignment of
-the existing Linux kernel has no impact on supporting multiple
-messages on an MSI-X capable device functions. To enable MSI-X on
-a device function's MSI-X capability structure requires its device
-driver to call the function pci_enable_msix() explicitly.
-
-The function pci_enable_msix(), once invoked, enables either
-all or nothing, depending on the current availability of PCI vector
-resources. If the PCI vector resources are available for the number
-of vectors requested by a device driver, this function will configure
-the MSI-X table of the MSI-X capability structure of a device with
-requested messages. To emphasize this reason, for example, a device
-may be capable for supporting the maximum of 32 vectors while its
-software driver usually may request 4 vectors. It is recommended
-that the device driver should call this function once during the
-initialization phase of the device driver.
-
-Unlike the function pci_enable_msi(), the function pci_enable_msix()
-does not replace the pre-assigned IOAPIC dev->irq with a new MSI
-vector because the PCI subsystem writes the 1:1 vector-to-entry mapping
-into the field vector of each element contained in a second argument.
-Note that the pre-assigned IOAPIC dev->irq is valid only if the device
-operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt at
-using dev->irq by the device driver to request for interrupt service
-may result in unpredictable behavior.
-
-For each MSI-X vector granted, a device driver is responsible for calling
-other functions like request_irq(), enable_irq(), etc. to enable
-this vector with its corresponding interrupt service handler. It is
-a device driver's choice to assign all vectors with the same
-interrupt service handler or each vector with a unique interrupt
-service handler.
-
-5.3.1 Handling MMIO address space of MSI-X Table
-
-The PCI 3.0 specification has implementation notes that MMIO address
-space for a device's MSI-X structure should be isolated so that the
-software system can set different pages for controlling accesses to the
-MSI-X structure. The implementation of MSI support requires the PCI
-subsystem, not a device driver, to maintain full control of the MSI-X
-table/MSI-X PBA (Pending Bit Array) and MMIO address space of the MSI-X
-table/MSI-X PBA. A device driver is prohibited from requesting the MMIO
-address space of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem
-will fail enabling MSI-X on its hardware device when it calls the function
-pci_enable_msix().
-
-5.3.2 API pci_enable_msix
-
-int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
-
-This API enables a device driver to request the PCI subsystem
-to enable MSI-X messages on its hardware device. Depending on
-the availability of PCI vectors resources, the PCI subsystem enables
-either all or none of the requested vectors.
-
-Argument 'dev' points to the device (pci_dev) structure.
-
-Argument 'entries' is a pointer to an array of msix_entry structs.
-The number of entries is indicated in argument 'nvec'.
-struct msix_entry is defined in /driver/pci/msi.h:
-
-struct msix_entry {
- u16 vector; /* kernel uses to write alloc vector */
- u16 entry; /* driver uses to specify entry */
-};
-
-A device driver is responsible for initializing the field 'entry' of
-each element with a unique entry supported by MSI-X table. Otherwise,
--EINVAL will be returned as a result. A successful return of zero
-indicates the PCI subsystem completed initializing each of the requested
-entries of the MSI-X table with message address and message data.
-Last but not least, the PCI subsystem will write the 1:1
-vector-to-entry mapping into the field 'vector' of each element. A
-device driver is responsible for keeping track of allocated MSI-X
-vectors in its internal data structure.
-
-A return of zero indicates that the number of MSI-X vectors was
-successfully allocated. A return of greater than zero indicates
-MSI-X vector shortage. Or a return of less than zero indicates
-a failure. This failure may be a result of duplicate entries
-specified in second argument, or a result of no available vector,
-or a result of failing to initialize MSI-X table entries.
-
-5.3.3 API pci_disable_msix
-
-void pci_disable_msix(struct pci_dev *dev)
-
-This API should always be used to undo the effect of pci_enable_msix()
-when a device driver is unloading. Note that a device driver should
-always call free_irq() on all MSI-X vectors it has done request_irq()
-on before calling this API. Failure to do so results in a BUG_ON() and
-a device will be left with MSI-X enabled and leaks its vectors.
-
-5.3.4 MSI-X mode vs. legacy mode diagram
-
-The below diagram shows the events which switch the interrupt
-mode on the MSI-X capable device function between MSI-X mode and
-PIN-IRQ assertion mode (legacy).
-
- ------------ pci_enable_msix(,,n) ------------------------
- | | <=============== | |
- | MSI-X MODE | | PIN-IRQ ASSERTION MODE |
- | | ===============> | |
- ------------ pci_disable_msix ------------------------
-
-Figure 2. MSI-X Mode vs. Legacy Mode
-
-In Figure 2, a device operates by default in legacy mode. A
-successful MSI-X request (using pci_enable_msix()) switches a
-device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector
-stored in dev->irq will be saved by the PCI subsystem; however,
-unlike MSI mode, the PCI subsystem will not replace dev->irq with
-assigned MSI-X vector because the PCI subsystem already writes the 1:1
-vector-to-entry mapping into the field 'vector' of each element
-specified in second argument.
-
-To return back to its default mode, a device driver should always call
-pci_disable_msix() to undo the effect of pci_enable_msix(). Note that
-a device driver should always call free_irq() on all MSI-X vectors it
-has done request_irq() on before calling pci_disable_msix(). Failure
-to do so results in a BUG_ON() and a device will be left with MSI-X
-enabled and leaks its vectors. Otherwise, the PCI subsystem switches a
-device function's interrupt mode from MSI-X mode to legacy mode and
-marks all allocated MSI-X vectors as unused.
-
-Once being marked as unused, there is no guarantee that the PCI
-subsystem will reserve these MSI-X vectors for a device. Depending on
-the availability of current PCI vector resources and the number of
-MSI/MSI-X requests from other drivers, these MSI-X vectors may be
-re-assigned.
-
-For the case where the PCI subsystem re-assigned these MSI-X vectors
-to other drivers, a request to switch back to MSI-X mode may result
-being assigned with another set of MSI-X vectors or a failure if no
-more vectors are available.
-
-5.4 Handling function implementing both MSI and MSI-X capabilities
-
-For the case where a function implements both MSI and MSI-X
-capabilities, the PCI subsystem enables a device to run either in MSI
-mode or MSI-X mode but not both. A device driver determines whether it
-wants MSI or MSI-X enabled on its hardware device. Once a device
-driver requests for MSI, for example, it is prohibited from requesting
-MSI-X; in other words, a device driver is not permitted to ping-pong
-between MSI mod MSI-X mode during a run-time.
-
-5.5 Hardware requirements for MSI/MSI-X support
-
-MSI/MSI-X support requires support from both system hardware and
-individual hardware device functions.
-
-5.5.1 Required x86 hardware support
-
-Since the target of MSI address is the local APIC CPU, enabling
-MSI/MSI-X support in the Linux kernel is dependent on whether existing
-system hardware supports local APIC. Users should verify that their
-system supports local APIC operation by testing that it runs when
-CONFIG_X86_LOCAL_APIC=y.
-
-In SMP environment, CONFIG_X86_LOCAL_APIC is automatically set;
-however, in UP environment, users must manually set
-CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting
-CONFIG_PCI_MSI enables the VECTOR based scheme and the option for
-MSI-capable device drivers to selectively enable MSI/MSI-X.
-
-Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X
-vector is allocated new during runtime and MSI/MSI-X support does not
-depend on BIOS support. This key independency enables MSI/MSI-X
-support on future IOxAPIC free platforms.
-
-5.5.2 Device hardware support
-
-The hardware device function supports MSI by indicating the
-MSI/MSI-X capability structure on its PCI capability list. By
-default, this capability structure will not be initialized by
-the kernel to enable MSI during the system boot. In other words,
-the device function is running on its default pin assertion mode.
-Note that in many cases the hardware supporting MSI have bugs,
-which may result in system hangs. The software driver of specific
-MSI-capable hardware is responsible for deciding whether to call
-pci_enable_msi or not. A return of zero indicates the kernel
-successfully initialized the MSI/MSI-X capability structure of the
-device function. The device function is now running on MSI/MSI-X mode.
-
-5.6 How to tell whether MSI/MSI-X is enabled on device function
-
-At the driver level, a return of zero from the function call of
-pci_enable_msi()/pci_enable_msix() indicates to a device driver that
-its device function is initialized successfully and ready to run in
-MSI/MSI-X mode.
-
-At the user level, users can use the command 'cat /proc/interrupts'
-to display the vectors allocated for devices and their interrupt
-MSI/MSI-X modes ("PCI-MSI"/"PCI-MSI-X"). Below shows MSI mode is
-enabled on a SCSI Adaptec 39320D Ultra320 controller.
-
- CPU0 CPU1
- 0: 324639 0 IO-APIC-edge timer
- 1: 1186 0 IO-APIC-edge i8042
- 2: 0 0 XT-PIC cascade
- 12: 2797 0 IO-APIC-edge i8042
- 14: 6543 0 IO-APIC-edge ide0
- 15: 1 0 IO-APIC-edge ide1
-169: 0 0 IO-APIC-level uhci-hcd
-185: 0 0 IO-APIC-level uhci-hcd
-193: 138 10 PCI-MSI aic79xx
-201: 30 0 PCI-MSI aic79xx
-225: 30 0 IO-APIC-level aic7xxx
-233: 30 0 IO-APIC-level aic7xxx
-NMI: 0 0
-LOC: 324553 325068
-ERR: 0
-MIS: 0
-
-6. MSI quirks
-
-Several PCI chipsets or devices are known to not support MSI.
-The PCI stack provides 3 possible levels of MSI disabling:
-* on a single device
-* on all devices behind a specific bridge
-* globally
-
-6.1. Disabling MSI on a single device
-
-Under some circumstances it might be required to disable MSI on a
-single device. This may be achieved by either not calling pci_enable_msi()
-or all, or setting the pci_dev->no_msi flag before (most of the time
-in a quirk).
-
-6.2. Disabling MSI below a bridge
-
-The vast majority of MSI quirks are required by PCI bridges not
-being able to route MSI between busses. In this case, MSI have to be
-disabled on all devices behind this bridge. It is achieves by setting
-the PCI_BUS_FLAGS_NO_MSI flag in the pci_bus->bus_flags of the bridge
-subordinate bus. There is no need to set the same flag on bridges that
-are below the broken bridge. When pci_enable_msi() is called to enable
-MSI on a device, pci_msi_supported() takes care of checking the NO_MSI
-flag in all parent busses of the device.
-
-Some bridges actually support dynamic MSI support enabling/disabling
-by changing some bits in their PCI configuration space (especially
-the Hypertransport chipsets such as the nVidia nForce and Serverworks
-HT2000). It may then be required to update the NO_MSI flag on the
-corresponding devices in the sysfs hierarchy. To enable MSI support
-on device "0000:00:0e", do:
-
- echo 1 > /sys/bus/pci/devices/0000:00:0e/msi_bus
-
-To disable MSI support, echo 0 instead of 1. Note that it should be
-used with caution since changing this value might break interrupts.
-
-6.3. Disabling MSI globally
-
-Some extreme cases may require to disable MSI globally on the system.
-For now, the only known case is a Serverworks PCI-X chipsets (MSI are
-not supported on several busses that are not all connected to the
-chipset in the Linux PCI hierarchy). In the vast majority of other
-cases, disabling only behind a specific bridge is enough.
-
-For debugging purpose, the user may also pass pci=nomsi on the kernel
-command-line to explicitly disable MSI globally. But, once the appro-
-priate quirks are added to the kernel, this option should not be
-required anymore.
-
-6.4. Finding why MSI cannot be enabled on a device
-
-Assuming that MSI are not enabled on a device, you should look at
-dmesg to find messages that quirks may output when disabling MSI
-on some devices, some bridges or even globally.
-Then, lspci -t gives the list of bridges above a device. Reading
-/sys/bus/pci/devices/0000:00:0e/msi_bus will tell you whether MSI
-are enabled (1) or disabled (0). In 0 is found in a single bridge
-msi_bus file above the device, MSI cannot be enabled.
-
-7. FAQ
-
-Q1. Are there any limitations on using the MSI?
-
-A1. If the PCI device supports MSI and conforms to the
-specification and the platform supports the APIC local bus,
-then using MSI should work.
-
-Q2. Will it work on all the Pentium processors (P3, P4, Xeon,
-AMD processors)? In P3 IPI's are transmitted on the APIC local
-bus and in P4 and Xeon they are transmitted on the system
-bus. Are there any implications with this?
-
-A2. MSI support enables a PCI device sending an inbound
-memory write (0xfeexxxxx as target address) on its PCI bus
-directly to the FSB. Since the message address has a
-redirection hint bit cleared, it should work.
-
-Q3. The target address 0xfeexxxxx will be translated by the
-Host Bridge into an interrupt message. Are there any
-limitations on the chipsets such as Intel 8xx, Intel e7xxx,
-or VIA?
-
-A3. If these chipsets support an inbound memory write with
-target address set as 0xfeexxxxx, as conformed to PCI
-specification 2.3 or latest, then it should work.
-
-Q4. From the driver point of view, if the MSI is lost because
-of errors occurring during inbound memory write, then it may
-wait forever. Is there a mechanism for it to recover?
-
-A4. Since the target of the transaction is an inbound memory
-write, all transaction termination conditions (Retry,
-Master-Abort, Target-Abort, or normal completion) are
-supported. A device sending an MSI must abide by all the PCI
-rules and conditions regarding that inbound memory write. So,
-if a retry is signaled it must retry, etc... We believe that
-the recommendation for Abort is also a retry (refer to PCI
-specification 2.3 or latest).
diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt
new file mode 100644
index 0000000..eb129ee
--- /dev/null
+++ b/Documentation/PCI/MSI-HOWTO.txt
@@ -0,0 +1,322 @@
+ The MSI Driver Guide HOWTO
+ Tom L Nguyen [email protected]
+ 10/03/2003
+ Revised Feb 12, 2004 by Martine Silbermann
+ email: [email protected]
+ Revised Jun 25, 2004 by Tom L Nguyen
+ Revised Jul 9, 2008 by Matthew Wilcox <[email protected]>
+ Copyright 2003, 2008 Intel Corporation
+
+1. About this guide
+
+This guide describes the basics of Message Signaled Interrupts (MSIs),
+the advantages of using MSI over traditional interrupt mechanisms,
+and how to change your driver to use MSI or MSI-X.
+
+
+2. What are MSIs?
+
+Message Signaled Interrupt (MSI) is an optional feature for devices
+which implement the PCI Local Bus Specification Revision 2.2 and later.
+MSI enables a device to generate an interrupt by sending a normal write
+to a special address in the host chipset that is translated into a CPU
+interrupt. MSI-X (introduced in PCI 3.0) is a more flexible scheme
+than MSI. It allows for greater control over what interrupts can be
+generated and supports a greater number of interrupts.
+
+A device indicates MSI support by implementing the MSI or the MSI-X
+capability in its PCI configuration space. It may implement both the
+MSI capability structure and the MSI-X capability structure, but only
+one may be enabled.
+
+
+3. Why use MSIs?
+
+Pin-based PCI interrupts are often shared between several devices.
+To support this, the kernel must call each interrupt handler associated
+with an interrupt which leads to increased latency for the interrupt
+handlers which are registered last.
+
+When a device performs DMA to memory and raises a pin-based interrupt, it
+is possible that the interrupt may arrive before all the data has arrived
+in memory (this becomes more likely with devices behind PCI-PCI bridges).
+In order to assure that all DMA has arrived in memory, the interrupt
+handler must read a register on the device which raised the interrupt.
+PCI ordering rules require that the writes be flushed to memory before
+the value can be returned from the register. MSI avoids this problem
+as the interrupt-generating write cannot pass the DMA writes, so by the
+time the interrupt is raised, the driver knows that the DMA has completed.
+
+Using MSIs enables the device to support more interrupts, allowing
+each interrupt to be specialised to a different purpose. This allows
+infrequent conditions (such as errors) to be given their own interrupt so
+the driver does not have to check for errors during the normal interrupt
+handling path.
+
+
+4. How to use MSIs
+
+PCI devices are initialised to use pin-based interrupts. The device
+driver has to set up the device to use MSI or MSI-X. Not all machines
+support MSIs correctly, and for those machines, the APIs described below
+will simply fail and the device will continue to use pin-based interrupts.
+
+4.1 Include kernel support for MSIs
+
+To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI
+option enabled. This option is only available on some architectures,
+and it may depend on some other options also being set. For example,
+on x86, you must also enable X86_UP_APIC or SMP in order to see the
+CONFIG_PCI_MSI option.
+
+4.2 Using MSI
+
+Most of the hard work is done for the driver in the PCI layer. It simply
+has to request that the PCI layer set up the MSI capability for this
+device.
+
+4.2.1 pci_enable_msi
+
+int pci_enable_msi(struct pci_dev *dev)
+
+A successful call will allocate ONE interrupt to the device, regardless
+of how many MSIs the device supports. The device will be switched from
+pin-based interrupt mode to MSI mode. The dev->irq number is changed
+to a new number which represents the message signaled interrupt.
+This function should be called before the driver calls request_irq()
+since enabling MSIs disables the pin-based IRQ and the driver will not
+receive interrupts on the old interrupt.
+
+4.2.2 pci_enable_msi_block
+
+int pci_enable_msi_block(struct pci_dev *pdev, int count)
+
+This variation allows a device driver to request multiple MSIs. The MSI
+specification only allows interrupts to be allocated in powers of two,
+up to a maximum of 2^5 (32).
+
+If this function returns 0, it has succeeded in allocating as many
+interrupt vectors as the driver requested (it may have allocated more
+in order to satisfy the power-of-two requirement). In this case, the
+function enables MSI on this device and updates pdev->irq to be the
+the new interrupt assigned to it.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to request any more MSI interrupts for
+this device. If this function returns a positive number, it will be
+less than 'count' and indicate the number of interrupts that could have
+been allocated. In neither case will the irq value have been
+updated, nor will the device have been switched into MSI mode.
+
+The device driver must decide what action to take if
+pci_enable_msi_block() returns a value less than the number asked for.
+Some devices can make use of fewer interrupts than the maximum they
+request; in this case the driver should call pci_enable_msi_block()
+again. Note that it is not guaranteed to succeed, even when the
+'count' has been reduced to the value returned from a previous call to
+pci_enable_msi_block(). This is because there are multiple constraints
+on the number of vectors that can be allocated; pci_enable_msi_block()
+will return as soon as it finds any constraint that doesn't allow the
+call to succeed.
+
+The interrupt handler for the interrupt will be told which vector was
+assigned in the top bits of the 'irq' parameter. To extract the vector
+number, use get_irq_subchannel(irq).
+
+4.2.3 pci_disable_msi
+
+void pci_disable_msi(struct pci_dev *dev)
+
+This API should be used to undo the effect of pci_enable_msi() or
+pci_enable_msi_block(). This API restores dev->irq to the pin-based
+interrupt number and frees the previously allocated message signaled
+interrupt(s). The interrupt may subsequently be assigned to another
+device, so drivers should not cache the value of pdev->irq.
+
+A device driver must always call free_irq() on the interrupt(s)
+for which it has called request_irq() before calling this function.
+Failure to do so will result in a BUG_ON(), the device will be left with
+MSI enabled and will leak its vector.
+
+4.3 Using MSI-X
+
+The MSI-X capability is much more flexible than the MSI capability.
+It supports up to 2048 interrupts, each of which can be separately
+assigned. To support this flexibility, drivers must use an array of
+`struct msix_entry':
+
+struct msix_entry {
+ u16 vector; /* kernel uses to write alloc vector */
+ u16 entry; /* driver uses to specify entry */
+};
+
+This allows for the device to use these interrupts in a sparse fashion;
+for example it could use interrupts 3 and 1027 and allocate only a
+two-element array. The driver is expected to fill in the 'entry' value
+in each element of the array to indicate which entries it wants the kernel have
+interrupts assigned for. It is invalid to fill in two entries with the
+same number.
+
+4.3.1 pci_enable_msix
+
+int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec)
+
+Calling this function asks the PCI subsystem to allocate 'nvec' MSIs.
+The 'entries' argument is a pointer to an array of msix_entry structs
+which should be at least 'nvec' entries in size. On success, the
+function will return 0 and the device will have been switched into
+MSI-X interrupt mode. The 'vector' elements in each entry will have
+been filled in with the interrupt number.
+
+If this function returns a negative number, it indicates an error and
+the driver should not attempt to allocate any more MSI-X interrupts for
+this device. If it returns a positive number, it indicates the maximum
+number of interrupt vectors that could have been allocated.
+
+This function, in contrast with pci_enable_msi(), does not adjust
+pdev->irq. The device will not generate interrupts for this interrupt
+number once MSI-X is enabled. The device driver is responsible for
+keeping track of the interrupts assigned to the MSI-X vectors so it can
+free them again later.
+
+Device drivers should normally call this function once per device
+during the initialization phase.
+
+4.3.2 pci_disable_msix
+
+void pci_disable_msix(struct pci_dev *dev)
+
+This API should be used to undo the effect of pci_enable_msix(). It frees
+the previously allocated message signaled interrupts. The interrupts may
+subsequently be assigned to another device, so drivers should not cache
+the value of the 'vector' elements over a call to pci_disable_msix().
+
+A device driver must always call free_irq() on the interrupt(s)
+for which it has called request_irq() before calling this function.
+Failure to do so will result in a BUG_ON(), the device will be left with
+MSI enabled and will leak its vector.
+
+4.3.3 The MSI-X Table
+
+The MSI-X capability specifies a BAR and offset within that BAR for the
+MSI-X Table. This address is mapped by the PCI subsystem, and should be
+considered to be off limits to the device driver. If the driver wishes to
+mask or unmask an interrupt, it should call disable_irq() / enable_irq().
+
+4.4 Handling devices implementing both MSI and MSI-X capabilities
+
+If a device implements both MSI and MSI-X capabilities, it can
+run in either MSI mode or MSI-X mode but not both simultaneously.
+This is a requirement of the PCI spec, and it is enforced by the
+PCI layer. Calling pci_enable_msi() when MSI-X is already enabled or
+pci_enable_msix() when MSI is already enabled will result in an error.
+If a device driver wishes to switch between MSI and MSI-X at runtime,
+it must first quiesce the device, then switch it back to pin-interrupt
+mode, before calling pci_enable_msi() or pci_enable_msix() and resuming
+operation. This is not expected to be a common operation but may be
+useful for debugging or testing during development.
+
+4.5 Considerations when using MSIs
+
+4.5.1 Choosing between MSI-X and MSI
+
+If your device supports both MSI-X and MSI capabilities, you should use
+the MSI-X facilities in preference to the MSI facilities. As mentioned
+above, MSI-X supports any number of interrupts between 1 and 2048.
+In constrast, MSI is restricted to a maximum of 32 interrupts (and
+must be a power of two). In addition, the MSI interrupt vectors must
+be allocated consecutively, so the system may not be able to allocate
+as many vectors for MSI as it could for MSI-X. On some platforms, MSI
+interrupts must all be targetted at the same set of CPUs whereas MSI-X
+interrupts can all be targetted at different CPUs.
+
+4.5.2 Spinlocks
+
+Most device drivers have a per-device spinlock which is taken in the
+interrupt handler. With pin-based interrupts or a single MSI, it is not
+necessary to disable interrupts (Linux guarantees the same interrupt will
+not be re-entered). If a device uses multiple interrupts, the driver
+must disable interrupts while the lock is held. If the device sends
+a different interrupt, the driver will deadlock trying to recursively
+acquire the spinlock.
+
+There are two solutions. The first is to take the
+lock with spin_lock_irqsave() or spin_lock_irq() (see
+Documentation/DocBook/kernel-locking). The second is to specify
+IRQF_DISABLED to request_irq() so that the kernel runs the entire
+interrupt routine with interrupts disabled.
+
+If your MSI interrupt routine does not hold the lock for the whole time
+it is running, the first solution may be best. The second solution is
+normally preferred as it avoids making two transitions from interrupt
+disabled to enabled and back again.
+
+4.6 How to tell whether MSI/MSI-X is enabled on a device
+
+Using lspci -v (as root) will show some devices with "Message Signalled
+Interrupts" and others with "MSI-X". Each of these capabilities have an
+'Enable' flag which will be followed with either "+" (enabled) or "-"
+(disabled).
+
+
+5. MSI quirks
+
+Several PCI chipsets or devices are known not to support MSI.
+The PCI stack provides three possible ways to disable MSIs :
+* on a single device
+* on all devices behind a specific bridge
+* globally
+
+5.1. Disabling MSI on a single device
+
+Under some circumstances it might be required to disable MSI on a single
+device. This may be achieved by either not calling pci_enable_msi()
+for this device, or setting the pci_dev->no_msi flag before (most of
+the time in a quirk).
+
+5.2. Disabling MSI below a bridge
+
+The vast majority of MSI quirks are required by PCI bridges not
+being able to route MSI between busses. In this case, MSI have to be
+disabled on all devices behind this bridge. It is achieves by setting
+the PCI_BUS_FLAGS_NO_MSI flag in the pci_bus->bus_flags of the bridge
+subordinate bus. There is no need to set the same flag on bridges that
+are below the broken bridge. When pci_enable_msi() is called to enable
+MSI on a device, pci_msi_supported() takes care of checking the NO_MSI
+flag in all parent busses of the device.
+
+Some bridges actually support dynamic MSI support enabling/disabling
+by changing some bits in their PCI configuration space (especially
+the Hypertransport chipsets such as the nVidia nForce and Serverworks
+HT2000). It may then be required to update the NO_MSI flag on the
+corresponding devices in the sysfs hierarchy. To enable MSI support
+on device "0000:00:0e", do:
+
+ echo 1 > /sys/bus/pci/devices/0000:00:0e/msi_bus
+
+To disable MSI support, echo 0 instead of 1. It should be
+used with caution since changing this value might break interrupts.
+
+5.3. Disabling MSI globally
+
+Some extreme cases may require to disable MSI globally on the system.
+For now, the only known case is a Serverworks PCI-X chipsets (MSI are
+not supported on several busses that are not all connected to the
+chipset in the Linux PCI hierarchy). In the vast majority of other
+cases, disabling only behind a specific bridge is enough.
+
+For debugging purpose, the user may also pass pci=nomsi on the kernel
+command-line to explicitly disable MSI globally. But, once the appro-
+priate quirks are added to the kernel, this option should not be
+required anymore.
+
+5.4. Finding why MSI cannot be enabled on a device
+
+Assuming that MSI are not enabled on a device, you should look at
+dmesg to find messages that quirks may output when disabling MSI
+on some devices, some bridges or even globally.
+Then, lspci -t gives the list of bridges above a device. Reading
+/sys/bus/pci/devices/0000:00:0e/msi_bus will tell you whether MSI
+are enabled (1) or disabled (0). In 0 is found in a single bridge
+msi_bus file above the device, MSI cannot be enabled.
+
--
1.5.5.4

2008-07-11 21:19:47

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 5/6] AHCI: Request multiple MSIs

AHCI controllers can support up to 16 interrupts, one per port. This
saves us a readl() in the interrupt path to determine which port has
generated the interrupt.

Signed-off-by: Matthew Wilcox <[email protected]>
---
drivers/ata/ahci.c | 104 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 99 insertions(+), 5 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 5e6468a..a3f6331 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -102,6 +102,7 @@ enum {
/* HOST_CTL bits */
HOST_RESET = (1 << 0), /* reset controller; self-clear */
HOST_IRQ_EN = (1 << 1), /* global IRQ enable */
+ HOST_MSI_RSM = (1 << 2), /* Revert to Single Message */
HOST_AHCI_EN = (1 << 31), /* AHCI enabled */

/* HOST_CAP bits */
@@ -1771,6 +1772,19 @@ static void ahci_port_intr(struct ata_port *ap)
}
}

+static irqreturn_t ahci_msi_interrupt(int irq, void *dev_instance)
+{
+ struct ata_host *host = dev_instance;
+ int port = get_irq_subchannel(irq);
+ struct ata_port *ap = host->ports[port];
+
+ spin_lock(&ap->host->lock);
+ ahci_port_intr(ap);
+ spin_unlock(&ap->host->lock);
+
+ return IRQ_HANDLED;
+}
+
static irqreturn_t ahci_interrupt(int irq, void *dev_instance)
{
struct ata_host *host = dev_instance;
@@ -2220,6 +2234,80 @@ static void ahci_p5wdh_workaround(struct ata_host *host)
}
}

+static int ahci_request_irq(struct pci_dev *pdev, struct ata_host *host,
+ int n_irqs)
+{
+ int rc;
+
+ if (n_irqs) {
+ rc = pci_enable_msi_block(pdev, n_irqs);
+ if (rc)
+ return rc;
+ }
+
+ if (n_irqs > 1) {
+ void __iomem *mmio = host->iomap[AHCI_PCI_BAR];
+ u32 host_ctl = readl(mmio + HOST_CTL);
+ if (host_ctl & HOST_MSI_RSM)
+ goto try_different;
+ rc = devm_request_irq(host->dev, pdev->irq, ahci_msi_interrupt,
+ IRQF_DISABLED,
+ dev_driver_string(host->dev), host);
+ } else {
+ rc = devm_request_irq(host->dev, pdev->irq, ahci_interrupt,
+ IRQF_SHARED,
+ dev_driver_string(host->dev), host);
+ }
+
+ if (!rc)
+ return 0;
+
+ try_different:
+ pci_disable_msi(pdev);
+ pci_intx(pdev, 1);
+ return (n_irqs == 1) ? rc : 1;
+}
+
+/*
+ * We only need one interrupt per port (plus one for CCC which isn't
+ * supported yet), but some AHCI controllers refuse to use multiple MSIs
+ * unless they get the maximum number of interrupts. We don't support the
+ * mode where fewer interrupts are assigned than we have ports.
+ */
+static int ahci_setup_irqs(struct pci_dev *pdev, struct ata_host *host)
+{
+ struct ahci_host_priv *hpriv = host->private_data;
+ int n_irqs, pos, rc;
+ u16 control;
+
+ if (hpriv->flags & AHCI_HFLAG_NO_MSI)
+ goto no_msi;
+
+ rc = ahci_request_irq(pdev, host, host->n_ports);
+ if (rc == 0)
+ return 0;
+ if (rc < 0)
+ goto no_msi;
+
+ /* Find out how many it might want */
+ pos = pci_find_capability(pdev, PCI_CAP_ID_MSI);
+ pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &control);
+ n_irqs = 1 << ((control & PCI_MSI_FLAGS_QMASK) >> 1);
+
+ rc = ahci_request_irq(pdev, host, n_irqs);
+ if (rc == 0)
+ return 0;
+ if (rc < 0)
+ goto no_msi;
+
+ rc = ahci_request_irq(pdev, host, 1);
+ if (rc == 0)
+ return 0;
+
+ no_msi:
+ return ahci_request_irq(pdev, host, 0);
+}
+
static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
{
static int printed_version;
@@ -2278,9 +2366,6 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
(pdev->revision == 0xa1 || pdev->revision == 0xa2))
hpriv->flags |= AHCI_HFLAG_NO_MSI;

- if ((hpriv->flags & AHCI_HFLAG_NO_MSI) || pci_enable_msi(pdev))
- pci_intx(pdev, 1);
-
/* save initial config */
ahci_save_initial_config(pdev, hpriv);

@@ -2335,8 +2420,17 @@ static int ahci_init_one(struct pci_dev *pdev, const struct pci_device_id *ent)
ahci_print_info(host);

pci_set_master(pdev);
- return ata_host_activate(host, pdev->irq, ahci_interrupt, IRQF_SHARED,
- &ahci_sht);
+
+ rc = ata_host_start(host);
+ if (rc)
+ return rc;
+
+ rc = ahci_setup_irqs(pdev, host);
+ if (rc)
+ return rc;
+
+ rc = ata_host_register(host, &ahci_sht);
+ return rc;
}

static int __init ahci_init(void)
--
1.5.5.4

2008-07-11 21:20:36

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 3/6] PCI MSI: Add support for multiple messages

Add the new API pci_enable_msi_block() to allow drivers to
request multiple MSI and reimplement pci_enable_msi in terms of
pci_enable_msi_block. Ensure that the architecture back ends don't
have to know about multiple MSI.

Signed-off-by: Matthew Wilcox <[email protected]>
---
arch/powerpc/kernel/msi.c | 4 ++
drivers/pci/msi.c | 69 +++++++++++++++++++++++++++++++-------------
drivers/pci/msi.h | 4 --
include/linux/msi.h | 1 +
include/linux/pci.h | 6 ++-
5 files changed, 57 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index c62d101..5bc8dda 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -19,6 +19,10 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
return -ENOSYS;
}

+ /* PowerPC doesn't support multiple MSI yet */
+ if (type == PCI_CAP_ID_MSI && nvec > 1)
+ return 1;
+
if (ppc_md.msi_check_device) {
pr_debug("msi: Using platform check routine.\n");
return ppc_md.msi_check_device(dev, nvec, type);
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 104f297..853a0f2 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -45,6 +45,13 @@ arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
struct msi_desc *entry;
int ret;

+ /*
+ * If an architecture wants to support multiple MSI, it needs to
+ * override arch_setup_msi_irqs()
+ */
+ if (type == PCI_CAP_ID_MSI && nvec > 1)
+ return 1;
+
list_for_each_entry(entry, &dev->msi_list, list) {
ret = arch_setup_msi_irq(dev, entry);
if (ret)
@@ -186,6 +193,12 @@ void write_msi_msg(unsigned int irq, struct msi_msg *msg)
} else {
struct pci_dev *dev = entry->dev;
int pos = entry->msi_attrib.pos;
+ u16 msgctl;
+
+ pci_read_config_word(dev, msi_control_reg(pos), &msgctl);
+ msgctl &= ~PCI_MSI_FLAGS_QSIZE;
+ msgctl |= entry->msi_attrib.multiple << 4;
+ pci_write_config_word(dev, msi_control_reg(pos), msgctl);

pci_write_config_dword(dev, msi_lower_address_reg(pos),
msg->address_lo);
@@ -301,13 +314,15 @@ EXPORT_SYMBOL_GPL(pci_restore_msi_state);
/**
* msi_capability_init - configure device's MSI capability structure
* @dev: pointer to the pci_dev data structure of MSI device function
+ * @nvec: number of interrupts to allocate
*
- * Setup the MSI capability structure of device function with a single
- * MSI irq, regardless of device function is capable of handling
- * multiple messages. A return of zero indicates the successful setup
- * of an entry zero with the new MSI irq or non-zero for otherwise.
- **/
-static int msi_capability_init(struct pci_dev *dev)
+ * Setup the MSI capability structure of the device with the requested
+ * number of interrupts. A return value of zero indicates the successful
+ * setup of an entry with the new MSI irq. A negative return value indicates
+ * an error, and a positive return value indicates the number of interrupts
+ * which could have been allocated.
+ */
+static int msi_capability_init(struct pci_dev *dev, int nvec)
{
struct msi_desc *entry;
int pos, ret;
@@ -351,7 +366,7 @@ static int msi_capability_init(struct pci_dev *dev)
list_add_tail(&entry->list, &dev->msi_list);

/* Configure MSI capability structure */
- ret = arch_setup_msi_irqs(dev, 1, PCI_CAP_ID_MSI);
+ ret = arch_setup_msi_irqs(dev, nvec, PCI_CAP_ID_MSI);
if (ret) {
msi_free_irqs(dev);
return ret;
@@ -503,36 +518,48 @@ static int pci_msi_check_device(struct pci_dev* dev, int nvec, int type)
}

/**
- * pci_enable_msi - configure device's MSI capability structure
- * @dev: pointer to the pci_dev data structure of MSI device function
+ * pci_enable_msi_block - configure device's MSI capability structure
+ * @dev: device to configure
+ * @nvec: number of interrupts to configure
*
- * Setup the MSI capability structure of device function with
- * a single MSI irq upon its software driver call to request for
- * MSI mode enabled on its hardware device function. A return of zero
- * indicates the successful setup of an entry zero with the new MSI
- * irq or non-zero for otherwise.
- **/
-int pci_enable_msi(struct pci_dev* dev)
+ * Allocate IRQs for a device with the MSI capability.
+ * This function returns a negative errno if an error occurs. If it
+ * is unable to allocate the number of interrupts requested, it returns
+ * the number of interrupts it might be able to allocate. If it successfully
+ * allocates at least the number of interrupt vectors requested, it returns
+ * 0 and updates the @dev's irq member to the new interrupt number.
+ */
+int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
{
- int status;
+ int status, pos, maxvec;
+ u16 msgctl;
+
+ pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
+ if (!pos)
+ return -EINVAL;
+ pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
+ maxvec = 1 << ((msgctl & PCI_MSI_FLAGS_QMASK) >> 1);
+ if (nvec > maxvec)
+ return maxvec;

- status = pci_msi_check_device(dev, 1, PCI_CAP_ID_MSI);
+ status = pci_msi_check_device(dev, nvec, PCI_CAP_ID_MSI);
if (status)
return status;

WARN_ON(!!dev->msi_enabled);

- /* Check whether driver already requested for MSI-X irqs */
+ /* Check whether driver already requested MSI-X irqs */
if (dev->msix_enabled) {
printk(KERN_INFO "PCI: %s: Can't enable MSI. "
"Device already has MSI-X enabled\n",
pci_name(dev));
return -EINVAL;
}
- status = msi_capability_init(dev);
+
+ status = msi_capability_init(dev, nvec);
return status;
}
-EXPORT_SYMBOL(pci_enable_msi);
+EXPORT_SYMBOL(pci_enable_msi_block);

void pci_msi_shutdown(struct pci_dev* dev)
{
diff --git a/drivers/pci/msi.h b/drivers/pci/msi.h
index 3898f52..b72e0bd 100644
--- a/drivers/pci/msi.h
+++ b/drivers/pci/msi.h
@@ -22,12 +22,8 @@
#define msi_disable(control) control &= ~PCI_MSI_FLAGS_ENABLE
#define multi_msi_capable(control) \
(1 << ((control & PCI_MSI_FLAGS_QMASK) >> 1))
-#define multi_msi_enable(control, num) \
- control |= (((num >> 1) << 4) & PCI_MSI_FLAGS_QSIZE);
#define is_64bit_address(control) (!!(control & PCI_MSI_FLAGS_64BIT))
#define is_mask_bit_support(control) (!!(control & PCI_MSI_FLAGS_MASKBIT))
-#define msi_enable(control, num) multi_msi_enable(control, num); \
- control |= PCI_MSI_FLAGS_ENABLE

#define msix_table_offset_reg(base) (base + 0x04)
#define msix_pba_offset_reg(base) (base + 0x08)
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 13a8a1b..70fcebc 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -18,6 +18,7 @@ extern void write_msi_msg(unsigned int irq, struct msi_msg *msg);
struct msi_desc {
struct {
__u8 is_msix : 1;
+ __u8 multiple: 3; /* log2 number of messages */
__u8 maskbit : 1; /* mask-pending bit supported ? */
__u8 masked : 1;
__u8 is_64 : 1; /* Address size: 0=32bit 1=64bit */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index d18b1dd..16df5f9 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -699,7 +699,7 @@ struct msix_entry {

#ifndef CONFIG_PCI_MSI
-static inline int pci_enable_msi(struct pci_dev *dev)
+static inline int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
{
return -1;
}
@@ -726,7 +726,7 @@ static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev)
static inline void pci_restore_msi_state(struct pci_dev *dev)
{ }
#else
-extern int pci_enable_msi(struct pci_dev *dev);
+extern int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec);
extern void pci_msi_shutdown(struct pci_dev *dev);
extern void pci_disable_msi(struct pci_dev *dev);
extern int pci_enable_msix(struct pci_dev *dev,
@@ -737,6 +737,8 @@ extern void msi_remove_pci_irq_vectors(struct pci_dev *dev);
extern void pci_restore_msi_state(struct pci_dev *dev);
#endif

+#define pci_enable_msi(pdev) pci_enable_msi_block(pdev, 1)
+
#ifdef CONFIG_HT_IRQ
/* The functions a driver should call */
int ht_create_irq(struct pci_dev *dev, int idx);
--
1.5.5.4

2008-07-11 21:20:09

by Matthew Wilcox

[permalink] [raw]

Subject: [PATCH 2/6] PCI MSI: Replace 'type' with 'is_msix'

By changing from a 5-bit field to a 1-bit field, we free up some bits
that can be used by a later patch. Also rearrange the fields for better
packing on 64-bit platforms (reducing the size of msi_desc from 72 bytes
to 64 bytes).

Signed-off-by: Matthew Wilcox <[email protected]>
---
drivers/pci/msi.c | 100 ++++++++++++++++----------------------------------
include/linux/msi.h | 4 +-
2 files changed, 34 insertions(+), 70 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 8c61304..104f297 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -106,20 +106,10 @@ static void msix_flush_writes(unsigned int irq)

entry = get_irq_msi(irq);
BUG_ON(!entry || !entry->dev);
- switch (entry->msi_attrib.type) {
- case PCI_CAP_ID_MSI:
- /* nothing to do */
- break;
- case PCI_CAP_ID_MSIX:
- {
+ if (entry->msi_attrib.is_msix) {
int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
readl(entry->mask_base + offset);
- break;
- }
- default:
- BUG();
- break;
}
}

@@ -129,8 +119,12 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 flag)

entry = get_irq_msi(irq);
BUG_ON(!entry || !entry->dev);
- switch (entry->msi_attrib.type) {
- case PCI_CAP_ID_MSI:
+ if (entry->msi_attrib.is_msix) {
+ int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
+ writel(flag, entry->mask_base + offset);
+ readl(entry->mask_base + offset);
+ } else {
if (entry->msi_attrib.maskbit) {
int pos;
u32 mask_bits;
@@ -143,18 +137,6 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 flag)
} else {
msi_set_enable(entry->dev, !flag);
}
- break;
- case PCI_CAP_ID_MSIX:
- {
- int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
- PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
- writel(flag, entry->mask_base + offset);
- readl(entry->mask_base + offset);
- break;
- }
- default:
- BUG();
- break;
}
entry->msi_attrib.masked = !!flag;
}
@@ -162,9 +144,14 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 flag)
void read_msi_msg(unsigned int irq, struct msi_msg *msg)
{
struct msi_desc *entry = get_irq_msi(irq);
- switch(entry->msi_attrib.type) {
- case PCI_CAP_ID_MSI:
- {
+ if (entry->msi_attrib.is_msix) {
+ void __iomem *base = entry->mask_base +
+ entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
+
+ msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
+ msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
+ msg->data = readl(base + PCI_MSIX_ENTRY_DATA_OFFSET);
+ } else {
struct pci_dev *dev = entry->dev;
int pos = entry->msi_attrib.pos;
u16 data;
@@ -180,30 +167,23 @@ void read_msi_msg(unsigned int irq, struct msi_msg *msg)
pci_read_config_word(dev, msi_data_reg(pos, 0), &data);
}
msg->data = data;
- break;
- }
- case PCI_CAP_ID_MSIX:
- {
- void __iomem *base;
- base = entry->mask_base +
- entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
-
- msg->address_lo = readl(base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- msg->address_hi = readl(base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
- msg->data = readl(base + PCI_MSIX_ENTRY_DATA_OFFSET);
- break;
- }
- default:
- BUG();
}
}

void write_msi_msg(unsigned int irq, struct msi_msg *msg)
{
struct msi_desc *entry = get_irq_msi(irq);
- switch (entry->msi_attrib.type) {
- case PCI_CAP_ID_MSI:
- {
+ if (entry->msi_attrib.is_msix) {
+ void __iomem *base;
+ base = entry->mask_base +
+ entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
+
+ writel(msg->address_lo,
+ base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
+ writel(msg->address_hi,
+ base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
+ writel(msg->data, base + PCI_MSIX_ENTRY_DATA_OFFSET);
+ } else {
struct pci_dev *dev = entry->dev;
int pos = entry->msi_attrib.pos;

@@ -218,23 +198,6 @@ void write_msi_msg(unsigned int irq, struct msi_msg *msg)
pci_write_config_word(dev, msi_data_reg(pos, 0),
msg->data);
}
- break;
- }
- case PCI_CAP_ID_MSIX:
- {
- void __iomem *base;
- base = entry->mask_base +
- entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE;
-
- writel(msg->address_lo,
- base + PCI_MSIX_ENTRY_LOWER_ADDR_OFFSET);
- writel(msg->address_hi,
- base + PCI_MSIX_ENTRY_UPPER_ADDR_OFFSET);
- writel(msg->data, base + PCI_MSIX_ENTRY_DATA_OFFSET);
- break;
- }
- default:
- BUG();
}
entry->msg = *msg;
}
@@ -359,7 +322,7 @@ static int msi_capability_init(struct pci_dev *dev)
if (!entry)
return -ENOMEM;

- entry->msi_attrib.type = PCI_CAP_ID_MSI;
+ entry->msi_attrib.is_msix = 0;
entry->msi_attrib.is_64 = is_64bit_address(control);
entry->msi_attrib.entry_nr = 0;
entry->msi_attrib.maskbit = is_mask_bit_support(control);
@@ -446,7 +409,7 @@ static int msix_capability_init(struct pci_dev *dev,
break;

j = entries[i].entry;
- entry->msi_attrib.type = PCI_CAP_ID_MSIX;
+ entry->msi_attrib.is_msix = 1;
entry->msi_attrib.is_64 = 1;
entry->msi_attrib.entry_nr = j;
entry->msi_attrib.maskbit = 1;
@@ -589,12 +552,13 @@ void pci_msi_shutdown(struct pci_dev* dev)
u32 mask = entry->msi_attrib.maskbits_mask;
msi_set_mask_bits(dev->irq, mask, ~mask);
}
- if (!entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI)
+ if (!entry->dev || entry->msi_attrib.is_msix)
return;

/* Restore dev->irq to its default pin-assertion irq */
dev->irq = entry->msi_attrib.default_irq;
}
+
void pci_disable_msi(struct pci_dev* dev)
{
struct msi_desc *entry;
@@ -605,7 +569,7 @@ void pci_disable_msi(struct pci_dev* dev)
pci_msi_shutdown(dev);

entry = list_entry(dev->msi_list.next, struct msi_desc, list);
- if (!entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI)
+ if (!entry->dev || entry->msi_attrib.is_msix)
return;

msi_free_irqs(dev);
@@ -624,7 +588,7 @@ static int msi_free_irqs(struct pci_dev* dev)
arch_teardown_msi_irqs(dev);

list_for_each_entry_safe(entry, tmp, &dev->msi_list, list) {
- if (entry->msi_attrib.type == PCI_CAP_ID_MSIX) {
+ if (entry->msi_attrib.is_msix) {
writel(1, entry->mask_base + entry->msi_attrib.entry_nr
* PCI_MSIX_ENTRY_SIZE
+ PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET);
diff --git a/include/linux/msi.h b/include/linux/msi.h
index 8f29392..13a8a1b 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -17,13 +17,13 @@ extern void write_msi_msg(unsigned int irq, struct msi_msg *msg);

struct msi_desc {
struct {
- __u8 type : 5; /* {0: unused, 5h:MSI, 11h:MSI-X} */
+ __u8 is_msix : 1;
__u8 maskbit : 1; /* mask-pending bit supported ? */
__u8 masked : 1;
__u8 is_64 : 1; /* Address size: 0=32bit 1=64bit */
__u8 pos; /* Location of the msi capability */
- __u32 maskbits_mask; /* mask bits mask */
__u16 entry_nr; /* specific enabled entry */
+ __u32 maskbits_mask; /* mask bits mask */
unsigned default_irq; /* default pre-assigned irq */
}msi_attrib;

--
1.5.5.4

2008-07-12 05:55:34

[permalink] [raw]

Subject: Re: [PATCH 6/6] x86-64: Support for multiple MSIs

* Matthew Wilcox <[email protected]> wrote:

> Add support for allocating an aligned block of interrupt vectors.
> Allow interrupts to have up to 32 subchannels. Implement the
> arch_setup_msi_irqs() and arch_teardown_msi_irqs() interfaces.
>
> Signed-off-by: Matthew Wilcox <[email protected]>
> ---
> arch/x86/kernel/io_apic_64.c | 221 +++++++++++++++++++++++++++++++++++------
> arch/x86/kernel/irq_64.c | 2 +-
> include/asm-x86/irq_64.h | 2 +
> 3 files changed, 191 insertions(+), 34 deletions(-)

hm, please implement this symmetrically on 64-bit and 32-bit as well.
That will need more IRQ infrastructure changes but that should be OK as
it will reduce the gap between the 32-bit and 64-bit side instead of
widening it. A good starting point would be to unify/generalize the IRQ
vector allocators. (and that would get most of what you need for
multi-vector MSI)

Ingo

2008-07-12 07:38:29

by Benjamin Herrenschmidt

[permalink] [raw]

Subject: Re: Multiple MSI, take 4

On Fri, 2008-07-11 at 15:16 -0600, Matthew Wilcox wrote:
> Here we go with take 4. Changes:
>
> - Check the requested number of interrupts against the maximum number
> the device claims to support. Thanks to Hidetoshi Seto for pointing
> out this oversight.
> - Implemented Eric's suggestion of using a single IRQ and storing the
> data with it.
> - As a result, don't try to support the mode in the AHCI driver where
> the excess ports all share the last interrupt. It could be done, but
> it would be rather messy and I don't have hardware that supports that
> mode anyway.
>
> I'm fairly comfortable with the subchannel notion we're introducing
> here. It's more flexible than MSI and doesn't impose a penalty on
> architectures which don't implement it. It makes some things more
> complex, but it makes other things simpler, so I think it's a wash from
> a cleanliness standpoint.

I prefer you initial approach. If those are effectively one interrupt,
you end up with the whole IRQ_INPROGRESS logic going bonkers trying to
prevent them from occuring at the same time and possibly losing some.

I think the masking "issue" is mostly a non-issue as I explained in
other emails.

Cheers,
Ben.

2008-07-12 16:30:36

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH 6/6] x86-64: Support for multiple MSIs

On Sat, Jul 12, 2008 at 07:54:44AM +0200, Ingo Molnar wrote:
> hm, please implement this symmetrically on 64-bit and 32-bit as well.
> That will need more IRQ infrastructure changes but that should be OK as
> it will reduce the gap between the 32-bit and 64-bit side instead of
> widening it. A good starting point would be to unify/generalize the IRQ
> vector allocators. (and that would get most of what you need for
> multi-vector MSI)

OK, I'll look into unifying them on Monday. Thanks for the review.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-07-12 18:44:46

[permalink] [raw]

Subject: Re: [PATCH 6/6] x86-64: Support for multiple MSIs

On Sat, Jul 12, 2008 at 9:30 AM, Matthew Wilcox <[email protected]> wrote:
> On Sat, Jul 12, 2008 at 07:54:44AM +0200, Ingo Molnar wrote:
>> hm, please implement this symmetrically on 64-bit and 32-bit as well.
>> That will need more IRQ infrastructure changes but that should be OK as
>> it will reduce the gap between the 32-bit and 64-bit side instead of
>> widening it. A good starting point would be to unify/generalize the IRQ
>> vector allocators. (and that would get most of what you need for
>> multi-vector MSI)
>
> OK, I'll look into unifying them on Monday. Thanks for the review.
>
to make ioapic_32.c like ioapic_64.c
we should remove the irqbalance in kernel for 32bit?

YH

2009-03-03 00:16:31

by Michael Ellerman

[permalink] [raw]

Subject: Re: [PATCH 2/6] PCI MSI: Replace 'type' with 'is_msix'

On Mon, 2009-02-23 at 12:27 -0500, Matthew Wilcox wrote:
> By changing from a 5-bit field to a 1-bit field, we free up some bits
> that can be used by a later patch. Also rearrange the fields for better
> packing on 64-bit platforms (reducing the size of msi_desc from 72 bytes
> to 64 bytes).
>
> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index dceea56..b3db438 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c

> @@ -393,7 +355,7 @@ static int msi_capability_init(struct pci_dev *dev)
> if (!entry)
> return -ENOMEM;
>
> - entry->msi_attrib.type = PCI_CAP_ID_MSI;
> + entry->msi_attrib.is_msix = 0;

This isn't strictly necessary given we kzalloc'ed the entry, but no
biggie.

Looks good otherwise.

cheers

--
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

Attachments:

signature.asc (197.00 B)
This is a digitally signed message part