2008-07-31 21:20:30

by Alan Mayer

[permalink] [raw]
Subject: [PATCH] x86_64: Dynamically allocate arch specific system vectors



Subject: [PATCH] x86_64: Dynamically allocate arch specific system vectors

From: Alan Mayer <[email protected]>

On some systems (e. g., UV) it is necessary to use an interrupt vector
as a "system" vector, that is, it is generated by system hardware, not an
IO device. This patch dynamically allocates them from the pool of interrupt
vectors below the fixed system vectors. This may include stealing some from
the device interrupt vector pool, so they are allocated dynamically so that
other archs don't have to pay the price. In UV, examples of these hardware
and software systems that need dynamically allocated vectors are the GRU,
the BAU, and XPM/XPC.

Signed-off-by: Alan Mayer <[email protected]>

Reviewed by: Robin Holt <[email protected]> Dean Nelson <[email protected]> Cliff
Wickman <[email protected]>

---
Index: temp/arch/x86/kernel/io_apic_64.c
===================================================================
--- temp.orig/arch/x86/kernel/io_apic_64.c 2008-07-31 11:53:58.000000000
-0500
+++ temp/arch/x86/kernel/io_apic_64.c 2008-07-31 11:54:27.000000000 -0500
@@ -85,10 +85,6 @@

static int assign_irq_vector(int irq, cpumask_t mask);

-int first_system_vector = 0xfe;
-
-char system_vectors[NR_VECTORS] = { [0 ... NR_VECTORS-1] =
SYS_VECTOR_FREE};
-
#define __apicdebuginit __init

int sis_apic_bug; /* not actually supported, dummy for compile */
@@ -783,7 +779,8 @@
* Also, we've got to be careful not to trash gate
* 0x80, because int 0x80 is hm, kind of importantish. ;)
*/
- static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
+ static int current_vector = FIRST_DYNAMIC_VECTOR;
+ static int current_offset = 0;
unsigned int old_vector;
int cpu;
struct irq_cfg *cfg;
@@ -817,10 +814,10 @@
offset = current_offset;
next:
vector += 8;
- if (vector >= first_system_vector) {
+ if (vector > last_dynamic_device_vector) {
/* If we run out of vectors on large boxen, must share them. */
offset = (offset + 1) % 8;
- vector = FIRST_DEVICE_VECTOR + offset;
+ vector = FIRST_DYNAMIC_VECTOR + offset;
}
if (unlikely(current_vector == vector))
continue;
@@ -856,6 +853,161 @@
return err;
}

+/*
+ * NOP functions
+ */
+static void noop(unsigned int irq)
+{
+}
+
+static unsigned int noop_ret(unsigned int irq)
+{
+ return 0;
+}
+
+static void ack_apic(unsigned int irq)
+{
+ ack_APIC_irq();
+}
+
+/*
+ * For dynamic allocation of system vectors where
+ * an ack_APIC_irq() is needed after handling the IRQ
+ */
+static struct irq_chip ack_apic_chip = {
+ .name = " ack_apic",
+ .startup = noop_ret,
+ .shutdown = noop,
+ .enable = noop,
+ .disable = noop,
+ .ack = noop,
+ .mask = noop,
+ .unmask = noop,
+ .eoi = ack_apic,
+ .end = noop,
+};
+
+/*
+ * Dynamically allocate an irq and a vector from the system vector range.
+ * (The irq is not to be shared)
+ *
+ * After calling this function, the caller is responsible for any needed
+ * calls to:
+ * set_irq_data(&any_driver_data);
+ * set_irq_type(irq, IRQ_TYPE...);
+ * Then make the call to request_irq() to create the irqaction:
+ * request_irq(irq, interrupt_handler, irqflags, "devname", NULL);
+ * You might consider the flag IRQF_NOBALANCING.
+ */
+int alloc_irq_system_vector(enum system_vector_pri priority, cpumask_t
*mask,
+ int *assigned_vector, char *irqname)
+{
+ unsigned long flags;
+ cpumask_t new_mask;
+ int in_use = 1;
+ int cpu;
+ int irq;
+ int vector;
+ int last_vector;
+ int vector_inc;
+
+ if (priority == system_vector_pri_low) {
+ vector = FIRST_DYNAMIC_VECTOR;
+ last_vector = first_fixed_system_vector;
+ vector_inc = 1;
+ } else if (priority == system_vector_pri_high) {
+ vector = first_fixed_system_vector - 1;
+ last_vector = FIRST_DYNAMIC_VECTOR - 1;
+ vector_inc = -1;
+ } else {
+ return -EINVAL;
+ }
+
+ spin_lock_irqsave(&vector_lock, flags);
+
+ /* locate an available irq */
+ for (irq = NR_IRQS - 1; irq >= 0; irq--) {
+ if (platform_legacy_irq(irq))
+ continue;
+ if (irq_cfg[irq].vector == 0)
+ break;
+ }
+ if (irq == -1)
+ goto out_1;
+
+ cpus_and(new_mask, *mask, cpu_possible_map);
+
+ for (; vector != last_vector; vector += vector_inc) {
+
+ in_use = 0;
+ for_each_cpu_mask_nr(cpu, new_mask) {
+ if (per_cpu(vector_irq, cpu)[vector] != -1) {
+ in_use = 1;
+ break;
+ }
+ }
+ if (!in_use)
+ break;
+ }
+
+ if (in_use)
+ goto out_1;
+
+ irq_cfg[irq].vector = vector;
+ irq_cfg[irq].domain = *mask;
+ for_each_cpu_mask_nr(cpu, new_mask)
+ per_cpu(vector_irq, cpu)[vector] = irq;
+
+ enable_irq(irq);
+ dynamic_irq_init(irq);
+ set_irq_chip_and_handler_name(irq, &ack_apic_chip, handle_percpu_irq,
+ irqname);
+
+ spin_unlock_irqrestore(&vector_lock, flags);
+
+ *assigned_vector = vector;
+ return irq;
+
+out_1:
+ spin_unlock_irqrestore(&vector_lock, flags);
+ return -EBUSY;
+}
+EXPORT_SYMBOL(alloc_irq_system_vector);
+
+/*
+ * Free a dynamically allocated irq system_vector mapping.
+ *
+ * Before calling this function, the caller is responsible for calling
+ * free_irq(irq, dev_id); to free the irqaction.
+ */
+void free_irq_system_vector(int irq)
+{
+ unsigned long flags;
+ cpumask_t mask;
+ int cpu;
+
+ if ((unsigned)irq >= NR_IRQS || irq_cfg[irq].vector == 0)
+ return;
+
+#ifdef CONFIG_SMP
+ synchronize_irq(irq);
+#endif
+ dynamic_irq_cleanup(irq);
+ disable_irq(irq);
+
+ spin_lock_irqsave(&vector_lock, flags);
+
+ cpus_and(mask, irq_cfg[irq].domain, cpu_possible_map);
+ for_each_cpu_mask_nr(cpu, mask)
+ per_cpu(vector_irq, cpu)[irq_cfg[irq].vector] = -1;
+
+ irq_cfg[irq].vector = 0;
+ cpus_clear(irq_cfg[irq].domain);
+
+ spin_unlock_irqrestore(&vector_lock, flags);
+}
+EXPORT_SYMBOL(free_irq_system_vector);
+
static void __clear_irq_vector(int irq)
{
struct irq_cfg *cfg;
Index: temp/include/linux/irq.h
===================================================================
--- temp.orig/include/linux/irq.h 2008-07-31 11:53:58.000000000 -0500
+++ temp/include/linux/irq.h 2008-07-31 11:54:27.000000000 -0500
@@ -352,10 +352,21 @@
extern void set_irq_noprobe(unsigned int irq);
extern void set_irq_probe(unsigned int irq);

-/* Handle dynamic irq creation and destruction */
+/* Handle dynamic irq device vector creation and destruction */
extern int create_irq(void);
extern void destroy_irq(unsigned int irq);

+/* Handle dynamic irq system vector mapping and unmapping */
+
+enum system_vector_pri {
+ system_vector_pri_low,
+ system_vector_pri_high
+};
+
+extern int alloc_irq_system_vector(enum system_vector_pri, cpumask_t *,
int *,
+ char *);
+extern void free_irq_system_vector(int);
+
/* Test to see if a driver has successfully requested an irq */
static inline int irq_has_action(unsigned int irq)
{
Index: temp/include/asm-x86/irq_vectors.h
===================================================================
--- temp.orig/include/asm-x86/irq_vectors.h 2008-07-31
11:53:58.000000000 -0500
+++ temp/include/asm-x86/irq_vectors.h 2008-07-31 11:54:27.000000000 -0500
@@ -91,14 +91,37 @@
#define LOCAL_TIMER_VECTOR 0xef

/*
+ * The first device or system vector (lowest numbered) available for
dynamic
+ * allocation is defined by FIRST_DYNAMIC_VECTOR.
+ *
+ * The last device vector available for dynamic allocation is defined by
+ * last_dynamic_device_vector, which is initially set to
+ * LAST_DYNAMIC_DEVICE_VECTOR.
+ *
+ * The last system vector available for dynamic allocation is defined by
+ * first_fixed_system_vector - 1. The variable first_fixed_system_vector
+ * is initially set to FIRST_FIXED_SYSTEM_VECTOR.
+ *
+ * SGI-UV uses LAST_UV_DYNAMIC_DEVICE_VECTOR to reserve a range of
+ * vectors that falls between the first_fixed_system_vector and
+ * last_dynamic_device_vector for dynamic system vector allocations.
+ */
+#define FIRST_FIXED_SYSTEM_VECTOR 0xfe
+#define LAST_DYNAMIC_DEVICE_VECTOR FIRST_FIXED_SYSTEM_VECTOR
+#define LAST_UV_DYNAMIC_DEVICE_VECTOR 0xe0
+
+/*
* First APIC vector available to drivers: (vectors 0x30-0xee) we
* start at 0x31(0x41) to spread out vectors evenly between priority
* levels. (0x80 is the syscall vector)
+ *
+ * Device vectors are dynamically allocated as numbers in the range of
+ * FIRST_DYNAMIC_VECTOR to last_dynamic_device_vector (inclusive).
*/
#ifdef CONFIG_X86_32
-# define FIRST_DEVICE_VECTOR 0x31
+# define FIRST_DYNAMIC_VECTOR 0x31
#else
-# define FIRST_DEVICE_VECTOR (IRQ15_VECTOR + 2)
+# define FIRST_DYNAMIC_VECTOR (IRQ15_VECTOR + 2)
#endif

#define NR_VECTORS 256
Index: temp/arch/x86/kernel/io_apic_32.c
===================================================================
--- temp.orig/arch/x86/kernel/io_apic_32.c 2008-07-31 11:53:58.000000000
-0500
+++ temp/arch/x86/kernel/io_apic_32.c 2008-07-31 11:54:27.000000000 -0500
@@ -1165,11 +1165,15 @@
}

/* irq_vectors is indexed by the sum of all RTEs in all I/O APICs. */
-static u8 irq_vector[NR_IRQ_VECTORS] __read_mostly = {
FIRST_DEVICE_VECTOR , 0 };
+static u8 irq_vector[NR_IRQ_VECTORS] __read_mostly = {
+ FIRST_DYNAMIC_VECTOR,
+ 0
+};

static int __assign_irq_vector(int irq)
{
- static int current_vector = FIRST_DEVICE_VECTOR, current_offset;
+ static int current_vector = FIRST_DYNAMIC_VECTOR;
+ static int current_offset = 0;
int vector, offset;

BUG_ON((unsigned)irq >= NR_IRQ_VECTORS);
@@ -1181,9 +1185,9 @@
offset = current_offset;
next:
vector += 8;
- if (vector >= first_system_vector) {
+ if (vector > last_dynamic_device_vector) {
offset = (offset + 1) % 8;
- vector = FIRST_DEVICE_VECTOR + offset;
+ vector = FIRST_DYNAMIC_VECTOR + offset;
}
if (vector == current_vector)
return -ENOSPC;
@@ -2314,7 +2318,7 @@
int i;

/* Reserve all the system vectors. */
- for (i = first_system_vector; i < NR_VECTORS; i++)
+ for (i = last_dynamic_device_vector + 1; i < NR_VECTORS; i++)
set_bit(i, used_vectors);

enable_IO_APIC();
Index: temp/include/asm-x86/desc.h
===================================================================
--- temp.orig/include/asm-x86/desc.h 2008-07-31 11:53:58.000000000 -0500
+++ temp/include/asm-x86/desc.h 2008-07-31 11:54:27.000000000 -0500
@@ -310,22 +310,25 @@
#define SYS_VECTOR_FREE 0
#define SYS_VECTOR_ALLOCED 1

-extern int first_system_vector;
-extern char system_vectors[];
-
-static inline void alloc_system_vector(int vector)
-{
- if (system_vectors[vector] == SYS_VECTOR_FREE) {
- system_vectors[vector] = SYS_VECTOR_ALLOCED;
- if (first_system_vector > vector)
- first_system_vector = vector;
+extern int last_dynamic_device_vector;
+extern int first_fixed_system_vector;
+extern char fixed_system_vectors[];
+
+static inline void alloc_fixed_system_vector(int vector)
+{
+ if (fixed_system_vectors[vector] == SYS_VECTOR_FREE) {
+ fixed_system_vectors[vector] = SYS_VECTOR_ALLOCED;
+ if (first_fixed_system_vector > vector)
+ first_fixed_system_vector = vector;
+ if (last_dynamic_device_vector >= vector)
+ last_dynamic_device_vector = vector - 1;
} else
BUG();
}

static inline void alloc_intr_gate(unsigned int n, void *addr)
{
- alloc_system_vector(n);
+ alloc_fixed_system_vector(n);
set_intr_gate(n, addr);
}

Index: temp/arch/x86/kernel/apic_32.c
===================================================================
--- temp.orig/arch/x86/kernel/apic_32.c 2008-07-31 11:53:58.000000000 -0500
+++ temp/arch/x86/kernel/apic_32.c 2008-07-31 11:54:27.000000000 -0500
@@ -68,9 +68,11 @@
int local_apic_timer_c2_ok;
EXPORT_SYMBOL_GPL(local_apic_timer_c2_ok);

-int first_system_vector = 0xfe;
-
-char system_vectors[NR_VECTORS] = { [0 ... NR_VECTORS-1] =
SYS_VECTOR_FREE};
+int last_dynamic_device_vector = LAST_DYNAMIC_DEVICE_VECTOR;
+int first_fixed_system_vector = FIRST_FIXED_SYSTEM_VECTOR;
+char fixed_system_vectors[NR_VECTORS] = {
+ [0 ... NR_VECTORS-1] = SYS_VECTOR_FREE
+};

/*
* Debug level, exported for io_apic.c
@@ -1361,7 +1363,7 @@
* IRQ0 must be given a fixed assignment and initialized,
* because it's used before the IO-APIC is set up.
*/
- set_intr_gate(FIRST_DEVICE_VECTOR, interrupt[0]);
+ set_intr_gate(FIRST_DYNAMIC_VECTOR, interrupt[0]);

/*
* The reschedule interrupt is a CPU-to-CPU reschedule-helper
Index: temp/arch/x86/kernel/irqinit_64.c
===================================================================
--- temp.orig/arch/x86/kernel/irqinit_64.c 2008-07-31 11:53:58.000000000
-0500
+++ temp/arch/x86/kernel/irqinit_64.c 2008-07-31 11:54:27.000000000 -0500
@@ -22,6 +22,7 @@
#include <asm/desc.h>
#include <asm/apic.h>
#include <asm/i8259.h>
+#include <asm/genapic.h>

/*
* Common place to define all x86 IRQ vectors
@@ -217,6 +218,11 @@
alloc_intr_gate(SPURIOUS_APIC_VECTOR, spurious_interrupt);
alloc_intr_gate(ERROR_APIC_VECTOR, error_interrupt);

+ if (is_uv_system() &&
+ LAST_UV_DYNAMIC_DEVICE_VECTOR < last_dynamic_device_vector) {
+ last_dynamic_device_vector = LAST_UV_DYNAMIC_DEVICE_VECTOR;
+ }
+
if (!acpi_ioapic)
setup_irq(2, &irq2);
}
Index: temp/arch/x86/kernel/vmiclock_32.c
===================================================================
--- temp.orig/arch/x86/kernel/vmiclock_32.c 2008-07-31
11:53:58.000000000 -0500
+++ temp/arch/x86/kernel/vmiclock_32.c 2008-07-31 11:54:28.000000000 -0500
@@ -81,7 +81,7 @@
static inline unsigned int vmi_get_timer_vector(void)
{
#ifdef CONFIG_X86_IO_APIC
- return FIRST_DEVICE_VECTOR;
+ return FIRST_DYNAMIC_VECTOR;
#else
return FIRST_EXTERNAL_VECTOR;
#endif
Index: temp/arch/x86/kernel/apic_64.c
===================================================================
--- temp.orig/arch/x86/kernel/apic_64.c 2008-07-31 11:53:58.000000000 -0500
+++ temp/arch/x86/kernel/apic_64.c 2008-07-31 11:54:28.000000000 -0500
@@ -33,6 +33,7 @@
#include <asm/smp.h>
#include <asm/mtrr.h>
#include <asm/mpspec.h>
+#include <asm/desc.h>
#include <asm/hpet.h>
#include <asm/pgalloc.h>
#include <asm/nmi.h>
@@ -58,6 +59,13 @@
int local_apic_timer_c2_ok;
EXPORT_SYMBOL_GPL(local_apic_timer_c2_ok);

+int last_dynamic_device_vector = LAST_DYNAMIC_DEVICE_VECTOR;
+int first_fixed_system_vector = FIRST_FIXED_SYSTEM_VECTOR;
+char fixed_system_vectors[NR_VECTORS] = {
+ [0 ... NR_VECTORS-1] = SYS_VECTOR_FREE
+};
+
+
/*
* Debug level, exported for io_apic.c
*/


2008-07-31 22:13:40

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Dynamically allocate arch specific system vectors

Alan Mayer <[email protected]> writes:

> Subject: [PATCH] x86_64: Dynamically allocate arch specific system vectors
>
> From: Alan Mayer <[email protected]>
>
> On some systems (e. g., UV) it is necessary to use an interrupt vector
> as a "system" vector, that is, it is generated by system hardware, not an
> IO device. This patch dynamically allocates them from the pool of interrupt
> vectors below the fixed system vectors. This may include stealing some from
> the device interrupt vector pool, so they are allocated dynamically so that
> other archs don't have to pay the price. In UV, examples of these hardware
> and software systems that need dynamically allocated vectors are the GRU,
> the BAU, and XPM/XPC.

Could you please explain words why the vector allocator does not work
for you?

This code at first glance looks a lot like duplicating the vector allocator.

Eric

2008-08-01 15:51:22

by Cliff Wickman

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Dynamically allocate arch specific system vectors

On Thu, Jul 31, 2008 at 03:10:23PM -0700, Eric W. Biederman wrote:
> Alan Mayer <[email protected]> writes:
>
> > Subject: [PATCH] x86_64: Dynamically allocate arch specific system vectors
> >
> > From: Alan Mayer <[email protected]>
> >
> > On some systems (e. g., UV) it is necessary to use an interrupt vector
> > as a "system" vector, that is, it is generated by system hardware, not an
> > IO device. This patch dynamically allocates them from the pool of interrupt
> > vectors below the fixed system vectors. This may include stealing some from
> > the device interrupt vector pool, so they are allocated dynamically so that
> > other archs don't have to pay the price. In UV, examples of these hardware
> > and software systems that need dynamically allocated vectors are the GRU,
> > the BAU, and XPM/XPC.
>
> Could you please explain words why the vector allocator does not work
> for you?
>
> This code at first glance looks a lot like duplicating the vector allocator.
>
> Eric

I assume you mean create_irq() and destroy_irq().
They are close to what we need.

However for any given system use:
- we need to request a high priority vector for some irq's, rather than
one randomly allocated as per __assign_irq_vector().
- we want the irq/vector to be targeted to all cpu's (as specified in a
mask,
and can include currently offline cpu's) rather than a single cpu.

I suppose those abilities could be added to create_irq(), but we didn't
want to intrude into that interface.

A smaller consideration is simplicity of use. We want any such user to
use
the generic do_IRQ() flow (not alloc_intr_gate()). But make it easy to
set
up the irq/vector, irq_chip and irq_desc without getting intimate with
the
details.
I suppose some other wrapper for an enhanced create_irq() could be done.

We are going to need such irq/vector pairs for a couple of UV drivers
(drivers/misc/sgi-gru/ and sgi-xp/). And would prefer it for the UV TLB
shootdown (x86/kernel/tlb.uv.c) rather than using alloc_intr_gate().


-Cliff
--
Cliff Wickman
Silicon Graphics, Inc.
[email protected]
(651) 683-3824

2008-08-01 21:03:23

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Dynamically allocate arch specific system vectors

Cliff Wickman <[email protected]> writes:
>
> I assume you mean create_irq() and destroy_irq().
> They are close to what we need.

Actually that isn't quite what I was thinking.

> However for any given system use:
> - we need to request a high priority vector for some irq's, rather than
> one randomly allocated as per __assign_irq_vector().

Well I was actually thinking mostly about reusing assign_irq_vector.

The high priority vector seems to be the place where we would need
to reexamine the vector allocator.

> - we want the irq/vector to be targeted to all cpu's (as specified in a
> mask,
> and can include currently offline cpu's) rather than a single cpu.

__assign_irq_vector should be able to handle that case.

> I suppose those abilities could be added to create_irq(), but we didn't
> want to intrude into that interface.

Right. That part gets custom and there is no intersection.

> A smaller consideration is simplicity of use. We want any such user to
> use
> the generic do_IRQ() flow (not alloc_intr_gate()).

My primary concern is that you are generalizing an interface that is
designed for alloc_intr_gate consumers. In particular for people
who wish to allocate vectors that can be triggered by software to
do this.

If this is normal irq handling I would really prefer to use the
normal data structures that we use for allocating vectors to irqs,
because that is what you are doing.

> But make it easy to set
> up the irq/vector, irq_chip and irq_desc without getting intimate with
> the details.
> I suppose some other wrapper for an enhanced create_irq() could be done.
>
> We are going to need such irq/vector pairs for a couple of UV drivers
> (drivers/misc/sgi-gru/ and sgi-xp/). And would prefer it for the UV TLB
> shootdown (x86/kernel/tlb.uv.c) rather than using alloc_intr_gate().

I also want to ask why do you need an irq that fires on every cpu?

That concept seems to be responsible for one of the nastiest data
structures in the kernel. Where we track how many times any irq has
occurred on any cpu. Look at the percpu variable kernel_stat. It
seems to be one of the nastiest variables I know of. Of size:
NR_CPUS*NR_CPUS*32 Assuming you have a machine with a reasonable
number of irqs per cpu.

Eric

2008-08-01 21:49:34

by Alan Mayer

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Dynamically allocate arch specific system vectors

Okay, I think we have it now. assign_irq_vector *almost* does what we
need. One minor thing is that assign_irq_vector ANDs against
cpu_online_map. We would need cpu_possible_map, so we get the vector on
offline cpus that may come online. The other thing is that
assign_irq_vector doesn't allow the specification of interrupt
priorities. It would need to be modified to handle returning either a
high priority vector or a low priority vector. Would modifying the api
for assign_irq_vector be the proper approach?

The interrupts don't necessarily fire on all cpus, it's just that they
*can* fire on any cpu. For example, the GRU triggers an interrupt (it
is very IPI'ish) to a particular cpu in the event of a GRU TLB fault.
That cpu handles the fault and returns. But the fault can happen on any
cpu, so all cpus need to be registered for the same vector and irq.
This is probably splitting hairs; it is certainly no different in
principal from timer interrupts or processor TLB faults.

As far as kernel_stat is concerned. I see you're point. NR_CPUS on our
machines is going to be big (4K? 8K? something like that). NR_IRQS is
also going to big because of that. It's unfortunate since the actual
number of interrupt sources is going to be an order of magnitude
smaller, at least.

--ajm

Eric W. Biederman wrote:
> Cliff Wickman <[email protected]> writes:
>> I assume you mean create_irq() and destroy_irq().
>> They are close to what we need.
>
> Actually that isn't quite what I was thinking.
>
>> However for any given system use:
>> - we need to request a high priority vector for some irq's, rather than
>> one randomly allocated as per __assign_irq_vector().
>
> Well I was actually thinking mostly about reusing assign_irq_vector.
>
> The high priority vector seems to be the place where we would need
> to reexamine the vector allocator.
>
>> - we want the irq/vector to be targeted to all cpu's (as specified in a
>> mask,
>> and can include currently offline cpu's) rather than a single cpu.
>
> __assign_irq_vector should be able to handle that case.
>
>> I suppose those abilities could be added to create_irq(), but we didn't
>> want to intrude into that interface.
>
> Right. That part gets custom and there is no intersection.
>
>> A smaller consideration is simplicity of use. We want any such user to
>> use
>> the generic do_IRQ() flow (not alloc_intr_gate()).
>
> My primary concern is that you are generalizing an interface that is
> designed for alloc_intr_gate consumers. In particular for people
> who wish to allocate vectors that can be triggered by software to
> do this.
>
> If this is normal irq handling I would really prefer to use the
> normal data structures that we use for allocating vectors to irqs,
> because that is what you are doing.
>
>> But make it easy to set
>> up the irq/vector, irq_chip and irq_desc without getting intimate with
>> the details.
>> I suppose some other wrapper for an enhanced create_irq() could be done.
>>
>> We are going to need such irq/vector pairs for a couple of UV drivers
>> (drivers/misc/sgi-gru/ and sgi-xp/). And would prefer it for the UV TLB
>> shootdown (x86/kernel/tlb.uv.c) rather than using alloc_intr_gate().
>
> I also want to ask why do you need an irq that fires on every cpu?
>
> That concept seems to be responsible for one of the nastiest data
> structures in the kernel. Where we track how many times any irq has
> occurred on any cpu. Look at the percpu variable kernel_stat. It
> seems to be one of the nastiest variables I know of. Of size:
> NR_CPUS*NR_CPUS*32 Assuming you have a machine with a reasonable
> number of irqs per cpu.
>
> Eric

--
Make me an angel
That flies from Montgomery.
--
Alan J. Mayer
SGI
[email protected]
WORK: 651-683-3131
HOME: 651-407-0134
--

2008-08-01 22:23:31

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Dynamically allocate arch specific system vectors

Alan Mayer <[email protected]> writes:

> Okay, I think we have it now. assign_irq_vector *almost* does what we need.
> One minor thing is that assign_irq_vector ANDs against cpu_online_map. We would
> need cpu_possible_map, so we get the vector on offline cpus that may come
> online. The other thing is that assign_irq_vector doesn't allow the
> specification of interrupt priorities. It would need to be modified to handle
> returning either a high priority vector or a low priority vector. Would
> modifying the api for assign_irq_vector be the proper approach?

I don't know if it makes sense to modify assign_irq_vector or to
have a companion function that uses the same data structures.

I think I would work on the companion function and if the code
can be made sufficiently similar merge the two functions.

> The interrupts don't necessarily fire on all cpus, it's just that they *can*
> fire on any cpu. For example, the GRU triggers an interrupt (it is very
> IPI'ish) to a particular cpu in the event of a GRU TLB fault. That cpu handles
> the fault and returns. But the fault can happen on any cpu, so all cpus need to
> be registered for the same vector and irq. This is probably splitting hairs; it
> is certainly no different in principal from timer interrupts or processor TLB
> faults.

Reasonable. As long as you don't need to read a status register to figure
out what to do that sounds reasonable. This does sound very much like
splitting hairs on a very platform specific capability.

If we can generalize the mechanism to things like per cpu timer
interrupts and such so that we reduced the total amount of code we
have to maintain I would find it a very compelling mechanism.

> As far as kernel_stat is concerned. I see you're point. NR_CPUS on our
> machines is going to be big (4K? 8K? something like that). NR_IRQS is also
> going to big because of that. It's unfortunate since the actual number of
> interrupt sources is going to be an order of magnitude smaller, at least.

The number of interrupts sources is going to be smaller only because
SGI machines have or at least appear to have poor I/O compared to most
of the rest of machines in existence. NR_CPUS*16 is a fairly
reasonable estimate on most machines in existence. In the short term
it is going to get worse in the presence of MSI-X. I was talking to a
developer at Intel last week about 256 irqs for one card. I keep
having dreams about finding a way to just keep stats for a few cpus
but alas I don't think that is going to happen. Silly us.

Eric

2008-08-04 19:35:07

by Alan Mayer

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Dynamically allocate arch specific system vectors



Eric W. Biederman wrote:
> Alan Mayer <[email protected]> writes:
>
>> Okay, I think we have it now. assign_irq_vector *almost* does what we need.
>> One minor thing is that assign_irq_vector ANDs against cpu_online_map. We would
>> need cpu_possible_map, so we get the vector on offline cpus that may come
>> online. The other thing is that assign_irq_vector doesn't allow the
>> specification of interrupt priorities. It would need to be modified to handle
>> returning either a high priority vector or a low priority vector. Would
>> modifying the api for assign_irq_vector be the proper approach?
>
> I don't know if it makes sense to modify assign_irq_vector or to
> have a companion function that uses the same data structures.
>
> I think I would work on the companion function and if the code
> can be made sufficiently similar merge the two functions.
>
Okay, If I understand you, here's what we can do. We currently have
this function that does pretty much what the combination of create_irq()
and __assign_irq_vector() do. We can accomplish the same thing that our
routine does using create_irq() and __assign_irq_vector() do if we make
the following changes:

__assign_irq_vector(int irq, cpumask_t mask) ==>
__assign_irq_vector(int irq, cpumask_t mask, int priority);

priority has three values: priority_none, priority_low, priority_high
priority_none means do everything the way it is done now.
priority_low means do everything the way its is done now, except use
cpu_possible_map rather than cpu_online_map.
priority_high means search the interrupt vectors from the top down,
rather than from the bottom up and use cpu_possible_map rather than
cpu_online_map.

create_irq(void) ==> create_irq(int priority, cpumask_t *mask)
priority_none, means do everything the way it is done now, passing in
TARGET_CPUS as the mask, but also sending the priority arg. into
__assign_irq_vector().
priority_low and priority_high means use create_irq()'s mask arg. as the
mask passed to __assign_irq_vector).

We would add an additional small routine on top of create_irq() to do
any massaging of the irq_desc, etc. that we need for these system vectors.

Is that what you were thinking about?

--ajm

>> The interrupts don't necessarily fire on all cpus, it's just that they *can*
>> fire on any cpu. For example, the GRU triggers an interrupt (it is very
>> IPI'ish) to a particular cpu in the event of a GRU TLB fault. That cpu handles
>> the fault and returns. But the fault can happen on any cpu, so all cpus need to
>> be registered for the same vector and irq. This is probably splitting hairs; it
>> is certainly no different in principal from timer interrupts or processor TLB
>> faults.
>
> Reasonable. As long as you don't need to read a status register to figure
> out what to do that sounds reasonable. This does sound very much like
> splitting hairs on a very platform specific capability.
>
> If we can generalize the mechanism to things like per cpu timer
> interrupts and such so that we reduced the total amount of code we
> have to maintain I would find it a very compelling mechanism.
>
>> As far as kernel_stat is concerned. I see you're point. NR_CPUS on our
>> machines is going to be big (4K? 8K? something like that). NR_IRQS is also
>> going to big because of that. It's unfortunate since the actual number of
>> interrupt sources is going to be an order of magnitude smaller, at least.
>
> The number of interrupts sources is going to be smaller only because
> SGI machines have or at least appear to have poor I/O compared to most
> of the rest of machines in existence. NR_CPUS*16 is a fairly
> reasonable estimate on most machines in existence. In the short term
> it is going to get worse in the presence of MSI-X. I was talking to a
> developer at Intel last week about 256 irqs for one card. I keep
> having dreams about finding a way to just keep stats for a few cpus
> but alas I don't think that is going to happen. Silly us.
>
> Eric
>

--
It's getting to the point
Where I'm no fun anymore.
--
Alan J. Mayer
SGI
[email protected]
WORK: 651-683-3131
HOME: 651-407-0134
--

2008-08-04 20:39:51

by Mike Travis

[permalink] [raw]
Subject: Re: [PATCH] x86_64: Dynamically allocate arch specific system vectors



Alan Mayer wrote:
>
>
> Eric W. Biederman wrote:
>> Alan Mayer <[email protected]> writes:
>>
>>> Okay, I think we have it now. assign_irq_vector *almost* does what
>>> we need.
>>> One minor thing is that assign_irq_vector ANDs against
>>> cpu_online_map. We would
>>> need cpu_possible_map, so we get the vector on offline cpus that may
>>> come
>>> online. The other thing is that assign_irq_vector doesn't allow the
>>> specification of interrupt priorities. It would need to be modified
>>> to handle
>>> returning either a high priority vector or a low priority vector. Would
>>> modifying the api for assign_irq_vector be the proper approach?
>>
>> I don't know if it makes sense to modify assign_irq_vector or to have
>> a companion function that uses the same data structures.
>>
>> I think I would work on the companion function and if the code
>> can be made sufficiently similar merge the two functions.
>>
> Okay, If I understand you, here's what we can do. We currently have
> this function that does pretty much what the combination of create_irq()
> and __assign_irq_vector() do. We can accomplish the same thing that our
> routine does using create_irq() and __assign_irq_vector() do if we make
> the following changes:
>
> __assign_irq_vector(int irq, cpumask_t mask) ==>
> __assign_irq_vector(int irq, cpumask_t mask, int priority);
>
> priority has three values: priority_none, priority_low, priority_high
> priority_none means do everything the way it is done now.
> priority_low means do everything the way its is done now, except use
> cpu_possible_map rather than cpu_online_map.
> priority_high means search the interrupt vectors from the top down,
> rather than from the bottom up and use cpu_possible_map rather than
> cpu_online_map.

Checking to insure that at least one of the cpus is online seems to be
prudent, as well as what happens when the last online cpu in the group
goes offline?

Thanks,
Mike
>
> create_irq(void) ==> create_irq(int priority, cpumask_t *mask)
> priority_none, means do everything the way it is done now, passing in
> TARGET_CPUS as the mask, but also sending the priority arg. into
> __assign_irq_vector().
> priority_low and priority_high means use create_irq()'s mask arg. as the
> mask passed to __assign_irq_vector).
>
> We would add an additional small routine on top of create_irq() to do
> any massaging of the irq_desc, etc. that we need for these system vectors.
>
> Is that what you were thinking about?
>
> --ajm
>
>>> The interrupts don't necessarily fire on all cpus, it's just that
>>> they *can*
>>> fire on any cpu. For example, the GRU triggers an interrupt (it is very
>>> IPI'ish) to a particular cpu in the event of a GRU TLB fault. That
>>> cpu handles
>>> the fault and returns. But the fault can happen on any cpu, so all
>>> cpus need to
>>> be registered for the same vector and irq. This is probably splitting
>>> hairs; it
>>> is certainly no different in principal from timer interrupts or
>>> processor TLB
>>> faults.
>>
>> Reasonable. As long as you don't need to read a status register to
>> figure
>> out what to do that sounds reasonable. This does sound very much like
>> splitting hairs on a very platform specific capability.
>>
>> If we can generalize the mechanism to things like per cpu timer
>> interrupts and such so that we reduced the total amount of code we
>> have to maintain I would find it a very compelling mechanism.
>>
>>> As far as kernel_stat is concerned. I see you're point. NR_CPUS on our
>>> machines is going to be big (4K? 8K? something like that). NR_IRQS
>>> is also
>>> going to big because of that. It's unfortunate since the actual
>>> number of
>>> interrupt sources is going to be an order of magnitude smaller, at
>>> least.
>>
>> The number of interrupts sources is going to be smaller only because
>> SGI machines have or at least appear to have poor I/O compared to most
>> of the rest of machines in existence. NR_CPUS*16 is a fairly
>> reasonable estimate on most machines in existence. In the short term
>> it is going to get worse in the presence of MSI-X. I was talking to a
>> developer at Intel last week about 256 irqs for one card. I keep
>> having dreams about finding a way to just keep stats for a few cpus
>> but alas I don't think that is going to happen. Silly us.
>>
>> Eric
>>
>