2006-12-13 14:23:23

by Arjan van de Ven

[permalink] [raw]
Subject: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs

[due to a broken libata in current -git I've not been able to test this patch enough]


This patch adds an "allowed_affinity" mask to each interrupt, in addition to the
existing actual affinity mask. In addition this new mask is exported to userspace
in a similar way as the actual affinity is exported. (this is so that irqbalance
can find out about the restriction and take it into account)

The purpose for having this mask is to allow for the situation where interrupts
just can't or shouldn't go to all cpus; one example is the "per cpu" IRQ thing
that powerpc and others have. Another soon-to-come example is MSIX devices that
can generate a different MSI interrupt for each cpu; in that case the MSI needs to
be strictly constrained to it's designated cpu.

Signed-off-by: Arjan van de Ven <[email protected]>

---
include/linux/irq.h | 2 ++
kernel/irq/chip.c | 1 +
kernel/irq/handle.c | 3 ++-
kernel/irq/manage.c | 5 ++++-
kernel/irq/proc.c | 27 +++++++++++++++++++++++++++
5 files changed, 36 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/irq.h
===================================================================
--- linux-2.6.orig/include/linux/irq.h
+++ linux-2.6/include/linux/irq.h
@@ -137,6 +137,7 @@ struct irq_chip {
* @irqs_unhandled: stats field for spurious unhandled interrupts
* @lock: locking for SMP
* @affinity: IRQ affinity on SMP
+ * @allowed_affinity The allowed affinity for this IRQ
* @cpu: cpu index useful for balancing
* @pending_mask: pending rebalanced interrupts
* @dir: /proc/irq/ procfs entry
@@ -160,6 +161,7 @@ struct irq_desc {
spinlock_t lock;
#ifdef CONFIG_SMP
cpumask_t affinity;
+ cpumask_t allowed_affinity;
unsigned int cpu;
#endif
#if defined(CONFIG_GENERIC_PENDING_IRQ) || defined(CONFIG_IRQBALANCE)
Index: linux-2.6/kernel/irq/chip.c
===================================================================
--- linux-2.6.orig/kernel/irq/chip.c
+++ linux-2.6/kernel/irq/chip.c
@@ -46,6 +46,7 @@ void dynamic_irq_init(unsigned int irq)
desc->irqs_unhandled = 0;
#ifdef CONFIG_SMP
desc->affinity = CPU_MASK_ALL;
+ desc->allowed_affinity = CPU_MASK_ALL;
#endif
spin_unlock_irqrestore(&desc->lock, flags);
}
Index: linux-2.6/kernel/irq/handle.c
===================================================================
--- linux-2.6.orig/kernel/irq/handle.c
+++ linux-2.6/kernel/irq/handle.c
@@ -56,7 +56,8 @@ struct irq_desc irq_desc[NR_IRQS] __cach
.depth = 1,
.lock = __SPIN_LOCK_UNLOCKED(irq_desc->lock),
#ifdef CONFIG_SMP
- .affinity = CPU_MASK_ALL
+ .affinity = CPU_MASK_ALL,
+ .allowed_affinity = CPU_MASK_ALL
#endif
}
};
Index: linux-2.6/kernel/irq/manage.c
===================================================================
--- linux-2.6.orig/kernel/irq/manage.c
+++ linux-2.6/kernel/irq/manage.c
@@ -278,8 +278,13 @@ int setup_irq(unsigned int irq, struct i

*p = new;
#if defined(CONFIG_IRQ_PER_CPU)
- if (new->flags & IRQF_PERCPU)
+ if (new->flags & IRQF_PERCPU) {
desc->status |= IRQ_PER_CPU;
+ /* don't allow affinity to be set for per cpu interrupts */
+#ifdef CONFIG_SMP
+ desc->allowed_affinity = CPU_MASK_NONE;
+#endif
+ }
#endif
if (!shared) {
irq_chip_set_defaults(desc->chip);
Index: linux-2.6/kernel/irq/proc.c
===================================================================
--- linux-2.6.orig/kernel/irq/proc.c
+++ linux-2.6/kernel/irq/proc.c
@@ -47,6 +47,20 @@ static int irq_affinity_read_proc(char *
return len;
}

+
+static int irq_affinity_read_allowed_proc(char *page, char **start, off_t off,
+ int count, int *eof, void *data)
+{
+ int len = cpumask_scnprintf(page, count, irq_desc[(long)data].allowed_affinity);
+
+ if (count - len < 2)
+ return -EINVAL;
+ len += sprintf(page + len, "\n");
+ return len;
+}
+
+
+
int no_irq_affinity;
static int irq_affinity_write_proc(struct file *file, const char __user *buffer,
unsigned long count, void *data)
@@ -62,6 +76,9 @@ static int irq_affinity_write_proc(struc
if (err)
return err;

+ /* mask off the allowed_affinity mask to only leave legal cpus */
+ cpus_and(new_value, new_value, irq_desc[irq].allowed_affinity);
+
/*
* Do not allow disabling IRQs completely - it's a too easy
* way to make the system unusable accidentally :-) At least
@@ -141,6 +158,16 @@ void register_irq_proc(unsigned int irq)
entry->read_proc = irq_affinity_read_proc;
entry->write_proc = irq_affinity_write_proc;
}
+
+ /* create /proc/irq/<irq>/smp_affinity */
+ entry = create_proc_entry("allowed_affinity", 0400, irq_desc[irq].dir);
+
+ if (entry) {
+ entry->nlink = 1;
+ entry->data = (void *)(long)irq;
+ entry->read_proc = irq_affinity_read_allowed_proc;
+ }
+
}
#endif
}


2006-12-13 19:41:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs

Arjan van de Ven <[email protected]> writes:

> [due to a broken libata in current -git I've not been able to test this patch
> enough]
>
>
> This patch adds an "allowed_affinity" mask to each interrupt, in addition to the
> existing actual affinity mask. In addition this new mask is exported to
> userspace
> in a similar way as the actual affinity is exported. (this is so that irqbalance
> can find out about the restriction and take it into account)
>
> The purpose for having this mask is to allow for the situation where interrupts
> just can't or shouldn't go to all cpus; one example is the "per cpu" IRQ thing
> that powerpc and others have. Another soon-to-come example is MSIX devices that
> can generate a different MSI interrupt for each cpu; in that case the MSI needs
> to
> be strictly constrained to it's designated cpu.

I don't like this. If we really have constraints that limit which cpus
we can handle a irqs on we should be taking advantage of them to make
the data structure smaller, not making it bigger.

This feels like a forced fit. To fit the situations you describe
into allowed_affinity you have to throw away information. Such
as the fact it is a NUMA node or a MSI-X interrupt.

In addition the cases I can think of allowed_affinity is the wrong
name. suggested_affinity sounds like what you are trying to implement
and when it is merely a suggestion and not a hard limit it doesn't
make sense to export like this.

The restriction with MSIX is that we want the irq that describes
the flow of data and the user space process that consumes that flow of
data scheduled on the same cpu. Yes this is mostly one per cpu but it
is not strictly one per cpu so allowed cpus doesn't make sense in that
case.

The more I think about describing what we are schedule, we are
scheduling a more or less periodic realtime process don't we already
have schedulers in the kernel to schedule this kind of thing? Don't
we want to take into consideration the cache penalty of migration
before we decided to move the irq to another cpu?

Anyway if we are going to start exporting this kind of information
please let's export the whole context and export it in a way that
applications besides irqbalance can take advantage of. Otherwise
we are just committing to support a wide kernel/user space interface
for the purposes of a single application and that seems ridiculous.

Eric

2006-12-13 19:45:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs


* Eric W. Biederman <[email protected]> wrote:

> In addition the cases I can think of allowed_affinity is the wrong
> name. suggested_affinity sounds like what you are trying to implement
> and when it is merely a suggestion and not a hard limit it doesn't
> make sense to export like this.

well, there are interrupts that must be tied to a single CPU and must
never be moved away. For example per-CPU clock-events-source interrupts
are such. So allowed_affinity very much exists.

also there might be hardware that can only route a given IRQ to a subset
of CPUs. While setting set_affinity allows the irqbalance-daemon to
'probe' this mask, it's a far from optimal API.

Ingo

2006-12-13 20:02:21

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs

> .
>
> In addition the cases I can think of allowed_affinity is the wrong
> name. suggested_affinity sounds like what you are trying to implement
> and when it is merely a suggestion and not a hard limit it doesn't
> make sense to export like this.

it really IS a hard limit.

2006-12-13 20:06:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs

Ingo Molnar <[email protected]> writes:

> * Eric W. Biederman <[email protected]> wrote:
>
>> In addition the cases I can think of allowed_affinity is the wrong
>> name. suggested_affinity sounds like what you are trying to implement
>> and when it is merely a suggestion and not a hard limit it doesn't
>> make sense to export like this.
>
> well, there are interrupts that must be tied to a single CPU and must
> never be moved away. For example per-CPU clock-events-source interrupts
> are such. So allowed_affinity very much exists.

Although in that case since it is a single cpu there is a much
more elegant implementation. We don't need a full cpumask_t to
describe it.

> also there might be hardware that can only route a given IRQ to a subset
> of CPUs. While setting set_affinity allows the irqbalance-daemon to
> 'probe' this mask, it's a far from optimal API.

I agree, I am just arguing that adding another awkward interface to
the current situation does not really make the situation better, and
it increases our support burden.

For a bunch of this it is arguable that the way to go is simply to
parse the irq type in /proc/interrupts. All of the really weird cases
will have a distinct type there. This certainly captures the MSI-X
case. There is still a question of how to handle the NUMA case but...

Eric

2006-12-13 20:16:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs

Arjan van de Ven <[email protected]> writes:

>> .
>>
>> In addition the cases I can think of allowed_affinity is the wrong
>> name. suggested_affinity sounds like what you are trying to implement
>> and when it is merely a suggestion and not a hard limit it doesn't
>> make sense to export like this.
>
> it really IS a hard limit.

Ok. Which generally makes it uninteresting. The only cases that I know
with a hard limit are completely unrouteable.

In addition upon reflection you don't handle PER_CPU irqs properly. As
I recall ia64 uses a different per cpu irq source to target each
individual processor. But because they are the same the share the
irq source.

I don't think allowed_affinity can even describe the case above.

Eric


2006-12-13 20:22:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs


* Eric W. Biederman <[email protected]> wrote:

> > also there might be hardware that can only route a given IRQ to a
> > subset of CPUs. While setting set_affinity allows the
> > irqbalance-daemon to 'probe' this mask, it's a far from optimal API.
>
> I agree, I am just arguing that adding another awkward interface to
> the current situation does not really make the situation better, and
> it increases our support burden.

well, please suggest a better interface then.

> For a bunch of this it is arguable that the way to go is simply to
> parse the irq type in /proc/interrupts. All of the really weird cases
> will have a distinct type there. This certainly captures the MSI-X
> case. There is still a question of how to handle the NUMA case but...

... so parsing /proc/interrupts should be that interface? That is a
historically very volatile interface. It's mostly human-parsed, and we
frequently twiddle it - genirq changed it too. In v2.6.19 we had fasteio
instead of fasteoi there.

Ingo

2006-12-13 20:23:26

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs

Eric W. Biederman wrote:

> There is still a question of how to handle the NUMA case but...
>

the numa case is already handled; the needed info for that is exposed
already enough... at least for irqbalance

2006-12-14 04:15:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs

Arjan van de Ven <[email protected]> writes:

> Eric W. Biederman wrote:
>
>> There is still a question of how to handle the NUMA case but...
>>
>
> the numa case is already handled; the needed info for that is exposed already
> enough... at least for irqbalance

What is the problem you are trying to solve?

If it is just interrupts irqbalanced can't help with we can do that
with a single bit.

My basic problem with understanding what this patch is trying to
solve is that I've seen some theoretical cases raised but I don't see
the real world problem.

Eric

2006-12-14 06:24:45

by Lu, Yinghai

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs

On 12/13/06, Eric W. Biederman <[email protected]> wrote:
> Arjan van de Ven <[email protected]> writes:
> > the numa case is already handled; the needed info for that is exposed already
> > enough... at least for irqbalance
>
> What is the problem you are trying to solve?
>
> If it is just interrupts irqbalanced can't help with we can do that
> with a single bit.
>
> My basic problem with understanding what this patch is trying to
> solve is that I've seen some theoretical cases raised but I don't see
> the real world problem.

Good question.

irqbalance could use set_affinity from boot cpu to final cpu.

YH

2006-12-14 08:05:56

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs

Eric W. Biederman wrote:
> What is the problem you are trying to solve?

2 problems
1) irq's that irqbalance should not touch at all
2) irqs that can only go to a subset of processors.

1) is very real today
2) is partially real on some of the bigger numa stuff already.

2006-12-14 08:22:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs

Arjan van de Ven <[email protected]> writes:

> Eric W. Biederman wrote:
>> What is the problem you are trying to solve?
>
> 2 problems
> 1) irq's that irqbalance should not touch at all

This is easy we just need a single bit. Not 128+ bytes on the huge
machines.

> 2) irqs that can only go to a subset of processors.
>
> 1) is very real today
> 2) is partially real on some of the bigger numa stuff already.

You have said you the NUMA cases is handled in another way already?
In which case irqs that can only got to a subset of processors
shouldn't be a problem.

Eric

2006-12-14 08:23:27

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs

Eric W. Biederman wrote:
>> 1) is very real today
>> 2) is partially real on some of the bigger numa stuff already.
>
> You have said you the NUMA cases is handled in another way already?

the numa case of "I prefer that cpu" is handled. Not the "I cannot
work on those".

2006-12-14 08:34:45

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs

Arjan van de Ven <[email protected]> writes:

> Eric W. Biederman wrote:
>>> 1) is very real today
>>> 2) is partially real on some of the bigger numa stuff already.
>>
>> You have said you the NUMA cases is handled in another way already?
>
> the numa case of "I prefer that cpu" is handled. Not the "I cannot work on
> those".

How is the NUMA case of I prefer that cpu handled?

Eric

2006-12-14 08:40:39

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] Add allowed_affinity to the irq_desc to make it possible to have restricted irqs


> > the numa case of "I prefer that cpu" is handled. Not the "I cannot work on
> > those".
>
> How is the NUMA case of I prefer that cpu handled?

it's exported via /sys/bus/pci/devices/<device>/local_cpus
(and the irq is in the /irq directory next to local_cpus)