2010-06-13 15:32:30

by Tejun Heo

[permalink] [raw]
Subject: [PATCHSET] irq: better lost/spurious irq handling

Hello,

This is the first take of better-lost-spurious-irq-handling patchset.

IRQs can go wrong in two opposite directions. There can be too many
or too few. Currently, the former is handled by spurious IRQ
detection and polling (the "nobody cared" thing) and the latter by
irqpoll kernel parameter, which currently is broken on many
configurations due to tickless timer and missing IRQF_IRQPOLL.

Certain hardware classes are inherently prone to IRQ related problems.
ATA is one very good example. When the traditional IDE interface is
used, regardless of PATA or SATA, handling of IRQ is very fragile.
There is no reliable way to tell whether the controller is raising an
interrupt or not and the driver should expect IRQ according to HSM
transitions and hope that the host controller stays in sync and does
what it's expected to do. Occasionally but surely something goes
wrong and IRQ storm or timeout follows. Furthermore, the IRQ is
ultimately under the control of the ATA device not the ATA controller
making things a whole lot more fragile and prone to permanent and
transient IRQ problems.

Even if the controller and hardware themselves are okay, IRQ sharing
means that any device can be a victim of rogue interrupts. For
example, there was an I2C device for which the driver didn't use IRQ
but when the configuration is right (well, rather, wrong), its IRQ
line would assert and cause IRQ storm and there wasn't much the driver
could do to prevent that. There's also the BIOS and OS expecting
different things especially during suspend and resume. On my x61s
when resuming from STR something funny happens and the IRQ line for a
USB host gets stuck once in a while.

Most of these problems can be worked around much more efficiently
without adding noticeable runtime overhead or driver complexity by
using polling carefully. This patchset improves the existing spurious
IRQ handling and implements two mechanisms to work around lost
interrupts.

Emphasis was put on making it easy to use for drivers. Drivers only
need IRQF_SHARED on the interrupt handler and add some function calls
here and there. Functions which can be used in hot paths are
efficient and can be called without worrying about performance
implications by virtually any driver which deals with an actual
hardware. Except for init functions, all don't care about calling
context and won't fail catastrophically even if used incorrectly.
Also, operational parameters are predetermined and/or self regulating.

After this patchset, the following three mechanisms are in place to
deal with IRQ problems.

* IRQ expecting: Tightly coupled with controller operation. Provides
strong protection against most lost IRQ problems. Applied to
libata.

* IRQ watching: Loosely coupled with controller operation. Provides
protection against common lost IRQ problems (misrouting). Applied
to usb.

* Spurious IRQ handling: More responsive and less expensive than the
existing implementation. Tries to disengage after some period so
that transient problems don't end up having prolonged effects.

With the patchset applied, my test machine works fine with IRQ routing
messed up. By applying the mechanism to more drivers, things will
improve but, even in the current state, many systems with IRQ problems
will be able to cope with transient problems much better and install
and run the base system well enough to allow bug reporting and
debugging of persistent ones.

This patchset contains the following 12 patches.

0001-irq-cleanup-irqfixup.patch
0002-irq-make-spurious-poll-timer-per-desc.patch
0003-irq-use-desc-poll_timer-for-irqpoll.patch
0004-irq-kill-IRQF_IRQPOLL.patch
0005-irq-misc-preparations-for-further-changes.patch
0006-irq-implement-irq_schedule_poll.patch
0007-irq-improve-spurious-IRQ-handling.patch
0008-irq-implement-IRQ-watching.patch
0009-irq-implement-IRQ-expecting.patch
0010-irq-add-comment-about-overall-design-of-lost-spuriou.patch
0011-libata-use-IRQ-expecting.patch
0012-usb-use-IRQ-watching.patch

0001 is cleanup.

0002-0004 convert the existing polling mechanisms to use per-desc
timer instead of IRQF_IRQPOLL. This is more reliable and cheaper and
easier to maintain.

0005-0006 prepare for further changes.

0007-0010 implement better lost/spurious interrupt handling
mechanisms.

0011-0012 apply them to libata and usb.

This patchset is available in the following git branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git lost-spurious-irq

and contains the following changes.

arch/arm/mach-aaec2000/core.c | 2
arch/arm/mach-at91/at91rm9200_time.c | 2
arch/arm/mach-at91/at91sam926x_time.c | 2
arch/arm/mach-bcmring/core.c | 2
arch/arm/mach-clps711x/time.c | 2
arch/arm/mach-cns3xxx/core.c | 2
arch/arm/mach-ebsa110/core.c | 2
arch/arm/mach-ep93xx/core.c | 2
arch/arm/mach-footbridge/dc21285-timer.c | 2
arch/arm/mach-footbridge/isa-timer.c | 2
arch/arm/mach-h720x/cpu-h7201.c | 2
arch/arm/mach-h720x/cpu-h7202.c | 2
arch/arm/mach-integrator/integrator_ap.c | 2
arch/arm/mach-ixp2000/core.c | 2
arch/arm/mach-ixp23xx/core.c | 2
arch/arm/mach-ixp4xx/common.c | 2
arch/arm/mach-lh7a40x/time.c | 2
arch/arm/mach-mmp/time.c | 2
arch/arm/mach-netx/time.c | 2
arch/arm/mach-ns9xxx/irq.c | 3
arch/arm/mach-ns9xxx/time-ns9360.c | 2
arch/arm/mach-nuc93x/time.c | 2
arch/arm/mach-omap1/time.c | 2
arch/arm/mach-omap1/timer32k.c | 2
arch/arm/mach-omap2/timer-gp.c | 2
arch/arm/mach-pnx4008/time.c | 2
arch/arm/mach-pxa/time.c | 2
arch/arm/mach-sa1100/time.c | 2
arch/arm/mach-shark/core.c | 2
arch/arm/mach-u300/timer.c | 2
arch/arm/mach-w90x900/time.c | 2
arch/arm/plat-iop/time.c | 2
arch/arm/plat-mxc/time.c | 2
arch/arm/plat-samsung/time.c | 2
arch/arm/plat-versatile/timer-sp.c | 2
arch/blackfin/kernel/time-ts.c | 6
arch/ia64/kernel/time.c | 2
arch/parisc/kernel/irq.c | 2
arch/powerpc/platforms/cell/interrupt.c | 5
arch/x86/kernel/time.c | 2
drivers/ata/libata-core.c | 15
drivers/ata/libata-eh.c | 4
drivers/ata/libata-sff.c | 37 -
drivers/clocksource/sh_cmt.c | 3
drivers/clocksource/sh_mtu2.c | 3
drivers/clocksource/sh_tmu.c | 3
drivers/usb/core/hcd.c | 1
include/linux/interrupt.h | 43 -
include/linux/irq.h | 40 +
include/linux/libata.h | 2
kernel/irq/chip.c | 20
kernel/irq/handle.c | 7
kernel/irq/internals.h | 10
kernel/irq/manage.c | 18
kernel/irq/proc.c | 5
kernel/irq/spurious.c | 978 ++++++++++++++++++++++++++-----
56 files changed, 1008 insertions(+), 269 deletions(-)

Thanks.

--
tejun


2010-06-13 15:32:17

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 05/12] irq: misc preparations for further changes

* properly indent irqaction fields.

* factor out print_irq_handler()

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/interrupt.h | 20 ++++++++++----------
kernel/irq/spurious.c | 28 ++++++++++++++++------------
2 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 61857f1..b20bd65 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -102,16 +102,16 @@ typedef irqreturn_t (*irq_handler_t)(int, void *);
* @thread_flags: flags related to @thread
*/
struct irqaction {
- irq_handler_t handler;
- unsigned long flags;
- const char *name;
- void *dev_id;
- struct irqaction *next;
- int irq;
- struct proc_dir_entry *dir;
- irq_handler_t thread_fn;
- struct task_struct *thread;
- unsigned long thread_flags;
+ irq_handler_t handler;
+ unsigned long flags;
+ const char *name;
+ void *dev_id;
+ struct irqaction *next;
+ int irq;
+ struct proc_dir_entry *dir;
+ irq_handler_t thread_fn;
+ struct task_struct *thread;
+ unsigned long thread_flags;
};

extern irqreturn_t no_action(int cpl, void *dev_id);
diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index fc18a13..0bce0e3 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -28,6 +28,21 @@ enum {
int noirqdebug __read_mostly;
static int irqfixup __read_mostly = IRQFIXUP_SPURIOUS;

+static void print_irq_handlers(struct irq_desc *desc)
+{
+ struct irqaction *action;
+
+ printk(KERN_ERR "handlers:\n");
+
+ action = desc->action;
+ while (action) {
+ printk(KERN_ERR "[<%p>]", action->handler);
+ print_symbol(" (%s)", (unsigned long)action->handler);
+ printk("\n");
+ action = action->next;
+ }
+}
+
/*
* Recovery handler for misrouted interrupts.
*/
@@ -126,8 +141,6 @@ static void
__report_bad_irq(unsigned int irq, struct irq_desc *desc,
irqreturn_t action_ret)
{
- struct irqaction *action;
-
if (action_ret != IRQ_HANDLED && action_ret != IRQ_NONE) {
printk(KERN_ERR "irq event %d: bogus return value %x\n",
irq, action_ret);
@@ -136,16 +149,7 @@ __report_bad_irq(unsigned int irq, struct irq_desc *desc,
"the \"irqpoll\" option)\n", irq);
}
dump_stack();
- printk(KERN_ERR "handlers:\n");
-
- action = desc->action;
- while (action) {
- printk(KERN_ERR "[<%p>]", action->handler);
- print_symbol(" (%s)",
- (unsigned long)action->handler);
- printk("\n");
- action = action->next;
- }
+ print_irq_handlers(desc);
}

static void
--
1.6.4.2

2010-06-13 15:32:20

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 08/12] irq: implement IRQ watching

This patch implements IRQ watching, which is a simple polling
mechanism drivers can use to work around lost and/or misrouted IRQs.
IRQ watching is enabled by driver calling watch_irq(irq, dev_id).
After that, it polls the irqaction for certain amount of time (1min)
and keeps track of whether IRQ delivery is actually working. If the
irqaction is serviced by poll, it's considered to be a possible
indication of IRQ misdelivery.

The watch polling starts slowly at 1HZ and speeds up to 100HZ when it
sees a possible bad delivery. After collecting stats for a while, it
determines whether the IRQ delivery is working for the irqaction. If
so or it can't be determined, it steps out. If not working, IRQ
polling continues till the irqaction is unregistered.

This can be used by drivers which don't know when to expect the next
IRQ. Just by calling watch_irq() after irqaction is registered or a
timeout, most IRQ misrouting problems can be worked around.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/interrupt.h | 12 +++
include/linux/irq.h | 2 +
kernel/irq/handle.c | 1 +
kernel/irq/spurious.c | 209 ++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 221 insertions(+), 3 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index b20bd65..bc0cdbc 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -88,6 +88,14 @@ enum {

typedef irqreturn_t (*irq_handler_t)(int, void *);

+struct irq_watch {
+ irqreturn_t last_ret;
+ unsigned int flags;
+ unsigned long started;
+ unsigned int nr_samples;
+ unsigned int nr_polled;
+};
+
/**
* struct irqaction - per interrupt action descriptor
* @handler: interrupt handler function
@@ -100,6 +108,7 @@ typedef irqreturn_t (*irq_handler_t)(int, void *);
* @thread_fn: interupt handler function for threaded interrupts
* @thread: thread pointer for threaded interrupts
* @thread_flags: flags related to @thread
+ * @watch: data for irq watching
*/
struct irqaction {
irq_handler_t handler;
@@ -112,6 +121,7 @@ struct irqaction {
irq_handler_t thread_fn;
struct task_struct *thread;
unsigned long thread_flags;
+ struct irq_watch watch;
};

extern irqreturn_t no_action(int cpl, void *dev_id);
@@ -184,6 +194,8 @@ devm_request_irq(struct device *dev, unsigned int irq, irq_handler_t handler,

extern void devm_free_irq(struct device *dev, unsigned int irq, void *dev_id);

+extern void watch_irq(unsigned int irq, void *dev_id);
+
/*
* On lockdep we dont want to enable hardirqs in hardirq
* context. Use local_irq_enable_in_hardirq() to annotate
diff --git a/include/linux/irq.h b/include/linux/irq.h
index b2f73ba..e31954f 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -72,6 +72,7 @@ typedef void (*irq_flow_handler_t)(unsigned int irq,
#define IRQ_SUSPENDED 0x04000000 /* IRQ has gone through suspend sequence */
#define IRQ_ONESHOT 0x08000000 /* IRQ is not unmasked after hardirq */
#define IRQ_NESTED_THREAD 0x10000000 /* IRQ is nested into another, no own handler thread */
+#define IRQ_CHECK_WATCHES 0x40000000 /* IRQ watch enabled */

#ifdef CONFIG_IRQ_PER_CPU
# define CHECK_IRQ_PER_CPU(var) ((var) & IRQ_PER_CPU)
@@ -214,6 +215,7 @@ struct irq_desc {

struct irq_spr spr;
struct timer_list poll_timer;
+ bool poll_warned;

#ifdef CONFIG_PROC_FS
struct proc_dir_entry *dir;
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 3ae50bf..685c3b3 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -416,6 +416,7 @@ irqreturn_t handle_IRQ_event(unsigned int irq, struct irqaction *action)
}

retval |= ret;
+ action->watch.last_ret = ret;
action = action->next;
} while (action);

diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index 3948666..6f2ea3b 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -23,11 +23,51 @@ enum {
IRQFIXUP_POLL = 2, /* enable polling by default */

/* IRQ polling common parameters */
+ IRQ_POLL_SLOW_INTV = 3 * HZ, /* not too slow for ppl, slow enough for machine */
IRQ_POLL_INTV = HZ / 100, /* from the good ol' 100HZ tick */

+ IRQ_POLL_SLOW_SLACK = HZ,
IRQ_POLL_SLACK = HZ / 1000, /* 10% slack */

/*
+ * IRQ watch parameters.
+ *
+ * As IRQ watching has much less information about what's
+ * going on, the parameters are more conservative. It will
+ * terminate unless it can reliably determine that IRQ
+ * delivery isn't working.
+ *
+ * IRQs are watched in timed intervals which is BASE_PERIOD
+ * long by default. Polling interval starts at BASE_INTV and
+ * grows upto SLOW_INTV if no bad delivery is detected.
+ *
+ * If a period contains zero sample and no bad delivery was
+ * seen since watch started, watch terminates.
+ *
+ * If a period contains >=1 but <MIN_SAMPLES deliveries,
+ * collected samples are inherited to the next period.
+ *
+ * If it contains enough samples, the ratio between good and
+ * bad deliveries are examined, if >=BAD_PCT% are bad, the
+ * irqaction is tagged bad and watched indefinitely. if
+ * BAD_PCT% > nr_bad >= WARY_PCT%, WARY_PERIOD is used instead
+ * of BASE_PERIOD and the whole process is restarted. If
+ * <WARY_PCT% are bad, watch terminates.
+ */
+ IRQ_WAT_MIN_SAMPLES = 10,
+ IRQ_WAT_BASE_INTV = HZ / 2,
+ IRQ_WAT_BASE_PERIOD = 60 * HZ,
+ IRQ_WAT_WARY_PERIOD = 600 * HZ,
+ IRQ_WAT_WARY_PCT = 1,
+ IRQ_WAT_BAD_PCT = 10,
+
+ /* IRQ watch flags */
+ IRQ_WATCHING = (1 << 0),
+ IRQ_WAT_POLLED = (1 << 1),
+ IRQ_WAT_WARY = (1 << 2),
+ IRQ_WAT_BAD = (1 << 3),
+
+ /*
* Spurious IRQ handling parameters.
*
* As this per-IRQ spurious handling is cheaper than the
@@ -62,6 +102,16 @@ enum {
int noirqdebug __read_mostly;
static int irqfixup __read_mostly = IRQFIXUP_SPURIOUS;

+static struct irqaction *find_irq_action(struct irq_desc *desc, void *dev_id)
+{
+ struct irqaction *act;
+
+ for (act = desc->action; act; act = act->next)
+ if (act->dev_id == dev_id)
+ return act;
+ return NULL;
+}
+
static void print_irq_handlers(struct irq_desc *desc)
{
struct irqaction *action;
@@ -77,9 +127,25 @@ static void print_irq_handlers(struct irq_desc *desc)
}
}

+static void warn_irq_poll(struct irq_desc *desc, struct irqaction *act)
+{
+ if (desc->poll_warned)
+ return;
+
+ desc->poll_warned = true;
+
+ printk(KERN_WARNING "IRQ %u: %s: can't verify IRQ, will keep polling\n",
+ desc->irq, act->name);
+ printk(KERN_WARNING "IRQ %u: %s: system performance may be affected\n",
+ desc->irq, act->name);
+}
+
static unsigned long irq_poll_slack(unsigned long intv)
{
- return IRQ_POLL_SLACK;
+ if (intv >= IRQ_POLL_SLOW_INTV)
+ return IRQ_POLL_SLOW_SLACK;
+ else
+ return IRQ_POLL_SLACK;
}

/**
@@ -109,6 +175,119 @@ static void irq_schedule_poll(struct irq_desc *desc, unsigned long intv)
mod_timer(&desc->poll_timer, expires);
}

+/**
+ * irq_update_watch - IRQ handled, update watch state
+ * @desc: IRQ desc of interest
+ * @act: IRQ action of interest
+ * @via_poll: IRQ was handled via poll
+ *
+ * Called after IRQ is successfully delievered or polled. Updates
+ * watch state accordingly and determines which watch interval to use.
+ *
+ * CONTEXT:
+ * desc->lock
+ *
+ * RETURNS:
+ * Watch poll interval to use, MAX_JIFFY_OFFSET if watch polling isn't
+ * necessary.
+ */
+static unsigned long irq_update_watch(struct irq_desc *desc,
+ struct irqaction *act, bool via_poll)
+{
+ struct irq_watch *wat = &act->watch;
+ unsigned long period = wat->flags & IRQ_WAT_WARY ?
+ IRQ_WAT_WARY_PERIOD : IRQ_WAT_BASE_PERIOD;
+
+ /* if not watching or already determined to be bad, it's easy */
+ if (!(wat->flags & IRQ_WATCHING))
+ return MAX_JIFFY_OFFSET;
+ if (wat->flags & IRQ_WAT_BAD)
+ return IRQ_POLL_INTV;
+
+ /* don't expire watch period while spurious polling is in effect */
+ if (desc->spr.poll_rem) {
+ wat->started = jiffies;
+ return IRQ_POLL_INTV;
+ }
+
+ /* IRQ was handled, record whether it was a good or bad delivery */
+ if (wat->last_ret == IRQ_HANDLED) {
+ wat->nr_samples++;
+ if (via_poll) {
+ wat->nr_polled++;
+ wat->flags |= IRQ_WAT_POLLED;
+ }
+ }
+
+ /* is this watch period over? */
+ if (time_after(jiffies, wat->started + period)) {
+ unsigned int wry_thr = wat->nr_samples * IRQ_WAT_WARY_PCT / 100;
+ unsigned int bad_thr = wat->nr_samples * IRQ_WAT_BAD_PCT / 100;
+
+ if (wat->nr_samples >= IRQ_WAT_MIN_SAMPLES) {
+ /* have enough samples, determine what to do */
+ if (wat->nr_polled <= wry_thr)
+ wat->flags &= ~IRQ_WATCHING;
+ else if (wat->nr_polled <= bad_thr)
+ wat->flags |= IRQ_WAT_WARY;
+ else {
+ warn_irq_poll(desc, act);
+ wat->flags |= IRQ_WAT_BAD;
+ }
+ wat->nr_samples = 0;
+ wat->nr_polled = 0;
+ } else if (!wat->nr_samples || !(wat->flags & IRQ_WAT_POLLED)) {
+ /* not sure but let's not hold onto it */
+ wat->flags &= ~IRQ_WATCHING;
+ }
+
+ wat->started = jiffies;
+ }
+
+ if (!(wat->flags & IRQ_WATCHING))
+ return MAX_JIFFY_OFFSET;
+ if (wat->flags & IRQ_WAT_POLLED)
+ return IRQ_POLL_INTV;
+ /* every delivery upto this point has been successful, grow interval */
+ return clamp_t(unsigned long, jiffies - wat->started,
+ IRQ_WAT_BASE_INTV, IRQ_POLL_SLOW_INTV);
+}
+
+/**
+ * watch_irq - watch an irqaction
+ * @irq: IRQ the irqaction to watch belongs to
+ * @dev_id: dev_id for the irqaction to watch
+ *
+ * LOCKING:
+ * Grabs and releases desc->lock.
+ */
+void watch_irq(unsigned int irq, void *dev_id)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+ struct irqaction *act;
+ unsigned long flags;
+
+ if (WARN_ON_ONCE(!desc))
+ return;
+
+ raw_spin_lock_irqsave(&desc->lock, flags);
+
+ act = find_irq_action(desc, dev_id);
+ if (!WARN_ON_ONCE(!act)) {
+ struct irq_watch *wat = &act->watch;
+
+ wat->flags |= IRQ_WATCHING;
+ wat->started = jiffies;
+ wat->nr_samples = 0;
+ wat->nr_polled = 0;
+ desc->status |= IRQ_CHECK_WATCHES;
+ irq_schedule_poll(desc, IRQ_WAT_BASE_INTV);
+ }
+
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+}
+EXPORT_SYMBOL_GPL(watch_irq);
+
/* start a new spurious handling period */
static void irq_spr_new_period(struct irq_spr *spr)
{
@@ -151,8 +330,9 @@ static int try_one_irq(int irq, struct irq_desc *desc)
while (action) {
/* Only shared IRQ handlers are safe to call */
if (action->flags & IRQF_SHARED) {
- if (action->handler(irq, action->dev_id) ==
- IRQ_HANDLED)
+ action->watch.last_ret =
+ action->handler(irq, action->dev_id);
+ if (action->watch.last_ret == IRQ_HANDLED)
ok = 1;
}
action = action->next;
@@ -219,6 +399,24 @@ void __note_interrupt(unsigned int irq, struct irq_desc *desc,
unsigned int cnt, abbr;
char unit = 'k';

+ /* first, take care of IRQ watches */
+ if (unlikely(desc->status & IRQ_CHECK_WATCHES)) {
+ unsigned long intv = MAX_JIFFY_OFFSET;
+ struct irqaction *act;
+
+ raw_spin_lock(&desc->lock);
+
+ for (act = desc->action; act; act = act->next)
+ intv = min(intv, irq_update_watch(desc, act, false));
+
+ if (intv < MAX_JIFFY_OFFSET)
+ irq_schedule_poll(desc, intv);
+ else
+ desc->status &= ~IRQ_CHECK_WATCHES;
+
+ raw_spin_unlock(&desc->lock);
+ }
+
/*
* Account for unhandled interrupt. We don't care whether
* spurious accounting update races with irq open/close and
@@ -313,6 +511,7 @@ void poll_irq(unsigned long arg)
struct irq_spr *spr = &desc->spr;
unsigned long intv = MAX_JIFFY_OFFSET;
bool reenable_irq = false;
+ struct irqaction *act;

raw_spin_lock_irq(&desc->lock);

@@ -331,6 +530,10 @@ void poll_irq(unsigned long arg)
if (!spr->poll_rem)
reenable_irq = desc->status & IRQ_SPURIOUS_DISABLED;

+ /* take care of watches */
+ for (act = desc->action; act; act = act->next)
+ intv = min(irq_update_watch(desc, act, true), intv);
+
/* need to poll again? */
if (intv < MAX_JIFFY_OFFSET)
irq_schedule_poll(desc, intv);
--
1.6.4.2

2010-06-13 15:32:28

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 09/12] irq: implement IRQ expecting

This patch implements IRQ expecting, which can be used when a driver
can anticipate the controller to raise an interrupt in relatively
immediate future. A driver needs to allocate an irq expect token
using init_irq_expect() to use it. expect_irq() should be called when
an operation which will be followed by an interrupt is started.
unexpect_irq() when the operation finished or timed out.

This allows IRQ subsystem closely monitor the IRQ and react quickly if
the expected IRQ doesn't happen for whatever reason. The
[un]expect_irq() functions are fairly light weight and any real driver
which accesses hardware controller should be able to use them for each
operation without adding noticeable overhead.

This is most useful for drivers which have to deal with hardware which
is inherently unreliable in dealing with interrupts.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/interrupt.h | 7 +
include/linux/irq.h | 1 +
kernel/irq/spurious.c | 276 ++++++++++++++++++++++++++++++++++++++++++++-
3 files changed, 281 insertions(+), 3 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index bc0cdbc..8bbd9dc 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -88,6 +88,8 @@ enum {

typedef irqreturn_t (*irq_handler_t)(int, void *);

+struct irq_expect;
+
struct irq_watch {
irqreturn_t last_ret;
unsigned int flags;
@@ -109,6 +111,7 @@ struct irq_watch {
* @thread: thread pointer for threaded interrupts
* @thread_flags: flags related to @thread
* @watch: data for irq watching
+ * @expects: data for irq expecting
*/
struct irqaction {
irq_handler_t handler;
@@ -122,6 +125,7 @@ struct irqaction {
struct task_struct *thread;
unsigned long thread_flags;
struct irq_watch watch;
+ struct irq_expect *expects;
};

extern irqreturn_t no_action(int cpl, void *dev_id);
@@ -194,6 +198,9 @@ devm_request_irq(struct device *dev, unsigned int irq, irq_handler_t handler,

extern void devm_free_irq(struct device *dev, unsigned int irq, void *dev_id);

+extern struct irq_expect *init_irq_expect(unsigned int irq, void *dev_id);
+extern void expect_irq(struct irq_expect *exp);
+extern void unexpect_irq(struct irq_expect *exp, bool timedout);
extern void watch_irq(unsigned int irq, void *dev_id);

/*
diff --git a/include/linux/irq.h b/include/linux/irq.h
index e31954f..98530ef 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -72,6 +72,7 @@ typedef void (*irq_flow_handler_t)(unsigned int irq,
#define IRQ_SUSPENDED 0x04000000 /* IRQ has gone through suspend sequence */
#define IRQ_ONESHOT 0x08000000 /* IRQ is not unmasked after hardirq */
#define IRQ_NESTED_THREAD 0x10000000 /* IRQ is nested into another, no own handler thread */
+#define IRQ_IN_POLLING 0x20000000 /* IRQ polling in progress */
#define IRQ_CHECK_WATCHES 0x40000000 /* IRQ watch enabled */

#ifdef CONFIG_IRQ_PER_CPU
diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index 6f2ea3b..2d92113 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -13,6 +13,7 @@
#include <linux/kallsyms.h>
#include <linux/interrupt.h>
#include <linux/moduleparam.h>
+#include <linux/slab.h>

#include "internals.h"

@@ -25,9 +26,43 @@ enum {
/* IRQ polling common parameters */
IRQ_POLL_SLOW_INTV = 3 * HZ, /* not too slow for ppl, slow enough for machine */
IRQ_POLL_INTV = HZ / 100, /* from the good ol' 100HZ tick */
+ IRQ_POLL_QUICK_INTV = HZ / 1000, /* pretty quick but not too taxing */

IRQ_POLL_SLOW_SLACK = HZ,
IRQ_POLL_SLACK = HZ / 1000, /* 10% slack */
+ IRQ_POLL_QUICK_SLACK = HZ / 10000, /* 10% slack */
+
+ /*
+ * IRQ expect parameters.
+ *
+ * Because IRQ expecting is tightly coupled with the actual
+ * activity of the controller, we can be slightly aggressive
+ * and try to minimize the effect of lost interrupts.
+ *
+ * An irqaction must accumulate VERIFY_GOAL good deliveries,
+ * where one bad delivery (delivered by polling) costs
+ * BAD_FACTOR good ones, before reaching the verified state.
+ *
+ * QUICK_SAMPLES IRQ deliveries are examined and if
+ * >=QUICK_THRESHOLD of them are polled on the first poll, the
+ * IRQ is considered to be quick and QUICK_INTV is used
+ * instead.
+ *
+ * Keep QUICK_SAMPLES much higher than VERIFY_GOAL so that
+ * quick polling doesn't interfact with the initial
+ * verification attempt (quicker polling increases the chance
+ * of polled deliveries).
+ */
+ IRQ_EXP_BAD_FACTOR = 10,
+ IRQ_EXP_VERIFY_GOAL = 256,
+ IRQ_EXP_QUICK_SAMPLES = IRQ_EXP_VERIFY_GOAL * 4,
+ IRQ_EXP_QUICK_THRESHOLD = IRQ_EXP_QUICK_SAMPLES * 8 / 10,
+
+ /* IRQ expect flags */
+ IRQ_EXPECTING = (1 << 0), /* expecting in progress */
+ IRQ_EXP_VERIFIED = (1 << 1), /* delivery verified, use slow interval */
+ IRQ_EXP_QUICK = (1 << 2), /* quick polling enabled */
+ IRQ_EXP_WARNED = (1 << 3), /* already whined */

/*
* IRQ watch parameters.
@@ -99,6 +134,18 @@ enum {
IRQ_SPR_POLL_CNT_MAX_DEC_SHIFT = BITS_PER_BYTE * sizeof(int) / 4,
};

+struct irq_expect {
+ struct irq_expect *next;
+ struct irq_desc *desc; /* the associated IRQ desc */
+ struct irqaction *act; /* the associated IRQ action */
+
+ unsigned int flags; /* IRQ_EXP_* flags */
+ unsigned int nr_samples; /* nr of collected samples in this period */
+ unsigned int nr_quick; /* nr of polls completed after single attempt */
+ unsigned int nr_good; /* nr of good IRQ deliveries */
+ unsigned long started; /* when this period started */
+};
+
int noirqdebug __read_mostly;
static int irqfixup __read_mostly = IRQFIXUP_SPURIOUS;

@@ -144,8 +191,10 @@ static unsigned long irq_poll_slack(unsigned long intv)
{
if (intv >= IRQ_POLL_SLOW_INTV)
return IRQ_POLL_SLOW_SLACK;
- else
+ else if (intv >= IRQ_POLL_INTV)
return IRQ_POLL_SLACK;
+ else
+ return IRQ_POLL_QUICK_SLACK;
}

/**
@@ -175,6 +224,206 @@ static void irq_schedule_poll(struct irq_desc *desc, unsigned long intv)
mod_timer(&desc->poll_timer, expires);
}

+static unsigned long irq_exp_intv(struct irq_expect *exp)
+{
+ if (!(exp->flags & IRQ_EXPECTING))
+ return MAX_JIFFY_OFFSET;
+ if (exp->flags & IRQ_EXP_VERIFIED)
+ return IRQ_POLL_SLOW_INTV;
+ if (exp->flags & IRQ_EXP_QUICK)
+ return IRQ_POLL_QUICK_INTV;
+ return IRQ_POLL_INTV;
+}
+
+/**
+ * init_irq_expect - initialize IRQ expecting
+ * @irq: IRQ to expect
+ * @dev_id: dev_id of the irqaction to expect
+ *
+ * Initializes IRQ expecting and returns expect token to use. This
+ * function can be called multiple times for the same irqaction and
+ * each token can be used independently.
+ *
+ * CONTEXT:
+ * Does GFP_KERNEL allocation.
+ *
+ * RETURNS:
+ * irq_expect token to use on success, %NULL on failure.
+ */
+struct irq_expect *init_irq_expect(unsigned int irq, void *dev_id)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+ struct irqaction *act;
+ struct irq_expect *exp;
+ unsigned long flags;
+
+ if (noirqdebug || WARN_ON_ONCE(!desc))
+ return NULL;
+
+ exp = kzalloc(sizeof(*exp), GFP_KERNEL);
+ if (!exp) {
+ printk(KERN_WARNING "IRQ %u: failed to initialize IRQ expect, "
+ "allocation failed\n", irq);
+ return NULL;
+ }
+
+ exp->desc = desc;
+
+ raw_spin_lock_irqsave(&desc->lock, flags);
+
+ act = find_irq_action(desc, dev_id);
+ if (!WARN_ON_ONCE(!act)) {
+ exp->act = act;
+ exp->next = act->expects;
+ act->expects = exp;
+ } else {
+ kfree(exp);
+ exp = NULL;
+ }
+
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+ return exp;
+}
+EXPORT_SYMBOL_GPL(init_irq_expect);
+
+/**
+ * expect_irq - expect IRQ
+ * @exp: expect token acquired from init_irq_expect(), %NULL is allowed
+ *
+ * Tell IRQ subsystem to expect an IRQ. The IRQ might be polled until
+ * unexpect_irq() is called on @exp. If @exp is %NULL, this function
+ * becomes noop.
+ *
+ * This function is fairly cheap and drivers can call it for each
+ * interrupt driven operation without adding noticeable overhead in
+ * most cases.
+ *
+ * CONTEXT:
+ * Don't care. The caller is responsible for ensuring
+ * [un]expect_irq() calls don't overlap. Overlapping may lead to
+ * unexpected polling behaviors but won't directly cause a failure.
+ */
+void expect_irq(struct irq_expect *exp)
+{
+ struct irq_desc *desc;
+ unsigned long intv, deadline;
+ unsigned long flags;
+
+ /* @exp is NULL if noirqdebug */
+ if (unlikely(!exp))
+ return;
+
+ desc = exp->desc;
+ exp->flags |= IRQ_EXPECTING;
+
+ /*
+ * Paired with mb in poll_irq(). Either we see timer pending
+ * cleared or poll_irq() sees IRQ_EXPECTING.
+ */
+ smp_mb();
+
+ exp->started = jiffies;
+ intv = irq_exp_intv(exp);
+ deadline = exp->started + intv + irq_poll_slack(intv);
+
+ /*
+ * poll_timer is never explicitly killed unless there's no
+ * action left on the irq; also, while it's online, timer
+ * duration is only shortened, which means that if we see
+ * ->expires in the future and not later than our deadline,
+ * the timer is guaranteed to fire before it.
+ */
+ if (!timer_pending(&desc->poll_timer) ||
+ time_after_eq(jiffies, desc->poll_timer.expires) ||
+ time_before(deadline, desc->poll_timer.expires)) {
+ raw_spin_lock_irqsave(&desc->lock, flags);
+ irq_schedule_poll(desc, intv);
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ }
+}
+EXPORT_SYMBOL_GPL(expect_irq);
+
+/**
+ * unexpect_irq - unexpect IRQ
+ * @exp: expect token acquired from init_irq_expect(), %NULL is allowed
+ * @timedout: did the IRQ timeout?
+ *
+ * Tell IRQ subsystem to stop expecting an IRQ. Set @timedout to
+ * %true if the expected IRQ never arrived. If @exp is %NULL, this
+ * function becomes noop.
+ *
+ * This function is fairly cheap and drivers can call it for each
+ * interrupt driven operation without adding noticeable overhead in
+ * most cases.
+ *
+ * CONTEXT:
+ * Don't care. The caller is responsible for ensuring
+ * [un]expect_irq() calls don't overlap. Overlapping may lead to
+ * unexpected polling behaviors but won't directly cause a failure.
+ */
+void unexpect_irq(struct irq_expect *exp, bool timedout)
+{
+ struct irq_desc *desc;
+
+ /* @exp is NULL if noirqdebug */
+ if (unlikely(!exp) || (!(exp->flags & IRQ_EXPECTING) && !timedout))
+ return;
+
+ desc = exp->desc;
+ exp->flags &= ~IRQ_EXPECTING;
+
+ /* succesful completion from IRQ? */
+ if (likely(!(desc->status & IRQ_IN_POLLING) && !timedout)) {
+ /*
+ * IRQ seems a bit more trustworthy. Allow nr_good to
+ * increase till VERIFY_GOAL + BAD_FACTOR - 1 so that
+ * single succesful delivery can recover verified
+ * state after an accidental polling hit.
+ */
+ if (unlikely(exp->nr_good <
+ IRQ_EXP_VERIFY_GOAL + IRQ_EXP_BAD_FACTOR - 1) &&
+ ++exp->nr_good >= IRQ_EXP_VERIFY_GOAL) {
+ exp->flags |= IRQ_EXP_VERIFIED;
+ exp->nr_samples = 0;
+ exp->nr_quick = 0;
+ }
+ return;
+ }
+
+ /* timedout or polled */
+ if (timedout) {
+ exp->nr_good = 0;
+ } else {
+ exp->nr_good -= min_t(unsigned int,
+ exp->nr_good, IRQ_EXP_BAD_FACTOR);
+
+ if (time_before_eq(jiffies, exp->started + IRQ_POLL_INTV))
+ exp->nr_quick++;
+
+ if (++exp->nr_samples >= IRQ_EXP_QUICK_SAMPLES) {
+ /*
+ * Use quick sampling checkpoints as warning
+ * checkpoints too.
+ */
+ if (!(exp->flags & IRQ_EXP_WARNED) &&
+ !desc->spr.poll_rem) {
+ warn_irq_poll(desc, exp->act);
+ exp->flags |= IRQ_EXP_WARNED;
+ }
+
+ exp->flags &= ~IRQ_EXP_QUICK;
+ if (exp->nr_quick >= IRQ_EXP_QUICK_THRESHOLD)
+ exp->flags |= IRQ_EXP_QUICK;
+ exp->nr_samples = 0;
+ exp->nr_quick = 0;
+ }
+ }
+
+ exp->flags &= ~IRQ_EXP_VERIFIED;
+}
+EXPORT_SYMBOL_GPL(unexpect_irq);
+
/**
* irq_update_watch - IRQ handled, update watch state
* @desc: IRQ desc of interest
@@ -512,11 +761,14 @@ void poll_irq(unsigned long arg)
unsigned long intv = MAX_JIFFY_OFFSET;
bool reenable_irq = false;
struct irqaction *act;
+ struct irq_expect *exp;

raw_spin_lock_irq(&desc->lock);

/* poll the IRQ */
+ desc->status |= IRQ_IN_POLLING;
try_one_irq(desc->irq, desc);
+ desc->status &= ~IRQ_IN_POLLING;

/* take care of spurious handling */
if (spr->poll_rem) {
@@ -530,9 +782,19 @@ void poll_irq(unsigned long arg)
if (!spr->poll_rem)
reenable_irq = desc->status & IRQ_SPURIOUS_DISABLED;

- /* take care of watches */
- for (act = desc->action; act; act = act->next)
+ /*
+ * Paired with mb in expect_irq() so that either they see
+ * timer pending cleared or irq_exp_intv() below sees
+ * IRQ_EXPECTING.
+ */
+ smp_mb();
+
+ /* take care of expects and watches */
+ for (act = desc->action; act; act = act->next) {
intv = min(irq_update_watch(desc, act, true), intv);
+ for (exp = act->expects; exp; exp = exp->next)
+ intv = min(irq_exp_intv(exp), intv);
+ }

/* need to poll again? */
if (intv < MAX_JIFFY_OFFSET)
@@ -583,6 +845,7 @@ void irq_poll_action_added(struct irq_desc *desc, struct irqaction *action)
void irq_poll_action_removed(struct irq_desc *desc, struct irqaction *action)
{
bool irq_enabled = false, timer_killed = false;
+ struct irq_expect *exp, *next;
unsigned long flags;
int rc;

@@ -625,6 +888,13 @@ void irq_poll_action_removed(struct irq_desc *desc, struct irqaction *action)
timer_killed && irq_enabled ? " and" : "",
irq_enabled ? " IRQ reenabled" : "");

+ /* free expect tokens */
+ for (exp = action->expects; exp; exp = next) {
+ next = exp->next;
+ kfree(exp);
+ }
+ action->expects = NULL;
+
raw_spin_unlock_irqrestore(&desc->lock, flags);
}

--
1.6.4.2

2010-06-13 15:32:56

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 12/12] usb: use IRQ watching

Ask IRQ subsystem to watch HCD IRQ line after initialization. This at
least keeps USB ports which are occupied on initialization working and
eases bug reporting and debugging.

Signed-off-by: Tejun Heo <[email protected]>
---
drivers/usb/core/hcd.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
index 12742f1..383875f 100644
--- a/drivers/usb/core/hcd.c
+++ b/drivers/usb/core/hcd.c
@@ -2270,6 +2270,7 @@ int usb_add_hcd(struct usb_hcd *hcd,
"request interrupt %d failed\n", irqnum);
goto err_request_irq;
}
+ watch_irq(irqnum, hcd);
hcd->irq = irqnum;
dev_info(hcd->self.controller, "irq %d, %s 0x%08llx\n", irqnum,
(hcd->driver->flags & HCD_MEMORY) ?
--
1.6.4.2

2010-06-13 15:33:18

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 04/12] irq: kill IRQF_IRQPOLL

With irqpoll polling switched to desc->poll_timer, IRQF_IRQPOLL is no
longer necessary. Drop it.

Signed-off-by: Tejun Heo <[email protected]>
---
arch/arm/mach-aaec2000/core.c | 2 +-
arch/arm/mach-at91/at91rm9200_time.c | 2 +-
arch/arm/mach-at91/at91sam926x_time.c | 2 +-
arch/arm/mach-bcmring/core.c | 2 +-
arch/arm/mach-clps711x/time.c | 2 +-
arch/arm/mach-cns3xxx/core.c | 2 +-
arch/arm/mach-ebsa110/core.c | 2 +-
arch/arm/mach-ep93xx/core.c | 2 +-
arch/arm/mach-footbridge/dc21285-timer.c | 2 +-
arch/arm/mach-footbridge/isa-timer.c | 2 +-
arch/arm/mach-h720x/cpu-h7201.c | 2 +-
arch/arm/mach-h720x/cpu-h7202.c | 2 +-
arch/arm/mach-integrator/integrator_ap.c | 2 +-
arch/arm/mach-ixp2000/core.c | 2 +-
arch/arm/mach-ixp23xx/core.c | 2 +-
arch/arm/mach-ixp4xx/common.c | 2 +-
arch/arm/mach-lh7a40x/time.c | 2 +-
arch/arm/mach-mmp/time.c | 2 +-
arch/arm/mach-netx/time.c | 2 +-
arch/arm/mach-ns9xxx/time-ns9360.c | 2 +-
arch/arm/mach-nuc93x/time.c | 2 +-
arch/arm/mach-omap1/time.c | 2 +-
arch/arm/mach-omap1/timer32k.c | 2 +-
arch/arm/mach-omap2/timer-gp.c | 2 +-
arch/arm/mach-pnx4008/time.c | 2 +-
arch/arm/mach-pxa/time.c | 2 +-
arch/arm/mach-sa1100/time.c | 2 +-
arch/arm/mach-shark/core.c | 2 +-
arch/arm/mach-u300/timer.c | 2 +-
arch/arm/mach-w90x900/time.c | 2 +-
arch/arm/plat-iop/time.c | 2 +-
arch/arm/plat-mxc/time.c | 2 +-
arch/arm/plat-samsung/time.c | 2 +-
arch/arm/plat-versatile/timer-sp.c | 2 +-
arch/blackfin/kernel/time-ts.c | 6 ++----
arch/ia64/kernel/time.c | 2 +-
arch/parisc/kernel/irq.c | 2 +-
arch/x86/kernel/time.c | 2 +-
drivers/clocksource/sh_cmt.c | 3 +--
drivers/clocksource/sh_mtu2.c | 3 +--
drivers/clocksource/sh_tmu.c | 3 +--
include/linux/interrupt.h | 4 ----
42 files changed, 42 insertions(+), 51 deletions(-)

diff --git a/arch/arm/mach-aaec2000/core.c b/arch/arm/mach-aaec2000/core.c
index 3ef6833..ac87535 100644
--- a/arch/arm/mach-aaec2000/core.c
+++ b/arch/arm/mach-aaec2000/core.c
@@ -139,7 +139,7 @@ aaec2000_timer_interrupt(int irq, void *dev_id)

static struct irqaction aaec2000_timer_irq = {
.name = "AAEC-2000 Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = aaec2000_timer_interrupt,
};

diff --git a/arch/arm/mach-at91/at91rm9200_time.c b/arch/arm/mach-at91/at91rm9200_time.c
index 2500f41..a4a0049 100644
--- a/arch/arm/mach-at91/at91rm9200_time.c
+++ b/arch/arm/mach-at91/at91rm9200_time.c
@@ -87,7 +87,7 @@ static irqreturn_t at91rm9200_timer_interrupt(int irq, void *dev_id)

static struct irqaction at91rm9200_timer_irq = {
.name = "at91_tick",
- .flags = IRQF_SHARED | IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_SHARED | IRQF_DISABLED | IRQF_TIMER,
.handler = at91rm9200_timer_interrupt
};

diff --git a/arch/arm/mach-at91/at91sam926x_time.c b/arch/arm/mach-at91/at91sam926x_time.c
index 608a632..22a290b 100644
--- a/arch/arm/mach-at91/at91sam926x_time.c
+++ b/arch/arm/mach-at91/at91sam926x_time.c
@@ -123,7 +123,7 @@ static irqreturn_t at91sam926x_pit_interrupt(int irq, void *dev_id)

static struct irqaction at91sam926x_pit_irq = {
.name = "at91_tick",
- .flags = IRQF_SHARED | IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_SHARED | IRQF_DISABLED | IRQF_TIMER,
.handler = at91sam926x_pit_interrupt
};

diff --git a/arch/arm/mach-bcmring/core.c b/arch/arm/mach-bcmring/core.c
index 72e405d..99e10ff 100644
--- a/arch/arm/mach-bcmring/core.c
+++ b/arch/arm/mach-bcmring/core.c
@@ -266,7 +266,7 @@ static irqreturn_t bcmring_timer_interrupt(int irq, void *dev_id)

static struct irqaction bcmring_timer_irq = {
.name = "bcmring Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = bcmring_timer_interrupt,
};

diff --git a/arch/arm/mach-clps711x/time.c b/arch/arm/mach-clps711x/time.c
index d581ef0..9dffaa6 100644
--- a/arch/arm/mach-clps711x/time.c
+++ b/arch/arm/mach-clps711x/time.c
@@ -56,7 +56,7 @@ p720t_timer_interrupt(int irq, void *dev_id)

static struct irqaction clps711x_timer_irq = {
.name = "CLPS711x Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = p720t_timer_interrupt,
};

diff --git a/arch/arm/mach-cns3xxx/core.c b/arch/arm/mach-cns3xxx/core.c
index 9ca4d58..9dcea9f 100644
--- a/arch/arm/mach-cns3xxx/core.c
+++ b/arch/arm/mach-cns3xxx/core.c
@@ -178,7 +178,7 @@ static irqreturn_t cns3xxx_timer_interrupt(int irq, void *dev_id)

static struct irqaction cns3xxx_timer_irq = {
.name = "timer",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = cns3xxx_timer_interrupt,
};

diff --git a/arch/arm/mach-ebsa110/core.c b/arch/arm/mach-ebsa110/core.c
index c7bc7fb..efb7734 100644
--- a/arch/arm/mach-ebsa110/core.c
+++ b/arch/arm/mach-ebsa110/core.c
@@ -195,7 +195,7 @@ ebsa110_timer_interrupt(int irq, void *dev_id)

static struct irqaction ebsa110_timer_irq = {
.name = "EBSA110 Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = ebsa110_timer_interrupt,
};

diff --git a/arch/arm/mach-ep93xx/core.c b/arch/arm/mach-ep93xx/core.c
index 9092677..2910f6c 100644
--- a/arch/arm/mach-ep93xx/core.c
+++ b/arch/arm/mach-ep93xx/core.c
@@ -132,7 +132,7 @@ static irqreturn_t ep93xx_timer_interrupt(int irq, void *dev_id)

static struct irqaction ep93xx_timer_irq = {
.name = "ep93xx timer",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = ep93xx_timer_interrupt,
};

diff --git a/arch/arm/mach-footbridge/dc21285-timer.c b/arch/arm/mach-footbridge/dc21285-timer.c
index bc5e83f..42b0bd7 100644
--- a/arch/arm/mach-footbridge/dc21285-timer.c
+++ b/arch/arm/mach-footbridge/dc21285-timer.c
@@ -41,7 +41,7 @@ timer1_interrupt(int irq, void *dev_id)
static struct irqaction footbridge_timer_irq = {
.name = "Timer1 timer tick",
.handler = timer1_interrupt,
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
};

/*
diff --git a/arch/arm/mach-footbridge/isa-timer.c b/arch/arm/mach-footbridge/isa-timer.c
index f488fa2..ca1932e 100644
--- a/arch/arm/mach-footbridge/isa-timer.c
+++ b/arch/arm/mach-footbridge/isa-timer.c
@@ -71,7 +71,7 @@ isa_timer_interrupt(int irq, void *dev_id)
static struct irqaction isa_timer_irq = {
.name = "ISA timer tick",
.handler = isa_timer_interrupt,
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
};

static void __init isa_timer_init(void)
diff --git a/arch/arm/mach-h720x/cpu-h7201.c b/arch/arm/mach-h720x/cpu-h7201.c
index 24df2a3..be1db54 100644
--- a/arch/arm/mach-h720x/cpu-h7201.c
+++ b/arch/arm/mach-h720x/cpu-h7201.c
@@ -37,7 +37,7 @@ h7201_timer_interrupt(int irq, void *dev_id)

static struct irqaction h7201_timer_irq = {
.name = "h7201 Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = h7201_timer_interrupt,
};

diff --git a/arch/arm/mach-h720x/cpu-h7202.c b/arch/arm/mach-h720x/cpu-h7202.c
index fd33a19..e40deea 100644
--- a/arch/arm/mach-h720x/cpu-h7202.c
+++ b/arch/arm/mach-h720x/cpu-h7202.c
@@ -166,7 +166,7 @@ static struct irq_chip h7202_timerx_chip = {

static struct irqaction h7202_timer_irq = {
.name = "h7202 Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = h7202_timer_interrupt,
};

diff --git a/arch/arm/mach-integrator/integrator_ap.c b/arch/arm/mach-integrator/integrator_ap.c
index 227cf4d..60f741e 100644
--- a/arch/arm/mach-integrator/integrator_ap.c
+++ b/arch/arm/mach-integrator/integrator_ap.c
@@ -447,7 +447,7 @@ static struct clock_event_device integrator_clockevent = {

static struct irqaction integrator_timer_irq = {
.name = "timer",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = integrator_timer_interrupt,
.dev_id = &integrator_clockevent,
};
diff --git a/arch/arm/mach-ixp2000/core.c b/arch/arm/mach-ixp2000/core.c
index babb225..693275d 100644
--- a/arch/arm/mach-ixp2000/core.c
+++ b/arch/arm/mach-ixp2000/core.c
@@ -213,7 +213,7 @@ static int ixp2000_timer_interrupt(int irq, void *dev_id)

static struct irqaction ixp2000_timer_irq = {
.name = "IXP2000 Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = ixp2000_timer_interrupt,
};

diff --git a/arch/arm/mach-ixp23xx/core.c b/arch/arm/mach-ixp23xx/core.c
index aa4c442..8967392 100644
--- a/arch/arm/mach-ixp23xx/core.c
+++ b/arch/arm/mach-ixp23xx/core.c
@@ -359,7 +359,7 @@ ixp23xx_timer_interrupt(int irq, void *dev_id)
static struct irqaction ixp23xx_timer_irq = {
.name = "IXP23xx Timer Tick",
.handler = ixp23xx_timer_interrupt,
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
};

void __init ixp23xx_init_timer(void)
diff --git a/arch/arm/mach-ixp4xx/common.c b/arch/arm/mach-ixp4xx/common.c
index 0bce097..9a574fb 100644
--- a/arch/arm/mach-ixp4xx/common.c
+++ b/arch/arm/mach-ixp4xx/common.c
@@ -278,7 +278,7 @@ static irqreturn_t ixp4xx_timer_interrupt(int irq, void *dev_id)

static struct irqaction ixp4xx_timer_irq = {
.name = "timer1",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = ixp4xx_timer_interrupt,
.dev_id = &clockevent_ixp4xx,
};
diff --git a/arch/arm/mach-lh7a40x/time.c b/arch/arm/mach-lh7a40x/time.c
index 4601e42..841fe8c 100644
--- a/arch/arm/mach-lh7a40x/time.c
+++ b/arch/arm/mach-lh7a40x/time.c
@@ -49,7 +49,7 @@ lh7a40x_timer_interrupt(int irq, void *dev_id)

static struct irqaction lh7a40x_timer_irq = {
.name = "LHA740x Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = lh7a40x_timer_interrupt,
};

diff --git a/arch/arm/mach-mmp/time.c b/arch/arm/mach-mmp/time.c
index cf75694..3d6957a 100644
--- a/arch/arm/mach-mmp/time.c
+++ b/arch/arm/mach-mmp/time.c
@@ -177,7 +177,7 @@ static void __init timer_config(void)

static struct irqaction timer_irq = {
.name = "timer",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = timer_interrupt,
.dev_id = &ckevt,
};
diff --git a/arch/arm/mach-netx/time.c b/arch/arm/mach-netx/time.c
index 82801db..c0cc836 100644
--- a/arch/arm/mach-netx/time.c
+++ b/arch/arm/mach-netx/time.c
@@ -100,7 +100,7 @@ netx_timer_interrupt(int irq, void *dev_id)

static struct irqaction netx_timer_irq = {
.name = "NetX Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = netx_timer_interrupt,
};

diff --git a/arch/arm/mach-ns9xxx/time-ns9360.c b/arch/arm/mach-ns9xxx/time-ns9360.c
index 7728126..e60627f 100644
--- a/arch/arm/mach-ns9xxx/time-ns9360.c
+++ b/arch/arm/mach-ns9xxx/time-ns9360.c
@@ -121,7 +121,7 @@ static irqreturn_t ns9360_clockevent_handler(int irq, void *dev_id)

static struct irqaction ns9360_clockevent_action = {
.name = "ns9360-timer" __stringify(TIMER_CLOCKEVENT),
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = ns9360_clockevent_handler,
};

diff --git a/arch/arm/mach-nuc93x/time.c b/arch/arm/mach-nuc93x/time.c
index 2f90f9d..8e0dbea 100644
--- a/arch/arm/mach-nuc93x/time.c
+++ b/arch/arm/mach-nuc93x/time.c
@@ -56,7 +56,7 @@ static irqreturn_t nuc93x_timer_interrupt(int irq, void *dev_id)

static struct irqaction nuc93x_timer_irq = {
.name = "nuc93x Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = nuc93x_timer_interrupt,
};

diff --git a/arch/arm/mach-omap1/time.c b/arch/arm/mach-omap1/time.c
index 1be6a21..c62fa79 100644
--- a/arch/arm/mach-omap1/time.c
+++ b/arch/arm/mach-omap1/time.c
@@ -157,7 +157,7 @@ static irqreturn_t omap_mpu_timer1_interrupt(int irq, void *dev_id)

static struct irqaction omap_mpu_timer1_irq = {
.name = "mpu_timer1",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = omap_mpu_timer1_interrupt,
};

diff --git a/arch/arm/mach-omap1/timer32k.c b/arch/arm/mach-omap1/timer32k.c
index 20cfbcc..8ad901b 100644
--- a/arch/arm/mach-omap1/timer32k.c
+++ b/arch/arm/mach-omap1/timer32k.c
@@ -156,7 +156,7 @@ static irqreturn_t omap_32k_timer_interrupt(int irq, void *dev_id)

static struct irqaction omap_32k_timer_irq = {
.name = "32KHz timer",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = omap_32k_timer_interrupt,
};

diff --git a/arch/arm/mach-omap2/timer-gp.c b/arch/arm/mach-omap2/timer-gp.c
index 74fbed8..ddf9fae 100644
--- a/arch/arm/mach-omap2/timer-gp.c
+++ b/arch/arm/mach-omap2/timer-gp.c
@@ -62,7 +62,7 @@ static irqreturn_t omap2_gp_timer_interrupt(int irq, void *dev_id)

static struct irqaction omap2_gp_timer_irq = {
.name = "gp timer",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = omap2_gp_timer_interrupt,
};

diff --git a/arch/arm/mach-pnx4008/time.c b/arch/arm/mach-pnx4008/time.c
index 0c8aad4..1d5b2da 100644
--- a/arch/arm/mach-pnx4008/time.c
+++ b/arch/arm/mach-pnx4008/time.c
@@ -80,7 +80,7 @@ static irqreturn_t pnx4008_timer_interrupt(int irq, void *dev_id)

static struct irqaction pnx4008_timer_irq = {
.name = "PNX4008 Tick Timer",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = pnx4008_timer_interrupt
};

diff --git a/arch/arm/mach-pxa/time.c b/arch/arm/mach-pxa/time.c
index 293e40a..9fa6e1a 100644
--- a/arch/arm/mach-pxa/time.c
+++ b/arch/arm/mach-pxa/time.c
@@ -133,7 +133,7 @@ static struct clocksource cksrc_pxa_oscr0 = {

static struct irqaction pxa_ost0_irq = {
.name = "ost0",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = pxa_ost0_interrupt,
.dev_id = &ckevt_pxa_osmr0,
};
diff --git a/arch/arm/mach-sa1100/time.c b/arch/arm/mach-sa1100/time.c
index 74b6e0e..7ec781d 100644
--- a/arch/arm/mach-sa1100/time.c
+++ b/arch/arm/mach-sa1100/time.c
@@ -87,7 +87,7 @@ static struct clocksource cksrc_sa1100_oscr = {

static struct irqaction sa1100_timer_irq = {
.name = "ost0",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = sa1100_ost0_interrupt,
.dev_id = &ckevt_sa1100_osmr0,
};
diff --git a/arch/arm/mach-shark/core.c b/arch/arm/mach-shark/core.c
index 358d875..d1d6ea5 100644
--- a/arch/arm/mach-shark/core.c
+++ b/arch/arm/mach-shark/core.c
@@ -130,7 +130,7 @@ shark_timer_interrupt(int irq, void *dev_id)

static struct irqaction shark_timer_irq = {
.name = "Shark Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = shark_timer_interrupt,
};

diff --git a/arch/arm/mach-u300/timer.c b/arch/arm/mach-u300/timer.c
index 26d26f5..63db057 100644
--- a/arch/arm/mach-u300/timer.c
+++ b/arch/arm/mach-u300/timer.c
@@ -326,7 +326,7 @@ static irqreturn_t u300_timer_interrupt(int irq, void *dev_id)

static struct irqaction u300_timer_irq = {
.name = "U300 Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = u300_timer_interrupt,
};

diff --git a/arch/arm/mach-w90x900/time.c b/arch/arm/mach-w90x900/time.c
index b80f769..9cda835 100644
--- a/arch/arm/mach-w90x900/time.c
+++ b/arch/arm/mach-w90x900/time.c
@@ -111,7 +111,7 @@ static irqreturn_t nuc900_timer0_interrupt(int irq, void *dev_id)

static struct irqaction nuc900_timer0_irq = {
.name = "nuc900-timer0",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = nuc900_timer0_interrupt,
};

diff --git a/arch/arm/plat-iop/time.c b/arch/arm/plat-iop/time.c
index 6c8a02a..e320148 100644
--- a/arch/arm/plat-iop/time.c
+++ b/arch/arm/plat-iop/time.c
@@ -164,7 +164,7 @@ iop_timer_interrupt(int irq, void *dev_id)
static struct irqaction iop_timer_irq = {
.name = "IOP Timer Tick",
.handler = iop_timer_interrupt,
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.dev_id = &iop_clockevent,
};

diff --git a/arch/arm/plat-mxc/time.c b/arch/arm/plat-mxc/time.c
index f9a1b05..e5ca2bd 100644
--- a/arch/arm/plat-mxc/time.c
+++ b/arch/arm/plat-mxc/time.c
@@ -258,7 +258,7 @@ static irqreturn_t mxc_timer_interrupt(int irq, void *dev_id)

static struct irqaction mxc_timer_irq = {
.name = "i.MX Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = mxc_timer_interrupt,
};

diff --git a/arch/arm/plat-samsung/time.c b/arch/arm/plat-samsung/time.c
index 2231d80..133069a 100644
--- a/arch/arm/plat-samsung/time.c
+++ b/arch/arm/plat-samsung/time.c
@@ -138,7 +138,7 @@ s3c2410_timer_interrupt(int irq, void *dev_id)

static struct irqaction s3c2410_timer_irq = {
.name = "S3C2410 Timer Tick",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = s3c2410_timer_interrupt,
};

diff --git a/arch/arm/plat-versatile/timer-sp.c b/arch/arm/plat-versatile/timer-sp.c
index fb0d1c2..62066b4 100644
--- a/arch/arm/plat-versatile/timer-sp.c
+++ b/arch/arm/plat-versatile/timer-sp.c
@@ -135,7 +135,7 @@ static struct clock_event_device sp804_clockevent = {

static struct irqaction sp804_timer_irq = {
.name = "timer",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER,
.handler = sp804_timer_interrupt,
.dev_id = &sp804_clockevent,
};
diff --git a/arch/blackfin/kernel/time-ts.c b/arch/blackfin/kernel/time-ts.c
index 8c9a43d..6fefad4 100644
--- a/arch/blackfin/kernel/time-ts.c
+++ b/arch/blackfin/kernel/time-ts.c
@@ -213,8 +213,7 @@ irqreturn_t bfin_gptmr0_interrupt(int irq, void *dev_id)

static struct irqaction gptmr0_irq = {
.name = "Blackfin GPTimer0",
- .flags = IRQF_DISABLED | IRQF_TIMER | \
- IRQF_IRQPOLL | IRQF_PERCPU,
+ .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_PERCPU,
.handler = bfin_gptmr0_interrupt,
};

@@ -322,8 +321,7 @@ irqreturn_t bfin_coretmr_interrupt(int irq, void *dev_id)

static struct irqaction coretmr_irq = {
.name = "Blackfin CoreTimer",
- .flags = IRQF_DISABLED | IRQF_TIMER | \
- IRQF_IRQPOLL | IRQF_PERCPU,
+ .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_PERCPU,
.handler = bfin_coretmr_interrupt,
};

diff --git a/arch/ia64/kernel/time.c b/arch/ia64/kernel/time.c
index 653b3c4..8b366c4 100644
--- a/arch/ia64/kernel/time.c
+++ b/arch/ia64/kernel/time.c
@@ -411,7 +411,7 @@ static cycle_t itc_get_cycles(struct clocksource *cs)

static struct irqaction timer_irqaction = {
.handler = timer_interrupt,
- .flags = IRQF_DISABLED | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED,
.name = "timer"
};

diff --git a/arch/parisc/kernel/irq.c b/arch/parisc/kernel/irq.c
index efbcee5..24681d5 100644
--- a/arch/parisc/kernel/irq.c
+++ b/arch/parisc/kernel/irq.c
@@ -383,7 +383,7 @@ void do_cpu_irq_mask(struct pt_regs *regs)
static struct irqaction timer_action = {
.handler = timer_interrupt,
.name = "timer",
- .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_PERCPU | IRQF_IRQPOLL,
+ .flags = IRQF_DISABLED | IRQF_TIMER | IRQF_PERCPU,
};

#ifdef CONFIG_SMP
diff --git a/arch/x86/kernel/time.c b/arch/x86/kernel/time.c
index fb5cc5e..476e2fe 100644
--- a/arch/x86/kernel/time.c
+++ b/arch/x86/kernel/time.c
@@ -88,7 +88,7 @@ static irqreturn_t timer_interrupt(int irq, void *dev_id)

static struct irqaction irq0 = {
.handler = timer_interrupt,
- .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_IRQPOLL | IRQF_TIMER,
+ .flags = IRQF_DISABLED | IRQF_NOBALANCING | IRQF_TIMER,
.name = "timer"
};

diff --git a/drivers/clocksource/sh_cmt.c b/drivers/clocksource/sh_cmt.c
index f3d3898..60b681c 100644
--- a/drivers/clocksource/sh_cmt.c
+++ b/drivers/clocksource/sh_cmt.c
@@ -610,8 +610,7 @@ static int sh_cmt_setup(struct sh_cmt_priv *p, struct platform_device *pdev)
p->irqaction.name = dev_name(&p->pdev->dev);
p->irqaction.handler = sh_cmt_interrupt;
p->irqaction.dev_id = p;
- p->irqaction.flags = IRQF_DISABLED | IRQF_TIMER | \
- IRQF_IRQPOLL | IRQF_NOBALANCING;
+ p->irqaction.flags = IRQF_DISABLED | IRQF_TIMER | IRQF_NOBALANCING;

/* get hold of clock */
p->clk = clk_get(&p->pdev->dev, "cmt_fck");
diff --git a/drivers/clocksource/sh_mtu2.c b/drivers/clocksource/sh_mtu2.c
index ef7a5be..9fe3507 100644
--- a/drivers/clocksource/sh_mtu2.c
+++ b/drivers/clocksource/sh_mtu2.c
@@ -281,8 +281,7 @@ static int sh_mtu2_setup(struct sh_mtu2_priv *p, struct platform_device *pdev)
p->irqaction.handler = sh_mtu2_interrupt;
p->irqaction.dev_id = p;
p->irqaction.irq = irq;
- p->irqaction.flags = IRQF_DISABLED | IRQF_TIMER | \
- IRQF_IRQPOLL | IRQF_NOBALANCING;
+ p->irqaction.flags = IRQF_DISABLED | IRQF_TIMER | IRQF_NOBALANCING;

/* get hold of clock */
p->clk = clk_get(&p->pdev->dev, "mtu2_fck");
diff --git a/drivers/clocksource/sh_tmu.c b/drivers/clocksource/sh_tmu.c
index de71590..75967f8 100644
--- a/drivers/clocksource/sh_tmu.c
+++ b/drivers/clocksource/sh_tmu.c
@@ -387,8 +387,7 @@ static int sh_tmu_setup(struct sh_tmu_priv *p, struct platform_device *pdev)
p->irqaction.handler = sh_tmu_interrupt;
p->irqaction.dev_id = p;
p->irqaction.irq = irq;
- p->irqaction.flags = IRQF_DISABLED | IRQF_TIMER | \
- IRQF_IRQPOLL | IRQF_NOBALANCING;
+ p->irqaction.flags = IRQF_DISABLED | IRQF_TIMER | IRQF_NOBALANCING;

/* get hold of clock */
p->clk = clk_get(&p->pdev->dev, "tmu_fck");
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index c233113..61857f1 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -47,9 +47,6 @@
* IRQF_TIMER - Flag to mark this interrupt as timer interrupt
* IRQF_PERCPU - Interrupt is per cpu
* IRQF_NOBALANCING - Flag to exclude this interrupt from irq balancing
- * IRQF_IRQPOLL - Interrupt is used for polling (only the interrupt that is
- * registered first in an shared interrupt is considered for
- * performance reasons)
* IRQF_ONESHOT - Interrupt is not reenabled after the hardirq handler finished.
* Used by threaded interrupts which need to keep the
* irq line disabled until the threaded handler has been run.
@@ -61,7 +58,6 @@
#define IRQF_TIMER 0x00000200
#define IRQF_PERCPU 0x00000400
#define IRQF_NOBALANCING 0x00000800
-#define IRQF_IRQPOLL 0x00001000
#define IRQF_ONESHOT 0x00002000

/*
--
1.6.4.2

2010-06-13 15:32:14

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 11/12] libata: use IRQ expecting

Thanks to its age, ATA is very susceptible to IRQ delivery problems in
both directions - lost and spurious interrupts. In traditional PATA,
the IRQ line is ultimately out of the controller and driver's control.
Even relatively new SATA isn't free from these issues. Many
controllers still emulate the traditional IDE interface which doesn't
have reliable way to indicate interrupt pending state and there also
is an issue regarding the interpretation of nIEN on both sides of the
cable.

Most of these problems can be worked around by using the new IRQ
expecting mechanism without adding noticeable overhead. In ATA,
almost all operations are initiated by the host and the controller
signals progress or completion using IRQ. IRQ expecting can easily be
added in libata core and applied to all libata drivers.

Signed-off-by: Tejun Heo <[email protected]>
---
drivers/ata/libata-core.c | 15 +++++++++++++--
drivers/ata/libata-eh.c | 4 +++-
drivers/ata/libata-sff.c | 37 +++++++++++++++++++------------------
include/linux/libata.h | 2 ++
4 files changed, 37 insertions(+), 21 deletions(-)

diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
index ddf8e48..9a0aaa0 100644
--- a/drivers/ata/libata-core.c
+++ b/drivers/ata/libata-core.c
@@ -4972,6 +4972,8 @@ void ata_qc_complete(struct ata_queued_cmd *qc)
{
struct ata_port *ap = qc->ap;

+ unexpect_irq(ap->irq_expect, false);
+
/* XXX: New EH and old EH use different mechanisms to
* synchronize EH with regular execution path.
*
@@ -5087,6 +5089,9 @@ int ata_qc_complete_multiple(struct ata_port *ap, u32 qc_active)
done_mask &= ~(1 << tag);
}

+ if (ap->qc_active)
+ expect_irq(ap->irq_expect);
+
return nr_done;
}

@@ -5153,6 +5158,7 @@ void ata_qc_issue(struct ata_queued_cmd *qc)
qc->err_mask |= ap->ops->qc_issue(qc);
if (unlikely(qc->err_mask))
goto err;
+ expect_irq(ap->irq_expect);
return;

sg_err:
@@ -6185,8 +6191,13 @@ int ata_host_activate(struct ata_host *host, int irq,
if (rc)
return rc;

- for (i = 0; i < host->n_ports; i++)
- ata_port_desc(host->ports[i], "irq %d", irq);
+ for (i = 0; i < host->n_ports; i++) {
+ struct ata_port *ap = host->ports[i];
+
+ if (!ata_port_is_dummy(ap))
+ ap->irq_expect = init_irq_expect(irq, host);
+ ata_port_desc(ap, "irq %d%s", irq, ap->irq_expect ? "+" : "");
+ }

rc = ata_host_register(host, sht);
/* if failed, just free the IRQ and leave ports alone */
diff --git a/drivers/ata/libata-eh.c b/drivers/ata/libata-eh.c
index f77a673..f1ae3ec 100644
--- a/drivers/ata/libata-eh.c
+++ b/drivers/ata/libata-eh.c
@@ -619,8 +619,10 @@ void ata_scsi_error(struct Scsi_Host *host)
* handler doesn't diddle with those qcs. This must
* be done atomically w.r.t. setting QCFLAG_FAILED.
*/
- if (nr_timedout)
+ if (nr_timedout) {
+ unexpect_irq(ap->irq_expect, true);
__ata_port_freeze(ap);
+ }

spin_unlock_irqrestore(ap->lock, flags);

diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c
index efa4a18..cc96f36 100644
--- a/drivers/ata/libata-sff.c
+++ b/drivers/ata/libata-sff.c
@@ -2388,7 +2388,8 @@ int ata_pci_sff_activate_host(struct ata_host *host,
struct device *dev = host->dev;
struct pci_dev *pdev = to_pci_dev(dev);
const char *drv_name = dev_driver_string(host->dev);
- int legacy_mode = 0, rc;
+ struct ata_port *ap[2] = { host->ports[0], host->ports[1] };
+ int legacy_mode = 0, i, rc;

rc = ata_host_start(host);
if (rc)
@@ -2422,29 +2423,29 @@ int ata_pci_sff_activate_host(struct ata_host *host,
if (rc)
goto out;

- ata_port_desc(host->ports[0], "irq %d", pdev->irq);
- ata_port_desc(host->ports[1], "irq %d", pdev->irq);
+ for (i = 0; i < 2; i++) {
+ if (!ata_port_is_dummy(ap[i]))
+ ap[i]->irq_expect =
+ init_irq_expect(pdev->irq, host);
+ ata_port_desc(ap[i], "irq %d%s",
+ pdev->irq, ap[i]->irq_expect ? "+" : "");
+ }
} else if (legacy_mode) {
- if (!ata_port_is_dummy(host->ports[0])) {
- rc = devm_request_irq(dev, ATA_PRIMARY_IRQ(pdev),
- irq_handler, IRQF_SHARED,
- drv_name, host);
- if (rc)
- goto out;
+ unsigned int irqs[2] = { ATA_PRIMARY_IRQ(pdev),
+ ATA_SECONDARY_IRQ(pdev) };

- ata_port_desc(host->ports[0], "irq %d",
- ATA_PRIMARY_IRQ(pdev));
- }
+ for (i = 0; i < 2; i++) {
+ if (ata_port_is_dummy(ap[i]))
+ continue;

- if (!ata_port_is_dummy(host->ports[1])) {
- rc = devm_request_irq(dev, ATA_SECONDARY_IRQ(pdev),
- irq_handler, IRQF_SHARED,
- drv_name, host);
+ rc = devm_request_irq(dev, irqs[i], irq_handler,
+ IRQF_SHARED, drv_name, host);
if (rc)
goto out;

- ata_port_desc(host->ports[1], "irq %d",
- ATA_SECONDARY_IRQ(pdev));
+ ap[i]->irq_expect = init_irq_expect(irqs[i], host);
+ ata_port_desc(ap[i], "irq %d%s",
+ irqs[i], ap[i]->irq_expect ? "+" : "");
}
}

diff --git a/include/linux/libata.h b/include/linux/libata.h
index b85f3ff..3f5f159 100644
--- a/include/linux/libata.h
+++ b/include/linux/libata.h
@@ -751,6 +751,8 @@ struct ata_port {
struct ata_host *host;
struct device *dev;

+ struct irq_expect *irq_expect; /* for irq expecting */
+
struct delayed_work hotplug_task;
struct work_struct scsi_rescan_task;

--
1.6.4.2

2010-06-13 15:33:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 07/12] irq: improve spurious IRQ handling

Currently, once spurious polling is enabled, it's never disabled and
to avoid enaling it unnecessarily, the condition for kicking in is
very conservative. Now that spurious polling is per-IRQ, it can be
made more adaptive without adding overhead to the fast path.

This patch improves spurious handling such that the spurious IRQ
polling kicks in earlier and it disables itself after polling certain
number of times which is automatically adjusted according to whether
and when spurious IRQ happens again.

This allows the system to work around temporary IRQ glitches without
paying unnecessary long term overhead.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/irq.h | 19 ++-
kernel/irq/chip.c | 2 -
kernel/irq/internals.h | 2 +-
kernel/irq/manage.c | 4 -
kernel/irq/proc.c | 5 +-
kernel/irq/spurious.c | 292 ++++++++++++++++++++++++++++++++++--------------
6 files changed, 226 insertions(+), 98 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 50a77f9..b2f73ba 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -145,6 +145,17 @@ struct irq_chip {

struct timer_rand_state;
struct irq_2_iommu;
+
+/* spurious IRQ tracking and handling */
+struct irq_spr {
+ unsigned long last_bad; /* when was the last bad? */
+ unsigned long period_start; /* period start jiffies */
+ unsigned int nr_samples; /* nr of irqs in this period */
+ unsigned int nr_bad; /* nr of bad deliveries */
+ unsigned int poll_cnt; /* nr to poll once activated */
+ unsigned int poll_rem; /* how many polls are left? */
+};
+
/**
* struct irq_desc - interrupt descriptor
* @irq: interrupt number for this descriptor
@@ -161,15 +172,13 @@ struct irq_2_iommu;
* @status: status information
* @depth: disable-depth, for nested irq_disable() calls
* @wake_depth: enable depth, for multiple set_irq_wake() callers
- * @irq_count: stats field to detect stalled irqs
- * @last_unhandled: aging timer for unhandled count
- * @irqs_unhandled: stats field for spurious unhandled interrupts
* @lock: locking for SMP
* @affinity: IRQ affinity on SMP
* @node: node index useful for balancing
* @pending_mask: pending rebalanced interrupts
* @threads_active: number of irqaction threads currently running
* @wait_for_threads: wait queue for sync_irq to wait for threaded handlers
+ * @spr: data for spurious IRQ handling
* @poll_timer: timer for IRQ polling
* @dir: /proc/irq/ procfs entry
* @name: flow handler name for /proc/interrupts output
@@ -191,9 +200,6 @@ struct irq_desc {

unsigned int depth; /* nested irq disables */
unsigned int wake_depth; /* nested wake enables */
- unsigned int irq_count; /* For detecting broken IRQs */
- unsigned long last_unhandled; /* Aging timer for unhandled count */
- unsigned int irqs_unhandled;
raw_spinlock_t lock;
#ifdef CONFIG_SMP
cpumask_var_t affinity;
@@ -206,6 +212,7 @@ struct irq_desc {
atomic_t threads_active;
wait_queue_head_t wait_for_threads;

+ struct irq_spr spr;
struct timer_list poll_timer;

#ifdef CONFIG_PROC_FS
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index db26ff0..45a87f5 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -40,8 +40,6 @@ static void dynamic_irq_init_x(unsigned int irq, bool keep_chip_data)
if (!keep_chip_data)
desc->chip_data = NULL;
desc->action = NULL;
- desc->irq_count = 0;
- desc->irqs_unhandled = 0;
#ifdef CONFIG_SMP
cpumask_setall(desc->affinity);
#ifdef CONFIG_GENERIC_PENDING_IRQ
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 088e5d6..1b24309 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -68,7 +68,7 @@ static inline void chip_bus_sync_unlock(unsigned int irq, struct irq_desc *desc)
static inline void print_irq_desc(unsigned int irq, struct irq_desc *desc)
{
printk("irq %d, desc: %p, depth: %d, count: %d, unhandled: %d\n",
- irq, desc, desc->depth, desc->irq_count, desc->irqs_unhandled);
+ irq, desc, desc->depth, desc->spr.nr_samples, desc->spr.nr_bad);
printk("->handle_irq(): %p, ", desc->handle_irq);
print_symbol("%s\n", (unsigned long)desc->handle_irq);
printk("->chip(): %p, ", desc->chip);
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index cf9ab65..5862bfc 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -801,10 +801,6 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
new->irq = irq;
*old_ptr = new;

- /* Reset broken irq detection when installing new handler */
- desc->irq_count = 0;
- desc->irqs_unhandled = 0;
-
raw_spin_unlock_irqrestore(&desc->lock, flags);

irq_poll_action_added(desc, new);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 09a2ee5..b072460 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -205,10 +205,11 @@ static const struct file_operations irq_node_proc_fops = {
static int irq_spurious_proc_show(struct seq_file *m, void *v)
{
struct irq_desc *desc = irq_to_desc((long) m->private);
+ struct irq_spr *spr = &desc->spr;

seq_printf(m, "count %u\n" "unhandled %u\n" "last_unhandled %u ms\n",
- desc->irq_count, desc->irqs_unhandled,
- jiffies_to_msecs(desc->last_unhandled));
+ spr->nr_samples, spr->nr_bad,
+ jiffies_to_msecs(spr->last_bad));
return 0;
}

diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index 0b8fd0a..3948666 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -8,6 +8,7 @@

#include <linux/jiffies.h>
#include <linux/irq.h>
+#include <linux/log2.h>
#include <linux/module.h>
#include <linux/kallsyms.h>
#include <linux/interrupt.h>
@@ -25,6 +26,37 @@ enum {
IRQ_POLL_INTV = HZ / 100, /* from the good ol' 100HZ tick */

IRQ_POLL_SLACK = HZ / 1000, /* 10% slack */
+
+ /*
+ * Spurious IRQ handling parameters.
+ *
+ * As this per-IRQ spurious handling is cheaper than the
+ * previous system wide spurious handling, it can afford to
+ * use more responsive settings but these parameters are still
+ * pretty conservative. If ever necessary, making it more
+ * responsive shouldn't cause any problem.
+ *
+ * Spurious IRQs are monitored in segments of PERIOD_SAMPLES
+ * IRQs which can stretch PERIOD_DURATION at maximum. If
+ * there are less than PERIOD_SAMPLES IRQs per
+ * PERIOD_DURATION, the period is considered good.
+ *
+ * If >=BAD_THRESHOLD IRQs are bad ones, the period is
+ * considered bad and spurious IRQ handling kicks in - the IRQ
+ * is disabled and polled. The IRQ is given another shot
+ * after certain number IRQs are handled, which is at minimum
+ * POLL_CNT_MIN, increased by 1 << POLL_CNT_INC_SHIFT times
+ * after each bad period and decreased by factor of
+ * POLL_CNT_INC_DEC_SHIFT after each good one.
+ */
+ IRQ_SPR_PERIOD_DURATION = 10 * HZ,
+ IRQ_SPR_PERIOD_SAMPLES = 10000,
+ IRQ_SPR_BAD_THRESHOLD = 9900,
+ IRQ_SPR_POLL_CNT_MIN = 10000,
+ IRQ_SPR_POLL_CNT_INF = UINT_MAX,
+ IRQ_SPR_POLL_CNT_INC_SHIFT = 3,
+ IRQ_SPR_POLL_CNT_DEC_SHIFT = 1,
+ IRQ_SPR_POLL_CNT_MAX_DEC_SHIFT = BITS_PER_BYTE * sizeof(int) / 4,
};

int noirqdebug __read_mostly;
@@ -77,8 +109,24 @@ static void irq_schedule_poll(struct irq_desc *desc, unsigned long intv)
mod_timer(&desc->poll_timer, expires);
}

+/* start a new spurious handling period */
+static void irq_spr_new_period(struct irq_spr *spr)
+{
+ spr->period_start = jiffies;
+ spr->nr_samples = 0;
+ spr->nr_bad = 0;
+}
+
+/* Reset spurious handling. After this, poll_timer will offline itself soon. */
+static void irq_spr_reset(struct irq_spr *spr)
+{
+ irq_spr_new_period(spr);
+ spr->poll_cnt = IRQ_SPR_POLL_CNT_MIN;
+ spr->poll_rem = 0;
+}
+
/*
- * Recovery handler for misrouted interrupts.
+ * Perform an actual poll.
*/
static int try_one_irq(int irq, struct irq_desc *desc)
{
@@ -161,91 +209,99 @@ static int misrouted_irq(int irq)
}

/*
- * If 99,900 of the previous 100,000 interrupts have not been handled
- * then assume that the IRQ is stuck in some manner. Drop a diagnostic
- * and try to turn the IRQ off.
- *
- * (The other 100-of-100,000 interrupts may have been a correctly
- * functioning device sharing an IRQ with the failing one)
- *
- * Called under desc->lock
+ * IRQ delivery notification function. Called after each IRQ delivery.
*/
-
-static void
-__report_bad_irq(unsigned int irq, struct irq_desc *desc,
- irqreturn_t action_ret)
-{
- if (action_ret != IRQ_HANDLED && action_ret != IRQ_NONE) {
- printk(KERN_ERR "irq event %d: bogus return value %x\n",
- irq, action_ret);
- } else {
- printk(KERN_ERR "irq %d: nobody cared (try booting with "
- "the \"irqpoll\" option)\n", irq);
- }
- dump_stack();
- print_irq_handlers(desc);
-}
-
-static void
-report_bad_irq(unsigned int irq, struct irq_desc *desc, irqreturn_t action_ret)
-{
- static int count = 100;
-
- if (count > 0) {
- count--;
- __report_bad_irq(irq, desc, action_ret);
- }
-}
-
void __note_interrupt(unsigned int irq, struct irq_desc *desc,
irqreturn_t action_ret)
{
- if (unlikely(action_ret != IRQ_HANDLED)) {
- /*
- * If we are seeing only the odd spurious IRQ caused by
- * bus asynchronicity then don't eventually trigger an error,
- * otherwise the counter becomes a doomsday timer for otherwise
- * working systems
- */
- if (time_after(jiffies, desc->last_unhandled + HZ/10))
- desc->irqs_unhandled = 1;
- else
- desc->irqs_unhandled++;
- desc->last_unhandled = jiffies;
- if (unlikely(action_ret != IRQ_NONE))
- report_bad_irq(irq, desc, action_ret);
- }
+ struct irq_spr *spr = &desc->spr;
+ unsigned long dur;
+ unsigned int cnt, abbr;
+ char unit = 'k';

- if (unlikely(irqfixup >= IRQFIXUP_MISROUTED &&
- action_ret == IRQ_NONE)) {
- int ok = misrouted_irq(irq);
- if (action_ret == IRQ_NONE)
- desc->irqs_unhandled -= ok;
+ /*
+ * Account for unhandled interrupt. We don't care whether
+ * spurious accounting update races with irq open/close and
+ * gets some values wrong. Do it w/o locking.
+ */
+ if (unlikely(action_ret != IRQ_HANDLED)) {
+ static int bogus_count = 100;
+
+ spr->last_bad = jiffies - INITIAL_JIFFIES;
+ spr->nr_bad++;
+ if (likely(action_ret == IRQ_NONE)) {
+ if (unlikely(irqfixup >= IRQFIXUP_MISROUTED &&
+ misrouted_irq(irq)))
+ spr->nr_bad--;
+ } else if (bogus_count > 0) {
+ bogus_count--;
+ printk(KERN_ERR "IRQ %u: bogus return value %x\n",
+ irq, action_ret);
+ dump_stack();
+ print_irq_handlers(desc);
+ }
}

- desc->irq_count++;
- if (likely(desc->irq_count < 100000))
+ /* did we finish this spurious period? */
+ spr->nr_samples++;
+ if (likely(spr->nr_samples < IRQ_SPR_PERIOD_SAMPLES))
return;

- desc->irq_count = 0;
- if (unlikely(desc->irqs_unhandled > 99900)) {
+ /* if so, was it a good one? */
+ dur = jiffies - spr->period_start;
+ if (likely(spr->nr_bad < IRQ_SPR_BAD_THRESHOLD ||
+ dur > IRQ_SPR_PERIOD_DURATION)) {
/*
- * The interrupt is stuck
+ * If longer than PERIOD_DURATION has passed, consider
+ * multiple good periods have happened.
*/
- __report_bad_irq(irq, desc, action_ret);
- /*
- * Now kill the IRQ
- */
- printk(KERN_EMERG "Disabling IRQ #%d\n", irq);
- desc->status |= IRQ_DISABLED | IRQ_SPURIOUS_DISABLED;
- desc->depth++;
- desc->chip->disable(irq);
+ int sft = IRQ_SPR_POLL_CNT_DEC_SHIFT *
+ (dur >> order_base_2(IRQ_SPR_PERIOD_DURATION));

- raw_spin_lock(&desc->lock);
- irq_schedule_poll(desc, IRQ_POLL_INTV);
- raw_spin_unlock(&desc->lock);
+ /* but don't kill poll_cnt at once */
+ sft = clamp(sft, 1, IRQ_SPR_POLL_CNT_MAX_DEC_SHIFT);
+
+ spr->poll_cnt >>= sft;
+ irq_spr_new_period(spr);
+ return;
}
- desc->irqs_unhandled = 0;
+
+ /*
+ * It was a bad one, start polling. This is a slow path and
+ * we're gonna be changing states which require proper
+ * synchronization, grab desc->lock.
+ */
+ raw_spin_lock(&desc->lock);
+
+ irq_spr_new_period(spr);
+
+ /* update spr_poll_cnt considering the lower and upper bounds */
+ cnt = max_t(unsigned int, spr->poll_cnt, IRQ_SPR_POLL_CNT_MIN);
+ spr->poll_cnt = cnt << IRQ_SPR_POLL_CNT_INC_SHIFT;
+ if (spr->poll_cnt < cnt) /* did it overflow? */
+ spr->poll_cnt = IRQ_SPR_POLL_CNT_INF;
+
+ /* whine, plug IRQ and kick poll timer */
+ abbr = cnt / 1000;
+ if (abbr > 1000) {
+ abbr /= 1000;
+ unit = 'm';
+ }
+ printk(KERN_ERR "IRQ %u: too many spurious IRQs, disabling and "
+ "polling for %u%c %umsec intervals.\n",
+ desc->irq, abbr, unit, jiffies_to_msecs(IRQ_POLL_INTV));
+ printk(KERN_ERR "IRQ %u: system performance may be affected\n",
+ desc->irq);
+ print_irq_handlers(desc);
+
+ desc->status |= IRQ_DISABLED | IRQ_SPURIOUS_DISABLED;
+ desc->depth++;
+ desc->chip->disable(desc->irq);
+
+ spr->poll_rem = cnt;
+ irq_schedule_poll(desc, IRQ_POLL_INTV);
+
+ raw_spin_unlock(&desc->lock);
}

/*
@@ -254,48 +310,118 @@ void __note_interrupt(unsigned int irq, struct irq_desc *desc,
void poll_irq(unsigned long arg)
{
struct irq_desc *desc = (void *)arg;
+ struct irq_spr *spr = &desc->spr;
+ unsigned long intv = MAX_JIFFY_OFFSET;
+ bool reenable_irq = false;

raw_spin_lock_irq(&desc->lock);
+
+ /* poll the IRQ */
try_one_irq(desc->irq, desc);
- irq_schedule_poll(desc, IRQ_POLL_INTV);
+
+ /* take care of spurious handling */
+ if (spr->poll_rem) {
+ if (spr->poll_rem != IRQ_SPR_POLL_CNT_INF)
+ spr->poll_rem--;
+ if (spr->poll_rem)
+ intv = IRQ_POLL_INTV;
+ else
+ irq_spr_new_period(spr);
+ }
+ if (!spr->poll_rem)
+ reenable_irq = desc->status & IRQ_SPURIOUS_DISABLED;
+
+ /* need to poll again? */
+ if (intv < MAX_JIFFY_OFFSET)
+ irq_schedule_poll(desc, intv);
+
+ raw_spin_unlock_irq(&desc->lock);
+
+ if (!reenable_irq)
+ return;
+
+ /* need to do locking dance for chip_bus_lock() to reenable IRQ */
+ chip_bus_lock(desc->irq, desc);
+ raw_spin_lock_irq(&desc->lock);
+
+ /* make sure we haven't raced with anyone inbetween */
+ if (!spr->poll_rem && (desc->status & IRQ_SPURIOUS_DISABLED)) {
+ printk(KERN_INFO "IRQ %u: spurious polling finished, "
+ "reenabling IRQ\n", desc->irq);
+ __enable_irq(desc, desc->irq, false);
+ desc->status &= ~IRQ_SPURIOUS_DISABLED;
+ }
+
raw_spin_unlock_irq(&desc->lock);
+ chip_bus_sync_unlock(desc->irq, desc);
}

void irq_poll_action_added(struct irq_desc *desc, struct irqaction *action)
{
+ struct irq_spr *spr = &desc->spr;
unsigned long flags;

raw_spin_lock_irqsave(&desc->lock, flags);

- /* if the interrupt was killed before, give it one more chance */
- if (desc->status & IRQ_SPURIOUS_DISABLED) {
- desc->status &= ~IRQ_SPURIOUS_DISABLED;
- __enable_irq(desc, desc->irq, false);
- }
-
- if ((action->flags & IRQF_SHARED) && irqfixup >= IRQFIXUP_POLL)
+ if ((action->flags & IRQF_SHARED) && irqfixup >= IRQFIXUP_POLL) {
+ if (!spr->poll_rem)
+ printk(KERN_INFO "IRQ %u: starting IRQFIXUP_POLL\n",
+ desc->irq);
+ spr->poll_rem = IRQ_SPR_POLL_CNT_INF;
irq_schedule_poll(desc, IRQ_POLL_INTV);
+ } else {
+ /* new irqaction registered, give the IRQ another chance */
+ irq_spr_reset(spr);
+ }

raw_spin_unlock_irqrestore(&desc->lock, flags);
}

void irq_poll_action_removed(struct irq_desc *desc, struct irqaction *action)
{
+ bool irq_enabled = false, timer_killed = false;
unsigned long flags;
+ int rc;

raw_spin_lock_irqsave(&desc->lock, flags);

+ /* give the IRQ another chance */
+ if (irqfixup < IRQFIXUP_POLL)
+ irq_spr_reset(&desc->spr);
+
/*
* Make sure the timer is offline if no irqaction is left as
* the irq_desc will be reinitialized when the next irqaction
- * is added.
+ * is added; otherwise, the timer can be left alone. It will
+ * offline itself if no longer necessary.
*/
- while (!desc->action && try_to_del_timer_sync(&desc->poll_timer) < 0) {
+ while (!desc->action) {
+ rc = try_to_del_timer_sync(&desc->poll_timer);
+ if (rc >= 0) {
+ timer_killed = rc > 0;
+ break;
+ }
raw_spin_unlock_irqrestore(&desc->lock, flags);
cpu_relax();
raw_spin_lock_irqsave(&desc->lock, flags);
}

+ /*
+ * If the timer was forcefully shut down, it might not have
+ * had the chance to reenable IRQ. Make sure it's enabled.
+ */
+ if (timer_killed && (desc->status & IRQ_SPURIOUS_DISABLED)) {
+ __enable_irq(desc, desc->irq, false);
+ desc->status &= ~IRQ_SPURIOUS_DISABLED;
+ irq_enabled = true;
+ }
+
+ if (timer_killed || irq_enabled)
+ printk(KERN_INFO "IRQ %u:%s%s%s\n", desc->irq,
+ timer_killed ? " polling stopped" : "",
+ timer_killed && irq_enabled ? " and" : "",
+ irq_enabled ? " IRQ reenabled" : "");
+
raw_spin_unlock_irqrestore(&desc->lock, flags);
}

--
1.6.4.2

2010-06-13 15:33:36

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 10/12] irq: add comment about overall design of lost/spurious IRQ handling

Give a general overview of the facility at the top of file and add
copyright notice.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/irq/spurious.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 73 insertions(+), 1 deletions(-)

diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index 2d92113..329555f 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -2,8 +2,66 @@
* linux/kernel/irq/spurious.c
*
* Copyright (C) 1992, 1998-2004 Linus Torvalds, Ingo Molnar
+ * Copyright (C) 2010 SUSE Linux Products GmbH
+ * Copyright (C) 2010 Tejun Heo <[email protected]>
*
- * This file contains spurious interrupt handling.
+ * There are two ways interrupt handling can go wrong - too few or too
+ * many. Due to misrouting or other issues, sometimes IRQs don't
+ * reach the driver while at other times an interrupt line gets stuck
+ * and a continuous spurious interrupts are generated.
+ *
+ * This file implements workaround for both cases. Lost interrupts
+ * are handled by IRQ expecting and watching, and spurious interrupts
+ * by spurious polling. All mechanisms need IRQF_SHARED to be set on
+ * the irqaction in question.
+ *
+ * Both lost interrupt workarounds require cooperation from drivers
+ * and can be chosen depending on how much information the driver can
+ * provide.
+ *
+ * - IRQ expecting
+ *
+ * IRQ expecting is useful when the driver can tell when IRQs can be
+ * expected; in other words, when IRQs are used to signal completion
+ * of host initiated operations. This is the surest way to work
+ * around lost interrupts.
+ *
+ * When the controller is expected to raise an IRQ, the driver
+ * should call expect_irq() and, when the expected event happens or
+ * times out, unexpect_irq(). IRQ subsystem polls the interrupt
+ * inbetween.
+ *
+ * As interrupts tend to keep working if it works at the beginning,
+ * IRQ expecting implements "verified state". After certain number
+ * of successful IRQ deliveries, the irqaction becomes verified and
+ * much longer polling interval is used.
+ *
+ * - IRQ watching
+ *
+ * This can be used when the driver doesn't know when to exactly
+ * expect and unexpect IRQs. Once watch_irq() is called, the
+ * irqaction is slowly polled for certain amount of time (1min). If
+ * IRQs are missed during that time, the irqaction is marked and
+ * actively polled; otherwise, the watching is stopped.
+ *
+ * In the most basic case, drivers can call this right after
+ * registering an irqaction to verify IRQ delivery. In many cases,
+ * if IRQ works at the beginning, it keeps working, so just calling
+ * watch_irq() once can provide decent protection against misrouted
+ * IRQs. It would also be a good idea to call watch_irq() when
+ * timeouts are detected.
+ *
+ * - Spurious IRQ handling
+ *
+ * All IRQs are continuously monitored and spurious IRQ handling
+ * kicks in if there are too many spurious IRQs. The IRQ is
+ * disabled and the registered irqactions are polled. The IRQ is
+ * given another shot after certain number IRQs are handled or an
+ * irqaction is added or removed.
+ *
+ * All of the above three mechanisms can be used together. Spurious
+ * IRQ handling is enabled by default and drivers are free to expect
+ * and watch IRQs as they see fit.
*/

#include <linux/jiffies.h>
@@ -17,6 +75,20 @@

#include "internals.h"

+/*
+ * I spent quite some time thinking about each parameter but they
+ * still are just numbers pulled out of my ass. If you think your ass
+ * is prettier than mine, please go ahead and suggest better ones.
+ *
+ * Most parameters are intentionally fixed constants and not
+ * adjustable through API. The nature of IRQ delivery failures isn't
+ * usually dependent on specific drivers and the timing parameters are
+ * more about human perceivable latencies rather than any specific
+ * controller timing details, so figuring out constant values which
+ * can work for most cases shouldn't be too hard. This allows tighter
+ * control over polling behaviors, eases future changes and makes the
+ * interface easy for drivers.
+ */
enum {
/* irqfixup levels */
IRQFIXUP_SPURIOUS = 0, /* spurious storm detection */
--
1.6.4.2

2010-06-13 15:34:01

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 01/12] irq: cleanup irqfixup

Make the following cleanups to irqfixup.

* Define IRQFIXUP_{SPURIOUS|MISROUTED|POLL} and use them instead of
hard coding 0, 1 and 2.

* Add an inline note_interrupt() wrapper which checks noirqdebug and
calls __note_interrupt() instead of checking noirqdebug from each
caller.

Signed-off-by: Tejun Heo <[email protected]>
---
arch/arm/mach-ns9xxx/irq.c | 3 ---
arch/powerpc/platforms/cell/interrupt.c | 5 +----
include/linux/irq.h | 13 +++++++++++--
kernel/irq/chip.c | 18 ++++++------------
kernel/irq/handle.c | 6 ++----
kernel/irq/internals.h | 2 --
kernel/irq/spurious.c | 30 ++++++++++++++++++------------
7 files changed, 38 insertions(+), 39 deletions(-)

diff --git a/arch/arm/mach-ns9xxx/irq.c b/arch/arm/mach-ns9xxx/irq.c
index 038f24d..6be86e9 100644
--- a/arch/arm/mach-ns9xxx/irq.c
+++ b/arch/arm/mach-ns9xxx/irq.c
@@ -82,9 +82,6 @@ static void handle_prio_irq(unsigned int irq, struct irq_desc *desc)

action_ret = handle_IRQ_event(irq, action);

- /* XXX: There is no direct way to access noirqdebug, so check
- * unconditionally for spurious irqs...
- * Maybe this function should go to kernel/irq/chip.c? */
note_interrupt(irq, desc, action_ret);

raw_spin_lock(&desc->lock);
diff --git a/arch/powerpc/platforms/cell/interrupt.c b/arch/powerpc/platforms/cell/interrupt.c
index 10eb1a4..e638fb3 100644
--- a/arch/powerpc/platforms/cell/interrupt.c
+++ b/arch/powerpc/platforms/cell/interrupt.c
@@ -233,8 +233,6 @@ static int iic_host_match(struct irq_host *h, struct device_node *node)
"IBM,CBEA-Internal-Interrupt-Controller");
}

-extern int noirqdebug;
-
static void handle_iic_irq(unsigned int irq, struct irq_desc *desc)
{
raw_spin_lock(&desc->lock);
@@ -267,8 +265,7 @@ static void handle_iic_irq(unsigned int irq, struct irq_desc *desc)
desc->status &= ~IRQ_PENDING;
raw_spin_unlock(&desc->lock);
action_ret = handle_IRQ_event(irq, action);
- if (!noirqdebug)
- note_interrupt(irq, desc, action_ret);
+ note_interrupt(irq, desc, action_ret);
raw_spin_lock(&desc->lock);

} while ((desc->status & (IRQ_PENDING | IRQ_DISABLED)) == IRQ_PENDING);
diff --git a/include/linux/irq.h b/include/linux/irq.h
index c03243a..ec93be4 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -324,8 +324,17 @@ static inline void generic_handle_irq(unsigned int irq)
}

/* Handling of unhandled and spurious interrupts: */
-extern void note_interrupt(unsigned int irq, struct irq_desc *desc,
- irqreturn_t action_ret);
+extern void __note_interrupt(unsigned int irq, struct irq_desc *desc,
+ irqreturn_t action_ret);
+
+static inline void note_interrupt(unsigned int irq, struct irq_desc *desc,
+ irqreturn_t action_ret)
+{
+ extern int noirqdebug;
+
+ if (!noirqdebug)
+ __note_interrupt(irq, desc, action_ret);
+}

/* Resending of interrupts :*/
void check_irq_resend(struct irq_desc *desc, unsigned int irq);
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index b7091d5..db26ff0 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -406,8 +406,7 @@ void handle_nested_irq(unsigned int irq)
raw_spin_unlock_irq(&desc->lock);

action_ret = action->thread_fn(action->irq, action->dev_id);
- if (!noirqdebug)
- note_interrupt(irq, desc, action_ret);
+ note_interrupt(irq, desc, action_ret);

raw_spin_lock_irq(&desc->lock);
desc->status &= ~IRQ_INPROGRESS;
@@ -450,8 +449,7 @@ handle_simple_irq(unsigned int irq, struct irq_desc *desc)
raw_spin_unlock(&desc->lock);

action_ret = handle_IRQ_event(irq, action);
- if (!noirqdebug)
- note_interrupt(irq, desc, action_ret);
+ note_interrupt(irq, desc, action_ret);

raw_spin_lock(&desc->lock);
desc->status &= ~IRQ_INPROGRESS;
@@ -495,8 +493,7 @@ handle_level_irq(unsigned int irq, struct irq_desc *desc)
raw_spin_unlock(&desc->lock);

action_ret = handle_IRQ_event(irq, action);
- if (!noirqdebug)
- note_interrupt(irq, desc, action_ret);
+ note_interrupt(irq, desc, action_ret);

raw_spin_lock(&desc->lock);
desc->status &= ~IRQ_INPROGRESS;
@@ -548,8 +545,7 @@ handle_fasteoi_irq(unsigned int irq, struct irq_desc *desc)
raw_spin_unlock(&desc->lock);

action_ret = handle_IRQ_event(irq, action);
- if (!noirqdebug)
- note_interrupt(irq, desc, action_ret);
+ note_interrupt(irq, desc, action_ret);

raw_spin_lock(&desc->lock);
desc->status &= ~IRQ_INPROGRESS;
@@ -625,8 +621,7 @@ handle_edge_irq(unsigned int irq, struct irq_desc *desc)
desc->status &= ~IRQ_PENDING;
raw_spin_unlock(&desc->lock);
action_ret = handle_IRQ_event(irq, action);
- if (!noirqdebug)
- note_interrupt(irq, desc, action_ret);
+ note_interrupt(irq, desc, action_ret);
raw_spin_lock(&desc->lock);

} while ((desc->status & (IRQ_PENDING | IRQ_DISABLED)) == IRQ_PENDING);
@@ -654,8 +649,7 @@ handle_percpu_irq(unsigned int irq, struct irq_desc *desc)
desc->chip->ack(irq);

action_ret = handle_IRQ_event(irq, desc->action);
- if (!noirqdebug)
- note_interrupt(irq, desc, action_ret);
+ note_interrupt(irq, desc, action_ret);

if (desc->chip->eoi)
desc->chip->eoi(irq);
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index 27e5c69..3ae50bf 100644
--- a/kernel/irq/handle.c
+++ b/kernel/irq/handle.c
@@ -461,8 +461,7 @@ unsigned int __do_IRQ(unsigned int irq)
desc->chip->ack(irq);
if (likely(!(desc->status & IRQ_DISABLED))) {
action_ret = handle_IRQ_event(irq, desc->action);
- if (!noirqdebug)
- note_interrupt(irq, desc, action_ret);
+ note_interrupt(irq, desc, action_ret);
}
desc->chip->end(irq);
return 1;
@@ -515,8 +514,7 @@ unsigned int __do_IRQ(unsigned int irq)
raw_spin_unlock(&desc->lock);

action_ret = handle_IRQ_event(irq, action);
- if (!noirqdebug)
- note_interrupt(irq, desc, action_ret);
+ note_interrupt(irq, desc, action_ret);

raw_spin_lock(&desc->lock);
if (likely(!(desc->status & IRQ_PENDING)))
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index c63f3bc..341f952 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -2,8 +2,6 @@
* IRQ subsystem internal functions and variables:
*/

-extern int noirqdebug;
-
/* Set default functions for irq_chip structures: */
extern void irq_chip_set_defaults(struct irq_chip *chip);

diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index 89fb90a..5da60a2 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -14,7 +14,15 @@
#include <linux/moduleparam.h>
#include <linux/timer.h>

-static int irqfixup __read_mostly;
+enum {
+ /* irqfixup levels */
+ IRQFIXUP_SPURIOUS = 0, /* spurious storm detection */
+ IRQFIXUP_MISROUTED = 1, /* misrouted IRQ fixup */
+ IRQFIXUP_POLL = 2, /* enable polling by default */
+};
+
+int noirqdebug __read_mostly;
+static int irqfixup __read_mostly = IRQFIXUP_SPURIOUS;

#define POLL_SPURIOUS_IRQ_INTERVAL (HZ/10)
static void poll_spurious_irqs(unsigned long dummy);
@@ -184,7 +192,7 @@ try_misrouted_irq(unsigned int irq, struct irq_desc *desc,
{
struct irqaction *action;

- if (!irqfixup)
+ if (irqfixup < IRQFIXUP_MISROUTED)
return 0;

/* We didn't actually handle the IRQ - see if it was misrouted? */
@@ -192,11 +200,11 @@ try_misrouted_irq(unsigned int irq, struct irq_desc *desc,
return 1;

/*
- * But for 'irqfixup == 2' we also do it for handled interrupts if
- * they are marked as IRQF_IRQPOLL (or for irq zero, which is the
- * traditional PC timer interrupt.. Legacy)
+ * But for IRQFIXUP_POLL we also do it for handled interrupts
+ * if they are marked as IRQF_IRQPOLL (or for irq zero, which
+ * is the traditional PC timer interrupt.. Legacy)
*/
- if (irqfixup < 2)
+ if (irqfixup < IRQFIXUP_POLL)
return 0;

if (!irq)
@@ -213,8 +221,8 @@ try_misrouted_irq(unsigned int irq, struct irq_desc *desc,
return action && (action->flags & IRQF_IRQPOLL);
}

-void note_interrupt(unsigned int irq, struct irq_desc *desc,
- irqreturn_t action_ret)
+void __note_interrupt(unsigned int irq, struct irq_desc *desc,
+ irqreturn_t action_ret)
{
if (unlikely(action_ret != IRQ_HANDLED)) {
/*
@@ -262,8 +270,6 @@ void note_interrupt(unsigned int irq, struct irq_desc *desc,
desc->irqs_unhandled = 0;
}

-int noirqdebug __read_mostly;
-
int noirqdebug_setup(char *str)
{
noirqdebug = 1;
@@ -278,7 +284,7 @@ MODULE_PARM_DESC(noirqdebug, "Disable irq lockup detection when true");

static int __init irqfixup_setup(char *str)
{
- irqfixup = 1;
+ irqfixup = max(irqfixup, IRQFIXUP_MISROUTED);
printk(KERN_WARNING "Misrouted IRQ fixup support enabled.\n");
printk(KERN_WARNING "This may impact system performance.\n");

@@ -290,7 +296,7 @@ module_param(irqfixup, int, 0644);

static int __init irqpoll_setup(char *str)
{
- irqfixup = 2;
+ irqfixup = IRQFIXUP_POLL;
printk(KERN_WARNING "Misrouted IRQ fixup and polling support "
"enabled\n");
printk(KERN_WARNING "This may significantly impact system "
--
1.6.4.2

2010-06-13 15:32:13

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 02/12] irq: make spurious poll timer per desc

Currently there is single timer for for spurious IRQ polling and when
it kicks in, it polls all the IRQs. Add irq_desc->poll_timer and use
it for spurious polling such that only the failed IRQ is polled. This
significantly reduces the cost of spurious polling and the polling
interval is adjusted to 10ms.

irq_poll_action_{added|removed}(), which are called from
{setup|free}_irq() respectively, are added so that poll timer
management is done inside spurious.c.

The global polling function poll_spurious_irqs() is replaced with
per-IRQ polling function poll_irq() and try_one_irq() is changed to
expect its callers to acquire desc->lock for upcoming extension of
poll_irq().

This reduces the overhead of spurious handling and eases implementing
further fine grained IRQ protection mechanisms on top.

Signed-off-by: Tejun Heo <[email protected]>
---
include/linux/irq.h | 5 +++
kernel/irq/internals.h | 6 +++
kernel/irq/manage.c | 14 +++-----
kernel/irq/spurious.c | 93 +++++++++++++++++++++++++++++------------------
4 files changed, 73 insertions(+), 45 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index ec93be4..50a77f9 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -21,6 +21,7 @@
#include <linux/irqreturn.h>
#include <linux/irqnr.h>
#include <linux/errno.h>
+#include <linux/timer.h>
#include <linux/topology.h>
#include <linux/wait.h>

@@ -169,6 +170,7 @@ struct irq_2_iommu;
* @pending_mask: pending rebalanced interrupts
* @threads_active: number of irqaction threads currently running
* @wait_for_threads: wait queue for sync_irq to wait for threaded handlers
+ * @poll_timer: timer for IRQ polling
* @dir: /proc/irq/ procfs entry
* @name: flow handler name for /proc/interrupts output
*/
@@ -203,6 +205,9 @@ struct irq_desc {
#endif
atomic_t threads_active;
wait_queue_head_t wait_for_threads;
+
+ struct timer_list poll_timer;
+
#ifdef CONFIG_PROC_FS
struct proc_dir_entry *dir;
#endif
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 341f952..088e5d6 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -38,6 +38,12 @@ extern int irq_select_affinity_usr(unsigned int irq);

extern void irq_set_thread_affinity(struct irq_desc *desc);

+extern void poll_irq(unsigned long arg);
+extern void irq_poll_action_added(struct irq_desc *desc,
+ struct irqaction *action);
+extern void irq_poll_action_removed(struct irq_desc *desc,
+ struct irqaction *action);
+
/* Inline functions for support of irq chips on slow busses */
static inline void chip_bus_lock(unsigned int irq, struct irq_desc *desc)
{
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 3164ba7..cf9ab65 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -752,6 +752,7 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
irq_chip_set_defaults(desc->chip);

init_waitqueue_head(&desc->wait_for_threads);
+ setup_timer(&desc->poll_timer, poll_irq, (unsigned long)desc);

/* Setup the type (level, edge polarity) if configured: */
if (new->flags & IRQF_TRIGGER_MASK) {
@@ -804,17 +805,10 @@ __setup_irq(unsigned int irq, struct irq_desc *desc, struct irqaction *new)
desc->irq_count = 0;
desc->irqs_unhandled = 0;

- /*
- * Check whether we disabled the irq via the spurious handler
- * before. Reenable it and give it another chance.
- */
- if (shared && (desc->status & IRQ_SPURIOUS_DISABLED)) {
- desc->status &= ~IRQ_SPURIOUS_DISABLED;
- __enable_irq(desc, irq, false);
- }
-
raw_spin_unlock_irqrestore(&desc->lock, flags);

+ irq_poll_action_added(desc, new);
+
/*
* Strictly no need to wake it up, but hung_task complains
* when no hard interrupt wakes the thread up.
@@ -930,6 +924,8 @@ static struct irqaction *__free_irq(unsigned int irq, void *dev_id)

raw_spin_unlock_irqrestore(&desc->lock, flags);

+ irq_poll_action_removed(desc, action);
+
unregister_handler_proc(irq, action);

/* Make sure it's not being used on another CPU: */
diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index 5da60a2..545f730 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -12,22 +12,22 @@
#include <linux/kallsyms.h>
#include <linux/interrupt.h>
#include <linux/moduleparam.h>
-#include <linux/timer.h>
+
+#include "internals.h"

enum {
/* irqfixup levels */
IRQFIXUP_SPURIOUS = 0, /* spurious storm detection */
IRQFIXUP_MISROUTED = 1, /* misrouted IRQ fixup */
IRQFIXUP_POLL = 2, /* enable polling by default */
+
+ /* IRQ polling common parameters */
+ IRQ_POLL_INTV = HZ / 100, /* from the good ol' 100HZ tick */
};

int noirqdebug __read_mostly;
static int irqfixup __read_mostly = IRQFIXUP_SPURIOUS;

-#define POLL_SPURIOUS_IRQ_INTERVAL (HZ/10)
-static void poll_spurious_irqs(unsigned long dummy);
-static DEFINE_TIMER(poll_spurious_irq_timer, poll_spurious_irqs, 0, 0);
-
/*
* Recovery handler for misrouted interrupts.
*/
@@ -36,7 +36,6 @@ static int try_one_irq(int irq, struct irq_desc *desc)
struct irqaction *action;
int ok = 0, work = 0;

- raw_spin_lock(&desc->lock);
/* Already running on another processor */
if (desc->status & IRQ_INPROGRESS) {
/*
@@ -45,7 +44,6 @@ static int try_one_irq(int irq, struct irq_desc *desc)
*/
if (desc->action && (desc->action->flags & IRQF_SHARED))
desc->status |= IRQ_PENDING;
- raw_spin_unlock(&desc->lock);
return ok;
}
/* Honour the normal IRQ locking */
@@ -88,7 +86,6 @@ static int try_one_irq(int irq, struct irq_desc *desc)
*/
if (work && desc->chip && desc->chip->end)
desc->chip->end(irq);
- raw_spin_unlock(&desc->lock);

return ok;
}
@@ -105,39 +102,15 @@ static int misrouted_irq(int irq)
if (i == irq) /* Already tried */
continue;

+ raw_spin_lock(&desc->lock);
if (try_one_irq(i, desc))
ok = 1;
+ raw_spin_unlock(&desc->lock);
}
/* So the caller can adjust the irq error counts */
return ok;
}

-static void poll_spurious_irqs(unsigned long dummy)
-{
- struct irq_desc *desc;
- int i;
-
- for_each_irq_desc(i, desc) {
- unsigned int status;
-
- if (!i)
- continue;
-
- /* Racy but it doesn't matter */
- status = desc->status;
- barrier();
- if (!(status & IRQ_SPURIOUS_DISABLED))
- continue;
-
- local_irq_disable();
- try_one_irq(i, desc);
- local_irq_enable();
- }
-
- mod_timer(&poll_spurious_irq_timer,
- jiffies + POLL_SPURIOUS_IRQ_INTERVAL);
-}
-
/*
* If 99,900 of the previous 100,000 interrupts have not been handled
* then assume that the IRQ is stuck in some manner. Drop a diagnostic
@@ -264,12 +237,60 @@ void __note_interrupt(unsigned int irq, struct irq_desc *desc,
desc->depth++;
desc->chip->disable(irq);

- mod_timer(&poll_spurious_irq_timer,
- jiffies + POLL_SPURIOUS_IRQ_INTERVAL);
+ mod_timer(&desc->poll_timer, jiffies + IRQ_POLL_INTV);
}
desc->irqs_unhandled = 0;
}

+/*
+ * IRQ poller. Called from desc->poll_timer.
+ */
+void poll_irq(unsigned long arg)
+{
+ struct irq_desc *desc = (void *)arg;
+
+ raw_spin_lock_irq(&desc->lock);
+ try_one_irq(desc->irq, desc);
+ raw_spin_unlock_irq(&desc->lock);
+
+ mod_timer(&desc->poll_timer, jiffies + IRQ_POLL_INTV);
+}
+
+void irq_poll_action_added(struct irq_desc *desc, struct irqaction *action)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&desc->lock, flags);
+
+ /* if the interrupt was killed before, give it one more chance */
+ if (desc->status & IRQ_SPURIOUS_DISABLED) {
+ desc->status &= ~IRQ_SPURIOUS_DISABLED;
+ __enable_irq(desc, desc->irq, false);
+ }
+
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+}
+
+void irq_poll_action_removed(struct irq_desc *desc, struct irqaction *action)
+{
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&desc->lock, flags);
+
+ /*
+ * Make sure the timer is offline if no irqaction is left as
+ * the irq_desc will be reinitialized when the next irqaction
+ * is added.
+ */
+ while (!desc->action && try_to_del_timer_sync(&desc->poll_timer) < 0) {
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+ cpu_relax();
+ raw_spin_lock_irqsave(&desc->lock, flags);
+ }
+
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
+}
+
int noirqdebug_setup(char *str)
{
noirqdebug = 1;
--
1.6.4.2

2010-06-13 15:34:35

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 03/12] irq: use desc->poll_timer for irqpoll

Due to tickless and missing IRQF_IRQPOLL flags, irqpoll has been
broken in many configurations for quite some time. Make irqpoll
global polling use desc->poll_timer instead. It's simpler and more
reliable.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/irq/spurious.c | 41 +++++------------------------------------
1 files changed, 5 insertions(+), 36 deletions(-)

diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index 545f730..fc18a13 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -159,41 +159,6 @@ report_bad_irq(unsigned int irq, struct irq_desc *desc, irqreturn_t action_ret)
}
}

-static inline int
-try_misrouted_irq(unsigned int irq, struct irq_desc *desc,
- irqreturn_t action_ret)
-{
- struct irqaction *action;
-
- if (irqfixup < IRQFIXUP_MISROUTED)
- return 0;
-
- /* We didn't actually handle the IRQ - see if it was misrouted? */
- if (action_ret == IRQ_NONE)
- return 1;
-
- /*
- * But for IRQFIXUP_POLL we also do it for handled interrupts
- * if they are marked as IRQF_IRQPOLL (or for irq zero, which
- * is the traditional PC timer interrupt.. Legacy)
- */
- if (irqfixup < IRQFIXUP_POLL)
- return 0;
-
- if (!irq)
- return 1;
-
- /*
- * Since we don't get the descriptor lock, "action" can
- * change under us. We don't really care, but we don't
- * want to follow a NULL pointer. So tell the compiler to
- * just load it once by using a barrier.
- */
- action = desc->action;
- barrier();
- return action && (action->flags & IRQF_IRQPOLL);
-}
-
void __note_interrupt(unsigned int irq, struct irq_desc *desc,
irqreturn_t action_ret)
{
@@ -213,7 +178,8 @@ void __note_interrupt(unsigned int irq, struct irq_desc *desc,
report_bad_irq(irq, desc, action_ret);
}

- if (unlikely(try_misrouted_irq(irq, desc, action_ret))) {
+ if (unlikely(irqfixup >= IRQFIXUP_MISROUTED &&
+ action_ret == IRQ_NONE)) {
int ok = misrouted_irq(irq);
if (action_ret == IRQ_NONE)
desc->irqs_unhandled -= ok;
@@ -269,6 +235,9 @@ void irq_poll_action_added(struct irq_desc *desc, struct irqaction *action)
}

raw_spin_unlock_irqrestore(&desc->lock, flags);
+
+ if ((action->flags & IRQF_SHARED) && irqfixup >= IRQFIXUP_POLL)
+ mod_timer(&desc->poll_timer, jiffies + IRQ_POLL_INTV);
}

void irq_poll_action_removed(struct irq_desc *desc, struct irqaction *action)
--
1.6.4.2

2010-06-13 15:34:34

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 06/12] irq: implement irq_schedule_poll()

Implement and use irq_schedule_poll() to schedule desc->poll_timer
instead of calling mod_timer directly. irq_schedule_poll() is called
with desc->lock held and schedules the timer iff necessary - ie. if
the timer is offline or scheduled to expire later than requested.
This will be used to share desc->poll_timer.

Signed-off-by: Tejun Heo <[email protected]>
---
kernel/irq/spurious.c | 47 +++++++++++++++++++++++++++++++++++++++++------
1 files changed, 41 insertions(+), 6 deletions(-)

diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
index 0bce0e3..0b8fd0a 100644
--- a/kernel/irq/spurious.c
+++ b/kernel/irq/spurious.c
@@ -23,6 +23,8 @@ enum {

/* IRQ polling common parameters */
IRQ_POLL_INTV = HZ / 100, /* from the good ol' 100HZ tick */
+
+ IRQ_POLL_SLACK = HZ / 1000, /* 10% slack */
};

int noirqdebug __read_mostly;
@@ -43,6 +45,38 @@ static void print_irq_handlers(struct irq_desc *desc)
}
}

+static unsigned long irq_poll_slack(unsigned long intv)
+{
+ return IRQ_POLL_SLACK;
+}
+
+/**
+ * irq_schedule_poll - schedule IRQ poll
+ * @desc: IRQ desc to schedule poll for
+ * @intv: poll interval
+ *
+ * Schedules @desc->poll_timer. If the timer is already scheduled,
+ * it's modified iff jiffies + @intv + slack is before the timer's
+ * expires. poll_timers aren't taken offline behind this function's
+ * back and the users of this function are guaranteed that poll_irq()
+ * will be called at or before jiffies + @intv + slack.
+ *
+ * CONTEXT:
+ * desc->lock
+ */
+static void irq_schedule_poll(struct irq_desc *desc, unsigned long intv)
+{
+ unsigned long expires = jiffies + intv;
+ int slack = irq_poll_slack(intv);
+
+ if (timer_pending(&desc->poll_timer) &&
+ time_before_eq(desc->poll_timer.expires, expires + slack))
+ return;
+
+ set_timer_slack(&desc->poll_timer, slack);
+ mod_timer(&desc->poll_timer, expires);
+}
+
/*
* Recovery handler for misrouted interrupts.
*/
@@ -207,7 +241,9 @@ void __note_interrupt(unsigned int irq, struct irq_desc *desc,
desc->depth++;
desc->chip->disable(irq);

- mod_timer(&desc->poll_timer, jiffies + IRQ_POLL_INTV);
+ raw_spin_lock(&desc->lock);
+ irq_schedule_poll(desc, IRQ_POLL_INTV);
+ raw_spin_unlock(&desc->lock);
}
desc->irqs_unhandled = 0;
}
@@ -221,9 +257,8 @@ void poll_irq(unsigned long arg)

raw_spin_lock_irq(&desc->lock);
try_one_irq(desc->irq, desc);
+ irq_schedule_poll(desc, IRQ_POLL_INTV);
raw_spin_unlock_irq(&desc->lock);
-
- mod_timer(&desc->poll_timer, jiffies + IRQ_POLL_INTV);
}

void irq_poll_action_added(struct irq_desc *desc, struct irqaction *action)
@@ -238,10 +273,10 @@ void irq_poll_action_added(struct irq_desc *desc, struct irqaction *action)
__enable_irq(desc, desc->irq, false);
}

- raw_spin_unlock_irqrestore(&desc->lock, flags);
-
if ((action->flags & IRQF_SHARED) && irqfixup >= IRQFIXUP_POLL)
- mod_timer(&desc->poll_timer, jiffies + IRQ_POLL_INTV);
+ irq_schedule_poll(desc, IRQ_POLL_INTV);
+
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
}

void irq_poll_action_removed(struct irq_desc *desc, struct irqaction *action)
--
1.6.4.2

2010-06-14 09:22:08

by Jiri Slaby

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

On 06/13/2010 05:31 PM, Tejun Heo wrote:
> --- a/kernel/irq/spurious.c
> +++ b/kernel/irq/spurious.c
...

> @@ -25,9 +26,43 @@ enum {
> /* IRQ polling common parameters */
> IRQ_POLL_SLOW_INTV = 3 * HZ, /* not too slow for ppl, slow enough for machine */
> IRQ_POLL_INTV = HZ / 100, /* from the good ol' 100HZ tick */
> + IRQ_POLL_QUICK_INTV = HZ / 1000, /* pretty quick but not too taxing */
>
> IRQ_POLL_SLOW_SLACK = HZ,
> IRQ_POLL_SLACK = HZ / 1000, /* 10% slack */
> + IRQ_POLL_QUICK_SLACK = HZ / 10000, /* 10% slack */

Hi. These are zeros on most systems (assuming distros set HZ=100 and
250), what is their purpose then?

regards,
--
js

2010-06-14 09:45:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

Hello,

On 06/14/2010 11:21 AM, Jiri Slaby wrote:
> On 06/13/2010 05:31 PM, Tejun Heo wrote:
>> --- a/kernel/irq/spurious.c
>> +++ b/kernel/irq/spurious.c
> ...
>
>> @@ -25,9 +26,43 @@ enum {
>> /* IRQ polling common parameters */
>> IRQ_POLL_SLOW_INTV = 3 * HZ, /* not too slow for ppl, slow enough for machine */
>> IRQ_POLL_INTV = HZ / 100, /* from the good ol' 100HZ tick */
>> + IRQ_POLL_QUICK_INTV = HZ / 1000, /* pretty quick but not too taxing */
>>
>> IRQ_POLL_SLOW_SLACK = HZ,
>> IRQ_POLL_SLACK = HZ / 1000, /* 10% slack */
>> + IRQ_POLL_QUICK_SLACK = HZ / 10000, /* 10% slack */
>
> Hi. These are zeros on most systems (assuming distros set HZ=100 and
> 250), what is their purpose then?

On every tick and no slack. :-)

--
tejun

2010-06-14 09:47:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

On 06/14/2010 11:43 AM, Tejun Heo wrote:
> Hello,
>
> On 06/14/2010 11:21 AM, Jiri Slaby wrote:
>> On 06/13/2010 05:31 PM, Tejun Heo wrote:
>>> --- a/kernel/irq/spurious.c
>>> +++ b/kernel/irq/spurious.c
>> ...
>>
>>> @@ -25,9 +26,43 @@ enum {
>>> /* IRQ polling common parameters */
>>> IRQ_POLL_SLOW_INTV = 3 * HZ, /* not too slow for ppl, slow enough for machine */
>>> IRQ_POLL_INTV = HZ / 100, /* from the good ol' 100HZ tick */
>>> + IRQ_POLL_QUICK_INTV = HZ / 1000, /* pretty quick but not too taxing */
>>>
>>> IRQ_POLL_SLOW_SLACK = HZ,
>>> IRQ_POLL_SLACK = HZ / 1000, /* 10% slack */
>>> + IRQ_POLL_QUICK_SLACK = HZ / 10000, /* 10% slack */
>>
>> Hi. These are zeros on most systems (assuming distros set HZ=100 and
>> 250), what is their purpose then?
>
> On every tick and no slack. :-)

Hmmm... but yeah, it would be better to make IRQ_POLL_SLACK HZ / 250
so that we at least have one tick slack on 250HZ configs which are
pretty common these days.

Thanks.

--
tejun

2010-06-14 21:41:42

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

On Sun, Jun 13, 2010 at 05:31:38PM +0200, Tejun Heo wrote:
> Ask IRQ subsystem to watch HCD IRQ line after initialization. This at
> least keeps USB ports which are occupied on initialization working and
> eases bug reporting and debugging.
>
> Signed-off-by: Tejun Heo <[email protected]>
> ---
> drivers/usb/core/hcd.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/usb/core/hcd.c b/drivers/usb/core/hcd.c
> index 12742f1..383875f 100644
> --- a/drivers/usb/core/hcd.c
> +++ b/drivers/usb/core/hcd.c
> @@ -2270,6 +2270,7 @@ int usb_add_hcd(struct usb_hcd *hcd,
> "request interrupt %d failed\n", irqnum);
> goto err_request_irq;
> }
> + watch_irq(irqnum, hcd);

So if there's a routing problem, it turns into a polled interrupt? Do
we really want that? I wonder how long people will run without
realizing that there are problems with their system if their devices
still work.

Other than that minor comment, it all looks good to me.

thanks,

greg k-h

2010-06-14 21:53:11

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

Hello,

On 06/14/2010 11:41 PM, Greg KH wrote:
> So if there's a routing problem, it turns into a polled interrupt? Do
> we really want that?

Oh yeah, I really want that for libata. Routing is only part of the
problem and flaky IRQ is something we have to learn to cope with.

> I wonder how long people will run without realizing that there are
> problems with their system if their devices still work.

I think things would be better this way. If the drives (both cd and
hard) / input devices are not accessible, most people would simply
give up rather than reporting, and many cases are transient problems
which happen only once in the blue moon.

It would be great if some kind of automatic reporting can be used
(similar to kerneloops?). Hmm... maybe make the warnings scarier?

Thanks.

--
tejun

2010-06-14 22:17:19

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

On Mon, Jun 14, 2010 at 11:52:10PM +0200, Tejun Heo wrote:
> Hello,
>
> On 06/14/2010 11:41 PM, Greg KH wrote:
> > So if there's a routing problem, it turns into a polled interrupt? Do
> > we really want that?
>
> Oh yeah, I really want that for libata. Routing is only part of the
> problem and flaky IRQ is something we have to learn to cope with.
>
> > I wonder how long people will run without realizing that there are
> > problems with their system if their devices still work.
>
> I think things would be better this way. If the drives (both cd and
> hard) / input devices are not accessible, most people would simply
> give up rather than reporting, and many cases are transient problems
> which happen only once in the blue moon.
>
> It would be great if some kind of automatic reporting can be used
> (similar to kerneloops?). Hmm... maybe make the warnings scarier?

If you throw a WARN(), then kerneloops will pick it up, so you should be
fine there.

thanks,

greg k-h

2010-06-14 22:20:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

cc'ing Kay, hi.

On 06/14/2010 11:52 PM, Tejun Heo wrote:
> Hello,
>
> On 06/14/2010 11:41 PM, Greg KH wrote:
>> So if there's a routing problem, it turns into a polled interrupt? Do
>> we really want that?
>
> Oh yeah, I really want that for libata. Routing is only part of the
> problem and flaky IRQ is something we have to learn to cope with.
>
>> I wonder how long people will run without realizing that there are
>> problems with their system if their devices still work.
>
> I think things would be better this way. If the drives (both cd and
> hard) / input devices are not accessible, most people would simply
> give up rather than reporting, and many cases are transient problems
> which happen only once in the blue moon.
>
> It would be great if some kind of automatic reporting can be used
> (similar to kerneloops?). Hmm... maybe make the warnings scarier?

Hmm... maybe what we can do is generating an uevent when an IRQ is
confirmed to be bad and then let udev notify the user. That way we'll
probably have better chance of getting bug reports and users have
whiny but working system.

Thanks.

--
tejun

2010-06-15 10:36:46

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

On Tue, Jun 15, 2010 at 00:19, Tejun Heo <[email protected]> wrote:
> On 06/14/2010 11:52 PM, Tejun Heo wrote:
>> On 06/14/2010 11:41 PM, Greg KH wrote:
>>> So if there's a routing problem, it turns into a polled interrupt?  Do
>>> we really want that?
>>
>> Oh yeah, I really want that for libata.  Routing is only part of the
>> problem and flaky IRQ is something we have to learn to cope with.
>>
>>> I wonder how long people will run without realizing that there are
>>> problems with their system if their devices still work.
>>
>> I think things would be better this way.  If the drives (both cd and
>> hard) / input devices are not accessible, most people would simply
>> give up rather than reporting, and many cases are transient problems
>> which happen only once in the blue moon.
>>
>> It would be great if some kind of automatic reporting can be used
>> (similar to kerneloops?).  Hmm... maybe make the warnings scarier?
>
> Hmm... maybe what we can do is generating an uevent when an IRQ is
> confirmed to be bad and then let udev notify the user.  That way we'll
> probably have better chance of getting bug reports and users have
> whiny but working system.

Not really, uevents are not picked up by anything that could report an
error to userspace, they are just seen by udev. Also uevents are
usually not the proper passing method. They are not meant to ever
transport higher frequency events, or structured data. They cause to
run the entire udev rule matching machine, and update symlinks and
permissions with every event.

We will need some better error reporting facility. On Linux you don't
even get notified when the kernel mounts your filesystem read-only
because of an error. It will only end up in 'dmesg' as a pretty much
undefined bunch of words. :)

We will need some generic error reporting facility, with structured
data exported, and where userspace stuff can subscribe to.
Uevents/udev can not really properly provide such infrastructure.
Maybe that can be extended somehow, but using kobject_uevent() and
trigger the usual udev rule engine is not what we are looking for, for
sane error reporting.

Kay

2010-06-15 11:05:25

by Jean Delvare

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

On Tue, 15 Jun 2010 12:30:00 +0200, Kay Sievers wrote:
> On Tue, Jun 15, 2010 at 00:19, Tejun Heo <[email protected]> wrote:
> > Hmm... maybe what we can do is generating an uevent when an IRQ is
> > confirmed to be bad and then let udev notify the user.  That way we'll
> > probably have better chance of getting bug reports and users have
> > whiny but working system.
>
> Not really, uevents are not picked up by anything that could report an
> error to userspace, they are just seen by udev. Also uevents are
> usually not the proper passing method. They are not meant to ever
> transport higher frequency events, or structured data. They cause to
> run the entire udev rule matching machine, and update symlinks and
> permissions with every event.
>
> We will need some better error reporting facility. On Linux you don't
> even get notified when the kernel mounts your filesystem read-only
> because of an error. It will only end up in 'dmesg' as a pretty much
> undefined bunch of words. :)
>
> We will need some generic error reporting facility, with structured
> data exported, and where userspace stuff can subscribe to.
> Uevents/udev can not really properly provide such infrastructure.
> Maybe that can be extended somehow, but using kobject_uevent() and
> trigger the usual udev rule engine is not what we are looking for, for
> sane error reporting.

Random idea of the day (I don't know anything about it all): let the
kernel connect to D-Bus and use it somehow?

--
Jean Delvare

2010-06-15 11:22:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

Hello, Kay.

On 06/15/2010 12:30 PM, Kay Sievers wrote:
>> Hmm... maybe what we can do is generating an uevent when an IRQ is
>> confirmed to be bad and then let udev notify the user. That way we'll
>> probably have better chance of getting bug reports and users have
>> whiny but working system.
>
> Not really, uevents are not picked up by anything that could report an
> error to userspace, they are just seen by udev. Also uevents are
> usually not the proper passing method. They are not meant to ever
> transport higher frequency events, or structured data. They cause to
> run the entire udev rule matching machine, and update symlinks and
> permissions with every event.

Oh, these will be very low frequency event. At most once per
irq_expect or irqaction. e.g. on a machine with four hard drives, it
can only happen four times after boot unless the driver is unloaded
and reloaded, so from frequency standpoint it should be okay.

> We will need some better error reporting facility. On Linux you don't
> even get notified when the kernel mounts your filesystem read-only
> because of an error. It will only end up in 'dmesg' as a pretty much
> undefined bunch of words. :)

That one is a very low frequency too.

> We will need some generic error reporting facility, with structured
> data exported, and where userspace stuff can subscribe to.
> Uevents/udev can not really properly provide such infrastructure.
> Maybe that can be extended somehow, but using kobject_uevent() and
> trigger the usual udev rule engine is not what we are looking for, for
> sane error reporting.

It's true that there are higher frequency events which will be
horrible to report via kobject_uevent(). Hmmm... but the thing is
that events which should annoy the user by userland notification can't
be definition high freq. There's only so many users can recognize and
respond, so the frequency limitation of uevent might actually fit here
although it would be nice to have some kind of safety mechanism.
Still no go for uevent?

Thanks.

--
tejun

2010-06-15 13:30:31

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

On Tue, Jun 15, 2010 at 13:05, Jean Delvare <[email protected]> wrote:
> On Tue, 15 Jun 2010 12:30:00 +0200, Kay Sievers wrote:
>> On Tue, Jun 15, 2010 at 00:19, Tejun Heo <[email protected]> wrote:
>> > Hmm... maybe what we can do is generating an uevent when an IRQ is
>> > confirmed to be bad and then let udev notify the user.  That way we'll
>> > probably have better chance of getting bug reports and users have
>> > whiny but working system.
>>
>> Not really, uevents are not picked up by anything that could report an
>> error to userspace, they are just seen by udev. Also uevents are
>> usually not the proper passing method. They are not meant to ever
>> transport higher frequency events, or structured data. They cause to
>> run the entire udev rule matching machine, and update symlinks and
>> permissions with every event.
>>
>> We will need some better error reporting facility. On Linux you don't
>> even get notified when the kernel mounts your filesystem read-only
>> because of an error. It will only end up in 'dmesg' as a pretty much
>> undefined bunch of words. :)
>>
>> We will need some generic error reporting facility, with structured
>> data exported, and where userspace stuff can subscribe to.
>> Uevents/udev can not really properly provide such infrastructure.
>> Maybe that can be extended somehow, but using kobject_uevent() and
>> trigger the usual udev rule engine is not what we are looking for, for
>> sane error reporting.
>
> Random idea of the day (I don't know anything about it all): let the
> kernel connect to D-Bus and use it somehow?

Yeah, D-Bus is an peer-to-peer IPC mechanism/protocol. The D-Bus
daemon can filter and multiplex/distibute messages.

It's very similar to what we can do with netlink. The netlink
multicast stuff can even provide lots of the functionality the D-Bus
daemon provides.

I think we should avoid the D-Bus complexity for the very low-level
stuff. Very much like udev is not using it, but has efficient
in-kernel message filtering based on Berkeley Packet Filters, and
multiple listeners event subscription/distribution based on netlink
multicast functionality.

Not sure if netlink is the right answer here, but it's surely easier
to handle than D-Bus, and would provide a very similar functionality.

Kay

2010-06-15 13:36:58

by Kay Sievers

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

On Tue, Jun 15, 2010 at 13:20, Tejun Heo <[email protected]> wrote:
> On 06/15/2010 12:30 PM, Kay Sievers wrote:
>>> Hmm... maybe what we can do is generating an uevent when an IRQ is
>>> confirmed to be bad and then let udev notify the user.  That way we'll
>>> probably have better chance of getting bug reports and users have
>>> whiny but working system.
>>
>> Not really, uevents are not picked up by anything that could report an
>> error to userspace, they are just seen by udev. Also uevents are
>> usually not the proper passing method. They are not meant to ever
>> transport higher frequency events, or structured data. They cause to
>> run the entire udev rule matching machine, and update symlinks and
>> permissions with every event.
>
> Oh, these will be very low frequency event.  At most once per
> irq_expect or irqaction.  e.g. on a machine with four hard drives, it
> can only happen four times after boot unless the driver is unloaded
> and reloaded, so from frequency standpoint it should be okay.
>
>> We will need some better error reporting facility. On Linux you don't
>> even get notified when the kernel mounts your filesystem read-only
>> because of an error. It will only end up in 'dmesg' as a pretty much
>> undefined bunch of words. :)
>
> That one is a very low frequency too.
>
>> We will need some generic error reporting facility, with structured
>> data exported, and where userspace stuff can subscribe to.
>> Uevents/udev can not really properly provide such infrastructure.
>> Maybe that can be extended somehow, but using kobject_uevent() and
>> trigger the usual udev rule engine is not what we are looking for, for
>> sane error reporting.
>
> It's true that there are higher frequency events which will be
> horrible to report via kobject_uevent().  Hmmm... but the thing is
> that events which should annoy the user by userland notification can't
> be definition high freq.  There's only so many users can recognize and
> respond, so the frequency limitation of uevent might actually fit here
> although it would be nice to have some kind of safety mechanism.
> Still no go for uevent?

Yeah, I'm pretty sure that's not what we want. We want structured
data, and a generic channel to pass all sort of errors through, and a
userspace part to handle it in a sane way. Many error sources may also
not have a device path in /sys, and therefore no uevent to send.
Uevents/udev just seem so convinient because it's already there, but I
think, has too many limitations to provide the needed functionality.
Besides the fact that nothing listens to these events in userspace
today -- it's a lot more to think through and work on than passing
things through uevents, especially some generic classification and
structured data passing, which is needed, instead of the current
free-text 'dmsg' or the property-based stuff in uevents. I'm very sure
it's the wrong facility to use.

Kay

2010-06-15 13:51:57

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 02/12] irq: make spurious poll timer per desc

> @@ -169,6 +170,7 @@ struct irq_2_iommu;
> * @pending_mask: pending rebalanced interrupts
> * @threads_active: number of irqaction threads currently running
> * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers
> + * @poll_timer: timer for IRQ polling

I looked over the patches and they all look good to my eyes.

The only thing that looked out of place was the above comment.
Would it make sense to move the comment one tab to the left?

> * @dir: /proc/irq/ procfs entry
> * @name: flow handler name for /proc/interrupts output
> */

2010-06-15 16:35:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 02/12] irq: make spurious poll timer per desc

On 06/15/2010 07:10 AM, Konrad Rzeszutek Wilk wrote:
>> @@ -169,6 +170,7 @@ struct irq_2_iommu;
>> * @pending_mask: pending rebalanced interrupts
>> * @threads_active: number of irqaction threads currently running
>> * @wait_for_threads: wait queue for sync_irq to wait for threaded handlers
>> + * @poll_timer: timer for IRQ polling
>
> I looked over the patches and they all look good to my eyes.
>
> The only thing that looked out of place was the above comment.
> Would it make sense to move the comment one tab to the left?

When applied, it will show up at the right place. It's because the
first tab ends next to ':' and the leading '+' ends up pushing it to
the next tab position.

Thanks.

--
tejun

2010-06-15 17:37:23

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

Hello,

On 06/15/2010 03:36 PM, Kay Sievers wrote:
> Yeah, I'm pretty sure that's not what we want. We want structured
> data, and a generic channel to pass all sort of errors through, and a
> userspace part to handle it in a sane way. Many error sources may also
> not have a device path in /sys, and therefore no uevent to send.
> Uevents/udev just seem so convinient because it's already there, but I
> think, has too many limitations to provide the needed functionality.
> Besides the fact that nothing listens to these events in userspace
> today -- it's a lot more to think through and work on than passing
> things through uevents, especially some generic classification and
> structured data passing, which is needed, instead of the current
> free-text 'dmsg' or the property-based stuff in uevents. I'm very sure
> it's the wrong facility to use.

Yeah, well, if you say so. It would be very nice to have report this
type of critical events in somewhat formatted way so that they can be
processed automatically and presented to the user in more accessible
manner. I doubt requiring strict structure would work in the long
run. It'll likely end up being able to cover only portion of what's
originally designed and and stagnate in time. I think control
information including identification and severity + free form string
would be much more manageable.

It would really be great to have something like that. I can easily
think of several libata events the user should be notified of from the
top of my head but currently are buried in dmesg.

Thanks.

--
tejun

2010-06-15 17:41:04

by Jonathan Corbet

[permalink] [raw]
Subject: Re: [PATCH 06/12] irq: implement irq_schedule_poll()

On Sun, 13 Jun 2010 17:31:32 +0200
Tejun Heo <[email protected]> wrote:

> Implement and use irq_schedule_poll() to schedule desc->poll_timer
> instead of calling mod_timer directly. irq_schedule_poll() is called
> with desc->lock held and schedules the timer iff necessary - ie. if
> the timer is offline or scheduled to expire later than requested.
> This will be used to share desc->poll_timer.

One quick question...

> diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
> index 0bce0e3..0b8fd0a 100644
> --- a/kernel/irq/spurious.c
> +++ b/kernel/irq/spurious.c
> @@ -23,6 +23,8 @@ enum {
>
> /* IRQ polling common parameters */
> IRQ_POLL_INTV = HZ / 100, /* from the good ol' 100HZ tick */
> +
> + IRQ_POLL_SLACK = HZ / 1000, /* 10% slack */
> };

Why would IRQ_POLL_SLACK be anything other than zero for HZ < 1000?
It seems like you'd get no slack at all on most systems. Am I missing
something?

Thanks,

jon

2010-06-15 17:48:32

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

On Tue, Jun 15, 2010 at 07:36:12PM +0200, Tejun Heo wrote:
> Hello,
>
> On 06/15/2010 03:36 PM, Kay Sievers wrote:
> > Yeah, I'm pretty sure that's not what we want. We want structured
> > data, and a generic channel to pass all sort of errors through, and a
> > userspace part to handle it in a sane way. Many error sources may also
> > not have a device path in /sys, and therefore no uevent to send.
> > Uevents/udev just seem so convinient because it's already there, but I
> > think, has too many limitations to provide the needed functionality.
> > Besides the fact that nothing listens to these events in userspace
> > today -- it's a lot more to think through and work on than passing
> > things through uevents, especially some generic classification and
> > structured data passing, which is needed, instead of the current
> > free-text 'dmsg' or the property-based stuff in uevents. I'm very sure
> > it's the wrong facility to use.
>
> Yeah, well, if you say so. It would be very nice to have report this
> type of critical events in somewhat formatted way so that they can be
> processed automatically and presented to the user in more accessible
> manner. I doubt requiring strict structure would work in the long
> run. It'll likely end up being able to cover only portion of what's
> originally designed and and stagnate in time. I think control
> information including identification and severity + free form string
> would be much more manageable.
>
> It would really be great to have something like that. I can easily
> think of several libata events the user should be notified of from the
> top of my head but currently are buried in dmesg.

That is what Andi Kleen and others are working on at the moment. I
think he's posted to lkml about it, and there are going to be more talks
about it at the Plumbers conference this year.

thanks,

greg k-h

2010-06-15 17:52:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 06/12] irq: implement irq_schedule_poll()

On 06/15/2010 07:40 PM, Jonathan Corbet wrote:
> On Sun, 13 Jun 2010 17:31:32 +0200
> Tejun Heo <[email protected]> wrote:
>
>> Implement and use irq_schedule_poll() to schedule desc->poll_timer
>> instead of calling mod_timer directly. irq_schedule_poll() is called
>> with desc->lock held and schedules the timer iff necessary - ie. if
>> the timer is offline or scheduled to expire later than requested.
>> This will be used to share desc->poll_timer.
>
> One quick question...
>
>> diff --git a/kernel/irq/spurious.c b/kernel/irq/spurious.c
>> index 0bce0e3..0b8fd0a 100644
>> --- a/kernel/irq/spurious.c
>> +++ b/kernel/irq/spurious.c
>> @@ -23,6 +23,8 @@ enum {
>>
>> /* IRQ polling common parameters */
>> IRQ_POLL_INTV = HZ / 100, /* from the good ol' 100HZ tick */
>> +
>> + IRQ_POLL_SLACK = HZ / 1000, /* 10% slack */
>> };
>
> Why would IRQ_POLL_SLACK be anything other than zero for HZ < 1000?
> It seems like you'd get no slack at all on most systems. Am I missing
> something?

Jiri questioned the same thing. I'll bump it to HZ / 250 so that it
at least has one tick slack on 250HZ configuration which is pretty
common these days.

Thanks.

--
tejun

2010-06-15 17:53:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

On 06/15/2010 07:47 PM, Greg KH wrote:
>> It would really be great to have something like that. I can easily
>> think of several libata events the user should be notified of from the
>> top of my head but currently are buried in dmesg.
>
> That is what Andi Kleen and others are working on at the moment. I
> think he's posted to lkml about it, and there are going to be more talks
> about it at the Plumbers conference this year.

Ah, cool. Yeah, I'll be happy to be a beta user for such feature.

thanks.

--
tejun

2010-06-17 03:49:58

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

On Sun, 13 Jun 2010 17:31:35 +0200
Tejun Heo <[email protected]> wrote:
> + */
> +void expect_irq(struct irq_expect *exp)


I would like to suggest an (optional) argument to this with a duration
within which to expect an interrupt....

that way in the backend we can plumb this also into the idle handler
for C state selection...


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-06-17 08:19:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

Hello, Arjan.

On 06/17/2010 05:48 AM, Arjan van de Ven wrote:
> On Sun, 13 Jun 2010 17:31:35 +0200
> Tejun Heo <[email protected]> wrote:
>> + */
>> +void expect_irq(struct irq_expect *exp)
>
> I would like to suggest an (optional) argument to this with a duration
> within which to expect an interrupt....
>
> that way in the backend we can plumb this also into the idle handler
> for C state selection...

Hmmm.... oh, I see. Wouldn't it be much better to use moving avg of
IRQ durations instead of letting the driver specify it? Drivers are
most likely to just hard code it and It's never gonna be accurate.

Thanks.

--
tejun

2010-06-17 11:16:41

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting


On Thu, 17 Jun 2010, Tejun Heo wrote:

> Hello, Arjan.
>
> On 06/17/2010 05:48 AM, Arjan van de Ven wrote:
> > On Sun, 13 Jun 2010 17:31:35 +0200
> > Tejun Heo <[email protected]> wrote:
> >> + */
> >> +void expect_irq(struct irq_expect *exp)
> >
> > I would like to suggest an (optional) argument to this with a duration
> > within which to expect an interrupt....
> >
> > that way in the backend we can plumb this also into the idle handler
> > for C state selection...
>
> Hmmm.... oh, I see. Wouldn't it be much better to use moving avg of
> IRQ durations instead of letting the driver specify it? Drivers are
> most likely to just hard code it and It's never gonna be accurate.

Right, but that's probably more accurate than the core code heuristics
ever will be.

Thanks,

tglx

2010-06-17 11:24:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

On 06/17/2010 01:12 PM, Thomas Gleixner wrote:
>> Hmmm.... oh, I see. Wouldn't it be much better to use moving avg of
>> IRQ durations instead of letting the driver specify it? Drivers are
>> most likely to just hard code it and It's never gonna be accurate.
>
> Right, but that's probably more accurate than the core code heuristics
> ever will be.

Eh, not really. For ATA at least, there will be three different
classes of devices. SSDs, hard drives and optical devices and if we
get running avg w/ fairly large stability part, the numbers wouldn't
be too far off and there's no reliable way for the driver to tell
which type of device is on the other side of the cable. So, I think
running avg would work much better.

Thanks.

--
tejun

2010-06-17 11:42:24

by Alan

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

On Thu, 17 Jun 2010 13:23:27 +0200
Tejun Heo <[email protected]> wrote:

> On 06/17/2010 01:12 PM, Thomas Gleixner wrote:
> >> Hmmm.... oh, I see. Wouldn't it be much better to use moving avg of
> >> IRQ durations instead of letting the driver specify it? Drivers are
> >> most likely to just hard code it and It's never gonna be accurate.
> >
> > Right, but that's probably more accurate than the core code heuristics
> > ever will be.
>
> Eh, not really. For ATA at least, there will be three different
> classes of devices. SSDs, hard drives and optical devices

At least four: It may also be battery backed RAM.

Alan

2010-06-17 15:55:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

On 06/17/2010 01:43 PM, Alan Cox wrote:
> On Thu, 17 Jun 2010 13:23:27 +0200
> Tejun Heo <[email protected]> wrote:
>
>> On 06/17/2010 01:12 PM, Thomas Gleixner wrote:
>>>> Hmmm.... oh, I see. Wouldn't it be much better to use moving avg of
>>>> IRQ durations instead of letting the driver specify it? Drivers are
>>>> most likely to just hard code it and It's never gonna be accurate.
>>>
>>> Right, but that's probably more accurate than the core code heuristics
>>> ever will be.
>>
>> Eh, not really. For ATA at least, there will be three different
>> classes of devices. SSDs, hard drives and optical devices
>
> At least four: It may also be battery backed RAM.

Yeah, right, there are those crazy devices too but I think they would
fall in a single tick any way. At any rate, let's say I have those
numbers, how would I feed it into c-state selection?

Thanks.

--
tejun

2010-06-17 16:02:56

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

On Thu, 17 Jun 2010 17:54:48 +0200
Tejun Heo <[email protected]> wrote:

> Crazy devices too but I think they would
> fall in a single tick any way.

not sure what ticks have to do with anything but ok ;)

> At any rate, let's say I have those
> numbers, how would I feed it into c-state selection?

if we have this, we need to put a bit of glue in the backend that
tracks (per cpu I suppose) the shortest expected interrupt, which
the C state code then queries.
(and in that regard, it does not matter if shortest expected is
computed via heuristic on a per irq basis, or passed in).

mapping an irq to a cpu is not a 100% science (since interrupts can
move in theory), but just assuming that the irq will happen on the same
CPU it happened last time is more than good enough.


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-06-17 16:48:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

On 06/17/2010 06:02 PM, Arjan van de Ven wrote:
> On Thu, 17 Jun 2010 17:54:48 +0200
> Tejun Heo <[email protected]> wrote:
>
>> Crazy devices too but I think they would
>> fall in a single tick any way.
>
> not sure what ticks have to do with anything but ok ;)

Eh... right, I was thinking about something else. IRQ expect code
originally had a tick based duration estimator to determine poll
interval which I ripped out later for simpler stepped adjustments.
c-state would need higher frequency timing measurements than jiffies.

>> At any rate, let's say I have those
>> numbers, how would I feed it into c-state selection?
>
> if we have this, we need to put a bit of glue in the backend that
> tracks (per cpu I suppose) the shortest expected interrupt, which
> the C state code then queries.
> (and in that regard, it does not matter if shortest expected is
> computed via heuristic on a per irq basis, or passed in).
>
> mapping an irq to a cpu is not a 100% science (since interrupts can
> move in theory), but just assuming that the irq will happen on the same
> CPU it happened last time is more than good enough.

Hmmm... the thing is that there will be many cases which won't fit
irq_expect() model (why irq_watch() exists in the first place) and for
the time being libata is the only one providing that data. Would the
data still be useful to determine which c-state to use?

Thanks.

--
tejun

2010-06-18 06:27:21

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

On Thu, 17 Jun 2010 18:47:19 +0200
Tejun Heo <[email protected]> wrote:

>
> Hmmm... the thing is that there will be many cases which won't fit
> irq_expect() model (why irq_watch() exists in the first place) and for
> the time being libata is the only one providing that data. Would the
> data still be useful to determine which c-state to use?

yes absolutely. One of the hard cases right now that the C state code
has is that it needs to predict the future. While it has a ton of
heuristics, including some is there IO oustanding" ones, libata is a
really good case: libata will know generally that within one seek time
(5 msec on rotating rust, much less on floating electrons) there'll be
an interrupt (give or take, but this is what we can do heuristics for
on a per irq level).
So it's a good suggestion of what the future will be like, MUCH better
than any hint we have right now... all we have right now is some
history, and when the next timer is....


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-06-18 09:24:12

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

Hello,

On 06/18/2010 08:26 AM, Arjan van de Ven wrote:
> On Thu, 17 Jun 2010 18:47:19 +0200
> Tejun Heo <[email protected]> wrote:
>
>>
>> Hmmm... the thing is that there will be many cases which won't fit
>> irq_expect() model (why irq_watch() exists in the first place) and for
>> the time being libata is the only one providing that data. Would the
>> data still be useful to determine which c-state to use?
>
> yes absolutely. One of the hard cases right now that the C state code
> has is that it needs to predict the future. While it has a ton of
> heuristics, including some is there IO oustanding" ones, libata is a
> really good case: libata will know generally that within one seek time
> (5 msec on rotating rust, much less on floating electrons) there'll be
> an interrupt (give or take, but this is what we can do heuristics for
> on a per irq level).
> So it's a good suggestion of what the future will be like, MUCH better
> than any hint we have right now... all we have right now is some
> history, and when the next timer is....

Cool, good to know. It shouldn't be difficult to at all to add. Once
the whole thing gets generally agreed on, I'll work on that.

Thomas, Ingo, through which tree should these patches routed through?
Shall I set up a separate branch?

Thanks.

--
tejun

2010-06-18 09:46:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

On Fri, 18 Jun 2010, Tejun Heo wrote:
> Hello,
>
> On 06/18/2010 08:26 AM, Arjan van de Ven wrote:
> > On Thu, 17 Jun 2010 18:47:19 +0200
> > Tejun Heo <[email protected]> wrote:
> >
> >>
> >> Hmmm... the thing is that there will be many cases which won't fit
> >> irq_expect() model (why irq_watch() exists in the first place) and for
> >> the time being libata is the only one providing that data. Would the
> >> data still be useful to determine which c-state to use?
> >
> > yes absolutely. One of the hard cases right now that the C state code
> > has is that it needs to predict the future. While it has a ton of
> > heuristics, including some is there IO oustanding" ones, libata is a
> > really good case: libata will know generally that within one seek time
> > (5 msec on rotating rust, much less on floating electrons) there'll be
> > an interrupt (give or take, but this is what we can do heuristics for
> > on a per irq level).
> > So it's a good suggestion of what the future will be like, MUCH better
> > than any hint we have right now... all we have right now is some
> > history, and when the next timer is....
>
> Cool, good to know. It shouldn't be difficult to at all to add. Once
> the whole thing gets generally agreed on, I'll work on that.
>
> Thomas, Ingo, through which tree should these patches routed through?

I'm going to pull that into tip/genirq I guess

Thanks,

tglx

2010-06-19 08:35:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

Arjan van de Ven <[email protected]> writes:

> On Sun, 13 Jun 2010 17:31:35 +0200
> Tejun Heo <[email protected]> wrote:
>> + */
>> +void expect_irq(struct irq_expect *exp)
>
>
> I would like to suggest an (optional) argument to this with a duration
> within which to expect an interrupt....
>
> that way in the backend we can plumb this also into the idle handler
> for C state selection...

I'm not sure it's really that useful to power optimize
the lost interrupts polling case. It's just a last resort
fallback anyways and will be always less power efficient
because there will be unnecessary polls.

-Andi

--
[email protected] -- Speaking for myself only.

2010-06-19 08:44:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

Hello,

On 06/19/2010 10:35 AM, Andi Kleen wrote:
>> I would like to suggest an (optional) argument to this with a duration
>> within which to expect an interrupt....
>>
>> that way in the backend we can plumb this also into the idle handler
>> for C state selection...
>
> I'm not sure it's really that useful to power optimize
> the lost interrupts polling case. It's just a last resort
> fallback anyways and will be always less power efficient
> because there will be unnecessary polls.

IIUC, it's not to help or optimize polling itself. It just gives us a
way to estimate when the next interrupt would be so that power can be
optimized for non polling cases.

Thanks.

--
tejun

2010-06-19 09:00:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

> IIUC, it's not to help or optimize polling itself. It just gives us a
> way to estimate when the next interrupt would be so that power can be
> optimized for non polling cases.

Shouldn't the idle governour estimate this already?

BTW I looked at something like this for networking. There was
one case where a network benchmark was impacted by deep sleep
states while processing packets. But in the end it turned
out to be mostly a broken BIOS that gave wrong
parameters to the idle governour.

-Andi

--
[email protected] -- Speaking for myself only.

2010-06-19 09:04:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

Hello,

On 06/19/2010 11:00 AM, Andi Kleen wrote:
>> IIUC, it's not to help or optimize polling itself. It just gives us a
>> way to estimate when the next interrupt would be so that power can be
>> optimized for non polling cases.
>
> Shouldn't the idle governour estimate this already?

I'm not an expert on the subject. According to Arjan,

One of the hard cases right now that the C state code has is that it
needs to predict the future. While it has a ton of heuristics,
including some is there IO oustanding" ones, libata is a really good
case: libata will know generally that within one seek time (5 msec
on rotating rust, much less on floating electrons) there'll be an
interrupt (give or take, but this is what we can do heuristics for
on a per irq level). So it's a good suggestion of what the future
will be like, MUCH better than any hint we have right now... all we
have right now is some history, and when the next timer is....

So, it seems like it would help.

Thanks.

--
tejun

2010-06-19 14:55:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

On Sat, 19 Jun 2010 11:00:31 +0200
Andi Kleen <[email protected]> wrote:

> > IIUC, it's not to help or optimize polling itself. It just gives
> > us a way to estimate when the next interrupt would be so that power
> > can be optimized for non polling cases.
>
> Shouldn't the idle governour estimate this already?

we have a set of heuristics in the idle governor to try to predict the
future. For patterns that are regular in some shape of form we can do
this.

Here we have the opportunity to get real good information.. we KNOW
now an interrupt is coming, and we can even estimate how long it'll
be... Even for the first few after a long time, before a pattern
emerges.



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-06-19 19:49:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

> Here we have the opportunity to get real good information.. we KNOW
> now an interrupt is coming, and we can even estimate how long it'll
> be... Even for the first few after a long time, before a pattern
> emerges.

Ok but I thought the driver would just fill in a single number
likely. Is that really that good information?

Alternatively each driver would need to implement some dynamic
estimation algorithm, but that would be just mostly equivalent
to having it all in the idle governour?

-Andi
--
[email protected] -- Speaking for myself only.

2010-06-19 20:07:49

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 09/12] irq: implement IRQ expecting

On Sat, 19 Jun 2010 21:49:37 +0200
Andi Kleen <[email protected]> wrote:

> > Here we have the opportunity to get real good information.. we KNOW
> > now an interrupt is coming, and we can even estimate how long it'll
> > be... Even for the first few after a long time, before a pattern
> > emerges.
>
> Ok but I thought the driver would just fill in a single number
> likely. Is that really that good information?

Tejun suggested tracking this per handler, I'm assuming that instead of
the driver passed number

>
> Alternatively each driver would need to implement some dynamic
> estimation algorithm, but that would be just mostly equivalent
> to having it all in the idle governour?

that's the whole point... let the irq layer track this stuff, and
inform the governor. the governor does not know irqs are expected..
with tejun's change, the irq layer at least will.



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-06-21 13:27:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 06/12 UPDATED] irq: implement irq_schedule_poll()

Implement and use irq_schedule_poll() to schedule desc->poll_timer
instead of calling mod_timer directly. irq_schedule_poll() is called
with desc->lock held and schedules the timer iff necessary - ie. if
the timer is offline or scheduled to expire later than requested.
This will be used to share desc->poll_timer.

* Jiri and Jonathan pointed out that IRQ_POLL_SLACK of HZ / 1000 would
yield 0 slack on the popular 250HZ config. Adjusted to HZ / 250.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Jiri Slaby <[email protected]>
Cc: Jonathan Corbet <[email protected]>
---
Minor update. IRQ_POLL_SLACK bumped up to HZ / 250.

Thanks.

kernel/irq/spurious.c | 47 +++++++++++++++++++++++++++++++++++++++++------
1 file changed, 41 insertions(+), 6 deletions(-)

Index: work/kernel/irq/spurious.c
===================================================================
--- work.orig/kernel/irq/spurious.c
+++ work/kernel/irq/spurious.c
@@ -23,6 +23,8 @@ enum {

/* IRQ polling common parameters */
IRQ_POLL_INTV = HZ / 100, /* from the good ol' 100HZ tick */
+
+ IRQ_POLL_SLACK = HZ / 250, /* 1 tick slack w/ the popular 250HZ config */
};

int noirqdebug __read_mostly;
@@ -43,6 +45,38 @@ static void print_irq_handlers(struct ir
}
}

+static unsigned long irq_poll_slack(unsigned long intv)
+{
+ return IRQ_POLL_SLACK;
+}
+
+/**
+ * irq_schedule_poll - schedule IRQ poll
+ * @desc: IRQ desc to schedule poll for
+ * @intv: poll interval
+ *
+ * Schedules @desc->poll_timer. If the timer is already scheduled,
+ * it's modified iff jiffies + @intv + slack is before the timer's
+ * expires. poll_timers aren't taken offline behind this function's
+ * back and the users of this function are guaranteed that poll_irq()
+ * will be called at or before jiffies + @intv + slack.
+ *
+ * CONTEXT:
+ * desc->lock
+ */
+static void irq_schedule_poll(struct irq_desc *desc, unsigned long intv)
+{
+ unsigned long expires = jiffies + intv;
+ int slack = irq_poll_slack(intv);
+
+ if (timer_pending(&desc->poll_timer) &&
+ time_before_eq(desc->poll_timer.expires, expires + slack))
+ return;
+
+ set_timer_slack(&desc->poll_timer, slack);
+ mod_timer(&desc->poll_timer, expires);
+}
+
/*
* Recovery handler for misrouted interrupts.
*/
@@ -207,7 +241,9 @@ void __note_interrupt(unsigned int irq,
desc->depth++;
desc->chip->disable(irq);

- mod_timer(&desc->poll_timer, jiffies + IRQ_POLL_INTV);
+ raw_spin_lock(&desc->lock);
+ irq_schedule_poll(desc, IRQ_POLL_INTV);
+ raw_spin_unlock(&desc->lock);
}
desc->irqs_unhandled = 0;
}
@@ -221,9 +257,8 @@ void poll_irq(unsigned long arg)

raw_spin_lock_irq(&desc->lock);
try_one_irq(desc->irq, desc);
+ irq_schedule_poll(desc, IRQ_POLL_INTV);
raw_spin_unlock_irq(&desc->lock);
-
- mod_timer(&desc->poll_timer, jiffies + IRQ_POLL_INTV);
}

void irq_poll_action_added(struct irq_desc *desc, struct irqaction *action)
@@ -238,10 +273,10 @@ void irq_poll_action_added(struct irq_de
__enable_irq(desc, desc->irq, false);
}

- raw_spin_unlock_irqrestore(&desc->lock, flags);
-
if ((action->flags & IRQF_SHARED) && irqfixup >= IRQFIXUP_POLL)
- mod_timer(&desc->poll_timer, jiffies + IRQ_POLL_INTV);
+ irq_schedule_poll(desc, IRQ_POLL_INTV);
+
+ raw_spin_unlock_irqrestore(&desc->lock, flags);
}

void irq_poll_action_removed(struct irq_desc *desc, struct irqaction *action)

2010-06-21 13:52:11

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

On 06/13/2010 05:31 PM, Tejun Heo wrote:
> Ask IRQ subsystem to watch HCD IRQ line after initialization. This at
> least keeps USB ports which are occupied on initialization working and
> eases bug reporting and debugging.
>
> Signed-off-by: Tejun Heo <[email protected]>

Greg, if you think the change is okay, can you please ack it? Also,
would it be okay to route this through irq tree?

Thanks.

--
tejun

2010-06-21 13:52:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

On 06/13/2010 05:31 PM, Tejun Heo wrote:
> Thanks to its age, ATA is very susceptible to IRQ delivery problems in
> both directions - lost and spurious interrupts. In traditional PATA,
> the IRQ line is ultimately out of the controller and driver's control.
> Even relatively new SATA isn't free from these issues. Many
> controllers still emulate the traditional IDE interface which doesn't
> have reliable way to indicate interrupt pending state and there also
> is an issue regarding the interpretation of nIEN on both sides of the
> cable.
>
> Most of these problems can be worked around by using the new IRQ
> expecting mechanism without adding noticeable overhead. In ATA,
> almost all operations are initiated by the host and the controller
> signals progress or completion using IRQ. IRQ expecting can easily be
> added in libata core and applied to all libata drivers.
>
> Signed-off-by: Tejun Heo <[email protected]>

Jeff, can you please ack this change if this looks okay to you? Also,
would it okay to route this through irq tree?

Thanks.

--
tejun

2010-06-21 20:32:11

by Greg KH

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

On Mon, Jun 21, 2010 at 03:51:25PM +0200, Tejun Heo wrote:
> On 06/13/2010 05:31 PM, Tejun Heo wrote:
> > Ask IRQ subsystem to watch HCD IRQ line after initialization. This at
> > least keeps USB ports which are occupied on initialization working and
> > eases bug reporting and debugging.
> >
> > Signed-off-by: Tejun Heo <[email protected]>
>
> Greg, if you think the change is okay, can you please ack it? Also,
> would it be okay to route this through irq tree?

Sorry, I thought I did earlier. Yes, feel free to add:
Acked-by: Greg Kroah-Hartman <[email protected]>
and take it through what ever tree you want to :)

thanks,

greg k-h

2010-06-22 07:33:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 12/12] usb: use IRQ watching

On 06/21/2010 10:27 PM, Greg KH wrote:
> On Mon, Jun 21, 2010 at 03:51:25PM +0200, Tejun Heo wrote:
>> On 06/13/2010 05:31 PM, Tejun Heo wrote:
>>> Ask IRQ subsystem to watch HCD IRQ line after initialization. This at
>>> least keeps USB ports which are occupied on initialization working and
>>> eases bug reporting and debugging.
>>>
>>> Signed-off-by: Tejun Heo <[email protected]>
>>
>> Greg, if you think the change is okay, can you please ack it? Also,
>> would it be okay to route this through irq tree?
>
> Sorry, I thought I did earlier. Yes, feel free to add:
> Acked-by: Greg Kroah-Hartman <[email protected]>
> and take it through what ever tree you want to :)

Great, thanks. :-)

--
tejun

2010-06-25 00:22:33

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

On 06/13/2010 11:31 AM, Tejun Heo wrote:
> Thanks to its age, ATA is very susceptible to IRQ delivery problems in
> both directions - lost and spurious interrupts. In traditional PATA,
> the IRQ line is ultimately out of the controller and driver's control.
> Even relatively new SATA isn't free from these issues. Many
> controllers still emulate the traditional IDE interface which doesn't
> have reliable way to indicate interrupt pending state and there also
> is an issue regarding the interpretation of nIEN on both sides of the
> cable.
>
> Most of these problems can be worked around by using the new IRQ
> expecting mechanism without adding noticeable overhead. In ATA,
> almost all operations are initiated by the host and the controller
> signals progress or completion using IRQ. IRQ expecting can easily be
> added in libata core and applied to all libata drivers.
>
> Signed-off-by: Tejun Heo<[email protected]>
> ---
> drivers/ata/libata-core.c | 15 +++++++++++++--
> drivers/ata/libata-eh.c | 4 +++-
> drivers/ata/libata-sff.c | 37 +++++++++++++++++++------------------
> include/linux/libata.h | 2 ++
> 4 files changed, 37 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c
> index ddf8e48..9a0aaa0 100644
> --- a/drivers/ata/libata-core.c
> +++ b/drivers/ata/libata-core.c
> @@ -4972,6 +4972,8 @@ void ata_qc_complete(struct ata_queued_cmd *qc)
> {
> struct ata_port *ap = qc->ap;
>
> + unexpect_irq(ap->irq_expect, false);
> +
> /* XXX: New EH and old EH use different mechanisms to
> * synchronize EH with regular execution path.
> *

Unconditional use of unexpect_irq() here seems incorrect for some cases,
such as sata_mv's use, where ata_qc_complete() is called multiple times
rather than a singleton ata_qc_complete_multiple() call.

Jeff





2010-06-25 07:44:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

Hello, Jeff.

On 06/25/2010 02:22 AM, Jeff Garzik wrote:
>> @@ -4972,6 +4972,8 @@ void ata_qc_complete(struct ata_queued_cmd *qc)
>> {
>> struct ata_port *ap = qc->ap;
>>
>> + unexpect_irq(ap->irq_expect, false);
>> +
>> /* XXX: New EH and old EH use different mechanisms to
>> * synchronize EH with regular execution path.
>> *
>
> Unconditional use of unexpect_irq() here seems incorrect for some cases,
> such as sata_mv's use, where ata_qc_complete() is called multiple times
> rather than a singleton ata_qc_complete_multiple() call.

Indeed, sata_mv is calling ata_qc_complete() directly multiple times.
I still think calling unexpect_irq() from ata_qc_complete() is correct
as ata_qc_complete() is always a good indicator of completion events.
What's missing is a way for sata_mv to indicate that it has more
events to expect for, which under the current implementation only
sata_mv interrupt handler can determine. I'll see if I can convert it
to use ata_qc_complete_multiple() instead.

Thanks.

--
tejun

2010-06-25 09:48:10

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

On 06/25/2010 03:44 AM, Tejun Heo wrote:
> Hello, Jeff.
>
> On 06/25/2010 02:22 AM, Jeff Garzik wrote:
>>> @@ -4972,6 +4972,8 @@ void ata_qc_complete(struct ata_queued_cmd *qc)
>>> {
>>> struct ata_port *ap = qc->ap;
>>>
>>> + unexpect_irq(ap->irq_expect, false);
>>> +
>>> /* XXX: New EH and old EH use different mechanisms to
>>> * synchronize EH with regular execution path.
>>> *
>>
>> Unconditional use of unexpect_irq() here seems incorrect for some cases,
>> such as sata_mv's use, where ata_qc_complete() is called multiple times
>> rather than a singleton ata_qc_complete_multiple() call.
>
> Indeed, sata_mv is calling ata_qc_complete() directly multiple times.
> I still think calling unexpect_irq() from ata_qc_complete() is correct
> as ata_qc_complete() is always a good indicator of completion events.

My basic point is that you are implicitly changing the entire
ata_qc_complete() API, and associated underlying assumptions.

The existing assumption, since libata day #0, is that ata_qc_complete()
works entirely within the scope of a single qc -- thus enabling multiple
calls for a single controller interrupt. Your change greatly widens the
scope to an entire port.

This isn't just an issue with sata_mv, that was just the easy example I
remember off the top of my head. sata_fsl and sata_nv also make the
same assumption. And it's a reasonable assumption, IMO.

I think an unexpect_irq() call is more appropriate outside
ata_qc_complete().

Jeff


2010-06-25 09:52:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

Hello,

On 06/25/2010 11:48 AM, Jeff Garzik wrote:
> My basic point is that you are implicitly changing the entire
> ata_qc_complete() API, and associated underlying assumptions.
>
> The existing assumption, since libata day #0, is that ata_qc_complete()
> works entirely within the scope of a single qc -- thus enabling multiple
> calls for a single controller interrupt. Your change greatly widens the
> scope to an entire port.

Yeah, I'm changing that and it actually reduces code.

> This isn't just an issue with sata_mv, that was just the easy example I
> remember off the top of my head. sata_fsl and sata_nv also make the
> same assumption. And it's a reasonable assumption, IMO.

Yeah, already updating all of them.

> I think an unexpect_irq() call is more appropriate outside
> ata_qc_complete().

The choices we have here are....

1. Update completion API so that libata core layer has enough
information to decide expect/unexpect events.

2. Add expect/unexpect calls to individual drivers.

I think #1 is much better now and in the long run. The code actually
looks better too.

Thanks.

--
tejun

2010-06-25 13:04:19

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 1/2 #upstream] sata_fsl,mv,nv: prepare for NCQ command completion update

Make the following changes to prepare for NCQ command completion
update. Changes made by this patch don't cause any functional
difference.

* sata_fsl_host_intr(): rename the local variable qc_active to
done_mask as that's what it is.

* mv_process_crpb_response(): restructure if clause for easier update.

* nv_adma_interrupt(): drop unnecessary error variable.

* nv_swncq_sdbfis(): drop unnecessary nr_done and return 0 on success.
Typo fix.

* nv_swncq_dmafis(): drop unused return value and return void.

* nv_swncq_host_interrupt(): drop unnecessary return value handling.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ashish Kalra <[email protected]>
Cc: Saeed Bishara <[email protected]>
Cc: Mark Lord <[email protected]>
Cc: Robert Hancock <[email protected]>
---
So, something like this. I tested both flavors of sata_nv but don't
have access to sata_mv or sata_fsl, but the conversion is pretty
straight forward and failures should be pretty easy to catch.

Thanks.

drivers/ata/sata_fsl.c | 20 ++++++++++----------
drivers/ata/sata_mv.c | 47 ++++++++++++++++++++++++-----------------------
drivers/ata/sata_nv.c | 32 ++++++++++++++------------------
3 files changed, 48 insertions(+), 51 deletions(-)

Index: ata/drivers/ata/sata_fsl.c
===================================================================
--- ata.orig/drivers/ata/sata_fsl.c
+++ ata/drivers/ata/sata_fsl.c
@@ -1096,7 +1096,7 @@ static void sata_fsl_host_intr(struct at
{
struct sata_fsl_host_priv *host_priv = ap->host->private_data;
void __iomem *hcr_base = host_priv->hcr_base;
- u32 hstatus, qc_active = 0;
+ u32 hstatus, done_mask = 0;
struct ata_queued_cmd *qc;
u32 SError;

@@ -1116,28 +1116,28 @@ static void sata_fsl_host_intr(struct at
}

/* Read command completed register */
- qc_active = ioread32(hcr_base + CC);
+ done_mask = ioread32(hcr_base + CC);

VPRINTK("Status of all queues :\n");
- VPRINTK("qc_active/CC = 0x%x, CA = 0x%x, CE=0x%x,CQ=0x%x,apqa=0x%x\n",
- qc_active,
+ VPRINTK("done_mask/CC = 0x%x, CA = 0x%x, CE=0x%x,CQ=0x%x,apqa=0x%x\n",
+ done_mask,
ioread32(hcr_base + CA),
ioread32(hcr_base + CE),
ioread32(hcr_base + CQ),
ap->qc_active);

- if (qc_active & ap->qc_active) {
+ if (done_mask & ap->qc_active) {
int i;
/* clear CC bit, this will also complete the interrupt */
- iowrite32(qc_active, hcr_base + CC);
+ iowrite32(done_mask, hcr_base + CC);

DPRINTK("Status of all queues :\n");
- DPRINTK("qc_active/CC = 0x%x, CA = 0x%x, CE=0x%x\n",
- qc_active, ioread32(hcr_base + CA),
+ DPRINTK("done_mask/CC = 0x%x, CA = 0x%x, CE=0x%x\n",
+ done_mask, ioread32(hcr_base + CA),
ioread32(hcr_base + CE));

for (i = 0; i < SATA_FSL_QUEUE_DEPTH; i++) {
- if (qc_active & (1 << i)) {
+ if (done_mask & (1 << i)) {
qc = ata_qc_from_tag(ap, i);
if (qc) {
ata_qc_complete(qc);
@@ -1164,7 +1164,7 @@ static void sata_fsl_host_intr(struct at
/* Spurious Interrupt!! */
DPRINTK("spurious interrupt!!, CC = 0x%x\n",
ioread32(hcr_base + CC));
- iowrite32(qc_active, hcr_base + CC);
+ iowrite32(done_mask, hcr_base + CC);
return;
}
}
Index: ata/drivers/ata/sata_mv.c
===================================================================
--- ata.orig/drivers/ata/sata_mv.c
+++ ata/drivers/ata/sata_mv.c
@@ -2716,34 +2716,35 @@ static void mv_err_intr(struct ata_port
static void mv_process_crpb_response(struct ata_port *ap,
struct mv_crpb *response, unsigned int tag, int ncq_enabled)
{
+ u8 ata_status;
+ u16 edma_status = le16_to_cpu(response->flags);
struct ata_queued_cmd *qc = ata_qc_from_tag(ap, tag);

- if (qc) {
- u8 ata_status;
- u16 edma_status = le16_to_cpu(response->flags);
- /*
- * edma_status from a response queue entry:
- * LSB is from EDMA_ERR_IRQ_CAUSE (non-NCQ only).
- * MSB is saved ATA status from command completion.
- */
- if (!ncq_enabled) {
- u8 err_cause = edma_status & 0xff & ~EDMA_ERR_DEV;
- if (err_cause) {
- /*
- * Error will be seen/handled by mv_err_intr().
- * So do nothing at all here.
- */
- return;
- }
- }
- ata_status = edma_status >> CRPB_FLAG_STATUS_SHIFT;
- if (!ac_err_mask(ata_status))
- ata_qc_complete(qc);
- /* else: leave it for mv_err_intr() */
- } else {
+ if (unlikely(!qc)) {
ata_port_printk(ap, KERN_ERR, "%s: no qc for tag=%d\n",
__func__, tag);
+ return;
+ }
+
+ /*
+ * edma_status from a response queue entry:
+ * LSB is from EDMA_ERR_IRQ_CAUSE (non-NCQ only).
+ * MSB is saved ATA status from command completion.
+ */
+ if (!ncq_enabled) {
+ u8 err_cause = edma_status & 0xff & ~EDMA_ERR_DEV;
+ if (err_cause) {
+ /*
+ * Error will be seen/handled by
+ * mv_err_intr(). So do nothing at all here.
+ */
+ return;
+ }
}
+ ata_status = edma_status >> CRPB_FLAG_STATUS_SHIFT;
+ if (!ac_err_mask(ata_status))
+ ata_qc_complete(qc);
+ /* else: leave it for mv_err_intr() */
}

static void mv_process_crpb_entries(struct ata_port *ap, struct mv_port_priv *pp)
Index: ata/drivers/ata/sata_nv.c
===================================================================
--- ata.orig/drivers/ata/sata_nv.c
+++ ata/drivers/ata/sata_nv.c
@@ -1018,7 +1018,7 @@ static irqreturn_t nv_adma_interrupt(int
NV_ADMA_STAT_CPBERR |
NV_ADMA_STAT_CMD_COMPLETE)) {
u32 check_commands = notifier_clears[i];
- int pos, error = 0;
+ int pos, rc;

if (status & NV_ADMA_STAT_CPBERR) {
/* check all active commands */
@@ -1030,10 +1030,12 @@ static irqreturn_t nv_adma_interrupt(int
}

/* check CPBs for completed commands */
- while ((pos = ffs(check_commands)) && !error) {
+ while ((pos = ffs(check_commands))) {
pos--;
- error = nv_adma_check_cpb(ap, pos,
+ rc = nv_adma_check_cpb(ap, pos,
notifier_error & (1 << pos));
+ if (unlikely(rc))
+ check_commands = 0;
check_commands &= ~(1 << pos);
}
}
@@ -2129,7 +2131,6 @@ static int nv_swncq_sdbfis(struct ata_po
struct nv_swncq_port_priv *pp = ap->private_data;
struct ata_eh_info *ehi = &ap->link.eh_info;
u32 sactive;
- int nr_done = 0;
u32 done_mask;
int i;
u8 host_stat;
@@ -2170,22 +2171,21 @@ static int nv_swncq_sdbfis(struct ata_po
pp->dhfis_bits &= ~(1 << i);
pp->dmafis_bits &= ~(1 << i);
pp->sdbfis_bits |= (1 << i);
- nr_done++;
}
}

if (!ap->qc_active) {
DPRINTK("over\n");
nv_swncq_pp_reinit(ap);
- return nr_done;
+ return 0;
}

if (pp->qc_active & pp->dhfis_bits)
- return nr_done;
+ return 0;

if ((pp->ncq_flags & ncq_saw_backout) ||
(pp->qc_active ^ pp->dhfis_bits))
- /* if the controller cann't get a device to host register FIS,
+ /* if the controller can't get a device to host register FIS,
* The driver needs to reissue the new command.
*/
lack_dhfis = 1;
@@ -2202,7 +2202,7 @@ static int nv_swncq_sdbfis(struct ata_po
if (lack_dhfis) {
qc = ata_qc_from_tag(ap, pp->last_issue_tag);
nv_swncq_issue_atacmd(ap, qc);
- return nr_done;
+ return 0;
}

if (pp->defer_queue.defer_bits) {
@@ -2212,7 +2212,7 @@ static int nv_swncq_sdbfis(struct ata_po
nv_swncq_issue_atacmd(ap, qc);
}

- return nr_done;
+ return 0;
}

static inline u32 nv_swncq_tag(struct ata_port *ap)
@@ -2224,7 +2224,7 @@ static inline u32 nv_swncq_tag(struct at
return (tag & 0x1f);
}

-static int nv_swncq_dmafis(struct ata_port *ap)
+static void nv_swncq_dmafis(struct ata_port *ap)
{
struct ata_queued_cmd *qc;
unsigned int rw;
@@ -2239,7 +2239,7 @@ static int nv_swncq_dmafis(struct ata_po
qc = ata_qc_from_tag(ap, tag);

if (unlikely(!qc))
- return 0;
+ return;

rw = qc->tf.flags & ATA_TFLAG_WRITE;

@@ -2254,8 +2254,6 @@ static int nv_swncq_dmafis(struct ata_po
dmactl |= ATA_DMA_WR;

iowrite8(dmactl | ATA_DMA_START, ap->ioaddr.bmdma_addr + ATA_DMA_CMD);
-
- return 1;
}

static void nv_swncq_host_interrupt(struct ata_port *ap, u16 fis)
@@ -2265,7 +2263,6 @@ static void nv_swncq_host_interrupt(stru
struct ata_eh_info *ehi = &ap->link.eh_info;
u32 serror;
u8 ata_stat;
- int rc = 0;

ata_stat = ap->ops->sff_check_status(ap);
nv_swncq_irq_clear(ap, fis);
@@ -2310,8 +2307,7 @@ static void nv_swncq_host_interrupt(stru
"dhfis 0x%X dmafis 0x%X sactive 0x%X\n",
ap->print_id, pp->qc_active, pp->dhfis_bits,
pp->dmafis_bits, readl(pp->sactive_block));
- rc = nv_swncq_sdbfis(ap);
- if (rc < 0)
+ if (nv_swncq_sdbfis(ap) < 0)
goto irq_error;
}

@@ -2348,7 +2344,7 @@ static void nv_swncq_host_interrupt(stru
*/
pp->dmafis_bits |= (0x1 << nv_swncq_tag(ap));
pp->ncq_flags |= ncq_saw_dmas;
- rc = nv_swncq_dmafis(ap);
+ nv_swncq_dmafis(ap);
}

irq_exit:

2010-06-25 13:04:30

by Tejun Heo

[permalink] [raw]
Subject: [PATCH 2/2 #upstream] libata: always use ata_qc_complete_multiple() for NCQ command completions

Currently, sata_fsl, mv and nv call ata_qc_complete() multiple times
from their interrupt handlers to indicate completion of NCQ commands.
This limits the visibility the libata core layer has into how commands
are being executed and completed, which is necessary to support IRQ
expecting in generic way. libata already has an interface to complete
multiple commands at once - ata_qc_complete_multiple() which ahci and
sata_sil24 already use.

This patch updates the three drivers to use ata_qc_complete_multiple()
too and updates comments on ata_qc_complete[_multiple]() regarding
their usages with NCQ completions. This change not only provides
better visibility into command execution to the core layer but also
simplifies low level drivers.

* sata_fsl: It already builds done_mask. Conversion is straight
forward.

* sata_mv: mv_process_crpb_response() no longer checks for illegal
completions, it just returns whether the tag is completed or not.
mv_process_crpb_entries() builds done_mask from it and passes it to
ata_qc_complete_multiple() which will check for illegal completions.

* sata_nv adma: Similar to sata_mv. nv_adma_check_cpb() now just
returns the tag status and nv_adma_interrupt() builds done_mask from
it and passes it to ata_qc_complete_multiple().

* sata_nv swncq: It already builds done_mask. Drop unnecessary
illegal transition checks and call ata_qc_complete_multiple().

In the long run, it might be a good idea to make ata_qc_complete()
whine if called when multiple NCQ commands are in flight.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Ashish Kalra <[email protected]>
Cc: Saeed Bishara <[email protected]>
Cc: Mark Lord <[email protected]>
Cc: Robert Hancock <[email protected]>
---
drivers/ata/libata-core.c | 13 ++++++++--
drivers/ata/sata_fsl.c | 8 +-----
drivers/ata/sata_mv.c | 25 +++++++++-----------
drivers/ata/sata_nv.c | 57 ++++++++++------------------------------------
4 files changed, 38 insertions(+), 65 deletions(-)

Index: ata/drivers/ata/libata-core.c
===================================================================
--- ata.orig/drivers/ata/libata-core.c
+++ ata/drivers/ata/libata-core.c
@@ -4962,8 +4962,13 @@ static void ata_verify_xfer(struct ata_q
* ata_qc_complete - Complete an active ATA command
* @qc: Command to complete
*
- * Indicate to the mid and upper layers that an ATA
- * command has completed, with either an ok or not-ok status.
+ * Indicate to the mid and upper layers that an ATA command has
+ * completed, with either an ok or not-ok status.
+ *
+ * Refrain from calling this function multiple times when
+ * successfully completing multiple NCQ commands.
+ * ata_qc_complete_multiple() should be used instead, which will
+ * properly update IRQ expect state.
*
* LOCKING:
* spin_lock_irqsave(host lock)
@@ -5056,6 +5061,10 @@ void ata_qc_complete(struct ata_queued_c
* requests normally. ap->qc_active and @qc_active is compared
* and commands are completed accordingly.
*
+ * Always use this function when completing multiple NCQ commands
+ * from IRQ handlers instead of calling ata_qc_complete()
+ * multiple times to keep IRQ expect status properly in sync.
+ *
* LOCKING:
* spin_lock_irqsave(host lock)
*
Index: ata/drivers/ata/sata_fsl.c
===================================================================
--- ata.orig/drivers/ata/sata_fsl.c
+++ ata/drivers/ata/sata_fsl.c
@@ -1137,17 +1137,13 @@ static void sata_fsl_host_intr(struct at
ioread32(hcr_base + CE));

for (i = 0; i < SATA_FSL_QUEUE_DEPTH; i++) {
- if (done_mask & (1 << i)) {
- qc = ata_qc_from_tag(ap, i);
- if (qc) {
- ata_qc_complete(qc);
- }
+ if (done_mask & (1 << i))
DPRINTK
("completing ncq cmd,tag=%d,CC=0x%x,CA=0x%x\n",
i, ioread32(hcr_base + CC),
ioread32(hcr_base + CA));
- }
}
+ ata_qc_complete_multiple(ap, ap->qc_active ^ done_mask);
return;

} else if ((ap->qc_active & (1 << ATA_TAG_INTERNAL))) {
Index: ata/drivers/ata/sata_mv.c
===================================================================
--- ata.orig/drivers/ata/sata_mv.c
+++ ata/drivers/ata/sata_mv.c
@@ -2713,18 +2713,11 @@ static void mv_err_intr(struct ata_port
}
}

-static void mv_process_crpb_response(struct ata_port *ap,
+static bool mv_process_crpb_response(struct ata_port *ap,
struct mv_crpb *response, unsigned int tag, int ncq_enabled)
{
u8 ata_status;
u16 edma_status = le16_to_cpu(response->flags);
- struct ata_queued_cmd *qc = ata_qc_from_tag(ap, tag);
-
- if (unlikely(!qc)) {
- ata_port_printk(ap, KERN_ERR, "%s: no qc for tag=%d\n",
- __func__, tag);
- return;
- }

/*
* edma_status from a response queue entry:
@@ -2738,13 +2731,14 @@ static void mv_process_crpb_response(str
* Error will be seen/handled by
* mv_err_intr(). So do nothing at all here.
*/
- return;
+ return false;
}
}
ata_status = edma_status >> CRPB_FLAG_STATUS_SHIFT;
if (!ac_err_mask(ata_status))
- ata_qc_complete(qc);
+ return true;
/* else: leave it for mv_err_intr() */
+ return false;
}

static void mv_process_crpb_entries(struct ata_port *ap, struct mv_port_priv *pp)
@@ -2753,6 +2747,7 @@ static void mv_process_crpb_entries(stru
struct mv_host_priv *hpriv = ap->host->private_data;
u32 in_index;
bool work_done = false;
+ u32 done_mask = 0;
int ncq_enabled = (pp->pp_flags & MV_PP_FLAG_NCQ_EN);

/* Get the hardware queue position index */
@@ -2773,15 +2768,19 @@ static void mv_process_crpb_entries(stru
/* Gen II/IIE: get command tag from CRPB entry */
tag = le16_to_cpu(response->id) & 0x1f;
}
- mv_process_crpb_response(ap, response, tag, ncq_enabled);
+ if (mv_process_crpb_response(ap, response, tag, ncq_enabled))
+ done_mask |= 1 << tag;
work_done = true;
}

- /* Update the software queue position index in hardware */
- if (work_done)
+ if (work_done) {
+ ata_qc_complete_multiple(ap, ap->qc_active ^ done_mask);
+
+ /* Update the software queue position index in hardware */
writelfl((pp->crpb_dma & EDMA_RSP_Q_BASE_LO_MASK) |
(pp->resp_idx << EDMA_RSP_Q_PTR_SHIFT),
port_mmio + EDMA_RSP_Q_OUT_PTR);
+ }
}

static void mv_port_intr(struct ata_port *ap, u32 port_cause)
Index: ata/drivers/ata/sata_nv.c
===================================================================
--- ata.orig/drivers/ata/sata_nv.c
+++ ata/drivers/ata/sata_nv.c
@@ -873,29 +873,11 @@ static int nv_adma_check_cpb(struct ata_
ata_port_freeze(ap);
else
ata_port_abort(ap);
- return 1;
+ return -1;
}

- if (likely(flags & NV_CPB_RESP_DONE)) {
- struct ata_queued_cmd *qc = ata_qc_from_tag(ap, cpb_num);
- VPRINTK("CPB flags done, flags=0x%x\n", flags);
- if (likely(qc)) {
- DPRINTK("Completing qc from tag %d\n", cpb_num);
- ata_qc_complete(qc);
- } else {
- struct ata_eh_info *ehi = &ap->link.eh_info;
- /* Notifier bits set without a command may indicate the drive
- is misbehaving. Raise host state machine violation on this
- condition. */
- ata_port_printk(ap, KERN_ERR,
- "notifier for tag %d with no cmd?\n",
- cpb_num);
- ehi->err_mask |= AC_ERR_HSM;
- ehi->action |= ATA_EH_RESET;
- ata_port_freeze(ap);
- return 1;
- }
- }
+ if (likely(flags & NV_CPB_RESP_DONE))
+ return 1;
return 0;
}

@@ -1018,6 +1000,7 @@ static irqreturn_t nv_adma_interrupt(int
NV_ADMA_STAT_CPBERR |
NV_ADMA_STAT_CMD_COMPLETE)) {
u32 check_commands = notifier_clears[i];
+ u32 done_mask = 0;
int pos, rc;

if (status & NV_ADMA_STAT_CPBERR) {
@@ -1034,10 +1017,13 @@ static irqreturn_t nv_adma_interrupt(int
pos--;
rc = nv_adma_check_cpb(ap, pos,
notifier_error & (1 << pos));
- if (unlikely(rc))
+ if (rc > 0)
+ done_mask |= 1 << pos;
+ else if (unlikely(rc < 0))
check_commands = 0;
check_commands &= ~(1 << pos);
}
+ ata_qc_complete_multiple(ap, ap->qc_active ^ done_mask);
}
}

@@ -2132,7 +2118,6 @@ static int nv_swncq_sdbfis(struct ata_po
struct ata_eh_info *ehi = &ap->link.eh_info;
u32 sactive;
u32 done_mask;
- int i;
u8 host_stat;
u8 lack_dhfis = 0;

@@ -2152,27 +2137,11 @@ static int nv_swncq_sdbfis(struct ata_po
sactive = readl(pp->sactive_block);
done_mask = pp->qc_active ^ sactive;

- if (unlikely(done_mask & sactive)) {
- ata_ehi_clear_desc(ehi);
- ata_ehi_push_desc(ehi, "illegal SWNCQ:qc_active transition"
- "(%08x->%08x)", pp->qc_active, sactive);
- ehi->err_mask |= AC_ERR_HSM;
- ehi->action |= ATA_EH_RESET;
- return -EINVAL;
- }
- for (i = 0; i < ATA_MAX_QUEUE; i++) {
- if (!(done_mask & (1 << i)))
- continue;
-
- qc = ata_qc_from_tag(ap, i);
- if (qc) {
- ata_qc_complete(qc);
- pp->qc_active &= ~(1 << i);
- pp->dhfis_bits &= ~(1 << i);
- pp->dmafis_bits &= ~(1 << i);
- pp->sdbfis_bits |= (1 << i);
- }
- }
+ pp->qc_active &= ~done_mask;
+ pp->dhfis_bits &= ~done_mask;
+ pp->dmafis_bits &= ~done_mask;
+ pp->sdbfis_bits |= done_mask;
+ ata_qc_complete_multiple(ap, ap->qc_active ^ done_mask);

if (!ap->qc_active) {
DPRINTK("over\n");

2010-06-26 03:52:08

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

On 06/25/2010 11:45 PM, Jeff Garzik wrote:
> ata_qc_complete_multiple(), you call
> unexpect_irq()
> expect_irq()


Correction,

unexpect_irq() x N
expect_irq()

for N NCQ commands.

Controller-wide irq twiddling in the function that completes a single
ATA command is painful.

Jeff

2010-06-26 03:52:34

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

On 06/25/2010 03:44 AM, Tejun Heo wrote:
> I still think calling unexpect_irq() from ata_qc_complete() is correct
> as ata_qc_complete() is always a good indicator of completion events.

No, it's not. ata_qc_complete() is an indicator that _one_ completion
event occurred, not _all_ completion events for that port.

Converting drivers to use ata_qc_complete_multiple() completely misses
the point: ata_qc_complete_multiple() is doing exactly the same thing
as those drivers: calling ata_qc_complete() in a loop.

ata_qc_complete() is -- as its name implies -- completing a single qc.
Your patch introduces an unconditional controller-wide unexpect_irq()
call. It's a layering violation.

You can trivially trace through ata_qc_complete_multiple() after patch
#11 is applied, and see the result... for $N completion bits passed to
ata_qc_complete_multiple(), you call
unexpect_irq()
expect_irq()
in rapid succession $N times, once for each ata_qc_complete() call in
the loop of ata_qc_complete_multiple(). For something that is not
needed for modern SATA controllers.

The preferred solution would be something that only touches legacy
controllers, namely:

*) create ata_port_complete(), with the implementation

ata_qc_complete()
unexpect_irq()

*) then call ata_port_complete() in the legacy code that needs
unexpect_irq()

We don't want to burden modern SATA drivers with the overhead of
dealing with silly PATA/SATA1 legacy irq nastiness, particularly the
ugliness of calling

unexpect_irq() + expect_irq()

for a number of NCQ commands, in a tight loop!

Jeff




2010-06-26 08:32:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

Hello, Jeff.

On 06/26/2010 05:45 AM, Jeff Garzik wrote:
> No, it's not. ata_qc_complete() is an indicator that _one_ completion
> event occurred, not _all_ completion events for that port.

Well, it can indicte the start of cluster of completions, which is the
necessary information anyway. From the second call on, it's a simple
flag test and return. I doubt it will affect anything even w/ high
performance SSDs but please read on.

> Converting drivers to use ata_qc_complete_multiple() completely misses
> the point: ata_qc_complete_multiple() is doing exactly the same thing
> as those drivers: calling ata_qc_complete() in a loop.

The point of ata_qc_complete_multiple() is giving libata core layer
better visibility into how commands are being executed, which in turn
allows (continued below)

> ata_qc_complete() is -- as its name implies -- completing a single qc.
> Your patch introduces an unconditional controller-wide unexpect_irq()
> call. It's a layering violation.
>
> You can trivially trace through ata_qc_complete_multiple() after patch
> #11 is applied, and see the result... for $N completion bits passed to
> ata_qc_complete_multiple(), you call
> unexpect_irq()
> expect_irq()
> in rapid succession $N times, once for each ata_qc_complete() call in
> the loop of ata_qc_complete_multiple(). For something that is not
> needed for modern SATA controllers.

ata_qc_complete_multiple() call [un]expect_irq() only once by
introducing an internal completion function w/o irq expect handling,
say ata_qc_complete_raw() and making both ata_qc_complete() and
ata_qc_complete_multiple() simple wrapper around it w/ irq expect
handling.

> The preferred solution would be something that only touches legacy
> controllers, namely:
>
> *) create ata_port_complete(), with the implementation
>
> ata_qc_complete()
> unexpect_irq()
>
> *) then call ata_port_complete() in the legacy code that needs
> unexpect_irq()
>
> We don't want to burden modern SATA drivers with the overhead of
> dealing with silly PATA/SATA1 legacy irq nastiness, particularly the
> ugliness of calling

I think we're much better off applying it to all the drivers. IRQ
expecting is very cheap and scalable and there definitely are plenty
of IRQ delivery problems with modern controllers although their
patterns tend to be different from legacy ones. Plus, it will also be
useful for power state predictions.

> unexpect_irq() + expect_irq()
>
> for a number of NCQ commands, in a tight loop!

As I wrote above, I don't think N*unexpect_irq() really matters but
with ata_qc_complete_multiple() conversion, there will only be single
pair of expect/unexpect for each cluster of completions anyway.

Thanks.

--
tejun

2010-06-26 09:16:10

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

On 06/26/2010 04:31 AM, Tejun Heo wrote:
> Well, it can indicte the start of cluster of completions, which is the
> necessary information anyway. From the second call on, it's a simple
> flag test and return. I doubt it will affect anything even w/ high
> performance SSDs but please read on.

Yes, and your patch calls unexpect_irq() at the _start_ of a cluster of
completions. That is nonsensical, because it reflects the /opposite/ of
the present ATA bus state, when multiple commands are in flight.


> ata_qc_complete_multiple() call [un]expect_irq() only once by
> introducing an internal completion function w/o irq expect handling,
> say ata_qc_complete_raw() and making both ata_qc_complete() and
> ata_qc_complete_multiple() simple wrapper around it w/ irq expect
> handling.

Yes, this fixes problem, but it is better to create a wrapper path for
the legacy PATA/SATA1 that uses irq-expecting, and a fast path for
modern controllers that do not use it.


> On 06/26/2010 05:45 AM, Jeff Garzik wrote:
>> We don't want to burden modern SATA drivers with the overhead of
>> dealing with silly PATA/SATA1 legacy irq nastiness, particularly the
>> ugliness of calling
>
> I think we're much better off applying it to all the drivers. IRQ
> expecting is very cheap and scalable and there definitely are plenty
> of IRQ delivery problems with modern controllers although their
> patterns tend to be different from legacy ones. Plus, it will also be
> useful for power state predictions.

Modern SATA/SAS controllers, and their drivers, already have well
defined methods of acknowledging interrupts, even unexpected ones, in
ways that do not need this core manipulation. This is over-engineering,
punishing all modern chipsets moving forward regardless of their design,
by unconditionally requiring this behavior of all libata drivers.

Just like the rest of libata's layered driver architecture, it should be
straightforward to apply this only to SFF/BMDMA chipsets, then tackle
odd cases as needs arise.

Modern controllers acknowledge interrupts sanely, and always "expect" an
interrupt when you include interrupt events like hotplug, even if the
ATA bus itself is idle. There is no need to burden the millions of ahci
users with irq-expecting, for example.

With regards to power state predictions, it is only useful if you are
accurately reflecting the ATA bus state (idle or not) at all times. As
mentioned above, this patch clearly creates a condition where
unexpect_irq() is called when commands remain in flight, and libata is
expecting further command completions.

IOW, patch #11 says "we are not expecting irq" when we are.

At least a halfway sane approach would be to track bus-idle status, and
trigger useful code when that status changes (idle->active or
active->idle). Perhaps LED, power state, and irq-expecting could all
use such a triggering mechanism.

Jeff


2010-06-26 09:45:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

Hello, Jeff.

On 06/26/2010 11:16 AM, Jeff Garzik wrote:
> On 06/26/2010 04:31 AM, Tejun Heo wrote:
>> Well, it can indicte the start of cluster of completions, which is the
>> necessary information anyway. From the second call on, it's a simple
>> flag test and return. I doubt it will affect anything even w/ high
>> performance SSDs but please read on.
>
> Yes, and your patch calls unexpect_irq() at the _start_ of a cluster of
> completions. That is nonsensical, because it reflects the /opposite/ of
> the present ATA bus state, when multiple commands are in flight.

That's actually what we wanna know. I'll talk about it below.

>> ata_qc_complete_multiple() call [un]expect_irq() only once by
>> introducing an internal completion function w/o irq expect handling,
>> say ata_qc_complete_raw() and making both ata_qc_complete() and
>> ata_qc_complete_multiple() simple wrapper around it w/ irq expect
>> handling.
>
> Yes, this fixes problem, but it is better to create a wrapper path for
> the legacy PATA/SATA1 that uses irq-expecting, and a fast path for
> modern controllers that do not use it.
>
>> On 06/26/2010 05:45 AM, Jeff Garzik wrote:
>>> We don't want to burden modern SATA drivers with the overhead of
>>> dealing with silly PATA/SATA1 legacy irq nastiness, particularly the
>>> ugliness of calling
>>
>> I think we're much better off applying it to all the drivers. IRQ
>> expecting is very cheap and scalable and there definitely are plenty
>> of IRQ delivery problems with modern controllers although their
>> patterns tend to be different from legacy ones. Plus, it will also be
>> useful for power state predictions.
>
> Modern SATA/SAS controllers, and their drivers, already have well
> defined methods of acknowledging interrupts, even unexpected ones, in
> ways that do not need this core manipulation. This is over-engineering,
> punishing all modern chipsets moving forward regardless of their design,
> by unconditionally requiring this behavior of all libata drivers.

Unacked irqs are primarily handled by spurious IRQ handling. IRQ
expecting is more about lost interrupts and we have enough lost
interrupt cases even on new controllers w/ native interface, both
transient and non-transient.

One of the goals of this whole IRQ exception handling was to make it
dumb easy for drivers to use which also included makes things cheap
enough so that they can be called from hot paths. Both expect and
unexpect_irq() are very cheap once the IRQ delivery is verified. If
the processor is taking an interrupt in the first place, this amount
of logic shouldn't matter at all. There really isn't punishment to
avoid and IMHO not doing it for native controllers is an over
optimization. It gains almost nothing while losing meaningful
protection.

> Just like the rest of libata's layered driver architecture, it should be
> straightforward to apply this only to SFF/BMDMA chipsets, then tackle
> odd cases as needs arise.
>
> Modern controllers acknowledge interrupts sanely, and always "expect" an
> interrupt when you include interrupt events like hotplug, even if the
> ATA bus itself is idle. There is no need to burden the millions of ahci
> users with irq-expecting, for example.

I'm not saying applying it to only SFF/BMDMA is difficult, just that
it's better to apply it to all drivers in this case. IRQ expecting is
to protect against misdelivered / lost IRQs and we do have them for
ahci, sil24 or whatever too. It would of course be silly to pay
significant performance overhead for such protection but as I stated
above, it's _really_ cheap. If the driver is taking an interrupt and
accessing harddware and even if compared only against the general
complexity of generic IRQ and libata code, the cost of IRQ [un]expect
is negligible and designed precisely that way to allow use cases like
this.

> With regards to power state predictions, it is only useful if you are
> accurately reflecting the ATA bus state (idle or not) at all times. As
> mentioned above, this patch clearly creates a condition where
> unexpect_irq() is called when commands remain in flight, and libata is
> expecting further command completions.
>
> IOW, patch #11 says "we are not expecting irq" when we are.
>
> At least a halfway sane approach would be to track bus-idle status, and
> trigger useful code when that status changes (idle->active or
> active->idle). Perhaps LED, power state, and irq-expecting could all
> use such a triggering mechanism.

Continuing from the response to the first paragraph. The IRQ
expecting code isn't interested in the bus state, it's interested only
in the IRQ events and that's what it's expecting. The same applies to
power state prediction too, so please consider the following NCQ
command execution sequence.

1. issue tags 0, 1, 2, 3
2. IRQ triggers, tags 0, 2 complete
3. IRQ triggers, tags 1, 3 completes

For IRQ expecting, both 1-2 and 2-3 are segments to expect for and for
power state transition too, as it's IRQ itself which forces the cpu to
come out of sleep state. The reason why I said unexpect in
ata_qc_complete() is okay is that it can still delimit each segment as
long as we have proper irq_expect() call at the beginning of each
segment (all other unexpect calls are ignored). But, that's kind of
moot point as we can easily do single pair.

Thanks.

--
tejun

2010-07-02 14:43:00

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

Hello, Jeff.

So, something like the following. This should be applied on top of
the two previous libata patches. The amount of code in the hot path
is very small. Compared to the cpu actually taking an interrupt and
accessing hardware, it should be negligible, and this will give us
working and acceptably performing systems in the presence of most
types of IRQ problems.

Thanks.

Subject: [PATCH] libata: use IRQ expecting

Legacy ATA is very susceptible to IRQ delivery problems in both
directions - lost and spurious interrupts. In traditional PATA, the
IRQ line is ultimately out of the controller and driver's control.

Even relatively new SATA controllers share this problem as many still
emulate the traditional IDE interface which doesn't have reliable way
to indicate interrupt pending state and there also is an issue
regarding the interpretation of nIEN on both sides of the cable.

Controllers with native interface have fewer problems compared to the
ones which use SFF but they still are affected by IRQ misrouting or
broken MSI implementations.

IRQ delivery problems on ATA are particularly nasty because it
commonly hosts installation and/or booting.

Most of these problems can be worked around by using the new IRQ
expecting mechanism without adding any noticeable overhead. In ATA,
almost all operations are initiated by the host and the controller
signals progress or completion using IRQ. IRQ expecting can easily be
added in libata core and applied to all libata drivers.

Signed-off-by: Tejun Heo <[email protected]>
---
drivers/ata/libata-core.c | 55 ++++++++++++++++++++++++++++++----------------
drivers/ata/libata-eh.c | 4 ++-
drivers/ata/libata-sff.c | 37 +++++++++++++++---------------
include/linux/libata.h | 2 +
4 files changed, 60 insertions(+), 38 deletions(-)

Index: work/drivers/ata/libata-core.c
===================================================================
--- work.orig/drivers/ata/libata-core.c
+++ work/drivers/ata/libata-core.c
@@ -4958,22 +4958,7 @@ static void ata_verify_xfer(struct ata_q
dev->flags &= ~ATA_DFLAG_DUBIOUS_XFER;
}

-/**
- * ata_qc_complete - Complete an active ATA command
- * @qc: Command to complete
- *
- * Indicate to the mid and upper layers that an ATA command has
- * completed, with either an ok or not-ok status.
- *
- * Refrain from calling this function multiple times when
- * successfully completing multiple NCQ commands.
- * ata_qc_complete_multiple() should be used instead, which will
- * properly update IRQ expect state.
- *
- * LOCKING:
- * spin_lock_irqsave(host lock)
- */
-void ata_qc_complete(struct ata_queued_cmd *qc)
+static void ata_qc_complete_raw(struct ata_queued_cmd *qc)
{
struct ata_port *ap = qc->ap;

@@ -5052,6 +5037,27 @@ void ata_qc_complete(struct ata_queued_c
}

/**
+ * ata_qc_complete - Complete an active ATA command
+ * @qc: Command to complete
+ *
+ * Indicate to the mid and upper layers that an ATA command has
+ * completed, with either an ok or not-ok status.
+ *
+ * Refrain from calling this function multiple times when
+ * successfully completing multiple NCQ commands.
+ * ata_qc_complete_multiple() should be used instead, which will
+ * properly update IRQ expect state.
+ *
+ * LOCKING:
+ * spin_lock_irqsave(host lock)
+ */
+void ata_qc_complete(struct ata_queued_cmd *qc)
+{
+ unexpect_irq(qc->ap->irq_expect, false);
+ ata_qc_complete_raw(qc);
+}
+
+/**
* ata_qc_complete_multiple - Complete multiple qcs successfully
* @ap: port in question
* @qc_active: new qc_active mask
@@ -5076,6 +5082,8 @@ int ata_qc_complete_multiple(struct ata_
int nr_done = 0;
u32 done_mask;

+ unexpect_irq(ap->irq_expect, false);
+
done_mask = ap->qc_active ^ qc_active;

if (unlikely(done_mask & qc_active)) {
@@ -5090,12 +5098,15 @@ int ata_qc_complete_multiple(struct ata_

qc = ata_qc_from_tag(ap, tag);
if (qc) {
- ata_qc_complete(qc);
+ ata_qc_complete_raw(qc);
nr_done++;
}
done_mask &= ~(1 << tag);
}

+ if (ap->qc_active)
+ expect_irq(ap->irq_expect);
+
return nr_done;
}

@@ -5162,6 +5173,7 @@ void ata_qc_issue(struct ata_queued_cmd
qc->err_mask |= ap->ops->qc_issue(qc);
if (unlikely(qc->err_mask))
goto err;
+ expect_irq(ap->irq_expect);
return;

sg_err:
@@ -6194,8 +6206,13 @@ int ata_host_activate(struct ata_host *h
if (rc)
return rc;

- for (i = 0; i < host->n_ports; i++)
- ata_port_desc(host->ports[i], "irq %d", irq);
+ for (i = 0; i < host->n_ports; i++) {
+ struct ata_port *ap = host->ports[i];
+
+ if (!ata_port_is_dummy(ap))
+ ap->irq_expect = init_irq_expect(irq, host);
+ ata_port_desc(ap, "irq %d%s", irq, ap->irq_expect ? "+" : "");
+ }

rc = ata_host_register(host, sht);
/* if failed, just free the IRQ and leave ports alone */
Index: work/drivers/ata/libata-eh.c
===================================================================
--- work.orig/drivers/ata/libata-eh.c
+++ work/drivers/ata/libata-eh.c
@@ -619,8 +619,10 @@ void ata_scsi_error(struct Scsi_Host *ho
* handler doesn't diddle with those qcs. This must
* be done atomically w.r.t. setting QCFLAG_FAILED.
*/
- if (nr_timedout)
+ if (nr_timedout) {
+ unexpect_irq(ap->irq_expect, true);
__ata_port_freeze(ap);
+ }

spin_unlock_irqrestore(ap->lock, flags);

Index: work/include/linux/libata.h
===================================================================
--- work.orig/include/linux/libata.h
+++ work/include/linux/libata.h
@@ -751,6 +751,8 @@ struct ata_port {
struct ata_host *host;
struct device *dev;

+ struct irq_expect *irq_expect; /* for irq expecting */
+
struct delayed_work hotplug_task;
struct work_struct scsi_rescan_task;

Index: work/drivers/ata/libata-sff.c
===================================================================
--- work.orig/drivers/ata/libata-sff.c
+++ work/drivers/ata/libata-sff.c
@@ -2388,7 +2388,8 @@ int ata_pci_sff_activate_host(struct ata
struct device *dev = host->dev;
struct pci_dev *pdev = to_pci_dev(dev);
const char *drv_name = dev_driver_string(host->dev);
- int legacy_mode = 0, rc;
+ struct ata_port *ap[2] = { host->ports[0], host->ports[1] };
+ int legacy_mode = 0, i, rc;

rc = ata_host_start(host);
if (rc)
@@ -2422,29 +2423,29 @@ int ata_pci_sff_activate_host(struct ata
if (rc)
goto out;

- ata_port_desc(host->ports[0], "irq %d", pdev->irq);
- ata_port_desc(host->ports[1], "irq %d", pdev->irq);
+ for (i = 0; i < 2; i++) {
+ if (!ata_port_is_dummy(ap[i]))
+ ap[i]->irq_expect =
+ init_irq_expect(pdev->irq, host);
+ ata_port_desc(ap[i], "irq %d%s",
+ pdev->irq, ap[i]->irq_expect ? "+" : "");
+ }
} else if (legacy_mode) {
- if (!ata_port_is_dummy(host->ports[0])) {
- rc = devm_request_irq(dev, ATA_PRIMARY_IRQ(pdev),
- irq_handler, IRQF_SHARED,
- drv_name, host);
- if (rc)
- goto out;
+ unsigned int irqs[2] = { ATA_PRIMARY_IRQ(pdev),
+ ATA_SECONDARY_IRQ(pdev) };

- ata_port_desc(host->ports[0], "irq %d",
- ATA_PRIMARY_IRQ(pdev));
- }
+ for (i = 0; i < 2; i++) {
+ if (ata_port_is_dummy(ap[i]))
+ continue;

- if (!ata_port_is_dummy(host->ports[1])) {
- rc = devm_request_irq(dev, ATA_SECONDARY_IRQ(pdev),
- irq_handler, IRQF_SHARED,
- drv_name, host);
+ rc = devm_request_irq(dev, irqs[i], irq_handler,
+ IRQF_SHARED, drv_name, host);
if (rc)
goto out;

- ata_port_desc(host->ports[1], "irq %d",
- ATA_SECONDARY_IRQ(pdev));
+ ap[i]->irq_expect = init_irq_expect(irqs[i], host);
+ ata_port_desc(ap[i], "irq %d%s",
+ irqs[i], ap[i]->irq_expect ? "+" : "");
}
}

2010-07-02 14:54:28

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

On 07/02/2010 04:41 PM, Tejun Heo wrote:
> Hello, Jeff.
>
> So, something like the following. This should be applied on top of
> the two previous libata patches. The amount of code in the hot path
> is very small. Compared to the cpu actually taking an interrupt and
> accessing hardware, it should be negligible, and this will give us
> working and acceptably performing systems in the presence of most
> types of IRQ problems.

The whole tree is available in the following branch.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git libata-irq-expect

Thanks.

--
tejun

2010-07-02 15:01:11

by Tejun Heo

[permalink] [raw]
Subject: [GIT PULL] irq: better lost/spurious irq handling

Hello, Thomas.

Please pull from the following branch to receieve lost-spurious-irq
patchset.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git lost-spurious-irq

The branch contains the patches to implement lost/spurious IRQ
handling (patches 0001-0010) and apply it to USB (0012). libata part
is being discussed and not included in this pull request.

Thanks.

--
tejun

2010-07-10 10:08:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

On 07/02/2010 04:41 PM, Tejun Heo wrote:
> Hello, Jeff.
>
> So, something like the following. This should be applied on top of
> the two previous libata patches. The amount of code in the hot path
> is very small. Compared to the cpu actually taking an interrupt and
> accessing hardware, it should be negligible, and this will give us
> working and acceptably performing systems in the presence of most
> types of IRQ problems.

Ping.

--
tejun

2010-07-14 07:58:14

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

On 07/10/2010 06:06 AM, Tejun Heo wrote:
> On 07/02/2010 04:41 PM, Tejun Heo wrote:
>> Hello, Jeff.
>>
>> So, something like the following. This should be applied on top of
>> the two previous libata patches. The amount of code in the hot path
>> is very small. Compared to the cpu actually taking an interrupt and
>> accessing hardware, it should be negligible, and this will give us
>> working and acceptably performing systems in the presence of most
>> types of IRQ problems.
>
> Ping.

Give me another day or two. I'm working up an alternate patch for
demonstration (still use the irq-expecting api, but differently).

Jeff


2010-07-14 09:27:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

On 07/14/2010 09:58 AM, Jeff Garzik wrote:
> Give me another day or two. I'm working up an alternate patch for
> demonstration (still use the irq-expecting api, but differently).

Sure thing.

Thanks.

--
tejun

2010-07-27 17:37:58

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 11/12] libata: use IRQ expecting

On 07/02/2010 10:41 AM, Tejun Heo wrote:
> Hello, Jeff.
>
> So, something like the following. This should be applied on top of
> the two previous libata patches. The amount of code in the hot path
> is very small. Compared to the cpu actually taking an interrupt and
> accessing hardware, it should be negligible, and this will give us
> working and acceptably performing systems in the presence of most
> types of IRQ problems.
>
> Thanks.
>
> Subject: [PATCH] libata: use IRQ expecting
>
> Legacy ATA is very susceptible to IRQ delivery problems in both
> directions - lost and spurious interrupts. In traditional PATA, the
> IRQ line is ultimately out of the controller and driver's control.
>
> Even relatively new SATA controllers share this problem as many still
> emulate the traditional IDE interface which doesn't have reliable way
> to indicate interrupt pending state and there also is an issue
> regarding the interpretation of nIEN on both sides of the cable.
>
> Controllers with native interface have fewer problems compared to the
> ones which use SFF but they still are affected by IRQ misrouting or
> broken MSI implementations.
>
> IRQ delivery problems on ATA are particularly nasty because it
> commonly hosts installation and/or booting.
>
> Most of these problems can be worked around by using the new IRQ
> expecting mechanism without adding any noticeable overhead. In ATA,
> almost all operations are initiated by the host and the controller
> signals progress or completion using IRQ. IRQ expecting can easily be
> added in libata core and applied to all libata drivers.
>
> Signed-off-by: Tejun Heo<[email protected]>
> ---
> drivers/ata/libata-core.c | 55 ++++++++++++++++++++++++++++++----------------
> drivers/ata/libata-eh.c | 4 ++-
> drivers/ata/libata-sff.c | 37 +++++++++++++++---------------
> include/linux/libata.h | 2 +
> 4 files changed, 60 insertions(+), 38 deletions(-)

I suppose the few cycles it costs are worth it...

Assuming this new version (from July 2) is tested,

Acked-by: Jeff Garzik <[email protected]>

2010-08-01 23:48:08

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 1/2 #upstream] sata_fsl,mv,nv: prepare for NCQ command completion update

On 06/25/2010 09:02 AM, Tejun Heo wrote:
> Make the following changes to prepare for NCQ command completion
> update. Changes made by this patch don't cause any functional
> difference.
>
> * sata_fsl_host_intr(): rename the local variable qc_active to
> done_mask as that's what it is.
>
> * mv_process_crpb_response(): restructure if clause for easier update.
>
> * nv_adma_interrupt(): drop unnecessary error variable.
>
> * nv_swncq_sdbfis(): drop unnecessary nr_done and return 0 on success.
> Typo fix.
>
> * nv_swncq_dmafis(): drop unused return value and return void.
>
> * nv_swncq_host_interrupt(): drop unnecessary return value handling.
>
> Signed-off-by: Tejun Heo<[email protected]>
> Cc: Ashish Kalra<[email protected]>
> Cc: Saeed Bishara<[email protected]>
> Cc: Mark Lord<[email protected]>
> Cc: Robert Hancock<[email protected]>
> ---
> So, something like this. I tested both flavors of sata_nv but don't
> have access to sata_mv or sata_fsl, but the conversion is pretty
> straight forward and failures should be pretty easy to catch.

applied

2010-08-02 07:18:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/2 #upstream] sata_fsl,mv,nv: prepare for NCQ command completion update

Hello, Jeff.

On 08/02/2010 01:47 AM, Jeff Garzik wrote:
>> So, something like this. I tested both flavors of sata_nv but don't
>> have access to sata_mv or sata_fsl, but the conversion is pretty
>> straight forward and failures should be pretty easy to catch.
>
> applied

Are you planning on applying the second patch too?

Thanks.

--
tejun

2010-08-04 04:23:43

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 1/2 #upstream] sata_fsl,mv,nv: prepare for NCQ command completion update

On 08/02/2010 03:18 AM, Tejun Heo wrote:
> Hello, Jeff.
>
> On 08/02/2010 01:47 AM, Jeff Garzik wrote:
>>> So, something like this. I tested both flavors of sata_nv but don't
>>> have access to sata_mv or sata_fsl, but the conversion is pretty
>>> straight forward and failures should be pretty easy to catch.
>>
>> applied
>
> Are you planning on applying the second patch too?

sata_mv is behaving weirdly in 2.6.35 + these patches, on the older 6081
chip, embedded in a Dell server box. Looking into it...

Jeff


2010-08-17 22:03:11

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 2/2 #upstream] libata: always use ata_qc_complete_multiple() for NCQ command completions

On 06/25/2010 09:03 AM, Tejun Heo wrote:
> Currently, sata_fsl, mv and nv call ata_qc_complete() multiple times
> from their interrupt handlers to indicate completion of NCQ commands.
> This limits the visibility the libata core layer has into how commands
> are being executed and completed, which is necessary to support IRQ
> expecting in generic way. libata already has an interface to complete
> multiple commands at once - ata_qc_complete_multiple() which ahci and
> sata_sil24 already use.
>
> This patch updates the three drivers to use ata_qc_complete_multiple()
> too and updates comments on ata_qc_complete[_multiple]() regarding
> their usages with NCQ completions. This change not only provides
> better visibility into command execution to the core layer but also
> simplifies low level drivers.
>
> * sata_fsl: It already builds done_mask. Conversion is straight
> forward.
>
> * sata_mv: mv_process_crpb_response() no longer checks for illegal
> completions, it just returns whether the tag is completed or not.
> mv_process_crpb_entries() builds done_mask from it and passes it to
> ata_qc_complete_multiple() which will check for illegal completions.
>
> * sata_nv adma: Similar to sata_mv. nv_adma_check_cpb() now just
> returns the tag status and nv_adma_interrupt() builds done_mask from
> it and passes it to ata_qc_complete_multiple().
>
> * sata_nv swncq: It already builds done_mask. Drop unnecessary
> illegal transition checks and call ata_qc_complete_multiple().
>
> In the long run, it might be a good idea to make ata_qc_complete()
> whine if called when multiple NCQ commands are in flight.
>
> Signed-off-by: Tejun Heo<[email protected]>
> Cc: Ashish Kalra<[email protected]>
> Cc: Saeed Bishara<[email protected]>
> Cc: Mark Lord<[email protected]>
> Cc: Robert Hancock<[email protected]>
> ---
> drivers/ata/libata-core.c | 13 ++++++++--
> drivers/ata/sata_fsl.c | 8 +-----
> drivers/ata/sata_mv.c | 25 +++++++++-----------
> drivers/ata/sata_nv.c | 57 ++++++++++------------------------------------
> 4 files changed, 38 insertions(+), 65 deletions(-)

I was getting really inconsistent results on sata_mv 6081 last week.
Just booted it up again, to try and resolve this, and it started
spitting out a bunch of PCI errors. That sounds like the root cause to
me... pushing this into #upstream.