Hello, Thomas.
With Jeff's acks added, patches to make libata use irq-expect are
commited. Please pull from the following branch to receive patches[1]
to improve lost/spurious irq handling.
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git lost-spurious-irq
Thanks.
Tejun Heo (14):
irq: cleanup irqfixup
irq: make spurious poll timer per desc
irq: use desc->poll_timer for irqpoll
irq: kill IRQF_IRQPOLL
irq: misc preparations for further changes
irq: implement irq_schedule_poll()
irq: improve spurious IRQ handling
irq: implement IRQ watching
irq: implement IRQ expecting
irq: add comment about overall design of lost/spurious IRQ handling
usb: use IRQ watching
sata_fsl,mv,nv: prepare for NCQ command completion update
libata: always use ata_qc_complete_multiple() for NCQ command completion
libata: use IRQ expecting
arch/arm/mach-aaec2000/core.c | 2 +-
arch/arm/mach-at91/at91rm9200_time.c | 2 +-
arch/arm/mach-at91/at91sam926x_time.c | 2 +-
arch/arm/mach-bcmring/core.c | 2 +-
arch/arm/mach-clps711x/time.c | 2 +-
arch/arm/mach-cns3xxx/core.c | 2 +-
arch/arm/mach-ebsa110/core.c | 2 +-
arch/arm/mach-ep93xx/core.c | 2 +-
arch/arm/mach-footbridge/dc21285-timer.c | 2 +-
arch/arm/mach-footbridge/isa-timer.c | 2 +-
arch/arm/mach-h720x/cpu-h7201.c | 2 +-
arch/arm/mach-h720x/cpu-h7202.c | 2 +-
arch/arm/mach-integrator/integrator_ap.c | 2 +-
arch/arm/mach-ixp2000/core.c | 2 +-
arch/arm/mach-ixp23xx/core.c | 2 +-
arch/arm/mach-ixp4xx/common.c | 2 +-
arch/arm/mach-lh7a40x/time.c | 2 +-
arch/arm/mach-mmp/time.c | 2 +-
arch/arm/mach-netx/time.c | 2 +-
arch/arm/mach-ns9xxx/irq.c | 3 -
arch/arm/mach-ns9xxx/time-ns9360.c | 2 +-
arch/arm/mach-nuc93x/time.c | 2 +-
arch/arm/mach-omap1/time.c | 2 +-
arch/arm/mach-omap1/timer32k.c | 2 +-
arch/arm/mach-omap2/timer-gp.c | 2 +-
arch/arm/mach-pnx4008/time.c | 2 +-
arch/arm/mach-pxa/time.c | 2 +-
arch/arm/mach-sa1100/time.c | 2 +-
arch/arm/mach-shark/core.c | 2 +-
arch/arm/mach-u300/timer.c | 2 +-
arch/arm/mach-w90x900/time.c | 2 +-
arch/arm/plat-iop/time.c | 2 +-
arch/arm/plat-mxc/time.c | 2 +-
arch/arm/plat-samsung/time.c | 2 +-
arch/arm/plat-versatile/timer-sp.c | 2 +-
arch/blackfin/kernel/time-ts.c | 6 +-
arch/ia64/kernel/time.c | 2 +-
arch/parisc/kernel/irq.c | 2 +-
arch/powerpc/platforms/cell/interrupt.c | 5 +-
arch/x86/kernel/time.c | 2 +-
drivers/ata/libata-core.c | 54 ++-
drivers/ata/libata-eh.c | 4 +-
drivers/ata/libata-sff.c | 37 +-
drivers/ata/sata_fsl.c | 26 +-
drivers/ata/sata_mv.c | 58 +-
drivers/ata/sata_nv.c | 87 +--
drivers/clocksource/sh_cmt.c | 3 +-
drivers/clocksource/sh_mtu2.c | 3 +-
drivers/clocksource/sh_tmu.c | 3 +-
drivers/usb/core/hcd.c | 1 +
include/linux/interrupt.h | 43 +-
include/linux/irq.h | 40 +-
include/linux/libata.h | 2 +
kernel/irq/chip.c | 20 +-
kernel/irq/handle.c | 7 +-
kernel/irq/internals.h | 10 +-
kernel/irq/manage.c | 18 +-
kernel/irq/proc.c | 5 +-
kernel/irq/spurious.c | 978 +++++++++++++++++++++++++-----
59 files changed, 1101 insertions(+), 386 deletions(-)
--
tejun
[1] http://thread.gmane.org/gmane.linux.ide/46448
On 07/28/2010 03:42 PM, Tejun Heo wrote:
> Hello, Thomas.
>
> With Jeff's acks added, patches to make libata use irq-expect are
> commited. Please pull from the following branch to receive patches[1]
> to improve lost/spurious irq handling.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git lost-spurious-irq
I'll ask Stephen to pull the above branch into linux-next until it
shows up via tip so that we don't lose any more linux-next testing
time.
Thank you.
--
tejun
On Wed, 28 Jul 2010, Tejun Heo wrote:
> On 07/28/2010 03:42 PM, Tejun Heo wrote:
> > Hello, Thomas.
> >
> > With Jeff's acks added, patches to make libata use irq-expect are
> > commited. Please pull from the following branch to receive patches[1]
> > to improve lost/spurious irq handling.
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git lost-spurious-irq
>
> I'll ask Stephen to pull the above branch into linux-next until it
> shows up via tip so that we don't lose any more linux-next testing
> time.
I'm working on it already and I'm currently twisting my brain around
the problem this patches will impose to the RT stuff, where we cannot
access timer_list timers from atomic regions :(
Thanks,
tglx
Hello,
On 07/29/2010 10:44 AM, Thomas Gleixner wrote:
>> I'll ask Stephen to pull the above branch into linux-next until it
>> shows up via tip so that we don't lose any more linux-next testing
>> time.
>
> I'm working on it already and I'm currently twisting my brain around
> the problem this patches will impose to the RT stuff, where we cannot
> access timer_list timers from atomic regions :(
Oh, I see. A dull last ditch solution would be just disabling it on
CONFIG_PREEMPT_RT, but yeah that will be dull. :-(
Thanks.
--
tejun
Tejun,
On Fri, 30 Jul 2010, Tejun Heo wrote:
> Hello,
>
> On 07/29/2010 10:44 AM, Thomas Gleixner wrote:
> >> I'll ask Stephen to pull the above branch into linux-next until it
> >> shows up via tip so that we don't lose any more linux-next testing
> >> time.
> >
> > I'm working on it already and I'm currently twisting my brain around
> > the problem this patches will impose to the RT stuff, where we cannot
> > access timer_list timers from atomic regions :(
>
> Oh, I see. A dull last ditch solution would be just disabling it on
> CONFIG_PREEMPT_RT, but yeah that will be dull. :-(
Yup, but I have some other issues with this series as well. That thing
is massively overengineered. You mangle all the various functions into
irq_poll which makes it really hard to understand what the code does
under which circumstances.
I understand most of the problems you want to solve, but I don't like
the outcome too much.
Let's look at the various parts:
1) irq polling
- Get rid of the for_each_irq() loop in the poll timer.
You can solve this by adding the interrupt to a linked list and
let the poll timer run through the list. Also why is the simple
counter based decision not good enough ? Why do we need an extra
time period for this ?
- Reenable the interrupt from time to time to check whether the irq
storm subsided.
Generally a good idea, but we really do not want to wait for
another 10k unhandled interrupts flooding the machine. Ideally we
should read out the irq pending register from the irq chip to see
whether the interrupt is still asserted.
2) irq watch
I have to admit, that I have no clue what this thing exactly
does. After staring into the code for quite a while I came to the
conclusion that it's a dynamic polling machinery, which adjusts
it's polling frequency depending on the number of good and bad
samples. The heuristics for this are completely non obvious.
Aside of that the watch mechanism adds unneccesary code to the hard
interrupt context. There is no reason why you need to update that
watch magic in the hard interrupt context. Analysing the stats can
be done in the watch timer as well. All it takes is in
handle_irq_event()
case IRQ_HANDLED:
action->handled++;
and in the watch timer function
if (ret == IRQ_HANDLED)
action->polled++;
irq_watch should use a separate global timer and a linked list. The
timer is started when an interrupt is added to the watch mechanism
and stopped when the list becomes empty.
That's a clear separation of watch and poll and simplifies the
whole business a lot.
3) irq expect
So you basically poll the interrupt until either the interrupt
happened or the device driver timeout occured.
That's really weird. What does it solve ? Not running into the
device driver timeout routine if the hardware interrupt has been
lost?
That's overkill, isn't it ? When the hardware loses an interrupt
occasionally then why isn't the device driver timeout routine
sufficient? If it does not check whether the device has an
interrupt pending despite the fact that it has not been delivered,
then this code needs to be fixed and not worked around with weird
polling in the generic irq code.
Btw, this polling is pretty bad in terms of power savings.
- The timers are not aligned, so if there are several expects armed
you get unbatched wakeups.
- If the poll interval is smaller than the average hardware
interrupt delivery time, you add more wakeups than necessary.
- If the hardware interrupt has happened, you let the poll timer
pending, which gives us an extra wakeup for each interrupt in the
worst case.
I assume your concern is to detect and deal with flaky interrupts,
but avoiding the permanent poll when the device is not used, right?
So that can be done simpler as well. One timer and a linked list
again.
expect_irq() adds the irq to the linked list and arms the timer, if
it is not armed already. unexpect_irq() removes the irq from the
linked list and deletes the timer when the list becomes empty.
That can reuse the list_head in irq_desc which you need for irq
poll and the timer interval should be fixed without dynamic
adjustment.
Keep it simple!
Thanks
tglx
Hello, Thomas.
On 08/02/2010 04:07 PM, Thomas Gleixner wrote:
> Yup, but I have some other issues with this series as well. That thing
> is massively overengineered. You mangle all the various functions into
> irq_poll which makes it really hard to understand what the code does
> under which circumstances.
Oh, I'll explain why I muliplexed timers this way below.
> I understand most of the problems you want to solve, but I don't like
> the outcome too much.
>
> Let's look at the various parts:
Heh, fair enough. Let's talk about it.
> 1) irq polling
Do you mean spurious irq polling here?
> - Get rid of the for_each_irq() loop in the poll timer.
>
> You can solve this by adding the interrupt to a linked list and
> let the poll timer run through the list.
I suppose you're talking about using per-desc timer. I first thought
about using a single common timer too but decided against it because
timer events can be merged from timer code (depending on the interval
and slack) and I wanted to use different polling frequencies for
different cases. We can of course implement logic to put polled irqs
on a list in chronological order but that's basically redoing the work
timer code already does.
> Also why is the simple counter based decision not good enough?
> Why do we need an extra time period for this?
Because in many cases IRQ storms are transient and spurious IRQ
polling worsens performance a lot, it's worthwhile to be able to
recover from such conditions, so the extra time period is there to
trigger reenabling of the offending IRQ to see whether the storm is
over now. Please note that many of this type of IRQ storms are
extremely obscure (for example, happens during resume from STR once in
a blue moon due to bad interaction between legacy ATA state machine in
the controller and the drive's certain behavior) and some are very
difficult to avoid from the driver in any reasonable way.
> - Reenable the interrupt from time to time to check whether the irq
> storm subsided.
Or maybe you were talking about something else?
> Generally a good idea, but we really do not want to wait for
> another 10k unhandled interrupts flooding the machine. Ideally we
> should read out the irq pending register from the irq chip to see
> whether the interrupt is still asserted.
In simulated tests with hosed IRQ routing the behavior was already
quite acceptable, but yeah if peeking at interrupt state can be done
easily that would definitely be an improvement.
> 2) irq watch
>
> I have to admit, that I have no clue what this thing exactly
> does. After staring into the code for quite a while I came to the
> conclusion that it's a dynamic polling machinery, which adjusts
> it's polling frequency depending on the number of good and bad
> samples. The heuristics for this are completely non obvious.
Eh, I thought I did pretty good on documenting each mechanism.
Obviously, not enough at all. :-)
It basically allows drivers to tell the irq code that IRQs are likely
to happen. In response, IRQ code polls the desc for a while and sees
whether poll detects IRQ handling w/o actual IRQ deliveries. If that
seems to happen more than usual, IRQ code can determine that IRQ is
not being delivered and enables full blown IRQ polling.
The WARY state is inbetween state between good and bad. Dropping this
will simplify the logic a bit. It's basically to avoid false
positives as it would suck to enable polling for a working IRQ line.
Anyways, the basic heuristics is it watches certain number of IRQ
events and count how many are good (via IRQ) and bad (via poll). If
bad goes over certain threshold, polling is enabled.
> Aside of that the watch mechanism adds unneccesary code to the hard
> interrupt context. There is no reason why you need to update that
> watch magic in the hard interrupt context. Analysing the stats can
> be done in the watch timer as well. All it takes is in
> handle_irq_event()
>
> case IRQ_HANDLED:
> action->handled++;
>
> and in the watch timer function
>
> if (ret == IRQ_HANDLED)
> action->polled++;
The whole thing is executed iff unlikely(desc->status &
IRQ_CHECK_WATCHES) which the processor will be predicting correctly
most of the time. Do you think it would make any noticeable
difference?
> irq_watch should use a separate global timer and a linked list. The
> timer is started when an interrupt is added to the watch mechanism
> and stopped when the list becomes empty.
>
> That's a clear separation of watch and poll and simplifies the
> whole business a lot.
The reason why poll_irq() looks complicated is because different types
of timers on the same desc shares the timer. The reason why it's
shared this way instead of across different descs of the same type is
to avoid locking overhead in maintaining the timer and linked lists of
its targets. By sharing the timer in the same desc, everything can be
protected by desc->lock but if we use timers across different descs,
we need another layer of locking.
> 3) irq expect
>
> So you basically poll the interrupt until either the interrupt
> happened or the device driver timeout occured.
>
> That's really weird. What does it solve ? Not running into the
> device driver timeout routine if the hardware interrupt has been
> lost?
>
> That's overkill, isn't it ? When the hardware loses an interrupt
> occasionally then why isn't the device driver timeout routine
> sufficient? If it does not check whether the device has an
> interrupt pending despite the fact that it has not been delivered,
> then this code needs to be fixed and not worked around with weird
> polling in the generic irq code.
Delays caused by lost interrupt and by other failures can differ a lot
in their magnitude and, for example, libata does check IRQ delivery
failure in its timeout handler but at that point it doesn't really
matter why the timeout has occurred for recovery of that specific
command. Too much time has passed anyway.
We definitely can implement such quick polling timeouts which check
for IRQ misdeliveries in drivers so that it can detect such events and
maybe add an interface in the IRQ polling code which the driver can
call to enable polling on the IRQ. But then again the pattern of such
failures and handling of them would be very similar across different
drivers and we already of most of polling machinary implemented in IRQ
code, so I think it makes sense to have a generic implementation which
drivers can use. Moreover, given the difference in test/review
coverage and general complexity of doing it right, I think it is far
better to do it in the core code and make it easy to use for drivers.
> Btw, this polling is pretty bad in terms of power savings.
>
> - The timers are not aligned, so if there are several expects armed
> you get unbatched wakeups.
In usual operation conditions, the timer interval will quickly become
3 seconds w/ 1 sec slack. I doubt it can cause much problem. Right?
> - If the poll interval is smaller than the average hardware
> interrupt delivery time, you add more wakeups than necessary.
The same thing as above. If IRQ delivery is working, it will be
polling every 3 seconds w/ 1 sec slack. It wouldn't really matter.
> - If the hardware interrupt has happened, you let the poll timer
> pending, which gives us an extra wakeup for each interrupt in the
> worst case.
That's to avoid the overhead of constantly manipulating the timer for
high frequency commands. expect/unexpect fast paths don't do much, no
locking, no timer manipulations. No matter how busy the IRQ is, the
timer will only activate and rearmed occassionally.
> I assume your concern is to detect and deal with flaky interrupts,
> but avoiding the permanent poll when the device is not used, right?
>
> So that can be done simpler as well. One timer and a linked list
> again.
>
> expect_irq() adds the irq to the linked list and arms the timer, if
> it is not armed already. unexpect_irq() removes the irq from the
> linked list and deletes the timer when the list becomes empty.
>
> That can reuse the list_head in irq_desc which you need for irq
> poll and the timer interval should be fixed without dynamic
> adjustment.
The main problem here is that expect/unexpect_irq() needs to be low
overhead so that drivers can easily call in w/o worrying too much
about performance overhead and I would definitely want to avoid
locking in fast paths.
> Keep it simple!
To me, the more interesting question is where to put the complexity.
In many cases, the most we can do is pushing complexity around, and if
we can make things easier for drivers by making the core IRQ code
somewhat more complex, I'll definitely go for that every time.
That's also the main theme of this IRQ spurious/missing patchset. The
implementation might be a bit complex but using it from the driver
side is almost no brainer. The drivers just need to give hints to the
IRQ polling code and the IRQ polling code will take care of everything
else, and the low overhead for irq_expect/unexpect() is important in
that aspect because drivers can only use it without worrying iff its
overhead is sufficiently low.
That said, I definitely think poll_irq() can be better factored. Its
current form is mainly due to its transition from spurious poller to
the current multiplexed form. Factoring out different parts and
possibly killing watch handling should simplify it quite a bit.
Thank you.
--
tejun
On 08/02/2010 05:28 PM, Tejun Heo wrote:
> The reason why poll_irq() looks complicated is because different types
> of timers on the same desc shares the timer. The reason why it's
> shared this way instead of across different descs of the same type is
> to avoid locking overhead in maintaining the timer and linked lists of
> its targets. By sharing the timer in the same desc, everything can be
> protected by desc->lock but if we use timers across different descs,
> we need another layer of locking.
Ooh, another reason is timer locality. If timers are shared per desc,
they have much higher chance of being on the same processor. Global
timers would be pretty bad in that respect.
--
tejun
Tejun,
On Mon, 2 Aug 2010, Tejun Heo wrote:
> Hello, Thomas.
>
> On 08/02/2010 04:07 PM, Thomas Gleixner wrote:
> > 1) irq polling
>
> Do you mean spurious irq polling here?
Yep
> > - Get rid of the for_each_irq() loop in the poll timer.
> >
> > You can solve this by adding the interrupt to a linked list and
> > let the poll timer run through the list.
>
> I suppose you're talking about using per-desc timer. I first thought
> about using a single common timer too but decided against it because
> timer events can be merged from timer code (depending on the interval
> and slack) and I wanted to use different polling frequencies for
> different cases. We can of course implement logic to put polled irqs
> on a list in chronological order but that's basically redoing the work
> timer code already does.
They might be merged depending on interval and slack, but why do we
need different frequencies at all ?
> > Also why is the simple counter based decision not good enough?
> > Why do we need an extra time period for this?
>
> Because in many cases IRQ storms are transient and spurious IRQ
> polling worsens performance a lot, it's worthwhile to be able to
> recover from such conditions, so the extra time period is there to
> trigger reenabling of the offending IRQ to see whether the storm is
> over now. Please note that many of this type of IRQ storms are
> extremely obscure (for example, happens during resume from STR once in
> a blue moon due to bad interaction between legacy ATA state machine in
> the controller and the drive's certain behavior) and some are very
> difficult to avoid from the driver in any reasonable way.
If the IRQ has been marked as spurious, then switch to polling for a
defined time period (simply count the number of poll timer runs for
that irq). After that switch back to non polling and keep a flag which
kicks the stupid thing back to poll after 10 spurious ones in a row.
> > - Reenable the interrupt from time to time to check whether the irq
> > storm subsided.
>
> Or maybe you were talking about something else?
No, that stuff is so convoluted that I did not understand it.
> > Generally a good idea, but we really do not want to wait for
> > another 10k unhandled interrupts flooding the machine. Ideally we
> > should read out the irq pending register from the irq chip to see
> > whether the interrupt is still asserted.
>
> In simulated tests with hosed IRQ routing the behavior was already
> quite acceptable, but yeah if peeking at interrupt state can be done
> easily that would definitely be an improvement.
It could be done on various irq chips, but not on all. So we need that
real retry anyway; no need to add another function to irq_chip.
> > 2) irq watch
>
> The whole thing is executed iff unlikely(desc->status &
> IRQ_CHECK_WATCHES) which the processor will be predicting correctly
> most of the time. Do you think it would make any noticeable
> difference?
It's not about the !WATCH case. I don't like the extra work for those
watched ones.
> > irq_watch should use a separate global timer and a linked list. The
> > timer is started when an interrupt is added to the watch mechanism
> > and stopped when the list becomes empty.
> >
> > That's a clear separation of watch and poll and simplifies the
> > whole business a lot.
>
> The reason why poll_irq() looks complicated is because different types
> of timers on the same desc shares the timer. The reason why it's
> shared this way instead of across different descs of the same type is
> to avoid locking overhead in maintaining the timer and linked lists of
> its targets. By sharing the timer in the same desc, everything can be
> protected by desc->lock but if we use timers across different descs,
> we need another layer of locking.
The locking overhead for a global timer + linked list is minimal.
There is only one additional lock, the one which protects the
list_head. So you take it when watch_irq() adds the watch, when the
watch is removed and in the timer routine. That's zero overhead as
lock contention is totaly unlikely. There is no need to fiddle with
the timer interval. Watch the stupid thing for a while, if it behaves
remove it from the list, if not you need to keep polling anyway.
> > 3) irq expect
> >
> We definitely can implement such quick polling timeouts which check
> for IRQ misdeliveries in drivers so that it can detect such events and
> maybe add an interface in the IRQ polling code which the driver can
> call to enable polling on the IRQ. But then again the pattern of such
> failures and handling of them would be very similar across different
> drivers and we already of most of polling machinary implemented in IRQ
> code, so I think it makes sense to have a generic implementation which
> drivers can use. Moreover, given the difference in test/review
> coverage and general complexity of doing it right, I think it is far
> better to do it in the core code and make it easy to use for drivers.
I'm not opposed to have a generic solution for this, but I want a
simple and understandable one. The irq_poll() works for everything
magic is definitely not in this category.
> > Btw, this polling is pretty bad in terms of power savings.
> >
> > - The timers are not aligned, so if there are several expects armed
> > you get unbatched wakeups.
>
> In usual operation conditions, the timer interval will quickly become
> 3 seconds w/ 1 sec slack. I doubt it can cause much problem. Right?
And those 4 seconds are probably larger than the device timeouts in
most cases. So what's the point of running that timer at all ?
> > I assume your concern is to detect and deal with flaky interrupts,
> > but avoiding the permanent poll when the device is not used, right?
> >
> > So that can be done simpler as well. One timer and a linked list
> > again.
> >
> > expect_irq() adds the irq to the linked list and arms the timer, if
> > it is not armed already. unexpect_irq() removes the irq from the
> > linked list and deletes the timer when the list becomes empty.
> >
> > That can reuse the list_head in irq_desc which you need for irq
> > poll and the timer interval should be fixed without dynamic
> > adjustment.
>
> The main problem here is that expect/unexpect_irq() needs to be low
> overhead so that drivers can easily call in w/o worrying too much
> about performance overhead and I would definitely want to avoid
> locking in fast paths.
In general you only want to use this expect thing when you detected
that the device has a problem. So why not use the watch mechanism for
this ?
When you detect the first timeout in the device driver, set the watch
on that action and let it figure out whether it goes back to
normal. If it goes back to normal the watch disappears. If it happens
again, repeat.
expect_irq() would be basically a NOP, but I like to keep it, as it
can be used for power management decisions.
unexpect_irq() would either add or remove the watch depending on the
timeout argument. So there is only overhead when things went bad and
when we go back to normal in all other cases it's zero.
Though we need a reference to irqaction.watch of that device, but
that's not a big deal. With that we can call
unexpect_irq(watch, false/true);
which would be an inline function:
static inline void unexpect_irq(watch, timeout)
{
if (unlikely(watch->active ^ timeout))
__unexpect_irq(watch, timeout);
}
Simple, right ?
> To me, the more interesting question is where to put the complexity.
> In many cases, the most we can do is pushing complexity around, and if
> we can make things easier for drivers by making the core IRQ code
> somewhat more complex, I'll definitely go for that every time.
No objections, but I doubt, that we need all the heuristics and
complex code to deal with it.
Thanks,
tglx
On Mon, 2 Aug 2010, Tejun Heo wrote:
> On 08/02/2010 05:28 PM, Tejun Heo wrote:
> > The reason why poll_irq() looks complicated is because different types
> > of timers on the same desc shares the timer. The reason why it's
> > shared this way instead of across different descs of the same type is
> > to avoid locking overhead in maintaining the timer and linked lists of
> > its targets. By sharing the timer in the same desc, everything can be
> > protected by desc->lock but if we use timers across different descs,
> > we need another layer of locking.
>
> Ooh, another reason is timer locality. If timers are shared per desc,
> they have much higher chance of being on the same processor. Global
> timers would be pretty bad in that respect.
That's irrelevant. If you need to poll an interrupt, then it does not
matter at all whether you bounce some cache lines or not.
In fact we have two cases:
1) An interrupt needs to be polled all the time. That sucks whether
the poll timer bounces a few cache lines or not.
2) Polling an irq for some time. Either it works again after a
while, so your suckage is restricted to the poll period. If not
see #1
And it's even less of an issue as the main users of this misfeature
are laptops and desktop machines, where locality is not really that
important. If an enterprise admin decides to ignore the fact that the
hardware is flaky, then he does not care about the cache line bounces
either.
Thanks,
tglx
Hello, Thomas.
On 08/02/2010 08:52 PM, Thomas Gleixner wrote:
>> Ooh, another reason is timer locality. If timers are shared per desc,
>> they have much higher chance of being on the same processor. Global
>> timers would be pretty bad in that respect.
>
> That's irrelevant. If you need to poll an interrupt, then it does not
> matter at all whether you bounce some cache lines or not.
>
> In fact we have two cases:
>
> 1) An interrupt needs to be polled all the time. That sucks whether
> the poll timer bounces a few cache lines or not.
>
> 2) Polling an irq for some time. Either it works again after a
> while, so your suckage is restricted to the poll period. If not
> see #1
Hmm... for spurious and watch the above are true and if it were the
above two it would definitely make more sense to use per-purpose
global timers. The problem is w/ expect tho. It's supposed to be
used with normal hot paths, so expect/unexpect operations better be
low overhead and local. I'll talk more about it in the other reply.
> And it's even less of an issue as the main users of this misfeature
> are laptops and desktop machines, where locality is not really that
> important. If an enterprise admin decides to ignore the fact that the
> hardware is flaky, then he does not care about the cache line bounces
> either.
These problems do happen on intel chipset machines and is something
which can be worked around with some effort. Eh, let's talk on the
other reply.
Thanks.
--
tejun
Hello, Thomas.
On 08/02/2010 08:52 PM, Thomas Gleixner wrote:
>> Ooh, another reason is timer locality. If timers are shared per desc,
>> they have much higher chance of being on the same processor. Global
>> timers would be pretty bad in that respect.
>
> That's irrelevant. If you need to poll an interrupt, then it does not
> matter at all whether you bounce some cache lines or not.
>
> In fact we have two cases:
>
> 1) An interrupt needs to be polled all the time. That sucks whether
> the poll timer bounces a few cache lines or not.
>
> 2) Polling an irq for some time. Either it works again after a
> while, so your suckage is restricted to the poll period. If not
> see #1
Hmm... for spurious and watch the above are true and if it were the
above two it would definitely make more sense to use per-purpose
global timers. The problem is w/ expect tho. It's supposed to be
used with normal hot paths, so expect/unexpect operations better be
low overhead and local. I'll talk more about it in the other reply.
> And it's even less of an issue as the main users of this misfeature
> are laptops and desktop machines, where locality is not really that
> important. If an enterprise admin decides to ignore the fact that the
> hardware is flaky, then he does not care about the cache line bounces
> either.
These problems do happen on intel chipset machines and is something
which can be worked around with some effort. Eh, let's talk on the
other reply.
Thanks.
--
tejun
--
tejun
Hello, again.
On 08/02/2010 07:10 PM, Thomas Gleixner wrote:
>> I suppose you're talking about using per-desc timer. I first thought
>> about using a single common timer too but decided against it because
>> timer events can be merged from timer code (depending on the interval
>> and slack) and I wanted to use different polling frequencies for
>> different cases. We can of course implement logic to put polled irqs
>> on a list in chronological order but that's basically redoing the work
>> timer code already does.
>
> They might be merged depending on interval and slack, but why do we
> need different frequencies at all ?
Yeah, these differences all come down to expecting. If it weren't for
expecting, one timer to rule them all should have been enough. I'll
continue below.
>> Because in many cases IRQ storms are transient and spurious IRQ
>> polling worsens performance a lot, it's worthwhile to be able to
>> recover from such conditions, so the extra time period is there to
>> trigger reenabling of the offending IRQ to see whether the storm is
>> over now. Please note that many of this type of IRQ storms are
>> extremely obscure (for example, happens during resume from STR once in
>> a blue moon due to bad interaction between legacy ATA state machine in
>> the controller and the drive's certain behavior) and some are very
>> difficult to avoid from the driver in any reasonable way.
>
> If the IRQ has been marked as spurious, then switch to polling for a
> defined time period (simply count the number of poll timer runs for
> that irq). After that switch back to non polling and keep a flag which
> kicks the stupid thing back to poll after 10 spurious ones in a row.
The current logic isn't that different except that it always considers
periods instead of consecutive ones and uses exponential backoff for
re-enable timing.
>> Or maybe you were talking about something else?
>
> No, that stuff is so convoluted that I did not understand it.
Hmmm... there _are_ some complications but they're mainly how
different mechanisms may interact with each other (ie. watching
delayed while spurious polling is in effect kind of thing) and turning
IRQs on/off (mostly due to locking), but the spurious irq polling part
isn't exactly convoluted. Which part do you find so convoluted? I do
agree it's somewhat complex and if you have different model in mind,
it might not match your expectations. I tried to clean it up at least
in its current structure but I could have been looking at it for a bit
too long.
>>> 2) irq watch
>>
>> The whole thing is executed iff unlikely(desc->status &
>> IRQ_CHECK_WATCHES) which the processor will be predicting correctly
>> most of the time. Do you think it would make any noticeable
>> difference?
>
> It's not about the !WATCH case. I don't like the extra work for those
> watched ones.
WATCH shouldn't be used in normal paths because the polling code
doesn't have enough information to reduce overhead. It can only be
used in occassional events like IRQ enable, timeout detected by driver
kind of events, so WATCH case overhead doesn't really matter. Its
primary use case would be invoking it right after requesting IRQ as
USB does.
> The locking overhead for a global timer + linked list is minimal.
>
> There is only one additional lock, the one which protects the
> list_head. So you take it when watch_irq() adds the watch, when the
> watch is removed and in the timer routine. That's zero overhead as
> lock contention is totaly unlikely. There is no need to fiddle with
> the timer interval. Watch the stupid thing for a while, if it behaves
> remove it from the list, if not you need to keep polling anyway.
For watch only, I agree that global timer would work better.
>>> 3) irq expect
> I'm not opposed to have a generic solution for this, but I want a
> simple and understandable one. The irq_poll() works for everything
> magic is definitely not in this category.
It's just multiplexing timer. If the biggest issue is convolution in
irq_poll(), I can try to make it prettier for sure, but I guess that
discussion is for later time.
>> In usual operation conditions, the timer interval will quickly become
>> 3 seconds w/ 1 sec slack. I doubt it can cause much problem. Right?
>
> And those 4 seconds are probably larger than the device timeouts in
> most cases. So what's the point of running that timer at all ?
Well, it's fairly common to have tens of secs for timeouts. Even
going into minute range isn't too uncommon.
>> The main problem here is that expect/unexpect_irq() needs to be low
>> overhead so that drivers can easily call in w/o worrying too much
>> about performance overhead and I would definitely want to avoid
>> locking in fast paths.
>
> In general you only want to use this expect thing when you detected
> that the device has a problem. So why not use the watch mechanism for
> this ?
No, the goal is to use it for normal cases, so that we can provide
reasonable level of protection against occassional hiccups at fairly
low overhead. Otherwise, it wouldn't be different at all from IRQ
watching.
> When you detect the first timeout in the device driver, set the watch
> on that action and let it figure out whether it goes back to
> normal. If it goes back to normal the watch disappears. If it happens
> again, repeat.
Yeah, that's how irq_watch is supposed to be used. irq_expect is for
cases where drivers can provide better information about when to start
and finish expecting interrupts. For these cases, we can provide
protection against occassional IRQ hiccups too. A few secs of hiccup
would be an order of magnitude better and will actually make such
failures mostly bearable.
> Though we need a reference to irqaction.watch of that device, but
> that's not a big deal. With that we can call
>
> unexpect_irq(watch, false/true);
>
> which would be an inline function:
>
> static inline void unexpect_irq(watch, timeout)
> {
> if (unlikely(watch->active ^ timeout))
> __unexpect_irq(watch, timeout);
> }
>
> Simple, right ?
Well, now we're talking about different problems. If it were not for
using it in normal cases, it would be better to just rip off expecting
and use watching.
>> To me, the more interesting question is where to put the complexity.
>> In many cases, the most we can do is pushing complexity around, and if
>> we can make things easier for drivers by making the core IRQ code
>> somewhat more complex, I'll definitely go for that every time.
>
> No objections, but I doubt, that we need all the heuristics and
> complex code to deal with it.
Cool, I'm happy as long as you agree on that. So, here's where I'm
coming from: There are two classes of ATA bugs which have been
bothering me for quite some time now. Both are pretty cumbersome to
solve from libata proper.
The first is some ATAPI devices locking up because we use different
probing sequence than windows and other media presence polling related
issues. I think we'll need in-kernel media presence detection and
with cmwq it shouldn't be too hard to implement.
The other, as you might have expected, is these frigging IRQ issues.
There are several different patterns of failures. One is misrouting
or other persistent delivery failure. irq_watch alone should be able
to solve most part of this. Another common one is stuck IRQ line.
Cases where this condition is sporadic and transient aren't too rare,
so the updates to spurious handling. The last is occassional delivery
failures which unnecessarily lead to full blown command timeouts.
This is what irq_expect tries to work around.
After spending some time w/ ATA, what I've learned is that shit
happens no matter what and os/driver better be ready to handle them
and, for a lot of cases, the driver has enough information to be able
to work around and survive most IRQ problems.
I don't think that IRQ expect/unexpect is pushing it too far. It sure
adds some level of complexity but I don't think it is an inherently
complex thing - it could be just that my code sucks. Anyways, so
there are two things to discuss here, I guess. First, irq_expect
itself and second the implementation deatils. If you can think of
something which is simpler and achieves about the same thing, it would
great.
Thanks.
--
tejun
Tejun,
On Mon, 2 Aug 2010, Tejun Heo wrote:
> Hello, Thomas.
>
> On 08/02/2010 08:52 PM, Thomas Gleixner wrote:
> >> Ooh, another reason is timer locality. If timers are shared per desc,
> >> they have much higher chance of being on the same processor. Global
> >> timers would be pretty bad in that respect.
> >
> > That's irrelevant. If you need to poll an interrupt, then it does not
> > matter at all whether you bounce some cache lines or not.
> >
> > In fact we have two cases:
> >
> > 1) An interrupt needs to be polled all the time. That sucks whether
> > the poll timer bounces a few cache lines or not.
> >
> > 2) Polling an irq for some time. Either it works again after a
> > while, so your suckage is restricted to the poll period. If not
> > see #1
>
> Hmm... for spurious and watch the above are true and if it were the
> above two it would definitely make more sense to use per-purpose
> global timers. The problem is w/ expect tho. It's supposed to be
> used with normal hot paths, so expect/unexpect operations better be
> low overhead and local. I'll talk more about it in the other reply.
No, it's not. You are just looking at it from the wrong
perspective. The expect scenario is just a different form of
watch. You worked around the problem by moving the timer to 3 seconds
+ slack if everything works as expected, but that's just sloppy. In
fact you really want to kick that thing in when things go awry and
take it away when it comes back to normal. When things go awry, then
the cache line bouncing is the least of your worries, really.
Thanks,
tglx
Tejun,
On Mon, 2 Aug 2010, Tejun Heo wrote:
> >> Because in many cases IRQ storms are transient and spurious IRQ
> >> polling worsens performance a lot, it's worthwhile to be able to
> >> recover from such conditions, so the extra time period is there to
> >> trigger reenabling of the offending IRQ to see whether the storm is
> >> over now. Please note that many of this type of IRQ storms are
> >> extremely obscure (for example, happens during resume from STR once in
> >> a blue moon due to bad interaction between legacy ATA state machine in
> >> the controller and the drive's certain behavior) and some are very
> >> difficult to avoid from the driver in any reasonable way.
> >
> > If the IRQ has been marked as spurious, then switch to polling for a
> > defined time period (simply count the number of poll timer runs for
> > that irq). After that switch back to non polling and keep a flag which
> > kicks the stupid thing back to poll after 10 spurious ones in a row.
>
> The current logic isn't that different except that it always considers
> periods instead of consecutive ones and uses exponential backoff for
> re-enable timing.
There is no need for that exponential backoff. Either the thing works
or it does not. Where is the fcking point? If you switch the thing to
polling mode, then it does not matter at all whether you try once a
second, once a minute or once an hour to restore the thing. It only
matters when you insist on collecting another 10k spurious ones before
deciding another time that the thing is hosed.
> >> Or maybe you were talking about something else?
> >
> > No, that stuff is so convoluted that I did not understand it.
>
> Hmmm... there _are_ some complications but they're mainly how
> different mechanisms may interact with each other (ie. watching
> delayed while spurious polling is in effect kind of thing) and turning
> IRQs on/off (mostly due to locking), but the spurious irq polling part
> isn't exactly convoluted. Which part do you find so convoluted? I do
> agree it's somewhat complex and if you have different model in mind,
> it might not match your expectations. I tried to clean it up at least
> in its current structure but I could have been looking at it for a bit
> too long.
My main concern is that you are trying to tie everything into one
function, which is always a clear sign for crap.
> >>> 2) irq watch
> >>
> >> The whole thing is executed iff unlikely(desc->status &
> >> IRQ_CHECK_WATCHES) which the processor will be predicting correctly
> >> most of the time. Do you think it would make any noticeable
> >> difference?
> >
> > It's not about the !WATCH case. I don't like the extra work for those
> > watched ones.
>
> WATCH shouldn't be used in normal paths because the polling code
> doesn't have enough information to reduce overhead. It can only be
> used in occassional events like IRQ enable, timeout detected by driver
> kind of events, so WATCH case overhead doesn't really matter. Its
> primary use case would be invoking it right after requesting IRQ as
> USB does.
That's nonsense. You add the watch in the first place to figure out
whether that IRQ is reliably delivered or not. If it is not, then you
add the whole magic to the irq disabled hotpath forever. That's not a
problem for that particular IRQ, it's a problem for the overall
system.
> >>> 3) irq expect
> > I'm not opposed to have a generic solution for this, but I want a
> > simple and understandable one. The irq_poll() works for everything
> > magic is definitely not in this category.
>
> It's just multiplexing timer. If the biggest issue is convolution in
> irq_poll(), I can try to make it prettier for sure, but I guess that
> discussion is for later time.
It's _NOT_. You _CANNOT_ make it prettier because your "design" is sucky
by "multiplexing the timer". The multiplexing is the main issue and
there is no way to make this less convoluted as hard as you might try.
> >> In usual operation conditions, the timer interval will quickly become
> >> 3 seconds w/ 1 sec slack. I doubt it can cause much problem. Right?
> >
> > And those 4 seconds are probably larger than the device timeouts in
> > most cases. So what's the point of running that timer at all ?
>
> Well, it's fairly common to have tens of secs for timeouts. Even
> going into minute range isn't too uncommon.
No. You just look at it from ATA or whatever, but there are tons of
drivers which have timeouts below 100ms. You _CANNOT_ generalize that
assumption. And if you want this for ATA where it possibly fits, then
you just make your whole argument of generalizing the approach moot.
> >> The main problem here is that expect/unexpect_irq() needs to be low
> >> overhead so that drivers can easily call in w/o worrying too much
> >> about performance overhead and I would definitely want to avoid
> >> locking in fast paths.
> >
> > In general you only want to use this expect thing when you detected
> > that the device has a problem. So why not use the watch mechanism for
> > this ?
>
> No, the goal is to use it for normal cases, so that we can provide
> reasonable level of protection against occassional hiccups at fairly
> low overhead. Otherwise, it wouldn't be different at all from IRQ
> watching.
WTF ? Once you switched to IRQ_EXP_VERIFIED then you run with 3sec +
slack timeout. You only switch back when a timeout in the device
driver happens. That's at least how I understand the code. Correct me
if I'm wrong, but then you just deliver another proof for complete non
understandble horror.
> > When you detect the first timeout in the device driver, set the watch
> > on that action and let it figure out whether it goes back to
> > normal. If it goes back to normal the watch disappears. If it happens
> > again, repeat.
>
> Yeah, that's how irq_watch is supposed to be used. irq_expect is for
> cases where drivers can provide better information about when to start
> and finish expecting interrupts. For these cases, we can provide
> protection against occassional IRQ hiccups too. A few secs of hiccup
> would be an order of magnitude better and will actually make such
> failures mostly bearable.
Err. That's complete nonsense. The device driver has a timeout for the
occasional hickup. If it happens you want to watch out for the device
to settle back to normal.
> > Though we need a reference to irqaction.watch of that device, but
> > that's not a big deal. With that we can call
> >
> > unexpect_irq(watch, false/true);
> >
> > which would be an inline function:
> >
> > static inline void unexpect_irq(watch, timeout)
> > {
> > if (unlikely(watch->active ^ timeout))
> > __unexpect_irq(watch, timeout);
> > }
> >
> > Simple, right ?
>
> Well, now we're talking about different problems. If it were not for
> using it in normal cases, it would be better to just rip off expecting
> and use watching.
Oh man. You seem to be lost in some different universe. The normal
case - and that's what it is according to your code is running the
expect thing at low frequency (>= 3 seconds) and just swicthes back to
quick polling when the driver code detected an error. So how is that
fcking different from my approach ?
> >> To me, the more interesting question is where to put the complexity.
> >> In many cases, the most we can do is pushing complexity around, and if
> >> we can make things easier for drivers by making the core IRQ code
> >> somewhat more complex, I'll definitely go for that every time.
> >
> > No objections, but I doubt, that we need all the heuristics and
> > complex code to deal with it.
>
> Cool, I'm happy as long as you agree on that. So, here's where I'm
> coming from: There are two classes of ATA bugs which have been
> bothering me for quite some time now. Both are pretty cumbersome to
> solve from libata proper.
>
> The first is some ATAPI devices locking up because we use different
> probing sequence than windows and other media presence polling related
> issues. I think we'll need in-kernel media presence detection and
> with cmwq it shouldn't be too hard to implement.
That's an unrelated issue.
> The other, as you might have expected, is these frigging IRQ issues.
> There are several different patterns of failures. One is misrouting
> or other persistent delivery failure. irq_watch alone should be able
> to solve most part of this. Another common one is stuck IRQ line.
> Cases where this condition is sporadic and transient aren't too rare,
> so the updates to spurious handling. The last is occassional delivery
> failures which unnecessarily lead to full blown command timeouts.
> This is what irq_expect tries to work around.
As I said, you can cope with transient failures with the modified
watch scheme as well.
> After spending some time w/ ATA, what I've learned is that shit
> happens no matter what and os/driver better be ready to handle them
> and, for a lot of cases, the driver has enough information to be able
> to work around and survive most IRQ problems.
>
> I don't think that IRQ expect/unexpect is pushing it too far. It sure
> adds some level of complexity but I don't think it is an inherently
> complex thing - it could be just that my code sucks. Anyways, so
> there are two things to discuss here, I guess. First, irq_expect
> itself and second the implementation deatils. If you can think of
> something which is simpler and achieves about the same thing, it would
> great.
I pointed out the simple and acceptable solution already.
Thanks,
tglx
Hello, Thomas.
On 08/03/2010 12:28 AM, Thomas Gleixner wrote:
>> The current logic isn't that different except that it always considers
>> periods instead of consecutive ones and uses exponential backoff for
>> re-enable timing.
>
> There is no need for that exponential backoff. Either the thing works
> or it does not. Where is the fcking point? If you switch the thing to
> polling mode, then it does not matter at all whether you try once a
> second, once a minute or once an hour to restore the thing. It only
> matters when you insist on collecting another 10k spurious ones before
> deciding another time that the thing is hosed.
The intention was to use the same detection code for retrials as the
initial detection. But it sure can be done that way too.
>> WATCH shouldn't be used in normal paths because the polling code
>> doesn't have enough information to reduce overhead. It can only be
>> used in occassional events like IRQ enable, timeout detected by driver
>> kind of events, so WATCH case overhead doesn't really matter. Its
>> primary use case would be invoking it right after requesting IRQ as
>> USB does.
>
> That's nonsense. You add the watch in the first place to figure out
> whether that IRQ is reliably delivered or not. If it is not, then you
> add the whole magic to the irq disabled hotpath forever. That's not a
> problem for that particular IRQ, it's a problem for the overall
> system.
For the overall system? The thing is enabled/disabled per IRQ or are
you talking about something else?
>>>>> 3) irq expect
>>> I'm not opposed to have a generic solution for this, but I want a
>>> simple and understandable one. The irq_poll() works for everything
>>> magic is definitely not in this category.
>>
>> It's just multiplexing timer. If the biggest issue is convolution in
>> irq_poll(), I can try to make it prettier for sure, but I guess that
>> discussion is for later time.
>
> It's _NOT_. You _CANNOT_ make it prettier because your "design" is sucky
> by "multiplexing the timer". The multiplexing is the main issue and
> there is no way to make this less convoluted as hard as you might try.
Yeah, there's certain level of complexity w/ multiplexing timer. It
would be nice to be able to do away with that.
>> Well, it's fairly common to have tens of secs for timeouts. Even
>> going into minute range isn't too uncommon.
>
> No. You just look at it from ATA or whatever, but there are tons of
> drivers which have timeouts below 100ms. You _CANNOT_ generalize that
> assumption. And if you want this for ATA where it possibly fits, then
> you just make your whole argument of generalizing the approach moot.
For most IO devices, having long timeouts is pretty common.
irq_expect doesn't make sense for cases where the usual timeouts are
pretty low. It's for cases where the device timeout is much longer.
It probably is most useful for ATA where long timeouts and flaky IRQs
are commonplace but it's generally useful for IO initiators.
>> No, the goal is to use it for normal cases, so that we can provide
>> reasonable level of protection against occassional hiccups at fairly
>> low overhead. Otherwise, it wouldn't be different at all from IRQ
>> watching.
>
> WTF ? Once you switched to IRQ_EXP_VERIFIED then you run with 3sec +
> slack timeout. You only switch back when a timeout in the device
> driver happens. That's at least how I understand the code. Correct
> me if I'm wrong,
No, it also switches on when 3sec timeout polls successfully. The
IRQ_IN_POLLING test in unexpect_irq().
> but then you just deliver another proof for complete non
> understandble horror.
I understand the concern about and partially agree with the ugliness
of multiplexing the timer but, come on, you misunderstanding that test
can't be all my fault. There is even a comment saying /* succesful
completion from IRQ? */ on top of that test.
>> Yeah, that's how irq_watch is supposed to be used. irq_expect is for
>> cases where drivers can provide better information about when to start
>> and finish expecting interrupts. For these cases, we can provide
>> protection against occassional IRQ hiccups too. A few secs of hiccup
>> would be an order of magnitude better and will actually make such
>> failures mostly bearable.
>
> Err. That's complete nonsense. The device driver has a timeout for the
> occasional hickup. If it happens you want to watch out for the device
> to settle back to normal.
In all or most IO drivers, the timeout only kicks in after tens of
seconds to basically clean things up and retry. irq_expect is a
generic mechanism to detect and handle IRQ delivery hiccups which is a
specific subclass of the different timeouts.
>> Well, now we're talking about different problems. If it were not for
>> using it in normal cases, it would be better to just rip off expecting
>> and use watching.
>
> Oh man. You seem to be lost in some different universe. The normal
> case - and that's what it is according to your code is running the
> expect thing at low frequency (>= 3 seconds) and just swicthes back to
> quick polling when the driver code detected an error. So how is that
> fcking different from my approach ?
Gees, enough with fucks already. As I wrote above, shorter interval
polling kicks in when an IRQ event is delivered through polling. Why
would there be irq_expect at all otherwise? It would be basically
identical to irq_watch.
I agree that timer multiplexing is a rather ugly thing and it would be
great to remove it. You're right that it doesn't make whole lot of
difference whether the timer is global or local if it's low frequency
and in fast paths expect/unexpect would be able to just test its list
entry and set/clear currently expecting status without messing with
the global timer or lock. Then, we can have a single low freq timer
for expect/unexpect and the other for actual polling. How does that
sound to you?
Thanks.
--
tejun
On Mon, 2 Aug 2010, Tejun Heo wrote:
> > And it's even less of an issue as the main users of this misfeature
> > are laptops and desktop machines, where locality is not really that
> > important. If an enterprise admin decides to ignore the fact that the
> > hardware is flaky, then he does not care about the cache line bounces
> > either.
>
> These problems do happen on intel chipset machines and is something
> which can be worked around with some effort. Eh, let's talk on the
> other reply.
So you're saying that the ATA problem is restricted to Intel chipsets?
Do we know the root cause ?
Thanks,
tglx
Hello,
On 08/03/2010 12:06 PM, Thomas Gleixner wrote:
>> These problems do happen on intel chipset machines and is something
>> which can be worked around with some effort. Eh, let's talk on the
>> other reply.
>
> So you're saying that the ATA problem is restricted to Intel chipsets?
> Do we know the root cause ?
The original sentence is missing an 'also'. Severals have been root
caused and worked around. For SATA, one of the notable problems was
misinterpretation of nIEN (interrupt block bit on the ATA device side)
on both device and host sides. In traditional IDE API, the interrupt
bit is primarily under the control of the device, so it's pretty
difficult to guarantee reliable operation from driver side only. For
PATA, the device actually has full control of the interrupt line, so
it's much worse.
Thanks.
--
tejun
Tejun,
On Tue, 3 Aug 2010, Tejun Heo wrote:
> I agree that timer multiplexing is a rather ugly thing and it would be
> great to remove it. You're right that it doesn't make whole lot of
> difference whether the timer is global or local if it's low frequency
> and in fast paths expect/unexpect would be able to just test its list
> entry and set/clear currently expecting status without messing with
> the global timer or lock. Then, we can have a single low freq timer
> for expect/unexpect and the other for actual polling.
And the third one for watch, right ? That would give us separate timer
functions which each serve a particular purpose.
When you go for it, can you please simplify all the heuristics?
spurious poll:
One fixed poll interval is enough. The retry logic can be made
simple, just set it back to interrupt delivery once per minute and
limit the storm to 10.
watch:
Get rid of the interrupt context work and do all the work in the
timer. Use a fixed interval and either keep it forever or remove it.
expect:
Use a slow fixed interval and just do the expect/unexpect fast
marking. A nice thing would be to have a counter of expect calls so
we can switch off the timer when there is no activity within 10
seconds.
> How does that sound to you?
Way better. :)
Thanks,
tglx