2008-10-05 21:51:58

by Elias Oltmanns

[permalink] [raw]
Subject: ath5k: kernel timing screwed - due to unserialised register access?

Hi there,

on my system, I observe some odd symptoms which I have troubled Thomas
with before. After some more investigation, I have come to the
conclusion that ath5k is at the bottom of this, but since I don't really
understand the connection, I thought that Thomas may perhaps throw some
light on the matter, after all, even though I still think that ath5k
will have to be fixed. The Behaviour I'm seeing is this: sometimes,
timers fire prematurely, i.e. a timer x started with

mod_timer(&x, HZ/50);

fires after less than 10 or even 1 msec rather than 20 msec. Trying to
get to the bottom of this, it struck me that these glitches only occur
when ath5k is loaded and an interface is brought up (ifconfig wlan0 up
is quite sufficient). Some more digging revealed that the occurrences of
such ``fast forward events'' coincided with the expiry of the
recalibration timer started for the interface. The same behaviour can be
observed on kernels 2.6.25.16, 2.6.26.5, 2.6.27-rc8-git8 and
next-20080919.

Looking through the code, I tried to find an obvious suspect, but
nothing struck me as out of the ordinary, except for one thing: There
doesn't seem to be anything in place that ensures serialisation of calls
to the functions involved in calibration or, indeed, accesses to the
card's registers. There definitely are parts of the calibration sequence
that don't just get called from the calibration timer callback, so I
think something has to be done about that.

My first question is, can simultaneous unserialised accesses to
registers possibly disturb softirqs in the way I see it happen, or do I
have to look for something else?

What about the locking issue? In fact, I wonder whether this really is
the only place in the driver where we face the problem of potential
concurrent access to the same registers including bit manipulations that
require a read-write-in-a-row operation.

Regards,

Elias


2008-10-11 09:55:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: ath5k: kernel timing screwed - due to unserialised register access?

On Fri, 10 Oct 2008, Christoph Hellwig wrote:

> On Fri, Oct 10, 2008 at 02:59:28PM +0200, Elias Oltmanns wrote:
> > That was my first thought when I discovered this. However, from what I
> > read on the web, I somehow got the impression that [um]delay() was
> > alright as opposed to msleep(). What exactly is the difference then?
>
> Yes, only msleep() sleeps, mdelay spins.

Opps, right.

2008-10-06 19:47:20

by Thomas Gleixner

[permalink] [raw]
Subject: Re: ath5k: kernel timing screwed - due to unserialised register access?

On Mon, 6 Oct 2008, Elias Oltmanns wrote:
> Make sure that event1 is the right device. chktimer usually reports
> several premature timer expiries in less than a minute.

Which is not surprising. You measure the delta of the reads. Your
process can be scheduled away and when it comes back two events can be
available, so they are read right after each other and trigger your
check.

Your measuring method is wrong. You really want to measure the delta
of the timer events in the kernel via ktime_get(), not the delta of
something else in userspace.

Thanks,

tglx








2008-10-05 21:59:15

by Thomas Gleixner

[permalink] [raw]
Subject: Re: ath5k: kernel timing screwed - due to unserialised register access?

On Sun, 5 Oct 2008, Elias Oltmanns wrote:
> Hi there,
>
> on my system, I observe some odd symptoms which I have troubled Thomas
> with before. After some more investigation, I have come to the
> conclusion that ath5k is at the bottom of this, but since I don't really
> understand the connection, I thought that Thomas may perhaps throw some
> light on the matter, after all, even though I still think that ath5k
> will have to be fixed. The Behaviour I'm seeing is this: sometimes,
> timers fire prematurely, i.e. a timer x started with
>
> mod_timer(&x, HZ/50);
>
> fires after less than 10 or even 1 msec rather than 20 msec. Trying to
> get to the bottom of this, it struck me that these glitches only occur
> when ath5k is loaded and an interface is brought up (ifconfig wlan0 up
> is quite sufficient). Some more digging revealed that the occurrences of
> such ``fast forward events'' coincided with the expiry of the
> recalibration timer started for the interface. The same behaviour can be
> observed on kernels 2.6.25.16, 2.6.26.5, 2.6.27-rc8-git8 and
> next-20080919.

We had an intermittent problem with jffies based timers between
2.6.27-rc6 and -rc8-latest. This is fixed in current mainline. It only
happened when CONFIG_NOHZ=n and CONFIG_HIGH_RES_TIMERS=n or both
options were disabled at the kernel commandline.

I have no idea why this should happen on any other kernel versions.

Can you please point me to the code in question (file, line number) ?

Thanks,

tglx