2007-11-07 17:48:54

by Marin Mitov

[permalink] [raw]
Subject: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?

Hi all,

I have written a linux device driver for a frame grabber I use in my
every day experimental work.

In my device driver I have to write to a MMIO register, wait for a while
(using udelay(65)) for data being written to an internal register (i2c?)
and test a flag (in another MMIO register) if the operation has completed.
(The hardware guarantees that the operation has completed in less than
65 usec). If the flag is not reset I write a message via printk.
After switching to the kernel-2.6.23 (compiled as PREEMPTIBLE SMP i686)
(AMD dual core) I see this message in dmesg output sometime.

Testing with rdtscll() before and after udelay(65) shows the expected
delay of 65 usec (after dividing by CPU frequency) when all is OK, but
gives a big value (in the tenths msec range) when the error message
shows itself in dmesg.

Bracketing udelay(65) by:

local_irq_disable();
udelay(65);
local_irq_enable();

as well as by

preempt_disable();
udelay(65);
preempt_enable();

leads to message disappearing.

I believe the hardware is working correctly, so if the flag is not reset
I think udelay(65) returns prematurely (the flag clears some time latter)
And it does not matter if I use udelay(65) or udelay(100).

What could be the reason for such a behavior?
Is this a bug in udelay() due to preemption?
(udelay() being preempted and migrated to another processor)

All my previous kernels used were SMP (but not PREEMPTIBLE)

My kernel is compiled with:
CONFIG_PREEMPT=y
CONFIG_IRQBALANCE=y
CONFIG_HPET_TIMER=y

And I have this line in dmesg:
Time: acpi_pm clocksource has been installed.
Switched to high resolution mode on CPU 0
Switched to high resolution mode on CPU 1

The south bridge is: VIA VT8237 (Asus A8V Delux)

Thank you in advance for your help in understanding where
the problem is coming from.

Best regards.

Marin Mitov


2007-11-07 20:31:42

by Andrew Morton

[permalink] [raw]
Subject: Re: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?

> On Wed, 7 Nov 2007 19:21:52 +0200 Marin Mitov <[email protected]> wrote:
> Hi all,
>
> I have written a linux device driver for a frame grabber I use in my
> every day experimental work.
>
> In my device driver I have to write to a MMIO register, wait for a while
> (using udelay(65)) for data being written to an internal register (i2c?)
> and test a flag (in another MMIO register) if the operation has completed.
> (The hardware guarantees that the operation has completed in less than
> 65 usec). If the flag is not reset I write a message via printk.
> After switching to the kernel-2.6.23 (compiled as PREEMPTIBLE SMP i686)
> (AMD dual core) I see this message in dmesg output sometime.
>
> Testing with rdtscll() before and after udelay(65) shows the expected
> delay of 65 usec (after dividing by CPU frequency) when all is OK, but
> gives a big value (in the tenths msec range) when the error message
> shows itself in dmesg.
>
> Bracketing udelay(65) by:
>
> local_irq_disable();
> udelay(65);
> local_irq_enable();
>
> as well as by
>
> preempt_disable();
> udelay(65);
> preempt_enable();
>
> leads to message disappearing.
>
> I believe the hardware is working correctly, so if the flag is not reset
> I think udelay(65) returns prematurely (the flag clears some time latter)
> And it does not matter if I use udelay(65) or udelay(100).
>
> What could be the reason for such a behavior?
> Is this a bug in udelay() due to preemption?
> (udelay() being preempted and migrated to another processor)
>
> All my previous kernels used were SMP (but not PREEMPTIBLE)
>
> My kernel is compiled with:
> CONFIG_PREEMPT=y
> CONFIG_IRQBALANCE=y
> CONFIG_HPET_TIMER=y
>
> And I have this line in dmesg:
> Time: acpi_pm clocksource has been installed.
> Switched to high resolution mode on CPU 0
> Switched to high resolution mode on CPU 1
>
> The south bridge is: VIA VT8237 (Asus A8V Delux)
>
> Thank you in advance for your help in understanding where
> the problem is coming from.
>

Ow. Yes, from my reading delay_tsc() can return early (or after
heat-death-of-the-universe) if the TSCs are offset and if preemption
migrates the calling task between CPUs.

I suppose a lameo fix would be to disable preemption in delay_tsc().

2007-11-07 23:17:38

by Andi Kleen

[permalink] [raw]
Subject: Re: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?


> Ow. Yes, from my reading delay_tsc() can return early (or after
> heat-death-of-the-universe) if the TSCs are offset and if preemption
> migrates the calling task between CPUs.
>
> I suppose a lameo fix would be to disable preemption in delay_tsc().

Yes. Can't think of any better reasonable fix.

-Andi

2007-11-08 00:22:12

by Matt Mackall

[permalink] [raw]
Subject: Re: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?

On Wed, Nov 07, 2007 at 12:30:45PM -0800, Andrew Morton wrote:
> Ow. Yes, from my reading delay_tsc() can return early (or after
> heat-death-of-the-universe) if the TSCs are offset and if preemption
> migrates the calling task between CPUs.
>
> I suppose a lameo fix would be to disable preemption in delay_tsc().

preempt_disable is lousy documentation here. This and other cases
(lots of per_cpu users, IIRC) actually want a migrate_disable() which
is a proper subset. We can simply implement migrate_disable() as
preempt_disable() for now and come back later and implement a proper
migrate_disable() that still allows preemption (and thus avoids the
latency).

--
Mathematics is the supreme nostalgia of our time.

2007-11-08 00:34:47

by Andi Kleen

[permalink] [raw]
Subject: Re: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?

On Thursday 08 November 2007 01:20, Matt Mackall wrote:
> On Wed, Nov 07, 2007 at 12:30:45PM -0800, Andrew Morton wrote:
> > Ow. Yes, from my reading delay_tsc() can return early (or after
> > heat-death-of-the-universe) if the TSCs are offset and if preemption
> > migrates the calling task between CPUs.
> >
> > I suppose a lameo fix would be to disable preemption in delay_tsc().
>
> preempt_disable is lousy documentation here. This and other cases
> (lots of per_cpu users, IIRC) actually want a migrate_disable() which
> is a proper subset. We can simply implement migrate_disable() as
> preempt_disable() for now and come back later and implement a proper
> migrate_disable() that still allows preemption (and thus avoids the
> latency).

We could actually do this right now. migrate_disable() can be just changing
the cpu affinity of the current thread to current cpu and then restoring it
afterwards. That should even work from interrupt context.

get_cpu() etc. could be changed to use this then too.

-Andi

2007-11-08 01:06:31

by Matt Mackall

[permalink] [raw]
Subject: Re: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?

On Thu, Nov 08, 2007 at 01:31:00AM +0100, Andi Kleen wrote:
> On Thursday 08 November 2007 01:20, Matt Mackall wrote:
> > On Wed, Nov 07, 2007 at 12:30:45PM -0800, Andrew Morton wrote:
> > > Ow. Yes, from my reading delay_tsc() can return early (or after
> > > heat-death-of-the-universe) if the TSCs are offset and if preemption
> > > migrates the calling task between CPUs.
> > >
> > > I suppose a lameo fix would be to disable preemption in delay_tsc().
> >
> > preempt_disable is lousy documentation here. This and other cases
> > (lots of per_cpu users, IIRC) actually want a migrate_disable() which
> > is a proper subset. We can simply implement migrate_disable() as
> > preempt_disable() for now and come back later and implement a proper
> > migrate_disable() that still allows preemption (and thus avoids the
> > latency).
>
> We could actually do this right now. migrate_disable() can be just changing
> the cpu affinity of the current thread to current cpu and then restoring it
> afterwards. That should even work from interrupt context.

Yes, that's one way. But we need somewhere to stash the old flags.
Expanding the task struct sucks. Jamming another bit in the preempt
count sucks.

But I think we'd be best off stashing a single bit somewhere and
checking it at migrate time (relatively infrequent) rather than
copying and zeroing out a potentially enormous affinity mask every
time we disable migration (often, and in fast paths). Perhaps adding
TASK_PINNED to the task state flags would do it?

> get_cpu() etc. could be changed to use this then too.

Some users of get_cpu might be relying on it to avoid actual
preemption. In other words, we should have introduced a
migrate_disable() when we first discovered the preempt/per_cpu
conflict.

--
Mathematics is the supreme nostalgia of our time.

2007-11-08 01:25:26

by Andi Kleen

[permalink] [raw]
Subject: Re: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?


> But I think we'd be best off stashing a single bit somewhere and
> checking it at migrate time (relatively infrequent) rather than
> copying and zeroing out a potentially enormous affinity mask every
> time we disable migration (often, and in fast paths). Perhaps adding
> TASK_PINNED to the task state flags would do it?

It would need to be a count to be able to nest it.

> > get_cpu() etc. could be changed to use this then too.
>
> Some users of get_cpu might be relying on it to avoid actual
> preemption. In other words, we should have introduced a
> migrate_disable() when we first discovered the preempt/per_cpu
> conflict.

Ok perhaps it would make sense to migrate it step by step :-
define a replacement for get_cpu and migrate over as users are getting
audited and eventually deprecate old one.

-Andi

2007-11-08 02:47:55

by Matt Mackall

[permalink] [raw]
Subject: Re: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?

On Thu, Nov 08, 2007 at 02:20:58AM +0100, Andi Kleen wrote:
>
> > But I think we'd be best off stashing a single bit somewhere and
> > checking it at migrate time (relatively infrequent) rather than
> > copying and zeroing out a potentially enormous affinity mask every
> > time we disable migration (often, and in fast paths). Perhaps adding
> > TASK_PINNED to the task state flags would do it?
>
> It would need to be a count to be able to nest it.

Ahh, right. Suppose that means fattening the task struct until someone
comes up with something more clever.

--
Mathematics is the supreme nostalgia of our time.

2007-11-08 09:11:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?

On Wed, 2007-11-07 at 18:20 -0600, Matt Mackall wrote:

> This and other cases
> (lots of per_cpu users, IIRC) actually want a migrate_disable() which
> is a proper subset.

The disadvantage of migrate_disable() is that it complicates the
load-balancer but more importantly, that it does bring a form of
latencies with it that are hard to measure. Using preempt_disable() for
these current per-cpu users basically forces them to keep it short.

Which is a GOOD (tm) thing.

If we go overboard with this migrate_disable() stuff we can end up with
a very hard to analyse system that sporadically does weird stuff.

So, please, don't start that again.

Also see:
http://lkml.org/lkml/2007/7/23/338

2007-11-08 11:24:14

by Marin Mitov

[permalink] [raw]
Subject: Re: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?

Hi all,

Thanks to all of you answering to my post.

On 7.11.2007, Andrew Morton wrote:
> Ow. Yes, from my reading delay_tsc() can return early (or after
> heat-death-of-the-universe) if the TSCs are offset and if preemption
> migrates the calling task between CPUs.
>
> I suppose a lameo fix would be to disable preemption in delay_tsc().

I have seen the problem in delay_tsc(), but I was pusled by these lines
in my dmesg:

> Time: acpi_pm clocksource has been installed.
> Switched to high resolution mode on CPU 0
> Switched to high resolution mode on CPU 1

I thought (sort of) acpi_pm (but not tsc) is used in udelay().

The same delay_tsc() is used in both arches: i386 & x86_64
(and as I see from the proposed patches in -mm, also for
the new x86 arch). Should the patch be applied for all of them?

Quite similar function ia64_itc_udelay() is used in IA64, but one
find a coment before it:
/*
* Generic udelay assumes that if preemption is allowed and the thread
* migrates to another CPU, that the ITC values are synchronized across
* all CPUs.
*/

Are they really synchronized or similar patch: preempt_disable/enable()
should be applied?

What about all other arches?

Thanks.

Marin Mitov

2007-11-08 11:47:57

by Avi Kivity

[permalink] [raw]
Subject: Re: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?

Andi Kleen wrote:
> On Thursday 08 November 2007 01:20, Matt Mackall wrote:
>
>> On Wed, Nov 07, 2007 at 12:30:45PM -0800, Andrew Morton wrote:
>>
>>> Ow. Yes, from my reading delay_tsc() can return early (or after
>>> heat-death-of-the-universe) if the TSCs are offset and if preemption
>>> migrates the calling task between CPUs.
>>>
>>> I suppose a lameo fix would be to disable preemption in delay_tsc().
>>>
>> preempt_disable is lousy documentation here. This and other cases
>> (lots of per_cpu users, IIRC) actually want a migrate_disable() which
>> is a proper subset. We can simply implement migrate_disable() as
>> preempt_disable() for now and come back later and implement a proper
>> migrate_disable() that still allows preemption (and thus avoids the
>> latency).
>>
>
> We could actually do this right now. migrate_disable() can be just changing
> the cpu affinity of the current thread to current cpu and then restoring it
> afterwards. That should even work from interrupt context.
>
> get_cpu() etc. could be changed to use this then too.
>
>

What if some other thread calls sched_setaffinity() on the
migrate_disable()d cpu? we'd need to detect this to avoid
migrate_enable() stomping on sched_setaffinity()'s work.


--
error compiling committee.c: too many arguments to function

2007-11-08 15:11:46

by Matt Mackall

[permalink] [raw]
Subject: Re: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?

On Thu, Nov 08, 2007 at 01:46:48PM +0200, Avi Kivity wrote:
> Andi Kleen wrote:
> >On Thursday 08 November 2007 01:20, Matt Mackall wrote:
> >
> >>On Wed, Nov 07, 2007 at 12:30:45PM -0800, Andrew Morton wrote:
> >>
> >>>Ow. Yes, from my reading delay_tsc() can return early (or after
> >>>heat-death-of-the-universe) if the TSCs are offset and if preemption
> >>>migrates the calling task between CPUs.
> >>>
> >>>I suppose a lameo fix would be to disable preemption in delay_tsc().
> >>>
> >>preempt_disable is lousy documentation here. This and other cases
> >>(lots of per_cpu users, IIRC) actually want a migrate_disable() which
> >>is a proper subset. We can simply implement migrate_disable() as
> >>preempt_disable() for now and come back later and implement a proper
> >>migrate_disable() that still allows preemption (and thus avoids the
> >>latency).
> >>
> >
> >We could actually do this right now. migrate_disable() can be just changing
> >the cpu affinity of the current thread to current cpu and then restoring
> >it afterwards. That should even work from interrupt context.
> >
> >get_cpu() etc. could be changed to use this then too.
> >
> >
>
> What if some other thread calls sched_setaffinity() on the
> migrate_disable()d cpu? we'd need to detect this to avoid
> migrate_enable() stomping on sched_setaffinity()'s work.

Yep, there are a bunch of problems with actually touching the affinity
mask.

--
Mathematics is the supreme nostalgia of our time.

2007-11-08 15:45:53

by Matt Mackall

[permalink] [raw]
Subject: Re: is minimum udelay() not respected in preemptible SMP kernel-2.6.23?

On Thu, Nov 08, 2007 at 10:11:21AM +0100, Peter Zijlstra wrote:
> On Wed, 2007-11-07 at 18:20 -0600, Matt Mackall wrote:
>
> > This and other cases
> > (lots of per_cpu users, IIRC) actually want a migrate_disable() which
> > is a proper subset.
>
> The disadvantage of migrate_disable() is that it complicates the
> load-balancer

Hmm, I'm surprised it's more than a one-liner. Something like

if (cannot_migrate(task))
continue;

But I'm certainly not the expert here. Perhaps this one-liner introduces the
"unplannable O(N) overhead" Ingo mentions in the email you referenced.

> but more importantly, that it does bring a form of latencies with it
> that are hard to measure. Using preempt_disable() for these current
> per-cpu users basically forces them to keep it short.

Ok, so maybe we've got implementation issues to tackle. But still: a)
preempt_disable is inherently misdocumentation (preemption is not the
real problem) and b) preemption is a bigger hammer than needed. At the
very least, we should introduce migrate_disable as an alias for
preempt_disable so we can document intent.

At any rate, in the current context, we're talking about actually
disabling preempt in a *delay loop*. Let's find a fix for that.

> Also see:
> http://lkml.org/lkml/2007/7/23/338

Heh, I'd completely forgotten about this thread and thought I was
having an original idea! Must have never seen Ingo's followup.

Ingo's complaint about it being "a per-task BKL" is interesting. I've
occassionally wondered if it makes sense to make some of these
primitives take a "documentation parameter" - a pointer to an object
that does nothing more than indicate the object of interest.

--
Mathematics is the supreme nostalgia of our time.