2007-08-13 18:19:18

by Michal Piotrowski

[permalink] [raw]
Subject: Re: [4/4] 2.6.23-rc3: known regressions

Hi all,

Here is a list of some known regressions in 2.6.23-rc3.

Feel free to add new regressions/remove fixed etc.
http://kernelnewbies.org/known_regressions

List of Aces

Name Regressions fixed since 21-Jun-2007
Adrian Bunk 9
Andi Kleen 5
Linus Torvalds 5
Andrew Morton 4
Al Viro 3
Cornelia Huck 3
Jens Axboe 3
Tejun Heo 3



FS

Subject : [NFSD OOPS] 2.6.23-rc1-git10
References : http://lkml.org/lkml/2007/8/2/462
Last known good : ?
Submitter : Andrew Clayton <[email protected]>
Caused-By : ?
Handled-By : ?
Status : unknown



USB

Subject : EHCI Regression in 2.6.23-rc2
References : http://lkml.org/lkml/2007/8/10/81
Last known good : ?
Submitter : Daniel Exner <[email protected]>
Caused-By : [email protected] <[email protected]>
commit 196705c9bbc03540429b0f7cf9ee35c2f928a534
Handled-By : ?
Status : unknown

Subject : 2.6.23-rc1: USB hard disk broken
References : http://lkml.org/lkml/2007/7/25/62
Last known good : ?
Submitter : Tino Keitel <[email protected]>
Caused-By : ?
Handled-By : Oliver Neukum <[email protected]>
Status : unknown



Virtualization

Subject : WARNING: at arch/x86_64/kernel/smp.c:379 smp_call_function_single()
References : http://lkml.org/lkml/2007/8/10/117
Last known good : ?
Submitter : Paolo Ornati <[email protected]>
Caused-By : Avi Kivity <[email protected]>
commit cec9ad279b66793bee0b5009b7ca311060061efd
Handled-By : Avi Kivity <[email protected]>
Status : problem is being debugged



Alpha

Subject : -Werror compilation problem - make[1]: *** [arch/alpha/kernel/sys_titan.o] Error 1
References : http://lkml.org/lkml/2007/8/6/137
Last known good : ?
Submitter : Oliver Falk <[email protected]>
Caused-By : ?
Handled-By : ?
Status : unknown



Regards,
Michal

--
LOG
http://www.stardust.webpages.pl/log/


2007-08-21 01:42:06

by David Brownell

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions

On Monday 13 August 2007, Michal Piotrowski wrote:
> Subject ? ? ? ? : EHCI Regression in 2.6.23-rc2
> References ? ? ?: http://lkml.org/lkml/2007/8/10/81
> Last known good : ?
> Submitter ? ? ? : Daniel Exner <[email protected]>
> Caused-By ? ? ? : [email protected] <[email protected]>
> ? ? ? ? ? ? ? ? ? commit 196705c9bbc03540429b0f7cf9ee35c2f928a534
> Handled-By ? ? ?: ?
> Status ? ? ? ? ?: unknown

Fixed I believe by Stuart's patch:

http://marc.info/?l=linux-usb-devel&m=118765934722610&w=2

2007-08-21 02:03:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions



On Mon, 20 Aug 2007, David Brownell wrote:

> On Monday 13 August 2007, Michal Piotrowski wrote:
> > Subject         : EHCI Regression in 2.6.23-rc2
> > References      : http://lkml.org/lkml/2007/8/10/81
> > Last known good : ?
> > Submitter       : Daniel Exner <[email protected]>
> > Caused-By       : [email protected] <[email protected]>
> >                   commit 196705c9bbc03540429b0f7cf9ee35c2f928a534
> > Handled-By      : ?
> > Status          : unknown
>
> Fixed I believe by Stuart's patch:
>
> http://marc.info/?l=linux-usb-devel&m=118765934722610&w=2

Quite frankly, I'd personally prefer to just revert commit
196705c9bbc03540429b0f7cf9ee35c2f928a534 entirely instead.

The whole dependency on cpufreq seems totally bogus. Would it not be a lot
more natural to handle the *result* of the problem (ie the MMF errors by
broken EHCI controllers?) rather than add totally insane workarounds for
this case to try to hide them in the first place?

There can be *other* delays in reading memory that have nothing to do with
cpu frequency shifting, and everything to do with exteme situations on the
bus. If the stupid EHCI controller has some tight latency issues, that's a
generic problem.

That commit 196705c9bbc03540429b0f7cf9ee35c2f928a534 just exemplifies what
is wrong with USB, but it does so by adding incredibly ugly code. I'd
rather not add even *more* ugly code - especially not for a case where we
then seem to blame the wrong party (ie a VIA controller that didn't need
the ugly code in the first place).

Serverworks/Broadcom makes totally crap chips (not just in USB) and then
doesn't even document their buggy crap hardware. But that is NOT a reason
for then making the kernel have buggy crap software in it.

So really - is there any reason why we just don't say "Broadcom chips
suck, and get MMF errors under normal circumstances because they are
crap". And from *that*, the obvious solution would seem to not be to
penalize everybody else, but to just say that "We will try to recover from
MMF errors gracefully by retrying the transaction". Hmm?

Linus

2007-08-21 04:03:32

by David Brownell

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions

On Monday 20 August 2007, Linus Torvalds wrote:
>
> On Mon, 20 Aug 2007, David Brownell wrote:
>
> > On Monday 13 August 2007, Michal Piotrowski wrote:
> > > Subject         : EHCI Regression in 2.6.23-rc2
> > > References      : http://lkml.org/lkml/2007/8/10/81
> > > Last known good : ?
> > > Submitter       : Daniel Exner <[email protected]>
> > > Caused-By       : [email protected] <[email protected]>
> > >                   commit 196705c9bbc03540429b0f7cf9ee35c2f928a534
> > > Handled-By      : ?
> > > Status          : unknown
> >
> > Fixed I believe by Stuart's patch:
> >
> > http://marc.info/?l=linux-usb-devel&m=118765934722610&w=2
>
> Quite frankly, I'd personally prefer to just revert commit
> 196705c9bbc03540429b0f7cf9ee35c2f928a534 entirely instead.
>
> The whole dependency on cpufreq seems totally bogus. Would it not be a lot
> more natural to handle the *result* of the problem (ie the MMF errors by
> broken EHCI controllers?) rather than add totally insane workarounds for
> this case to try to hide them in the first place?

MMF basically means the "Transaction Translating" (TT) hub had data
for the host, but the host didn't collect it in time ... so that some
data was lost.

Unfortunately, that's the type of fault that's especially hard to
recover from. Plus, very few of the upper layer drivers have even
a minor clue about fault recovery strategies. And I don't trust
the current hcd/usbcore code that tries to clean up after MMF.

On the plus side, MMF errors have been vanishingly rare until this
cpufreq interaction came up ... which of course implies the downside
that those "handle the result" code paths are all but untested.


> There can be *other* delays in reading memory that have nothing to do with
> cpu frequency shifting, and everything to do with exteme situations on the
> bus. If the stupid EHCI controller has some tight latency issues, that's a
> generic problem.

There could be such problems, yes. But in practice, I don't know
that we've ever seen them. (There's a first time for everthing,
yes. I *just* fetched a webpage where an image got overwritten
about

> That commit 196705c9bbc03540429b0f7cf9ee35c2f928a534 just exemplifies what
> is wrong with USB, but it does so by adding incredibly ugly code. I'd
> rather not add even *more* ugly code - especially not for a case where we
> then seem to blame the wrong party (ie a VIA controller that didn't need
> the ugly code in the first place).
>
> Serverworks/Broadcom makes totally crap chips (not just in USB) and then
> doesn't even document their buggy crap hardware. But that is NOT a reason
> for then making the kernel have buggy crap software in it.
>
> So really - is there any reason why we just don't say "Broadcom chips
> suck, and get MMF errors under normal circumstances because they are
> crap". And from *that*, the obvious solution would seem to not be to
> penalize everybody else, but to just say that "We will try to recover from
> MMF errors gracefully by retrying the transaction". Hmm?
>
> Linus
>


2007-08-21 04:16:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions



On Mon, 20 Aug 2007, David Brownell wrote:
>
> MMF basically means the "Transaction Translating" (TT) hub had data
> for the host, but the host didn't collect it in time ... so that some
> data was lost.
>
> Unfortunately, that's the type of fault that's especially hard to
> recover from.

Fair enough. However, it still seems particularly idiotic to
- penalize everybody
- mix up two totally unrelated areas (cpufreq and USB) for a bug that is
extremely rare and could be handled differently.

For example, if it really ends up being practically impossible to recover
from split transaction errors, I would still suggest reverting that horrid
commit, and then just black-listing the known-broken EHCI controllers and
simply not *do* any split transactions on them. That way there's no
complexity.

As far as I know, split transactions aren't required anyway, they are just
a performance optimization.

Basically, we not only know that the commit has caused problems, it's
fundamentally ugly, fragile, and not very maintainable, and the whole
reason for doing it is pretty dubious.

Why not just admit that certain hardware is broken (and the vendor isn't
worth even bothering to be polite with, since they try to screw us every
chance they get) and cannot reliably do split transactions? Problem
solved, no real downside, and nobody will even *notice*.

Linus

2007-08-21 04:27:37

by David Brownell

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions

[ GRR sorry for the premature "SEND" ... mouspads-r-evil ]

On Monday 20 August 2007, Linus Torvalds wrote:
>
> On Mon, 20 Aug 2007, David Brownell wrote:
>
> > On Monday 13 August 2007, Michal Piotrowski wrote:
> > > Subject         : EHCI Regression in 2.6.23-rc2
> > > References      : http://lkml.org/lkml/2007/8/10/81
> > > Last known good : ?
> > > Submitter       : Daniel Exner <[email protected]>
> > > Caused-By       : [email protected] <[email protected]>
> > >                   commit 196705c9bbc03540429b0f7cf9ee35c2f928a534
> > > Handled-By      : ?
> > > Status          : unknown
> >
> > Fixed I believe by Stuart's patch:
> >
> > http://marc.info/?l=linux-usb-devel&m=118765934722610&w=2
>
> Quite frankly, I'd personally prefer to just revert commit
> 196705c9bbc03540429b0f7cf9ee35c2f928a534 entirely instead.
>
> The whole dependency on cpufreq seems totally bogus. Would it not be a lot
> more natural to handle the *result* of the problem (ie the MMF errors by
> broken EHCI controllers?) rather than add totally insane workarounds for
> this case to try to hide them in the first place?

MMF basically means the "Transaction Translating" (TT) Hub had data
for the host, but the host didn't collect it in time ... so that some
data was lost. In this context, it's only for periodic transfers,
meaning interrupt (for HID keyboards, mice, etc) or isochronous (mostly
audio or video, but sometimes ATM).

Unfortunately, that's the type of fault that's especially hard to
recover from. Plus, very few of the upper layer drivers have even
a minor clue about fault recovery strategies during I/O ... it's not
supposed to happen for interrupt transfers, one hopes the drivers
will at least die gracefully. With ISO, faults are expected (since
it's "best effort" delivery, time-priority).

On the plus side, MMF errors have been vanishingly rare until this
cpufreq interaction came up ... which of course implies the downside
that those "handle the result" code paths are all but untested.


> There can be *other* delays in reading memory that have nothing to do with
> cpu frequency shifting, and everything to do with exteme situations on the
> bus. If the stupid EHCI controller has some tight latency issues, that's a
> generic problem.

There could be such problems, yes. But in practice, I don't know
that we've ever seen them.

(There's a first time for everthing, yes. I *just* fetched a webpage
where an image got overwritten about half way through fetching it.
Top half was today's, bottom half was tomorrow's, update 12 midnight EST.
Strangest looking JPG ever! ;)


> That commit 196705c9bbc03540429b0f7cf9ee35c2f928a534 just exemplifies what
> is wrong with USB, but it does so by adding incredibly ugly code. I'd
> rather not add even *more* ugly code - especially not for a case where we
> then seem to blame the wrong party (ie a VIA controller that didn't need
> the ugly code in the first place).
>
> Serverworks/Broadcom makes totally crap chips (not just in USB) and then
> doesn't even document their buggy crap hardware.

That's pretty much how I feel about VIA's USB stuff:
buggy crap that I actively steer people away from.

And that's why it doesn't seem odd to me to add even
more workarounds for VIA-only bugs.


> But that is NOT a reason
> for then making the kernel have buggy crap software in it.

I don't think we always have the option not to cope with
broken hardware. We may have some options about *how* we
cope with it though ...


> So really - is there any reason why we just don't say "Broadcom chips
> suck, and get MMF errors under normal circumstances because they are
> crap". And from *that*, the obvious solution would seem to not be to
> penalize everybody else, but to just say that "We will try to recover from
> MMF errors gracefully by retrying the transaction". Hmm?

Well, see above about why retrying wouldn't work well. Data lost, and
not recoverable ... although if the events are only USB keyboards/mice,
then the user might be able to recover. (Stuart?)

Alternatively, if Broadcom then just veto cpufreq changes whenever
there are USB interrupt transfers active. (We *can* veto changes
in notifiers, yes?)

- Dave



>
> Linus
>


2007-08-21 04:48:56

by David Brownell

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions

On Monday 20 August 2007, Linus Torvalds wrote:
>
> On Mon, 20 Aug 2007, David Brownell wrote:
> >
> > MMF basically means the "Transaction Translating" (TT) hub had data
> > for the host, but the host didn't collect it in time ... so that some
> > data was lost.
> >
> > Unfortunately, that's the type of fault that's especially hard to
> > recover from.
>
> Fair enough. However, it still seems particularly idiotic to
> - penalize everybody
> - mix up two totally unrelated areas (cpufreq and USB) for a bug that is
> extremely rare and could be handled differently.

Yes on #1, but on #2 ... frequency transitions are a common place
for systems to want to hiccup. Maybe less so on PCs, but it's
hard to say that re-clocking an I/O or memory bus shouldn't affect
the peripherals using it for "realtime" (deadlines) I/O !!


My more complete response suggested maybe just vetoing cpufreq
transitions if the Broadcom chipset (or maybe it's just specific
boards?) finds itself in the awkward configuration ... penalizing
only the people we know could have trouble.


> For example, if it really ends up being practically impossible to recover
> from split transaction errors, I would still suggest reverting that horrid
> commit, and then just black-listing the known-broken EHCI controllers and
> simply not *do* any split transactions on them. That way there's no
> complexity.
>
> As far as I know, split transactions aren't required anyway, they are just
> a performance optimization.

Nope. Linus, this is at least the second or third time you've
been wrong -- sorry. But I wish you were right, since they're
such a PITA to cope with. ;)

Split transactions are how the full and low speed devices bridge
to high speed busses. Think of the TT hub as a speed converter,
buffering data and then retransmitting it at the other (slower or
faster) speed. Some systems don't even have a full/low speed host
adapter ... they just have a high speed root hub and rely on some
external TT hubs (maybe on a mainboard) to handle the rest.


> Basically, we not only know that the commit has caused problems, it's
> fundamentally ugly, fragile, and not very maintainable, and the whole
> reason for doing it is pretty dubious.
>
> Why not just admit that certain hardware is broken (and the vendor isn't
> worth even bothering to be polite with, since they try to screw us every
> chance they get) and cannot reliably do split transactions? Problem
> solved, no real downside, and nobody will even *notice*.

Well, I suggested an alternate fix that I hope Stuart will look at.
I think it achieves your goals (only impacting Broadcom systems).

- Dave




>
> Linus
>


2007-08-21 05:31:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions



On Mon, 20 Aug 2007, David Brownell wrote:

> On Monday 20 August 2007, Linus Torvalds wrote:
> >
> > Fair enough. However, it still seems particularly idiotic to
> > - penalize everybody
> > - mix up two totally unrelated areas (cpufreq and USB) for a bug that is
> > extremely rare and could be handled differently.
>
> Yes on #1, but on #2 ... frequency transitions are a common place
> for systems to want to hiccup.

I disagree - it's extremely rare. We've probably had more *software*
problems with cpufreq than we've had real hardware problems (ie all the
locking for cpufreq has been pretty painful).

It's also something that probably depends a lot on the particular CPU.
Some CPU's have very short PLL relocking, and continue running while the
voltage is changed. Others seem to stop for much longer times. This is not
somethign that the USB layer should stick its fingers in - because quite
frankly, it really doesn't have a clue.

> Maybe less so on PCs, but it's hard to say that re-clocking an I/O or
> memory bus shouldn't affect the peripherals using it for "realtime"
> (deadlines) I/O !!

Normally the memory bus isn't reclocked (it's *possible* to do, but it's
complex and can be quite fragile). I think the issue was just that the CPU
itself was reclocked, and had long latencies for probing the cache. Not
unlike a sleep state: there can be long DMA latencies just from the CPU
being in S1!

So adding this special case for CPU frequency is not at all unlike adding
a special case saying that the CPU cannot go into "halt" state, because
the DMA latency is too high.

Have we done that? Yes. We actually had a "no-hlt" kernel command line
flag that literally disabled halting the CPU, because it apparently caused
problems for some floppy disk setups (and yes, the main reasonable
explanation was some bad DMA interaction, we never figured it out).

So it might be much better if we instead re-introduced that kind of "DMA
latency requirement", and letting different subsystems react to that as
they may.

It really can affect more than just cpufreq - I would not be in the
*least* surprised if C3 latencies and other things can cause these things
too! But even within cpufreq, it's quite likely to hit certain situations
more than others.

(Of course, if C3 latencies are high on a MB that has known DMA latency
issues, you'd hope that the BIOS has simply disabled C3 entirely in the
ACPI tables)

> transitions if the Broadcom chipset (or maybe it's just specific
> boards?) finds itself in the awkward configuration ... penalizing
> only the people we know could have trouble.

Yes, that would be more acceptable, I think.

It is also quite likely that this is not a generic cpufreq issue, but one
that happens with just a certain class of CPU's - ie some particular CPU
that is just slower than others at re-clocking.

Just disabling things blindly on cpufreq events, when what it actually
wants to do is say "I need low DMA latency", and then the cpu-freq layer
(which can know about these things) may decide internally that it knows
that a particular setup is not able to have low-latency DMA durign
frequency relocking.

On other - saner - CPU's, the frequency relock may take a fraction of the
time, and the CPU is running perfectly the whole time - and it would _not_
be affected.

> > As far as I know, split transactions aren't required anyway, they are just
> > a performance optimization.
>
> Nope. Linus, this is at least the second or third time you've
> been wrong -- sorry. But I wish you were right, since they're
> such a PITA to cope with. ;)
>
> Split transactions are how the full and low speed devices bridge
> to high speed busses. Think of the TT hub as a speed converter,
> buffering data and then retransmitting it at the other (slower or
> faster) speed. Some systems don't even have a full/low speed host
> adapter ... they just have a high speed root hub and rely on some
> external TT hubs (maybe on a mainboard) to handle the rest.

Ok. But in the meantime, I really think we should just revert the code
that causes a known regression.

Because, quite frankly, you may not like VIA, but in the bigger picture,
VIA has been a hell of a lot better than Broadcom.

Linus

2007-08-21 05:57:13

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions


> Have we done that? Yes. We actually had a "no-hlt" kernel command line
> flag that literally disabled halting the CPU, because it apparently caused
> problems for some floppy disk setups (and yes, the main reasonable
> explanation was some bad DMA interaction, we never figured it out).
>
> So it might be much better if we instead re-introduced that kind of "DMA
> latency requirement", and letting different subsystems react to that as
> they may.


wait.... we HAVE that infrastructure .. see kernel/latency.c ...


> It really can affect more than just cpufreq - I would not be in the
> *least* surprised if C3 latencies and other things can cause these things
> too! But even within cpufreq, it's quite likely to hit certain situations
> more than others.

and kernel/latency.c was designed EXACTLY for that reason. All the USB
layer has to do is to announce it's latency requirement like this:

/*
* Some broadcom chips are buggy and can't take more than 5 usec as DMA
* latency; inform the rest of kernel of this.
*/
if (weird_broadcom_chip())
set_acceptable_latency("ehci", 5);


and the C-state code will honor it. CPUFREQ doesn't honor it yet but
that's easy to add.. (this assumes the ACPI BIOS informs us correctly
about the cpu behavior, but that's the best we can do obviously unless
you want a table inside the kernel keyed off vendor/model/stepping)




--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-08-21 06:03:52

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions



On Mon, 20 Aug 2007, Linus Torvalds wrote:
>
> Ok. But in the meantime, I really think we should just revert the code
> that causes a known regression.

Side note: one reason I'm interested in this is that my mac mini (now used
by the kids) has had a very flaky USB mouse lately. Is it related? I have
no idea, and probably not, but as a result I'm very interested in any USB
regressions. There's *something* rotten with that mouse, and while it
could be the mouse itself going bad, I think it started happening only
after updating that machine to 2.6.23-rc1.

Linus

2007-08-21 06:09:42

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions


On Mon, 2007-08-20 at 22:51 -0700, Arjan van de Ven wrote:
> and the C-state code will honor it. CPUFREQ doesn't honor it yet but
> that's easy to add..

untested patch to add this to cpufreq; this is probably a good idea in
general even if using the latency framework doesn't end up being used
for fixing this regression...


--- linux-2.6.23-rc2/drivers/cpufreq/cpufreq.c.org 2007-08-20 22:58:32.000000000 -0700
+++ linux-2.6.23-rc2/drivers/cpufreq/cpufreq.c 2007-08-20 23:02:21.000000000 -0700
@@ -1604,6 +1604,12 @@ static int __cpufreq_set_policy(struct c
if (ret)
goto error_out;

+
+ if (system_latency_constraint() < policy->cpuinfo.transition_latency) {
+ ret = -EINVAL;
+ goto error_out;
+ }
+
/* notification of the new policy */
blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
CPUFREQ_NOTIFY, policy);


2007-08-21 06:26:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions



On Mon, 20 Aug 2007, Arjan van de Ven wrote:
> >
> > So it might be much better if we instead re-introduced that kind of "DMA
> > latency requirement", and letting different subsystems react to that as
> > they may.
>
> wait.... we HAVE that infrastructure .. see kernel/latency.c ...

Heh. Just shows how wellknown that interface is - it seems like it's only
used by the ipw2100 driver and "pcm_native".

But yes, that looks like the right thing.

> and the C-state code will honor it. CPUFREQ doesn't honor it yet but
> that's easy to add.. (this assumes the ACPI BIOS informs us correctly
> about the cpu behavior, but that's the best we can do obviously unless
> you want a table inside the kernel keyed off vendor/model/stepping)

Do we actually have the latency information for these things? Especially
since I assume a number of people use the specialized direct-hw-access
cpufreq drivers..

I realize that we *have* "transition_latency" at the cpufreq layer, and it
is supposed to be in ns, but I wonder how likely it is to bear any
relationship to reality, considering that I don't think it's really used
for anything.. (yeah, it affects the heuristics, but I don't think it has
any _hard_ meaning, so I'd worry that it's not necessarily something that
people have tried to make accurate).

But I dunno.

Linus

2007-08-21 06:27:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions



On Mon, 20 Aug 2007, Arjan van de Ven wrote:
>
> untested patch to add this to cpufreq; this is probably a good idea in
> general even if using the latency framework doesn't end up being used
> for fixing this regression...
>
>
> --- linux-2.6.23-rc2/drivers/cpufreq/cpufreq.c.org 2007-08-20 22:58:32.000000000 -0700
> +++ linux-2.6.23-rc2/drivers/cpufreq/cpufreq.c 2007-08-20 23:02:21.000000000 -0700
> @@ -1604,6 +1604,12 @@ static int __cpufreq_set_policy(struct c
> if (ret)
> goto error_out;
>
> +
> + if (system_latency_constraint() < policy->cpuinfo.transition_latency) {

That looks broken. "system_latency_constraint()" is in us, but
transition_latency is in ns, afaik.

But adding a "/ 1000" to turn the ns into us, and it migth even work.

Linus

2007-08-21 06:30:24

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions


On Mon, 2007-08-20 at 23:25 -0700, Linus Torvalds wrote:

> Do we actually have the latency information for these things? Especially
> since I assume a number of people use the specialized direct-hw-access
> cpufreq drivers..
>
> I realize that we *have* "transition_latency" at the cpufreq layer, and it
> is supposed to be in ns, but I wonder how likely it is to bear any
> relationship to reality, considering that I don't think it's really used
> for anything.. (yeah, it affects the heuristics, but I don't think it has
> any _hard_ meaning, so I'd worry that it's not necessarily something that
> people have tried to make accurate).

trusting the bios to be accurate for all machines is generally a ...
well... it's like trusting politicians in election week.

But it's sort of the best we got; at the same time, what are the odds
that the time is more than an order of magnitude off? if the latency of
the cpu is so large that the requirement ehci puts in is orders of
magnitude more strict, a bit inaccurate data from the bios doesn't
matter all that much. And worst case we make a table with quirks somehow
(probably on cpu vendor/model I suppose)

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-08-21 06:33:31

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions


On Mon, 2007-08-20 at 23:26 -0700, Linus Torvalds wrote:
>
> On Mon, 20 Aug 2007, Arjan van de Ven wrote:
> >
> > untested patch to add this to cpufreq; this is probably a good idea in
> > general even if using the latency framework doesn't end up being used
> > for fixing this regression...
> >
> >
> > --- linux-2.6.23-rc2/drivers/cpufreq/cpufreq.c.org 2007-08-20 22:58:32.000000000 -0700
> > +++ linux-2.6.23-rc2/drivers/cpufreq/cpufreq.c 2007-08-20 23:02:21.000000000 -0700
> > @@ -1604,6 +1604,12 @@ static int __cpufreq_set_policy(struct c
> > if (ret)
> > goto error_out;
> >
> > +
> > + if (system_latency_constraint() < policy->cpuinfo.transition_latency) {
>
> That looks broken. "system_latency_constraint()" is in us, but
> transition_latency is in ns, afaik.
>
> But adding a "/ 1000" to turn the ns into us, and it migth even work.


eh woops yes indeed.
Shows me for not testing; I'll do that tomorrow when I'm more awake
--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2007-08-21 06:37:07

by David Brownell

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions

On Monday 20 August 2007, Linus Torvalds wrote:
>
> On Mon, 20 Aug 2007, Linus Torvalds wrote:
> >
> > Ok. But in the meantime, I really think we should just revert the code
> > that causes a known regression.
>
> Side note: one reason I'm interested in this is that my mac mini (now used
> by the kids) has had a very flaky USB mouse lately. Is it related? I have
> no idea, and probably not, but as a result I'm very interested in any USB
> regressions. There's *something* rotten with that mouse, and while it
> could be the mouse itself going bad, I think it started happening only
> after updating that machine to 2.6.23-rc1.

Try disabling USB_SUSPEND ... the rather aggressive powersave
mechanism (autosuspend defaulting to always ON) has made lots
of trouble. I think that default will change...

- Dave

2007-08-21 06:46:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions



On Mon, 20 Aug 2007, Arjan van de Ven wrote:

>
> On Mon, 2007-08-20 at 23:26 -0700, Linus Torvalds wrote:
> >
> > On Mon, 20 Aug 2007, Arjan van de Ven wrote:
> > >
> > > untested patch to add this to cpufreq; this is probably a good idea in
> > > general even if using the latency framework doesn't end up being used
> > > for fixing this regression...
> > >
> > >
> > > --- linux-2.6.23-rc2/drivers/cpufreq/cpufreq.c.org 2007-08-20 22:58:32.000000000 -0700
> > > +++ linux-2.6.23-rc2/drivers/cpufreq/cpufreq.c 2007-08-20 23:02:21.000000000 -0700
> > > @@ -1604,6 +1604,12 @@ static int __cpufreq_set_policy(struct c
> > > if (ret)
> > > goto error_out;
> > >
> > > +
> > > + if (system_latency_constraint() < policy->cpuinfo.transition_latency) {
> >
> > That looks broken. "system_latency_constraint()" is in us, but
> > transition_latency is in ns, afaik.
> >
> > But adding a "/ 1000" to turn the ns into us, and it migth even work.
>
>
> eh woops yes indeed.
> Shows me for not testing; I'll do that tomorrow when I'm more awake

Side note: I think we migth want to also have some way of telling the user
*why* we're not doing frequency changes. Maybe as simple as a rate-limited
printk() or something.

Otherwise, we'll easily be in a situation where some poor sod ends up
running constantly at lowest frequency, and no way of even seeing why.
Which sounds like a debugging nightmare.

If the kernel spits out the occasional warning about the latency
violation, at least we get notified about there being potential problems.

Linus

2007-08-21 06:53:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions



On Mon, 20 Aug 2007, David Brownell wrote:
>
> Try disabling USB_SUSPEND ... the rather aggressive powersave
> mechanism (autosuspend defaulting to always ON) has made lots
> of trouble. I think that default will change...

It's already disabled - so that's not it.

Linus

2007-08-21 07:25:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions



On Mon, 20 Aug 2007, Linus Torvalds wrote:
>
> On Mon, 20 Aug 2007, David Brownell wrote:
> >
> > Try disabling USB_SUSPEND ... the rather aggressive powersave
> > mechanism (autosuspend defaulting to always ON) has made lots
> > of trouble. I think that default will change...
>
> It's already disabled - so that's not it.

Side note: after reverting 196705c9bb I can't get the mouse to skip any
more on that mac mini. But since the bad behaviour wasn't 100% reliable to
begin with, that's not really a guarantee of anything. Two out of three
kids are off on camp this week, so that machine probably won't be getting
a lot of testing ;/

That's with an all-intel chipset, no VIA or Broadcom anywhere.

I do find it interesting that VIA apparently ignores the INACTIVE bit: it
implies that Windows probably doesn't use it, which in turn implies that
nobody has ever tested it in any real life situation. So I would not be
shocked to hear that others have problems too.

But as mentioned, the mouse behaviour was flaky (but very noticeable and
irritating when it did happen), so it's hard to be sure my quick testing
really was convincing. Which is why it's even harder to bisect ;(.

Linus

2007-08-22 03:35:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions



On Tue, 21 Aug 2007, Linus Torvalds wrote:
>
> Side note: after reverting 196705c9bb I can't get the mouse to skip any
> more on that mac mini. But since the bad behaviour wasn't 100% reliable to
> begin with, that's not really a guarantee of anything. Two out of three
> kids are off on camp this week, so that machine probably won't be getting
> a lot of testing ;/

Well, my one remaining child said today that "I got so much time on
webkinz today - yesterday the mouse locked up after five minutes".

Apparently it hadn't had the mouse lock up at all today.

So I really do believe that that 196705c9bb commit caused problems on
intel-only USB machines too ("ondemand" cpufreq governor, switching
between 1.0-1.66 Ghz using acpi-cpufreq: totally bog-standard in all
respects, in other words).

Linus

2007-08-22 14:42:39

by Stuart Hayes

[permalink] [raw]
Subject: RE: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions

Linus Torvalds wrote:
> On Tue, 21 Aug 2007, Linus Torvalds wrote:
>>
>> Side note: after reverting 196705c9bb I can't get the mouse to skip
>> any more on that mac mini. But since the bad behaviour wasn't 100%
>> reliable to begin with, that's not really a guarantee of anything.
>> Two out of three kids are off on camp this week, so that machine
>> probably won't be getting a lot of testing ;/
>
> Well, my one remaining child said today that "I got so much time on
> webkinz today - yesterday the mouse locked up after five minutes".
>
> Apparently it hadn't had the mouse lock up at all today.
>
> So I really do believe that that 196705c9bb commit caused problems on
> intel-only USB machines too ("ondemand" cpufreq governor, switching
> between 1.0-1.66 Ghz using acpi-cpufreq: totally bog-standard in all
> respects, in other words).
>
> Linus

If you were running 2.6.26-rc3, that's quite possibly because you didn't
have the follow-up patch that fixed my original patch... it wasn't in
2.6.26-rc3
(http://www.mail-archive.com/[email protected]/msg56
523.html). It fixed a bug with my patch that wasn't necessary with
Broadcom, but was with nVidia (and Intel, I believe).

That could definitely cause mouse lock-ups. Sorry, that should have
occurred to me yesterday when you mentioned the problem your kids were
seeing, but it didn't for some reason.

Stuart

2007-08-22 18:42:26

by Linus Torvalds

[permalink] [raw]
Subject: RE: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions



On Wed, 22 Aug 2007, [email protected] wrote:
>
> If you were running 2.6.26-rc3, that's quite possibly because you didn't
> have the follow-up patch that fixed my original patch... it wasn't in
> 2.6.26-rc3

Well, I was running "current git", and it's never been there. So not just
a -rc3 issue.

> That could definitely cause mouse lock-ups. Sorry, that should have
> occurred to me yesterday when you mentioned the problem your kids were
> seeing, but it didn't for some reason.

Btw, could it have caused the USB stack to be *really* confused? Some of
those mouse lockups ended up also locking the machine hard (ie no ping, no
nothing), and I'm a bit worried that there was something else going on
too..

That said, if you can actually re-create the MMF problems, could you
please try the patch that Arjan suggested? Ie add a

/*
* Some broadcom chips are buggy and can't take more than 5 usec as DMA
* latency; inform the rest of kernel of this.
*/
if (weird_broadcom_chip())
set_acceptable_latency("ehci", 5);

to the USB driver, and then add something like

static inline int cpufreq_acceptable_latency(struct cpufreq_policy *policy)
{
unsigned long latency;

/* Policy latency in usec */
latency = policy->cpuinfo.transition_latency / 1000;

if (latency > system_latency_constraint())
return -EINVAL;

return 0;
}

adn then add calls to this from both the "__cpufreq_set_policy()" function
and the "__cpufreq_driver_target()" one too..

That should disable cpufreq with that broken chip, which is perhaps a big
draconian, but it's certainly better than having the USB layer know about
cpufreq internals directly.

In the longer run, I think we can move the "system_latency_constraint()"
checking from the policy registration into each CPU frequency driver, so
that it could be more dynamically decide about "can we do it right _now_"
rather than globally saying "we can't do it with this hardware".

Linus

2007-08-22 20:41:53

by Stuart Hayes

[permalink] [raw]
Subject: RE: [linux-usb-devel] [4/4] 2.6.23-rc3: known regressions


>> That could definitely cause mouse lock-ups. Sorry, that should have
>> occurred to me yesterday when you mentioned the problem your kids
>> were seeing, but it didn't for some reason.
>
> Btw, could it have caused the USB stack to be *really* confused? Some
> of those mouse lockups ended up also locking the machine hard (ie no
> ping, no nothing), and I'm a bit worried that there was something
> else going on too..
>

Unfortunately, yes, that sounds exactly like what happened with the
nVidia controller. The problem was with my patch, but was fixed with
the later patch. I doubt there was anything else going on.

> That said, if you can actually re-create the MMF problems, could you
> please try the patch that Arjan suggested? Ie add a
>
> /*
> * Some broadcom chips are buggy and can't take more than 5 usec
as
> DMA
> * latency; inform the rest of kernel of this.
> */
> if (weird_broadcom_chip())
> set_acceptable_latency("ehci", 5);
>
> to the USB driver, and then add something like
>
> static inline int cpufreq_acceptable_latency(struct
cpufreq_policy
> *policy) {
> unsigned long latency;
>
> /* Policy latency in usec */
> latency = policy->cpuinfo.transition_latency / 1000;
>
> if (latency > system_latency_constraint())
> return -EINVAL;
>
> return 0;
> }
>
> adn then add calls to this from both the "__cpufreq_set_policy()"
> function and the "__cpufreq_driver_target()" one too..
>
> That should disable cpufreq with that broken chip, which is perhaps a
> big draconian, but it's certainly better than having the USB layer
> know about cpufreq internals directly.
>
> In the longer run, I think we can move the
> "system_latency_constraint()"
> checking from the policy registration into each CPU frequency driver,
> so that it could be more dynamically decide about "can we do it right
> _now_"
> rather than globally saying "we can't do it with this hardware".
>
> Linus

I will work on that, thank you for the help.

Stuart

2007-08-22 23:35:42

by Junio C Hamano

[permalink] [raw]
Subject: Re: [4/4] 2.6.23-rc3: known regressions

Linus Torvalds <[email protected]> writes:

> Well, my one remaining child said today that "I got so much time on
> webkinz today - yesterday the mouse locked up after five minutes".
>
> Apparently it hadn't had the mouse lock up at all today.
>
> So I really do believe that that 196705c9bb commit caused problems on
> intel-only USB machines too ("ondemand" cpufreq governor, switching
> between 1.0-1.66 Ghz using acpi-cpufreq: totally bog-standard in all
> respects, in other words).
>
> Linus

Sorry for being way offtopic, but the above message reminded me
of commit 869659a6 from git.git repository, also this message:

http://www.gelato.unsw.edu.au/archives/git/0607/24208.html

By the way, Linus, please let me know if you get this message
via vger but not via the direct path to you. I seem to have
been getting bounces for mails to you and andrew from my ISP in
the past few weeks.