2016-10-21 15:29:31

by Michael Thayer

[permalink] [raw]
Subject: PROBLEM: OHCI watchdog timeouts inside VirtualBox, probably due to timer wheel rework

Hello Alan (LKML on CC),

Contacting you about this on Thomas Gleixner's (also on CC) suggestion.
The short summary is that when Linux 4.8.0 (also tested with a few later
kernels) is run on a VirtualBox virtual machine with USB enabled, OHCI
fails with the log messages "frame counter not updated; disabled" and
"HC died; cleaning up". This seems to be due to the 250 ms interval
watchdog running with far too short intervals, which we think is a
consequence of the timer wheel code rework. I will refer you to a bug
filed in Launchpad<1> for a longer description.

Hope this is of interest to you.

Regards,

Michael

<1> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1634737
--
Michael Thayer | VirtualBox engineer
ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | D-71384 Weinstadt

ORACLE Deutschland B.V. & Co. KG
Hauptverwaltung: Riesstraße 25, D-80992 München
Registergericht: Amtsgericht München, HRA 95603

Komplementärin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande Handelsregister
der Handelskammer Midden-Nederland, Nr. 30143697
Geschäftsführer: Alexander van der Ven, Jan Schultheiss, Val Maher


2016-10-21 16:54:12

by Alan Stern

[permalink] [raw]
Subject: Re: PROBLEM: OHCI watchdog timeouts inside VirtualBox, probably due to timer wheel rework

On Fri, 21 Oct 2016, Michael Thayer wrote:

> Hello Alan (LKML on CC),
>
> Contacting you about this on Thomas Gleixner's (also on CC) suggestion.
> The short summary is that when Linux 4.8.0 (also tested with a few later
> kernels) is run on a VirtualBox virtual machine with USB enabled, OHCI
> fails with the log messages "frame counter not updated; disabled" and
> "HC died; cleaning up". This seems to be due to the 250 ms interval
> watchdog running with far too short intervals, which we think is a
> consequence of the timer wheel code rework. I will refer you to a bug
> filed in Launchpad<1> for a longer description.
>
> Hope this is of interest to you.
>
> Regards,
>
> Michael
>
> <1> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1634737

That bug description says the watchdog timer routine can be called
twice in a 4-ms period, even though it requests a 250-ms delay. Is
this really true? If it is, it sounds like a real bug in the timer
core.

Bryan Paluch reported a similar problem and said that increasing the
timeout value to 275 ms fixed it:

http://marc.info/?l=linux-usb&m=147670889009451&w=2

Does that patch also fix the "frame counter not updating" problem?

Alan Stern

2016-10-21 19:33:40

by Michal Necasek

[permalink] [raw]
Subject: Re: PROBLEM: OHCI watchdog timeouts inside VirtualBox, probably due to timer wheel rework


Alan,

I'll get back to you on whether increasing the timeout helps, it'll take a bit of testing. Does it actually sound plausible that some controllers could not get things done in 250ms but could in 275ms?

I will note that according to the table in <1>, a 250ms timeout with a HZ=250 kernel (what Ubuntu uses) falls into the 4ms granularity bucket, but a 275ms timeout goes into the 32ms granularity bucket. That could change things.

And as a bit of background... the 250ms timeout should not be a problem in a virtualized environment under normal conditions. A ~10ms timeout is. What makes things harder is that the watchdog routine reads the frame counter from the HCCA, which has to be updated asynchronously. Reading from a HC register would actually be more expensive but much more reliable in this case. But again, 250ms should be plenty...

Another factor is that the OHCI watchdog just kills the driver the first time there's a problem, there's no recovery attempt. So it's very noticeable when this happens.


Regards,
Michal


<1> http://lxr.free-electrons.com/source/kernel/time/timer.c

----- Original Message -----
From: [email protected]
To: [email protected]
Cc: [email protected], [email protected], [email protected], [email protected]
Sent: Friday, October 21, 2016 6:54:13 PM GMT +01:00 Amsterdam / Berlin / Bern / Rome / Stockholm / Vienna
Subject: Re: PROBLEM: OHCI watchdog timeouts inside VirtualBox, probably due to timer wheel rework

On Fri, 21 Oct 2016, Michael Thayer wrote:

> Hello Alan (LKML on CC),
>
> Contacting you about this on Thomas Gleixner's (also on CC) suggestion.
> The short summary is that when Linux 4.8.0 (also tested with a few later
> kernels) is run on a VirtualBox virtual machine with USB enabled, OHCI
> fails with the log messages "frame counter not updated; disabled" and
> "HC died; cleaning up". This seems to be due to the 250 ms interval
> watchdog running with far too short intervals, which we think is a
> consequence of the timer wheel code rework. I will refer you to a bug
> filed in Launchpad<1> for a longer description.
>
> Hope this is of interest to you.
>
> Regards,
>
> Michael
>
> <1> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1634737

That bug description says the watchdog timer routine can be called
twice in a 4-ms period, even though it requests a 250-ms delay. Is
this really true? If it is, it sounds like a real bug in the timer
core.

Bryan Paluch reported a similar problem and said that increasing the
timeout value to 275 ms fixed it:

http://marc.info/?l=linux-usb&m=147670889009451&w=2

Does that patch also fix the "frame counter not updating" problem?

Alan Stern

2016-10-21 19:47:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: PROBLEM: OHCI watchdog timeouts inside VirtualBox, probably due to timer wheel rework

On Fri, 21 Oct 2016, Michal Necasek wrote:

> I'll get back to you on whether increasing the timeout helps, it'll take
> a bit of testing. Does it actually sound plausible that some controllers
> could not get things done in 250ms but could in 275ms?

There is a bug in the new timer wheel code, which causes a timer to fire
way too early. That USB timeout is just exposing the issue.

We do not yet understand the issue completely, but we are working on it.

Thanks,

tglx



2016-10-21 20:57:19

by Alan Stern

[permalink] [raw]
Subject: Re: PROBLEM: OHCI watchdog timeouts inside VirtualBox, probably due to timer wheel rework

On Fri, 21 Oct 2016, Michal Necasek wrote:

> Alan,
>
> I'll get back to you on whether increasing the timeout helps, it'll
> take a bit of testing. Does it actually sound plausible that some
> controllers could not get things done in 250ms but could in 275ms?

Not very, I admit.

> I will note that according to the table in <1>, a 250ms timeout with
> a HZ=250 kernel (what Ubuntu uses) falls into the 4ms granularity
> bucket, but a 275ms timeout goes into the 32ms granularity bucket.
> That could change things.

Thomas has noted that there is definitely a bug in the new timer code.

> And as a bit of background... the 250ms timeout should not be a
> problem in a virtualized environment under normal conditions. A ~10ms
> timeout is. What makes things harder is that the watchdog routine
> reads the frame counter from the HCCA, which has to be updated
> asynchronously. Reading from a HC register would actually be more
> expensive but much more reliable in this case. But again, 250ms
> should be plenty...
>
> Another factor is that the OHCI watchdog just kills the driver the
> first time there's a problem, there's no recovery attempt. So it's
> very noticeable when this happens.

I should think so! The whole point of the watchdog is to catch bad
hardware when it starts misbehaving. If that happens, we need to
cancel everything and kill the driver so that it doesn't take large
parts of the system down with it -- which is what used to happen.

Alan Stern

> Regards,
> Michal
>
>
> <1> http://lxr.free-electrons.com/source/kernel/time/timer.c


> ----- Original Message -----
> From: [email protected]
> To: [email protected]
> Cc: [email protected], [email protected], [email protected], [email protected]
> Sent: Friday, October 21, 2016 6:54:13 PM GMT +01:00 Amsterdam / Berlin / Bern / Rome / Stockholm / Vienna
> Subject: Re: PROBLEM: OHCI watchdog timeouts inside VirtualBox, probably due to timer wheel rework
>
> On Fri, 21 Oct 2016, Michael Thayer wrote:
>
> > Hello Alan (LKML on CC),
> >
> > Contacting you about this on Thomas Gleixner's (also on CC) suggestion.
> > The short summary is that when Linux 4.8.0 (also tested with a few later
> > kernels) is run on a VirtualBox virtual machine with USB enabled, OHCI
> > fails with the log messages "frame counter not updated; disabled" and
> > "HC died; cleaning up". This seems to be due to the 250 ms interval
> > watchdog running with far too short intervals, which we think is a
> > consequence of the timer wheel code rework. I will refer you to a bug
> > filed in Launchpad<1> for a longer description.
> >
> > Hope this is of interest to you.
> >
> > Regards,
> >
> > Michael
> >
> > <1> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1634737
>
> That bug description says the watchdog timer routine can be called
> twice in a 4-ms period, even though it requests a 250-ms delay. Is
> this really true? If it is, it sounds like a real bug in the timer
> core.
>
> Bryan Paluch reported a similar problem and said that increasing the
> timeout value to 275 ms fixed it:
>
> http://marc.info/?l=linux-usb&m=147670889009451&w=2
>
> Does that patch also fix the "frame counter not updating" problem?
>
> Alan Stern