2017-04-03 15:39:26

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: v4.10-rc8 (-rc6) boot regression on Intel desktop, does not boot after cold boots, boots after reboot

On Thu, Feb 23, 2017 at 07:40:13PM +0100, Pavel Machek wrote:
> On Thu 2017-02-23 17:28:26, Frederic Weisbecker wrote:
> > On Tue, Feb 14, 2017 at 08:27:43PM +0100, Pavel Machek wrote:
> > > On Tue 2017-02-14 18:59:56, Pavel Machek wrote:
> > > > Hi!
> > > >
> > > > > > > > Hmm. I moved keyboard between USB ports, and now 4.10-rc6 no longer
> > > > > > > > boots. v4.6 works ok. Let me try with keyboard unplugged... no, I
> > > > > > > > could not get it to work. I believe v4.9 and some v4.10-rc's worked,
> > > > > > > > but I'll have to double check.
> > > > > > >
> > > > > > > But all the kernel versions worked when the keyboard was plugged into
> > > > > > > its original USB port?
> > > > > >
> > > > > > Aha. So it looks difference is probably in "where is keyboard plugged
> > > > > > in" but in "reboot" vs. "cold boot". I did not do a cold boot in quite
> > > > > > a while :-(.
> > > > > >
> > > > > > Booting to grub, then hitting ctrl-alt-del is enough to make it work. Ouch.
> > > > > >
> > > > > > It happens with current Linus' tree.
> > > > >
> > > > > v4.10-rc6-feb3 : broken
> > > > > v4.9 : ok
> > > > > (v4.6 : ok)
> > > >
> > > > Hmm. It hangs during PCI fixups, and it hangs in v4.10-rc8, too.
> > > >
> > > > With debug patch below, I get
> > > >
> > > > ...1d.7: PCI fixup... pass 2
> > > > ...1d.7: PCI fixup... pass 3
> > > > ...1d.7: PCI fixup... pass 3 done
> > > >
> > > > ...followed by hang. So yes, it looks USB related.
> > > >
> > > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > > startup, unfortunately useful info is off screen at that point).
> > >
> > > Forgot to say, 1d.7 is EHCI controller.
> > >
> > > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > > Controller (rev 01)
> >
> > Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> > I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> > burden you again :-)
>
> Go through more mails. It is only reproducible after cold boot. .. so
> I doubt it will be easy to reproduce on another machine.
>
> Now... I do have serial port, and I even might have serial cable
> somewhere, but.... Giving how sensitive it is, it is probably going to
> go away with console on ttyS...

I also tried on an eeepc (which has ICH7/NM10 as well), with your config.
I even plugged a usb keyboard but even then I have been unable to
reproduce either :-(


2017-04-03 18:20:54

by Pavel Machek

[permalink] [raw]
Subject: Re: v4.10-rc8 (-rc6) boot regression on Intel desktop, does not boot after cold boots, boots after reboot

> > > > > ...1d.7: PCI fixup... pass 2
> > > > > ...1d.7: PCI fixup... pass 3
> > > > > ...1d.7: PCI fixup... pass 3 done
> > > > >
> > > > > ...followed by hang. So yes, it looks USB related.
> > > > >
> > > > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > > > startup, unfortunately useful info is off screen at that point).
> > > >
> > > > Forgot to say, 1d.7 is EHCI controller.
> > > >
> > > > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > > > Controller (rev 01)
> > >
> > > Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> > > I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> > > burden you again :-)
> >
> > Go through more mails. It is only reproducible after cold boot. .. so
> > I doubt it will be easy to reproduce on another machine.
> >
> > Now... I do have serial port, and I even might have serial cable
> > somewhere, but.... Giving how sensitive it is, it is probably going to
> > go away with console on ttyS...
>
> I also tried on an eeepc (which has ICH7/NM10 as well), with your config.
> I even plugged a usb keyboard but even then I have been unable to
> reproduce either :-(

Ok, give me some time. I'm no longer using the affected machine, so no
promises.

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.44 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-04-12 15:08:45

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: v4.10-rc8 (-rc6) boot regression on Intel desktop, does not boot after cold boots, boots after reboot

On Mon, Apr 03, 2017 at 08:20:50PM +0200, Pavel Machek wrote:
> > > > > > ...1d.7: PCI fixup... pass 2
> > > > > > ...1d.7: PCI fixup... pass 3
> > > > > > ...1d.7: PCI fixup... pass 3 done
> > > > > >
> > > > > > ...followed by hang. So yes, it looks USB related.
> > > > > >
> > > > > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > > > > startup, unfortunately useful info is off screen at that point).
> > > > >
> > > > > Forgot to say, 1d.7 is EHCI controller.
> > > > >
> > > > > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > > > > Controller (rev 01)
> > > >
> > > > Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> > > > I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> > > > burden you again :-)
> > >
> > > Go through more mails. It is only reproducible after cold boot. .. so
> > > I doubt it will be easy to reproduce on another machine.
> > >
> > > Now... I do have serial port, and I even might have serial cable
> > > somewhere, but.... Giving how sensitive it is, it is probably going to
> > > go away with console on ttyS...
> >
> > I also tried on an eeepc (which has ICH7/NM10 as well), with your config.
> > I even plugged a usb keyboard but even then I have been unable to
> > reproduce either :-(
>
> Ok, give me some time. I'm no longer using the affected machine, so no
> promises.

Actually someone reported me a very similar issue than yours lately. It's probably
the same. And I have a potential fix.

The scenario is a bit tricky again, and still theoretical. If you're interested in gory details:
a tick which is scheduled at jiffies = N + 1, in order to expire a timer_list timer, fires a
tiny bit too early (ie: very few microseconds in advance). So it doesn't update the jiffies on irq entry
and still sees jiffies = N. The timer_list timer doesnt expire yet and on IRQ exit we reschedule
the tick at the same time. But we see that ts->next_tick already has that value, therefore
we don't reprogram it again, leaving the clockevent unprogrammed.

So in case you have the time and opportunity to test the fix, you'll need to:

1) Revert back to the offending change:
git revert 558e8e27e73f53f8a512485be538b07115fe5f3c

2) Apply a delta fix:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index a3b8154..ae66515 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1071,8 +1071,10 @@ static void tick_nohz_handler(struct clock_event_device *dev)
tick_sched_handle(ts, regs);

/* No need to reprogram if we are running tickless */
- if (unlikely(ts->tick_stopped))
+ if (unlikely(ts->tick_stopped)) {
+ ts->next_tick = 0;
return;
+ }

hrtimer_forward(&ts->sched_timer, now, tick_period);
tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
@@ -1172,8 +1174,10 @@ static enum hrtimer_restart tick_sched_timer(struct hrtimer *timer)
tick_sched_handle(ts, regs);

/* No need to reprogram if we are in idle or full dynticks mode */
- if (unlikely(ts->tick_stopped))
+ if (unlikely(ts->tick_stopped)) {
+ ts->next_tick = 0;
return HRTIMER_NORESTART;
+ }

hrtimer_forward(timer, now, tick_period);



Thanks!


Attachments:
(No filename) (3.18 kB)
pavel.diff (927.00 B)
Download all attachments

2017-04-15 21:34:54

by Pavel Machek

[permalink] [raw]
Subject: Re: v4.10-rc8 (-rc6) boot regression on Intel desktop, does not boot after cold boots, boots after reboot

On Wed 2017-04-12 17:08:35, Frederic Weisbecker wrote:
> On Mon, Apr 03, 2017 at 08:20:50PM +0200, Pavel Machek wrote:
> > > > > > > ...1d.7: PCI fixup... pass 2
> > > > > > > ...1d.7: PCI fixup... pass 3
> > > > > > > ...1d.7: PCI fixup... pass 3 done
> > > > > > >
> > > > > > > ...followed by hang. So yes, it looks USB related.
> > > > > > >
> > > > > > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > > > > > startup, unfortunately useful info is off screen at that point).
> > > > > >
> > > > > > Forgot to say, 1d.7 is EHCI controller.
> > > > > >
> > > > > > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > > > > > Controller (rev 01)
> > > > >
> > > > > Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> > > > > I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> > > > > burden you again :-)
> > > >
> > > > Go through more mails. It is only reproducible after cold boot. .. so
> > > > I doubt it will be easy to reproduce on another machine.
> > > >
> > > > Now... I do have serial port, and I even might have serial cable
> > > > somewhere, but.... Giving how sensitive it is, it is probably going to
> > > > go away with console on ttyS...
> > >
> > > I also tried on an eeepc (which has ICH7/NM10 as well), with your config.
> > > I even plugged a usb keyboard but even then I have been unable to
> > > reproduce either :-(
> >
> > Ok, give me some time. I'm no longer using the affected machine, so no
> > promises.
>
> Actually someone reported me a very similar issue than yours lately. It's probably
> the same. And I have a potential fix.

Got the machine back to work -- I guess it will be useful for distcc.

And yes, you seem to have right fix :-).

Tested-by: Pavel Machek <[email protected]>

Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html


Attachments:
(No filename) (1.96 kB)
signature.asc (181.00 B)
Digital signature
Download all attachments

2017-04-20 14:52:13

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: v4.10-rc8 (-rc6) boot regression on Intel desktop, does not boot after cold boots, boots after reboot

On Sat, Apr 15, 2017 at 11:34:47PM +0200, Pavel Machek wrote:
> On Wed 2017-04-12 17:08:35, Frederic Weisbecker wrote:
> > On Mon, Apr 03, 2017 at 08:20:50PM +0200, Pavel Machek wrote:
> > > > > > > > ...1d.7: PCI fixup... pass 2
> > > > > > > > ...1d.7: PCI fixup... pass 3
> > > > > > > > ...1d.7: PCI fixup... pass 3 done
> > > > > > > >
> > > > > > > > ...followed by hang. So yes, it looks USB related.
> > > > > > > >
> > > > > > > > (Sometimes it hangs with some kind backtrace involving secondary CPU
> > > > > > > > startup, unfortunately useful info is off screen at that point).
> > > > > > >
> > > > > > > Forgot to say, 1d.7 is EHCI controller.
> > > > > > >
> > > > > > > 00:1d.7 USB controller: Intel Corporation NM10/ICH7 Family USB2 EHCI
> > > > > > > Controller (rev 01)
> > > > > >
> > > > > > Ok, I should have access soon to a EeePc 1015CX (which seem to have this controller).
> > > > > > I hope I'll be able to reproduce the issue there. If not, I'm sorry but I'll have to
> > > > > > burden you again :-)
> > > > >
> > > > > Go through more mails. It is only reproducible after cold boot. .. so
> > > > > I doubt it will be easy to reproduce on another machine.
> > > > >
> > > > > Now... I do have serial port, and I even might have serial cable
> > > > > somewhere, but.... Giving how sensitive it is, it is probably going to
> > > > > go away with console on ttyS...
> > > >
> > > > I also tried on an eeepc (which has ICH7/NM10 as well), with your config.
> > > > I even plugged a usb keyboard but even then I have been unable to
> > > > reproduce either :-(
> > >
> > > Ok, give me some time. I'm no longer using the affected machine, so no
> > > promises.
> >
> > Actually someone reported me a very similar issue than yours lately. It's probably
> > the same. And I have a potential fix.
>
> Got the machine back to work -- I guess it will be useful for distcc.
>
> And yes, you seem to have right fix :-).
>
> Tested-by: Pavel Machek <[email protected]>

Thanks a lot! I'm posting the fix.