Moving the thread to LKML, as suggested by Thomas...
>
>> ---------- Forwarded message ---------
>> From: Woody Suwalski <[email protected]>
>> Date: Thu, Aug 1, 2019 at 3:45 PM
>> Subject: Intermittent suspend on 5.3 / 5.2
>> To: Rafael J. Wysocki <[email protected]>
>>
>>
>> Hi Rafał,
>> I know that you are investigating some issues between these 2 kernels,
>> however I see probably an unrelated problem with suspend on 5.3 and
>> 5.2.4. I think it has creeped in to 5.1.21 as well, but not sure (it is
>> intermittent). So far 4.20.17 works OK, and I think 5.2.0 works OK.
>> The problem I see is on both 32 and 64 bit VMs, in VMware workstation
>> 15. The VM is trying to suspend when no activity. It leaves out a black
>> box with cursor in top-left position. Upon wakeup from VMware it goes to
>> vmware pre-bios screen, and then expands the black box to the run-size
>> and switches to X.
>> The problem with new kernels is that (I think) the suspend fails - the
>> black box with cursor is there, but seems bigger, and of course is not
>> wake'able (have to reset). In kern.log suspend seems be running OK, and
>> then new dmesg lines kick in, and no obvious culprit.
>> So looking for a free advice .
>> a. You already know what it is
>> b. You may have suggestions as to which upstream patch could be to blame
>> c. I should boot with some debug params (console_off=0, or some other?)
>> and get some real info?
>>
>> BTW. For suspend to work I had to override mem_sleep to [shallow], or
>> maybe later to [s2idle] (the actual VMs are at work, referring from
>> memory...)
>>
>> If you have any ideas, all are welcomed
>> Thanks, Woody
On 8/6/2019 3:18 PM, Woody Suwalski wrote:
> Rafal, the patch (in 5.3-rc3)
>
> Fixes: f850a48a0799 ("ACPI: PM: Allow transitions to D0 to occur in
> special cases")
>
> does not fix the issue - it must be something else...
Sorry for the late response.
There are known issues in 5.3-rc related to power management which
should be fixed in -rc4. Please try that one when it is out.
Cheers!
Thomas Gleixner wrote:
> Woody,
>
> On Fri, 9 Aug 2019, Woody Suwalski wrote:
>
> For future things like this, please CC LKML. There is nothing secrit here
> and CC'ing the mailing list allows other people to find this and spare
> themself the whole bisection pain. Asided of that private mail does not
> scale. On the list other people can look at it and give input eventually.
>
>> After bisecting I have found the potential culprit:
>> dfe0cf8b x86/ioapic: Implement irq_get irqchip_state() callback
>>
>> I am repeating the bisection from start to re-confirm.
>>
>> Reverse-patch on 5.3-rc3 (64bit) is fixing the problem for me.
>> What is unclear - just adding the patch to 5.2.1 does not seem to
>> break it. So there is some more magic involved.
> Of course it does not do anything because 5.2.1 is not having
>
> f4999a2a3a48 ("genirq: Add optional hardware synchronization for shutdown")
>
>> Thomas, any suggestions?
> What that means is that there is an interrupt shutdown which hits the
> condition where an interrupt _IS_ marked in the IOAPIC as delivered to a
> CPU, but not serviced yet.
>
> Now the question is why it is not serviced. suspend_device_irqs() is
> calling into synchronize_irq(), which is probably the place where that
> it hangs. But that's called with CPUs online and interrupts enabled.
>
>> The reproduce methodology: use VMware player 15, either 32 or 64 bit build.
>> reboot and run "systemctl suspend". The first suspend works OK. The
>> second usually locks on kernels 5.2.2 and up. Maybe try 4 times to
>> confirm good (it is intermittent).
> -ENOVMWAREPLAYER and I'm traveling so I don't have a machine handy to
> install it. So if you can't debug it deeper down, I'm not going to have a
> chance to look at it before the end of next week.
>
> That said, can we please move this to LKML?
>
> Thanks,
>
> tglx
>
>
I can add some printk's into synchronize_irq(), however no idea if they
will be survive in the kmsg log after a next power-reset. I can wait for
a week :-)
Thanks, Woody
Thomas, Rafael,
I have added a timeout counter in __synchronize_hardirq().
At the bottom I have converted while(inprogress); to while(inprogress
&& timeout++ < 100);
That is bypassing the suspend lockup problem. On both 32-bit and
64-bit VMs the countdown is triggered by sync of irq9.
Which probably means that there is some issue in ACPI handler and
synchronize_hardirq() is stuck on it?
I will try to repeat with 5.3-rc4 tomorrow....
Thanks, Woody
On Sat, Aug 10, 2019 at 7:24 AM Woody Suwalski <[email protected]> wrote:
>
> Moving the thread to LKML, as suggested by Thomas...
> >
> >> ---------- Forwarded message ---------
> >> From: Woody Suwalski <[email protected]>
> >> Date: Thu, Aug 1, 2019 at 3:45 PM
> >> Subject: Intermittent suspend on 5.3 / 5.2
> >> To: Rafael J. Wysocki <[email protected]>
> >>
> >>
> >> Hi Rafał,
> >> I know that you are investigating some issues between these 2 kernels,
> >> however I see probably an unrelated problem with suspend on 5.3 and
> >> 5.2.4. I think it has creeped in to 5.1.21 as well, but not sure (it is
> >> intermittent). So far 4.20.17 works OK, and I think 5.2.0 works OK.
> >> The problem I see is on both 32 and 64 bit VMs, in VMware workstation
> >> 15. The VM is trying to suspend when no activity. It leaves out a black
> >> box with cursor in top-left position. Upon wakeup from VMware it goes to
> >> vmware pre-bios screen, and then expands the black box to the run-size
> >> and switches to X.
> >> The problem with new kernels is that (I think) the suspend fails - the
> >> black box with cursor is there, but seems bigger, and of course is not
> >> wake'able (have to reset). In kern.log suspend seems be running OK, and
> >> then new dmesg lines kick in, and no obvious culprit.
> >> So looking for a free advice .
> >> a. You already know what it is
> >> b. You may have suggestions as to which upstream patch could be to blame
> >> c. I should boot with some debug params (console_off=0, or some other?)
> >> and get some real info?
> >>
> >> BTW. For suspend to work I had to override mem_sleep to [shallow], or
> >> maybe later to [s2idle] (the actual VMs are at work, referring from
> >> memory...)
> >>
> >> If you have any ideas, all are welcomed
> >> Thanks, Woody
>
>
>
> On 8/6/2019 3:18 PM, Woody Suwalski wrote:
> > Rafal, the patch (in 5.3-rc3)
> >
> > Fixes: f850a48a0799 ("ACPI: PM: Allow transitions to D0 to occur in
> > special cases")
> >
> > does not fix the issue - it must be something else...
>
> Sorry for the late response.
>
> There are known issues in 5.3-rc related to power management which
> should be fixed in -rc4. Please try that one when it is out.
>
> Cheers!
>
>
>
> Thomas Gleixner wrote:
> > Woody,
> >
> > On Fri, 9 Aug 2019, Woody Suwalski wrote:
> >
> > For future things like this, please CC LKML. There is nothing secrit here
> > and CC'ing the mailing list allows other people to find this and spare
> > themself the whole bisection pain. Asided of that private mail does not
> > scale. On the list other people can look at it and give input eventually.
> >
> >> After bisecting I have found the potential culprit:
> >> dfe0cf8b x86/ioapic: Implement irq_get irqchip_state() callback
> >>
> >> I am repeating the bisection from start to re-confirm.
> >>
> >> Reverse-patch on 5.3-rc3 (64bit) is fixing the problem for me.
> >> What is unclear - just adding the patch to 5.2.1 does not seem to
> >> break it. So there is some more magic involved.
> > Of course it does not do anything because 5.2.1 is not having
> >
> > f4999a2a3a48 ("genirq: Add optional hardware synchronization for shutdown")
> >
> >> Thomas, any suggestions?
> > What that means is that there is an interrupt shutdown which hits the
> > condition where an interrupt _IS_ marked in the IOAPIC as delivered to a
> > CPU, but not serviced yet.
> >
> > Now the question is why it is not serviced. suspend_device_irqs() is
> > calling into synchronize_irq(), which is probably the place where that
> > it hangs. But that's called with CPUs online and interrupts enabled.
> >
> >> The reproduce methodology: use VMware player 15, either 32 or 64 bit build.
> >> reboot and run "systemctl suspend". The first suspend works OK. The
> >> second usually locks on kernels 5.2.2 and up. Maybe try 4 times to
> >> confirm good (it is intermittent).
> > -ENOVMWAREPLAYER and I'm traveling so I don't have a machine handy to
> > install it. So if you can't debug it deeper down, I'm not going to have a
> > chance to look at it before the end of next week.
> >
> > That said, can we please move this to LKML?
> >
> > Thanks,
> >
> > tglx
> >
> >
> I can add some printk's into synchronize_irq(), however no idea if they
> will be survive in the kmsg log after a next power-reset. I can wait for
> a week :-)
>
> Thanks, Woody
>
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?
A: No.
Q: Should I include quotations after my reply?
http://daringfireball.net/2007/07/on_top
Woody,
On Mon, 12 Aug 2019, Woody Suwalski wrote:
> I have added a timeout counter in __synchronize_hardirq().
> At the bottom I have converted while(inprogress); to while(inprogress
> && timeout++ < 100);
>
> That is bypassing the suspend lockup problem. On both 32-bit and
> 64-bit VMs the countdown is triggered by sync of irq9.
So ACPI triggered an interrupt, which got already forwarded to a CPU, but
it is not handled. That's more than strange.
> Which probably means that there is some issue in ACPI handler and
> synchronize_hardirq() is stuck on it?
The ACPI handler is not the culprit. This is either an emulation bug or
something really strange. Can you please use a WARN_ON() if the loop is
exited via the timeout so we can see in which context this happens?
Thanks,
tglx
On Mon, Aug 12, 2019 at 1:24 PM Thomas Gleixner <[email protected]> wrote:
> On Mon, 12 Aug 2019, Woody Suwalski wrote:
>
> > I have added a timeout counter in __synchronize_hardirq().
> > At the bottom I have converted while(inprogress); to while(inprogress
> > && timeout++ < 100);
> >
> > That is bypassing the suspend lockup problem. On both 32-bit and
> > 64-bit VMs the countdown is triggered by sync of irq9.
>
> So ACPI triggered an interrupt, which got already forwarded to a CPU, but
> it is not handled. That's more than strange.
>
> > Which probably means that there is some issue in ACPI handler and
> > synchronize_hardirq() is stuck on it?
>
> The ACPI handler is not the culprit. This is either an emulation bug or
> something really strange. Can you please use a WARN_ON() if the loop is
> exited via the timeout so we can see in which context this happens?
>
Thomas, Rafael
A. Learning the Wonderfull World of Gmail Web Interface. Maybe w/o top
posting this time....
B. On 5.3-rc4 problem is gone. I guess it is overall good sign.
C. To recreate problem I went back to 5.2.4. The WARN_ON trace shows
(in reverse):
entry_SYSCALL_64_after_hwframe
do_syscall_64
ksys_write
vfs_write
kernfs_fop_write
state_store
pm_suspend.cold.3
suspend_devices_and_enter
dpm_suspend_noirq
suspend_device_irqs
?ktime_get
?synchronize
synchronize_irq
__synchronize_hardirq.cold.9
Comm: systemd-sleep
Would that help?
Thanks, Woody
Woody,
On Tue, 13 Aug 2019, Woody Suwalski wrote:
> On Mon, Aug 12, 2019 at 1:24 PM Thomas Gleixner <[email protected]> wrote:
> > The ACPI handler is not the culprit. This is either an emulation bug or
> > something really strange. Can you please use a WARN_ON() if the loop is
> > exited via the timeout so we can see in which context this happens?
> >
>
> B. On 5.3-rc4 problem is gone. I guess it is overall good sign.
Now the interesting question is what changed between 5.3-rc3 and
5.3-rc4. Could you please try to bisect that?
> C. To recreate problem I went back to 5.2.4. The WARN_ON trace shows
> (in reverse):
Next time you can spare yourself the work to reverse the stack trace. We
are all used to read it the other way round :)
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> ksys_write
> vfs_write
> kernfs_fop_write
> state_store
> pm_suspend.cold.3
> suspend_devices_and_enter
> dpm_suspend_noirq
> suspend_device_irqs
> ?ktime_get
> ?synchronize
> synchronize_irq
> __synchronize_hardirq.cold.9
dpm_suspend_noirq() is called with all CPUs online and interrupts
enabled. In that case an interrupt pending in IRR does not make any sense
at all. Confused.
Thanks,
tglx
On Thu, Aug 15, 2019 at 2:37 AM Thomas Gleixner <[email protected]> wrote:
>
> Woody,
>
> On Tue, 13 Aug 2019, Woody Suwalski wrote:
> > On Mon, Aug 12, 2019 at 1:24 PM Thomas Gleixner <[email protected]> wrote:
> > > The ACPI handler is not the culprit. This is either an emulation bug or
> > > something really strange. Can you please use a WARN_ON() if the loop is
> > > exited via the timeout so we can see in which context this happens?
> > >
> >
> > B. On 5.3-rc4 problem is gone. I guess it is overall good sign.
>
> Now the interesting question is what changed between 5.3-rc3 and
> 5.3-rc4. Could you please try to bisect that?
>
Apparently I can not, and frustrated'ingly do not understand it.
Tried twice, and every time I get it broken to the end of bisection -
so the fixed-in-5.3-rc4 theory falls apart. Yet if I build cleanly
5.3-rc4 or -rc5, it works OK.
Then on a 32 bit system - I first tried with a scaled-down kernel
(just with the drivers needed in the VM). That one is never working,
even in rc5. Yet the "full" kernel works OK. So now there is a config
issue variation on top of other problem?
>
> dpm_suspend_noirq() is called with all CPUs online and interrupts
> enabled. In that case an interrupt pending in IRR does not make any sense
> at all. Confused.
>
For now I use a timeout counter patch - and it is showing 100% irq9
jammed and needing rescue. And I am even more confused...
Thanks, Woody
On Tue, 20 Aug 2019, Woody Suwalski wrote:
> On Thu, Aug 15, 2019 at 2:37 AM Thomas Gleixner <[email protected]> wrote:
> > On Tue, 13 Aug 2019, Woody Suwalski wrote:
> > > On Mon, Aug 12, 2019 at 1:24 PM Thomas Gleixner <[email protected]> wrote:
> > > > The ACPI handler is not the culprit. This is either an emulation bug or
> > > > something really strange. Can you please use a WARN_ON() if the loop is
> > > > exited via the timeout so we can see in which context this happens?
> > > >
> > >
> > > B. On 5.3-rc4 problem is gone. I guess it is overall good sign.
> >
> > Now the interesting question is what changed between 5.3-rc3 and
> > 5.3-rc4. Could you please try to bisect that?
> >
>
> Apparently I can not, and frustrated'ingly do not understand it.
> Tried twice, and every time I get it broken to the end of bisection -
> so the fixed-in-5.3-rc4 theory falls apart. Yet if I build cleanly
> 5.3-rc4 or -rc5, it works OK.
> Then on a 32 bit system - I first tried with a scaled-down kernel
> (just with the drivers needed in the VM). That one is never working,
> even in rc5. Yet the "full" kernel works OK. So now there is a config
> issue variation on top of other problem?
Looks like and it would be good to know which knob it is.
Can you send me the two configs please?
> > dpm_suspend_noirq() is called with all CPUs online and interrupts
> > enabled. In that case an interrupt pending in IRR does not make any sense
> > at all. Confused.
> >
> For now I use a timeout counter patch - and it is showing 100% irq9
> jammed and needing rescue. And I am even more confused...
You're not alone, if that gives you a bit of comfort :)
Thanks,
tglx
I have tried to "bisect" the config changes, and builds working/not
working between
rc3-rc4-rc5, and come out with the same frustrating result, that
building a "clean" kernel is not producing the same behavoir as
incremental building while bisecting. For some reason even after
getting to the same config step-by-step is not making the kernel work,
similar with actual bisecting.
So for now I simply use my patch to do the timeout.
Thinking of it - should I submit a patch like that to you for
consideration? It may be usefull for other users with the suspend
problems...
Thanks, Woody
On Wed, Aug 21, 2019 at 4:15 PM Thomas Gleixner <[email protected]> wrote:
>
> On Tue, 20 Aug 2019, Woody Suwalski wrote:
> > On Thu, Aug 15, 2019 at 2:37 AM Thomas Gleixner <[email protected]> wrote:
> > > On Tue, 13 Aug 2019, Woody Suwalski wrote:
> > > > On Mon, Aug 12, 2019 at 1:24 PM Thomas Gleixner <[email protected]> wrote:
> > > > > The ACPI handler is not the culprit. This is either an emulation bug or
> > > > > something really strange. Can you please use a WARN_ON() if the loop is
> > > > > exited via the timeout so we can see in which context this happens?
> > > > >
> > > >
> > > > B. On 5.3-rc4 problem is gone. I guess it is overall good sign.
> > >
> > > Now the interesting question is what changed between 5.3-rc3 and
> > > 5.3-rc4. Could you please try to bisect that?
> > >
> >
> > Apparently I can not, and frustrated'ingly do not understand it.
> > Tried twice, and every time I get it broken to the end of bisection -
> > so the fixed-in-5.3-rc4 theory falls apart. Yet if I build cleanly
> > 5.3-rc4 or -rc5, it works OK.
> > Then on a 32 bit system - I first tried with a scaled-down kernel
> > (just with the drivers needed in the VM). That one is never working,
> > even in rc5. Yet the "full" kernel works OK. So now there is a config
> > issue variation on top of other problem?
>
> Looks like and it would be good to know which knob it is.
>
> Can you send me the two configs please?
>
> > > dpm_suspend_noirq() is called with all CPUs online and interrupts
> > > enabled. In that case an interrupt pending in IRR does not make any sense
> > > at all. Confused.
> > >
> > For now I use a timeout counter patch - and it is showing 100% irq9
> > jammed and needing rescue. And I am even more confused...
>
> You're not alone, if that gives you a bit of comfort :)
>
> Thanks,
>
> tglx
Woody,
On Wed, 28 Aug 2019, Woody Suwalski wrote:
> I have tried to "bisect" the config changes, and builds working/not
> working between
> rc3-rc4-rc5, and come out with the same frustrating result, that
> building a "clean" kernel is not producing the same behavoir as
> incremental building while bisecting.
So what you say is that:
make clean; make menuconfig (change some option); make
and
make menuconfig (change some option); make
produces different results?
That needs to be fixed first. If you can't trust your build system then you
cannot trust any result it produces.
What's you actual build procedure?
Thanks,
tglx
On Wed, Aug 28, 2019 at 10:50 AM Thomas Gleixner <[email protected]> wrote:
>
> Woody,
>
> On Wed, 28 Aug 2019, Woody Suwalski wrote:
>
> > I have tried to "bisect" the config changes, and builds working/not
> > working between
> > rc3-rc4-rc5, and come out with the same frustrating result, that
> > building a "clean" kernel is not producing the same behavoir as
> > incremental building while bisecting.
>
> So what you say is that:
>
> make clean; make menuconfig (change some option); make
>
> and
>
> make menuconfig (change some option); make
>
> produces different results?
>
> That needs to be fixed first. If you can't trust your build system then you
> cannot trust any result it produces.
>
> What's you actual build procedure?
>
The build procedure: a "clean" one - I have a script/make to build a
.deb by untar the src, patch, copy .config, run make oldconfig, run
make, package.
The bisect and config-change procedures were simpler - I was running
"git bisect bad" and make (followed by make modules_install, copy
bzImage to boot, rebuild initramfs) and reboot. For config changes -
drop new config, hand-merged in steps toward the presumed "good" one,
and run make, and install and reboot...
So I was not running explicitly make oldconfig every time, however I
believe that config has been updated to match other options
selected/unselected by make itself (so I have assumed that make
oldconfig has been automagically run sometime during the build).
But for the bisect procedure I did not run "make clean" at every step,
again - in my former bisections it was not needed, and actually saves
a lot of compilation time toward the end of bisection...
As such I could not directly answer your question - however yes -
building "cleanly" from source seems to produce different results than
doing it incrementaly...
Thanks, Woody