2017-12-23 01:35:46

by Dexuan Cui

[permalink] [raw]
Subject: RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

> From: Alexandru Chirvasitu [mailto:[email protected]]
> Sent: Friday, December 22, 2017 14:29
>
> The output of that precise command run just now on a freshly-compiled
> copy of that commit is attached.
>
> On Fri, Dec 22, 2017 at 09:31:28PM +0000, Dexuan Cui wrote:
> > > From: Alexandru Chirvasitu [mailto:[email protected]]
> > > Sent: Friday, December 22, 2017 06:21
> > >
> > > In the absence of logs, the best I can do at the moment is attach a
> > > picture of the screen I am presented with on the apic=debug boot
> > > attempt.
> > > Alex
> >
> > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > IMO we should find which line of code causes the panic. I suppose
> > "objdump -D kernel/irq/matrix.o" can help to do that.
> >
> > Thanks,
> > -- Dexuan

The BUG_ON panic happens at line 147:
BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));

I'm sure Thomas and Dou know it better than me.

137 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit,
138 bool replace)
139 {
140 struct cpumap *cm = this_cpu_ptr(m->maps);
141
142 BUG_ON(bit > m->matrix_bits);
143 BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));
144
145 set_bit(bit, m->system_map);
146 if (replace) {
147 BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
148 cm->allocated--;
149 m->total_allocated--;
150 }
151 if (bit >= m->alloc_start && bit < m->alloc_end)
152 m->systembits_inalloc++;
153
154 trace_irq_matrix_assign_system(bit, m);
155 }

-- Dexuan



2017-12-23 04:50:15

by AC

[permalink] [raw]
Subject: Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

I was just now trying to track down my other issue, whereby somewhere
along the tree kexec stops working properly. In the process of doing
that I realized I had initially made one change to the original 4.9
config beyond oldconfig: I'd turned off WX debugging.

I've now compiled a bunch of versions with WX debugging back on, and
new behavior arises. I am attaching the joournalctl log of a booted
4.13 kernel (from Linus' tree, commit 569dbb8).

It boots and logs me in, but returns a call trace I wasn't seeing
without the WX debugging. I am sending over in case it provides any
information.

The trace bears the 23:24:09 timestamp.

On Sat, Dec 23, 2017 at 01:35:12AM +0000, Dexuan Cui wrote:
> > From: Alexandru Chirvasitu [mailto:[email protected]]
> > Sent: Friday, December 22, 2017 14:29
> >
> > The output of that precise command run just now on a freshly-compiled
> > copy of that commit is attached.
> >
> > On Fri, Dec 22, 2017 at 09:31:28PM +0000, Dexuan Cui wrote:
> > > > From: Alexandru Chirvasitu [mailto:[email protected]]
> > > > Sent: Friday, December 22, 2017 06:21
> > > >
> > > > In the absence of logs, the best I can do at the moment is attach a
> > > > picture of the screen I am presented with on the apic=debug boot
> > > > attempt.
> > > > Alex
> > >
> > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > IMO we should find which line of code causes the panic. I suppose
> > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > >
> > > Thanks,
> > > -- Dexuan
>
> The BUG_ON panic happens at line 147:
> BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
>
> I'm sure Thomas and Dou know it better than me.
>
> 137 void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit,
> 138 bool replace)
> 139 {
> 140 struct cpumap *cm = this_cpu_ptr(m->maps);
> 141
> 142 BUG_ON(bit > m->matrix_bits);
> 143 BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));
> 144
> 145 set_bit(bit, m->system_map);
> 146 if (replace) {
> 147 BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> 148 cm->allocated--;
> 149 m->total_allocated--;
> 150 }
> 151 if (bit >= m->alloc_start && bit < m->alloc_end)
> 152 m->systembits_inalloc++;
> 153
> 154 trace_irq_matrix_assign_system(bit, m);
> 155 }
>
> -- Dexuan
>


Attachments:
(No filename) (2.41 kB)
journal-4.13-dec22-23_48 (94.34 kB)
Download all attachments

2017-12-23 13:33:07

by Thomas Gleixner

[permalink] [raw]
Subject: RE: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

On Sat, 23 Dec 2017, Dexuan Cui wrote:

> > From: Alexandru Chirvasitu [mailto:[email protected]]
> > Sent: Friday, December 22, 2017 14:29
> >
> > The output of that precise command run just now on a freshly-compiled
> > copy of that commit is attached.
> >
> > On Fri, Dec 22, 2017 at 09:31:28PM +0000, Dexuan Cui wrote:
> > > > From: Alexandru Chirvasitu [mailto:[email protected]]
> > > > Sent: Friday, December 22, 2017 06:21
> > > >
> > > > In the absence of logs, the best I can do at the moment is attach a
> > > > picture of the screen I am presented with on the apic=debug boot
> > > > attempt.
> > > > Alex
> > >
> > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > IMO we should find which line of code causes the panic. I suppose
> > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > >
> > > Thanks,
> > > -- Dexuan
>
> The BUG_ON panic happens at line 147:
> BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
>
> I'm sure Thomas and Dou know it better than me.

I'll have a look after the holidays.

Thanks,

tglx

2017-12-23 19:59:45

by AC

[permalink] [raw]
Subject: Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

On Sat, Dec 23, 2017 at 02:32:52PM +0100, Thomas Gleixner wrote:
> On Sat, 23 Dec 2017, Dexuan Cui wrote:
>
> > > From: Alexandru Chirvasitu [mailto:[email protected]]
> > > Sent: Friday, December 22, 2017 14:29
> > >
> > > The output of that precise command run just now on a freshly-compiled
> > > copy of that commit is attached.
> > >
> > > On Fri, Dec 22, 2017 at 09:31:28PM +0000, Dexuan Cui wrote:
> > > > > From: Alexandru Chirvasitu [mailto:[email protected]]
> > > > > Sent: Friday, December 22, 2017 06:21
> > > > >
> > > > > In the absence of logs, the best I can do at the moment is attach a
> > > > > picture of the screen I am presented with on the apic=debug boot
> > > > > attempt.
> > > > > Alex
> > > >
> > > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > > IMO we should find which line of code causes the panic. I suppose
> > > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > > >
> > > > Thanks,
> > > > -- Dexuan
> >
> > The BUG_ON panic happens at line 147:
> > BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> >
> > I'm sure Thomas and Dou know it better than me.
>
> I'll have a look after the holidays.
>

Thanks for that!

A quick follow-up on my inability to make kexec / kdump work in order
to perhaps produce better logs: I've done another bisect for that with
this result:

# first bad commit: [e802a51ede91350438c051da2f238f5e8c918ead] x86/idt: Consolidate IDT invalidation

I am quite certain this is the one for that issue. Its only parent is

# good: [8f55868f9e42fea56021b17421914b9e4fda4960] x86/idt: Remove unused set_trap_gate()

(i.e. one of the "good" commits I hit upon during the bisect).

On the core 2 duo machine I've been referring to e802a51 and later
commits simply return me to a regular BIOS boot when issuing either
kexec -e on a loaded crash kernel or crashing with echo c >
/proc/sysrq-trigger.


Alex

2017-12-24 03:29:52

by Dou Liyang

[permalink] [raw]
Subject: Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

Hi Thomas,

At 12/23/2017 09:32 PM, Thomas Gleixner wrote:
[...]
>>
>> The BUG_ON panic happens at line 147:
>> BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
>>
>> I'm sure Thomas and Dou know it better than me.
>
> I'll have a look after the holidays.
>

Merry Christmas! :-)

I am trying to look into it.

Thanks,
dou


2017-12-27 08:14:33

by Dou Liyang

[permalink] [raw]
Subject: Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

Hi Alexandru,

At 12/24/2017 04:01 AM, Alexandru Chirvasitu wrote:
> On Sat, Dec 23, 2017 at 02:32:52PM +0100, Thomas Gleixner wrote:
>> On Sat, 23 Dec 2017, Dexuan Cui wrote:
>>
>>>> From: Alexandru Chirvasitu [mailto:[email protected]]
>>>> Sent: Friday, December 22, 2017 14:29
>>>>
>>>> The output of that precise command run just now on a freshly-compiled
>>>> copy of that commit is attached.
>>>>
>>>> On Fri, Dec 22, 2017 at 09:31:28PM +0000, Dexuan Cui wrote:
>>>>>> From: Alexandru Chirvasitu [mailto:[email protected]]
>>>>>> Sent: Friday, December 22, 2017 06:21
>>>>>>
>>>>>> In the absence of logs, the best I can do at the moment is attach a
>>>>>> picture of the screen I am presented with on the boot
>>>>>> attempt.
>>>>>> Alex
>>>>>
>>>>> The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
>>>>> IMO we should find which line of code causes the panic. I suppose
>>>>> "objdump -D kernel/irq/matrix.o" can help to do that.
>>>>>
>>>>> Thanks,
>>>>> -- Dexuan
>>>
>>> The BUG_ON panic happens at line 147:
>>> BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
>>>

There are 2 bugs in your laptop:

1. Hard lockups on both CPUs after login
2. panic with "apic=debug"

For the 2th bug, please try the following patch(need Thomas confirmation
:) ) in Linux 4.15-rc5. I think it can fix the panic.

If the 2th bug fixed, let's back to the 1th bug:

Is Linus current head 4.15-rc5 bad as well?

If yes, Please using "apic=debug" and give the dmesg log.

Thanks,
dou.

------------------------8<-------------------------------------------

irq/matrix: Remove the overused BUGON() in irq_matrix_assign_system()

Currently, x86 marks the preallocated legacy interrupts when initializing
IRQ(native_init_IRQ), but will clear them if they are not activated in
vector_configure_legacy().

So, in irq_matrix_assign_system(), replacing an legacy vector which may
not allocated in a cpumap->alloc_map[] with a system vector will trigger
the BUGON();

Remove the BUGON().

Signed-off-by: Dou Liyang <[email protected]>
---
kernel/irq/matrix.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 0ba0dd8863a7..876cbeab9ca2 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -143,11 +143,12 @@ void irq_matrix_assign_system(struct irq_matrix
*m, unsigned int bit,
BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));

set_bit(bit, m->system_map);
- if (replace) {
- BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
+
+ if (replace && test_and_clear_bit(bit, cm->alloc_map)){
cm->allocated--;
m->total_allocated--;
}
+
if (bit >= m->alloc_start && bit < m->alloc_end)
m->systembits_inalloc++;

--


2017-12-27 16:16:33

by AC

[permalink] [raw]
Subject: Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

As per instructions, I did the following:

(1)

Checked out

464e1d5 Linux 4.15-rc5

(after getting my copy up to date, fetching, pulling ,etc.) and
compiled it as-is. Config attached (the one labeled 'np' for 'no
patch').

Result:

Boot with no extraparameters locks up after login, as before;

apic=debug does not panic, but locks up after login, as before;

noapic logs me in fine, but disables my wired connection (this is also
behaviour I noted previously). Sees the thernet card and brings it up,
but dhclient willt not connect.

(2)

Applied the patch you sent below to 464e1d5; again config attached,
labeled 'p' for 'patch'. I applied it manually because git apply was
giving me errors and I didn't want to hold us back while I debug (or
rather I should say 'learn to use git apply'; first time doing it).

In any case though, the changes were as you indicated. The diff I get
with 'git show' were

---------------------------------------------------------------
irq/matrix: Remove the overused BUGON() in irq_matrix_assign_system()

diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 0ba0dd8..9292d79 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -143,8 +143,8 @@ void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit,
BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));

set_bit(bit, m->system_map);
- if (replace) {
- BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
+
+ if (replace && test_and_clear_bit(bit, cm->alloc_map)){
cm->allocated--;
m->total_allocated--;
}
---------------------------------------------------------------

I deleted / inserted those lines myself, but I do believe they
precisely match what you sent. Compiled and installed the modified kernel.

Result:

*Exactly* as above, on all three attempts (no parameters, 'apic=debug'
and 'noapic').

---

So the patch doesn't seem to have an effect, and the panicking is no
longer happening in 4.15-rc5 anyway (with 'apic=debug' and no patch).

Perhaps if I go back to the original bad commit in that bisect I did
and Apply the patch to *that*.. I'll try, but cannot at this precise
moment. I'll get back in a bit.


On Wed, Dec 27, 2017 at 04:14:23PM +0800, Dou Liyang wrote:
> Hi Alexandru,
>
> At 12/24/2017 04:01 AM, Alexandru Chirvasitu wrote:
> > On Sat, Dec 23, 2017 at 02:32:52PM +0100, Thomas Gleixner wrote:
> > > On Sat, 23 Dec 2017, Dexuan Cui wrote:
> > >
> > > > > From: Alexandru Chirvasitu [mailto:[email protected]]
> > > > > Sent: Friday, December 22, 2017 14:29
> > > > >
> > > > > The output of that precise command run just now on a freshly-compiled
> > > > > copy of that commit is attached.
> > > > >
> > > > > On Fri, Dec 22, 2017 at 09:31:28PM +0000, Dexuan Cui wrote:
> > > > > > > From: Alexandru Chirvasitu [mailto:[email protected]]
> > > > > > > Sent: Friday, December 22, 2017 06:21
> > > > > > >
> > > > > > > In the absence of logs, the best I can do at the moment is attach a
> > > > > > > picture of the screen I am presented with on the boot
> > > > > > > attempt.
> > > > > > > Alex
> > > > > >
> > > > > > The panic happens in irq_matrix_assign_system+0x4e/0xd0 in your picture.
> > > > > > IMO we should find which line of code causes the panic. I suppose
> > > > > > "objdump -D kernel/irq/matrix.o" can help to do that.
> > > > > >
> > > > > > Thanks,
> > > > > > -- Dexuan
> > > >
> > > > The BUG_ON panic happens at line 147:
> > > > BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> > > >
>
> There are 2 bugs in your laptop:
>
> 1. Hard lockups on both CPUs after login
> 2. panic with "apic=debug"
>
> For the 2th bug, please try the following patch(need Thomas confirmation
> :) ) in Linux 4.15-rc5. I think it can fix the panic.
>
> If the 2th bug fixed, let's back to the 1th bug:
>
> Is Linus current head 4.15-rc5 bad as well?
>
> If yes, Please using "apic=debug" and give the dmesg log.
>
> Thanks,
> dou.
>
> ------------------------8<-------------------------------------------
>
> irq/matrix: Remove the overused BUGON() in irq_matrix_assign_system()
>
> Currently, x86 marks the preallocated legacy interrupts when initializing
> IRQ(native_init_IRQ), but will clear them if they are not activated in
> vector_configure_legacy().
>
> So, in irq_matrix_assign_system(), replacing an legacy vector which may
> not allocated in a cpumap->alloc_map[] with a system vector will trigger
> the BUGON();
>
> Remove the BUGON().
>
> Signed-off-by: Dou Liyang <[email protected]>
> ---
> kernel/irq/matrix.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
> index 0ba0dd8863a7..876cbeab9ca2 100644
> --- a/kernel/irq/matrix.c
> +++ b/kernel/irq/matrix.c
> @@ -143,11 +143,12 @@ void irq_matrix_assign_system(struct irq_matrix *m,
> unsigned int bit,
> BUG_ON(m->online_maps > 1 || (m->online_maps && !replace));
>
> set_bit(bit, m->system_map);
> - if (replace) {
> - BUG_ON(!test_and_clear_bit(bit, cm->alloc_map));
> +
> + if (replace && test_and_clear_bit(bit, cm->alloc_map)){
> cm->allocated--;
> m->total_allocated--;
> }
> +
> if (bit >= m->alloc_start && bit < m->alloc_end)
> m->systembits_inalloc++;
>
> --
>
>


Attachments:
(No filename) (5.22 kB)
config-np (189.95 kB)
config-p (189.95 kB)
Download all attachments

2017-12-28 02:06:42

by Dou Liyang

[permalink] [raw]
Subject: Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

Hi Alexandru,

Thanks for testing !
At 12/28/2017 12:18 AM, Alexandru Chirvasitu wrote:
> As per instructions, I did the following:
>
> (1)
>
> Checked out
>
> 464e1d5 Linux 4.15-rc5
>
> (after getting my copy up to date, fetching, pulling ,etc.) and
> compiled it as-is. Config attached (the one labeled 'np' for 'no
> patch').
>
> Result:
>
> Boot with no extraparameters locks up after login, as before;
>
> apic=debug does not panic, but locks up after login, as before;
>
I also hope to see the log with "apic=debug" by "journalctl" command,
though the logs don't have the lockup trace.

Thanks,
dou.
>



2017-12-28 02:51:22

by AC

[permalink] [raw]
Subject: Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

On Thu, Dec 28, 2017 at 10:06:25AM +0800, Dou Liyang wrote:
> Hi Alexandru,
>
> Thanks for testing !
> At 12/28/2017 12:18 AM, Alexandru Chirvasitu wrote:
> > As per instructions, I did the following:
> >
> > (1)
> >
> > Checked out
> >
> > 464e1d5 Linux 4.15-rc5
> >
> > (after getting my copy up to date, fetching, pulling ,etc.) and
> > compiled it as-is. Config attached (the one labeled 'np' for 'no
> > patch').
> >
> > Result:
> >
> > Boot with no extraparameters locks up after login, as before;
> >
> > apic=debug does not panic, but locks up after login, as before;
> >
> I also hope to see the log with "apic=debug" by "journalctl" command,
> though the logs don't have the lockup trace.

Ah, of course. Attached is the output of `journalctl --boot=-1` after
booting, getting locked up, and then rebooting a good kernel.

Slightly different version of 4.15-rc5; this one has both patches
applied, yours and Linus' for kexec, but the latter shouldn't make a
difference.

---

You'll see another trace in there that's been bugging me, about W=X
checking. I'm not qualified to judge how related they are, but during
these past few days I've compiled and tested many kernels, and many of
them have exhibited the W+X thing but *not* the lockups.

I hope to trace that one back to the original commit with another
bisect one of these days, but they do seem to be different issues.

>
> Thanks,
> dou.
> >
>
>
>


Attachments:
(No filename) (1.40 kB)
4.15-rc5-apicdebug-journal (72.07 kB)
Download all attachments

2017-12-28 10:23:53

by Dou Liyang

[permalink] [raw]
Subject: Re: PROBLEM: 4.15.0-rc3 APIC causes lockups on Core 2 Duo laptop

Hi Alexandru,

At 12/28/2017 10:51 AM, Alexandru Chirvasitu wrote:
> Ah, of course. Attached is the output of `journalctl --boot=-1` after
> booting, getting locked up, and then rebooting a good kernel.
>
For the Hard lockups on both CPUs after login:

Please try the patch in the attachment by

git am ./0001-x86-vector-Replace-the-raw_spin_lock-with-raw_spin_l.patch

or

patch -p1 <
./0001-x86-vector-Replace-the-raw_spin_lock-with-raw_spin_l.patch

> Slightly different version of 4.15-rc5; this one has both patches
> applied, yours and Linus' for kexec, but the latter shouldn't make a
> difference.
>
> ---
>
> You'll see another trace in there that's been bugging me, about W=X
> checking. I'm not qualified to judge how related they are, but during
> these past few days I've compiled and tested many kernels, and many of
> them have exhibited the W+X thing but*not* the lockups.
>

Yes, I found it, but I am not familiar with it and have no idea.

Thanks,
dou.

---------------------8<--------------------------------------------



Attachments:
0001-x86-vector-Replace-the-raw_spin_lock-with-raw_spin_l.patch (1.35 kB)