2017-09-18 10:08:51

by Abdul Haleem

[permalink] [raw]
Subject: [linux-next][DLPAR CPU][Oops] Bad kernel stack pointer

Hi,

Dynamic CPU remove operation resulted in Kernel Panic on today's
next-20170915 kernel.

Machine Type: Power 7 PowerVM LPAR
Kernel : 4.13.0-next-20170915
config : attached
test: DLPAR CPU remove


dmesg logs:
----------
cpu 37 (hwid 37) Ready to die...
cpu 38 (hwid 38) Ready to die...
cpu 39 (hwid 39)
******* RTAS CReady to die...
ALL BUFFER CORRUPTION *******
[ 673.435910] Bad kernel stack pointer eec51c8 365: rtas32_callat
480010897c601_buff_ptr=
b78
0000 001E 0000 0001 0000 0002 0000 0027 [...............']
[ 673.435938] Oops: Bad kernel stack pointer, sig: 6 [#1]
0000 0000 0000 0000 0000 0005 0000 0001 [................]
[ 673.435942] BE SMP NR_CPUS=20 C000 0000 0048 NUMA pSeries
01 AF3C 0000 000
0 0001 1248 [.......<.......H]
C000 0000 0032 25D0 0000 0000 0000 0000 [.....2%.........]
0001 0000 0000 0000 0000 0004 0000 0100 [................]
C000 0000 0150 0AA0 C000 0013 FFFF A210 [.....P..........]
0000 0800 .... .... .... .... .... .... [.....P..........]
[ 673.435976] Dumping ftrace buffer:
366: rtas64_map_buff_ptr=
0000 0000 0000 0000 0000 0000 0000 0000 [................]
0000 0000 0000 0000 FFFF FFFF FFFF D8F1 [................]
0000 0000 0000 0000 0000 0000 0000 0000 [................]
0000 0000 0000 0000 0000 0000 0000 0000 [................]
0000 0000 0000 0000 0000 0000 0000 0000 [................]
0000 0000 0000 0000 0000 0000 0000 0000 [................]
0000 0000 .... .... .... .... .... .... [................]
(ftrace buffer empty)
Kernel panic - not syncing: Alas, I survived.

Modules linked in: xt_CHECKSUM(E) iptable_mangle(E) ipt_MASQUERADE(E)
nf_nat_masquerade_ipv4(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E)
nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_conntrack(E) nf_conntrack(E)
ipt_REJECT(E) nf_reject_ipv4(E) tun(E) bridge(E) stp(E) llc(E) kvm_pr(E)
kvm(E) rpadlpar_io(E) rpaphp(E) ebtable_filter(E) ebtables(E)
ip6table_filter(E) ip6_tables(E) dccp_diag(E) dccp(E) tcp_diag(E)
udp_diag(E) inet_diag(E) unix_diag(E) af_packet_diag(E) netlink_diag(E)
iptable_filter(E) sg(E) nfsd(E) auth_rpcgss(E) nfs_acl(E) lockd(E)
sunrpc(E) grace(E) binfmt_misc(E) ip_tables(E) ext4(E) mbcache(E)
jbd2(E) sd_mod(E) ibmvscsi(E) ibmveth(E) scsi_transport_srp(E)
Dumping ftrace buffer:
(ftrace buffer empty)
CPU: 0 PID: 8633 Comm: drmgr Tainted: G E
4.13.0-next-20170915-autotest #1
task: c0000000fd49c200 task.stack: c0000000fb824000
NIP: 480010897c601b78 LR: 480010897c601b78 CTR: 0000000000000000
REGS: c00000000ee6fd40 TRAP: 0400 Tainted: G E
(4.13.0-next-20170915-autotest)
MSR: 8000000042801000 <SF,VEC,VSX,ME> CR: 22000000 XER: 00000020
CFAR: 000000000ee97a20 SOFTE: -1152921504565094016
GPR00: 480010897c601b78 000000000eec51c8 000000000eea3680
000000000fc7b5c0
GPR04: 0000000000000000 00000000000000e0 000000000000b9fc
000000000000001e
GPR08: 000000000fa3b000 000000000fc7b5c0 000000000fa378f0
0000000000000000
GPR12: 0000000001500a90 c00000000e930000 0000000000000000
0000000000000000
GPR16: 0000000000000000 0000000000000000 c000000000c8c7a0
0000000010000024
GPR20: c000000000e44f74 c0000013ffff0670 0000000000000000
c000000000e44f74
GPR24: c000000000e44f70 c0000000fb8276d0 000000000000001e
0000000000000002
GPR28: 0000000000000001 0000000000000001 0000000000000002
900b00004bfe8545
NIP [480010897c601b78] 0x480010897c601b78
LR [480010897c601b78] 0x480010897c601b78
Call Trace:
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
---[ end trace d504e921bec4201a ]---
--

Regard's

Abdul Haleem
IBM Linux Technology Centre



Attachments:
p7-vm-config (137.77 kB)

2017-09-18 12:44:37

by Rob Herring

[permalink] [raw]
Subject: Re: [linux-next][DLPAR CPU][Oops] Bad kernel stack pointer

On Mon, Sep 18, 2017 at 5:08 AM, Abdul Haleem
<[email protected]> wrote:
> Hi,
>
> Dynamic CPU remove operation resulted in Kernel Panic on today's
> next-20170915 kernel.
>
> Machine Type: Power 7 PowerVM LPAR
> Kernel : 4.13.0-next-20170915

I assume this is not something new to 9/15 -next nor only in -next
because you also reported that 4.13.0 broke. Can you provide some
details on what version worked? 4.12?

Rob

2017-09-19 13:38:18

by Abdul Haleem

[permalink] [raw]
Subject: Re: [linux-next][DLPAR CPU][Oops] Bad kernel stack pointer

On Mon, 2017-09-18 at 07:44 -0500, Rob Herring wrote:
> On Mon, Sep 18, 2017 at 5:08 AM, Abdul Haleem
> <[email protected]> wrote:
> > Hi,
> >
> > Dynamic CPU remove operation resulted in Kernel Panic on today's
> > next-20170915 kernel.
> >
> > Machine Type: Power 7 PowerVM LPAR
> > Kernel : 4.13.0-next-20170915
>
> I assume this is not something new to 9/15 -next nor only in -next
> because you also reported that 4.13.0 broke. Can you provide some
> details on what version worked? 4.12?

[linux-next][DLPAR CPU][Oops] Bad kernel stack pointer
[mainline][DLPAR][Oops] OF: ERROR: Bad of_node_put() on /cpus

The above issues are not reproducible with 4.12.0 (mainline), it is
broken with 4.13.0 and next.

--
Regard's

Abdul Haleem
IBM Linux Technology Centre



2017-09-20 11:42:25

by Michael Ellerman

[permalink] [raw]
Subject: Re: [linux-next][DLPAR CPU][Oops] Bad kernel stack pointer

Abdul Haleem <[email protected]> writes:

> Hi,
>
> Dynamic CPU remove operation resulted in Kernel Panic on today's
> next-20170915 kernel.
>
> Machine Type: Power 7 PowerVM LPAR
> Kernel : 4.13.0-next-20170915
> config : attached
> test: DLPAR CPU remove
>
>
> dmesg logs:
> ----------
> cpu 37 (hwid 37) Ready to die...
> cpu 38 (hwid 38) Ready to die...
> cpu 39 (hwid 39)
> ******* RTAS CReady to die...
> ALL BUFFER CORRUPTION *******

Cool. Does that come from RTAS itself? I have never seen that happen
before.

Is this easily reproducible?

cheers

2017-09-22 09:57:16

by Abdul Haleem

[permalink] [raw]
Subject: Re: [linux-next][DLPAR CPU][Oops] Bad kernel stack pointer

On Wed, 2017-09-20 at 21:42 +1000, Michael Ellerman wrote:
> Abdul Haleem <[email protected]> writes:
>
> > Hi,
> >
> > Dynamic CPU remove operation resulted in Kernel Panic on today's
> > next-20170915 kernel.
> >
> > Machine Type: Power 7 PowerVM LPAR
> > Kernel : 4.13.0-next-20170915
> > config : attached
> > test: DLPAR CPU remove
> >
> >
> > dmesg logs:
> > ----------
> > cpu 37 (hwid 37) Ready to die...
> > cpu 38 (hwid 38) Ready to die...
> > cpu 39 (hwid 39)
> > ******* RTAS CReady to die...
> > ALL BUFFER CORRUPTION *******
>
> Cool. Does that come from RTAS itself? I have never seen that happen
> before.

Not sure, the var logs does not have any messages captured. This is
first time we hit this type of issue.
>
> Is this easily reproducible?

I am unable to reproduce it again. I will keep an eye on our CI runs for
few more runs.

--
Regard's

Abdul Haleem
IBM Linux Technology Centre



2017-09-22 12:26:16

by Michael Ellerman

[permalink] [raw]
Subject: Re: [linux-next][DLPAR CPU][Oops] Bad kernel stack pointer

Abdul Haleem <[email protected]> writes:

> On Wed, 2017-09-20 at 21:42 +1000, Michael Ellerman wrote:
>> Abdul Haleem <[email protected]> writes:
>>
>> > Hi,
>> >
>> > Dynamic CPU remove operation resulted in Kernel Panic on today's
>> > next-20170915 kernel.
>> >
>> > Machine Type: Power 7 PowerVM LPAR
>> > Kernel : 4.13.0-next-20170915
>> > config : attached
>> > test: DLPAR CPU remove
>> >
>> >
>> > dmesg logs:
>> > ----------
>> > cpu 37 (hwid 37) Ready to die...
>> > cpu 38 (hwid 38) Ready to die...
>> > cpu 39 (hwid 39)
>> > ******* RTAS CReady to die...
>> > ALL BUFFER CORRUPTION *******
>>
>> Cool. Does that come from RTAS itself? I have never seen that happen
>> before.
>
> Not sure, the var logs does not have any messages captured. This is
> first time we hit this type of issue.

Yeah it is from RTAS:

# lsprop /proc/device-tree/rtas/linux,rtas-base
/proc/device-tree/rtas/linux,rtas-base
1eca0000 (516554752)
# lsprop /proc/device-tree/rtas/rtas-size
/proc/device-tree/rtas/rtas-size
01360000 (20316160)

# dd if=/dev/mem bs=4096 skip=126112 count=4960 of=rtas.bin
# strings rtas.bin | grep "RTAS CALL BUFFER"
******* RTAS CALL BUFFER CORRUPTION *******


So we were doing an RTAS call and RTAS itself detected that the call
buffer was corrupted. I'm not sure how it detects that, but something is
definitely screwed up.

>> Is this easily reproducible?
>
> I am unable to reproduce it again. I will keep an eye on our CI runs for
> few more runs.

OK thanks.

cheers

2017-09-22 12:38:37

by Abdul Haleem

[permalink] [raw]
Subject: Re: [linux-next][DLPAR CPU][Oops] Bad kernel stack pointer

On Fri, 2017-09-22 at 15:27 +0530, Abdul Haleem wrote:
> On Wed, 2017-09-20 at 21:42 +1000, Michael Ellerman wrote:
> > Abdul Haleem <[email protected]> writes:
> >
> > > Hi,
> > >
> > > Dynamic CPU remove operation resulted in Kernel Panic on today's
> > > next-20170915 kernel.
> > >
> > > Machine Type: Power 7 PowerVM LPAR
> > > Kernel : 4.13.0-next-20170915
> > > config : attached
> > > test: DLPAR CPU remove
> > >
> > >
> > > dmesg logs:
> > > ----------
> > > cpu 37 (hwid 37) Ready to die...
> > > cpu 38 (hwid 38) Ready to die...
> > > cpu 39 (hwid 39)
> > > ******* RTAS CReady to die...
> > > ALL BUFFER CORRUPTION *******
> >
> > Cool. Does that come from RTAS itself? I have never seen that happen
> > before.
>
> Not sure, the var logs does not have any messages captured. This is
> first time we hit this type of issue.
> >
> > Is this easily reproducible?
>
> I am unable to reproduce it again. I will keep an eye on our CI runs for
> few more runs.
>

I was able to reproduce it again, the trace looks similar. except it
does not have RTAS 'ALL BUFFER CORRUPTION' message.

cpu 36 (hwid 36) Ready to die...
cpu 37 (hwid 37) Ready to die...
cpu 38 (hwid 38) Ready to die...
Bad kernel stack pointer fc7b120 at ee9fdc4
Bad kernel stack pointer fc7b220 at ee9da0c
Oops: Bad kernel stack pointer, sig: 6 [#1]
BE SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: loop xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun bridge stp llc kvm_pr kvm rpadlpar_io rpaphp ebtable_filter ebtables ip6table_filter ip6_tables dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag iptable_filter netlink_diag sg nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables ext4 mbcache jbd2 sd_mod ibmvscsi scsi_transport_srp ibmveth
CPU: 38 PID: 0 Comm: swapper/38 Not tainted 4.14.0-rc1-next-20170922 #2
task: c0000013f82ea300 task.stack: c0000013f8344000
NIP: 000000000ee9fdc4 LR: 000000000eea0f10 CTR: 000000000ee9fc64
REGS: c00000000eca7d40 TRAP: 0300 Not tainted (4.14.0-rc1-next-20170922)
MSR: 8000000000001000 <SF,ME> CR: 88000004 XER: 00000018
CFAR: 000000000ee9fd5c DAR: 003cf6eaa9e7225f DSISR: 42000000 SOFTE: -9223372036812787662
GPR00: 0000000000000038 000000000fc7b120 000000000ef68b00 000000000ef69000
GPR04: 000000000ef35ea8 000000000fc7b3a0 0000000000000800 0000000000000030
GPR08: 000000000f0f0110 0000000000000008 003cf6eaa9e7223f 0000000000000030
GPR12: 0000000000000000 c00000000e948f00 c0000013f8347f90 000000000eee8040
GPR16: 0000000000000000 c0000000013cfde8 c000000000e43a80 c000000000e43a80
GPR20: 0000000000000000 c000000000e43880 0000000000000098 0000000000000026
GPR24: 0000000000000026 c000000000e44f70 c000000000e44f74 0000000000000002
GPR28: c000000000e44f74 0000000000000001 0000000000000130 000000000fc7b120
NIP [000000000ee9fdc4] 0xee9fdc4
LR [000000000eea0f10] 0xeea0f10
Call Trace:
Instruction dump:
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX XXXXXXXX
---[ end trace 59dc6eb8faf1d63f ]---
Unable to handle kernel paging request for unaligned access at address 0xc000000000e658be
Faulting instruction address: 0xc0000000009f1460
Unable to handle kernel paging request for data at address 0xa08cc8b63900000c
Faulting instruction address: 0xc00000000017c2e4
Unable to handle kernel paging request for unaligned access at address 0xc000000000e624ae
Faulting instruction address: 0xc00000000010cea8
Unable to handle kernel paging request for data at address 0x4d455f54494d45f3
Faulting instruction address: 0xc000000000133b04
Unable to handle kernel paging request for unaligned access at address 0xc000000000e658be
Faulting instruction address: 0xc0000000009f16a4
Unable to handle kernel paging request for unaligned access at address 0xc000000000e6633e
Faulting instruction address: 0xc00000000059414c

Please let me know if you need more logs.

--
Regard's

Abdul Haleem
IBM Linux Technology Centre



Attachments:
dlparlogs.txt (4.24 kB)