2017-09-15 11:04:53

by Abdul Haleem

[permalink] [raw]
Subject: [mainline][DLPAR][Oops] OF: ERROR: Bad of_node_put() on /cpus

Hi,

Mainline kernel panics during DLPAR CPU add/remove operation.

Machine Type: Power8 PowerVM LPAR
kernel 4.13.0
gcc 6.3.1
config: attached


trace messaged:
-------------
drmgr: -c cpu -d 5 -w 30 -r
cpu 8 (hwid 8) Ready to die...
cpu 9 (hwid 9) Ready to die...
cpu 10 (hwid 10) Ready to die...
cpu 11 (hwid 11) Ready to die...
cpu 12 (hwid 12) Ready to die...
cpu 13 (hwid 13) Ready to die...
cpu 14 (hwid 14) Ready to die...
cpu 15 (hwid 15) Ready to die...
OF: ERROR: Bad of_node_put() on /cpus
CPU: 0 PID: 10896 Comm: drmgr Not tainted 4.13.0-autotest #1
Call Trace:
[c00000026ecdf810] [c00000000278a2a4] dump_stack+0x15c/0x1f8
(unreliable)
[c00000026ecdf850] [c0000000025371a4] of_node_release+0x1a4/0x1c0
[c00000026ecdf8e0] [c0000000027948c8] kobject_put+0x1a8/0x310
[c00000026ecdf960] [c000000002794bdc] kobject_del+0xbc/0xf0
[c00000026ecdf990] [c000000002535ff4] __of_detach_node_sysfs+0x144/0x210
[c00000026ecdf9d0] [c000000002536f70] of_detach_node+0xf0/0x180
[c00000026ecdfa40] [c0000000016ed494] dlpar_detach_node+0xc4/0x120
[c00000026ecdfa80] [c0000000016f47d0] dlpar_cpu_remove+0x280/0x560
[c00000026ecdfb60] [c0000000016f4d9c] dlpar_cpu_release+0xbc/0x1b0
[c00000026ecdfbb0] [c00000000161279c] arch_cpu_release+0x6c/0xb0
[c00000026ecdfbe0] [c00000000218ebf0] cpu_release_store+0xa0/0x100
[c00000026ecdfc20] [c000000002178388] dev_attr_store+0x68/0xa0
[c00000026ecdfc50] [c000000001bfaae8] sysfs_kf_write+0xa8/0xf0
[c00000026ecdfc80] [c000000001bf8a3c] kernfs_fop_write+0x2cc/0x400
[c00000026ecdfce0] [c000000001ad33fc] __vfs_write+0x5c/0x340
[c00000026ecdfd80] [c000000001ad89e8] vfs_write+0x1a8/0x3d0
[c00000026ecdfdd0] [c000000001ad9178] SyS_write+0xa8/0x1a0
[c00000026ecdfe30] [c0000000015eb8e0] system_call+0x58/0x6c

[....]

OF: ERROR: Bad of_node_put() on /cpus
CPU: 2 PID: 0 Comm: swapper/2 Not tainted 4.13.0-autotest #1
Call Trace:
[c0000002719dfcc0] [c00000000278a2a4] dump_stack+0x15c/0x1f8
(unreliable)
[c0000002719dfd00] [c0000000025371a4] of_node_release+0x1a4/0x1c0
[c0000002719dfd90] [c0000000027948c8] kobject_put+0x1a8/0x310
[c0000002719dfe10] [c00000000253519c] of_node_put+0x4c/0x80
[c0000002719dfe30] [c000000002529f2c] of_get_next_parent+0xac/0x120
[c0000002719dfe70] [c0000000016170cc] of_get_ibm_chip_id+0x8c/0x140
[c0000002719dfec0] [c00000000161720c] cpu_to_chip_id+0x8c/0xd0
[c0000002719dfef0] [c00000000164d478] start_secondary+0x258/0x8d0
[c0000002719dff90] [c0000000015eb180] start_secondary_resume+0x10/0x14
Unable to handle kernel paging request for data at address
0xc000306574617473
Faulting instruction address: 0xc000000001a89664
Oops: Kernel access of bad area, sig: 11 [#1]
LE SMP NR_CPUS=2048 NUMA pSeries
Dumping ftrace buffer:
(ftrace buffer empty)
Modules linked in: rcutorture torture rpadlpar_io rpaphp bridge stp llc
xt_tcpudp ipt_REJECT nf_reject_ipv4 xt_conntrack nfnetlink
iptable_mangle iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4
nf_nat nf_conntrack iptable_filter vmx_crypto pseries_rng rng_core nfsd
binfmt_misc ip_tables x_tables autofs4
CPU: 2 PID: 17 Comm: cpuhp/2 Not tainted 4.13.0-autotest #1
task: c000000271a62600 task.stack: c000000271a6c000
NIP: c000000001a89664 LR: c000000001a89638 CTR: 0000000000000000
REGS: c000000271a6f520 TRAP: 0300 Not tainted (4.13.0-autotest)
MSR: 800000000280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28000828
XER: 00000000
CFAR: c0000000015e87d8 DAR: c000306574617473 DSISR: 40000000 SOFTE: 1
GPR00: c0000000019d16b8 c000000271a6f7a0 c000000003561e00
0000000000000001
GPR04: 00000000014000c0 000000027ce71000 0000000000000fe1
c0000000035c6810
GPR08: 000000002e091d1e 000000000000e292 0000000000000000
c000000271a6f7ff
GPR12: 0000000000000000 c00000000fdc0900 c000000001792718
c00000026c0012c0
GPR16: 0000000000000000 c000000002b93350 c000000003466670
c00000000373f938
GPR20: 00000000014080c0 c000000003a30300 c000000002b036d0
c00000027fb953d8
GPR24: c000306574617473 c00000027e007c00 c0000000019d17c8
00000000014000c0
GPR28: 0000000000000007 c00000027e03e438 c00000027e007c00
c0000000037ab3f0
NIP [c000000001a89664] __kmalloc_track_caller+0x5d4/0x680
LR [c000000001a89638] __kmalloc_track_caller+0x5a8/0x680
Call Trace:
[c000000271a6f7a0] [c000000271a6f7d0] 0xc000000271a6f7d0 (unreliable)
[c000000271a6f800] [c0000000019d16b8] kstrdup+0x98/0x110
[c000000271a6f840] [c0000000019d17c8] kstrdup_const+0x98/0xc0
[c000000271a6f860] [c000000001bf3090] __kernfs_new_node+0x80/0x310
[c000000271a6f8d0] [c000000001bf571c] kernfs_new_node+0x5c/0xe0
[c000000271a6f900] [c000000001bf643c] kernfs_create_dir_ns+0x5c/0x170
[c000000271a6f950] [c000000001bfc5fc] sysfs_create_dir_ns+0x9c/0x190
[c000000271a6f990] [c000000002795d60] kobject_add_internal+0x1c0/0x6d0
[c000000271a6fa20] [c000000002796594] kobject_add_varg+0x74/0xf0
[c000000271a6faa0] [c00000000279668c] kobject_init_and_add+0x7c/0xb0
[c000000271a6fae0] [c0000000024f85bc] cpuidle_add_device_sysfs
+0x15c/0x370
[c000000271a6fbd0] [c0000000024f4a54] cpuidle_enable_device+0x114/0x230
[c000000271a6fc20] [c0000000024f97bc] pseries_cpuidle_cpu_online
+0xbc/0x100
[c000000271a6fc50] [c000000001743f08] cpuhp_invoke_callback+0x288/0xb50
[c000000271a6fcd0] [c000000001744b50] cpuhp_up_callbacks+0xb0/0x250
[c000000271a6fd20] [c000000001745fc4] cpuhp_thread_fun+0x214/0x280
[c000000271a6fd60] [c00000000179b9f4] smpboot_thread_fn+0x354/0x480
[c000000271a6fdc0] [c00000000179295c] kthread+0x24c/0x2e0
[c000000271a6fe30] [c0000000015ebc60] ret_from_kernel_thread+0x5c/0x7c
Instruction dump:
4bb73f35 60000000 e95f66a0 e93f66a8 2fb80000 394a0001 39290001 f95f66a0
f93f66a8 419e003c e95e0022 e93f66b0 <7d58502a> 39290001 f93f66b0
2faa0000
---[ end trace 9dc6afde15547583 ]---

Kernel panic - not syncing: Fatal exception

Regard's
Abdul Haleem
IBM Linux Technology Center


Attachments:
alpine-4k-pagesize (85.89 kB)

2017-09-15 12:53:12

by Rob Herring

[permalink] [raw]
Subject: Re: [mainline][DLPAR][Oops] OF: ERROR: Bad of_node_put() on /cpus

On Fri, Sep 15, 2017 at 6:04 AM, abdul <[email protected]> wrote:
> Hi,
>
> Mainline kernel panics during DLPAR CPU add/remove operation.
>
> Machine Type: Power8 PowerVM LPAR
> kernel 4.13.0

Did 4.12 work or when was it last working? I'm not seeing anything
recent in the DT code that looks suspicious.

Rob

2017-09-19 13:36:27

by Abdul Haleem

[permalink] [raw]
Subject: Re: [mainline][DLPAR][Oops] OF: ERROR: Bad of_node_put() on /cpus

On Fri, 2017-09-15 at 07:52 -0500, Rob Herring wrote:
> On Fri, Sep 15, 2017 at 6:04 AM, abdul <[email protected]> wrote:
> > Hi,
> >
> > Mainline kernel panics during DLPAR CPU add/remove operation.
> >
> > Machine Type: Power8 PowerVM LPAR
> > kernel 4.13.0
>
> Did 4.12 work or when was it last working? I'm not seeing anything
> recent in the DT code that looks suspicious.

The issue was not seen with 4.12.0

--
Regard's

Abdul Haleem
IBM Linux Technology Centre



2017-09-20 11:39:13

by Michael Ellerman

[permalink] [raw]
Subject: Re: [mainline][DLPAR][Oops] OF: ERROR: Bad of_node_put() on /cpus

Rob Herring <[email protected]> writes:

> On Fri, Sep 15, 2017 at 6:04 AM, abdul <[email protected]> wrote:
>> Hi,
>>
>> Mainline kernel panics during DLPAR CPU add/remove operation.
>>
>> Machine Type: Power8 PowerVM LPAR
>> kernel 4.13.0
>
> Did 4.12 work or when was it last working? I'm not seeing anything
> recent in the DT code that looks suspicious.

I'm pretty sure it's:

int dlpar_attach_node(struct device_node *dn, struct device_node *parent)
{
int rc;

dn->parent = parent;

rc = of_attach_node(dn);
if (rc) {
printk(KERN_ERR "Failed to add device node %pOF\n", dn);
return rc;
}

of_node_put(dn->parent);
HERE ^^^^^^^^^^

return 0;
}


Prior to 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node
dependency on full path"), we re-looked up the parent, and got another
reference on it. That meant the put before the return there was correct.
But now it's not because the caller has a reference to parent but it's
not ours to drop.

Testing a fix, will report back.

cheers

2017-09-20 17:23:09

by Tyrel Datwyler

[permalink] [raw]
Subject: Re: [mainline][DLPAR][Oops] OF: ERROR: Bad of_node_put() on /cpus

On 09/20/2017 04:39 AM, Michael Ellerman wrote:
> Rob Herring <[email protected]> writes:
>
>> On Fri, Sep 15, 2017 at 6:04 AM, abdul <[email protected]> wrote:
>>> Hi,
>>>
>>> Mainline kernel panics during DLPAR CPU add/remove operation.
>>>
>>> Machine Type: Power8 PowerVM LPAR
>>> kernel 4.13.0
>>
>> Did 4.12 work or when was it last working? I'm not seeing anything
>> recent in the DT code that looks suspicious.
>
> I'm pretty sure it's:
>
> int dlpar_attach_node(struct device_node *dn, struct device_node *parent)
> {
> int rc;
>
> dn->parent = parent;
>
> rc = of_attach_node(dn);
> if (rc) {
> printk(KERN_ERR "Failed to add device node %pOF\n", dn);
> return rc;
> }
>
> of_node_put(dn->parent);
> HERE ^^^^^^^^^^
>
> return 0;
> }
>
>
> Prior to 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node
> dependency on full path"), we re-looked up the parent, and got another
> reference on it. That meant the put before the return there was correct.
> But now it's not because the caller has a reference to parent but it's
> not ours to drop.
>
> Testing a fix, will report back.

So, that patch slipped past me. Not only is the parent reference not ours to drop, but
when I went and looked at dlpar_cpu_add() I also noticed that of_node_put() was done on
the parent prior to the call to dlpar_attach_node(). With the addition of "parent" to the
dlpar_attach_node() parameter list dlpar_cpu_add() needs to be fixed up to hold the
"parent" reference until after dlpar_attach_node() returns.

-Tyrel

>
> cheers
>

2017-09-20 21:03:06

by Tyrel Datwyler

[permalink] [raw]
Subject: [PATCH 2/2] powerpc/pseries: fix parent_dn reference leak in add_dt_node()

A reference to the parent device node is held by add_dt_node() for the
node to be added. If the call to dlpar_configure_connector() fails
add_dt_node() returns ENOENT and that reference is not freed.

Add a call to of_node_put(parent_dn) prior to bailing out after a failed
dlpar_configure_connector() call.

Cc: [email protected] # v3.12+
Fixes: 8d5ff320766f ("powerpc/pseries: Make dlpar_configure_connector parent node aware")
Signed-off-by: Tyrel Datwyler <[email protected]>
---
arch/powerpc/platforms/pseries/mobility.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/mobility.c b/arch/powerpc/platforms/pseries/mobility.c
index 210ce63..f7042ad 100644
--- a/arch/powerpc/platforms/pseries/mobility.c
+++ b/arch/powerpc/platforms/pseries/mobility.c
@@ -226,8 +226,10 @@ static int add_dt_node(__be32 parent_phandle, __be32 drc_index)
return -ENOENT;

dn = dlpar_configure_connector(drc_index, parent_dn);
- if (!dn)
+ if (!dn) {
+ of_node_put(parent_dn);
return -ENOENT;
+ }

rc = dlpar_attach_node(dn, parent_dn);
if (rc)
--
1.8.3.1

2017-09-20 21:03:08

by Tyrel Datwyler

[permalink] [raw]
Subject: [PATCH 1/2] powerpc/pseries: fix "OF: ERROR: Bad of_node_put() on /cpus" during DLPAR

Commit 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on
full path") reworked dlpar_attach_node() to no longer look up the parent
node "/cpus", but instead to have the parent node passed by the caller in the
function parameter list. As a result dlpar_attach_node() is no longer
responsible for freeing the reference to the parent node. However,
commit 215ee763f8cb failed to remove the of_node_put(parent) call in
dlpar_attach_node(), or to take into account that the reference to the
parent in the caller dlpar_cpu_add() needs to be held until after
dlpar_attach_node() returns. As a result doing repeated cpu add/remove dlpar
operations will eventually result in the following error:

OF: ERROR: Bad of_node_put() on /cpus
CPU: 0 PID: 10896 Comm: drmgr Not tainted 4.13.0-autotest #1
Call Trace:
[c00000026ecdf810] [c00000000278a2a4] dump_stack+0x15c/0x1f8
(unreliable)
[c00000026ecdf850] [c0000000025371a4] of_node_release+0x1a4/0x1c0
[c00000026ecdf8e0] [c0000000027948c8] kobject_put+0x1a8/0x310
[c00000026ecdf960] [c000000002794bdc] kobject_del+0xbc/0xf0
[c00000026ecdf990] [c000000002535ff4] __of_detach_node_sysfs+0x144/0x210
[c00000026ecdf9d0] [c000000002536f70] of_detach_node+0xf0/0x180
[c00000026ecdfa40] [c0000000016ed494] dlpar_detach_node+0xc4/0x120
[c00000026ecdfa80] [c0000000016f47d0] dlpar_cpu_remove+0x280/0x560
[c00000026ecdfb60] [c0000000016f4d9c] dlpar_cpu_release+0xbc/0x1b0
[c00000026ecdfbb0] [c00000000161279c] arch_cpu_release+0x6c/0xb0
[c00000026ecdfbe0] [c00000000218ebf0] cpu_release_store+0xa0/0x100
[c00000026ecdfc20] [c000000002178388] dev_attr_store+0x68/0xa0
[c00000026ecdfc50] [c000000001bfaae8] sysfs_kf_write+0xa8/0xf0
[c00000026ecdfc80] [c000000001bf8a3c] kernfs_fop_write+0x2cc/0x400
[c00000026ecdfce0] [c000000001ad33fc] __vfs_write+0x5c/0x340
[c00000026ecdfd80] [c000000001ad89e8] vfs_write+0x1a8/0x3d0
[c00000026ecdfdd0] [c000000001ad9178] SyS_write+0xa8/0x1a0
[c00000026ecdfe30] [c0000000015eb8e0] system_call+0x58/0x6c

Fix the issue by removing the of_node_put(parent) call from
dlpar_attach_node(), and ensuring that the reference to the parent node
is properly held and released by the caller dlpar_cpu_add().

Cc: [email protected] # v4.13+
Fixes: 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on full path")
Signed-off-by: Tyrel Datwyler <[email protected]>
Reported-by: Abdul Haleem <[email protected]>
---
arch/powerpc/platforms/pseries/dlpar.c | 1 -
arch/powerpc/platforms/pseries/hotplug-cpu.c | 4 +++-
2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/dlpar.c b/arch/powerpc/platforms/pseries/dlpar.c
index 783f363..e45b5f1 100644
--- a/arch/powerpc/platforms/pseries/dlpar.c
+++ b/arch/powerpc/platforms/pseries/dlpar.c
@@ -266,7 +266,6 @@ int dlpar_attach_node(struct device_node *dn, struct device_node *parent)
return rc;
}

- of_node_put(dn->parent);
return 0;
}

diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
index fc0d8f9..473d817 100644
--- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
+++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
@@ -462,15 +462,17 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
}

dn = dlpar_configure_connector(cpu_to_be32(drc_index), parent);
- of_node_put(parent);
if (!dn) {
pr_warn("Failed call to configure-connector, drc index: %x\n",
drc_index);
dlpar_release_drc(drc_index);
+ of_node_put(parent);
return -EINVAL;
}

rc = dlpar_attach_node(dn, parent);
+ of_node_put(parent);
+
if (rc) {
saved_rc = rc;
pr_warn("Failed to attach node %s, rc: %d, drc index: %x\n",
--
1.8.3.1

2017-09-21 09:54:55

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH 1/2] powerpc/pseries: fix "OF: ERROR: Bad of_node_put() on /cpus" during DLPAR

Hi Tyrel,

Thanks for jumping on this.

Tyrel Datwyler <[email protected]> writes:
> Commit 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on
> full path") reworked dlpar_attach_node() to no longer look up the parent
> node "/cpus", but instead to have the parent node passed by the caller in the
> function parameter list. As a result dlpar_attach_node() is no longer
> responsible for freeing the reference to the parent node. However,
> commit 215ee763f8cb failed to remove the of_node_put(parent) call in
> dlpar_attach_node(), or to take into account that the reference to the
> parent in the caller dlpar_cpu_add() needs to be held until after
> dlpar_attach_node() returns. As a result doing repeated cpu add/remove dlpar
> operations will eventually result in the following error:
>
> OF: ERROR: Bad of_node_put() on /cpus
> CPU: 0 PID: 10896 Comm: drmgr Not tainted 4.13.0-autotest #1
> Call Trace:
> [c00000026ecdf810] [c00000000278a2a4] dump_stack+0x15c/0x1f8
> (unreliable)
> [c00000026ecdf850] [c0000000025371a4] of_node_release+0x1a4/0x1c0
> [c00000026ecdf8e0] [c0000000027948c8] kobject_put+0x1a8/0x310
> [c00000026ecdf960] [c000000002794bdc] kobject_del+0xbc/0xf0
> [c00000026ecdf990] [c000000002535ff4] __of_detach_node_sysfs+0x144/0x210
> [c00000026ecdf9d0] [c000000002536f70] of_detach_node+0xf0/0x180
> [c00000026ecdfa40] [c0000000016ed494] dlpar_detach_node+0xc4/0x120
> [c00000026ecdfa80] [c0000000016f47d0] dlpar_cpu_remove+0x280/0x560
> [c00000026ecdfb60] [c0000000016f4d9c] dlpar_cpu_release+0xbc/0x1b0
> [c00000026ecdfbb0] [c00000000161279c] arch_cpu_release+0x6c/0xb0
> [c00000026ecdfbe0] [c00000000218ebf0] cpu_release_store+0xa0/0x100
> [c00000026ecdfc20] [c000000002178388] dev_attr_store+0x68/0xa0
> [c00000026ecdfc50] [c000000001bfaae8] sysfs_kf_write+0xa8/0xf0
> [c00000026ecdfc80] [c000000001bf8a3c] kernfs_fop_write+0x2cc/0x400
> [c00000026ecdfce0] [c000000001ad33fc] __vfs_write+0x5c/0x340
> [c00000026ecdfd80] [c000000001ad89e8] vfs_write+0x1a8/0x3d0
> [c00000026ecdfdd0] [c000000001ad9178] SyS_write+0xa8/0x1a0
> [c00000026ecdfe30] [c0000000015eb8e0] system_call+0x58/0x6c

I usually omit all the addresses in the change log, they don't add much
for future readers. I fixed it up myself.

> Fix the issue by removing the of_node_put(parent) call from
> dlpar_attach_node(), and ensuring that the reference to the parent node
> is properly held and released by the caller dlpar_cpu_add().
>
> Fixes: 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on full path")
> Cc: [email protected] # v4.13+

It doesn't need to go to stable, v4.13 doesn't have that commit, it's
only in v4.14-rc1.

I dropped the stable tag.

> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> index fc0d8f9..473d817 100644
> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
> @@ -462,15 +462,17 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
> }
>
> dn = dlpar_configure_connector(cpu_to_be32(drc_index), parent);
> - of_node_put(parent);
> if (!dn) {
> pr_warn("Failed call to configure-connector, drc index: %x\n",
> drc_index);
> dlpar_release_drc(drc_index);
> + of_node_put(parent);
> return -EINVAL;
> }
>
> rc = dlpar_attach_node(dn, parent);

I added a comment here that I had in my version:

+ /* Regardless we are done with parent now */

> + of_node_put(parent);
> +
> if (rc) {
> saved_rc = rc;
> pr_warn("Failed to attach node %s, rc: %d, drc index: %x\n",

cheers

2017-09-21 09:57:14

by Michael Ellerman

[permalink] [raw]
Subject: Re: [mainline][DLPAR][Oops] OF: ERROR: Bad of_node_put() on /cpus

Tyrel Datwyler <[email protected]> writes:
> On 09/20/2017 04:39 AM, Michael Ellerman wrote:
>> Rob Herring <[email protected]> writes:
>>> On Fri, Sep 15, 2017 at 6:04 AM, abdul <[email protected]> wrote:
>>>>
>>>> Mainline kernel panics during DLPAR CPU add/remove operation.
>>>>
>>>> Machine Type: Power8 PowerVM LPAR
>>>> kernel 4.13.0
>>>
>>> Did 4.12 work or when was it last working? I'm not seeing anything
>>> recent in the DT code that looks suspicious.
>>
>> I'm pretty sure it's:
>>
>> int dlpar_attach_node(struct device_node *dn, struct device_node *parent)
>> {
>> int rc;
>>
>> dn->parent = parent;
>>
>> rc = of_attach_node(dn);
>> if (rc) {
>> printk(KERN_ERR "Failed to add device node %pOF\n", dn);
>> return rc;
>> }
>>
>> of_node_put(dn->parent);
>> HERE ^^^^^^^^^^
>>
>> return 0;
>> }
>>
>>
>> Prior to 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node
>> dependency on full path"), we re-looked up the parent, and got another
>> reference on it. That meant the put before the return there was correct.
>> But now it's not because the caller has a reference to parent but it's
>> not ours to drop.
>>
>> Testing a fix, will report back.
>
> So, that patch slipped past me. Not only is the parent reference not ours to drop, but
> when I went and looked at dlpar_cpu_add() I also noticed that of_node_put() was done on
> the parent prior to the call to dlpar_attach_node(). With the addition of "parent" to the
> dlpar_attach_node() parameter list dlpar_cpu_add() needs to be fixed up to hold the
> "parent" reference until after dlpar_attach_node() returns.

Yep. I wrote the same patch :)

Rob asked me to test it, which I did, but /cpus starts out with an
elevated ref count, so you have to do ~30 (on my system) DLPAR removes
to hit the bug, which I didn't do.

I've updated my test script to do roughly $(nproc) x 10 DLPAR removes,
which is hopefully sufficient to catch these bugs in future.

cheers

2017-09-21 18:41:46

by Tyrel Datwyler

[permalink] [raw]
Subject: Re: [PATCH 1/2] powerpc/pseries: fix "OF: ERROR: Bad of_node_put() on /cpus" during DLPAR

On 09/21/2017 02:54 AM, Michael Ellerman wrote:
> Hi Tyrel,
>
> Thanks for jumping on this.
>
> Tyrel Datwyler <[email protected]> writes:
>> Commit 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on
>> full path") reworked dlpar_attach_node() to no longer look up the parent
>> node "/cpus", but instead to have the parent node passed by the caller in the
>> function parameter list. As a result dlpar_attach_node() is no longer
>> responsible for freeing the reference to the parent node. However,
>> commit 215ee763f8cb failed to remove the of_node_put(parent) call in
>> dlpar_attach_node(), or to take into account that the reference to the
>> parent in the caller dlpar_cpu_add() needs to be held until after
>> dlpar_attach_node() returns. As a result doing repeated cpu add/remove dlpar
>> operations will eventually result in the following error:
>>
>> OF: ERROR: Bad of_node_put() on /cpus
>> CPU: 0 PID: 10896 Comm: drmgr Not tainted 4.13.0-autotest #1
>> Call Trace:
>> [c00000026ecdf810] [c00000000278a2a4] dump_stack+0x15c/0x1f8
>> (unreliable)
>> [c00000026ecdf850] [c0000000025371a4] of_node_release+0x1a4/0x1c0
>> [c00000026ecdf8e0] [c0000000027948c8] kobject_put+0x1a8/0x310
>> [c00000026ecdf960] [c000000002794bdc] kobject_del+0xbc/0xf0
>> [c00000026ecdf990] [c000000002535ff4] __of_detach_node_sysfs+0x144/0x210
>> [c00000026ecdf9d0] [c000000002536f70] of_detach_node+0xf0/0x180
>> [c00000026ecdfa40] [c0000000016ed494] dlpar_detach_node+0xc4/0x120
>> [c00000026ecdfa80] [c0000000016f47d0] dlpar_cpu_remove+0x280/0x560
>> [c00000026ecdfb60] [c0000000016f4d9c] dlpar_cpu_release+0xbc/0x1b0
>> [c00000026ecdfbb0] [c00000000161279c] arch_cpu_release+0x6c/0xb0
>> [c00000026ecdfbe0] [c00000000218ebf0] cpu_release_store+0xa0/0x100
>> [c00000026ecdfc20] [c000000002178388] dev_attr_store+0x68/0xa0
>> [c00000026ecdfc50] [c000000001bfaae8] sysfs_kf_write+0xa8/0xf0
>> [c00000026ecdfc80] [c000000001bf8a3c] kernfs_fop_write+0x2cc/0x400
>> [c00000026ecdfce0] [c000000001ad33fc] __vfs_write+0x5c/0x340
>> [c00000026ecdfd80] [c000000001ad89e8] vfs_write+0x1a8/0x3d0
>> [c00000026ecdfdd0] [c000000001ad9178] SyS_write+0xa8/0x1a0
>> [c00000026ecdfe30] [c0000000015eb8e0] system_call+0x58/0x6c
>
> I usually omit all the addresses in the change log, they don't add much
> for future readers. I fixed it up myself.

Ok, good point. I'll be sure to try and remember for future patches.

>
>> Fix the issue by removing the of_node_put(parent) call from
>> dlpar_attach_node(), and ensuring that the reference to the parent node
>> is properly held and released by the caller dlpar_cpu_add().
>>
>> Fixes: 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on full path")
>> Cc: [email protected] # v4.13+
>
> It doesn't need to go to stable, v4.13 doesn't have that commit, it's
> only in v4.14-rc1.

My bad, Abdul mentioned 4.13.0 in his email, but sure enough you are correct.

[root@ltcalpine2-lp2 linux]# git tag --contains 215ee763f8cb
v4.14-rc1

Thanks for catching that.

>
> I dropped the stable tag.
>
>> diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> index fc0d8f9..473d817 100644
>> --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c
>> @@ -462,15 +462,17 @@ static ssize_t dlpar_cpu_add(u32 drc_index)
>> }
>>
>> dn = dlpar_configure_connector(cpu_to_be32(drc_index), parent);
>> - of_node_put(parent);
>> if (!dn) {
>> pr_warn("Failed call to configure-connector, drc index: %x\n",
>> drc_index);
>> dlpar_release_drc(drc_index);
>> + of_node_put(parent);
>> return -EINVAL;
>> }
>>
>> rc = dlpar_attach_node(dn, parent);
>
> I added a comment here that I had in my version:
Sure np.

-Tyrel

>
> + /* Regardless we are done with parent now */
>
>> + of_node_put(parent);
>> +
>> if (rc) {
>> saved_rc = rc;
>> pr_warn("Failed to attach node %s, rc: %d, drc index: %x\n",
>
> cheers
>

2017-09-21 18:48:30

by Tyrel Datwyler

[permalink] [raw]
Subject: Re: [mainline][DLPAR][Oops] OF: ERROR: Bad of_node_put() on /cpus

On 09/21/2017 02:57 AM, Michael Ellerman wrote:
> Tyrel Datwyler <[email protected]> writes:
>> On 09/20/2017 04:39 AM, Michael Ellerman wrote:
>>> Rob Herring <[email protected]> writes:

<snip>

>>>
>>> Testing a fix, will report back.
>>
>> So, that patch slipped past me. Not only is the parent reference not ours to drop, but
>> when I went and looked at dlpar_cpu_add() I also noticed that of_node_put() was done on
>> the parent prior to the call to dlpar_attach_node(). With the addition of "parent" to the
>> dlpar_attach_node() parameter list dlpar_cpu_add() needs to be fixed up to hold the
>> "parent" reference until after dlpar_attach_node() returns.
>
> Yep. I wrote the same patch :)
>
> Rob asked me to test it, which I did, but /cpus starts out with an
> elevated ref count, so you have to do ~30 (on my system) DLPAR removes
> to hit the bug, which I didn't do.

Yeah, there are a lot of things that grab references to /cpus. So, I had a good idea that
I needed to loop a few times adding and removing multiple cpus to trigger the issue. Its
also obvious when using those OF trace points I wrote a while back that refcount for /cpus
is dropping off uncharacteristically in response to symmetrical adds/removes of cpus. I
saw your note about getting that patchset resubmitted. I'll try and get that queued back
up soon.

-Tyrel

>
> I've updated my test script to do roughly $(nproc) x 10 DLPAR removes,
> which is hopefully sufficient to catch these bugs in future.
>
> cheers
>

2017-09-22 01:03:36

by Michael Ellerman

[permalink] [raw]
Subject: Re: [1/2] powerpc/pseries: fix "OF: ERROR: Bad of_node_put() on /cpus" during DLPAR

On Wed, 2017-09-20 at 21:02:51 UTC, Tyrel Datwyler wrote:
> Commit 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on
> full path") reworked dlpar_attach_node() to no longer look up the parent
> node "/cpus", but instead to have the parent node passed by the caller in the
> function parameter list. As a result dlpar_attach_node() is no longer
> responsible for freeing the reference to the parent node. However,
> commit 215ee763f8cb failed to remove the of_node_put(parent) call in
> dlpar_attach_node(), or to take into account that the reference to the
> parent in the caller dlpar_cpu_add() needs to be held until after
> dlpar_attach_node() returns. As a result doing repeated cpu add/remove dlpar
> operations will eventually result in the following error:
>
> OF: ERROR: Bad of_node_put() on /cpus
> CPU: 0 PID: 10896 Comm: drmgr Not tainted 4.13.0-autotest #1
> Call Trace:
> [c00000026ecdf810] [c00000000278a2a4] dump_stack+0x15c/0x1f8
> (unreliable)
> [c00000026ecdf850] [c0000000025371a4] of_node_release+0x1a4/0x1c0
> [c00000026ecdf8e0] [c0000000027948c8] kobject_put+0x1a8/0x310
> [c00000026ecdf960] [c000000002794bdc] kobject_del+0xbc/0xf0
> [c00000026ecdf990] [c000000002535ff4] __of_detach_node_sysfs+0x144/0x210
> [c00000026ecdf9d0] [c000000002536f70] of_detach_node+0xf0/0x180
> [c00000026ecdfa40] [c0000000016ed494] dlpar_detach_node+0xc4/0x120
> [c00000026ecdfa80] [c0000000016f47d0] dlpar_cpu_remove+0x280/0x560
> [c00000026ecdfb60] [c0000000016f4d9c] dlpar_cpu_release+0xbc/0x1b0
> [c00000026ecdfbb0] [c00000000161279c] arch_cpu_release+0x6c/0xb0
> [c00000026ecdfbe0] [c00000000218ebf0] cpu_release_store+0xa0/0x100
> [c00000026ecdfc20] [c000000002178388] dev_attr_store+0x68/0xa0
> [c00000026ecdfc50] [c000000001bfaae8] sysfs_kf_write+0xa8/0xf0
> [c00000026ecdfc80] [c000000001bf8a3c] kernfs_fop_write+0x2cc/0x400
> [c00000026ecdfce0] [c000000001ad33fc] __vfs_write+0x5c/0x340
> [c00000026ecdfd80] [c000000001ad89e8] vfs_write+0x1a8/0x3d0
> [c00000026ecdfdd0] [c000000001ad9178] SyS_write+0xa8/0x1a0
> [c00000026ecdfe30] [c0000000015eb8e0] system_call+0x58/0x6c
>
> Fix the issue by removing the of_node_put(parent) call from
> dlpar_attach_node(), and ensuring that the reference to the parent node
> is properly held and released by the caller dlpar_cpu_add().
>
> Cc: [email protected] # v4.13+
> Fixes: 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on full path")
> Signed-off-by: Tyrel Datwyler <[email protected]>
> Reported-by: Abdul Haleem <[email protected]>

Series applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/087ff6a5ae3052bb2835e191094b79

cheers

2017-09-22 10:06:06

by Abdul Haleem

[permalink] [raw]
Subject: Re: [1/2] powerpc/pseries: fix "OF: ERROR: Bad of_node_put() on /cpus" during DLPAR

On Fri, 2017-09-22 at 11:03 +1000, Michael Ellerman wrote:
> On Wed, 2017-09-20 at 21:02:51 UTC, Tyrel Datwyler wrote:
> > Commit 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on
> > full path") reworked dlpar_attach_node() to no longer look up the parent
> > node "/cpus", but instead to have the parent node passed by the caller in the
> > function parameter list. As a result dlpar_attach_node() is no longer
> > responsible for freeing the reference to the parent node. However,
> > commit 215ee763f8cb failed to remove the of_node_put(parent) call in
> > dlpar_attach_node(), or to take into account that the reference to the
> > parent in the caller dlpar_cpu_add() needs to be held until after
> > dlpar_attach_node() returns. As a result doing repeated cpu add/remove dlpar
> > operations will eventually result in the following error:
> >
> > OF: ERROR: Bad of_node_put() on /cpus
> > CPU: 0 PID: 10896 Comm: drmgr Not tainted 4.13.0-autotest #1
> > Call Trace:
> > [c00000026ecdf810] [c00000000278a2a4] dump_stack+0x15c/0x1f8
> > (unreliable)
> > [c00000026ecdf850] [c0000000025371a4] of_node_release+0x1a4/0x1c0
> > [c00000026ecdf8e0] [c0000000027948c8] kobject_put+0x1a8/0x310
> > [c00000026ecdf960] [c000000002794bdc] kobject_del+0xbc/0xf0
> > [c00000026ecdf990] [c000000002535ff4] __of_detach_node_sysfs+0x144/0x210
> > [c00000026ecdf9d0] [c000000002536f70] of_detach_node+0xf0/0x180
> > [c00000026ecdfa40] [c0000000016ed494] dlpar_detach_node+0xc4/0x120
> > [c00000026ecdfa80] [c0000000016f47d0] dlpar_cpu_remove+0x280/0x560
> > [c00000026ecdfb60] [c0000000016f4d9c] dlpar_cpu_release+0xbc/0x1b0
> > [c00000026ecdfbb0] [c00000000161279c] arch_cpu_release+0x6c/0xb0
> > [c00000026ecdfbe0] [c00000000218ebf0] cpu_release_store+0xa0/0x100
> > [c00000026ecdfc20] [c000000002178388] dev_attr_store+0x68/0xa0
> > [c00000026ecdfc50] [c000000001bfaae8] sysfs_kf_write+0xa8/0xf0
> > [c00000026ecdfc80] [c000000001bf8a3c] kernfs_fop_write+0x2cc/0x400
> > [c00000026ecdfce0] [c000000001ad33fc] __vfs_write+0x5c/0x340
> > [c00000026ecdfd80] [c000000001ad89e8] vfs_write+0x1a8/0x3d0
> > [c00000026ecdfdd0] [c000000001ad9178] SyS_write+0xa8/0x1a0
> > [c00000026ecdfe30] [c0000000015eb8e0] system_call+0x58/0x6c
> >
> > Fix the issue by removing the of_node_put(parent) call from
> > dlpar_attach_node(), and ensuring that the reference to the parent node
> > is properly held and released by the caller dlpar_cpu_add().
> >
> > Cc: [email protected] # v4.13+
> > Fixes: 215ee763f8cb ("powerpc: pseries: remove dlpar_attach_node dependency on full path")
> > Signed-off-by: Tyrel Datwyler <[email protected]>
> > Reported-by: Abdul Haleem <[email protected]>
>
> Series applied to powerpc fixes, thanks.
>
> https://git.kernel.org/powerpc/c/087ff6a5ae3052bb2835e191094b79

The patch fixes the problem, No warnings seen for 100 iterations of
DLPAR CPU add/remove operation.

Thanks Tyrel and Michael.

Tested-by: Abdul Haleem <[email protected]>

--
Regard's

Abdul Haleem
IBM Linux Technology Centre



2017-09-22 11:59:28

by Michael Ellerman

[permalink] [raw]
Subject: Re: [mainline][DLPAR][Oops] OF: ERROR: Bad of_node_put() on /cpus

Tyrel Datwyler <[email protected]> writes:

> On 09/21/2017 02:57 AM, Michael Ellerman wrote:
>> Tyrel Datwyler <[email protected]> writes:
>>> On 09/20/2017 04:39 AM, Michael Ellerman wrote:
>>>> Rob Herring <[email protected]> writes:
>
> <snip>
>
>>>>
>>>> Testing a fix, will report back.
>>>
>>> So, that patch slipped past me. Not only is the parent reference not ours to drop, but
>>> when I went and looked at dlpar_cpu_add() I also noticed that of_node_put() was done on
>>> the parent prior to the call to dlpar_attach_node(). With the addition of "parent" to the
>>> dlpar_attach_node() parameter list dlpar_cpu_add() needs to be fixed up to hold the
>>> "parent" reference until after dlpar_attach_node() returns.
>>
>> Yep. I wrote the same patch :)
>>
>> Rob asked me to test it, which I did, but /cpus starts out with an
>> elevated ref count, so you have to do ~30 (on my system) DLPAR removes
>> to hit the bug, which I didn't do.
>
> Yeah, there are a lot of things that grab references to /cpus. So, I had a good idea that
> I needed to loop a few times adding and removing multiple cpus to trigger the issue. Its
> also obvious when using those OF trace points I wrote a while back that refcount for /cpus
> is dropping off uncharacteristically in response to symmetrical adds/removes of cpus. I
> saw your note about getting that patchset resubmitted. I'll try and get that queued back
> up soon.

Thanks, it'd be great to get it in. I applied it from the list and used
it for testing this.

cheers