On 05/25/2017 10:30 AM, Michael Bringmann wrote:
> I will try that patch shortly. I also updated my patch to be conditional
> on whether the pool's cpumask attribute was empty. You should have received
> V2 of that patch by now.
Let's try this again.
The hotplug problem goes away with the changes that you provided earlier, and
shown in the patch below. I kept this change to get_unbound_pool' as a just
in case to explain the crash in the event that it occurs again:
if (!cpumask_weight(pool->attrs->cpumask))
cpumask_copy(pool->attrs->cpumask, cpumask_of(smp_processor_id()));
I could also insert
BUG(!cpumask_weight(pool->attrs->cpumask, cpumask_of(smp_processor_id()));
at that place, but I really prefer not to crash the system if there is a workaround.
> On 05/25/2017 10:07 AM, Tejun Heo wrote:
>> On Thu, May 25, 2017 at 11:03:53AM -0400, Tejun Heo wrote:
>>> wq_update_unbound_numa() should have never called into
>>> alloc_unbound_pwq() w/ empty node cpu mask. It should have fallen
>>> back to the dfl_pwq. It looks like I just messed up the logic there
>>> from the initial commit of the feature. Can you please see whether
>>> the following fixes the problem?
>>
>> Can you please try the following instead. On the second thought, I
>> don't think the current logic is wrong. If this fixes the issue,
>> somehow your setup is having a situation where online cpumask for a
>> node is a proper superset of possible cpumask for the node.
>>
>> Thanks.
>>
>> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
>> index c74bf39ef764..4da5ff649ff8 100644
>> --- a/kernel/workqueue.c
>> +++ b/kernel/workqueue.c
>> @@ -3559,13 +3559,13 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
>> * stable.
>> *
>> * Return: %true if the resulting @cpumask is different from @attrs->cpumask,
>> - * %false if equal.
>> + * %false if equal. On %false return, the content of @cpumask is undefined.
>> */
>> static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
>> int cpu_going_down, cpumask_t *cpumask)
>> {
>> if (!wq_numa_enabled || attrs->no_numa)
>> - goto use_dfl;
>> + return false;
>>
>> /* does @node have any online CPUs @attrs wants? */
>> cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
>> @@ -3573,15 +3573,13 @@ static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
>> cpumask_clear_cpu(cpu_going_down, cpumask);
>>
>> if (cpumask_empty(cpumask))
>> - goto use_dfl;
>> + return false;
>>
>> /* yeap, return possible CPUs in @node that @attrs wants */
>> cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
>> - return !cpumask_equal(cpumask, attrs->cpumask);
>>
>> -use_dfl:
>> - cpumask_copy(cpumask, attrs->cpumask);
>> - return false;
>> + return !cpumask_empty(cpumask) &&
>> + !cpumask_equal(cpumask, attrs->cpumask);
>> }
>>
>> /* install @pwq into @wq's numa_pwq_tbl[] for @node and return the old pwq */
>>
>>
> Can you please post the messages with the debug patch from the prev
> thread? In fact, let's please continue on that thread. I'm having a
> hard time following what's going wrong with the code.
Are these the failure logs that you requested?
Red Hat Enterprise Linux Server 7.3 (Maipo)
Kernel 4.12.0-rc1.wi91275_debug_03.ppc64le+ on an ppc64le
ltcalpine2-lp20 login: root
Password:
Last login: Wed May 24 18:45:40 from oc1554177480.austin.ibm.com
[root@ltcalpine2-lp20 ~]# numactl -H
available: 2 nodes (0,6)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 6 size: 19858 MB
node 6 free: 16920 MB
node distances:
node 0 6
0: 10 40
6: 40 10
[root@ltcalpine2-lp20 ~]# numactl -H
available: 2 nodes (0,6)
node 0 cpus:
node 0 size: 0 MB
node 0 free: 0 MB
node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
node 6 size: 19858 MB
node 6 free: 16362 MB
node distances:
node 0 6
0: 10 40
6: 40 10
[root@ltcalpine2-lp20 ~]# [ 321.310943] workqueue:get_unbound_pool has empty cpumask for pool attrs
[ 321.310961] ------------[ cut here ]------------
[ 321.310997] WARNING: CPU: 184 PID: 13201 at kernel/workqueue.c:3375 alloc_unbound_pwq+0x5c0/0x5e0
[ 321.311005] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag sg pseries_rng ghash_generic gf128mul xts vmx_crypto binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
[ 321.311097] CPU: 184 PID: 13201 Comm: cpuhp/184 Not tainted 4.12.0-rc1.wi91275_debug_03.ppc64le+ #8
[ 321.311106] task: c000000408961080 task.stack: c000000406394000
[ 321.311113] NIP: c000000000116c80 LR: c000000000116c7c CTR: 0000000000000000
[ 321.311121] REGS: c0000004063977b0 TRAP: 0700 Not tainted (4.12.0-rc1.wi91275_debug_03.ppc64le+)
[ 321.311128] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE>
[ 321.311150] CR: 28000082 XER: 00000000
[ 321.311159] CFAR: c000000000a2dc80 SOFTE: 1
[ 321.311159] GPR00: c000000000116c7c c000000406397a30 c0000000013ae900 000000000000003b
[ 321.311159] GPR04: c000000408961a38 0000000000000006 00000000a49e41e5 ffffffffa4a5a483
[ 321.311159] GPR08: 00000000000062cc 0000000000000000 0000000000000000 c000000408961a38
[ 321.311159] GPR12: 0000000000000000 c00000000fb38c00 c00000000011e858 c00000040a902ac0
[ 321.311159] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 321.311159] GPR20: c000000406394000 0000000000000002 c000000406394000 0000000000000000
[ 321.311159] GPR24: c000000405075400 c000000404fc0000 0000000000000110 c0000000015a4c88
[ 321.311159] GPR28: 0000000000000000 c0000004fe256000 c0000004fe256008 c0000004fe052800
[ 321.311290] NIP [c000000000116c80] alloc_unbound_pwq+0x5c0/0x5e0
[ 321.311298] LR [c000000000116c7c] alloc_unbound_pwq+0x5bc/0x5e0
[ 321.311305] Call Trace:
[ 321.311310] [c000000406397a30] [c000000000116c7c] alloc_unbound_pwq+0x5bc/0x5e0 (unreliable)
[ 321.311323] [c000000406397ad0] [c000000000116e30] wq_update_unbound_numa+0x190/0x270
[ 321.311334] [c000000406397b60] [c000000000118eb0] workqueue_offline_cpu+0xe0/0x130
[ 321.311345] [c000000406397bf0] [c0000000000e9f20] cpuhp_invoke_callback+0x240/0xcd0
[ 321.311355] [c000000406397cb0] [c0000000000eab28] cpuhp_down_callbacks+0x78/0xf0
[ 321.311365] [c000000406397d00] [c0000000000eae6c] cpuhp_thread_fun+0x18c/0x1a0
[ 321.311376] [c000000406397d30] [c0000000001251cc] smpboot_thread_fn+0x2fc/0x3b0
[ 321.311386] [c000000406397dc0] [c00000000011e9c0] kthread+0x170/0x1b0
[ 321.311397] [c000000406397e30] [c00000000000b4f4] ret_from_kernel_thread+0x5c/0x68
[ 321.311406] Instruction dump:
[ 321.311413] 3d42fff0 892ac565 2f890000 40fefd98 39200001 3c62ff89 3c82ff6c 3863d590
[ 321.311437] 38847cb0 992ac565 48916fc9 60000000 <0fe00000> 4bfffd70 60000000 60420000
[ 321.311462] ---[ end trace 9f7c1cd616b26de8 ]---
[ 321.318347] Unable to handle kernel paging request for unaligned access at address 0xc0000003c5577ebf
[ 321.318448] Faulting instruction address: 0xc00000000055ec8c
[ 321.318457] Oops: Kernel access of bad area, sig: 7 [#1]
[ 321.318462] SMP NR_CPUS=2048
[ 321.318463] NUMA
[ 321.318468] pSeries
[ 321.318473] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag sg pseries_rng ghash_generic gf128mul xts vmx_crypto binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
[ 321.318524] CPU: 184 PID: 13201 Comm: cpuhp/184 Tainted: G W 4.12.0-rc1.wi91275_debug_03.ppc64le+ #8
[ 321.318532] task: c000000408961080 task.stack: c000000406394000
[ 321.318537] NIP: c00000000055ec8c LR: c0000000001312d4 CTR: c000000000145d50
[ 321.318544] REGS: c000000406397690 TRAP: 0600 Tainted: G W (4.12.0-rc1.wi91275_debug_03.ppc64le+)
[ 321.318551] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE>
[ 321.318563] CR: 28000024 XER: 00000000
[ 321.318571] CFAR: c0000000001312d0 DAR: c0000003c5577ebf DSISR: 00000000 SOFTE: 0
[ 321.318571] GPR00: c000000000131298 c000000406397910 c0000000013ae900 c0000004b6d22820
[ 321.318571] GPR04: c0000004b6d22820 c0000003c5577ebf 0000000000000000 00000004f1230000
[ 321.318571] GPR08: 0000000d8ddb1ea7 0000000000000000 0000000000000008 c000000408961aa8
[ 321.318571] GPR12: c000000000145d50 c00000000fb38c00 c00000000011e858 c00000040a902ac0
[ 321.318571] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 321.318571] GPR20: c000000406394000 0000000000000002 0000000000004000 c000000000fb7700
[ 321.318571] GPR24: c0000000013f5d00 c0000000013f9d48 0000000000000000 c0000004b6d230e8
[ 321.318571] GPR28: 0000000000000004 00000003c45bfc57 0000000000000800 c0000004b6d22800
[ 321.318664] NIP [c00000000055ec8c] llist_add_batch+0xc/0x40
[ 321.318670] LR [c0000000001312d4] try_to_wake_up+0x524/0x850
[ 321.318675] Call Trace:
[ 321.318679] [c000000406397910] [c000000000131298] try_to_wake_up+0x4e8/0x850 (unreliable)
[ 321.318689] [c000000406397990] [c000000000111bf8] create_worker+0x148/0x220
[ 321.318696] [c000000406397a30] [c000000000116ae8] alloc_unbound_pwq+0x428/0x5e0
[ 321.318705] [c000000406397ad0] [c000000000116e30] wq_update_unbound_numa+0x190/0x270
[ 321.318713] [c000000406397b60] [c000000000118eb0] workqueue_offline_cpu+0xe0/0x130
[ 321.318721] [c000000406397bf0] [c0000000000e9f20] cpuhp_invoke_callback+0x240/0xcd0
[ 321.318729] [c000000406397cb0] [c0000000000eab28] cpuhp_down_callbacks+0x78/0xf0
[ 321.318737] [c000000406397d00] [c0000000000eae6c] cpuhp_thread_fun+0x18c/0x1a0
[ 321.318745] [c000000406397d30] [c0000000001251cc] smpboot_thread_fn+0x2fc/0x3b0
[ 321.318754] [c000000406397dc0] [c00000000011e9c0] kthread+0x170/0x1b0
[ 321.318762] [c000000406397e30] [c00000000000b4f4] ret_from_kernel_thread+0x5c/0x68
[ 321.318769] Instruction dump:
[ 321.318775] 60420000 38600000 4e800020 60000000 60420000 7c832378 4e800020 60000000
[ 321.318790] 60000000 e9250000 f9240000 7c0004ac <7d4028a8> 7c2a4800 40c20010 7c6029ad
[ 321.318808] ---[ end trace 9f7c1cd616b26de9 ]---
[ 321.322303]
[ 323.322505] Kernel panic - not syncing: Fatal exception
[ 323.429027] Rebooting in 10 seconds..
Regards,
--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]
Hello,
On Tue, Jun 06, 2017 at 11:18:36AM -0500, Michael Bringmann wrote:
> On 05/25/2017 10:30 AM, Michael Bringmann wrote:
> > I will try that patch shortly. I also updated my patch to be conditional
> > on whether the pool's cpumask attribute was empty. You should have received
> > V2 of that patch by now.
>
> Let's try this again.
>
> The hotplug problem goes away with the changes that you provided earlier, and
So, that means we're ending up in situations where NUMA online is a
proper superset of NUMA possible.
> shown in the patch below. I kept this change to get_unbound_pool' as a just
> in case to explain the crash in the event that it occurs again:
>
> if (!cpumask_weight(pool->attrs->cpumask))
> cpumask_copy(pool->attrs->cpumask, cpumask_of(smp_processor_id()));
>
> I could also insert
>
> BUG(!cpumask_weight(pool->attrs->cpumask, cpumask_of(smp_processor_id()));
>
> at that place, but I really prefer not to crash the system if there is a workaround.
I'm not sure because it doesn't make any logical sense and it's not
right in terms of correctness. The above would be able to enable CPUs
which are explicitly excluded from a workqueue. The only fallback
which makes sense is falling back to the default pwq.
> > Can you please post the messages with the debug patch from the prev
> > thread? In fact, let's please continue on that thread. I'm having a
> > hard time following what's going wrong with the code.
>
> Are these the failure logs that you requested?
>
>
> Red Hat Enterprise Linux Server 7.3 (Maipo)
> Kernel 4.12.0-rc1.wi91275_debug_03.ppc64le+ on an ppc64le
>
> ltcalpine2-lp20 login: root
> Password:
> Last login: Wed May 24 18:45:40 from oc1554177480.austin.ibm.com
> [root@ltcalpine2-lp20 ~]# numactl -H
> available: 2 nodes (0,6)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
> node 6 size: 19858 MB
> node 6 free: 16920 MB
> node distances:
> node 0 6
> 0: 10 40
> 6: 40 10
> [root@ltcalpine2-lp20 ~]# numactl -H
> available: 2 nodes (0,6)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
> node 6 size: 19858 MB
> node 6 free: 16362 MB
> node distances:
> node 0 6
> 0: 10 40
> 6: 40 10
> [root@ltcalpine2-lp20 ~]# [ 321.310943] workqueue:get_unbound_pool has empty cpumask for pool attrs
> [ 321.310961] ------------[ cut here ]------------
> [ 321.310997] WARNING: CPU: 184 PID: 13201 at kernel/workqueue.c:3375 alloc_unbound_pwq+0x5c0/0x5e0
> [ 321.311005] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag sg pseries_rng ghash_generic gf128mul xts vmx_crypto binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
> [ 321.311097] CPU: 184 PID: 13201 Comm: cpuhp/184 Not tainted 4.12.0-rc1.wi91275_debug_03.ppc64le+ #8
> [ 321.311106] task: c000000408961080 task.stack: c000000406394000
> [ 321.311113] NIP: c000000000116c80 LR: c000000000116c7c CTR: 0000000000000000
> [ 321.311121] REGS: c0000004063977b0 TRAP: 0700 Not tainted (4.12.0-rc1.wi91275_debug_03.ppc64le+)
> [ 321.311128] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE>
> [ 321.311150] CR: 28000082 XER: 00000000
> [ 321.311159] CFAR: c000000000a2dc80 SOFTE: 1
> [ 321.311159] GPR00: c000000000116c7c c000000406397a30 c0000000013ae900 000000000000003b
> [ 321.311159] GPR04: c000000408961a38 0000000000000006 00000000a49e41e5 ffffffffa4a5a483
> [ 321.311159] GPR08: 00000000000062cc 0000000000000000 0000000000000000 c000000408961a38
> [ 321.311159] GPR12: 0000000000000000 c00000000fb38c00 c00000000011e858 c00000040a902ac0
> [ 321.311159] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> [ 321.311159] GPR20: c000000406394000 0000000000000002 c000000406394000 0000000000000000
> [ 321.311159] GPR24: c000000405075400 c000000404fc0000 0000000000000110 c0000000015a4c88
> [ 321.311159] GPR28: 0000000000000000 c0000004fe256000 c0000004fe256008 c0000004fe052800
> [ 321.311290] NIP [c000000000116c80] alloc_unbound_pwq+0x5c0/0x5e0
> [ 321.311298] LR [c000000000116c7c] alloc_unbound_pwq+0x5bc/0x5e0
> [ 321.311305] Call Trace:
> [ 321.311310] [c000000406397a30] [c000000000116c7c] alloc_unbound_pwq+0x5bc/0x5e0 (unreliable)
> [ 321.311323] [c000000406397ad0] [c000000000116e30] wq_update_unbound_numa+0x190/0x270
> [ 321.311334] [c000000406397b60] [c000000000118eb0] workqueue_offline_cpu+0xe0/0x130
> [ 321.311345] [c000000406397bf0] [c0000000000e9f20] cpuhp_invoke_callback+0x240/0xcd0
> [ 321.311355] [c000000406397cb0] [c0000000000eab28] cpuhp_down_callbacks+0x78/0xf0
> [ 321.311365] [c000000406397d00] [c0000000000eae6c] cpuhp_thread_fun+0x18c/0x1a0
> [ 321.311376] [c000000406397d30] [c0000000001251cc] smpboot_thread_fn+0x2fc/0x3b0
> [ 321.311386] [c000000406397dc0] [c00000000011e9c0] kthread+0x170/0x1b0
> [ 321.311397] [c000000406397e30] [c00000000000b4f4] ret_from_kernel_thread+0x5c/0x68
> [ 321.311406] Instruction dump:
> [ 321.311413] 3d42fff0 892ac565 2f890000 40fefd98 39200001 3c62ff89 3c82ff6c 3863d590
> [ 321.311437] 38847cb0 992ac565 48916fc9 60000000 <0fe00000> 4bfffd70 60000000 60420000
The only way offlining can lead to this failure is when wq numa
possible cpu mask is a proper subset of the matching online mask. Can
you please print out the numa online cpu and wq_numa_possible_cpumask
masks and verify that online stays within the possible for each node?
If not, the ppc arch init code needs to be updated so that cpu <->
node binding is establish for all possible cpus on boot. Note that
this isn't a requirement coming solely from wq. All node affine (thus
percpu) allocations depend on that.
Thanks.
--
tejun
Hello:
On 06/06/2017 01:09 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Jun 06, 2017 at 11:18:36AM -0500, Michael Bringmann wrote:
>> On 05/25/2017 10:30 AM, Michael Bringmann wrote:
>>> I will try that patch shortly. I also updated my patch to be conditional
>>> on whether the pool's cpumask attribute was empty. You should have received
>>> V2 of that patch by now.
>>
>> Let's try this again.
>>
>> The hotplug problem goes away with the changes that you provided earlier, and
>
> So, that means we're ending up in situations where NUMA online is a
> proper superset of NUMA possible.
>
>> shown in the patch below. I kept this change to get_unbound_pool' as a just
>> in case to explain the crash in the event that it occurs again:
>>
>> if (!cpumask_weight(pool->attrs->cpumask))
>> cpumask_copy(pool->attrs->cpumask, cpumask_of(smp_processor_id()));
>>
>> I could also insert
>>
>> BUG(!cpumask_weight(pool->attrs->cpumask, cpumask_of(smp_processor_id()));
>>
>> at that place, but I really prefer not to crash the system if there is a workaround.
>
> I'm not sure because it doesn't make any logical sense and it's not
> right in terms of correctness. The above would be able to enable CPUs
> which are explicitly excluded from a workqueue. The only fallback
> which makes sense is falling back to the default pwq.
What would that look like? Are you sure that would always be valid?
In a system that is hot-adding and hot-removing CPUs?
>>> Can you please post the messages with the debug patch from the prev
>>> thread? In fact, let's please continue on that thread. I'm having a
>>> hard time following what's going wrong with the code.
>>
>> Are these the failure logs that you requested?
>>
>>
>> Red Hat Enterprise Linux Server 7.3 (Maipo)
>> Kernel 4.12.0-rc1.wi91275_debug_03.ppc64le+ on an ppc64le
>>
>> ltcalpine2-lp20 login: root
>> Password:
>> Last login: Wed May 24 18:45:40 from oc1554177480.austin.ibm.com
>> [root@ltcalpine2-lp20 ~]# numactl -H
>> available: 2 nodes (0,6)
>> node 0 cpus:
>> node 0 size: 0 MB
>> node 0 free: 0 MB
>> node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
>> node 6 size: 19858 MB
>> node 6 free: 16920 MB
>> node distances:
>> node 0 6
>> 0: 10 40
>> 6: 40 10
>> [root@ltcalpine2-lp20 ~]# numactl -H
>> available: 2 nodes (0,6)
>> node 0 cpus:
>> node 0 size: 0 MB
>> node 0 free: 0 MB
>> node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
>> node 6 size: 19858 MB
>> node 6 free: 16362 MB
>> node distances:
>> node 0 6
>> 0: 10 40
>> 6: 40 10
>> [root@ltcalpine2-lp20 ~]# [ 321.310943] workqueue:get_unbound_pool has empty cpumask for pool attrs
>> [ 321.310961] ------------[ cut here ]------------
>> [ 321.310997] WARNING: CPU: 184 PID: 13201 at kernel/workqueue.c:3375 alloc_unbound_pwq+0x5c0/0x5e0
>> [ 321.311005] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag sg pseries_rng ghash_generic gf128mul xts vmx_crypto binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
>> [ 321.311097] CPU: 184 PID: 13201 Comm: cpuhp/184 Not tainted 4.12.0-rc1.wi91275_debug_03.ppc64le+ #8
>> [ 321.311106] task: c000000408961080 task.stack: c000000406394000
>> [ 321.311113] NIP: c000000000116c80 LR: c000000000116c7c CTR: 0000000000000000
>> [ 321.311121] REGS: c0000004063977b0 TRAP: 0700 Not tainted (4.12.0-rc1.wi91275_debug_03.ppc64le+)
>> [ 321.311128] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE>
>> [ 321.311150] CR: 28000082 XER: 00000000
>> [ 321.311159] CFAR: c000000000a2dc80 SOFTE: 1
>> [ 321.311159] GPR00: c000000000116c7c c000000406397a30 c0000000013ae900 000000000000003b
>> [ 321.311159] GPR04: c000000408961a38 0000000000000006 00000000a49e41e5 ffffffffa4a5a483
>> [ 321.311159] GPR08: 00000000000062cc 0000000000000000 0000000000000000 c000000408961a38
>> [ 321.311159] GPR12: 0000000000000000 c00000000fb38c00 c00000000011e858 c00000040a902ac0
>> [ 321.311159] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
>> [ 321.311159] GPR20: c000000406394000 0000000000000002 c000000406394000 0000000000000000
>> [ 321.311159] GPR24: c000000405075400 c000000404fc0000 0000000000000110 c0000000015a4c88
>> [ 321.311159] GPR28: 0000000000000000 c0000004fe256000 c0000004fe256008 c0000004fe052800
>> [ 321.311290] NIP [c000000000116c80] alloc_unbound_pwq+0x5c0/0x5e0
>> [ 321.311298] LR [c000000000116c7c] alloc_unbound_pwq+0x5bc/0x5e0
>> [ 321.311305] Call Trace:
>> [ 321.311310] [c000000406397a30] [c000000000116c7c] alloc_unbound_pwq+0x5bc/0x5e0 (unreliable)
>> [ 321.311323] [c000000406397ad0] [c000000000116e30] wq_update_unbound_numa+0x190/0x270
>> [ 321.311334] [c000000406397b60] [c000000000118eb0] workqueue_offline_cpu+0xe0/0x130
>> [ 321.311345] [c000000406397bf0] [c0000000000e9f20] cpuhp_invoke_callback+0x240/0xcd0
>> [ 321.311355] [c000000406397cb0] [c0000000000eab28] cpuhp_down_callbacks+0x78/0xf0
>> [ 321.311365] [c000000406397d00] [c0000000000eae6c] cpuhp_thread_fun+0x18c/0x1a0
>> [ 321.311376] [c000000406397d30] [c0000000001251cc] smpboot_thread_fn+0x2fc/0x3b0
>> [ 321.311386] [c000000406397dc0] [c00000000011e9c0] kthread+0x170/0x1b0
>> [ 321.311397] [c000000406397e30] [c00000000000b4f4] ret_from_kernel_thread+0x5c/0x68
>> [ 321.311406] Instruction dump:
>> [ 321.311413] 3d42fff0 892ac565 2f890000 40fefd98 39200001 3c62ff89 3c82ff6c 3863d590
>> [ 321.311437] 38847cb0 992ac565 48916fc9 60000000 <0fe00000> 4bfffd70 60000000 60420000
>
> The only way offlining can lead to this failure is when wq numa
> possible cpu mask is a proper subset of the matching online mask. Can
> you please print out the numa online cpu and wq_numa_possible_cpumask
> masks and verify that online stays within the possible for each node?
> If not, the ppc arch init code needs to be updated so that cpu <->
> node binding is establish for all possible cpus on boot. Note that
> this isn't a requirement coming solely from wq. All node affine (thus
> percpu) allocations depend on that.
The ppc arch init code already records all nodes used by the CPUs visible in
the device-tree at boot time into the possible and online node bindings. The
problem here occurs when we hot-add new CPUs to the powerpc system -- they may
require nodes that are mentioned by the VPHN hcall, but which were not used
at boot time.
I will run a test that dumps these masks later this week to try to provide
the information that you are interested in.
Right now we are having a discussion on another thread as to how to properly
set the possible node mask at boot given that there is no mechanism to hot-add
nodes to the system. The latest idea appears to be adding another property
or two to define the maximum number of nodes that should be added to the
possible / online node masks to allow for dynamic growth after boot.
>
> Thanks.
>
Thanks.
--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]
Hello,
On Mon, Jun 12, 2017 at 09:47:31AM -0500, Michael Bringmann wrote:
> > I'm not sure because it doesn't make any logical sense and it's not
> > right in terms of correctness. The above would be able to enable CPUs
> > which are explicitly excluded from a workqueue. The only fallback
> > which makes sense is falling back to the default pwq.
>
> What would that look like? Are you sure that would always be valid?
> In a system that is hot-adding and hot-removing CPUs?
The reason why we're ending up with empty masks is because
wq_calc_node_cpumask() is assuming that the possible node cpumask is
always a superset of online (as it should). We can trigger a fat
warning there if that isn't so and just return false from that
function.
> > The only way offlining can lead to this failure is when wq numa
> > possible cpu mask is a proper subset of the matching online mask. Can
> > you please print out the numa online cpu and wq_numa_possible_cpumask
> > masks and verify that online stays within the possible for each node?
> > If not, the ppc arch init code needs to be updated so that cpu <->
> > node binding is establish for all possible cpus on boot. Note that
> > this isn't a requirement coming solely from wq. All node affine (thus
> > percpu) allocations depend on that.
>
> The ppc arch init code already records all nodes used by the CPUs visible in
> the device-tree at boot time into the possible and online node bindings. The
> problem here occurs when we hot-add new CPUs to the powerpc system -- they may
> require nodes that are mentioned by the VPHN hcall, but which were not used
> at boot time.
We need all the possible (so, for cpus which aren't online yet too)
CPU -> node mappings to be established on boot. This isn't just a
requirement from workqueue. We don't have any synchronization
regarding cpu <-> numa mapping in memory allocation paths either.
> I will run a test that dumps these masks later this week to try to provide
> the information that you are interested in.
>
> Right now we are having a discussion on another thread as to how to properly
> set the possible node mask at boot given that there is no mechanism to hot-add
> nodes to the system. The latest idea appears to be adding another property
> or two to define the maximum number of nodes that should be added to the
> possible / online node masks to allow for dynamic growth after boot.
I have no idea about the specifics of ppc but at least the code base
we have currently expect all possible cpus and nodes and their
mappings to be established on boot.
Thanks.
--
tejun
On 06/12/2017 11:14 AM, Tejun Heo wrote:
> Hello,
>
> On Mon, Jun 12, 2017 at 09:47:31AM -0500, Michael Bringmann wrote:
>>> I'm not sure because it doesn't make any logical sense and it's not
>>> right in terms of correctness. The above would be able to enable CPUs
>>> which are explicitly excluded from a workqueue. The only fallback
>>> which makes sense is falling back to the default pwq.
>>
>> What would that look like? Are you sure that would always be valid?
>> In a system that is hot-adding and hot-removing CPUs?
>
> The reason why we're ending up with empty masks is because
> wq_calc_node_cpumask() is assuming that the possible node cpumask is
> always a superset of online (as it should). We can trigger a fat
> warning there if that isn't so and just return false from that
> function.
What would that look like? I should be able to test it on top of the
other changes / corrections.
>>> The only way offlining can lead to this failure is when wq numa
>>> possible cpu mask is a proper subset of the matching online mask. Can
>>> you please print out the numa online cpu and wq_numa_possible_cpumask
>>> masks and verify that online stays within the possible for each node?
>>> If not, the ppc arch init code needs to be updated so that cpu <->
>>> node binding is establish for all possible cpus on boot. Note that
>>> this isn't a requirement coming solely from wq. All node affine (thus
>>> percpu) allocations depend on that.
>>
>> The ppc arch init code already records all nodes used by the CPUs visible in
>> the device-tree at boot time into the possible and online node bindings. The
>> problem here occurs when we hot-add new CPUs to the powerpc system -- they may
>> require nodes that are mentioned by the VPHN hcall, but which were not used
>> at boot time.
>
> We need all the possible (so, for cpus which aren't online yet too)
> CPU -> node mappings to be established on boot. This isn't just a
> requirement from workqueue. We don't have any synchronization
> regarding cpu <-> numa mapping in memory allocation paths either.
>
>> I will run a test that dumps these masks later this week to try to provide
>> the information that you are interested in.
>>
>> Right now we are having a discussion on another thread as to how to properly
>> set the possible node mask at boot given that there is no mechanism to hot-add
>> nodes to the system. The latest idea appears to be adding another property
>> or two to define the maximum number of nodes that should be added to the
>> possible / online node masks to allow for dynamic growth after boot.
>
> I have no idea about the specifics of ppc but at least the code base
> we have currently expect all possible cpus and nodes and their
> mappings to be established on boot.
Hopefully, the new properties will fix the holes in the current implementation
with regard to hot-add.
>
> Thanks.
>
--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]
Hello,
On Mon, Jun 12, 2017 at 12:10:49PM -0500, Michael Bringmann wrote:
> > The reason why we're ending up with empty masks is because
> > wq_calc_node_cpumask() is assuming that the possible node cpumask is
> > always a superset of online (as it should). We can trigger a fat
> > warning there if that isn't so and just return false from that
> > function.
>
> What would that look like? I should be able to test it on top of the
> other changes / corrections.
So, the function looks like the following now.
static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
int cpu_going_down, cpumask_t *cpumask)
{
if (!wq_numa_enabled || attrs->no_numa)
goto use_dfl;
/* does @node have any online CPUs @attrs wants? */
A: cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
if (cpu_going_down >= 0)
cpumask_clear_cpu(cpu_going_down, cpumask);
B: if (cpumask_empty(cpumask))
goto use_dfl;
/* yeap, return possible CPUs in @node that @attrs wants */
C: cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
return !cpumask_equal(cpumask, attrs->cpumask);
use_dfl:
cpumask_copy(cpumask, attrs->cpumask);
return false;
}
A is calculating the target cpumask to use using the online map. B
falls back to dfl mask if the intersection is empty. C calculates the
eventual mask to use from the intersection of possible mask and what's
requested. The assumption is that because possible is a superset of
online, C's result can't be smaller than A.
So, what we can do is if to calculate the possible intersection,
compare it against the online intersection, and if the latter is
bigger, trigger a big fat warning and return false there.
> > I have no idea about the specifics of ppc but at least the code base
> > we have currently expect all possible cpus and nodes and their
> > mappings to be established on boot.
>
> Hopefully, the new properties will fix the holes in the current implementation
> with regard to hot-add.
Yeah, that's the only proper fix here.
Thanks.
--
tejun
Hello:
On 06/12/2017 12:32 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Jun 12, 2017 at 12:10:49PM -0500, Michael Bringmann wrote:
>>> The reason why we're ending up with empty masks is because
>>> wq_calc_node_cpumask() is assuming that the possible node cpumask is
>>> always a superset of online (as it should). We can trigger a fat
>>> warning there if that isn't so and just return false from that
>>> function.
>>
>> What would that look like? I should be able to test it on top of the
>> other changes / corrections.
>
> So, the function looks like the following now.
>
> static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
> int cpu_going_down, cpumask_t *cpumask)
> {
> if (!wq_numa_enabled || attrs->no_numa)
> goto use_dfl;
>
> /* does @node have any online CPUs @attrs wants? */
> A: cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
> if (cpu_going_down >= 0)
> cpumask_clear_cpu(cpu_going_down, cpumask);
>
> B: if (cpumask_empty(cpumask))
> goto use_dfl;
>
> /* yeap, return possible CPUs in @node that @attrs wants */
> C: cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
> return !cpumask_equal(cpumask, attrs->cpumask);
>
> use_dfl:
> cpumask_copy(cpumask, attrs->cpumask);
> return false;
> }
>
> A is calculating the target cpumask to use using the online map. B
> falls back to dfl mask if the intersection is empty. C calculates the
> eventual mask to use from the intersection of possible mask and what's
> requested. The assumption is that because possible is a superset of
> online, C's result can't be smaller than A.
>
> So, what we can do is if to calculate the possible intersection,
> compare it against the online intersection, and if the latter is
> bigger, trigger a big fat warning and return false there.
So the revisions to the function would look like, correct?
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index c74bf39..5d7674e 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3564,19 +3564,28 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
int cpu_going_down, cpumask_t *cpumask)
{
+ cpumask_t onl_targ_cm;
+
if (!wq_numa_enabled || attrs->no_numa)
goto use_dfl;
/* does @node have any online CPUs @attrs wants? */
- cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
+ cpumask_and(&onl_targ_cm, cpumask_of_node(node), attrs->cpumask);
if (cpu_going_down >= 0)
- cpumask_clear_cpu(cpu_going_down, cpumask);
+ cpumask_clear_cpu(cpu_going_down, &onl_targ_cm);
- if (cpumask_empty(cpumask))
+ if (cpumask_empty(&onl_targ_cm))
goto use_dfl;
/* yeap, return possible CPUs in @node that @attrs wants */
cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
+
+ if (cpumask_weight(&onl_targ_cm) > cpumask_weight(cpumask)) {
+ printk(KERN_INFO "WARNING: WQ cpumask: onl intersect > "
+ "possible intersect\n");
+ return false;
+ }
+
return !cpumask_equal(cpumask, attrs->cpumask);
use_dfl:
>
>>> I have no idea about the specifics of ppc but at least the code base
>>> we have currently expect all possible cpus and nodes and their
>>> mappings to be established on boot.
>>
>> Hopefully, the new properties will fix the holes in the current implementation
>> with regard to hot-add.
>
> Yeah, that's the only proper fix here.
I incorporated other changes to try to fill in the possible map more accurately,
and with only the above modification to workqueue.c, I ran a hot-add CPU test
to add 8 VPs to a powerpc Shared CPU configuration. It produced a lot of the
warning messages -- a lot more than I was expecting. But it did not crash.
I have attached a compressed copy of the log file, here.
>
> Thanks.
>
Thanks.
--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]
Hello,
On Tue, Jun 13, 2017 at 03:04:30PM -0500, Michael Bringmann wrote:
> @@ -3564,19 +3564,28 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
> static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
> int cpu_going_down, cpumask_t *cpumask)
> {
> + cpumask_t onl_targ_cm;
> +
> if (!wq_numa_enabled || attrs->no_numa)
> goto use_dfl;
>
> /* does @node have any online CPUs @attrs wants? */
> - cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
> + cpumask_and(&onl_targ_cm, cpumask_of_node(node), attrs->cpumask);
> if (cpu_going_down >= 0)
> - cpumask_clear_cpu(cpu_going_down, cpumask);
> + cpumask_clear_cpu(cpu_going_down, &onl_targ_cm);
>
> - if (cpumask_empty(cpumask))
> + if (cpumask_empty(&onl_targ_cm))
> goto use_dfl;
>
> /* yeap, return possible CPUs in @node that @attrs wants */
> cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
> +
> + if (cpumask_weight(&onl_targ_cm) > cpumask_weight(cpumask)) {
> + printk(KERN_INFO "WARNING: WQ cpumask: onl intersect > "
> + "possible intersect\n");
> + return false;
> + }
Yeah, alternatively, we can just add right before returning,
if (WARN_ON(cpumask_empty(cpumask)))
return false;
Thanks.
--
tejun
I will try that patch tomorrow. My only concern about that is the use of WARN_ON().
As I may have mentioned in my note of 6/27, I saw about 600 instances of the warning
message just during boot of the PowerPC kernel. I doubt that we want to see that on
an ongoing basis.
Michael
On 06/13/2017 03:10 PM, Tejun Heo wrote:
> Hello,
>
> On Tue, Jun 13, 2017 at 03:04:30PM -0500, Michael Bringmann wrote:
>> @@ -3564,19 +3564,28 @@ static struct pool_workqueue *alloc_unbound_pwq(struct workqueue_struct *wq,
>> static bool wq_calc_node_cpumask(const struct workqueue_attrs *attrs, int node,
>> int cpu_going_down, cpumask_t *cpumask)
>> {
>> + cpumask_t onl_targ_cm;
>> +
>> if (!wq_numa_enabled || attrs->no_numa)
>> goto use_dfl;
>>
>> /* does @node have any online CPUs @attrs wants? */
>> - cpumask_and(cpumask, cpumask_of_node(node), attrs->cpumask);
>> + cpumask_and(&onl_targ_cm, cpumask_of_node(node), attrs->cpumask);
>> if (cpu_going_down >= 0)
>> - cpumask_clear_cpu(cpu_going_down, cpumask);
>> + cpumask_clear_cpu(cpu_going_down, &onl_targ_cm);
>>
>> - if (cpumask_empty(cpumask))
>> + if (cpumask_empty(&onl_targ_cm))
>> goto use_dfl;
>>
>> /* yeap, return possible CPUs in @node that @attrs wants */
>> cpumask_and(cpumask, attrs->cpumask, wq_numa_possible_cpumask[node]);
>> +
>> + if (cpumask_weight(&onl_targ_cm) > cpumask_weight(cpumask)) {
>> + printk(KERN_INFO "WARNING: WQ cpumask: onl intersect > "
>> + "possible intersect\n");
>> + return false;
>> + }
>
> Yeah, alternatively, we can just add right before returning,
>
> if (WARN_ON(cpumask_empty(cpumask)))
> return false;
>
> Thanks.
>
--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]
On Wed, Jun 28, 2017 at 04:15:09PM -0500, Michael Bringmann wrote:
> I will try that patch tomorrow. My only concern about that is the use of WARN_ON().
> As I may have mentioned in my note of 6/27, I saw about 600 instances of the warning
> message just during boot of the PowerPC kernel. I doubt that we want to see that on
> an ongoing basis.
Yeah, we can either do pr_err_once() or WARN_ON_ONCE(), so that we
don't print that out constantly.
Thanks.
--
tejun
Hello, Tejun:
Do you need anything else from me regarding this patch?
Or are you good to commit it upstream?
Thanks.
Michael
On 06/28/2017 04:24 PM, Tejun Heo wrote:
> On Wed, Jun 28, 2017 at 04:15:09PM -0500, Michael Bringmann wrote:
>> I will try that patch tomorrow. My only concern about that is the use of WARN_ON().
>> As I may have mentioned in my note of 6/27, I saw about 600 instances of the warning
>> message just during boot of the PowerPC kernel. I doubt that we want to see that on
>> an ongoing basis.
>
> Yeah, we can either do pr_err_once() or WARN_ON_ONCE(), so that we
> don't print that out constantly.
>
> Thanks.
>
--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]
On Wed, Jul 26, 2017 at 10:25:08AM -0500, Michael Bringmann wrote:
> Hello, Tejun:
> Do you need anything else from me regarding this patch?
> Or are you good to commit it upstream?
> Thanks.
Hmmm... you were planning to try it and we wanted to convert it to
WARN_ONCE?
Thanks.
--
tejun
Sorry, I did try it. I must have forgotten to notify you of its success.
Will next post an updated patch using 'pr_warn_once'. It prints a message
during system initialization, by the way.
Thanks.
On 07/26/2017 02:16 PM, Tejun Heo wrote:
> On Wed, Jul 26, 2017 at 10:25:08AM -0500, Michael Bringmann wrote:
>> Hello, Tejun:
>> Do you need anything else from me regarding this patch?
>> Or are you good to commit it upstream?
>> Thanks.
>
> Hmmm... you were planning to try it and we wanted to convert it to
> WARN_ONCE?
>
> Thanks.
>
--
Michael W. Bringmann
Linux Technology Center
IBM Corporation
Tie-Line 363-5196
External: (512) 286-5196
Cell: (512) 466-0650
[email protected]