Date: Tue, 6 Jun 2017 14:09:13 -0400
From: Tejun Heo <tj@kernel.org>
To: Michael Bringmann <mwb@linux.vnet.ibm.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>, linux-kernel@vger.kernel.org,
        Nathan Fontenot <nfont@linux.vnet.ibm.com>
Subject: Re: [PATCH] workqueue: Ensure that cpumask set for pools created
 after boot
Message-ID: <20170606180913.GA32062@htj.duckdns.org>
References: <20170516155527.GB6389@htj.duckdns.org>
 <b981e817-8f96-31cf-421b-5f38b8f23628@linux.vnet.ibm.com>
 <20170523194952.GF13222@htj.duckdns.org>
 <d757731e-8770-05a0-9d39-9b296bdccd52@linux.vnet.ibm.com>
 <20170523201029.GH13222@htj.duckdns.org>
 <cc40ae7f-8904-4e3b-0205-06c114a056b8@linux.vnet.ibm.com>
 <20170525150353.GE23493@htj.duckdns.org>
 <20170525150752.GF23493@htj.duckdns.org>
 <f0723547-1383-efa2-a691-eb91ff0620c4@linux.vnet.ibm.com>
 <d4eaef3a-7371-a9e6-d371-2b2dd5703df8@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <d4eaef3a-7371-a9e6-d371-2b2dd5703df8@linux.vnet.ibm.com>
User-Agent: Mutt/1.8.2 (2017-04-18)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 6623
Lines: 117

Hello,

On Tue, Jun 06, 2017 at 11:18:36AM -0500, Michael Bringmann wrote:
> On 05/25/2017 10:30 AM, Michael Bringmann wrote:
> > I will try that patch shortly.  I also updated my patch to be conditional
> > on whether the pool's cpumask attribute was empty.  You should have received
> > V2 of that patch by now.
> 
> Let's try this again.
> 
> The hotplug problem goes away with the changes that you provided earlier, and

So, that means we're ending up in situations where NUMA online is a
proper superset of NUMA possible.

> shown in the patch below.  I kept this change to get_unbound_pool' as a just
> in case to explain the crash in the event that it occurs again:
> 
>     if (!cpumask_weight(pool->attrs->cpumask))
>         cpumask_copy(pool->attrs->cpumask, cpumask_of(smp_processor_id()));
> 
> I could also insert 
> 
>     BUG(!cpumask_weight(pool->attrs->cpumask, cpumask_of(smp_processor_id()));
> 
> at that place, but I really prefer not to crash the system if there is a workaround.

I'm not sure because it doesn't make any logical sense and it's not
right in terms of correctness.  The above would be able to enable CPUs
which are explicitly excluded from a workqueue.  The only fallback
which makes sense is falling back to the default pwq.

> > Can you please post the messages with the debug patch from the prev
> > thread?  In fact, let's please continue on that thread.  I'm having a
> > hard time following what's going wrong with the code.
> 
> Are these the failure logs that you requested?
> 
> 
> Red Hat Enterprise Linux Server 7.3 (Maipo)
> Kernel 4.12.0-rc1.wi91275_debug_03.ppc64le+ on an ppc64le
> 
> ltcalpine2-lp20 login: root
> Password: 
> Last login: Wed May 24 18:45:40 from oc1554177480.austin.ibm.com
> [root@ltcalpine2-lp20 ~]# numactl -H
> available: 2 nodes (0,6)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
> node 6 size: 19858 MB
> node 6 free: 16920 MB
> node distances:
> node   0   6 
>   0:  10  40 
>   6:  40  10 
> [root@ltcalpine2-lp20 ~]# numactl -H
> available: 2 nodes (0,6)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 6 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191
> node 6 size: 19858 MB
> node 6 free: 16362 MB
> node distances:
> node   0   6 
>   0:  10  40 
>   6:  40  10 
> [root@ltcalpine2-lp20 ~]# [  321.310943] workqueue:get_unbound_pool has empty cpumask for pool attrs
> [  321.310961] ------------[ cut here ]------------
> [  321.310997] WARNING: CPU: 184 PID: 13201 at kernel/workqueue.c:3375 alloc_unbound_pwq+0x5c0/0x5e0
> [  321.311005] Modules linked in: rpadlpar_io rpaphp dccp_diag dccp tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag sg pseries_rng ghash_generic gf128mul xts vmx_crypto binfmt_misc ip_tables xfs libcrc32c sd_mod ibmvscsi ibmveth scsi_transport_srp dm_mirror dm_region_hash dm_log dm_mod
> [  321.311097] CPU: 184 PID: 13201 Comm: cpuhp/184 Not tainted 4.12.0-rc1.wi91275_debug_03.ppc64le+ #8
> [  321.311106] task: c000000408961080 task.stack: c000000406394000
> [  321.311113] NIP: c000000000116c80 LR: c000000000116c7c CTR: 0000000000000000
> [  321.311121] REGS: c0000004063977b0 TRAP: 0700   Not tainted  (4.12.0-rc1.wi91275_debug_03.ppc64le+)
> [  321.311128] MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE>
> [  321.311150]   CR: 28000082  XER: 00000000
> [  321.311159] CFAR: c000000000a2dc80 SOFTE: 1 
> [  321.311159] GPR00: c000000000116c7c c000000406397a30 c0000000013ae900 000000000000003b 
> [  321.311159] GPR04: c000000408961a38 0000000000000006 00000000a49e41e5 ffffffffa4a5a483 
> [  321.311159] GPR08: 00000000000062cc 0000000000000000 0000000000000000 c000000408961a38 
> [  321.311159] GPR12: 0000000000000000 c00000000fb38c00 c00000000011e858 c00000040a902ac0 
> [  321.311159] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 
> [  321.311159] GPR20: c000000406394000 0000000000000002 c000000406394000 0000000000000000 
> [  321.311159] GPR24: c000000405075400 c000000404fc0000 0000000000000110 c0000000015a4c88 
> [  321.311159] GPR28: 0000000000000000 c0000004fe256000 c0000004fe256008 c0000004fe052800 
> [  321.311290] NIP [c000000000116c80] alloc_unbound_pwq+0x5c0/0x5e0
> [  321.311298] LR [c000000000116c7c] alloc_unbound_pwq+0x5bc/0x5e0
> [  321.311305] Call Trace:
> [  321.311310] [c000000406397a30] [c000000000116c7c] alloc_unbound_pwq+0x5bc/0x5e0 (unreliable)
> [  321.311323] [c000000406397ad0] [c000000000116e30] wq_update_unbound_numa+0x190/0x270
> [  321.311334] [c000000406397b60] [c000000000118eb0] workqueue_offline_cpu+0xe0/0x130
> [  321.311345] [c000000406397bf0] [c0000000000e9f20] cpuhp_invoke_callback+0x240/0xcd0
> [  321.311355] [c000000406397cb0] [c0000000000eab28] cpuhp_down_callbacks+0x78/0xf0
> [  321.311365] [c000000406397d00] [c0000000000eae6c] cpuhp_thread_fun+0x18c/0x1a0
> [  321.311376] [c000000406397d30] [c0000000001251cc] smpboot_thread_fn+0x2fc/0x3b0
> [  321.311386] [c000000406397dc0] [c00000000011e9c0] kthread+0x170/0x1b0
> [  321.311397] [c000000406397e30] [c00000000000b4f4] ret_from_kernel_thread+0x5c/0x68
> [  321.311406] Instruction dump:
> [  321.311413] 3d42fff0 892ac565 2f890000 40fefd98 39200001 3c62ff89 3c82ff6c 3863d590 
> [  321.311437] 38847cb0 992ac565 48916fc9 60000000 <0fe00000> 4bfffd70 60000000 60420000 

The only way offlining can lead to this failure is when wq numa
possible cpu mask is a proper subset of the matching online mask.  Can
you please print out the numa online cpu and wq_numa_possible_cpumask
masks and verify that online stays within the possible for each node?
If not, the ppc arch init code needs to be updated so that cpu <->
node binding is establish for all possible cpus on boot.  Note that
this isn't a requirement coming solely from wq.  All node affine (thus
percpu) allocations depend on that.

Thanks.

-- 
tejun