2023-06-21 15:35:46

by Chuck Lever III

[permalink] [raw]
Subject: contention on pwq->pool->lock under heavy NFS workload

Hi Tejun-

lock_stat reports that the pool->lock kernel/workqueue.c:1483 is the highest
contended lock on my test NFS client. The issue appears to be that the three
NFS-related workqueues, rpciod_workqueue, xprtiod_workqueue, and nfsiod all
get placed in the same worker_pool, so they have to fight over one pool lock.

I notice that ib_comp_wq is allocated with the same flags, but I don't see
significant contention there, and a trace_printk in __queue_work shows that
work items queued on that WQ seem to alternate between at least two different
worker_pools.

Is there a preferred way to ensure the NFS WQs get spread a little more fairly
amongst the worker_pools?

Thanks in advance.

--
Chuck Lever




2023-06-21 21:36:23

by Tejun Heo

[permalink] [raw]
Subject: Re: contention on pwq->pool->lock under heavy NFS workload

Hello,

On Wed, Jun 21, 2023 at 03:26:22PM +0000, Chuck Lever III wrote:
> lock_stat reports that the pool->lock kernel/workqueue.c:1483 is the highest
> contended lock on my test NFS client. The issue appears to be that the three
> NFS-related workqueues, rpciod_workqueue, xprtiod_workqueue, and nfsiod all
> get placed in the same worker_pool, so they have to fight over one pool lock.
>
> I notice that ib_comp_wq is allocated with the same flags, but I don't see
> significant contention there, and a trace_printk in __queue_work shows that
> work items queued on that WQ seem to alternate between at least two different
> worker_pools.
>
> Is there a preferred way to ensure the NFS WQs get spread a little more fairly
> amongst the worker_pools?

Can you share the output of lstopo on the test machine?

The following branch has pending workqueue changes which makes unbound
workqueues finer grained by default and a lot more flexible in how they're
segmented.

git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v2

Can you please test with the brnach? If the default doesn't improve the
situation, you can set WQ_SYSFS on the affected workqueues and change their
scoping by writing to /sys/devices/virtual/WQ_NAME/affinity_scope. Please
take a look at

https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/tree/Documentation/core-api/workqueue.rst?h=affinity-scopes-v2#n350

for more details.

Thanks.

--
tejun

2023-06-22 14:42:51

by Chuck Lever III

[permalink] [raw]
Subject: Re: contention on pwq->pool->lock under heavy NFS workload



> On Jun 21, 2023, at 5:28 PM, Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Wed, Jun 21, 2023 at 03:26:22PM +0000, Chuck Lever III wrote:
>> lock_stat reports that the pool->lock kernel/workqueue.c:1483 is the highest
>> contended lock on my test NFS client. The issue appears to be that the three
>> NFS-related workqueues, rpciod_workqueue, xprtiod_workqueue, and nfsiod all
>> get placed in the same worker_pool, so they have to fight over one pool lock.
>>
>> I notice that ib_comp_wq is allocated with the same flags, but I don't see
>> significant contention there, and a trace_printk in __queue_work shows that
>> work items queued on that WQ seem to alternate between at least two different
>> worker_pools.
>>
>> Is there a preferred way to ensure the NFS WQs get spread a little more fairly
>> amongst the worker_pools?
>
> Can you share the output of lstopo on the test machine?

Machine (P#0 total=32480548KB DMIProductName="Super Server" DMIProductVersion=0123456789 DMIBoardVendor=Supermicro DMIBoardName=X12SPL-F DMIBoardVersion=2.00 DMIBoardAssetTag="Base Board Asset Tag" DMIChassisVendor=Supermicro DMIChassisType=17 DMIChassisVersion=0123456789 DMIChassisAssetTag="Chassis Asset Tag" DMIBIOSVendor="American Megatrends International, LLC." DMIBIOSVersion=1.1a DMIBIOSDate=08/05/2021 DMISysVendor=Supermicro Backend=Linux LinuxCgroup=/ OSName=Linux OSRelease=6.4.0-rc7-00005-ga0c30c01f971 OSVersion="#8 SMP PREEMPT Wed Jun 21 11:29:02 EDT 2023" HostName=morisot.XXXXXXXXXXX.net Architecture=x86_64 hwlocVersion=2.5.0 ProcessName=lstopo)
Package L#0 (P#0 total=32480548KB CPUVendor=GenuineIntel CPUFamilyNumber=6 CPUModelNumber=106 CPUModel="Intel(R) Xeon(R) Silver 4310 CPU @ 2.10GHz" CPUStepping=6)
NUMANode L#0 (P#0 local=32480548KB total=32480548KB)
L3Cache L#0 (size=18432KB linesize=64 ways=12 Inclusive=0)
L2Cache L#0 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#0 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#0 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#0 (P#0)
PU L#0 (P#0)
L2Cache L#1 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#1 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#1 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#1 (P#1)
PU L#1 (P#1)
L2Cache L#2 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#2 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#2 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#2 (P#2)
PU L#2 (P#2)
L2Cache L#3 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#3 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#3 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#3 (P#3)
PU L#3 (P#3)
L2Cache L#4 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#4 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#4 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#4 (P#4)
PU L#4 (P#4)
L2Cache L#5 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#5 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#5 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#5 (P#5)
PU L#5 (P#5)
L2Cache L#6 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#6 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#6 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#6 (P#6)
PU L#6 (P#6)
L2Cache L#7 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#7 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#7 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#7 (P#7)
PU L#7 (P#7)
L2Cache L#8 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#8 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#8 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#8 (P#8)
PU L#8 (P#8)
L2Cache L#9 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#9 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#9 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#9 (P#9)
PU L#9 (P#9)
L2Cache L#10 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#10 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#10 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#10 (P#10)
PU L#10 (P#10)
L2Cache L#11 (size=1280KB linesize=64 ways=20 Inclusive=0)
L1dCache L#11 (size=48KB linesize=64 ways=12 Inclusive=0)
L1iCache L#11 (size=32KB linesize=64 ways=8 Inclusive=0)
Core L#11 (P#11)
PU L#11 (P#11)
HostBridge L#0 (buses=0000:[00-07])
PCI L#0 (busid=0000:00:11.5 id=8086:a1d2 class=0106(SATA))
Block(Removable Media Device) L#0 (Size=1048575 SectorSize=512 LinuxDeviceID=11:0 Model=ASUS_DRW-24F1ST_b Revision=1.00 SerialNumber=E5D0CL034213) "sr0"
PCI L#1 (busid=0000:00:17.0 id=8086:a182 class=0106(SATA))
PCIBridge L#1 (busid=0000:00:1c.0 id=8086:a190 class=0604(PCIBridge) link=0.25GB/s buses=0000:[01-01])
PCI L#2 (busid=0000:01:00.0 id=8086:1533 class=0200(Ethernet) link=0.25GB/s)
Network L#1 (Address=3c:ec:ef:7a:0b:fa) "eno1"
PCIBridge L#2 (busid=0000:00:1c.1 id=8086:a191 class=0604(PCIBridge) link=0.25GB/s buses=0000:[02-02])
PCI L#3 (busid=0000:02:00.0 id=8086:1533 class=0200(Ethernet) link=0.25GB/s)
Network L#2 (Address=3c:ec:ef:7a:0b:fb) "eno2"
PCIBridge L#3 (busid=0000:00:1c.5 id=8086:a195 class=0604(PCIBridge) link=0.62GB/s buses=0000:[05-06])
PCIBridge L#4 (busid=0000:05:00.0 id=1a03:1150 class=0604(PCIBridge) link=0.62GB/s buses=0000:[06-06])
PCI L#4 (busid=0000:06:00.0 id=1a03:2000 class=0300(VGA))
PCIBridge L#5 (busid=0000:00:1d.0 id=8086:a198 class=0604(PCIBridge) link=3.94GB/s buses=0000:[07-07])
PCI L#5 (busid=0000:07:00.0 id=c0a9:540a class=0108(NVMExp) link=3.94GB/s)
Block(Disk) L#3 (Size=244198584 SectorSize=512 LinuxDeviceID=259:0 Model=CT250P2SSD8 Revision=P2CR012 SerialNumber=2116E597CC4F) "nvme0n1"
HostBridge L#6 (buses=0000:[17-18])
PCIBridge L#7 (busid=0000:17:02.0 id=8086:347a class=0604(PCIBridge) link=15.75GB/s buses=0000:[18-18])
PCI L#6 (busid=0000:18:00.0 id=15b3:1017 class=0200(Ethernet) link=15.75GB/s PCISlot=6)
Network L#4 (Address=ec:0d:9a:92:b2:46 Port=1) "ens6np0"
OpenFabrics L#5 (NodeGUID=ec0d:9a03:0092:b246 SysImageGUID=ec0d:9a03:0092:b246 Port1State=4 Port1LID=0x0 Port1LMC=0 Port1GID0=fe80:0000:0000:0000:ee0d:9aff:fe92:b246 Port1GID1=fe80:0000:0000:0000:ee0d:9aff:fe92:b246 Port1GID2=0000:0000:0000:0000:0000:ffff:c0a8:6443 Port1GID3=0000:0000:0000:0000:0000:ffff:c0a8:6443 Port1GID4=0000:0000:0000:0000:0000:ffff:c0a8:6743 Port1GID5=0000:0000:0000:0000:0000:ffff:c0a8:6743 Port1GID6=fe80:0000:0000:0000:4cd6:043b:b8d6:ecd2 Port1GID7=fe80:0000:0000:0000:4cd6:043b:b8d6:ecd2 Port1GID8=fe80:0000:0000:0000:88dd:0692:352e:0cec Port1GID9=fe80:0000:0000:0000:88dd:0692:352e:0cec) "rocep24s0"
HostBridge L#8 (buses=0000:[50-51])
PCIBridge L#9 (busid=0000:50:04.0 id=8086:347c class=0604(PCIBridge) link=15.75GB/s buses=0000:[51-51])
PCI L#7 (busid=0000:51:00.0 id=15b3:101b class=0207(InfiniBand) link=15.75GB/s PCISlot=4)
Network L#6 (Address=00:00:05:f4:fe:80:00:00:00:00:00:00:b8:ce:f6:03:00:37:7a:0a Port=1) "ibs4f0"
OpenFabrics L#7 (NodeGUID=b8ce:f603:0037:7a0a SysImageGUID=b8ce:f603:0037:7a0a Port1State=4 Port1LID=0xc Port1LMC=0 Port1GID0=fe80:0000:0000:0000:b8ce:f603:0037:7a0a) "ibp81s0f0"
PCI L#8 (busid=0000:51:00.1 id=15b3:101b class=0207(InfiniBand) link=15.75GB/s PCISlot=4)
Network L#8 (Address=00:00:03:d3:fe:80:00:00:00:00:00:00:b8:ce:f6:03:00:37:7a:0b Port=1) "ibs4f1"
OpenFabrics L#9 (NodeGUID=b8ce:f603:0037:7a0b SysImageGUID=b8ce:f603:0037:7a0a Port1State=1 Port1LID=0xffff Port1LMC=0 Port1GID0=fe80:0000:0000:0000:b8ce:f603:0037:7a0b) "ibp81s0f1"
depth 0: 1 Machine (type #0)
depth 1: 1 Package (type #1)
depth 2: 1 L3Cache (type #6)
depth 3: 12 L2Cache (type #5)
depth 4: 12 L1dCache (type #4)
depth 5: 12 L1iCache (type #9)
depth 6: 12 Core (type #2)
depth 7: 12 PU (type #3)
Special depth -3: 1 NUMANode (type #13)
Special depth -4: 10 Bridge (type #14)
Special depth -5: 9 PCIDev (type #15)
Special depth -6: 10 OSDev (type #16)
Memory attribute #2 name `Bandwidth' flags 5
NUMANode L#0 = 1790 from cpuset 0x00000fff (Machine L#0)
Memory attribute #3 name `Latency' flags 6
NUMANode L#0 = 7600 from cpuset 0x00000fff (Machine L#0)
CPU kind #0 efficiency 0 cpuset 0x00000fff
FrequencyMaxMHz = 3300


> The following branch has pending workqueue changes which makes unbound
> workqueues finer grained by default and a lot more flexible in how they're
> segmented.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v2
>
> Can you please test with the brnach? If the default doesn't improve the
> situation, you can set WQ_SYSFS on the affected workqueues and change their
> scoping by writing to /sys/devices/virtual/WQ_NAME/affinity_scope. Please
> take a look at
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/tree/Documentation/core-api/workqueue.rst?h=affinity-scopes-v2#n350
>
> for more details.

I will give this a try.


--
Chuck Lever



2023-06-22 16:11:33

by Chuck Lever III

[permalink] [raw]
Subject: Re: contention on pwq->pool->lock under heavy NFS workload



> On Jun 21, 2023, at 5:28 PM, Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Wed, Jun 21, 2023 at 03:26:22PM +0000, Chuck Lever III wrote:
>> lock_stat reports that the pool->lock kernel/workqueue.c:1483 is the highest
>> contended lock on my test NFS client. The issue appears to be that the three
>> NFS-related workqueues, rpciod_workqueue, xprtiod_workqueue, and nfsiod all
>> get placed in the same worker_pool, so they have to fight over one pool lock.
>>
>> I notice that ib_comp_wq is allocated with the same flags, but I don't see
>> significant contention there, and a trace_printk in __queue_work shows that
>> work items queued on that WQ seem to alternate between at least two different
>> worker_pools.
>>
>> Is there a preferred way to ensure the NFS WQs get spread a little more fairly
>> amongst the worker_pools?
>
> Can you share the output of lstopo on the test machine?
>
> The following branch has pending workqueue changes which makes unbound
> workqueues finer grained by default and a lot more flexible in how they're
> segmented.
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git affinity-scopes-v2
>
> Can you please test with the brnach? If the default doesn't improve the
> situation, you can set WQ_SYSFS on the affected workqueues and change their
> scoping by writing to /sys/devices/virtual/WQ_NAME/affinity_scope. Please
> take a look at
>
> https://git.kernel.org/pub/scm/linux/kernel/git/tj/wq.git/tree/Documentation/core-api/workqueue.rst?h=affinity-scopes-v2#n350
>
> for more details.

The good news:

On stock 6.4-rc7:

fio 8k [r=108k,w=46.9k IOPS]

On the affinity-scopes-v2 branch (with no other tuning):

fio 8k [r=130k,w=55.9k IOPS]


The bad news:

pool->lock is still the hottest lock on the system during the test.


I'll try some of the alternate scope settings this afternoon.


--
Chuck Lever



2023-06-22 19:52:00

by Chuck Lever III

[permalink] [raw]
Subject: Re: contention on pwq->pool->lock under heavy NFS workload



> On Jun 22, 2023, at 3:23 PM, Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Thu, Jun 22, 2023 at 03:45:18PM +0000, Chuck Lever III wrote:
>> The good news:
>>
>> On stock 6.4-rc7:
>>
>> fio 8k [r=108k,w=46.9k IOPS]
>>
>> On the affinity-scopes-v2 branch (with no other tuning):
>>
>> fio 8k [r=130k,w=55.9k IOPS]
>
> Ah, okay, that's probably coming from per-cpu pwq. Didn't expect that to
> make that much difference but that's nice.

"cpu" and "smt" work equally well on this system.

"cache", "numa", and "system" work equally poorly.

I have HT disabled, and there's only one NUMA node, so
the difference here is plausible.


>> The bad news:
>>
>> pool->lock is still the hottest lock on the system during the test.
>>
>> I'll try some of the alternate scope settings this afternoon.
>
> Yeah, in your system, there's still gonna be one pool shared across all
> CPUs. SMT or CPU may behave better but it might make sense to add a way to
> further segment the scope so that e.g. one can split a cache domain N-ways.

If there could be more than one pool to choose from, then these
WQs would not be hitting the same lock. Alternately, finding a
lockless way to queue the work on a pool would be a huge win.


--
Chuck Lever



2023-06-22 19:52:45

by Tejun Heo

[permalink] [raw]
Subject: Re: contention on pwq->pool->lock under heavy NFS workload

Hello,

On Thu, Jun 22, 2023 at 03:45:18PM +0000, Chuck Lever III wrote:
> The good news:
>
> On stock 6.4-rc7:
>
> fio 8k [r=108k,w=46.9k IOPS]
>
> On the affinity-scopes-v2 branch (with no other tuning):
>
> fio 8k [r=130k,w=55.9k IOPS]

Ah, okay, that's probably coming from per-cpu pwq. Didn't expect that to
make that much difference but that's nice.

> The bad news:
>
> pool->lock is still the hottest lock on the system during the test.
>
> I'll try some of the alternate scope settings this afternoon.

Yeah, in your system, there's still gonna be one pool shared across all
CPUs. SMT or CPU may behave better but it might make sense to add a way to
further segment the scope so that e.g. one can split a cache domain N-ways.

Thanks.

--
tejun

2023-06-23 14:47:00

by Chuck Lever III

[permalink] [raw]
Subject: Re: contention on pwq->pool->lock under heavy NFS workload



> On Jun 22, 2023, at 3:39 PM, Chuck Lever III <[email protected]> wrote:
>
>
>
>> On Jun 22, 2023, at 3:23 PM, Tejun Heo <[email protected]> wrote:
>>
>> Hello,
>>
>> On Thu, Jun 22, 2023 at 03:45:18PM +0000, Chuck Lever III wrote:
>>> The good news:
>>>
>>> On stock 6.4-rc7:
>>>
>>> fio 8k [r=108k,w=46.9k IOPS]
>>>
>>> On the affinity-scopes-v2 branch (with no other tuning):
>>>
>>> fio 8k [r=130k,w=55.9k IOPS]
>>
>> Ah, okay, that's probably coming from per-cpu pwq. Didn't expect that to
>> make that much difference but that's nice.
>
> "cpu" and "smt" work equally well on this system.
>
> "cache", "numa", and "system" work equally poorly.
>
> I have HT disabled, and there's only one NUMA node, so
> the difference here is plausible.
>
>
>>> The bad news:
>>>
>>> pool->lock is still the hottest lock on the system during the test.
>>>
>>> I'll try some of the alternate scope settings this afternoon.
>>
>> Yeah, in your system, there's still gonna be one pool shared across all
>> CPUs. SMT or CPU may behave better but it might make sense to add a way to
>> further segment the scope so that e.g. one can split a cache domain N-ways.
>
> If there could be more than one pool to choose from, then these
> WQs would not be hitting the same lock. Alternately, finding a
> lockless way to queue the work on a pool would be a huge win.

Following up with a few more tests.

I'm using NFS/RDMA for my test because I can drive more IOPS with it.

I've found that setting the nfsiod and rpciod workqueues to "cpu"
scope provide the best benefit for this workload. Changing the
xprtiod workqueue to "cpu" had no discernible effect.

This tracks with the number of queue_work calls for each of these
WQs. 59% of queue_work calls during the test are for the rpciod
WQ, 21% are for nfsiod, and 2% is for xprtiod.


The same test with TCP (using IP-over-IB on the same physical network)
shows no improvement on any test. That suggests there is a bottleneck
somewhere else, when using TCP, that limits its throughput.


--
Chuck Lever



2023-06-24 01:45:37

by Tejun Heo

[permalink] [raw]
Subject: Re: contention on pwq->pool->lock under heavy NFS workload

Hey,

On Fri, Jun 23, 2023 at 02:37:17PM +0000, Chuck Lever III wrote:
> I'm using NFS/RDMA for my test because I can drive more IOPS with it.
>
> I've found that setting the nfsiod and rpciod workqueues to "cpu"
> scope provide the best benefit for this workload. Changing the
> xprtiod workqueue to "cpu" had no discernible effect.
>
> This tracks with the number of queue_work calls for each of these
> WQs. 59% of queue_work calls during the test are for the rpciod
> WQ, 21% are for nfsiod, and 2% is for xprtiod.
>
> The same test with TCP (using IP-over-IB on the same physical network)
> shows no improvement on any test. That suggests there is a bottleneck
> somewhere else, when using TCP, that limits its throughput.

Yeah, you can make the necessary workqueues to default to CPU or SMT scope
using apply_workqueue_attrs(). The interface a bit cumbersome and we
probably wanna add convenience helpers to switch e.g. affinity scopes but
it's still just several lines of code.

Thanks.

--
tejun

2023-06-25 16:12:46

by Chuck Lever III

[permalink] [raw]
Subject: Re: contention on pwq->pool->lock under heavy NFS workload

Hi Tejun-


> On Jun 23, 2023, at 9:44 PM, Tejun Heo <[email protected]> wrote:
>
> Hey,
>
> On Fri, Jun 23, 2023 at 02:37:17PM +0000, Chuck Lever III wrote:
>> I'm using NFS/RDMA for my test because I can drive more IOPS with it.
>>
>> I've found that setting the nfsiod and rpciod workqueues to "cpu"
>> scope provide the best benefit for this workload. Changing the
>> xprtiod workqueue to "cpu" had no discernible effect.
>>
>> This tracks with the number of queue_work calls for each of these
>> WQs. 59% of queue_work calls during the test are for the rpciod
>> WQ, 21% are for nfsiod, and 2% is for xprtiod.
>>
>> The same test with TCP (using IP-over-IB on the same physical network)
>> shows no improvement on any test. That suggests there is a bottleneck
>> somewhere else, when using TCP, that limits its throughput.
>
> Yeah, you can make the necessary workqueues to default to CPU or SMT scope
> using apply_workqueue_attrs(). The interface a bit cumbersome and we
> probably wanna add convenience helpers to switch e.g. affinity scopes but
> it's still just several lines of code.

6037 static ssize_t wq_affn_scope_store(struct device *dev,
6038 struct device_attribute *attr,
6039 const char *buf, size_t count)
6040 {
6041 struct workqueue_struct *wq = dev_to_wq(dev);
6042 struct workqueue_attrs *attrs;
6043 int affn, ret = -ENOMEM;
6044
6045 affn = parse_affn_scope(buf);
6046 if (affn < 0)
6047 return affn;
6048
6049 apply_wqattrs_lock(); <<< takes &wq_pool_mutex
6050 attrs = wq_sysfs_prep_attrs(wq); <<< copies the wq_attrs
6051 if (attrs) {
6052 attrs->affn_scope = affn;
6053 ret = apply_workqueue_attrs_locked(wq, attrs);
6054 }
6055 apply_wqattrs_unlock();
6056 free_workqueue_attrs(attrs);
6057 return ret ?: count;
6058 }

Both wq_pool_mutex and copy_workqueue_attrs() are static, so having
only apply_workqueue_attrs() is not yet enough to carry this off
in workqueue consumers such as sunrpc.ko.

It looks like padata_setup_cpumasks() for example is holding the
CPU read lock, but it doesn't take the wq_pool_mutex.
apply_wqattrs_prepare() has a "lockdep_assert_held(&wq_pool_mutex);" .

I can wait for a v3 of this series so you can construct the public
API the way you prefer.


--
Chuck Lever



2023-06-26 20:25:19

by Tejun Heo

[permalink] [raw]
Subject: Re: contention on pwq->pool->lock under heavy NFS workload

On Sun, Jun 25, 2023 at 04:01:38PM +0000, Chuck Lever III wrote:
> Both wq_pool_mutex and copy_workqueue_attrs() are static, so having
> only apply_workqueue_attrs() is not yet enough to carry this off
> in workqueue consumers such as sunrpc.ko.
>
> It looks like padata_setup_cpumasks() for example is holding the
> CPU read lock, but it doesn't take the wq_pool_mutex.
> apply_wqattrs_prepare() has a "lockdep_assert_held(&wq_pool_mutex);" .
>
> I can wait for a v3 of this series so you can construct the public
> API the way you prefer.

Ah, indeed. That API isn't really useable right now. It's gonna be a while
before the affinity scope patches are applied. I'll fix up the apply
interface afterwards.

Thanks.

--
tejun