2017-06-29 13:25:37

by Olga Kornievskaia

[permalink] [raw]
Subject: [RFC] fix parallelism for rpc tasks

Hi folks,

On a multi-core machine, is it expected that we can have parallel RPCs
handled by each of the per-core workqueue?

In testing a read workload, observing via "top" command that a single
"kworker" thread is running servicing the requests (no parallelism).
It's more prominent while doing these operations over krb5p mount.

What has been suggested by Bruce is to try this and in my testing I
see then the read workload spread among all the kworker threads.

Signed-off-by: Olga Kornievskaia <[email protected]>

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 0cc8383..f80e688 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -1095,7 +1095,7 @@ static int rpciod_start(void)
* Create the rpciod thread and wait for it to start.
*/
dprintk("RPC: creating workqueue rpciod\n");
- wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0);
+ wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
if (!wq)
goto out_failed;
rpciod_workqueue = wq;


2017-07-03 14:58:48

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC] fix parallelism for rpc tasks

T24gVGh1LCAyMDE3LTA2LTI5IGF0IDA5OjI1IC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90
ZToNCj4gSGkgZm9sa3MsDQo+IA0KPiBPbiBhIG11bHRpLWNvcmUgbWFjaGluZSwgaXMgaXQgZXhw
ZWN0ZWQgdGhhdCB3ZSBjYW4gaGF2ZSBwYXJhbGxlbA0KPiBSUENzDQo+IGhhbmRsZWQgYnkgZWFj
aCBvZiB0aGUgcGVyLWNvcmUgd29ya3F1ZXVlPw0KPiANCj4gSW4gdGVzdGluZyBhIHJlYWQgd29y
a2xvYWQsIG9ic2VydmluZyB2aWEgInRvcCIgY29tbWFuZCB0aGF0IGEgc2luZ2xlDQo+ICJrd29y
a2VyIiB0aHJlYWQgaXMgcnVubmluZyBzZXJ2aWNpbmcgdGhlIHJlcXVlc3RzIChubyBwYXJhbGxl
bGlzbSkuDQo+IEl0J3MgbW9yZSBwcm9taW5lbnQgd2hpbGUgZG9pbmcgdGhlc2Ugb3BlcmF0aW9u
cyBvdmVyIGtyYjVwIG1vdW50Lg0KPiANCj4gV2hhdCBoYXMgYmVlbiBzdWdnZXN0ZWQgYnkgQnJ1
Y2UgaXMgdG8gdHJ5IHRoaXMgYW5kIGluIG15IHRlc3RpbmcgSQ0KPiBzZWUgdGhlbiB0aGUgcmVh
ZCB3b3JrbG9hZCBzcHJlYWQgYW1vbmcgYWxsIHRoZSBrd29ya2VyIHRocmVhZHMuDQo+IA0KPiBT
aWduZWQtb2ZmLWJ5OiBPbGdhIEtvcm5pZXZza2FpYSA8a29sZ2FAbmV0YXBwLmNvbT4NCj4gDQo+
IGRpZmYgLS1naXQgYS9uZXQvc3VucnBjL3NjaGVkLmMgYi9uZXQvc3VucnBjL3NjaGVkLmMNCj4g
aW5kZXggMGNjODM4My4uZjgwZTY4OCAxMDA2NDQNCj4gLS0tIGEvbmV0L3N1bnJwYy9zY2hlZC5j
DQo+ICsrKyBiL25ldC9zdW5ycGMvc2NoZWQuYw0KPiBAQCAtMTA5NSw3ICsxMDk1LDcgQEAgc3Rh
dGljIGludCBycGNpb2Rfc3RhcnQodm9pZCkNCj4gwqAgKiBDcmVhdGUgdGhlIHJwY2lvZCB0aHJl
YWQgYW5kIHdhaXQgZm9yIGl0IHRvIHN0YXJ0Lg0KPiDCoCAqLw0KPiDCoCBkcHJpbnRrKCJSUEM6
wqDCoMKgwqDCoMKgwqBjcmVhdGluZyB3b3JrcXVldWUgcnBjaW9kXG4iKTsNCj4gLSB3cSA9IGFs
bG9jX3dvcmtxdWV1ZSgicnBjaW9kIiwgV1FfTUVNX1JFQ0xBSU0sIDApOw0KPiArIHdxID0gYWxs
b2Nfd29ya3F1ZXVlKCJycGNpb2QiLCBXUV9NRU1fUkVDTEFJTSB8IFdRX1VOQk9VTkQsIDApOw0K
PiDCoCBpZiAoIXdxKQ0KPiDCoCBnb3RvIG91dF9mYWlsZWQ7DQo+IMKgIHJwY2lvZF93b3JrcXVl
dWUgPSB3cTsNCj4gDQoNCldRX1VOQk9VTkQgdHVybnMgb2ZmIGNvbmN1cnJlbmN5IG1hbmFnZW1l
bnQgb24gdGhlIHRocmVhZCBwb29sIChTZWUNCkRvY3VtZW50YXRpb24vY29yZS1hcGkvd29ya3F1
ZXVlLnJzdC4gSXQgYWxzbyBtZWFucyB3ZSBjb250ZW5kIGZvciB3b3JrDQppdGVtIHF1ZXVpbmcv
ZGVxdWV1aW5nIGxvY2tzLCBzaW5jZSB0aGUgdGhyZWFkcyB3aGljaCBydW4gdGhlIHdvcmsNCml0
ZW1zIGFyZSBub3QgYm91bmQgdG8gYSBDUFUuDQoNCklPVzogVGhpcyBpcyBub3QgYSBzbGFtLWR1
bmsgb2J2aW91cyBnYWluLg0KDQotLSANClRyb25kIE15a2xlYnVzdA0KTGludXggTkZTIGNsaWVu
dCBtYWludGFpbmVyLCBQcmltYXJ5RGF0YQ0KdHJvbmQubXlrbGVidXN0QHByaW1hcnlkYXRhLmNv
bQ0K


2017-07-05 14:44:34

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [RFC] fix parallelism for rpc tasks

On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust
<[email protected]> wrote:
> On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote:
>> Hi folks,
>>
>> On a multi-core machine, is it expected that we can have parallel
>> RPCs
>> handled by each of the per-core workqueue?
>>
>> In testing a read workload, observing via "top" command that a single
>> "kworker" thread is running servicing the requests (no parallelism).
>> It's more prominent while doing these operations over krb5p mount.
>>
>> What has been suggested by Bruce is to try this and in my testing I
>> see then the read workload spread among all the kworker threads.
>>
>> Signed-off-by: Olga Kornievskaia <[email protected]>
>>
>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>> index 0cc8383..f80e688 100644
>> --- a/net/sunrpc/sched.c
>> +++ b/net/sunrpc/sched.c
>> @@ -1095,7 +1095,7 @@ static int rpciod_start(void)
>> * Create the rpciod thread and wait for it to start.
>> */
>> dprintk("RPC: creating workqueue rpciod\n");
>> - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0);
>> + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
>> if (!wq)
>> goto out_failed;
>> rpciod_workqueue = wq;
>>
>
> WQ_UNBOUND turns off concurrency management on the thread pool (See
> Documentation/core-api/workqueue.rst. It also means we contend for work
> item queuing/dequeuing locks, since the threads which run the work
> items are not bound to a CPU.
>
> IOW: This is not a slam-dunk obvious gain.

I agree but I think it's worth consideration. I'm waiting to get
(real) performance numbers of improvement (instead of my VM setup) to
help my case. However, it was reported 90% degradation for the read
performance over krb5p when 1CPU is executing all ops.

Is there a different way to make sure that on a multi-processor
machine we can take advantage of all available CPUs? Simple kernel
threads instead of a work queue?

Can/should we have an WQ_UNBOUND work queue for secure mounts and
another queue for other mounts?

While I wouldn't call krb5 load long running, Documentation says that
an example for WQ_UNBOUND is for CPU intensive workloads. And also in
general "work items are not expected to hog a CPU and consume many
cycles". How "many" is too "many". How many operations are crypto
operations?

2017-07-05 15:11:37

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC] fix parallelism for rpc tasks


> On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <[email protected]> wrote:
>
> On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust
> <[email protected]> wrote:
>> On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote:
>>> Hi folks,
>>>
>>> On a multi-core machine, is it expected that we can have parallel
>>> RPCs
>>> handled by each of the per-core workqueue?
>>>
>>> In testing a read workload, observing via "top" command that a single
>>> "kworker" thread is running servicing the requests (no parallelism).
>>> It's more prominent while doing these operations over krb5p mount.
>>>
>>> What has been suggested by Bruce is to try this and in my testing I
>>> see then the read workload spread among all the kworker threads.
>>>
>>> Signed-off-by: Olga Kornievskaia <[email protected]>
>>>
>>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>>> index 0cc8383..f80e688 100644
>>> --- a/net/sunrpc/sched.c
>>> +++ b/net/sunrpc/sched.c
>>> @@ -1095,7 +1095,7 @@ static int rpciod_start(void)
>>> * Create the rpciod thread and wait for it to start.
>>> */
>>> dprintk("RPC: creating workqueue rpciod\n");
>>> - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0);
>>> + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_UNBOUND, 0);
>>> if (!wq)
>>> goto out_failed;
>>> rpciod_workqueue = wq;
>>>
>>
>> WQ_UNBOUND turns off concurrency management on the thread pool (See
>> Documentation/core-api/workqueue.rst. It also means we contend for work
>> item queuing/dequeuing locks, since the threads which run the work
>> items are not bound to a CPU.
>>
>> IOW: This is not a slam-dunk obvious gain.
>
> I agree but I think it's worth consideration. I'm waiting to get
> (real) performance numbers of improvement (instead of my VM setup) to
> help my case. However, it was reported 90% degradation for the read
> performance over krb5p when 1CPU is executing all ops.
>
> Is there a different way to make sure that on a multi-processor
> machine we can take advantage of all available CPUs? Simple kernel
> threads instead of a work queue?

There is a trade-off between spreading the work, and ensuring it
is executed on a CPU close to the I/O and application. IMO UNBOUND
is a good way to do that. UNBOUND will attempt to schedule the
work on the preferred CPU, but allow it to be migrated if that
CPU is busy.

The advantage of this is that when the client workload is CPU
intensive (say, a software build), RPC client work can be scheduled
and run more quickly, which reduces latency.


> Can/should we have an WQ_UNBOUND work queue for secure mounts and
> another queue for other mounts?
>
> While I wouldn't call krb5 load long running, Documentation says that
> an example for WQ_UNBOUND is for CPU intensive workloads. And also in
> general "work items are not expected to hog a CPU and consume many
> cycles". How "many" is too "many". How many operations are crypto
> operations?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever




2017-07-05 15:46:39

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC] fix parallelism for rpc tasks

T24gV2VkLCAyMDE3LTA3LTA1IGF0IDExOjExIC0wNDAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4g
PiBPbiBKdWwgNSwgMjAxNywgYXQgMTA6NDQgQU0sIE9sZ2EgS29ybmlldnNrYWlhIDxhZ2xvQHVt
aWNoLmVkdT4NCj4gPiB3cm90ZToNCj4gPiANCj4gPiBPbiBNb24sIEp1bCAzLCAyMDE3IGF0IDEw
OjU4IEFNLCBUcm9uZCBNeWtsZWJ1c3QNCj4gPiA8dHJvbmRteUBwcmltYXJ5ZGF0YS5jb20+IHdy
b3RlOg0KPiA+ID4gT24gVGh1LCAyMDE3LTA2LTI5IGF0IDA5OjI1IC0wNDAwLCBPbGdhIEtvcm5p
ZXZza2FpYSB3cm90ZToNCj4gPiA+ID4gSGkgZm9sa3MsDQo+ID4gPiA+IA0KPiA+ID4gPiBPbiBh
IG11bHRpLWNvcmUgbWFjaGluZSwgaXMgaXQgZXhwZWN0ZWQgdGhhdCB3ZSBjYW4gaGF2ZQ0KPiA+
ID4gPiBwYXJhbGxlbA0KPiA+ID4gPiBSUENzDQo+ID4gPiA+IGhhbmRsZWQgYnkgZWFjaCBvZiB0
aGUgcGVyLWNvcmUgd29ya3F1ZXVlPw0KPiA+ID4gPiANCj4gPiA+ID4gSW4gdGVzdGluZyBhIHJl
YWQgd29ya2xvYWQsIG9ic2VydmluZyB2aWEgInRvcCIgY29tbWFuZCB0aGF0IGENCj4gPiA+ID4g
c2luZ2xlDQo+ID4gPiA+ICJrd29ya2VyIiB0aHJlYWQgaXMgcnVubmluZyBzZXJ2aWNpbmcgdGhl
IHJlcXVlc3RzIChubw0KPiA+ID4gPiBwYXJhbGxlbGlzbSkuDQo+ID4gPiA+IEl0J3MgbW9yZSBw
cm9taW5lbnQgd2hpbGUgZG9pbmcgdGhlc2Ugb3BlcmF0aW9ucyBvdmVyIGtyYjVwDQo+ID4gPiA+
IG1vdW50Lg0KPiA+ID4gPiANCj4gPiA+ID4gV2hhdCBoYXMgYmVlbiBzdWdnZXN0ZWQgYnkgQnJ1
Y2UgaXMgdG8gdHJ5IHRoaXMgYW5kIGluIG15DQo+ID4gPiA+IHRlc3RpbmcgSQ0KPiA+ID4gPiBz
ZWUgdGhlbiB0aGUgcmVhZCB3b3JrbG9hZCBzcHJlYWQgYW1vbmcgYWxsIHRoZSBrd29ya2VyDQo+
ID4gPiA+IHRocmVhZHMuDQo+ID4gPiA+IA0KPiA+ID4gPiBTaWduZWQtb2ZmLWJ5OiBPbGdhIEtv
cm5pZXZza2FpYSA8a29sZ2FAbmV0YXBwLmNvbT4NCj4gPiA+ID4gDQo+ID4gPiA+IGRpZmYgLS1n
aXQgYS9uZXQvc3VucnBjL3NjaGVkLmMgYi9uZXQvc3VucnBjL3NjaGVkLmMNCj4gPiA+ID4gaW5k
ZXggMGNjODM4My4uZjgwZTY4OCAxMDA2NDQNCj4gPiA+ID4gLS0tIGEvbmV0L3N1bnJwYy9zY2hl
ZC5jDQo+ID4gPiA+ICsrKyBiL25ldC9zdW5ycGMvc2NoZWQuYw0KPiA+ID4gPiBAQCAtMTA5NSw3
ICsxMDk1LDcgQEAgc3RhdGljIGludCBycGNpb2Rfc3RhcnQodm9pZCkNCj4gPiA+ID4gwqAqIENy
ZWF0ZSB0aGUgcnBjaW9kIHRocmVhZCBhbmQgd2FpdCBmb3IgaXQgdG8gc3RhcnQuDQo+ID4gPiA+
IMKgKi8NCj4gPiA+ID4gwqBkcHJpbnRrKCJSUEM6wqDCoMKgwqDCoMKgwqBjcmVhdGluZyB3b3Jr
cXVldWUgcnBjaW9kXG4iKTsNCj4gPiA+ID4gLSB3cSA9IGFsbG9jX3dvcmtxdWV1ZSgicnBjaW9k
IiwgV1FfTUVNX1JFQ0xBSU0sIDApOw0KPiA+ID4gPiArIHdxID0gYWxsb2Nfd29ya3F1ZXVlKCJy
cGNpb2QiLCBXUV9NRU1fUkVDTEFJTSB8IFdRX1VOQk9VTkQsDQo+ID4gPiA+IDApOw0KPiA+ID4g
PiDCoGlmICghd3EpDQo+ID4gPiA+IMKgZ290byBvdXRfZmFpbGVkOw0KPiA+ID4gPiDCoHJwY2lv
ZF93b3JrcXVldWUgPSB3cTsNCj4gPiA+ID4gDQo+ID4gPiANCj4gPiA+IFdRX1VOQk9VTkQgdHVy
bnMgb2ZmIGNvbmN1cnJlbmN5IG1hbmFnZW1lbnQgb24gdGhlIHRocmVhZCBwb29sDQo+ID4gPiAo
U2VlDQo+ID4gPiBEb2N1bWVudGF0aW9uL2NvcmUtYXBpL3dvcmtxdWV1ZS5yc3QuIEl0IGFsc28g
bWVhbnMgd2UgY29udGVuZA0KPiA+ID4gZm9yIHdvcmsNCj4gPiA+IGl0ZW0gcXVldWluZy9kZXF1
ZXVpbmcgbG9ja3MsIHNpbmNlIHRoZSB0aHJlYWRzIHdoaWNoIHJ1biB0aGUNCj4gPiA+IHdvcmsN
Cj4gPiA+IGl0ZW1zIGFyZSBub3QgYm91bmQgdG8gYSBDUFUuDQo+ID4gPiANCj4gPiA+IElPVzog
VGhpcyBpcyBub3QgYSBzbGFtLWR1bmsgb2J2aW91cyBnYWluLg0KPiA+IA0KPiA+IEkgYWdyZWUg
YnV0IEkgdGhpbmsgaXQncyB3b3J0aCBjb25zaWRlcmF0aW9uLiBJJ20gd2FpdGluZyB0byBnZXQN
Cj4gPiAocmVhbCkgcGVyZm9ybWFuY2UgbnVtYmVycyBvZiBpbXByb3ZlbWVudCAoaW5zdGVhZCBv
ZiBteSBWTSBzZXR1cCkNCj4gPiB0bw0KPiA+IGhlbHAgbXkgY2FzZS4gSG93ZXZlciwgaXQgd2Fz
IHJlcG9ydGVkIDkwJSBkZWdyYWRhdGlvbiBmb3IgdGhlIHJlYWQNCj4gPiBwZXJmb3JtYW5jZSBv
dmVyIGtyYjVwIHdoZW4gMUNQVSBpcyBleGVjdXRpbmcgYWxsIG9wcy4NCj4gPiANCj4gPiBJcyB0
aGVyZSBhIGRpZmZlcmVudCB3YXkgdG8gbWFrZSBzdXJlIHRoYXQgb24gYSBtdWx0aS1wcm9jZXNz
b3INCj4gPiBtYWNoaW5lIHdlIGNhbiB0YWtlIGFkdmFudGFnZSBvZiBhbGwgYXZhaWxhYmxlIENQ
VXM/IFNpbXBsZSBrZXJuZWwNCj4gPiB0aHJlYWRzIGluc3RlYWQgb2YgYSB3b3JrIHF1ZXVlPw0K
PiANCj4gVGhlcmUgaXMgYSB0cmFkZS1vZmYgYmV0d2VlbiBzcHJlYWRpbmcgdGhlIHdvcmssIGFu
ZCBlbnN1cmluZyBpdA0KPiBpcyBleGVjdXRlZCBvbiBhIENQVSBjbG9zZSB0byB0aGUgSS9PIGFu
ZCBhcHBsaWNhdGlvbi4gSU1PIFVOQk9VTkQNCj4gaXMgYSBnb29kIHdheSB0byBkbyB0aGF0LiBV
TkJPVU5EIHdpbGwgYXR0ZW1wdCB0byBzY2hlZHVsZSB0aGUNCj4gd29yayBvbiB0aGUgcHJlZmVy
cmVkIENQVSwgYnV0IGFsbG93IGl0IHRvIGJlIG1pZ3JhdGVkIGlmIHRoYXQNCj4gQ1BVIGlzIGJ1
c3kuDQo+IA0KPiBUaGUgYWR2YW50YWdlIG9mIHRoaXMgaXMgdGhhdCB3aGVuIHRoZSBjbGllbnQg
d29ya2xvYWQgaXMgQ1BVDQo+IGludGVuc2l2ZSAoc2F5LCBhIHNvZnR3YXJlIGJ1aWxkKSwgUlBD
IGNsaWVudCB3b3JrIGNhbiBiZSBzY2hlZHVsZWQNCj4gYW5kIHJ1biBtb3JlIHF1aWNrbHksIHdo
aWNoIHJlZHVjZXMgbGF0ZW5jeS4NCj4gDQoNClRoYXQgc2hvdWxkIG5vIGxvbmdlciBiZSBhIGh1
Z2UgaXNzdWUsIHNpbmNlIHF1ZXVlX3dvcmsoKSB3aWxsIG5vdw0KZGVmYXVsdCB0byB0aGUgV09S
S19DUFVfVU5CT1VORCBmbGFnLCB3aGljaCBwcmVmZXJzIHRoZSBsb2NhbCBDUFUsIGJ1dA0Kd2ls
bCBzY2hlZHVsZSBlbHNld2hlcmUgaWYgdGhlIGxvY2FsIENQVSBpcyBjb25nZXN0ZWQuDQoNCi0t
IA0KVHJvbmQgTXlrbGVidXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXIsIFByaW1hcnlE
YXRhDQp0cm9uZC5teWtsZWJ1c3RAcHJpbWFyeWRhdGEuY29tDQo=


2017-07-05 16:09:21

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [RFC] fix parallelism for rpc tasks

On Wed, Jul 5, 2017 at 11:46 AM, Trond Myklebust
<[email protected]> wrote:
> On Wed, 2017-07-05 at 11:11 -0400, Chuck Lever wrote:
>> > On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <[email protected]>
>> > wrote:
>> >
>> > On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust
>> > <[email protected]> wrote:
>> > > On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote:
>> > > > Hi folks,
>> > > >
>> > > > On a multi-core machine, is it expected that we can have
>> > > > parallel
>> > > > RPCs
>> > > > handled by each of the per-core workqueue?
>> > > >
>> > > > In testing a read workload, observing via "top" command that a
>> > > > single
>> > > > "kworker" thread is running servicing the requests (no
>> > > > parallelism).
>> > > > It's more prominent while doing these operations over krb5p
>> > > > mount.
>> > > >
>> > > > What has been suggested by Bruce is to try this and in my
>> > > > testing I
>> > > > see then the read workload spread among all the kworker
>> > > > threads.
>> > > >
>> > > > Signed-off-by: Olga Kornievskaia <[email protected]>
>> > > >
>> > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>> > > > index 0cc8383..f80e688 100644
>> > > > --- a/net/sunrpc/sched.c
>> > > > +++ b/net/sunrpc/sched.c
>> > > > @@ -1095,7 +1095,7 @@ static int rpciod_start(void)
>> > > > * Create the rpciod thread and wait for it to start.
>> > > > */
>> > > > dprintk("RPC: creating workqueue rpciod\n");
>> > > > - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0);
>> > > > + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_UNBOUND,
>> > > > 0);
>> > > > if (!wq)
>> > > > goto out_failed;
>> > > > rpciod_workqueue = wq;
>> > > >
>> > >
>> > > WQ_UNBOUND turns off concurrency management on the thread pool
>> > > (See
>> > > Documentation/core-api/workqueue.rst. It also means we contend
>> > > for work
>> > > item queuing/dequeuing locks, since the threads which run the
>> > > work
>> > > items are not bound to a CPU.
>> > >
>> > > IOW: This is not a slam-dunk obvious gain.
>> >
>> > I agree but I think it's worth consideration. I'm waiting to get
>> > (real) performance numbers of improvement (instead of my VM setup)
>> > to
>> > help my case. However, it was reported 90% degradation for the read
>> > performance over krb5p when 1CPU is executing all ops.
>> >
>> > Is there a different way to make sure that on a multi-processor
>> > machine we can take advantage of all available CPUs? Simple kernel
>> > threads instead of a work queue?
>>
>> There is a trade-off between spreading the work, and ensuring it
>> is executed on a CPU close to the I/O and application. IMO UNBOUND
>> is a good way to do that. UNBOUND will attempt to schedule the
>> work on the preferred CPU, but allow it to be migrated if that
>> CPU is busy.
>>
>> The advantage of this is that when the client workload is CPU
>> intensive (say, a software build), RPC client work can be scheduled
>> and run more quickly, which reduces latency.
>>
>
> That should no longer be a huge issue, since queue_work() will now
> default to the WORK_CPU_UNBOUND flag, which prefers the local CPU, but
> will schedule elsewhere if the local CPU is congested.

I don't believe NFS use workqueue_congested() to somehow schedule the
work elsewhere. Unless the queue is marked UNBOUNDED I don't believe
there is any intention of balancing the CPU load.

>
> --
> Trond Myklebust
> Linux NFS client maintainer, PrimaryData
> [email protected]

2017-07-05 16:15:02

by Trond Myklebust

[permalink] [raw]
Subject: Re: [RFC] fix parallelism for rpc tasks

T24gV2VkLCAyMDE3LTA3LTA1IGF0IDEyOjA5IC0wNDAwLCBPbGdhIEtvcm5pZXZza2FpYSB3cm90
ZToNCj4gT24gV2VkLCBKdWwgNSwgMjAxNyBhdCAxMTo0NiBBTSwgVHJvbmQgTXlrbGVidXN0DQo+
IDx0cm9uZG15QHByaW1hcnlkYXRhLmNvbT4gd3JvdGU6DQo+ID4gT24gV2VkLCAyMDE3LTA3LTA1
IGF0IDExOjExIC0wNDAwLCBDaHVjayBMZXZlciB3cm90ZToNCj4gPiA+ID4gT24gSnVsIDUsIDIw
MTcsIGF0IDEwOjQ0IEFNLCBPbGdhIEtvcm5pZXZza2FpYSA8YWdsb0B1bWljaC5lZHU+DQo+ID4g
PiA+IHdyb3RlOg0KPiA+ID4gPiANCj4gPiA+ID4gT24gTW9uLCBKdWwgMywgMjAxNyBhdCAxMDo1
OCBBTSwgVHJvbmQgTXlrbGVidXN0DQo+ID4gPiA+IDx0cm9uZG15QHByaW1hcnlkYXRhLmNvbT4g
d3JvdGU6DQo+ID4gPiA+ID4gT24gVGh1LCAyMDE3LTA2LTI5IGF0IDA5OjI1IC0wNDAwLCBPbGdh
IEtvcm5pZXZza2FpYSB3cm90ZToNCj4gPiA+ID4gPiA+IEhpIGZvbGtzLA0KPiA+ID4gPiA+ID4g
DQo+ID4gPiA+ID4gPiBPbiBhIG11bHRpLWNvcmUgbWFjaGluZSwgaXMgaXQgZXhwZWN0ZWQgdGhh
dCB3ZSBjYW4gaGF2ZQ0KPiA+ID4gPiA+ID4gcGFyYWxsZWwNCj4gPiA+ID4gPiA+IFJQQ3MNCj4g
PiA+ID4gPiA+IGhhbmRsZWQgYnkgZWFjaCBvZiB0aGUgcGVyLWNvcmUgd29ya3F1ZXVlPw0KPiA+
ID4gPiA+ID4gDQo+ID4gPiA+ID4gPiBJbiB0ZXN0aW5nIGEgcmVhZCB3b3JrbG9hZCwgb2JzZXJ2
aW5nIHZpYSAidG9wIiBjb21tYW5kDQo+ID4gPiA+ID4gPiB0aGF0IGENCj4gPiA+ID4gPiA+IHNp
bmdsZQ0KPiA+ID4gPiA+ID4gImt3b3JrZXIiIHRocmVhZCBpcyBydW5uaW5nIHNlcnZpY2luZyB0
aGUgcmVxdWVzdHMgKG5vDQo+ID4gPiA+ID4gPiBwYXJhbGxlbGlzbSkuDQo+ID4gPiA+ID4gPiBJ
dCdzIG1vcmUgcHJvbWluZW50IHdoaWxlIGRvaW5nIHRoZXNlIG9wZXJhdGlvbnMgb3ZlciBrcmI1
cA0KPiA+ID4gPiA+ID4gbW91bnQuDQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiA+IFdoYXQgaGFz
IGJlZW4gc3VnZ2VzdGVkIGJ5IEJydWNlIGlzIHRvIHRyeSB0aGlzIGFuZCBpbiBteQ0KPiA+ID4g
PiA+ID4gdGVzdGluZyBJDQo+ID4gPiA+ID4gPiBzZWUgdGhlbiB0aGUgcmVhZCB3b3JrbG9hZCBz
cHJlYWQgYW1vbmcgYWxsIHRoZSBrd29ya2VyDQo+ID4gPiA+ID4gPiB0aHJlYWRzLg0KPiA+ID4g
PiA+ID4gDQo+ID4gPiA+ID4gPiBTaWduZWQtb2ZmLWJ5OiBPbGdhIEtvcm5pZXZza2FpYSA8a29s
Z2FAbmV0YXBwLmNvbT4NCj4gPiA+ID4gPiA+IA0KPiA+ID4gPiA+ID4gZGlmZiAtLWdpdCBhL25l
dC9zdW5ycGMvc2NoZWQuYyBiL25ldC9zdW5ycGMvc2NoZWQuYw0KPiA+ID4gPiA+ID4gaW5kZXgg
MGNjODM4My4uZjgwZTY4OCAxMDA2NDQNCj4gPiA+ID4gPiA+IC0tLSBhL25ldC9zdW5ycGMvc2No
ZWQuYw0KPiA+ID4gPiA+ID4gKysrIGIvbmV0L3N1bnJwYy9zY2hlZC5jDQo+ID4gPiA+ID4gPiBA
QCAtMTA5NSw3ICsxMDk1LDcgQEAgc3RhdGljIGludCBycGNpb2Rfc3RhcnQodm9pZCkNCj4gPiA+
ID4gPiA+IMKgKiBDcmVhdGUgdGhlIHJwY2lvZCB0aHJlYWQgYW5kIHdhaXQgZm9yIGl0IHRvIHN0
YXJ0Lg0KPiA+ID4gPiA+ID4gwqAqLw0KPiA+ID4gPiA+ID4gwqBkcHJpbnRrKCJSUEM6wqDCoMKg
wqDCoMKgwqBjcmVhdGluZyB3b3JrcXVldWUgcnBjaW9kXG4iKTsNCj4gPiA+ID4gPiA+IC0gd3Eg
PSBhbGxvY193b3JrcXVldWUoInJwY2lvZCIsIFdRX01FTV9SRUNMQUlNLCAwKTsNCj4gPiA+ID4g
PiA+ICsgd3EgPSBhbGxvY193b3JrcXVldWUoInJwY2lvZCIsIFdRX01FTV9SRUNMQUlNIHwNCj4g
PiA+ID4gPiA+IFdRX1VOQk9VTkQsDQo+ID4gPiA+ID4gPiAwKTsNCj4gPiA+ID4gPiA+IMKgaWYg
KCF3cSkNCj4gPiA+ID4gPiA+IMKgZ290byBvdXRfZmFpbGVkOw0KPiA+ID4gPiA+ID4gwqBycGNp
b2Rfd29ya3F1ZXVlID0gd3E7DQo+ID4gPiA+ID4gPiANCj4gPiA+ID4gPiANCj4gPiA+ID4gPiBX
UV9VTkJPVU5EIHR1cm5zIG9mZiBjb25jdXJyZW5jeSBtYW5hZ2VtZW50IG9uIHRoZSB0aHJlYWQN
Cj4gPiA+ID4gPiBwb29sDQo+ID4gPiA+ID4gKFNlZQ0KPiA+ID4gPiA+IERvY3VtZW50YXRpb24v
Y29yZS1hcGkvd29ya3F1ZXVlLnJzdC4gSXQgYWxzbyBtZWFucyB3ZQ0KPiA+ID4gPiA+IGNvbnRl
bmQNCj4gPiA+ID4gPiBmb3Igd29yaw0KPiA+ID4gPiA+IGl0ZW0gcXVldWluZy9kZXF1ZXVpbmcg
bG9ja3MsIHNpbmNlIHRoZSB0aHJlYWRzIHdoaWNoIHJ1biB0aGUNCj4gPiA+ID4gPiB3b3JrDQo+
ID4gPiA+ID4gaXRlbXMgYXJlIG5vdCBib3VuZCB0byBhIENQVS4NCj4gPiA+ID4gPiANCj4gPiA+
ID4gPiBJT1c6IFRoaXMgaXMgbm90IGEgc2xhbS1kdW5rIG9idmlvdXMgZ2Fpbi4NCj4gPiA+ID4g
DQo+ID4gPiA+IEkgYWdyZWUgYnV0IEkgdGhpbmsgaXQncyB3b3J0aCBjb25zaWRlcmF0aW9uLiBJ
J20gd2FpdGluZyB0bw0KPiA+ID4gPiBnZXQNCj4gPiA+ID4gKHJlYWwpIHBlcmZvcm1hbmNlIG51
bWJlcnMgb2YgaW1wcm92ZW1lbnQgKGluc3RlYWQgb2YgbXkgVk0NCj4gPiA+ID4gc2V0dXApDQo+
ID4gPiA+IHRvDQo+ID4gPiA+IGhlbHAgbXkgY2FzZS4gSG93ZXZlciwgaXQgd2FzIHJlcG9ydGVk
IDkwJSBkZWdyYWRhdGlvbiBmb3IgdGhlDQo+ID4gPiA+IHJlYWQNCj4gPiA+ID4gcGVyZm9ybWFu
Y2Ugb3ZlciBrcmI1cCB3aGVuIDFDUFUgaXMgZXhlY3V0aW5nIGFsbCBvcHMuDQo+ID4gPiA+IA0K
PiA+ID4gPiBJcyB0aGVyZSBhIGRpZmZlcmVudCB3YXkgdG8gbWFrZSBzdXJlIHRoYXQgb24gYSBt
dWx0aS1wcm9jZXNzb3INCj4gPiA+ID4gbWFjaGluZSB3ZSBjYW4gdGFrZSBhZHZhbnRhZ2Ugb2Yg
YWxsIGF2YWlsYWJsZSBDUFVzPyBTaW1wbGUNCj4gPiA+ID4ga2VybmVsDQo+ID4gPiA+IHRocmVh
ZHMgaW5zdGVhZCBvZiBhIHdvcmsgcXVldWU/DQo+ID4gPiANCj4gPiA+IFRoZXJlIGlzIGEgdHJh
ZGUtb2ZmIGJldHdlZW4gc3ByZWFkaW5nIHRoZSB3b3JrLCBhbmQgZW5zdXJpbmcgaXQNCj4gPiA+
IGlzIGV4ZWN1dGVkIG9uIGEgQ1BVIGNsb3NlIHRvIHRoZSBJL08gYW5kIGFwcGxpY2F0aW9uLiBJ
TU8NCj4gPiA+IFVOQk9VTkQNCj4gPiA+IGlzIGEgZ29vZCB3YXkgdG8gZG8gdGhhdC4gVU5CT1VO
RCB3aWxsIGF0dGVtcHQgdG8gc2NoZWR1bGUgdGhlDQo+ID4gPiB3b3JrIG9uIHRoZSBwcmVmZXJy
ZWQgQ1BVLCBidXQgYWxsb3cgaXQgdG8gYmUgbWlncmF0ZWQgaWYgdGhhdA0KPiA+ID4gQ1BVIGlz
IGJ1c3kuDQo+ID4gPiANCj4gPiA+IFRoZSBhZHZhbnRhZ2Ugb2YgdGhpcyBpcyB0aGF0IHdoZW4g
dGhlIGNsaWVudCB3b3JrbG9hZCBpcyBDUFUNCj4gPiA+IGludGVuc2l2ZSAoc2F5LCBhIHNvZnR3
YXJlIGJ1aWxkKSwgUlBDIGNsaWVudCB3b3JrIGNhbiBiZQ0KPiA+ID4gc2NoZWR1bGVkDQo+ID4g
PiBhbmQgcnVuIG1vcmUgcXVpY2tseSwgd2hpY2ggcmVkdWNlcyBsYXRlbmN5Lg0KPiA+ID4gDQo+
ID4gDQo+ID4gVGhhdCBzaG91bGQgbm8gbG9uZ2VyIGJlIGEgaHVnZSBpc3N1ZSwgc2luY2UgcXVl
dWVfd29yaygpIHdpbGwgbm93DQo+ID4gZGVmYXVsdCB0byB0aGUgV09SS19DUFVfVU5CT1VORCBm
bGFnLCB3aGljaCBwcmVmZXJzIHRoZSBsb2NhbCBDUFUsDQo+ID4gYnV0DQo+ID4gd2lsbCBzY2hl
ZHVsZSBlbHNld2hlcmUgaWYgdGhlIGxvY2FsIENQVSBpcyBjb25nZXN0ZWQuDQo+IA0KPiBJIGRv
bid0IGJlbGlldmUgTkZTIHVzZSB3b3JrcXVldWVfY29uZ2VzdGVkKCkgdG8gc29tZWhvdyBzY2hl
ZHVsZSB0aGUNCj4gd29yayBlbHNld2hlcmUuIFVubGVzcyB0aGUgcXVldWUgaXMgbWFya2VkIFVO
Qk9VTkRFRCBJIGRvbid0IGJlbGlldmUNCj4gdGhlcmUgaXMgYW55IGludGVudGlvbiBvZiBiYWxh
bmNpbmcgdGhlIENQVSBsb2FkLg0KPiANCg0KSSBzaG91bGRuJ3QgaGF2ZSB0byB0ZXN0IHRoZSBx
dWV1ZSB3aGVuIHNjaGVkdWxpbmcgd2l0aA0KV09SS19DUFVfVU5CT1VORC4NCg0KLS0gDQpUcm9u
ZCBNeWtsZWJ1c3QNCkxpbnV4IE5GUyBjbGllbnQgbWFpbnRhaW5lciwgUHJpbWFyeURhdGENCnRy
b25kLm15a2xlYnVzdEBwcmltYXJ5ZGF0YS5jb20NCg==


2017-07-05 17:33:11

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [RFC] fix parallelism for rpc tasks

On Wed, Jul 5, 2017 at 12:14 PM, Trond Myklebust
<[email protected]> wrote:
> On Wed, 2017-07-05 at 12:09 -0400, Olga Kornievskaia wrote:
>> On Wed, Jul 5, 2017 at 11:46 AM, Trond Myklebust
>> <[email protected]> wrote:
>> > On Wed, 2017-07-05 at 11:11 -0400, Chuck Lever wrote:
>> > > > On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <[email protected]>
>> > > > wrote:
>> > > >
>> > > > On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust
>> > > > <[email protected]> wrote:
>> > > > > On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote:
>> > > > > > Hi folks,
>> > > > > >
>> > > > > > On a multi-core machine, is it expected that we can have
>> > > > > > parallel
>> > > > > > RPCs
>> > > > > > handled by each of the per-core workqueue?
>> > > > > >
>> > > > > > In testing a read workload, observing via "top" command
>> > > > > > that a
>> > > > > > single
>> > > > > > "kworker" thread is running servicing the requests (no
>> > > > > > parallelism).
>> > > > > > It's more prominent while doing these operations over krb5p
>> > > > > > mount.
>> > > > > >
>> > > > > > What has been suggested by Bruce is to try this and in my
>> > > > > > testing I
>> > > > > > see then the read workload spread among all the kworker
>> > > > > > threads.
>> > > > > >
>> > > > > > Signed-off-by: Olga Kornievskaia <[email protected]>
>> > > > > >
>> > > > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>> > > > > > index 0cc8383..f80e688 100644
>> > > > > > --- a/net/sunrpc/sched.c
>> > > > > > +++ b/net/sunrpc/sched.c
>> > > > > > @@ -1095,7 +1095,7 @@ static int rpciod_start(void)
>> > > > > > * Create the rpciod thread and wait for it to start.
>> > > > > > */
>> > > > > > dprintk("RPC: creating workqueue rpciod\n");
>> > > > > > - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0);
>> > > > > > + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM |
>> > > > > > WQ_UNBOUND,
>> > > > > > 0);
>> > > > > > if (!wq)
>> > > > > > goto out_failed;
>> > > > > > rpciod_workqueue = wq;
>> > > > > >
>> > > > >
>> > > > > WQ_UNBOUND turns off concurrency management on the thread
>> > > > > pool
>> > > > > (See
>> > > > > Documentation/core-api/workqueue.rst. It also means we
>> > > > > contend
>> > > > > for work
>> > > > > item queuing/dequeuing locks, since the threads which run the
>> > > > > work
>> > > > > items are not bound to a CPU.
>> > > > >
>> > > > > IOW: This is not a slam-dunk obvious gain.
>> > > >
>> > > > I agree but I think it's worth consideration. I'm waiting to
>> > > > get
>> > > > (real) performance numbers of improvement (instead of my VM
>> > > > setup)
>> > > > to
>> > > > help my case. However, it was reported 90% degradation for the
>> > > > read
>> > > > performance over krb5p when 1CPU is executing all ops.
>> > > >
>> > > > Is there a different way to make sure that on a multi-processor
>> > > > machine we can take advantage of all available CPUs? Simple
>> > > > kernel
>> > > > threads instead of a work queue?
>> > >
>> > > There is a trade-off between spreading the work, and ensuring it
>> > > is executed on a CPU close to the I/O and application. IMO
>> > > UNBOUND
>> > > is a good way to do that. UNBOUND will attempt to schedule the
>> > > work on the preferred CPU, but allow it to be migrated if that
>> > > CPU is busy.
>> > >
>> > > The advantage of this is that when the client workload is CPU
>> > > intensive (say, a software build), RPC client work can be
>> > > scheduled
>> > > and run more quickly, which reduces latency.
>> > >
>> >
>> > That should no longer be a huge issue, since queue_work() will now
>> > default to the WORK_CPU_UNBOUND flag, which prefers the local CPU,
>> > but
>> > will schedule elsewhere if the local CPU is congested.
>>
>> I don't believe NFS use workqueue_congested() to somehow schedule the
>> work elsewhere. Unless the queue is marked UNBOUNDED I don't believe
>> there is any intention of balancing the CPU load.
>>
>
> I shouldn't have to test the queue when scheduling with
> WORK_CPU_UNBOUND.
>

Comments in the code says that "if CPU dies" it'll be re-scheduled on
another. I think the code requires to mark the queue UNBOUND to really
be scheduled on a different CPU. Just my reading of the code and it
matches what is seen with the krb5 workload.

2017-07-19 17:59:48

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [RFC] fix parallelism for rpc tasks

On Wed, Jul 5, 2017 at 1:33 PM, Olga Kornievskaia <[email protected]> wrote:
> On Wed, Jul 5, 2017 at 12:14 PM, Trond Myklebust
> <[email protected]> wrote:
>> On Wed, 2017-07-05 at 12:09 -0400, Olga Kornievskaia wrote:
>>> On Wed, Jul 5, 2017 at 11:46 AM, Trond Myklebust
>>> <[email protected]> wrote:
>>> > On Wed, 2017-07-05 at 11:11 -0400, Chuck Lever wrote:
>>> > > > On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <[email protected]>
>>> > > > wrote:
>>> > > >
>>> > > > On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust
>>> > > > <[email protected]> wrote:
>>> > > > > On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote:
>>> > > > > > Hi folks,
>>> > > > > >
>>> > > > > > On a multi-core machine, is it expected that we can have
>>> > > > > > parallel
>>> > > > > > RPCs
>>> > > > > > handled by each of the per-core workqueue?
>>> > > > > >
>>> > > > > > In testing a read workload, observing via "top" command
>>> > > > > > that a
>>> > > > > > single
>>> > > > > > "kworker" thread is running servicing the requests (no
>>> > > > > > parallelism).
>>> > > > > > It's more prominent while doing these operations over krb5p
>>> > > > > > mount.
>>> > > > > >
>>> > > > > > What has been suggested by Bruce is to try this and in my
>>> > > > > > testing I
>>> > > > > > see then the read workload spread among all the kworker
>>> > > > > > threads.
>>> > > > > >
>>> > > > > > Signed-off-by: Olga Kornievskaia <[email protected]>
>>> > > > > >
>>> > > > > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>>> > > > > > index 0cc8383..f80e688 100644
>>> > > > > > --- a/net/sunrpc/sched.c
>>> > > > > > +++ b/net/sunrpc/sched.c
>>> > > > > > @@ -1095,7 +1095,7 @@ static int rpciod_start(void)
>>> > > > > > * Create the rpciod thread and wait for it to start.
>>> > > > > > */
>>> > > > > > dprintk("RPC: creating workqueue rpciod\n");
>>> > > > > > - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0);
>>> > > > > > + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM |
>>> > > > > > WQ_UNBOUND,
>>> > > > > > 0);
>>> > > > > > if (!wq)
>>> > > > > > goto out_failed;
>>> > > > > > rpciod_workqueue = wq;
>>> > > > > >
>>> > > > >
>>> > > > > WQ_UNBOUND turns off concurrency management on the thread
>>> > > > > pool
>>> > > > > (See
>>> > > > > Documentation/core-api/workqueue.rst. It also means we
>>> > > > > contend
>>> > > > > for work
>>> > > > > item queuing/dequeuing locks, since the threads which run the
>>> > > > > work
>>> > > > > items are not bound to a CPU.
>>> > > > >
>>> > > > > IOW: This is not a slam-dunk obvious gain.
>>> > > >
>>> > > > I agree but I think it's worth consideration. I'm waiting to
>>> > > > get
>>> > > > (real) performance numbers of improvement (instead of my VM
>>> > > > setup)
>>> > > > to
>>> > > > help my case. However, it was reported 90% degradation for the
>>> > > > read
>>> > > > performance over krb5p when 1CPU is executing all ops.
>>> > > >
>>> > > > Is there a different way to make sure that on a multi-processor
>>> > > > machine we can take advantage of all available CPUs? Simple
>>> > > > kernel
>>> > > > threads instead of a work queue?
>>> > >
>>> > > There is a trade-off between spreading the work, and ensuring it
>>> > > is executed on a CPU close to the I/O and application. IMO
>>> > > UNBOUND
>>> > > is a good way to do that. UNBOUND will attempt to schedule the
>>> > > work on the preferred CPU, but allow it to be migrated if that
>>> > > CPU is busy.
>>> > >
>>> > > The advantage of this is that when the client workload is CPU
>>> > > intensive (say, a software build), RPC client work can be
>>> > > scheduled
>>> > > and run more quickly, which reduces latency.
>>> > >
>>> >
>>> > That should no longer be a huge issue, since queue_work() will now
>>> > default to the WORK_CPU_UNBOUND flag, which prefers the local CPU,
>>> > but
>>> > will schedule elsewhere if the local CPU is congested.
>>>
>>> I don't believe NFS use workqueue_congested() to somehow schedule the
>>> work elsewhere. Unless the queue is marked UNBOUNDED I don't believe
>>> there is any intention of balancing the CPU load.
>>>
>>
>> I shouldn't have to test the queue when scheduling with
>> WORK_CPU_UNBOUND.
>>
>
> Comments in the code says that "if CPU dies" it'll be re-scheduled on
> another. I think the code requires to mark the queue UNBOUND to really
> be scheduled on a different CPU. Just my reading of the code and it
> matches what is seen with the krb5 workload.

Trond, what's the path forward here? What about a run-time
configuration that starts rpciod with the UNBOUND options instead?

2018-02-17 18:55:19

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC] fix parallelism for rpc tasks



> On Feb 14, 2018, at 6:13 PM, Mora, Jorge <[email protected]> =
wrote:
>=20
> Hello,
>=20
> The patch gives some performance improvement on Kerberos read.
> The following results show performance comparisons between unpatched
> and patched systems. The html files included as attachments show the
> results as line charts.
>=20
> - Best read performance improvement when testing with a single dd =
transfer.
> The patched system gives 70% better performance than the unpatched =
system.
> (first set of results)
>=20
> - The patched system gives 18% better performance than the unpatched =
system
> when testing with multiple dd transfers.
> (second set of results)
>=20
> - The write test shows there is no performance hit by the patch.
> (third set of results)
>=20
> - When testing on a different client having less RAM and fewer number =
of CPU cores,
> there is no performance degradation for Kerberos in the unpatched =
system.
> In this case, the patch does not provide any performance improvement.
> (fourth set of results)
>=20
> =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
> Test environment:
>=20
> NFS client: CPU: 16 cores, RAM: 32GB (E5620 @ 2.40GHz)
> NFS servers: CPU: 16 cores, RAM: 32GB (E5620 @ 2.40GHz)
> NFS mount: NFSv3 with sec=3D(sys or krb5p)
>=20
> For tests with a single dd transfer there is of course one NFS server =
used
> and one file being read -- only one transfer was needed to fill up the
> network connection.
>=20
> For tests with multiple dd transfers, three different NFS server were =
used
> and four different files were used per NFS server for a total of 12 =
different
> files being read (12 different transfers in parallel).
>=20
> The patch was applied on top of 4.14.0-rc3 kernel and the NFS servers =
were
> running RHEL 7.4.
>=20
> The fourth set of results below show an unpatched system with no =
Kerberos
> degradation (same kernel 4.14.0-rc3) but in contrast with the main =
client
> used for testing this client has only 4 CPU cores and 8GB of RAM.
> I believe that even though this system has less CPU cores and less =
RAM,
> the CPU is faster (E31220 @ 3.10GHz vs E5620 @ 2.40GHz) so it is able
> to handle the Kerberos load better and fill up the network connection
> with a single thread than the main client with more CPU cores and more
> memory.

Jorge, thanks for publishing these results.

Can you do a "numactl -H" on your clients and post the output? I suspect
the throughput improvement on the big client is because WQ_UNBOUND
behaves differently on NUMA systems. (Even so, I agree that the proposed
change is valuable).


> =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
>=20
> Kerberos Read Performance: 170.15% (patched system over unpatched =
system)
>=20
> Client CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
> CPU cores: 16
> RAM: 32 GB
> NFS version: 3
> Mount points: 1
> dd's per mount: 1
> Total dd's: 1
> Data transferred: 7.81 GB (per run)
> Number of runs: 10
>=20
> Kerberos Read Performance (unpatched system vs patched system)
> Transfer rate (unpatched system) avg: 65.88 MB/s, var: 20.28, =
stddev: 4.50
> Transfer rate (patched system) avg: 112.10 MB/s, var: 0.00, =
stddev: 0.01
> Performance (patched over unpatched): 170.15%
>=20
> Unpatched System Read Performance (sys vs krb5p)
> Transfer rate (sec=3Dsys) avg: 111.96 MB/s, var: 0.02, =
stddev: 0.13
> Transfer rate (sec=3Dkrb5p) avg: 65.88 MB/s, var: 20.28, =
stddev: 4.50
> Performance (krb5p over sys): 58.84%
>=20
> Patched System Read Performance (sys vs krb5p)
> Transfer rate (sec=3Dsys) avg: 111.94 MB/s, var: 0.02, =
stddev: 0.14
> Transfer rate (sec=3Dkrb5p) avg: 112.10 MB/s, var: 0.00, =
stddev: 0.01
> Performance (krb5p over sys): 100.14%
>=20
> =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
>=20
> Kerberos Read Performance: 118.02% (patched system over unpatched =
system)
>=20
> Client CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
> CPU cores: 16
> RAM: 32 GB
> NFS version: 3
> Mount points: 3
> dd's per mount: 4
> Total dd's: 12
> Data transferred: 93.75 GB (per run)
> Number of runs: 10
>=20
> Kerberos Read Performance (unpatched system vs patched system)
> Transfer rate (unpatched system) avg: 94.99 MB/s, var: 68.96, =
stddev: 8.30
> Transfer rate (patched system) avg: 112.11 MB/s, var: 0.00, =
stddev: 0.03
> Performance (patched over unpatched): 118.02%
>=20
> Unpatched System Read Performance (sys vs krb5p)
> Transfer rate (sec=3Dsys) avg: 112.21 MB/s, var: 0.00, =
stddev: 0.00
> Transfer rate (sec=3Dkrb5p) avg: 94.99 MB/s, var: 68.96, =
stddev: 8.30
> Performance (krb5p over sys): 84.66%
>=20
> Patched System Read Performance (sys vs krb5p)
> Transfer rate (sec=3Dsys) avg: 112.20 MB/s, var: 0.00, =
stddev: 0.00
> Transfer rate (sec=3Dkrb5p) avg: 112.11 MB/s, var: 0.00, =
stddev: 0.03
> Performance (krb5p over sys): 99.92%
>=20
> =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
>=20
> Kerberos Write Performance: 101.55% (patched system over unpatched =
system)
>=20
> Client CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
> CPU cores: 16
> RAM: 32 GB
> NFS version: 3
> Mount points: 3
> dd's per mount: 4
> Total dd's: 12
> Data transferred: 93.75 GB (per run)
> Number of runs: 10
>=20
> Kerberos Write Performance (unpatched system vs patched system)
> Transfer rate (unpatched system) avg: 103.70 MB/s, var: 110.51, =
stddev: 10.51
> Transfer rate (patched system) avg: 105.31 MB/s, var: 35.04, =
stddev: 5.92
> Performance (patched over unpatched): 101.55%
>=20
> Unpatched System Write Performance (sys vs krb5p)
> Transfer rate (sec=3Dsys) avg: 109.87 MB/s, var: 10.27, =
stddev: 3.20
> Transfer rate (sec=3Dkrb5p) avg: 103.70 MB/s, var: 110.51, =
stddev: 10.51
> Performance (krb5p over sys): 94.39%
>=20
> Patched System Write Performance (sys vs krb5p)
> Transfer rate (sec=3Dsys) avg: 111.03 MB/s, var: 0.58, =
stddev: 0.76
> Transfer rate (sec=3Dkrb5p) avg: 105.31 MB/s, var: 35.04, =
stddev: 5.92
> Performance (krb5p over sys): 94.85%
>=20
> =
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D
>=20
> Kerberos Read Performance: 99.99% (patched system over unpatched =
system)
>=20
> Client CPU: Intel(R) Xeon(R) CPU E31220 @ 3.10GHz
> CPU cores: 4
> RAM: 8 GB
> NFS version: 3
> Mount points: 1
> dd's per mount: 1
> Total dd's: 1
> Data transferred: 7.81 GB (per run)
> Number of runs: 10
>=20
> Kerberos Read Performance (unpatched system vs patched system)
> Transfer rate (unpatched system) avg: 112.02 MB/s, var: 0.04, =
stddev: 0.21
> Transfer rate (patched system) avg: 112.01 MB/s, var: 0.06, =
stddev: 0.25
> Performance (patched over unpatched): 99.99%
>=20
> Unpatched System Read Performance (sys vs krb5p)
> Transfer rate (sec=3Dsys) avg: 111.86 MB/s, var: 0.06, =
stddev: 0.24
> Transfer rate (sec=3Dkrb5p) avg: 112.02 MB/s, var: 0.04, =
stddev: 0.21
> Performance (krb5p over sys): 100.14%
>=20
> Patched System Read Performance (sys vs krb5p)
> Transfer rate (sec=3Dsys) avg: 111.76 MB/s, var: 0.12, =
stddev: 0.34
> Transfer rate (sec=3Dkrb5p) avg: 112.01 MB/s, var: 0.06, =
stddev: 0.25
> Performance (krb5p over sys): 100.22%
>=20
>=20
> --Jorge
>=20
> ________________________________________
> From: [email protected] =
<[email protected]> on behalf of Olga Kornievskaia =
<[email protected]>
> Sent: Wednesday, July 19, 2017 11:59 AM
> To: Trond Myklebust
> Cc: [email protected]; [email protected]
> Subject: Re: [RFC] fix parallelism for rpc tasks
>=20
> On Wed, Jul 5, 2017 at 1:33 PM, Olga Kornievskaia <[email protected]> =
wrote:
>> On Wed, Jul 5, 2017 at 12:14 PM, Trond Myklebust
>> <[email protected]> wrote:
>>> On Wed, 2017-07-05 at 12:09 -0400, Olga Kornievskaia wrote:
>>>> On Wed, Jul 5, 2017 at 11:46 AM, Trond Myklebust
>>>> <[email protected]> wrote:
>>>>> On Wed, 2017-07-05 at 11:11 -0400, Chuck Lever wrote:
>>>>>>> On Jul 5, 2017, at 10:44 AM, Olga Kornievskaia <[email protected]>
>>>>>>> wrote:
>>>>>>>=20
>>>>>>> On Mon, Jul 3, 2017 at 10:58 AM, Trond Myklebust
>>>>>>> <[email protected]> wrote:
>>>>>>>> On Thu, 2017-06-29 at 09:25 -0400, Olga Kornievskaia wrote:
>>>>>>>>> Hi folks,
>>>>>>>>>=20
>>>>>>>>> On a multi-core machine, is it expected that we can have
>>>>>>>>> parallel
>>>>>>>>> RPCs
>>>>>>>>> handled by each of the per-core workqueue?
>>>>>>>>>=20
>>>>>>>>> In testing a read workload, observing via "top" command
>>>>>>>>> that a
>>>>>>>>> single
>>>>>>>>> "kworker" thread is running servicing the requests (no
>>>>>>>>> parallelism).
>>>>>>>>> It's more prominent while doing these operations over krb5p
>>>>>>>>> mount.
>>>>>>>>>=20
>>>>>>>>> What has been suggested by Bruce is to try this and in my
>>>>>>>>> testing I
>>>>>>>>> see then the read workload spread among all the kworker
>>>>>>>>> threads.
>>>>>>>>>=20
>>>>>>>>> Signed-off-by: Olga Kornievskaia <[email protected]>
>>>>>>>>>=20
>>>>>>>>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>>>>>>>>> index 0cc8383..f80e688 100644
>>>>>>>>> --- a/net/sunrpc/sched.c
>>>>>>>>> +++ b/net/sunrpc/sched.c
>>>>>>>>> @@ -1095,7 +1095,7 @@ static int rpciod_start(void)
>>>>>>>>> * Create the rpciod thread and wait for it to start.
>>>>>>>>> */
>>>>>>>>> dprintk("RPC: creating workqueue rpciod\n");
>>>>>>>>> - wq =3D alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 0);
>>>>>>>>> + wq =3D alloc_workqueue("rpciod", WQ_MEM_RECLAIM |
>>>>>>>>> WQ_UNBOUND,
>>>>>>>>> 0);
>>>>>>>>> if (!wq)
>>>>>>>>> goto out_failed;
>>>>>>>>> rpciod_workqueue =3D wq;
>>>>>>>>>=20
>>>>>>>>=20
>>>>>>>> WQ_UNBOUND turns off concurrency management on the thread
>>>>>>>> pool
>>>>>>>> (See
>>>>>>>> Documentation/core-api/workqueue.rst. It also means we
>>>>>>>> contend
>>>>>>>> for work
>>>>>>>> item queuing/dequeuing locks, since the threads which run the
>>>>>>>> work
>>>>>>>> items are not bound to a CPU.
>>>>>>>>=20
>>>>>>>> IOW: This is not a slam-dunk obvious gain.
>>>>>>>=20
>>>>>>> I agree but I think it's worth consideration. I'm waiting to
>>>>>>> get
>>>>>>> (real) performance numbers of improvement (instead of my VM
>>>>>>> setup)
>>>>>>> to
>>>>>>> help my case. However, it was reported 90% degradation for the
>>>>>>> read
>>>>>>> performance over krb5p when 1CPU is executing all ops.
>>>>>>>=20
>>>>>>> Is there a different way to make sure that on a multi-processor
>>>>>>> machine we can take advantage of all available CPUs? Simple
>>>>>>> kernel
>>>>>>> threads instead of a work queue?
>>>>>>=20
>>>>>> There is a trade-off between spreading the work, and ensuring it
>>>>>> is executed on a CPU close to the I/O and application. IMO
>>>>>> UNBOUND
>>>>>> is a good way to do that. UNBOUND will attempt to schedule the
>>>>>> work on the preferred CPU, but allow it to be migrated if that
>>>>>> CPU is busy.
>>>>>>=20
>>>>>> The advantage of this is that when the client workload is CPU
>>>>>> intensive (say, a software build), RPC client work can be
>>>>>> scheduled
>>>>>> and run more quickly, which reduces latency.
>>>>>>=20
>>>>>=20
>>>>> That should no longer be a huge issue, since queue_work() will now
>>>>> default to the WORK_CPU_UNBOUND flag, which prefers the local CPU,
>>>>> but
>>>>> will schedule elsewhere if the local CPU is congested.
>>>>=20
>>>> I don't believe NFS use workqueue_congested() to somehow schedule =
the
>>>> work elsewhere. Unless the queue is marked UNBOUNDED I don't =
believe
>>>> there is any intention of balancing the CPU load.
>>>>=20
>>>=20
>>> I shouldn't have to test the queue when scheduling with
>>> WORK_CPU_UNBOUND.
>>>=20
>>=20
>> Comments in the code says that "if CPU dies" it'll be re-scheduled on
>> another. I think the code requires to mark the queue UNBOUND to =
really
>> be scheduled on a different CPU. Just my reading of the code and it
>> matches what is seen with the krb5 workload.
>=20
> Trond, what's the path forward here? What about a run-time
> configuration that starts rpciod with the UNBOUND options instead?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> =
<dd_read_single.html><dd_read_mult.html><dd_write_mult.html><dd_read_singl=
e1.html>

--
Chuck Lever