LinuxLists.cc - [PATCH 1/2] SUNRPC: Adjust rpciod workqueue parameters

2015-01-25 00:18:50

Subject: [PATCH 1/2] SUNRPC: Adjust rpciod workqueue parameters

Increase the concurrency level for rpciod threads to allow for allocations
etc that happen in the RPCSEC_GSS layer. Also note that the NFSv4 byte range
locks may now need to allocate memory from inside rpciod.

Add the WQ_HIGHPRI flag to improve latency guarantees while we're at it.

Signed-off-by: Trond Myklebust <[email protected]>
---
net/sunrpc/sched.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index d20f2329eea3..4f65ec28d2b4 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -1069,7 +1069,8 @@ static int rpciod_start(void)
* Create the rpciod thread and wait for it to start.
*/
dprintk("RPC: creating workqueue rpciod\n");
- wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);
+ /* Note: highpri because network receive is latency sensitive */
+ wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
rpciod_workqueue = wq;
return rpciod_workqueue != NULL;
}
--
2.1.0

2015-01-25 00:18:51

by Trond Myklebust

[permalink] [raw]

Subject: [PATCH 2/2] SUNRPC: Allow waiting on memory allocation

We should be safe now, as long as we don't do GFP_IO or higher allocations

Signed-off-by: Trond Myklebust <[email protected]>
---
net/sunrpc/sched.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index 4f65ec28d2b4..b91fd9c597b4 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -844,10 +844,10 @@ static void rpc_async_schedule(struct work_struct *work)
void *rpc_malloc(struct rpc_task *task, size_t size)
{
struct rpc_buffer *buf;
- gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
+ gfp_t gfp = GFP_NOIO | __GFP_NOWARN;

if (RPC_IS_SWAPPER(task))
- gfp |= __GFP_MEMALLOC;
+ gfp = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;

size += sizeof(struct rpc_buffer);
if (size <= RPC_BUFFER_MAXSIZE)
--
2.1.0

2015-01-25 20:11:06

by Chuck Lever III

[permalink] [raw]

Subject: Re: [PATCH 2/2] SUNRPC: Allow waiting on memory allocation

On Jan 24, 2015, at 7:18 PM, Trond Myklebust <[email protected]> wrote:

> We should be safe now, as long as we don't do GFP_IO or higher allocations

Should the GFP flags in xprt_rdma_allocate() reflect this change
as well? The non-swap case uses GFP_NOFS currently, and the
swap case does not include GFP_MEMALLOC. These choices might be
out of date.

If so I can submit a patch on top of my existing for-3.20 series
that changes xprt_rdma_allocate() to use the same flags as
rpc_malloc().

> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> net/sunrpc/sched.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index 4f65ec28d2b4..b91fd9c597b4 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -844,10 +844,10 @@ static void rpc_async_schedule(struct work_struct *work)
> void *rpc_malloc(struct rpc_task *task, size_t size)
> {
> struct rpc_buffer *buf;
> - gfp_t gfp = GFP_NOWAIT | __GFP_NOWARN;
> + gfp_t gfp = GFP_NOIO | __GFP_NOWARN;
>
> if (RPC_IS_SWAPPER(task))
> - gfp |= __GFP_MEMALLOC;
> + gfp = __GFP_MEMALLOC | GFP_NOWAIT | __GFP_NOWARN;
>
> size += sizeof(struct rpc_buffer);
> if (size <= RPC_BUFFER_MAXSIZE)
> --
> 2.1.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com

2015-01-25 21:29:41

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH 2/2] SUNRPC: Allow waiting on memory allocation

On Sun, Jan 25, 2015 at 3:11 PM, Chuck Lever <[email protected]> wrote:
>
>
> On Jan 24, 2015, at 7:18 PM, Trond Myklebust <[email protected]> wrote:
>
> > We should be safe now, as long as we don't do GFP_IO or higher allocations
>
> Should the GFP flags in xprt_rdma_allocate() reflect this change
> as well? The non-swap case uses GFP_NOFS currently, and the

GFP_NOFS or GFP_NOWAIT? If the former, then that would be yet further
justification for PATCH 1/2.

> swap case does not include GFP_MEMALLOC. These choices might be
> out of date.
>
> If so I can submit a patch on top of my existing for-3.20 series
> that changes xprt_rdma_allocate() to use the same flags as
> rpc_malloc().

Yes. I think GFP_NOIO is the more conservative (and correct) approach
here, rather than GFP_NOFS. In particular it means that we won't
trigger any new swap-over-nfs activity.

--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]

2015-01-26 23:33:53

by Shirley Ma

[permalink] [raw]

Subject: Re: [PATCH 1/2] SUNRPC: Adjust rpciod workqueue parameters

Hello Trond,

workqueue WQ_UNBOUND flag is also needed. Some customer hit a problem, RT thread caused rpciod starvation. It is easy to reproduce it with running a cpu intensive workload with lower nice value than rpciod workqueue on the cpu the network interrupt is received.

I've also tested iozone and fio test with WQ_UNBOUND|WQ_SYSFS flag on for NFS/RDMA, NFS/IPoIB. The results are better than BOUND.

Thanks,
Shirley

On 01/24/2015 04:18 PM, Trond Myklebust wrote:
> Increase the concurrency level for rpciod threads to allow for allocations
> etc that happen in the RPCSEC_GSS layer. Also note that the NFSv4 byte range
> locks may now need to allocate memory from inside rpciod.
>
> Add the WQ_HIGHPRI flag to improve latency guarantees while we're at it.
>
> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> net/sunrpc/sched.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index d20f2329eea3..4f65ec28d2b4 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -1069,7 +1069,8 @@ static int rpciod_start(void)
> * Create the rpciod thread and wait for it to start.
> */
> dprintk("RPC: creating workqueue rpciod\n");
> - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);
> + /* Note: highpri because network receive is latency sensitive */
> + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
> rpciod_workqueue = wq;
> return rpciod_workqueue != NULL;
> }
>

2015-01-27 00:34:21

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH 1/2] SUNRPC: Adjust rpciod workqueue parameters

On Mon, Jan 26, 2015 at 6:33 PM, Shirley Ma <[email protected]> wrote:
> Hello Trond,
>
> workqueue WQ_UNBOUND flag is also needed. Some customer hit a problem, RT thread caused rpciod starvation. It is easy to reproduce it with running a cpu intensive workload with lower nice value than rpciod workqueue on the cpu the network interrupt is received.
>
> I've also tested iozone and fio test with WQ_UNBOUND|WQ_SYSFS flag on for NFS/RDMA, NFS/IPoIB. The results are better than BOUND.

It certainly does not seem appropriate to use WQ_SYSFS on a queue that
is used for swap, and Documentation/kernel-per-CPU-kthreads.txt makes
an extra strong argument against enabling it on the grounds that it is
not easily reversible.

As for unbound queues: they will almost by definition defeat all the
packet steering and balancing that is done in the networking layer in
the name of multi-process scalability (see
Documentation/networking/scaling.txt). While RDMA systems may or may
not care about that, ordinary networked systems probably do.
Don't most RDMA drivers allow you to balance those interrupts, at
least on the high end systems?

> Thanks,
> Shirley
>
> On 01/24/2015 04:18 PM, Trond Myklebust wrote:
>> Increase the concurrency level for rpciod threads to allow for allocations
>> etc that happen in the RPCSEC_GSS layer. Also note that the NFSv4 byte range
>> locks may now need to allocate memory from inside rpciod.
>>
>> Add the WQ_HIGHPRI flag to improve latency guarantees while we're at it.
>>
>> Signed-off-by: Trond Myklebust <[email protected]>
>> ---
>> net/sunrpc/sched.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>> index d20f2329eea3..4f65ec28d2b4 100644
>> --- a/net/sunrpc/sched.c
>> +++ b/net/sunrpc/sched.c
>> @@ -1069,7 +1069,8 @@ static int rpciod_start(void)
>> * Create the rpciod thread and wait for it to start.
>> */
>> dprintk("RPC: creating workqueue rpciod\n");
>> - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);
>> + /* Note: highpri because network receive is latency sensitive */
>> + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
>> rpciod_workqueue = wq;
>> return rpciod_workqueue != NULL;
>> }
>>

--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]

2015-01-27 02:30:09

by Shirley Ma

[permalink] [raw]

Subject: Re: [PATCH 1/2] SUNRPC: Adjust rpciod workqueue parameters

On 01/26/2015 04:34 PM, Trond Myklebust wrote:
> On Mon, Jan 26, 2015 at 6:33 PM, Shirley Ma <[email protected]> wrote:
>> Hello Trond,
>>
>> workqueue WQ_UNBOUND flag is also needed. Some customer hit a problem, RT thread caused rpciod starvation. It is easy to reproduce it with running a cpu intensive workload with lower nice value than rpciod workqueue on the cpu the network interrupt is received.
>>
>> I've also tested iozone and fio test with WQ_UNBOUND|WQ_SYSFS flag on for NFS/RDMA, NFS/IPoIB. The results are better than BOUND.
>
> It certainly does not seem appropriate to use WQ_SYSFS on a queue that
> is used for swap, and Documentation/kernel-per-CPU-kthreads.txt makes
> an extra strong argument against enabling it on the grounds that it is
> not easily reversible.

If enabling UNBOUND, I thought customizing workqueue would help.

> As for unbound queues: they will almost by definition defeat all the
> packet steering and balancing that is done in the networking layer in
> the name of multi-process scalability (see
> Documentation/networking/scaling.txt). While RDMA systems may or may
> not care about that, ordinary networked systems probably do.
> Don't most RDMA drivers allow you to balance those interrupts, at
> least on the high end systems?

The problem was IRQ balance is not aware of which cpu is busy on the system, networking NIC interrupts can be directed to the busy CPU while other cpus are much lightly loaded. So packet steering and balancing in the networking layer doesn't have any benefit. The network workload can cause starvation in this situation it doesn't matter it's RDMA or Ethernet. The workaround solution is to unmask this busy cpu in irq balance, or manually set up irq smp affinity to avoid any interrupts on this cpu.

UNBOUND workqueue will choose same CPU to run the work if this CPU is not busy, it is scheduled to another CPU on the same NUMA node when this CPU is busy. So it doesn't defeat packet steering and balancing.

>
>> Thanks,
>> Shirley
>>
>> On 01/24/2015 04:18 PM, Trond Myklebust wrote:
>>> Increase the concurrency level for rpciod threads to allow for allocations
>>> etc that happen in the RPCSEC_GSS layer. Also note that the NFSv4 byte range
>>> locks may now need to allocate memory from inside rpciod.
>>>
>>> Add the WQ_HIGHPRI flag to improve latency guarantees while we're at it.
>>>
>>> Signed-off-by: Trond Myklebust <[email protected]>
>>> ---
>>> net/sunrpc/sched.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>>> index d20f2329eea3..4f65ec28d2b4 100644
>>> --- a/net/sunrpc/sched.c
>>> +++ b/net/sunrpc/sched.c
>>> @@ -1069,7 +1069,8 @@ static int rpciod_start(void)
>>> * Create the rpciod thread and wait for it to start.
>>> */
>>> dprintk("RPC: creating workqueue rpciod\n");
>>> - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);
>>> + /* Note: highpri because network receive is latency sensitive */
>>> + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
>>> rpciod_workqueue = wq;
>>> return rpciod_workqueue != NULL;
>>> }
>>>
>
>
>

2015-01-27 03:17:47

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH 1/2] SUNRPC: Adjust rpciod workqueue parameters

On Mon, Jan 26, 2015 at 9:30 PM, Shirley Ma <[email protected]> wrote:
>
>
>
> On 01/26/2015 04:34 PM, Trond Myklebust wrote:
> > On Mon, Jan 26, 2015 at 6:33 PM, Shirley Ma <[email protected]> wrote:
> >> Hello Trond,
> >>
> >> workqueue WQ_UNBOUND flag is also needed. Some customer hit a problem, RT thread caused rpciod starvation. It is easy to reproduce it with running a cpu intensive workload with lower nice value than rpciod workqueue on the cpu the network interrupt is received.
> >>
> >> I've also tested iozone and fio test with WQ_UNBOUND|WQ_SYSFS flag on for NFS/RDMA, NFS/IPoIB. The results are better than BOUND.
> >
> > It certainly does not seem appropriate to use WQ_SYSFS on a queue that
> > is used for swap, and Documentation/kernel-per-CPU-kthreads.txt makes
> > an extra strong argument against enabling it on the grounds that it is
> > not easily reversible.
>
> If enabling UNBOUND, I thought customizing workqueue would help.
>
> > As for unbound queues: they will almost by definition defeat all the
> > packet steering and balancing that is done in the networking layer in
> > the name of multi-process scalability (see
> > Documentation/networking/scaling.txt). While RDMA systems may or may
> > not care about that, ordinary networked systems probably do.
> > Don't most RDMA drivers allow you to balance those interrupts, at
> > least on the high end systems?
>
> The problem was IRQ balance is not aware of which cpu is busy on the system, networking NIC interrupts can be directed to the busy CPU while other cpus are much lightly loaded. So packet steering and balancing in the networking layer doesn't have any benefit.

Sure it does. You are argument revolves around a corner case.

> The network workload can cause starvation in this situation it doesn't matter it's RDMA or Ethernet. The workaround solution is to unmask this busy cpu in irq balance, or manually set up irq smp affinity to avoid any interrupts on this cpu.

Then schedule it to system_unbound_wq instead. Why do you need to change rpciod?

> UNBOUND workqueue will choose same CPU to run the work if this CPU is not busy, it is scheduled to another CPU on the same NUMA node when this CPU is busy. So it doesn't defeat packet steering and balancing.

Yes it does. The whole point of packet steering is to maximise
locality in order to avoid contention for resources, locks etc for
networking data that is being consumed by just one (or a few)
processes. By spraying jobs across multiple threads, you are
reintroducing that contention. Furthermore, you are randomising the
processing order for _all_rpciod tasks.

Please read Documentation/workqueue.txt where it state clearly that:

Unbound wq sacrifices locality but is useful for
the following cases.

* Wide fluctuation in the concurrency level requirement is
expected and using bound wq may end up creating large number
of mostly unused workers across different CPUs as the issuer
hops through different CPUs.

* Long running CPU intensive workloads which can be better
managed by the system scheduler.

neither of which conditions are the common case for rpciod.

> >> Thanks,
> >> Shirley
> >>
> >> On 01/24/2015 04:18 PM, Trond Myklebust wrote:
> >>> Increase the concurrency level for rpciod threads to allow for allocations
> >>> etc that happen in the RPCSEC_GSS layer. Also note that the NFSv4 byte range
> >>> locks may now need to allocate memory from inside rpciod.
> >>>
> >>> Add the WQ_HIGHPRI flag to improve latency guarantees while we're at it.
> >>>
> >>> Signed-off-by: Trond Myklebust <[email protected]>
> >>> ---
> >>> net/sunrpc/sched.c | 3 ++-
> >>> 1 file changed, 2 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> >>> index d20f2329eea3..4f65ec28d2b4 100644
> >>> --- a/net/sunrpc/sched.c
> >>> +++ b/net/sunrpc/sched.c
> >>> @@ -1069,7 +1069,8 @@ static int rpciod_start(void)
> >>> * Create the rpciod thread and wait for it to start.
> >>> */
> >>> dprintk("RPC: creating workqueue rpciod\n");
> >>> - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);
> >>> + /* Note: highpri because network receive is latency sensitive */
> >>> + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
> >>> rpciod_workqueue = wq;
> >>> return rpciod_workqueue != NULL;
> >>> }
> >>>
> >
> >
> >

--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]

2015-01-27 06:20:13

by Shirley Ma

[permalink] [raw]

Subject: Re: [PATCH 1/2] SUNRPC: Adjust rpciod workqueue parameters

On 01/26/2015 07:17 PM, Trond Myklebust wrote:
> On Mon, Jan 26, 2015 at 9:30 PM, Shirley Ma <[email protected]> wrote:
>>
>>
>>
>> On 01/26/2015 04:34 PM, Trond Myklebust wrote:
>>> On Mon, Jan 26, 2015 at 6:33 PM, Shirley Ma <[email protected]> wrote:
>>>> Hello Trond,
>>>>
>>>> workqueue WQ_UNBOUND flag is also needed. Some customer hit a problem, RT thread caused rpciod starvation. It is easy to reproduce it with running a cpu intensive workload with lower nice value than rpciod workqueue on the cpu the network interrupt is received.
>>>>
>>>> I've also tested iozone and fio test with WQ_UNBOUND|WQ_SYSFS flag on for NFS/RDMA, NFS/IPoIB. The results are better than BOUND.
>>>
>>> It certainly does not seem appropriate to use WQ_SYSFS on a queue that
>>> is used for swap, and Documentation/kernel-per-CPU-kthreads.txt makes
>>> an extra strong argument against enabling it on the grounds that it is
>>> not easily reversible.
>>
>> If enabling UNBOUND, I thought customizing workqueue would help.
>>
>>> As for unbound queues: they will almost by definition defeat all the
>>> packet steering and balancing that is done in the networking layer in
>>> the name of multi-process scalability (see
>>> Documentation/networking/scaling.txt). While RDMA systems may or may
>>> not care about that, ordinary networked systems probably do.
>>> Don't most RDMA drivers allow you to balance those interrupts, at
>>> least on the high end systems?
>>
>> The problem was IRQ balance is not aware of which cpu is busy on the system, networking NIC interrupts can be directed to the busy CPU while other cpus are much lightly loaded. So packet steering and balancing in the networking layer doesn't have any benefit.
>
> Sure it does. You are argument revolves around a corner case.
>
>> The network workload can cause starvation in this situation it doesn't matter it's RDMA or Ethernet. The workaround solution is to unmask this busy cpu in irq balance, or manually set up irq smp affinity to avoid any interrupts on this cpu.
>
> Then schedule it to system_unbound_wq instead. Why do you need to change rpciod?
>
>> UNBOUND workqueue will choose same CPU to run the work if this CPU is not busy, it is scheduled to another CPU on the same NUMA node when this CPU is busy. So it doesn't defeat packet steering and balancing.
>
> Yes it does. The whole point of packet steering is to maximise
> locality in order to avoid contention for resources, locks etc for
> networking data that is being consumed by just one (or a few)
> processes. By spraying jobs across multiple threads, you are
> reintroducing that contention. Furthermore, you are randomising the
> processing order for _all_rpciod tasks.
> Please read Documentation/workqueue.txt where it state clearly that:
>
> Unbound wq sacrifices locality but is useful for
> the following cases.
>
> * Wide fluctuation in the concurrency level requirement is
> expected and using bound wq may end up creating large number
> of mostly unused workers across different CPUs as the issuer
> hops through different CPUs.
>
> * Long running CPU intensive workloads which can be better
> managed by the system scheduler.
>
> neither of which conditions are the common case for rpciod.

My point was the locality is for performance, if locality doesn't gain much performance, then unbound is a better choice. From the data I've collected for rpciod workqueue bound vs unbound: when the local CPU is busy, bound's performance is much worse than unbound (which is scheduled to other remote CPU), when local CPU is not busy, bound and unbound performance (both latency and BW) seems similar. So unbound seems a better choice for rpciod.

>>>> Thanks,
>>>> Shirley
>>>>
>>>> On 01/24/2015 04:18 PM, Trond Myklebust wrote:
>>>>> Increase the concurrency level for rpciod threads to allow for allocations
>>>>> etc that happen in the RPCSEC_GSS layer. Also note that the NFSv4 byte range
>>>>> locks may now need to allocate memory from inside rpciod.
>>>>>
>>>>> Add the WQ_HIGHPRI flag to improve latency guarantees while we're at it.
>>>>>
>>>>> Signed-off-by: Trond Myklebust <[email protected]>
>>>>> ---
>>>>> net/sunrpc/sched.c | 3 ++-
>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
>>>>> index d20f2329eea3..4f65ec28d2b4 100644
>>>>> --- a/net/sunrpc/sched.c
>>>>> +++ b/net/sunrpc/sched.c
>>>>> @@ -1069,7 +1069,8 @@ static int rpciod_start(void)
>>>>> * Create the rpciod thread and wait for it to start.
>>>>> */
>>>>> dprintk("RPC: creating workqueue rpciod\n");
>>>>> - wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);
>>>>> + /* Note: highpri because network receive is latency sensitive */
>>>>> + wq = alloc_workqueue("rpciod", WQ_MEM_RECLAIM | WQ_HIGHPRI, 0);
>>>>> rpciod_workqueue = wq;
>>>>> return rpciod_workqueue != NULL;
>>>>> }
>>>>>
>>>
>>>
>>>
>
>
>
>

2015-01-27 06:34:14

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH 1/2] SUNRPC: Adjust rpciod workqueue parameters

On Tue, Jan 27, 2015 at 1:20 AM, Shirley Ma <[email protected]> wrote:
>
> On 01/26/2015 07:17 PM, Trond Myklebust wrote:
>> On Mon, Jan 26, 2015 at 9:30 PM, Shirley Ma <[email protected]> wrote:
>>>
>>>
>>>
>>> On 01/26/2015 04:34 PM, Trond Myklebust wrote:
>>>> On Mon, Jan 26, 2015 at 6:33 PM, Shirley Ma <[email protected]> wrote:
>>>>> Hello Trond,
>>>>>
>>>>> workqueue WQ_UNBOUND flag is also needed. Some customer hit a problem, RT thread caused rpciod starvation. It is easy to reproduce it with running a cpu intensive workload with lower nice value than rpciod workqueue on the cpu the network interrupt is received.
>>>>>
>>>>> I've also tested iozone and fio test with WQ_UNBOUND|WQ_SYSFS flag on for NFS/RDMA, NFS/IPoIB. The results are better than BOUND.
>>>>
>>>> It certainly does not seem appropriate to use WQ_SYSFS on a queue that
>>>> is used for swap, and Documentation/kernel-per-CPU-kthreads.txt makes
>>>> an extra strong argument against enabling it on the grounds that it is
>>>> not easily reversible.
>>>
>>> If enabling UNBOUND, I thought customizing workqueue would help.
>>>
>>>> As for unbound queues: they will almost by definition defeat all the
>>>> packet steering and balancing that is done in the networking layer in
>>>> the name of multi-process scalability (see
>>>> Documentation/networking/scaling.txt). While RDMA systems may or may
>>>> not care about that, ordinary networked systems probably do.
>>>> Don't most RDMA drivers allow you to balance those interrupts, at
>>>> least on the high end systems?
>>>
>>> The problem was IRQ balance is not aware of which cpu is busy on the system, networking NIC interrupts can be directed to the busy CPU while other cpus are much lightly loaded. So packet steering and balancing in the networking layer doesn't have any benefit.
>>
>> Sure it does. You are argument revolves around a corner case.
>>
>>> The network workload can cause starvation in this situation it doesn't matter it's RDMA or Ethernet. The workaround solution is to unmask this busy cpu in irq balance, or manually set up irq smp affinity to avoid any interrupts on this cpu.
>>
>> Then schedule it to system_unbound_wq instead. Why do you need to change rpciod?
>>
>>> UNBOUND workqueue will choose same CPU to run the work if this CPU is not busy, it is scheduled to another CPU on the same NUMA node when this CPU is busy. So it doesn't defeat packet steering and balancing.
>>
>> Yes it does. The whole point of packet steering is to maximise
>> locality in order to avoid contention for resources, locks etc for
>> networking data that is being consumed by just one (or a few)
>> processes. By spraying jobs across multiple threads, you are
>> reintroducing that contention. Furthermore, you are randomising the
>> processing order for _all_rpciod tasks.
>> Please read Documentation/workqueue.txt where it state clearly that:
>>
>> Unbound wq sacrifices locality but is useful for
>> the following cases.
>>
>> * Wide fluctuation in the concurrency level requirement is
>> expected and using bound wq may end up creating large number
>> of mostly unused workers across different CPUs as the issuer
>> hops through different CPUs.
>>
>> * Long running CPU intensive workloads which can be better
>> managed by the system scheduler.
>>
>> neither of which conditions are the common case for rpciod.
>
> My point was the locality is for performance, if locality doesn't gain much performance, then unbound is a better choice. From the data I've collected for rpciod workqueue bound vs unbound: when the local CPU is busy, bound's performance is much worse than unbound (which is scheduled to other remote CPU), when local CPU is not busy, bound and unbound performance (both latency and BW) seems similar. So unbound seems a better choice for rpciod.
>

You have supplied 1 data point on a platform (RDMA) that has no users.
I see no reason to make a change.

--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]

2015-01-27 15:09:31

by Shirley Ma

[permalink] [raw]

Subject: Re: [PATCH 1/2] SUNRPC: Adjust rpciod workqueue parameters

On 01/26/2015 10:34 PM, Trond Myklebust wrote:
> On Tue, Jan 27, 2015 at 1:20 AM, Shirley Ma <[email protected]> wrote:
>>
>> On 01/26/2015 07:17 PM, Trond Myklebust wrote:
>>> On Mon, Jan 26, 2015 at 9:30 PM, Shirley Ma <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 01/26/2015 04:34 PM, Trond Myklebust wrote:
>>>>> On Mon, Jan 26, 2015 at 6:33 PM, Shirley Ma <[email protected]> wrote:
>>>>>> Hello Trond,
>>>>>>
>>>>>> workqueue WQ_UNBOUND flag is also needed. Some customer hit a problem, RT thread caused rpciod starvation. It is easy to reproduce it with running a cpu intensive workload with lower nice value than rpciod workqueue on the cpu the network interrupt is received.
>>>>>>
>>>>>> I've also tested iozone and fio test with WQ_UNBOUND|WQ_SYSFS flag on for NFS/RDMA, NFS/IPoIB. The results are better than BOUND.
>>>>>
>>>>> It certainly does not seem appropriate to use WQ_SYSFS on a queue that
>>>>> is used for swap, and Documentation/kernel-per-CPU-kthreads.txt makes
>>>>> an extra strong argument against enabling it on the grounds that it is
>>>>> not easily reversible.
>>>>
>>>> If enabling UNBOUND, I thought customizing workqueue would help.
>>>>
>>>>> As for unbound queues: they will almost by definition defeat all the
>>>>> packet steering and balancing that is done in the networking layer in
>>>>> the name of multi-process scalability (see
>>>>> Documentation/networking/scaling.txt). While RDMA systems may or may
>>>>> not care about that, ordinary networked systems probably do.
>>>>> Don't most RDMA drivers allow you to balance those interrupts, at
>>>>> least on the high end systems?
>>>>
>>>> The problem was IRQ balance is not aware of which cpu is busy on the system, networking NIC interrupts can be directed to the busy CPU while other cpus are much lightly loaded. So packet steering and balancing in the networking layer doesn't have any benefit.
>>>
>>> Sure it does. You are argument revolves around a corner case.
>>>
>>>> The network workload can cause starvation in this situation it doesn't matter it's RDMA or Ethernet. The workaround solution is to unmask this busy cpu in irq balance, or manually set up irq smp affinity to avoid any interrupts on this cpu.
>>>
>>> Then schedule it to system_unbound_wq instead. Why do you need to change rpciod?
>>>
>>>> UNBOUND workqueue will choose same CPU to run the work if this CPU is not busy, it is scheduled to another CPU on the same NUMA node when this CPU is busy. So it doesn't defeat packet steering and balancing.
>>>
>>> Yes it does. The whole point of packet steering is to maximise
>>> locality in order to avoid contention for resources, locks etc for
>>> networking data that is being consumed by just one (or a few)
>>> processes. By spraying jobs across multiple threads, you are
>>> reintroducing that contention. Furthermore, you are randomising the
>>> processing order for _all_rpciod tasks.
>>> Please read Documentation/workqueue.txt where it state clearly that:
>>>
>>> Unbound wq sacrifices locality but is useful for
>>> the following cases.
>>>
>>> * Wide fluctuation in the concurrency level requirement is
>>> expected and using bound wq may end up creating large number
>>> of mostly unused workers across different CPUs as the issuer
>>> hops through different CPUs.
>>>
>>> * Long running CPU intensive workloads which can be better
>>> managed by the system scheduler.
>>>
>>> neither of which conditions are the common case for rpciod.
>>
>> My point was the locality is for performance, if locality doesn't gain much performance, then unbound is a better choice. From the data I've collected for rpciod workqueue bound vs unbound: when the local CPU is busy, bound's performance is much worse than unbound (which is scheduled to other remote CPU), when local CPU is not busy, bound and unbound performance (both latency and BW) seems similar. So unbound seems a better choice for rpciod.
>>
>
> You have supplied 1 data point on a platform (RDMA) that has no users.
> I see no reason to make a change.

I've also tested TCP(IPoIB). I am going to test GbitE once I get the set up. Based upon the data I've collected about rpciod workqueue, I am expecting similar results. In rpciod workqueue, unbound is better than bound. There is no rush to make a decision now, the performance outcome of unbound and bound workqueue depends on a couple factors: latency, cpu workload, scheduling ...... My observation so far, using above guideline to apply unbound and bound flag is not sufficient. It should add latency, cpu load factors and other factors as well.

Thanks
Shirley