LinuxLists.cc - Re: [PATCH] epoll: avoid calling ep_call_nested() from ep_poll

2018-01-18 11:12:53

Subject: Re: [PATCH] epoll: avoid calling ep_call_nested() from ep_poll_safewake()

Hi Jason,

On 2017/10/18 22:03, Jason Baron wrote:
>
>
> On 10/17/2017 11:37 AM, Davidlohr Bueso wrote:
>> On Fri, 13 Oct 2017, Jason Baron wrote:
>>
>>> The ep_poll_safewake() function is used to wakeup potentially nested
>>> epoll
>>> file descriptors. The function uses ep_call_nested() to prevent entering
>>> the same wake up queue more than once, and to prevent excessively deep
>>> wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary
>>> since we are already preventing these conditions during EPOLL_CTL_ADD.
>>> This
>>> saves extra function calls, and avoids taking a global lock during the
>>> ep_call_nested() calls.
>>
>> This makes sense.
>>
>>>
>>> I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
>>> case, since ep_call_nested() keeps track of the nesting level, and
>>> this is
>>> required by the call to spin_lock_irqsave_nested(). It would be nice to
>>> remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as
>>> well, however its not clear how to simply pass the nesting level through
>>> multiple wake_up() levels without more surgery. In any case, I don't
>>> think
>>> CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch,
>>> also
>>> apparently fixes a workload at Google that Salman Qazi reported by
>>> completely removing the poll_safewake_ncalls->lock from wakeup paths.
>>
>> I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as
>> I was tackling the nested epoll scaling issue with loop and readywalk lists
>> in mind.
>>>
>
> I'm not sure the details of the workload - perhaps Salman can elaborate
> further about it.
>
> It would seem that the safewake would potentially be the most contended
> in general in the nested case, because generally you have a few epoll
> fds attached to lots of sources doing wakeups. So those sources are all
> going to conflict on the safewake lock. The readywalk is used when
> performing a 'nested' poll and in general this is likely going to be
> called on a few epoll fds. That said, we should remove it too. I will
> post a patch to remove it.
>
> The loop lock is used during EPOLL_CTL_ADD to check for loops and deep
> wakeup paths and so I would expect this to be less common, but I
> wouldn't doubt there are workloads impacted by it. We can potentially, I
> think remove this one too - and the global 'epmutex'. I posted some
> ideas a while ago on it:
>
> http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html
>
> We can work through these ideas or others...

I have tested these patches on a simple non-nested epoll workload and found that
there are performance regression (-0.5% which is better than -1.0% from my patch set)
under normal nginx conf and performance improvement (+3%, and the number of my
patch set is +4%) under the one-fd-per-request conf. The reason for performance
improvement is clear, however, I haven't figured out the reason for the performance
regression (maybe cache related ?)

It seems that your patch set will work fine both under normal and nested epoll workload,
so I don't plan to continue my patch set which only works for normal epoll workload.
I have one question about your patch set. It is about the newly added fields in
struct file. I had tried to move these fields into struct eventpoll, so only epoll fds
will be added into a disjoint set and the target fd will be linked to a disjoint set
dynamically, but that method doesn't work out because there are multiple target fds and
the order in which these target fds are added to a disjoint set will lead to deadlock.
So do you have other ideas to reduce the size of these fields ?

Thanks,
Tao

The following lines are the result of performance result:

baseline: result for v4.14
ep_cc: result for v4.14 + ep_cc patch set
my: result for v4.14 + my patch set

The first 15 columns are the results from wrk and their unit is Requests/sec,
the lats two columns are their average value and standard derivation.

* normal scenario: nginx + wrk

baseline
376704.74 377223.88 378096.67 379484.45 379199.04 379526.21 378008.82 \
379959.98 377634.47 379127.27 377622.92 379442.12 378994.44 376000.08 377046.58 \
378271.4447 1210.673592

ep_cc
373376.08 376897.65 373493.65 376154.44 377374.36 379080.88 375124.44 \
376404.16 376539.28 375141.84 379075.32 377214.39 374524.52 375605.87 372167.8 \
375878.312 1983.190539

my
373067.96 372369 375317.61 375564.66 373841.88 373699.88 376802.91 369988.97 \
376080.45 371580.64 374265.34 376370.34 377033.75 372786.91 375521.61 \
374286.1273 2059.92093

* one-fd-per-request scenario: nginx + wrk
baseline
124509.77 106542.46 118074.46 111434.02 117628.9 108260.56 119037.91 \
114413.75 108706.82 116277.99 111588.4 118216.56 121267.63 110950.4 116455.43 \
114891.004 5168.30288

ep_cc
124209.45 118297.8 120258.1 122785.95 118744.64 118471.76 117621.07 \
117525.14 114797.14 115348.47 117154.02 117208.37 118618.92 118578.8 118945.52 \
118571.01 2437.050182

my
125090.99 121408.92 116444.09 120055.4 119399.24 114938.04 122889.56 \
114363.49 120934.15 119052.03 118060.47 129976.45 121108.08 114849.93 114810.64 \
119558.7653 4338.18401

> Thanks,
>
> -Jason
>
>
>>> Signed-off-by: Jason Baron <[email protected]>
>>> Cc: Alexander Viro <[email protected]>
>>> Cc: Andrew Morton <[email protected]>
>>> Cc: Salman Qazi <[email protected]>
>>
>> Acked-by: Davidlohr Bueso <[email protected]>
>>
>>> ---
>>> fs/eventpoll.c | 47 ++++++++++++++++++-----------------------------
>>> 1 file changed, 18 insertions(+), 29 deletions(-)
>>
>> Yay for getting rid of some of the callback hell.
>>
>> Thanks,
>> Davidlohr
>
> .
>

2018-01-18 21:20:44

by Jason Baron

[permalink] [raw]

Subject: Re: [PATCH] epoll: avoid calling ep_call_nested() from ep_poll_safewake()

On 01/18/2018 06:00 AM, Hou Tao wrote:
> Hi Jason,
>
> On 2017/10/18 22:03, Jason Baron wrote:
>>
>>
>> On 10/17/2017 11:37 AM, Davidlohr Bueso wrote:
>>> On Fri, 13 Oct 2017, Jason Baron wrote:
>>>
>>>> The ep_poll_safewake() function is used to wakeup potentially nested
>>>> epoll
>>>> file descriptors. The function uses ep_call_nested() to prevent entering
>>>> the same wake up queue more than once, and to prevent excessively deep
>>>> wakeup paths (deeper than EP_MAX_NESTS). However, this is not necessary
>>>> since we are already preventing these conditions during EPOLL_CTL_ADD.
>>>> This
>>>> saves extra function calls, and avoids taking a global lock during the
>>>> ep_call_nested() calls.
>>>
>>> This makes sense.
>>>
>>>>
>>>> I have, however, left ep_call_nested() for the CONFIG_DEBUG_LOCK_ALLOC
>>>> case, since ep_call_nested() keeps track of the nesting level, and
>>>> this is
>>>> required by the call to spin_lock_irqsave_nested(). It would be nice to
>>>> remove the ep_call_nested() calls for the CONFIG_DEBUG_LOCK_ALLOC case as
>>>> well, however its not clear how to simply pass the nesting level through
>>>> multiple wake_up() levels without more surgery. In any case, I don't
>>>> think
>>>> CONFIG_DEBUG_LOCK_ALLOC is generally used for production. This patch,
>>>> also
>>>> apparently fixes a workload at Google that Salman Qazi reported by
>>>> completely removing the poll_safewake_ncalls->lock from wakeup paths.
>>>
>>> I'm a bit curious about the workload (which uses lots of EPOLL_CTL_ADDs) as
>>> I was tackling the nested epoll scaling issue with loop and readywalk lists
>>> in mind.
>>>>
>>
>> I'm not sure the details of the workload - perhaps Salman can elaborate
>> further about it.
>>
>> It would seem that the safewake would potentially be the most contended
>> in general in the nested case, because generally you have a few epoll
>> fds attached to lots of sources doing wakeups. So those sources are all
>> going to conflict on the safewake lock. The readywalk is used when
>> performing a 'nested' poll and in general this is likely going to be
>> called on a few epoll fds. That said, we should remove it too. I will
>> post a patch to remove it.
>>
>> The loop lock is used during EPOLL_CTL_ADD to check for loops and deep
>> wakeup paths and so I would expect this to be less common, but I
>> wouldn't doubt there are workloads impacted by it. We can potentially, I
>> think remove this one too - and the global 'epmutex'. I posted some
>> ideas a while ago on it:
>>
>> http://lkml.iu.edu/hypermail//linux/kernel/1501.1/05905.html
>>
>> We can work through these ideas or others...
>
> I have tested these patches on a simple non-nested epoll workload and found that
> there are performance regression (-0.5% which is better than -1.0% from my patch set)
> under normal nginx conf and performance improvement (+3%, and the number of my
> patch set is +4%) under the one-fd-per-request conf. The reason for performance
> improvement is clear, however, I haven't figured out the reason for the performance
> regression (maybe cache related ?)
>
> It seems that your patch set will work fine both under normal and nested epoll workload,
> so I don't plan to continue my patch set which only works for normal epoll workload.
> I have one question about your patch set. It is about the newly added fields in
> struct file. I had tried to move these fields into struct eventpoll, so only epoll fds
> will be added into a disjoint set and the target fd will be linked to a disjoint set
> dynamically, but that method doesn't work out because there are multiple target fds and
> the order in which these target fds are added to a disjoint set will lead to deadlock.
> So do you have other ideas to reduce the size of these fields ?
>
> Thanks,
> Tao

Hi,

I agree that increasing the size of struct file is no go. One idea is to
simply have a single pointer in 'struct file' for epoll usage, that
could be dynamically allocated when a file is added to an epoll set.
That would reduce the size of things from where we currently are.

I'm also not sure that the target fds (or non-epoll fds) need to be
added to the disjoint sets. The sets are to serialize loop checks of
which the target fds can not be a part of.

If there is interest in resurrecting this patchset, I can pick it back
up. I hadn't been pushing it forward because I wasn't convinced it
mattered on real workloads...

Thanks,

-Jason

>
> The following lines are the result of performance result:
>
> baseline: result for v4.14
> ep_cc: result for v4.14 + ep_cc patch set
> my: result for v4.14 + my patch set
>
> The first 15 columns are the results from wrk and their unit is Requests/sec,
> the lats two columns are their average value and standard derivation.
>
> * normal scenario: nginx + wrk
>
> baseline
> 376704.74 377223.88 378096.67 379484.45 379199.04 379526.21 378008.82 \
> 379959.98 377634.47 379127.27 377622.92 379442.12 378994.44 376000.08 377046.58 \
> 378271.4447 1210.673592
>
> ep_cc
> 373376.08 376897.65 373493.65 376154.44 377374.36 379080.88 375124.44 \
> 376404.16 376539.28 375141.84 379075.32 377214.39 374524.52 375605.87 372167.8 \
> 375878.312 1983.190539
>
> my
> 373067.96 372369 375317.61 375564.66 373841.88 373699.88 376802.91 369988.97 \
> 376080.45 371580.64 374265.34 376370.34 377033.75 372786.91 375521.61 \
> 374286.1273 2059.92093
>
>
> * one-fd-per-request scenario: nginx + wrk
> baseline
> 124509.77 106542.46 118074.46 111434.02 117628.9 108260.56 119037.91 \
> 114413.75 108706.82 116277.99 111588.4 118216.56 121267.63 110950.4 116455.43 \
> 114891.004 5168.30288
>
> ep_cc
> 124209.45 118297.8 120258.1 122785.95 118744.64 118471.76 117621.07 \
> 117525.14 114797.14 115348.47 117154.02 117208.37 118618.92 118578.8 118945.52 \
> 118571.01 2437.050182
>
> my
> 125090.99 121408.92 116444.09 120055.4 119399.24 114938.04 122889.56 \
> 114363.49 120934.15 119052.03 118060.47 129976.45 121108.08 114849.93 114810.64 \
> 119558.7653 4338.18401
>
>> Thanks,
>>
>> -Jason
>>
>>
>>>> Signed-off-by: Jason Baron <[email protected]>
>>>> Cc: Alexander Viro <[email protected]>
>>>> Cc: Andrew Morton <[email protected]>
>>>> Cc: Salman Qazi <[email protected]>
>>>
>>> Acked-by: Davidlohr Bueso <[email protected]>
>>>
>>>> ---
>>>> fs/eventpoll.c | 47 ++++++++++++++++++-----------------------------
>>>> 1 file changed, 18 insertions(+), 29 deletions(-)
>>>
>>> Yay for getting rid of some of the callback hell.
>>>
>>> Thanks,
>>> Davidlohr
>>
>> .
>>
>