Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <a4d10f9e-ab8f-ffad-5eea-48316c436f41@intel.com>
Date:   Wed, 7 Sep 2022 17:39:47 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101
 Thunderbird/102.2.1
Subject: Re: [PATCH] ipc/msg.c: mitigate the lock contention with percpu
 counter
Content-Language: en-US
To:     Tim Chen <tim.c.chen@linux.intel.com>,
        Shakeel Butt <shakeelb@google.com>
Cc:     Andrew Morton <akpm@linux-foundation.org>, vasily.averin@linux.dev,
        Dennis Zhou <dennis@kernel.org>, Tejun Heo <tj@kernel.org>,
        Christoph Lameter <cl@linux.com>,
        "Eric W. Biederman" <ebiederm@xmission.com>,
        Alexey Gladkov <legion@kernel.org>,
        Manfred Spraul <manfred@colorfullife.com>,
        alexander.mikhalitsyn@virtuozzo.com, Linux MM <linux-mm@kvack.org>,
        LKML <linux-kernel@vger.kernel.org>,
        "Chen, Tim C" <tim.c.chen@intel.com>,
        Feng Tang <feng.tang@intel.com>,
        Huang Ying <ying.huang@intel.com>, tianyou.li@intel.com,
        wangyang.guo@intel.com, jiebin.sun@intel.com
References: <20220902152243.479592-1-jiebin.sun@intel.com>
 <CALvZod44uUFnwfF4StC24t+d1s_XE10hkmSCgb04FjtTATo6xQ@mail.gmail.com>
 <048517e7f95aa8460cd47a169f3dfbd8e9b70d5c.camel@linux.intel.com>
From:   "Sun, Jiebin" <jiebin.sun@intel.com>
In-Reply-To: <048517e7f95aa8460cd47a169f3dfbd8e9b70d5c.camel@linux.intel.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk


On 9/7/2022 2:44 AM, Tim Chen wrote:
> On Fri, 2022-09-02 at 09:27 -0700, Shakeel Butt wrote:
>> On Fri, Sep 2, 2022 at 12:04 AM Jiebin Sun <jiebin.sun@intel.com> wrote:
>>> The msg_bytes and msg_hdrs atomic counters are frequently
>>> updated when IPC msg queue is in heavy use, causing heavy
>>> cache bounce and overhead. Change them to percpu_counters
>>> greatly improve the performance. Since there is one unique
>>> ipc namespace, additional memory cost is minimal. Reading
>>> of the count done in msgctl call, which is infrequent. So
>>> the need to sum up the counts in each CPU is infrequent.
>>>
>>> Apply the patch and test the pts/stress-ng-1.4.0
>>> -- system v message passing (160 threads).
>>>
>>> Score gain: 3.38x
>>>
>>> CPU: ICX 8380 x 2 sockets
>>> Core number: 40 x 2 physical cores
>>> Benchmark: pts/stress-ng-1.4.0
>>> -- system v message passing (160 threads)
>>>
>>> Signed-off-by: Jiebin Sun <jiebin.sun@intel.com>
>> [...]
>>> +void percpu_counter_add_local(struct percpu_counter *fbc, s64 amount)
>>> +{
>>> +       this_cpu_add(*fbc->counters, amount);
>>> +}
>>> +EXPORT_SYMBOL(percpu_counter_add_local);
>> Why not percpu_counter_add()? This may drift the fbc->count more than
>> batch*nr_cpus. I am assuming that is not the issue for you as you
>> always do an expensive sum in the slow path. As Andrew asked, this
>> should be a separate patch.
> In the IPC case, the read is always done with the accurate read using
> percpu_counter_sum() gathering all the counts and
> never with percpu_counter_read() that only read global count.
> So Jiebin was not worry about accuracy.
>
> However, the counter is s64 and the local per cpu counter is S32.
> So the counter size has shrunk if we only keep the count in local per
> cpu counter, which can overflow a lot sooner and is not okay.
>
> Jiebin, can you try to use percpu_counter_add_batch, but using a large
> batch size.  That should achieve what you want without needing
> to create a percpu_counter_add_local() function, and also the overflow
> problem.
>
> Tim
>
I have sent out the patch v4 which use percpu_counter_add_batch. If we use
a tuned large batch size (1024), the performance gain is 3.17x (patch v4)
vs 3.38x (patch v3) previously in stress-ng -- message. It still has
significant performance improvement and also good balance between
performance gain and overflow issue.

Jiebin

>