Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751870AbdHCIfq (ORCPT ); Thu, 3 Aug 2017 04:35:46 -0400 Received: from mga14.intel.com ([192.55.52.115]:9719 "EHLO mga14.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751758AbdHCIfY (ORCPT ); Thu, 3 Aug 2017 04:35:24 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,315,1498546800"; d="scan'208";a="1158573771" From: "Huang\, Ying" To: Eric Dumazet Cc: "Huang\, Ying" , Peter Zijlstra , , Ingo Molnar , Michael Ellerman , Borislav Petkov , Thomas Gleixner , "Juergen Gross" , Aaron Lu Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data References: <20170802085220.4315-1-ying.huang@intel.com> <20170802085220.4315-4-ying.huang@intel.com> <1501669138.25002.20.camel@edumazet-glaptop3.roam.corp.google.com> Date: Thu, 03 Aug 2017 16:35:21 +0800 In-Reply-To: <1501669138.25002.20.camel@edumazet-glaptop3.roam.corp.google.com> (Eric Dumazet's message of "Wed, 2 Aug 2017 03:18:58 -0700") Message-ID: <87d18d122e.fsf@yhuang-dev.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5465 Lines: 141 Eric Dumazet writes: > On Wed, 2017-08-02 at 16:52 +0800, Huang, Ying wrote: >> From: Huang Ying >> >> struct call_single_data is used in IPI to transfer information between >> CPUs. Its size is bigger than sizeof(unsigned long) and less than >> cache line size. Now, it is allocated with no any alignment >> requirement. This makes it possible for allocated call_single_data to >> cross 2 cache lines. So that double the number of the cache lines >> that need to be transferred among CPUs. This is resolved by aligning >> the allocated call_single_data with cache line size. >> >> To test the effect of the patch, we use the vm-scalability multiple >> thread swap test case (swap-w-seq-mt). The test will create multiple >> threads and each thread will eat memory until all RAM and part of swap >> is used, so that huge number of IPI will be triggered when unmapping >> memory. In the test, the throughput of memory writing improves ~5% >> compared with misaligned call_single_data because of faster IPI. >> >> Signed-off-by: "Huang, Ying" >> Cc: Peter Zijlstra >> Cc: Ingo Molnar >> Cc: Michael Ellerman >> Cc: Borislav Petkov >> Cc: Thomas Gleixner >> Cc: Juergen Gross >> Cc: Aaron Lu >> --- >> kernel/smp.c | 6 ++++-- >> 1 file changed, 4 insertions(+), 2 deletions(-) >> >> diff --git a/kernel/smp.c b/kernel/smp.c >> index 3061483cb3ad..81d9ae08eb6e 100644 >> --- a/kernel/smp.c >> +++ b/kernel/smp.c >> @@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu) >> free_cpumask_var(cfd->cpumask); >> return -ENOMEM; >> } >> - cfd->csd = alloc_percpu(struct call_single_data); >> + cfd->csd = alloc_percpu_aligned(struct call_single_data); > > I do not believe allocating 64 bytes (per cpu) for this structure is > needed. That would be an increase of cache lines. > > What we can do instead is to force an alignment on 4*sizeof(void *). > (32 bytes on 64bit, 16 bytes on 32bit arches) > > Maybe something like this : > > diff --git a/include/linux/smp.h b/include/linux/smp.h > index 68123c1fe54918c051292eb5ba3427df09f31c2f..f7072bf173c5456e38e958d6af85a4793bced96e 100644 > --- a/include/linux/smp.h > +++ b/include/linux/smp.h > @@ -19,7 +19,7 @@ struct call_single_data { > smp_call_func_t func; > void *info; > unsigned int flags; > -}; > +} __attribute__((aligned(4 * sizeof(void *)))); > > /* total number of cpus in this system (may exceed NR_CPUS) */ > extern unsigned int total_cpus; OK. And if the sizeof(struct call_single_data) changes, we need to change the alignment accordingly too. So I add some BUILD_BUG_ON() for that. Best Regards, Huang, Ying ------------------>8------------------ >From 2c400e9b1793f1c1d33bc278f5bc066e32ca4fee Mon Sep 17 00:00:00 2001 From: Huang Ying Date: Thu, 27 Jul 2017 16:43:20 +0800 Subject: [PATCH -v2] IPI: Avoid to use 2 cache lines for one call_single_data struct call_single_data is used in IPI to transfer information between CPUs. Its size is bigger than sizeof(unsigned long) and less than cache line size. Now, it is allocated with no any alignment requirement. This makes it possible for allocated call_single_data to cross 2 cache lines. So that double the number of the cache lines that need to be transferred among CPUs. This is resolved by aligning the allocated call_single_data with 4 * sizeof(void *). If the size of struct call_single_data is changed in the future, the alignment should be changed accordingly. It should be more than sizeof(struct call_single_data) and the power of 2. To test the effect of the patch, we use the vm-scalability multiple thread swap test case (swap-w-seq-mt). The test will create multiple threads and each thread will eat memory until all RAM and part of swap is used, so that huge number of IPI will be triggered when unmapping memory. In the test, the throughput of memory writing improves ~5% compared with misaligned call_single_data because of faster IPI. [Align with 4 * sizeof(void*) instead of cache line size] Suggested-by: Eric Dumazet Signed-off-by: "Huang, Ying" Cc: Peter Zijlstra Cc: Ingo Molnar Cc: Michael Ellerman Cc: Borislav Petkov Cc: Thomas Gleixner Cc: Juergen Gross Cc: Aaron Lu --- include/linux/smp.h | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/include/linux/smp.h b/include/linux/smp.h index 68123c1fe549..4d3b372d50b0 100644 --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -13,13 +13,22 @@ #include #include +#define CSD_ALIGNMENT (4 * sizeof(void *)) + typedef void (*smp_call_func_t)(void *info); struct call_single_data { struct llist_node llist; smp_call_func_t func; void *info; unsigned int flags; -}; +} __aligned(CSD_ALIGNMENT); + +/* To avoid allocate csd across 2 cache lines */ +static inline void check_alignment_of_csd(void) +{ + BUILD_BUG_ON((CSD_ALIGNMENT & (CSD_ALIGNMENT - 1)) != 0); + BUILD_BUG_ON(sizeof(struct call_single_data) > CSD_ALIGNMENT); +} /* total number of cpus in this system (may exceed NR_CPUS) */ extern unsigned int total_cpus; -- 2.13.2