2017-08-02 08:52:32

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 0/3] IPI: Avoid to use 2 cache lines for one call_single_data

From: Huang Ying <[email protected]>

struct call_single_data is used in IPI to transfer information between
CPUs. Its size is bigger than sizeof(unsigned long) and less than
cache line size. Now, it is allocated with no any alignment
requirement. This makes it possible for allocated call_single_data to
cross 2 cache lines. So that double the number of the cache lines
that need to be transferred among CPUs. This is resolved by aligning
the allocated call_single_data with cache line size.

To allocate cache line size aligned percpu memory dynamically,
alloc_percpu_aligned() is introduced and used in iova drivers too.

To test the effect of the patch, we use the vm-scalability multiple
thread swap test case (swap-w-seq-mt). The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPI will be triggered when unmapping
memory. In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data because of faster IPI.

Best Regards,
Huang, Ying


2017-08-02 08:52:34

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 1/3] percpu: Add alloc_percpu_aligned()

From: Huang Ying <[email protected]>

To allocate percpu memory that is aligned with cache line size
dynamically. We can statically allocate percpu memory that is aligned
with cache line size with DEFINE_PER_CPU_ALIGNED(), but we have no
correspondent API for dynamic allocation.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/percpu.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 491b3f5a5f8a..8b80a965d64a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -129,5 +129,8 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
#define alloc_percpu(type) \
(typeof(type) __percpu *)__alloc_percpu(sizeof(type), \
__alignof__(type))
+#define alloc_percpu_aligned(type) \
+ ((typeof(type) __percpu *)__alloc_percpu(sizeof(type), \
+ max_t(unsigned int, cache_line_size(), __alignof__(type))))

#endif /* __LINUX_PERCPU_H */
--
2.13.2

2017-08-02 08:52:38

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

From: Huang Ying <[email protected]>

struct call_single_data is used in IPI to transfer information between
CPUs. Its size is bigger than sizeof(unsigned long) and less than
cache line size. Now, it is allocated with no any alignment
requirement. This makes it possible for allocated call_single_data to
cross 2 cache lines. So that double the number of the cache lines
that need to be transferred among CPUs. This is resolved by aligning
the allocated call_single_data with cache line size.

To test the effect of the patch, we use the vm-scalability multiple
thread swap test case (swap-w-seq-mt). The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPI will be triggered when unmapping
memory. In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data because of faster IPI.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Aaron Lu <[email protected]>
---
kernel/smp.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index 3061483cb3ad..81d9ae08eb6e 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
free_cpumask_var(cfd->cpumask);
return -ENOMEM;
}
- cfd->csd = alloc_percpu(struct call_single_data);
+ cfd->csd = alloc_percpu_aligned(struct call_single_data);
if (!cfd->csd) {
free_cpumask_var(cfd->cpumask);
free_cpumask_var(cfd->cpumask_ipi);
@@ -269,7 +269,9 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
int wait)
{
struct call_single_data *csd;
- struct call_single_data csd_stack = { .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS };
+ struct call_single_data csd_stack ____cacheline_aligned = {
+ .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS
+ };
int this_cpu;
int err;

--
2.13.2

2017-08-02 08:53:28

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 2/3] iova: Use alloc_percpu_aligned()

From: Huang Ying <[email protected]>

To use the newly introduced alloc_percpu_aligned(), which can allocate
cache line size aligned percpu memory dynamically.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: [email protected]
---
drivers/iommu/iova.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c
index 246f14c83944..83196f8e8fd5 100644
--- a/drivers/iommu/iova.c
+++ b/drivers/iommu/iova.c
@@ -698,7 +698,7 @@ static void init_iova_rcaches(struct iova_domain *iovad)
rcache = &iovad->rcaches[i];
spin_lock_init(&rcache->lock);
rcache->depot_size = 0;
- rcache->cpu_rcaches = __alloc_percpu(sizeof(*cpu_rcache), cache_line_size());
+ rcache->cpu_rcaches = alloc_percpu_aligned(*cpu_rcache);
if (WARN_ON(!rcache->cpu_rcaches))
continue;
for_each_possible_cpu(cpu) {
--
2.13.2

2017-08-02 10:19:03

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

On Wed, 2017-08-02 at 16:52 +0800, Huang, Ying wrote:
> From: Huang Ying <[email protected]>
>
> struct call_single_data is used in IPI to transfer information between
> CPUs. Its size is bigger than sizeof(unsigned long) and less than
> cache line size. Now, it is allocated with no any alignment
> requirement. This makes it possible for allocated call_single_data to
> cross 2 cache lines. So that double the number of the cache lines
> that need to be transferred among CPUs. This is resolved by aligning
> the allocated call_single_data with cache line size.
>
> To test the effect of the patch, we use the vm-scalability multiple
> thread swap test case (swap-w-seq-mt). The test will create multiple
> threads and each thread will eat memory until all RAM and part of swap
> is used, so that huge number of IPI will be triggered when unmapping
> memory. In the test, the throughput of memory writing improves ~5%
> compared with misaligned call_single_data because of faster IPI.
>
> Signed-off-by: "Huang, Ying" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Juergen Gross <[email protected]>
> Cc: Aaron Lu <[email protected]>
> ---
> kernel/smp.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 3061483cb3ad..81d9ae08eb6e 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
> free_cpumask_var(cfd->cpumask);
> return -ENOMEM;
> }
> - cfd->csd = alloc_percpu(struct call_single_data);
> + cfd->csd = alloc_percpu_aligned(struct call_single_data);

I do not believe allocating 64 bytes (per cpu) for this structure is
needed. That would be an increase of cache lines.

What we can do instead is to force an alignment on 4*sizeof(void *).
(32 bytes on 64bit, 16 bytes on 32bit arches)

Maybe something like this :

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 68123c1fe54918c051292eb5ba3427df09f31c2f..f7072bf173c5456e38e958d6af85a4793bced96e 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -19,7 +19,7 @@ struct call_single_data {
smp_call_func_t func;
void *info;
unsigned int flags;
-};
+} __attribute__((aligned(4 * sizeof(void *))));

/* total number of cpus in this system (may exceed NR_CPUS) */
extern unsigned int total_cpus;


2017-08-02 10:53:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

On Wed, Aug 02, 2017 at 03:18:58AM -0700, Eric Dumazet wrote:
> What we can do instead is to force an alignment on 4*sizeof(void *).
> (32 bytes on 64bit, 16 bytes on 32bit arches)
>
> Maybe something like this :
>
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 68123c1fe54918c051292eb5ba3427df09f31c2f..f7072bf173c5456e38e958d6af85a4793bced96e 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -19,7 +19,7 @@ struct call_single_data {
> smp_call_func_t func;
> void *info;
> unsigned int flags;
> -};
> +} __attribute__((aligned(4 * sizeof(void *))));

Agreed.

Subject: Re: [PATCH 1/3] percpu: Add alloc_percpu_aligned()

On Wed, 2 Aug 2017, Huang, Ying wrote:

> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -129,5 +129,8 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
> #define alloc_percpu(type) \
> (typeof(type) __percpu *)__alloc_percpu(sizeof(type), \
> __alignof__(type))
> +#define alloc_percpu_aligned(type) \
> + ((typeof(type) __percpu *)__alloc_percpu(sizeof(type), \
> + max_t(unsigned int, cache_line_size(), __alignof__(type))))
>
> #endif /* __LINUX_PERCPU_H */

This is not needeed since alloc_percpu() already uses __alignof__(type).

If you add an attribute to the definition of "type" that requires
cacheline alignmet (f.e. __cacheline_aligned) then alloc_percpu() will
align the allocation as you desire.


Subject: Re: [PATCH 0/3] IPI: Avoid to use 2 cache lines for one call_single_data

On Wed, 2 Aug 2017, Huang, Ying wrote:

> To allocate cache line size aligned percpu memory dynamically,
> alloc_percpu_aligned() is introduced and used in iova drivers too.

alloc_percpu() already aligns objects as specified when they are declared.

Moreover the function is improperly named since it aligns
to a cacheline(). If you want this then you would use

alloc_percpu_cacheline_aligned()

But then the alignment can already be requested by adding
__cacheline_aligned to the per cpu definition.


2017-08-03 00:33:25

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 1/3] percpu: Add alloc_percpu_aligned()

Christopher Lameter <[email protected]> writes:

> On Wed, 2 Aug 2017, Huang, Ying wrote:
>
>> --- a/include/linux/percpu.h
>> +++ b/include/linux/percpu.h
>> @@ -129,5 +129,8 @@ extern phys_addr_t per_cpu_ptr_to_phys(void *addr);
>> #define alloc_percpu(type) \
>> (typeof(type) __percpu *)__alloc_percpu(sizeof(type), \
>> __alignof__(type))
>> +#define alloc_percpu_aligned(type) \
>> + ((typeof(type) __percpu *)__alloc_percpu(sizeof(type), \
>> + max_t(unsigned int, cache_line_size(), __alignof__(type))))
>>
>> #endif /* __LINUX_PERCPU_H */
>
> This is not needeed since alloc_percpu() already uses __alignof__(type).
>
> If you add an attribute to the definition of "type" that requires
> cacheline alignmet (f.e. __cacheline_aligned) then alloc_percpu() will
> align the allocation as you desire.

OK.

Best Regards,
Huang, Ying

2017-08-03 08:35:46

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

Eric Dumazet <[email protected]> writes:

> On Wed, 2017-08-02 at 16:52 +0800, Huang, Ying wrote:
>> From: Huang Ying <[email protected]>
>>
>> struct call_single_data is used in IPI to transfer information between
>> CPUs. Its size is bigger than sizeof(unsigned long) and less than
>> cache line size. Now, it is allocated with no any alignment
>> requirement. This makes it possible for allocated call_single_data to
>> cross 2 cache lines. So that double the number of the cache lines
>> that need to be transferred among CPUs. This is resolved by aligning
>> the allocated call_single_data with cache line size.
>>
>> To test the effect of the patch, we use the vm-scalability multiple
>> thread swap test case (swap-w-seq-mt). The test will create multiple
>> threads and each thread will eat memory until all RAM and part of swap
>> is used, so that huge number of IPI will be triggered when unmapping
>> memory. In the test, the throughput of memory writing improves ~5%
>> compared with misaligned call_single_data because of faster IPI.
>>
>> Signed-off-by: "Huang, Ying" <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Ingo Molnar <[email protected]>
>> Cc: Michael Ellerman <[email protected]>
>> Cc: Borislav Petkov <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Juergen Gross <[email protected]>
>> Cc: Aaron Lu <[email protected]>
>> ---
>> kernel/smp.c | 6 ++++--
>> 1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/smp.c b/kernel/smp.c
>> index 3061483cb3ad..81d9ae08eb6e 100644
>> --- a/kernel/smp.c
>> +++ b/kernel/smp.c
>> @@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
>> free_cpumask_var(cfd->cpumask);
>> return -ENOMEM;
>> }
>> - cfd->csd = alloc_percpu(struct call_single_data);
>> + cfd->csd = alloc_percpu_aligned(struct call_single_data);
>
> I do not believe allocating 64 bytes (per cpu) for this structure is
> needed. That would be an increase of cache lines.
>
> What we can do instead is to force an alignment on 4*sizeof(void *).
> (32 bytes on 64bit, 16 bytes on 32bit arches)
>
> Maybe something like this :
>
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 68123c1fe54918c051292eb5ba3427df09f31c2f..f7072bf173c5456e38e958d6af85a4793bced96e 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -19,7 +19,7 @@ struct call_single_data {
> smp_call_func_t func;
> void *info;
> unsigned int flags;
> -};
> +} __attribute__((aligned(4 * sizeof(void *))));
>
> /* total number of cpus in this system (may exceed NR_CPUS) */
> extern unsigned int total_cpus;

OK. And if the sizeof(struct call_single_data) changes, we need to
change the alignment accordingly too. So I add some BUILD_BUG_ON() for
that.

Best Regards,
Huang, Ying

------------------>8------------------
>From 2c400e9b1793f1c1d33bc278f5bc066e32ca4fee Mon Sep 17 00:00:00 2001
From: Huang Ying <[email protected]>
Date: Thu, 27 Jul 2017 16:43:20 +0800
Subject: [PATCH -v2] IPI: Avoid to use 2 cache lines for one call_single_data

struct call_single_data is used in IPI to transfer information between
CPUs. Its size is bigger than sizeof(unsigned long) and less than
cache line size. Now, it is allocated with no any alignment
requirement. This makes it possible for allocated call_single_data to
cross 2 cache lines. So that double the number of the cache lines
that need to be transferred among CPUs.

This is resolved by aligning the allocated call_single_data with 4 *
sizeof(void *). If the size of struct call_single_data is changed in
the future, the alignment should be changed accordingly. It should be
more than sizeof(struct call_single_data) and the power of 2.

To test the effect of the patch, we use the vm-scalability multiple
thread swap test case (swap-w-seq-mt). The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPI will be triggered when unmapping
memory. In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data because of faster IPI.

[Align with 4 * sizeof(void*) instead of cache line size]
Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Aaron Lu <[email protected]>
---
include/linux/smp.h | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index 68123c1fe549..4d3b372d50b0 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -13,13 +13,22 @@
#include <linux/init.h>
#include <linux/llist.h>

+#define CSD_ALIGNMENT (4 * sizeof(void *))
+
typedef void (*smp_call_func_t)(void *info);
struct call_single_data {
struct llist_node llist;
smp_call_func_t func;
void *info;
unsigned int flags;
-};
+} __aligned(CSD_ALIGNMENT);
+
+/* To avoid allocate csd across 2 cache lines */
+static inline void check_alignment_of_csd(void)
+{
+ BUILD_BUG_ON((CSD_ALIGNMENT & (CSD_ALIGNMENT - 1)) != 0);
+ BUILD_BUG_ON(sizeof(struct call_single_data) > CSD_ALIGNMENT);
+}

/* total number of cpus in this system (may exceed NR_CPUS) */
extern unsigned int total_cpus;
--
2.13.2

2017-08-03 08:58:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

On Thu, Aug 03, 2017 at 04:35:21PM +0800, Huang, Ying wrote:
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 68123c1fe549..4d3b372d50b0 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -13,13 +13,22 @@
> #include <linux/init.h>
> #include <linux/llist.h>
>
> +#define CSD_ALIGNMENT (4 * sizeof(void *))
> +
> typedef void (*smp_call_func_t)(void *info);
> struct call_single_data {
> struct llist_node llist;
> smp_call_func_t func;
> void *info;
> unsigned int flags;
> -};
> +} __aligned(CSD_ALIGNMENT);
> +
> +/* To avoid allocate csd across 2 cache lines */
> +static inline void check_alignment_of_csd(void)
> +{
> + BUILD_BUG_ON((CSD_ALIGNMENT & (CSD_ALIGNMENT - 1)) != 0);
> + BUILD_BUG_ON(sizeof(struct call_single_data) > CSD_ALIGNMENT);
> +}
>
> /* total number of cpus in this system (may exceed NR_CPUS) */
> extern unsigned int total_cpus;

Bah, C sucks.. a much larger but possibly nicer patch

---
diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c
index 770d4d1516cb..bd8ba5472bca 100644
--- a/arch/mips/kernel/smp.c
+++ b/arch/mips/kernel/smp.c
@@ -648,12 +648,12 @@ EXPORT_SYMBOL(flush_tlb_one);
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST

static DEFINE_PER_CPU(atomic_t, tick_broadcast_count);
-static DEFINE_PER_CPU(struct call_single_data, tick_broadcast_csd);
+static DEFINE_PER_CPU(call_single_data_t, tick_broadcast_csd);

void tick_broadcast(const struct cpumask *mask)
{
atomic_t *count;
- struct call_single_data *csd;
+ call_single_data_t *csd;
int cpu;

for_each_cpu(cpu, mask) {
@@ -674,7 +674,7 @@ static void tick_broadcast_callee(void *info)

static int __init tick_broadcast_init(void)
{
- struct call_single_data *csd;
+ call_single_data_t *csd;
int cpu;

for (cpu = 0; cpu < NR_CPUS; cpu++) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 87b7df4851bf..07125e7941f4 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -60,7 +60,7 @@ static void trigger_softirq(void *data)
static int raise_blk_irq(int cpu, struct request *rq)
{
if (cpu_online(cpu)) {
- struct call_single_data *data = &rq->csd;
+ call_single_data_t *data = &rq->csd;

data->func = trigger_softirq;
data->info = rq;
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 85c24cace973..81142ce781da 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -13,7 +13,7 @@
struct nullb_cmd {
struct list_head list;
struct llist_node ll_list;
- struct call_single_data csd;
+ call_single_data_t csd;
struct request *rq;
struct bio *bio;
unsigned int tag;
diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
index 71e586d7df71..e54be79b2084 100644
--- a/drivers/cpuidle/coupled.c
+++ b/drivers/cpuidle/coupled.c
@@ -119,7 +119,7 @@ struct cpuidle_coupled {

#define CPUIDLE_COUPLED_NOT_IDLE (-1)

-static DEFINE_PER_CPU(struct call_single_data, cpuidle_coupled_poke_cb);
+static DEFINE_PER_CPU(call_single_data_t, cpuidle_coupled_poke_cb);

/*
* The cpuidle_coupled_poke_pending mask is used to avoid calling
@@ -339,7 +339,7 @@ static void cpuidle_coupled_handle_poke(void *info)
*/
static void cpuidle_coupled_poke(int cpu)
{
- struct call_single_data *csd = &per_cpu(cpuidle_coupled_poke_cb, cpu);
+ call_single_data_t *csd = &per_cpu(cpuidle_coupled_poke_cb, cpu);

if (!cpumask_test_and_set_cpu(cpu, &cpuidle_coupled_poke_pending))
smp_call_function_single_async(cpu, csd);
@@ -651,7 +651,7 @@ int cpuidle_coupled_register_device(struct cpuidle_device *dev)
{
int cpu;
struct cpuidle_device *other_dev;
- struct call_single_data *csd;
+ call_single_data_t *csd;
struct cpuidle_coupled *coupled;

if (cpumask_empty(&dev->coupled_cpus))
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 51583ae4b1eb..120b6e537b28 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -2468,7 +2468,7 @@ static void liquidio_napi_drv_callback(void *arg)
if (OCTEON_CN23XX_PF(oct) || droq->cpu_id == this_cpu) {
napi_schedule_irqoff(&droq->napi);
} else {
- struct call_single_data *csd = &droq->csd;
+ call_single_data_t *csd = &droq->csd;

csd->func = napi_schedule_wrapper;
csd->info = &droq->napi;
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
index 6efd139b894d..f91bc84d1719 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
@@ -328,7 +328,7 @@ struct octeon_droq {

u32 cpu_id;

- struct call_single_data csd;
+ call_single_data_t csd;
};

#define OCT_DROQ_SIZE (sizeof(struct octeon_droq))
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25f6a0cb27d3..006fa09a641e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -134,7 +134,7 @@ typedef __u32 __bitwise req_flags_t;
struct request {
struct list_head queuelist;
union {
- struct call_single_data csd;
+ call_single_data_t csd;
u64 fifo_time;
};

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 779b23595596..6557f320b66e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2774,7 +2774,7 @@ struct softnet_data {
unsigned int input_queue_head ____cacheline_aligned_in_smp;

/* Elements below can be accessed between CPUs for RPS/RFS */
- struct call_single_data csd ____cacheline_aligned_in_smp;
+ call_single_data_t csd ____cacheline_aligned_in_smp;
struct softnet_data *rps_ipi_next;
unsigned int cpu;
unsigned int input_queue_tail;
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 68123c1fe549..8d817cb80a38 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -14,13 +14,16 @@
#include <linux/llist.h>

typedef void (*smp_call_func_t)(void *info);
-struct call_single_data {
+struct __call_single_data {
struct llist_node llist;
smp_call_func_t func;
void *info;
unsigned int flags;
};

+typedef struct __call_single_data call_single_data_t
+ __aligned(sizeof(struct __call_single_data));
+
/* total number of cpus in this system (may exceed NR_CPUS) */
extern unsigned int total_cpus;

@@ -48,7 +51,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
gfp_t gfp_flags);

-int smp_call_function_single_async(int cpu, struct call_single_data *csd);
+int smp_call_function_single_async(int cpu, call_single_data_t *csd);

#ifdef CONFIG_SMP

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b2f9c9f7c0ee..39f9cc47eb1c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -769,7 +769,7 @@ struct rq {
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
int hrtick_csd_pending;
- struct call_single_data hrtick_csd;
+ call_single_data_t hrtick_csd;
#endif
struct hrtimer hrtick_timer;
#endif
diff --git a/kernel/smp.c b/kernel/smp.c
index 3061483cb3ad..1fb84b6db429 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -28,7 +28,7 @@ enum {
};

struct call_function_data {
- struct call_single_data __percpu *csd;
+ call_single_data_t __percpu *csd;
cpumask_var_t cpumask;
cpumask_var_t cpumask_ipi;
};
@@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
free_cpumask_var(cfd->cpumask);
return -ENOMEM;
}
- cfd->csd = alloc_percpu(struct call_single_data);
+ cfd->csd = alloc_percpu(call_single_data_t);
if (!cfd->csd) {
free_cpumask_var(cfd->cpumask);
free_cpumask_var(cfd->cpumask_ipi);
@@ -103,12 +103,12 @@ void __init call_function_init(void)
* previous function call. For multi-cpu calls its even more interesting
* as we'll have to ensure no other cpu is observing our csd.
*/
-static __always_inline void csd_lock_wait(struct call_single_data *csd)
+static __always_inline void csd_lock_wait(call_single_data_t *csd)
{
smp_cond_load_acquire(&csd->flags, !(VAL & CSD_FLAG_LOCK));
}

-static __always_inline void csd_lock(struct call_single_data *csd)
+static __always_inline void csd_lock(call_single_data_t *csd)
{
csd_lock_wait(csd);
csd->flags |= CSD_FLAG_LOCK;
@@ -121,7 +121,7 @@ static __always_inline void csd_lock(struct call_single_data *csd)
smp_wmb();
}

-static __always_inline void csd_unlock(struct call_single_data *csd)
+static __always_inline void csd_unlock(call_single_data_t *csd)
{
WARN_ON(!(csd->flags & CSD_FLAG_LOCK));

@@ -131,14 +131,14 @@ static __always_inline void csd_unlock(struct call_single_data *csd)
smp_store_release(&csd->flags, 0);
}

-static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data);
+static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);

/*
* Insert a previously allocated call_single_data element
* for execution on the given CPU. data must already have
* ->func, ->info, and ->flags set.
*/
-static int generic_exec_single(int cpu, struct call_single_data *csd,
+static int generic_exec_single(int cpu, call_single_data_t *csd,
smp_call_func_t func, void *info)
{
if (cpu == smp_processor_id()) {
@@ -210,7 +210,7 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
{
struct llist_head *head;
struct llist_node *entry;
- struct call_single_data *csd, *csd_next;
+ call_single_data_t *csd, *csd_next;
static bool warned;

WARN_ON(!irqs_disabled());
@@ -268,8 +268,8 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
int wait)
{
- struct call_single_data *csd;
- struct call_single_data csd_stack = { .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS };
+ call_single_data_t *csd;
+ call_single_data_t csd_stack = { .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS };
int this_cpu;
int err;

@@ -321,7 +321,7 @@ EXPORT_SYMBOL(smp_call_function_single);
* NOTE: Be careful, there is unfortunately no current debugging facility to
* validate the correctness of this serialization.
*/
-int smp_call_function_single_async(int cpu, struct call_single_data *csd)
+int smp_call_function_single_async(int cpu, call_single_data_t *csd)
{
int err = 0;

@@ -444,7 +444,7 @@ void smp_call_function_many(const struct cpumask *mask,

cpumask_clear(cfd->cpumask_ipi);
for_each_cpu(cpu, cfd->cpumask) {
- struct call_single_data *csd = per_cpu_ptr(cfd->csd, cpu);
+ call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);

csd_lock(csd);
if (wait)
@@ -460,7 +460,7 @@ void smp_call_function_many(const struct cpumask *mask,

if (wait) {
for_each_cpu(cpu, cfd->cpumask) {
- struct call_single_data *csd;
+ call_single_data_t *csd;

csd = per_cpu_ptr(cfd->csd, cpu);
csd_lock_wait(csd);
diff --git a/kernel/up.c b/kernel/up.c
index ee81ac9af4ca..42c46bf3e0a5 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -23,7 +23,7 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
}
EXPORT_SYMBOL(smp_call_function_single);

-int smp_call_function_single_async(int cpu, struct call_single_data *csd)
+int smp_call_function_single_async(int cpu, call_single_data_t *csd)
{
unsigned long flags;


2017-08-04 01:28:20

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

Peter Zijlstra <[email protected]> writes:
[snip]
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 68123c1fe549..8d817cb80a38 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -14,13 +14,16 @@
> #include <linux/llist.h>
>
> typedef void (*smp_call_func_t)(void *info);
> -struct call_single_data {
> +struct __call_single_data {
> struct llist_node llist;
> smp_call_func_t func;
> void *info;
> unsigned int flags;
> };
>
> +typedef struct __call_single_data call_single_data_t
> + __aligned(sizeof(struct __call_single_data));
> +

Another requirement of the alignment is that it should be the power of
2. Otherwise, for example, if someone adds a field to struct, so that
the size becomes 40 on x86_64. The alignment should be 64 instead of
40.

Best Regards,
Huang, Ying

> /* total number of cpus in this system (may exceed NR_CPUS) */
> extern unsigned int total_cpus;
>
[snip]

2017-08-04 02:05:59

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

"Huang, Ying" <[email protected]> writes:

> Peter Zijlstra <[email protected]> writes:
> [snip]
>> diff --git a/include/linux/smp.h b/include/linux/smp.h
>> index 68123c1fe549..8d817cb80a38 100644
>> --- a/include/linux/smp.h
>> +++ b/include/linux/smp.h
>> @@ -14,13 +14,16 @@
>> #include <linux/llist.h>
>>
>> typedef void (*smp_call_func_t)(void *info);
>> -struct call_single_data {
>> +struct __call_single_data {
>> struct llist_node llist;
>> smp_call_func_t func;
>> void *info;
>> unsigned int flags;
>> };
>>
>> +typedef struct __call_single_data call_single_data_t
>> + __aligned(sizeof(struct __call_single_data));
>> +
>
> Another requirement of the alignment is that it should be the power of
> 2. Otherwise, for example, if someone adds a field to struct, so that
> the size becomes 40 on x86_64. The alignment should be 64 instead of
> 40.

Thanks Aaron, he reminded me that there is a roundup_pow_of_two(). So
the typedef could be,

typedef struct __call_single_data call_single_data_t
__aligned(roundup_pow_of_two(sizeof(struct __call_single_data));

Best Regards,
Huang, Ying

> Best Regards,
> Huang, Ying
>
>> /* total number of cpus in this system (may exceed NR_CPUS) */
>> extern unsigned int total_cpus;
>>
> [snip]

2017-08-04 09:20:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

On Fri, Aug 04, 2017 at 09:28:17AM +0800, Huang, Ying wrote:
> Peter Zijlstra <[email protected]> writes:
> [snip]
> > diff --git a/include/linux/smp.h b/include/linux/smp.h
> > index 68123c1fe549..8d817cb80a38 100644
> > --- a/include/linux/smp.h
> > +++ b/include/linux/smp.h
> > @@ -14,13 +14,16 @@
> > #include <linux/llist.h>
> >
> > typedef void (*smp_call_func_t)(void *info);
> > -struct call_single_data {
> > +struct __call_single_data {
> > struct llist_node llist;
> > smp_call_func_t func;
> > void *info;
> > unsigned int flags;
> > };
> >
> > +typedef struct __call_single_data call_single_data_t
> > + __aligned(sizeof(struct __call_single_data));
> > +
>
> Another requirement of the alignment is that it should be the power of
> 2. Otherwise, for example, if someone adds a field to struct, so that
> the size becomes 40 on x86_64. The alignment should be 64 instead of
> 40.

Yes I know. This generates a compiler error if sizeof() isn't a
power of 2. That's similar to the BUILD_BUG_ON() you added.

2017-08-04 09:28:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

On Fri, Aug 04, 2017 at 10:05:55AM +0800, Huang, Ying wrote:
> "Huang, Ying" <[email protected]> writes:
> > Peter Zijlstra <[email protected]> writes:

> >> +struct __call_single_data {
> >> struct llist_node llist;
> >> smp_call_func_t func;
> >> void *info;
> >> unsigned int flags;
> >> };
> >>
> >> +typedef struct __call_single_data call_single_data_t
> >> + __aligned(sizeof(struct __call_single_data));
> >> +
> >
> > Another requirement of the alignment is that it should be the power of
> > 2. Otherwise, for example, if someone adds a field to struct, so that
> > the size becomes 40 on x86_64. The alignment should be 64 instead of
> > 40.
>
> Thanks Aaron, he reminded me that there is a roundup_pow_of_two(). So
> the typedef could be,
>
> typedef struct __call_single_data call_single_data_t
> __aligned(roundup_pow_of_two(sizeof(struct __call_single_data));
>

Yes, that would take away the requirement to play padding games with the
struct. Then again, maybe its a good thing to have to be explicit about
it.

If you see:

struct __call_single_data {
struct llist_node llist;
smp_call_func_t func;
void *info
int flags;
void *extra_field;

unsigned long __padding[3]; /* make align work */
};

that makes it very clear what is going on. In any case, we can delay
this part because the current structure is a power-of-2 for both ILP32
and LP64. So only the person growing this will have to deal with it ;-)

2017-08-05 00:47:08

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

Peter Zijlstra <[email protected]> writes:

> On Fri, Aug 04, 2017 at 10:05:55AM +0800, Huang, Ying wrote:
>> "Huang, Ying" <[email protected]> writes:
>> > Peter Zijlstra <[email protected]> writes:
>
>> >> +struct __call_single_data {
>> >> struct llist_node llist;
>> >> smp_call_func_t func;
>> >> void *info;
>> >> unsigned int flags;
>> >> };
>> >>
>> >> +typedef struct __call_single_data call_single_data_t
>> >> + __aligned(sizeof(struct __call_single_data));
>> >> +
>> >
>> > Another requirement of the alignment is that it should be the power of
>> > 2. Otherwise, for example, if someone adds a field to struct, so that
>> > the size becomes 40 on x86_64. The alignment should be 64 instead of
>> > 40.
>>
>> Thanks Aaron, he reminded me that there is a roundup_pow_of_two(). So
>> the typedef could be,
>>
>> typedef struct __call_single_data call_single_data_t
>> __aligned(roundup_pow_of_two(sizeof(struct __call_single_data));
>>
>
> Yes, that would take away the requirement to play padding games with the
> struct. Then again, maybe its a good thing to have to be explicit about
> it.
>
> If you see:
>
> struct __call_single_data {
> struct llist_node llist;
> smp_call_func_t func;
> void *info
> int flags;
> void *extra_field;
>
> unsigned long __padding[3]; /* make align work */
> };
>
> that makes it very clear what is going on. In any case, we can delay
> this part because the current structure is a power-of-2 for both ILP32
> and LP64. So only the person growing this will have to deal with it ;-)

Yes. That looks good. So you will prepare the final patch? Or you
hope me to do that?

Best Regards,
Huang, Ying

2017-08-07 08:28:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

On Sat, Aug 05, 2017 at 08:47:02AM +0800, Huang, Ying wrote:
> Yes. That looks good. So you will prepare the final patch? Or you
> hope me to do that?

I was hoping you'd do it ;-)

2017-08-08 04:30:27

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

Peter Zijlstra <[email protected]> writes:

> On Sat, Aug 05, 2017 at 08:47:02AM +0800, Huang, Ying wrote:
>> Yes. That looks good. So you will prepare the final patch? Or you
>> hope me to do that?
>
> I was hoping you'd do it ;-)

Thanks! Here is the updated patch

Best Regards,
Huang, Ying

---------->8----------
>From 957735e9ff3922368286540dab852986fc7b23b5 Mon Sep 17 00:00:00 2001
From: Huang Ying <[email protected]>
Date: Mon, 7 Aug 2017 16:55:33 +0800
Subject: [PATCH -v3] IPI: Avoid to use 2 cache lines for one
call_single_data

struct call_single_data is used in IPI to transfer information between
CPUs. Its size is bigger than sizeof(unsigned long) and less than
cache line size. Now, it is allocated with no explicit alignment
requirement. This makes it possible for allocated call_single_data to
cross 2 cache lines. So that double the number of the cache lines
that need to be transferred among CPUs.

This is resolved by requiring call_single_data to be aligned with the
size of call_single_data. Now the size of call_single_data is the
power of 2. If we add new fields to call_single_data, we may need to
add pads to make sure the size of new definition is the power of 2.
Fortunately, this is enforced by gcc, which will report error for not
power of 2 alignment requirement.

To set alignment requirement of call_single_data to the size of
call_single_data, a struct definition and a typedef is used.

To test the effect of the patch, we use the vm-scalability multiple
thread swap test case (swap-w-seq-mt). The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPI will be triggered when unmapping
memory. In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data because of faster IPI.

[Add call_single_data_t and align with size of call_single_data]
Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Aaron Lu <[email protected]>
---
arch/mips/kernel/smp.c | 6 ++--
block/blk-softirq.c | 2 +-
drivers/block/null_blk.c | 2 +-
drivers/cpuidle/coupled.c | 10 +++----
drivers/net/ethernet/cavium/liquidio/lio_main.c | 2 +-
drivers/net/ethernet/cavium/liquidio/octeon_droq.h | 2 +-
include/linux/blkdev.h | 2 +-
include/linux/netdevice.h | 2 +-
include/linux/smp.h | 8 ++++--
kernel/sched/sched.h | 2 +-
kernel/smp.c | 32 ++++++++++++----------
kernel/up.c | 2 +-
12 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c
index 770d4d1516cb..bd8ba5472bca 100644
--- a/arch/mips/kernel/smp.c
+++ b/arch/mips/kernel/smp.c
@@ -648,12 +648,12 @@ EXPORT_SYMBOL(flush_tlb_one);
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST

static DEFINE_PER_CPU(atomic_t, tick_broadcast_count);
-static DEFINE_PER_CPU(struct call_single_data, tick_broadcast_csd);
+static DEFINE_PER_CPU(call_single_data_t, tick_broadcast_csd);

void tick_broadcast(const struct cpumask *mask)
{
atomic_t *count;
- struct call_single_data *csd;
+ call_single_data_t *csd;
int cpu;

for_each_cpu(cpu, mask) {
@@ -674,7 +674,7 @@ static void tick_broadcast_callee(void *info)

static int __init tick_broadcast_init(void)
{
- struct call_single_data *csd;
+ call_single_data_t *csd;
int cpu;

for (cpu = 0; cpu < NR_CPUS; cpu++) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 87b7df4851bf..07125e7941f4 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -60,7 +60,7 @@ static void trigger_softirq(void *data)
static int raise_blk_irq(int cpu, struct request *rq)
{
if (cpu_online(cpu)) {
- struct call_single_data *data = &rq->csd;
+ call_single_data_t *data = &rq->csd;

data->func = trigger_softirq;
data->info = rq;
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 85c24cace973..81142ce781da 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -13,7 +13,7 @@
struct nullb_cmd {
struct list_head list;
struct llist_node ll_list;
- struct call_single_data csd;
+ call_single_data_t csd;
struct request *rq;
struct bio *bio;
unsigned int tag;
diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
index 71e586d7df71..147f38ea0fcd 100644
--- a/drivers/cpuidle/coupled.c
+++ b/drivers/cpuidle/coupled.c
@@ -119,13 +119,13 @@ struct cpuidle_coupled {

#define CPUIDLE_COUPLED_NOT_IDLE (-1)

-static DEFINE_PER_CPU(struct call_single_data, cpuidle_coupled_poke_cb);
+static DEFINE_PER_CPU(call_single_data_t, cpuidle_coupled_poke_cb);

/*
* The cpuidle_coupled_poke_pending mask is used to avoid calling
- * __smp_call_function_single with the per cpu call_single_data struct already
+ * __smp_call_function_single with the per cpu call_single_data_t struct already
* in use. This prevents a deadlock where two cpus are waiting for each others
- * call_single_data struct to be available
+ * call_single_data_t struct to be available
*/
static cpumask_t cpuidle_coupled_poke_pending;

@@ -339,7 +339,7 @@ static void cpuidle_coupled_handle_poke(void *info)
*/
static void cpuidle_coupled_poke(int cpu)
{
- struct call_single_data *csd = &per_cpu(cpuidle_coupled_poke_cb, cpu);
+ call_single_data_t *csd = &per_cpu(cpuidle_coupled_poke_cb, cpu);

if (!cpumask_test_and_set_cpu(cpu, &cpuidle_coupled_poke_pending))
smp_call_function_single_async(cpu, csd);
@@ -651,7 +651,7 @@ int cpuidle_coupled_register_device(struct cpuidle_device *dev)
{
int cpu;
struct cpuidle_device *other_dev;
- struct call_single_data *csd;
+ call_single_data_t *csd;
struct cpuidle_coupled *coupled;

if (cpumask_empty(&dev->coupled_cpus))
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 51583ae4b1eb..120b6e537b28 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -2468,7 +2468,7 @@ static void liquidio_napi_drv_callback(void *arg)
if (OCTEON_CN23XX_PF(oct) || droq->cpu_id == this_cpu) {
napi_schedule_irqoff(&droq->napi);
} else {
- struct call_single_data *csd = &droq->csd;
+ call_single_data_t *csd = &droq->csd;

csd->func = napi_schedule_wrapper;
csd->info = &droq->napi;
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
index 6efd139b894d..f91bc84d1719 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
@@ -328,7 +328,7 @@ struct octeon_droq {

u32 cpu_id;

- struct call_single_data csd;
+ call_single_data_t csd;
};

#define OCT_DROQ_SIZE (sizeof(struct octeon_droq))
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25f6a0cb27d3..006fa09a641e 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -134,7 +134,7 @@ typedef __u32 __bitwise req_flags_t;
struct request {
struct list_head queuelist;
union {
- struct call_single_data csd;
+ call_single_data_t csd;
u64 fifo_time;
};

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 779b23595596..6557f320b66e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2774,7 +2774,7 @@ struct softnet_data {
unsigned int input_queue_head ____cacheline_aligned_in_smp;

/* Elements below can be accessed between CPUs for RPS/RFS */
- struct call_single_data csd ____cacheline_aligned_in_smp;
+ call_single_data_t csd ____cacheline_aligned_in_smp;
struct softnet_data *rps_ipi_next;
unsigned int cpu;
unsigned int input_queue_tail;
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 68123c1fe549..98b1fe027fc9 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -14,13 +14,17 @@
#include <linux/llist.h>

typedef void (*smp_call_func_t)(void *info);
-struct call_single_data {
+struct __call_single_data {
struct llist_node llist;
smp_call_func_t func;
void *info;
unsigned int flags;
};

+/* Use __aligned() to avoid to use 2 cache lines for 1 csd */
+typedef struct __call_single_data call_single_data_t
+ __aligned(sizeof(struct __call_single_data));
+
/* total number of cpus in this system (may exceed NR_CPUS) */
extern unsigned int total_cpus;

@@ -48,7 +52,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
gfp_t gfp_flags);

-int smp_call_function_single_async(int cpu, struct call_single_data *csd);
+int smp_call_function_single_async(int cpu, call_single_data_t *csd);

#ifdef CONFIG_SMP

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3086d1..f29a7d2b57e1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -769,7 +769,7 @@ struct rq {
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
int hrtick_csd_pending;
- struct call_single_data hrtick_csd;
+ call_single_data_t hrtick_csd;
#endif
struct hrtimer hrtick_timer;
#endif
diff --git a/kernel/smp.c b/kernel/smp.c
index 3061483cb3ad..81cfca9b4cc3 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -28,7 +28,7 @@ enum {
};

struct call_function_data {
- struct call_single_data __percpu *csd;
+ call_single_data_t __percpu *csd;
cpumask_var_t cpumask;
cpumask_var_t cpumask_ipi;
};
@@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
free_cpumask_var(cfd->cpumask);
return -ENOMEM;
}
- cfd->csd = alloc_percpu(struct call_single_data);
+ cfd->csd = alloc_percpu(call_single_data_t);
if (!cfd->csd) {
free_cpumask_var(cfd->cpumask);
free_cpumask_var(cfd->cpumask_ipi);
@@ -103,12 +103,12 @@ void __init call_function_init(void)
* previous function call. For multi-cpu calls its even more interesting
* as we'll have to ensure no other cpu is observing our csd.
*/
-static __always_inline void csd_lock_wait(struct call_single_data *csd)
+static __always_inline void csd_lock_wait(call_single_data_t *csd)
{
smp_cond_load_acquire(&csd->flags, !(VAL & CSD_FLAG_LOCK));
}

-static __always_inline void csd_lock(struct call_single_data *csd)
+static __always_inline void csd_lock(call_single_data_t *csd)
{
csd_lock_wait(csd);
csd->flags |= CSD_FLAG_LOCK;
@@ -116,12 +116,12 @@ static __always_inline void csd_lock(struct call_single_data *csd)
/*
* prevent CPU from reordering the above assignment
* to ->flags with any subsequent assignments to other
- * fields of the specified call_single_data structure:
+ * fields of the specified call_single_data_t structure:
*/
smp_wmb();
}

-static __always_inline void csd_unlock(struct call_single_data *csd)
+static __always_inline void csd_unlock(call_single_data_t *csd)
{
WARN_ON(!(csd->flags & CSD_FLAG_LOCK));

@@ -131,14 +131,14 @@ static __always_inline void csd_unlock(struct call_single_data *csd)
smp_store_release(&csd->flags, 0);
}

-static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data);
+static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);

/*
- * Insert a previously allocated call_single_data element
+ * Insert a previously allocated call_single_data_t element
* for execution on the given CPU. data must already have
* ->func, ->info, and ->flags set.
*/
-static int generic_exec_single(int cpu, struct call_single_data *csd,
+static int generic_exec_single(int cpu, call_single_data_t *csd,
smp_call_func_t func, void *info)
{
if (cpu == smp_processor_id()) {
@@ -210,7 +210,7 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
{
struct llist_head *head;
struct llist_node *entry;
- struct call_single_data *csd, *csd_next;
+ call_single_data_t *csd, *csd_next;
static bool warned;

WARN_ON(!irqs_disabled());
@@ -268,8 +268,10 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
int wait)
{
- struct call_single_data *csd;
- struct call_single_data csd_stack = { .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS };
+ call_single_data_t *csd;
+ call_single_data_t csd_stack = {
+ .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS,
+ };
int this_cpu;
int err;

@@ -321,7 +323,7 @@ EXPORT_SYMBOL(smp_call_function_single);
* NOTE: Be careful, there is unfortunately no current debugging facility to
* validate the correctness of this serialization.
*/
-int smp_call_function_single_async(int cpu, struct call_single_data *csd)
+int smp_call_function_single_async(int cpu, call_single_data_t *csd)
{
int err = 0;

@@ -444,7 +446,7 @@ void smp_call_function_many(const struct cpumask *mask,

cpumask_clear(cfd->cpumask_ipi);
for_each_cpu(cpu, cfd->cpumask) {
- struct call_single_data *csd = per_cpu_ptr(cfd->csd, cpu);
+ call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);

csd_lock(csd);
if (wait)
@@ -460,7 +462,7 @@ void smp_call_function_many(const struct cpumask *mask,

if (wait) {
for_each_cpu(cpu, cfd->cpumask) {
- struct call_single_data *csd;
+ call_single_data_t *csd;

csd = per_cpu_ptr(cfd->csd, cpu);
csd_lock_wait(csd);
diff --git a/kernel/up.c b/kernel/up.c
index ee81ac9af4ca..42c46bf3e0a5 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -23,7 +23,7 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
}
EXPORT_SYMBOL(smp_call_function_single);

-int smp_call_function_single_async(int cpu, struct call_single_data *csd)
+int smp_call_function_single_async(int cpu, call_single_data_t *csd)
{
unsigned long flags;

--
2.11.0

2017-08-14 05:44:28

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

Hi, Peter,

"Huang, Ying" <[email protected]> writes:

> Peter Zijlstra <[email protected]> writes:
>
>> On Sat, Aug 05, 2017 at 08:47:02AM +0800, Huang, Ying wrote:
>>> Yes. That looks good. So you will prepare the final patch? Or you
>>> hope me to do that?
>>
>> I was hoping you'd do it ;-)
>
> Thanks! Here is the updated patch
>
> Best Regards,
> Huang, Ying
>
> ---------->8----------
> From 957735e9ff3922368286540dab852986fc7b23b5 Mon Sep 17 00:00:00 2001
> From: Huang Ying <[email protected]>
> Date: Mon, 7 Aug 2017 16:55:33 +0800
> Subject: [PATCH -v3] IPI: Avoid to use 2 cache lines for one
> call_single_data
>
> struct call_single_data is used in IPI to transfer information between
> CPUs. Its size is bigger than sizeof(unsigned long) and less than
> cache line size. Now, it is allocated with no explicit alignment
> requirement. This makes it possible for allocated call_single_data to
> cross 2 cache lines. So that double the number of the cache lines
> that need to be transferred among CPUs.
>
> This is resolved by requiring call_single_data to be aligned with the
> size of call_single_data. Now the size of call_single_data is the
> power of 2. If we add new fields to call_single_data, we may need to
> add pads to make sure the size of new definition is the power of 2.
> Fortunately, this is enforced by gcc, which will report error for not
> power of 2 alignment requirement.
>
> To set alignment requirement of call_single_data to the size of
> call_single_data, a struct definition and a typedef is used.
>
> To test the effect of the patch, we use the vm-scalability multiple
> thread swap test case (swap-w-seq-mt). The test will create multiple
> threads and each thread will eat memory until all RAM and part of swap
> is used, so that huge number of IPI will be triggered when unmapping
> memory. In the test, the throughput of memory writing improves ~5%
> compared with misaligned call_single_data because of faster IPI.

What do you think about this version?

Best Regards,
Huang, Ying

> [Add call_single_data_t and align with size of call_single_data]
> Suggested-by: Peter Zijlstra <[email protected]>
> Signed-off-by: "Huang, Ying" <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Michael Ellerman <[email protected]>
> Cc: Borislav Petkov <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Juergen Gross <[email protected]>
> Cc: Aaron Lu <[email protected]>
> ---
> arch/mips/kernel/smp.c | 6 ++--
> block/blk-softirq.c | 2 +-
> drivers/block/null_blk.c | 2 +-
> drivers/cpuidle/coupled.c | 10 +++----
> drivers/net/ethernet/cavium/liquidio/lio_main.c | 2 +-
> drivers/net/ethernet/cavium/liquidio/octeon_droq.h | 2 +-
> include/linux/blkdev.h | 2 +-
> include/linux/netdevice.h | 2 +-
> include/linux/smp.h | 8 ++++--
> kernel/sched/sched.h | 2 +-
> kernel/smp.c | 32 ++++++++++++----------
> kernel/up.c | 2 +-
> 12 files changed, 39 insertions(+), 33 deletions(-)
>
> diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c
> index 770d4d1516cb..bd8ba5472bca 100644
> --- a/arch/mips/kernel/smp.c
> +++ b/arch/mips/kernel/smp.c
> @@ -648,12 +648,12 @@ EXPORT_SYMBOL(flush_tlb_one);
> #ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST
>
> static DEFINE_PER_CPU(atomic_t, tick_broadcast_count);
> -static DEFINE_PER_CPU(struct call_single_data, tick_broadcast_csd);
> +static DEFINE_PER_CPU(call_single_data_t, tick_broadcast_csd);
>
> void tick_broadcast(const struct cpumask *mask)
> {
> atomic_t *count;
> - struct call_single_data *csd;
> + call_single_data_t *csd;
> int cpu;
>
> for_each_cpu(cpu, mask) {
> @@ -674,7 +674,7 @@ static void tick_broadcast_callee(void *info)
>
> static int __init tick_broadcast_init(void)
> {
> - struct call_single_data *csd;
> + call_single_data_t *csd;
> int cpu;
>
> for (cpu = 0; cpu < NR_CPUS; cpu++) {
> diff --git a/block/blk-softirq.c b/block/blk-softirq.c
> index 87b7df4851bf..07125e7941f4 100644
> --- a/block/blk-softirq.c
> +++ b/block/blk-softirq.c
> @@ -60,7 +60,7 @@ static void trigger_softirq(void *data)
> static int raise_blk_irq(int cpu, struct request *rq)
> {
> if (cpu_online(cpu)) {
> - struct call_single_data *data = &rq->csd;
> + call_single_data_t *data = &rq->csd;
>
> data->func = trigger_softirq;
> data->info = rq;
> diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
> index 85c24cace973..81142ce781da 100644
> --- a/drivers/block/null_blk.c
> +++ b/drivers/block/null_blk.c
> @@ -13,7 +13,7 @@
> struct nullb_cmd {
> struct list_head list;
> struct llist_node ll_list;
> - struct call_single_data csd;
> + call_single_data_t csd;
> struct request *rq;
> struct bio *bio;
> unsigned int tag;
> diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
> index 71e586d7df71..147f38ea0fcd 100644
> --- a/drivers/cpuidle/coupled.c
> +++ b/drivers/cpuidle/coupled.c
> @@ -119,13 +119,13 @@ struct cpuidle_coupled {
>
> #define CPUIDLE_COUPLED_NOT_IDLE (-1)
>
> -static DEFINE_PER_CPU(struct call_single_data, cpuidle_coupled_poke_cb);
> +static DEFINE_PER_CPU(call_single_data_t, cpuidle_coupled_poke_cb);
>
> /*
> * The cpuidle_coupled_poke_pending mask is used to avoid calling
> - * __smp_call_function_single with the per cpu call_single_data struct already
> + * __smp_call_function_single with the per cpu call_single_data_t struct already
> * in use. This prevents a deadlock where two cpus are waiting for each others
> - * call_single_data struct to be available
> + * call_single_data_t struct to be available
> */
> static cpumask_t cpuidle_coupled_poke_pending;
>
> @@ -339,7 +339,7 @@ static void cpuidle_coupled_handle_poke(void *info)
> */
> static void cpuidle_coupled_poke(int cpu)
> {
> - struct call_single_data *csd = &per_cpu(cpuidle_coupled_poke_cb, cpu);
> + call_single_data_t *csd = &per_cpu(cpuidle_coupled_poke_cb, cpu);
>
> if (!cpumask_test_and_set_cpu(cpu, &cpuidle_coupled_poke_pending))
> smp_call_function_single_async(cpu, csd);
> @@ -651,7 +651,7 @@ int cpuidle_coupled_register_device(struct cpuidle_device *dev)
> {
> int cpu;
> struct cpuidle_device *other_dev;
> - struct call_single_data *csd;
> + call_single_data_t *csd;
> struct cpuidle_coupled *coupled;
>
> if (cpumask_empty(&dev->coupled_cpus))
> diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
> index 51583ae4b1eb..120b6e537b28 100644
> --- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
> +++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
> @@ -2468,7 +2468,7 @@ static void liquidio_napi_drv_callback(void *arg)
> if (OCTEON_CN23XX_PF(oct) || droq->cpu_id == this_cpu) {
> napi_schedule_irqoff(&droq->napi);
> } else {
> - struct call_single_data *csd = &droq->csd;
> + call_single_data_t *csd = &droq->csd;
>
> csd->func = napi_schedule_wrapper;
> csd->info = &droq->napi;
> diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
> index 6efd139b894d..f91bc84d1719 100644
> --- a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
> +++ b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
> @@ -328,7 +328,7 @@ struct octeon_droq {
>
> u32 cpu_id;
>
> - struct call_single_data csd;
> + call_single_data_t csd;
> };
>
> #define OCT_DROQ_SIZE (sizeof(struct octeon_droq))
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 25f6a0cb27d3..006fa09a641e 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -134,7 +134,7 @@ typedef __u32 __bitwise req_flags_t;
> struct request {
> struct list_head queuelist;
> union {
> - struct call_single_data csd;
> + call_single_data_t csd;
> u64 fifo_time;
> };
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 779b23595596..6557f320b66e 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2774,7 +2774,7 @@ struct softnet_data {
> unsigned int input_queue_head ____cacheline_aligned_in_smp;
>
> /* Elements below can be accessed between CPUs for RPS/RFS */
> - struct call_single_data csd ____cacheline_aligned_in_smp;
> + call_single_data_t csd ____cacheline_aligned_in_smp;
> struct softnet_data *rps_ipi_next;
> unsigned int cpu;
> unsigned int input_queue_tail;
> diff --git a/include/linux/smp.h b/include/linux/smp.h
> index 68123c1fe549..98b1fe027fc9 100644
> --- a/include/linux/smp.h
> +++ b/include/linux/smp.h
> @@ -14,13 +14,17 @@
> #include <linux/llist.h>
>
> typedef void (*smp_call_func_t)(void *info);
> -struct call_single_data {
> +struct __call_single_data {
> struct llist_node llist;
> smp_call_func_t func;
> void *info;
> unsigned int flags;
> };
>
> +/* Use __aligned() to avoid to use 2 cache lines for 1 csd */
> +typedef struct __call_single_data call_single_data_t
> + __aligned(sizeof(struct __call_single_data));
> +
> /* total number of cpus in this system (may exceed NR_CPUS) */
> extern unsigned int total_cpus;
>
> @@ -48,7 +52,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
> smp_call_func_t func, void *info, bool wait,
> gfp_t gfp_flags);
>
> -int smp_call_function_single_async(int cpu, struct call_single_data *csd);
> +int smp_call_function_single_async(int cpu, call_single_data_t *csd);
>
> #ifdef CONFIG_SMP
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index eeef1a3086d1..f29a7d2b57e1 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -769,7 +769,7 @@ struct rq {
> #ifdef CONFIG_SCHED_HRTICK
> #ifdef CONFIG_SMP
> int hrtick_csd_pending;
> - struct call_single_data hrtick_csd;
> + call_single_data_t hrtick_csd;
> #endif
> struct hrtimer hrtick_timer;
> #endif
> diff --git a/kernel/smp.c b/kernel/smp.c
> index 3061483cb3ad..81cfca9b4cc3 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -28,7 +28,7 @@ enum {
> };
>
> struct call_function_data {
> - struct call_single_data __percpu *csd;
> + call_single_data_t __percpu *csd;
> cpumask_var_t cpumask;
> cpumask_var_t cpumask_ipi;
> };
> @@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
> free_cpumask_var(cfd->cpumask);
> return -ENOMEM;
> }
> - cfd->csd = alloc_percpu(struct call_single_data);
> + cfd->csd = alloc_percpu(call_single_data_t);
> if (!cfd->csd) {
> free_cpumask_var(cfd->cpumask);
> free_cpumask_var(cfd->cpumask_ipi);
> @@ -103,12 +103,12 @@ void __init call_function_init(void)
> * previous function call. For multi-cpu calls its even more interesting
> * as we'll have to ensure no other cpu is observing our csd.
> */
> -static __always_inline void csd_lock_wait(struct call_single_data *csd)
> +static __always_inline void csd_lock_wait(call_single_data_t *csd)
> {
> smp_cond_load_acquire(&csd->flags, !(VAL & CSD_FLAG_LOCK));
> }
>
> -static __always_inline void csd_lock(struct call_single_data *csd)
> +static __always_inline void csd_lock(call_single_data_t *csd)
> {
> csd_lock_wait(csd);
> csd->flags |= CSD_FLAG_LOCK;
> @@ -116,12 +116,12 @@ static __always_inline void csd_lock(struct call_single_data *csd)
> /*
> * prevent CPU from reordering the above assignment
> * to ->flags with any subsequent assignments to other
> - * fields of the specified call_single_data structure:
> + * fields of the specified call_single_data_t structure:
> */
> smp_wmb();
> }
>
> -static __always_inline void csd_unlock(struct call_single_data *csd)
> +static __always_inline void csd_unlock(call_single_data_t *csd)
> {
> WARN_ON(!(csd->flags & CSD_FLAG_LOCK));
>
> @@ -131,14 +131,14 @@ static __always_inline void csd_unlock(struct call_single_data *csd)
> smp_store_release(&csd->flags, 0);
> }
>
> -static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data);
> +static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
>
> /*
> - * Insert a previously allocated call_single_data element
> + * Insert a previously allocated call_single_data_t element
> * for execution on the given CPU. data must already have
> * ->func, ->info, and ->flags set.
> */
> -static int generic_exec_single(int cpu, struct call_single_data *csd,
> +static int generic_exec_single(int cpu, call_single_data_t *csd,
> smp_call_func_t func, void *info)
> {
> if (cpu == smp_processor_id()) {
> @@ -210,7 +210,7 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
> {
> struct llist_head *head;
> struct llist_node *entry;
> - struct call_single_data *csd, *csd_next;
> + call_single_data_t *csd, *csd_next;
> static bool warned;
>
> WARN_ON(!irqs_disabled());
> @@ -268,8 +268,10 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
> int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
> int wait)
> {
> - struct call_single_data *csd;
> - struct call_single_data csd_stack = { .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS };
> + call_single_data_t *csd;
> + call_single_data_t csd_stack = {
> + .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS,
> + };
> int this_cpu;
> int err;
>
> @@ -321,7 +323,7 @@ EXPORT_SYMBOL(smp_call_function_single);
> * NOTE: Be careful, there is unfortunately no current debugging facility to
> * validate the correctness of this serialization.
> */
> -int smp_call_function_single_async(int cpu, struct call_single_data *csd)
> +int smp_call_function_single_async(int cpu, call_single_data_t *csd)
> {
> int err = 0;
>
> @@ -444,7 +446,7 @@ void smp_call_function_many(const struct cpumask *mask,
>
> cpumask_clear(cfd->cpumask_ipi);
> for_each_cpu(cpu, cfd->cpumask) {
> - struct call_single_data *csd = per_cpu_ptr(cfd->csd, cpu);
> + call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);
>
> csd_lock(csd);
> if (wait)
> @@ -460,7 +462,7 @@ void smp_call_function_many(const struct cpumask *mask,
>
> if (wait) {
> for_each_cpu(cpu, cfd->cpumask) {
> - struct call_single_data *csd;
> + call_single_data_t *csd;
>
> csd = per_cpu_ptr(cfd->csd, cpu);
> csd_lock_wait(csd);
> diff --git a/kernel/up.c b/kernel/up.c
> index ee81ac9af4ca..42c46bf3e0a5 100644
> --- a/kernel/up.c
> +++ b/kernel/up.c
> @@ -23,7 +23,7 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
> }
> EXPORT_SYMBOL(smp_call_function_single);
>
> -int smp_call_function_single_async(int cpu, struct call_single_data *csd)
> +int smp_call_function_single_async(int cpu, call_single_data_t *csd)
> {
> unsigned long flags;

2017-08-28 05:19:25

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

"Huang, Ying" <[email protected]> writes:

> Hi, Peter,
>
> "Huang, Ying" <[email protected]> writes:
>
>> Peter Zijlstra <[email protected]> writes:
>>
>>> On Sat, Aug 05, 2017 at 08:47:02AM +0800, Huang, Ying wrote:
>>>> Yes. That looks good. So you will prepare the final patch? Or you
>>>> hope me to do that?
>>>
>>> I was hoping you'd do it ;-)
>>
>> Thanks! Here is the updated patch
>>
>> Best Regards,
>> Huang, Ying
>>
>> ---------->8----------
>> From 957735e9ff3922368286540dab852986fc7b23b5 Mon Sep 17 00:00:00 2001
>> From: Huang Ying <[email protected]>
>> Date: Mon, 7 Aug 2017 16:55:33 +0800
>> Subject: [PATCH -v3] IPI: Avoid to use 2 cache lines for one
>> call_single_data
>>
>> struct call_single_data is used in IPI to transfer information between
>> CPUs. Its size is bigger than sizeof(unsigned long) and less than
>> cache line size. Now, it is allocated with no explicit alignment
>> requirement. This makes it possible for allocated call_single_data to
>> cross 2 cache lines. So that double the number of the cache lines
>> that need to be transferred among CPUs.
>>
>> This is resolved by requiring call_single_data to be aligned with the
>> size of call_single_data. Now the size of call_single_data is the
>> power of 2. If we add new fields to call_single_data, we may need to
>> add pads to make sure the size of new definition is the power of 2.
>> Fortunately, this is enforced by gcc, which will report error for not
>> power of 2 alignment requirement.
>>
>> To set alignment requirement of call_single_data to the size of
>> call_single_data, a struct definition and a typedef is used.
>>
>> To test the effect of the patch, we use the vm-scalability multiple
>> thread swap test case (swap-w-seq-mt). The test will create multiple
>> threads and each thread will eat memory until all RAM and part of swap
>> is used, so that huge number of IPI will be triggered when unmapping
>> memory. In the test, the throughput of memory writing improves ~5%
>> compared with misaligned call_single_data because of faster IPI.
>
> What do you think about this version?
>

Ping.

Best Regards,
Huang, Ying

2017-08-28 08:49:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 3/3] IPI: Avoid to use 2 cache lines for one call_single_data

On Mon, Aug 28, 2017 at 01:19:21PM +0800, Huang, Ying wrote:
> > What do you think about this version?
> >
>
> Ping.

Thanks, yes that got lost in the inbox :-(

I'll queue it, thanks !

Subject: [tip:locking/core] smp: Avoid using two cache lines for struct call_single_data

Commit-ID: 966a967116e699762dbf4af7f9e0d1955c25aa37
Gitweb: http://git.kernel.org/tip/966a967116e699762dbf4af7f9e0d1955c25aa37
Author: Ying Huang <[email protected]>
AuthorDate: Tue, 8 Aug 2017 12:30:00 +0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 29 Aug 2017 15:14:38 +0200

smp: Avoid using two cache lines for struct call_single_data

struct call_single_data is used in IPIs to transfer information between
CPUs. Its size is bigger than sizeof(unsigned long) and less than
cache line size. Currently it is not allocated with any explicit alignment
requirements. This makes it possible for allocated call_single_data to
cross two cache lines, which results in double the number of the cache lines
that need to be transferred among CPUs.

This can be fixed by requiring call_single_data to be aligned with the
size of call_single_data. Currently the size of call_single_data is the
power of 2. If we add new fields to call_single_data, we may need to
add padding to make sure the size of new definition is the power of 2
as well.

Fortunately, this is enforced by GCC, which will report bad sizes.

To set alignment requirements of call_single_data to the size of
call_single_data, a struct definition and a typedef is used.

To test the effect of the patch, I used the vm-scalability multiple
thread swap test case (swap-w-seq-mt). The test will create multiple
threads and each thread will eat memory until all RAM and part of swap
is used, so that huge number of IPIs are triggered when unmapping
memory. In the test, the throughput of memory writing improves ~5%
compared with misaligned call_single_data, because of faster IPIs.

Suggested-by: Peter Zijlstra <[email protected]>
Signed-off-by: Huang, Ying <[email protected]>
[ Add call_single_data_t and align with size of call_single_data. ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Aaron Lu <[email protected]>
Cc: Borislav Petkov <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Juergen Gross <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Michael Ellerman <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/mips/kernel/smp.c | 6 ++--
block/blk-softirq.c | 2 +-
drivers/block/null_blk.c | 2 +-
drivers/cpuidle/coupled.c | 10 +++----
drivers/net/ethernet/cavium/liquidio/lio_main.c | 2 +-
drivers/net/ethernet/cavium/liquidio/octeon_droq.h | 2 +-
include/linux/blkdev.h | 2 +-
include/linux/netdevice.h | 2 +-
include/linux/smp.h | 8 ++++--
kernel/sched/sched.h | 2 +-
kernel/smp.c | 32 ++++++++++++----------
kernel/up.c | 2 +-
12 files changed, 39 insertions(+), 33 deletions(-)

diff --git a/arch/mips/kernel/smp.c b/arch/mips/kernel/smp.c
index 6bace76..c7cbddf 100644
--- a/arch/mips/kernel/smp.c
+++ b/arch/mips/kernel/smp.c
@@ -648,12 +648,12 @@ EXPORT_SYMBOL(flush_tlb_one);
#ifdef CONFIG_GENERIC_CLOCKEVENTS_BROADCAST

static DEFINE_PER_CPU(atomic_t, tick_broadcast_count);
-static DEFINE_PER_CPU(struct call_single_data, tick_broadcast_csd);
+static DEFINE_PER_CPU(call_single_data_t, tick_broadcast_csd);

void tick_broadcast(const struct cpumask *mask)
{
atomic_t *count;
- struct call_single_data *csd;
+ call_single_data_t *csd;
int cpu;

for_each_cpu(cpu, mask) {
@@ -674,7 +674,7 @@ static void tick_broadcast_callee(void *info)

static int __init tick_broadcast_init(void)
{
- struct call_single_data *csd;
+ call_single_data_t *csd;
int cpu;

for (cpu = 0; cpu < NR_CPUS; cpu++) {
diff --git a/block/blk-softirq.c b/block/blk-softirq.c
index 87b7df4..07125e7 100644
--- a/block/blk-softirq.c
+++ b/block/blk-softirq.c
@@ -60,7 +60,7 @@ static void trigger_softirq(void *data)
static int raise_blk_irq(int cpu, struct request *rq)
{
if (cpu_online(cpu)) {
- struct call_single_data *data = &rq->csd;
+ call_single_data_t *data = &rq->csd;

data->func = trigger_softirq;
data->info = rq;
diff --git a/drivers/block/null_blk.c b/drivers/block/null_blk.c
index 85c24ca..81142ce 100644
--- a/drivers/block/null_blk.c
+++ b/drivers/block/null_blk.c
@@ -13,7 +13,7 @@
struct nullb_cmd {
struct list_head list;
struct llist_node ll_list;
- struct call_single_data csd;
+ call_single_data_t csd;
struct request *rq;
struct bio *bio;
unsigned int tag;
diff --git a/drivers/cpuidle/coupled.c b/drivers/cpuidle/coupled.c
index 71e586d..147f38e 100644
--- a/drivers/cpuidle/coupled.c
+++ b/drivers/cpuidle/coupled.c
@@ -119,13 +119,13 @@ struct cpuidle_coupled {

#define CPUIDLE_COUPLED_NOT_IDLE (-1)

-static DEFINE_PER_CPU(struct call_single_data, cpuidle_coupled_poke_cb);
+static DEFINE_PER_CPU(call_single_data_t, cpuidle_coupled_poke_cb);

/*
* The cpuidle_coupled_poke_pending mask is used to avoid calling
- * __smp_call_function_single with the per cpu call_single_data struct already
+ * __smp_call_function_single with the per cpu call_single_data_t struct already
* in use. This prevents a deadlock where two cpus are waiting for each others
- * call_single_data struct to be available
+ * call_single_data_t struct to be available
*/
static cpumask_t cpuidle_coupled_poke_pending;

@@ -339,7 +339,7 @@ static void cpuidle_coupled_handle_poke(void *info)
*/
static void cpuidle_coupled_poke(int cpu)
{
- struct call_single_data *csd = &per_cpu(cpuidle_coupled_poke_cb, cpu);
+ call_single_data_t *csd = &per_cpu(cpuidle_coupled_poke_cb, cpu);

if (!cpumask_test_and_set_cpu(cpu, &cpuidle_coupled_poke_pending))
smp_call_function_single_async(cpu, csd);
@@ -651,7 +651,7 @@ int cpuidle_coupled_register_device(struct cpuidle_device *dev)
{
int cpu;
struct cpuidle_device *other_dev;
- struct call_single_data *csd;
+ call_single_data_t *csd;
struct cpuidle_coupled *coupled;

if (cpumask_empty(&dev->coupled_cpus))
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 51583ae..120b6e5 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -2468,7 +2468,7 @@ static void liquidio_napi_drv_callback(void *arg)
if (OCTEON_CN23XX_PF(oct) || droq->cpu_id == this_cpu) {
napi_schedule_irqoff(&droq->napi);
} else {
- struct call_single_data *csd = &droq->csd;
+ call_single_data_t *csd = &droq->csd;

csd->func = napi_schedule_wrapper;
csd->info = &droq->napi;
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
index 6efd139..f91bc84 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_droq.h
@@ -328,7 +328,7 @@ struct octeon_droq {

u32 cpu_id;

- struct call_single_data csd;
+ call_single_data_t csd;
};

#define OCT_DROQ_SIZE (sizeof(struct octeon_droq))
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 25f6a0c..006fa09 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -134,7 +134,7 @@ typedef __u32 __bitwise req_flags_t;
struct request {
struct list_head queuelist;
union {
- struct call_single_data csd;
+ call_single_data_t csd;
u64 fifo_time;
};

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 779b235..6557f32 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2774,7 +2774,7 @@ struct softnet_data {
unsigned int input_queue_head ____cacheline_aligned_in_smp;

/* Elements below can be accessed between CPUs for RPS/RFS */
- struct call_single_data csd ____cacheline_aligned_in_smp;
+ call_single_data_t csd ____cacheline_aligned_in_smp;
struct softnet_data *rps_ipi_next;
unsigned int cpu;
unsigned int input_queue_tail;
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 68123c1..98b1fe0 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -14,13 +14,17 @@
#include <linux/llist.h>

typedef void (*smp_call_func_t)(void *info);
-struct call_single_data {
+struct __call_single_data {
struct llist_node llist;
smp_call_func_t func;
void *info;
unsigned int flags;
};

+/* Use __aligned() to avoid to use 2 cache lines for 1 csd */
+typedef struct __call_single_data call_single_data_t
+ __aligned(sizeof(struct __call_single_data));
+
/* total number of cpus in this system (may exceed NR_CPUS) */
extern unsigned int total_cpus;

@@ -48,7 +52,7 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
smp_call_func_t func, void *info, bool wait,
gfp_t gfp_flags);

-int smp_call_function_single_async(int cpu, struct call_single_data *csd);
+int smp_call_function_single_async(int cpu, call_single_data_t *csd);

#ifdef CONFIG_SMP

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index eeef1a3..f29a7d2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -769,7 +769,7 @@ struct rq {
#ifdef CONFIG_SCHED_HRTICK
#ifdef CONFIG_SMP
int hrtick_csd_pending;
- struct call_single_data hrtick_csd;
+ call_single_data_t hrtick_csd;
#endif
struct hrtimer hrtick_timer;
#endif
diff --git a/kernel/smp.c b/kernel/smp.c
index 3061483..81cfca9 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -28,7 +28,7 @@ enum {
};

struct call_function_data {
- struct call_single_data __percpu *csd;
+ call_single_data_t __percpu *csd;
cpumask_var_t cpumask;
cpumask_var_t cpumask_ipi;
};
@@ -51,7 +51,7 @@ int smpcfd_prepare_cpu(unsigned int cpu)
free_cpumask_var(cfd->cpumask);
return -ENOMEM;
}
- cfd->csd = alloc_percpu(struct call_single_data);
+ cfd->csd = alloc_percpu(call_single_data_t);
if (!cfd->csd) {
free_cpumask_var(cfd->cpumask);
free_cpumask_var(cfd->cpumask_ipi);
@@ -103,12 +103,12 @@ void __init call_function_init(void)
* previous function call. For multi-cpu calls its even more interesting
* as we'll have to ensure no other cpu is observing our csd.
*/
-static __always_inline void csd_lock_wait(struct call_single_data *csd)
+static __always_inline void csd_lock_wait(call_single_data_t *csd)
{
smp_cond_load_acquire(&csd->flags, !(VAL & CSD_FLAG_LOCK));
}

-static __always_inline void csd_lock(struct call_single_data *csd)
+static __always_inline void csd_lock(call_single_data_t *csd)
{
csd_lock_wait(csd);
csd->flags |= CSD_FLAG_LOCK;
@@ -116,12 +116,12 @@ static __always_inline void csd_lock(struct call_single_data *csd)
/*
* prevent CPU from reordering the above assignment
* to ->flags with any subsequent assignments to other
- * fields of the specified call_single_data structure:
+ * fields of the specified call_single_data_t structure:
*/
smp_wmb();
}

-static __always_inline void csd_unlock(struct call_single_data *csd)
+static __always_inline void csd_unlock(call_single_data_t *csd)
{
WARN_ON(!(csd->flags & CSD_FLAG_LOCK));

@@ -131,14 +131,14 @@ static __always_inline void csd_unlock(struct call_single_data *csd)
smp_store_release(&csd->flags, 0);
}

-static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data);
+static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);

/*
- * Insert a previously allocated call_single_data element
+ * Insert a previously allocated call_single_data_t element
* for execution on the given CPU. data must already have
* ->func, ->info, and ->flags set.
*/
-static int generic_exec_single(int cpu, struct call_single_data *csd,
+static int generic_exec_single(int cpu, call_single_data_t *csd,
smp_call_func_t func, void *info)
{
if (cpu == smp_processor_id()) {
@@ -210,7 +210,7 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
{
struct llist_head *head;
struct llist_node *entry;
- struct call_single_data *csd, *csd_next;
+ call_single_data_t *csd, *csd_next;
static bool warned;

WARN_ON(!irqs_disabled());
@@ -268,8 +268,10 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
int wait)
{
- struct call_single_data *csd;
- struct call_single_data csd_stack = { .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS };
+ call_single_data_t *csd;
+ call_single_data_t csd_stack = {
+ .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS,
+ };
int this_cpu;
int err;

@@ -321,7 +323,7 @@ EXPORT_SYMBOL(smp_call_function_single);
* NOTE: Be careful, there is unfortunately no current debugging facility to
* validate the correctness of this serialization.
*/
-int smp_call_function_single_async(int cpu, struct call_single_data *csd)
+int smp_call_function_single_async(int cpu, call_single_data_t *csd)
{
int err = 0;

@@ -444,7 +446,7 @@ void smp_call_function_many(const struct cpumask *mask,

cpumask_clear(cfd->cpumask_ipi);
for_each_cpu(cpu, cfd->cpumask) {
- struct call_single_data *csd = per_cpu_ptr(cfd->csd, cpu);
+ call_single_data_t *csd = per_cpu_ptr(cfd->csd, cpu);

csd_lock(csd);
if (wait)
@@ -460,7 +462,7 @@ void smp_call_function_many(const struct cpumask *mask,

if (wait) {
for_each_cpu(cpu, cfd->cpumask) {
- struct call_single_data *csd;
+ call_single_data_t *csd;

csd = per_cpu_ptr(cfd->csd, cpu);
csd_lock_wait(csd);
diff --git a/kernel/up.c b/kernel/up.c
index ee81ac9..42c46bf 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -23,7 +23,7 @@ int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
}
EXPORT_SYMBOL(smp_call_function_single);

-int smp_call_function_single_async(int cpu, struct call_single_data *csd)
+int smp_call_function_single_async(int cpu, call_single_data_t *csd)
{
unsigned long flags;