2017-03-28 23:13:01

by Keith Busch

[permalink] [raw]
Subject: [PATCH] irq/affinity: Assign all CPUs a vector

The number of vectors to assign needs to be adjusted for each node such
that it doesn't exceed the number of CPUs in that node. This patch
recalculates the vector assignment per-node so that we don't try to
assign more vectors than there are CPUs. When that previously happened,
the cpus_per_vec was calculated to be 0, so many vectors had no CPUs
assigned. This then goes on to fail to allocate descriptors due to
empty masks, leading to an unoptimal spread.

Not only does this patch get the intended spread, this also fixes
other subsystems that depend on every CPU being assigned to something:
blk_mq_map_swqueue dereferences NULL while mapping s/w queues when CPUs
are unnassigned, so making sure all CPUs are assigned fixes that.

Signed-off-by: Keith Busch <[email protected]>
---
kernel/irq/affinity.c | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/kernel/irq/affinity.c b/kernel/irq/affinity.c
index 4544b11..dc52911 100644
--- a/kernel/irq/affinity.c
+++ b/kernel/irq/affinity.c
@@ -59,7 +59,7 @@ static int get_nodes_in_cpumask(const struct cpumask *mask, nodemask_t *nodemsk)
struct cpumask *
irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
{
- int n, nodes, vecs_per_node, cpus_per_vec, extra_vecs, curvec;
+ int n, nodes, cpus_per_vec, extra_vecs, curvec;
int affv = nvecs - affd->pre_vectors - affd->post_vectors;
int last_affv = affv + affd->pre_vectors;
nodemask_t nodemsk = NODE_MASK_NONE;
@@ -94,19 +94,21 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
goto done;
}

- /* Spread the vectors per node */
- vecs_per_node = affv / nodes;
- /* Account for rounding errors */
- extra_vecs = affv - (nodes * vecs_per_node);
-
for_each_node_mask(n, nodemsk) {
- int ncpus, v, vecs_to_assign = vecs_per_node;
+ int ncpus, v, vecs_to_assign, vecs_per_node;
+
+ /* Spread the vectors per node */
+ vecs_per_node = (affv - curvec) / nodes;

/* Get the cpus on this node which are in the mask */
cpumask_and(nmsk, cpu_online_mask, cpumask_of_node(n));

/* Calculate the number of cpus per vector */
ncpus = cpumask_weight(nmsk);
+ vecs_to_assign = min(vecs_per_node, ncpus);
+
+ /* Account for rounding errors */
+ extra_vecs = ncpus - vecs_to_assign;

for (v = 0; curvec < last_affv && v < vecs_to_assign;
curvec++, v++) {
@@ -115,14 +117,14 @@ irq_create_affinity_masks(int nvecs, const struct irq_affinity *affd)
/* Account for extra vectors to compensate rounding errors */
if (extra_vecs) {
cpus_per_vec++;
- if (!--extra_vecs)
- vecs_per_node++;
+ --extra_vecs;
}
irq_spread_init_one(masks + curvec, nmsk, cpus_per_vec);
}

if (curvec >= last_affv)
break;
+ --nodes;
}

done:
--
2.7.2


2017-03-29 17:16:02

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH] irq/affinity: Assign all CPUs a vector


> The number of vectors to assign needs to be adjusted for each node such
> that it doesn't exceed the number of CPUs in that node. This patch
> recalculates the vector assignment per-node so that we don't try to
> assign more vectors than there are CPUs. When that previously happened,
> the cpus_per_vec was calculated to be 0, so many vectors had no CPUs
> assigned. This then goes on to fail to allocate descriptors due to
> empty masks, leading to an unoptimal spread.

Can you give a specific (numeric) example where this happens? I'm having
a little trouble following the logical change here.

2017-03-29 17:46:29

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] irq/affinity: Assign all CPUs a vector

On Wed, Mar 29, 2017 at 08:15:50PM +0300, Sagi Grimberg wrote:
>
> > The number of vectors to assign needs to be adjusted for each node such
> > that it doesn't exceed the number of CPUs in that node. This patch
> > recalculates the vector assignment per-node so that we don't try to
> > assign more vectors than there are CPUs. When that previously happened,
> > the cpus_per_vec was calculated to be 0, so many vectors had no CPUs
> > assigned. This then goes on to fail to allocate descriptors due to
> > empty masks, leading to an unoptimal spread.
>
> Can you give a specific (numeric) example where this happens? I'm having
> a little trouble following the logical change here.

Sure, I have a 2-socket server with 16 threads each. I take one CPU
offline in socket 2, so I've 16 threads on socket 1, 15 in socket 2. In
total, 31 threads so requesting 31 vectors.

Currently, vecs_per_node is calculated in the first iteration as 31 / 2, so 15.

ncpus of socket 1 is 16. cpus_per_vec = 16 / 15, so 1 CPU per vector
with one extra.

When iterating the second socket, though, vecs_per_node is incremented
from 15 to 16 (to account for the "extra" from before). However, the
ncpus is only 15, so that iteration calculates:

cpus_per_vec = 15 / 16

And since that's zero, the remaining 16 vectors are not assigned to any
CPU, and the second socket has no vectors assigned to their CPUs.

2017-03-29 17:50:26

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH] irq/affinity: Assign all CPUs a vector


> Sure, I have a 2-socket server with 16 threads each. I take one CPU
> offline in socket 2, so I've 16 threads on socket 1, 15 in socket 2. In
> total, 31 threads so requesting 31 vectors.
>
> Currently, vecs_per_node is calculated in the first iteration as 31 / 2, so 15.
>
> ncpus of socket 1 is 16. cpus_per_vec = 16 / 15, so 1 CPU per vector
> with one extra.
>
> When iterating the second socket, though, vecs_per_node is incremented
> from 15 to 16 (to account for the "extra" from before). However, the
> ncpus is only 15, so that iteration calculates:
>
> cpus_per_vec = 15 / 16
>
> And since that's zero, the remaining 16 vectors are not assigned to any
> CPU, and the second socket has no vectors assigned to their CPUs.

Thanks for the clarification, makes sense...

Reviewed-by: Sagi Grimberg <[email protected]>

2017-03-30 08:22:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] irq/affinity: Assign all CPUs a vector

Looks fine,

Reviewed-by: Christoph Hellwig <[email protected]>

2017-03-30 17:04:31

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] irq/affinity: Assign all CPUs a vector

Hi Thomas,

I received an email delivery failure on the original patch, so I'm not
sure if you got it. I'm reattaching here with the two reviews just in
case. Please let me know if you've any concerns with the proposal.

Thanks,
Keith


Attachments:
(No filename) (235.00 B)
0001-irq-affinity-Assign-all-CPUs-a-vector.patch (3.02 kB)
Download all attachments

2017-03-31 10:59:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] irq/affinity: Assign all CPUs a vector

On Tue, 28 Mar 2017, Keith Busch wrote:

> The number of vectors to assign needs to be adjusted for each node such
> that it doesn't exceed the number of CPUs in that node. This patch
> recalculates the vector assignment per-node so that we don't try to
> assign more vectors than there are CPUs. When that previously happened,
> the cpus_per_vec was calculated to be 0, so many vectors had no CPUs
> assigned. This then goes on to fail to allocate descriptors due to
> empty masks, leading to an unoptimal spread.

To be honest: This changelog sucks. I really have a hard time to figure out
what's wrong. Can you please structure and rephrase this so it's
understandable for people who did not debug the issue five days ago? That
includes yourself when you have to look at that patch 3 month from now.

A proper structure would be: Context - Problem - Solution. e.g.

irq_create_affinity_masks() spreads the interrupt vectors of a multi
queue device across CPUs.

The algorithm fails to do X, which causes problem Y

Make it do frotz so the blas are properly assigned.

> Not only does this patch get the intended spread, this also fixes
> other subsystems that depend on every CPU being assigned to something:
> blk_mq_map_swqueue dereferences NULL while mapping s/w queues when CPUs
> are unnassigned, so making sure all CPUs are assigned fixes that.

That's hardly a justification for that change. If blk_mq_map_swqueue()
lacks sanity checks, then blk_mq_map_swqueue() is broken and needs to be
fixed independently of this.

Thanks,

tglx

2017-04-12 01:35:33

by kernel test robot

[permalink] [raw]
Subject: [lkp-robot] [irq/affinity] 13c024422c: fsmark.files_per_sec -4.3% regression


Greeting,

FYI, we noticed a -4.3% regression of fsmark.files_per_sec due to commit:


commit: 13c024422cbb6dcc513667be9a2613b0f0de781a ("irq/affinity: Assign all CPUs a vector")
url: https://github.com/0day-ci/linux/commits/Keith-Busch/irq-affinity-Assign-all-CPUs-a-vector/20170401-035036


in testcase: fsmark
on test machine: 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz with 128G memory
with following parameters:

iterations: 8
disk: 1SSD
nr_threads: 4
fs: btrfs
filesize: 9B
test_size: 16G
sync_method: fsyncBeforeClose
nr_directories: 16d
nr_files_per_directory: 256fpd
cpufreq_governor: performance

test-description: The fsmark is a file system benchmark to test synchronous write workloads, for example, mail servers workload.
test-url: https://sourceforge.net/projects/fsmark/


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/01org/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

testcase/path_params/tbox_group/run: fsmark/8-1SSD-4-btrfs-9B-16G-fsyncBeforeClose-16d-256fpd-performance/lkp-hsw-ep4

45e5202213ae6541 13c024422cbb6dcc513667be9a
---------------- --------------------------
%stddev change %stddev
\ | \
13888 -4% 13293 fsmark.files_per_sec
302 4% 315 fsmark.time.elapsed_time
302 4% 315 fsmark.time.elapsed_time.max
2.452e+08 3% 2.528e+08 fsmark.time.file_system_outputs
20265209 19926430 fsmark.time.voluntary_context_switches
206 -4% 198 fsmark.time.percent_of_cpu_this_job_got
707811 -65% 246894 fsmark.time.involuntary_context_switches
383249 ? 13% -36% 246054 interrupts.CAL:Function_call_interrupts
149418 -5% 141286 vmstat.system.cs
190 -7% 178 turbostat.Avg_MHz
6.25 -5% 5.93 turbostat.%Busy
8994 ? 3% 199% 26910 ? 13% perf-stat.cpu-migrations
45475990 44940538 perf-stat.context-switches
0.42 12% 0.48 perf-stat.dTLB-load-miss-rate%
0.09 10% 0.09 perf-stat.dTLB-store-miss-rate%
2.162e+08 7% 2.324e+08 perf-stat.dTLB-store-misses
674521 4% 703995 perf-stat.page-faults
674521 4% 703995 perf-stat.minor-faults
1.46 4% 1.52 perf-stat.branch-miss-rate%
1.445e+09 3% 1.494e+09 perf-stat.iTLB-loads



fsmark.time.elapsed_time

400 ++--------------------------------------------------------------------+
| O |
350 ++ |
300 O+O.O.O O.O.O.*.O..O.O.O O O.O.O.O O O.O.O.O.O.O.O.*.. .*.*.*.*. .*.*.*
| *.* * : * : * * |
250 ++ : : : : |
| : : : : |
200 ++ : : : : |
| : : : : |
150 ++ : : : : |
100 ++ : : : : |
| : : : : |
50 ++ : : |
| : : |
0 ++-----------------------*---------*----------------------------------+


fsmark.time.elapsed_time.max

400 ++--------------------------------------------------------------------+
| O |
350 ++ |
300 O+O.O.O O.O.O.*.O..O.O.O O O.O.O.O O O.O.O.O.O.O.O.*.. .*.*.*.*. .*.*.*
| *.* * : * : * * |
250 ++ : : : : |
| : : : : |
200 ++ : : : : |
| : : : : |
150 ++ : : : : |
100 ++ : : : : |
| : : : : |
50 ++ : : |
| : : |
0 ++-----------------------*---------*----------------------------------+


fsmark.time.involuntary_context_switches

800000 ++-----------------------------------------------------------------+
*.*.*.*.*. .*. .* .*.* .*. .*. .*. .*
700000 ++ * *.*.*.* : *.* : *.* *.*.*.*.*.*.*.* * *.* |
| : : : : |
600000 ++ : : : : |
500000 ++ : : : : |
| : : : : |
400000 ++ : : : : |
| O : : : : |
300000 ++ : : :: |
200000 O+O O O O O O O O O O:O:O O O O:OO O O O O O O |
| :: : |
100000 ++ : : |
| : : |
0 ++----------------------*---------*--------------------------------+


fsmark.files_per_sec

16000 ++------------------------------------------------------------------+
| |
14000 O+*.O.O.O.O.O.*.O.O.O.O O O.O.O.O O O.O.O.O.O.O.O.*.*.*.*.*.*.*.*.*.*
12000 ++O : : : : |
| O : : : : |
10000 ++ : : : : |
| : : : : |
8000 ++ : : : : |
| : : : : |
6000 ++ : : : : |
4000 ++ : : : : |
| :: :: |
2000 ++ : : |
| : : |
0 ++----------------------*---------*---------------------------------+

[*] bisect-good sample
[O] bisect-bad sample


Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Xiaolong


Attachments:
(No filename) (7.93 kB)
config-4.11.0-rc1-00292-g13c0244 (154.27 kB)
job-script (7.05 kB)
job.yaml (4.71 kB)
reproduce (372.00 B)
Download all attachments

2017-04-12 14:54:13

by Keith Busch

[permalink] [raw]
Subject: Re: [lkp-robot] [irq/affinity] 13c024422c: fsmark.files_per_sec -4.3% regression

On Wed, Apr 12, 2017 at 09:33:28AM +0800, kernel test robot wrote:
>
> Greeting,
>
> FYI, we noticed a -4.3% regression of fsmark.files_per_sec due to commit:
>
>
> commit: 13c024422cbb6dcc513667be9a2613b0f0de781a ("irq/affinity: Assign all CPUs a vector")
> url: https://github.com/0day-ci/linux/commits/Keith-Busch/irq-affinity-Assign-all-CPUs-a-vector/20170401-035036
>
>
> in testcase: fsmark
> on test machine: 72 threads Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz with 128G memory
> with following parameters:
>
> iterations: 8
> disk: 1SSD
> nr_threads: 4
> fs: btrfs
> filesize: 9B
> test_size: 16G
> sync_method: fsyncBeforeClose
> nr_directories: 16d
> nr_files_per_directory: 256fpd
> cpufreq_governor: performance
>
> test-description: The fsmark is a file system benchmark to test synchronous write workloads, for example, mail servers workload.
> test-url: https://sourceforge.net/projects/fsmark/
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->

This wasn't supposed to change anything if all the nodes have the same
number of CPU's. I've reached out to the 0-day team to get a little more
information on the before/after smp affinity settings to see how this
algorithm messed up the spread on this system.