2015-08-18 15:55:44

by Dario Faggioli

[permalink] [raw]
Subject: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

Hey everyone,

So, as a followup of what we were discussing in this thread:

[Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html

I started looking in more details at scheduling domains in the Linux
kernel. Now, that thread was about CPUID and vNUMA, and their weird way
of interacting, while this thing I'm proposing here is completely
independent from them both.

In fact, no matter whether vNUMA is supported and enabled, and no matter
whether CPUID is reporting accurate, random, meaningful or completely
misleading information, I think that we should do something about how
scheduling domains are build.

Fact is, unless we use 1:1, and immutable (across all the guest
lifetime) pinning, scheduling domains should not be constructed, in
Linux, by looking at *any* topology information, because that just does
not make any sense, when vcpus move around.

Let me state this again (hoping to make myself as clear as possible): no
matter in how much good shape we put CPUID support, no matter how
beautifully and consistently that will interact with both vNUMA,
licensing requirements and whatever else. It will be always possible for
vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
on two different NUMA nodes at time t2. Hence, the Linux scheduler
should really not skew his load balancing logic toward any of those two
situations, as neither of them could be considered correct (since
nothing is!).

For now, this only covers the PV case. HVM case shouldn't be any
different, but I haven't looked at how to make the same thing happen in
there as well.

OVERALL DESCRIPTION
===================
What this RFC patch does is, in the Xen PV case, configure scheduling
domains in such a way that there is only one of them, spanning all the
pCPUs of the guest.

Note that the patch deals directly with scheduling domains, and there is
no need to alter the masks that will then be used for building and
reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
the main difference between it and the patch proposed by Juergen here:
http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html

This means that when, in future, we will fix CPUID handling and make it
comply with whatever logic or requirements we want, that won't have any
unexpected side effects on scheduling domains.

Information about how the scheduling domains are being constructed
during boot are available in `dmesg', if the kernel is booted with the
'sched_debug' parameter. It is also possible to look
at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.

With the patch applied, only one scheduling domain is created, called
the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
tell that from the fact that every cpu* folder
in /proc/sys/kernel/sched_domain/ only have one subdirectory
('domain0'), with all the tweaks and the tunables for our scheduling
domain.

EVALUATION
==========
I've tested this with UnixBench, and by looking at Xen build time, on a
16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
now, but I plan to re-run them in DomUs soon (Juergen may be doing
something similar to this in DomU already, AFAUI).

I've run the benchmarks with and without the patch applied ('patched'
and 'vanilla', respectively, in the tables below), and with different
number of build jobs (in case of the Xen build) or of parallel copy of
the benchmarks (in the case of UnixBench).

What I get from the numbers is that the patch almost always brings
benefits, in some cases even huge ones. There are a couple of cases
where we regress, but always only slightly so, especially if comparing
that to the magnitude of some of the improvement that we get.

Bear also in mind that these results are gathered from Dom0, and without
any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
we move things in DomU and do overcommit at the Xen scheduler level, I
am expecting even better results.

RESULTS
=======
To have a quick idea of how a benchmark went, look at the '%
improvement' row of each table.

I'll put these results online, in a googledoc spreadsheet or something
like that, to make them easier to read, as soon as possible.

*** Intel(R) Xeon(R) E5620 @ 2.40GHz
*** pCPUs 16 DOM0 vCPUS 16
*** RAM 12285 MB DOM0 Memory 9955 MB
*** NUMA nodes 2
=======================================================================================================================================
MAKE XEN (lower == better)
=======================================================================================================================================
# of build jobs -j1 -j6 -j8 -j16** -j24
vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
---------------------------------------------------------------------------------------------------------------------------------------
153.72 152.41 35.33 34.93 30.7 30.33 26.79 25.97 26.88 26.21
153.81 152.76 35.37 34.99 30.81 30.36 26.83 26.08 27 26.24
153.93 152.79 35.37 35.25 30.92 30.39 26.83 26.13 27.01 26.28
153.94 152.94 35.39 35.28 31.05 30.43 26.9 26.14 27.01 26.44
153.98 153.06 35.45 35.31 31.17 30.5 26.95 26.18 27.02 26.55
154.01 153.23 35.5 35.35 31.2 30.59 26.98 26.2 27.05 26.61
154.04 153.34 35.56 35.42 31.45 30.76 27.12 26.21 27.06 26.78
154.16 153.5 37.79 35.58 31.68 30.83 27.16 26.23 27.16 26.78
154.18 153.71 37.98 35.61 33.73 30.9 27.49 26.32 27.16 26.8
154.9 154.67 38.03 37.64 34.69 31.69 29.82 26.38 27.2 28.63
---------------------------------------------------------------------------------------------------------------------------------------
Avg. 154.067 153.241 36.177 35.536 31.74 30.678 27.287 26.184 27.055 26.732
---------------------------------------------------------------------------------------------------------------------------------------
Std. Dev. 0.325 0.631 1.215 0.771 1.352 0.410 0.914 0.116 0.095 0.704
---------------------------------------------------------------------------------------------------------------------------------------
% improvement 0.536 1.772 3.346 4.042 1.194
========================================================================================================================================
====================================================================================================================================================
UNIXBENCH
====================================================================================================================================================
# parallel copies 1 parallel 6 parrallel 8 parallel 16 parallel** 24 parallel
vanilla/patched vanilla patched vanilla pached vanilla patched vanilla patched vanilla patched
----------------------------------------------------------------------------------------------------------------------------------------------------
Dhrystone 2 using register variables 2302.2 2302.1 13157.8 12262.4 15691.5 15860.1 18927.7 19078.5 18654.3 18855.6
Double-Precision Whetstone 620.2 620.2 3481.2 3566.9 4669.2 4551.5 7610.1 7614.3 11558.9 11561.3
Execl Throughput 184.3 186.7 884.6 905.3 1168.4 1213.6 2134.6 2210.2 2250.9 2265
File Copy 1024 bufsize 2000 maxblocks 780.8 783.3 1243.7 1255.5 1250.6 1215.7 1080.9 1094.2 1069.8 1062.5
File Copy 256 bufsize 500 maxblocks 479.8 482.8 781.8 803.6 806.4 781 682.9 707.7 698.2 694.6
File Copy 4096 bufsize 8000 maxblocks 1617.6 1593.5 2739.7 2943.4 2818.3 2957.8 2389.6 2412.6 2371.6 2423.8
Pipe Throughput 363.9 361.6 2068.6 2065.6 2622 2633.5 4053.3 4085.9 4064.7 4076.7
Pipe-based Context Switching 70.6 207.2 369.1 1126.8 623.9 1431.3 1970.4 2082.9 1963.8 2077
Process Creation 103.1 135 503 677.6 618.7 855.4 1138 1113.7 1195.6 1199
Shell Scripts (1 concurrent) 723.2 765.3 4406.4 4334.4 5045.4 5002.5 5861.9 5844.2 5958.8 5916.1
Shell Scripts (8 concurrent) 2243.7 2715.3 5694.7 5663.6 5694.7 5657.8 5637.1 5600.5 5582.9 5543.6
System Call Overhead 330 330.1 1669.2 1672.4 2028.6 1996.6 2920.5 2947.1 2923.9 2952.5
System Benchmarks Index Score 496.8 567.5 1861.9 2106 2220.3 2441.3 2972.5 3007.9 3103.4 3125.3
----------------------------------------------------------------------------------------------------------------------------------------------------
% increase (of the Index Score) 14.231 13.110 9.954 1.191 0.706
====================================================================================================================================================

*** Intel(R) Xeon(R) X5650 @ 2.67GHz
*** pCPUs 24 DOM0 vCPUS 16
*** RAM 36851 MB DOM0 Memory 9955 MB
*** NUMA nodes 2
=======================================================================================================================================
MAKE XEN (lower == better)
=======================================================================================================================================
# of build jobs -j1 -j8 -j12 -j24** -j32
vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
---------------------------------------------------------------------------------------------------------------------------------------
119.49 119.47 23.37 23.29 20.12 19.85 17.99 17.9 17.82 17.8
119.59 119.64 23.52 23.31 20.16 19.99 18.19 18.05 18.23 17.89
119.59 119.65 23.53 23.35 20.19 20.08 18.26 18.09 18.35 17.91
119.72 119.75 23.63 23.41 20.2 20.14 18.54 18.1 18.4 17.95
119.95 119.86 23.68 23.42 20.24 20.19 18.57 18.15 18.44 18.03
119.97 119.9 23.72 23.51 20.38 20.31 18.61 18.21 18.49 18.03
119.97 119.91 25.03 23.53 20.38 20.42 18.75 18.28 18.51 18.08
120.01 119.98 25.05 23.93 20.39 21.69 19.99 18.49 18.52 18.6
120.24 119.99 25.12 24.19 21.67 21.76 20.08 19.74 19.73 19.62
120.66 121.22 25.16 25.36 21.94 21.85 20.26 20.3 19.92 19.81
---------------------------------------------------------------------------------------------------------------------------------------
Avg. 119.919 119.937 24.181 23.73 20.567 20.628 18.924 18.531 18.641 18.372
---------------------------------------------------------------------------------------------------------------------------------------
Std. Dev. 0.351 0.481 0.789 0.642 0.663 0.802 0.851 0.811 0.658 0.741
---------------------------------------------------------------------------------------------------------------------------------------
% improvement -0.015 1.865 -0.297 2.077 1.443
========================================================================================================================================
====================================================================================================================================================
UNIXBENCH
====================================================================================================================================================
# parallel copies 1 parallel 8 parrallel 12 parallel 24 parallel** 32 parallel
vanilla/patched vanilla patched vanilla pached vanilla patched vanilla patched vanilla patched
----------------------------------------------------------------------------------------------------------------------------------------------------
Dhrystone 2 using register variables 2650.1 2664.6 18967.8 19060.4 27534.1 27046.8 30077.9 30110.6 30542.1 30358.7
Double-Precision Whetstone 713.7 713.5 5463.6 5455.1 7863.9 7923.8 12725.1 12727.8 17474.3 17463.3
Execl Throughput 280.9 283.8 1724.4 1866.5 2029.5 2367.6 2370 2521.3 2453 2506.8
File Copy 1024 bufsize 2000 maxblocks 891.1 894.2 1423 1457.7 1385.6 1482.2 1226.1 1224.2 1235.9 1265.5
File Copy 256 bufsize 500 maxblocks 546.9 555.4 949 972.1 882.8 878.6 821.9 817.7 784.7 810.8
File Copy 4096 bufsize 8000 maxblocks 1743.4 1722.8 3406.5 3438.9 3314.3 3265.9 2801.9 2788.3 2695.2 2781.5
Pipe Throughput 426.8 423.4 3207.9 3234 4635.1 4708.9 7326 7335.3 7327.2 7319.7
Pipe-based Context Switching 110.2 223.5 680.8 1602.2 998.6 2324.6 3122.1 3252.7 3128.6 3337.2
Process Creation 130.7 224.4 1001.3 1043.6 1209 1248.2 1337.9 1380.4 1338.6 1280.1
Shell Scripts (1 concurrent) 1140.5 1257.5 5462.8 6146.4 6435.3 7206.1 7425.2 7636.2 7566.1 7636.6
Shell Scripts (8 concurrent) 3492 3586.7 7144.9 7307 7258 7320.2 7295.1 7296.7 7248.6 7252.2
System Call Overhead 387.7 387.5 2398.4 2367 2793.8 2752.7 3735.7 3694.2 3752.1 3709.4
System Benchmarks Index Score 634.8 712.6 2725.8 3005.7 3232.4 3569.7 3981.3 4028.8 4085.2 4126.3
----------------------------------------------------------------------------------------------------------------------------------------------------
% increase (of the Index Score) 12.256 10.269 10.435 1.193 1.006
====================================================================================================================================================

*** Intel(R) Xeon(R) X5650 @ 2.67GHz
*** pCPUs 48 DOM0 vCPUS 16
*** RAM 393138 MB DOM0 Memory 9955 MB
*** NUMA nodes 2
=======================================================================================================================================
MAKE XEN (lower == better)
=======================================================================================================================================
# of build jobs -j1 -j20 -j24 -j48** -j62
vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
---------------------------------------------------------------------------------------------------------------------------------------
267.78 233.25 36.53 35.53 35.98 34.99 33.46 32.13 33.57 32.54
268.42 233.92 36.82 35.56 36.12 35.2 34.24 32.24 33.64 32.56
268.85 234.39 36.92 35.75 36.15 35.35 34.48 32.86 33.67 32.74
268.98 235.11 36.96 36.01 36.25 35.46 34.73 32.89 33.97 32.83
269.03 236.48 37.04 36.16 36.45 35.63 34.77 32.97 34.12 33.01
269.54 237.05 40.33 36.59 36.57 36.15 34.97 33.09 34.18 33.52
269.99 238.24 40.45 36.78 36.58 36.22 34.99 33.69 34.28 33.63
270.11 238.48 41.13 39.98 40.22 36.24 38 33.92 34.35 33.87
270.96 239.07 41.66 40.81 40.59 36.35 38.99 34.19 34.49 37.24
271.84 240.89 42.07 41.24 40.63 40.06 39.07 36.04 34.69 37.59
---------------------------------------------------------------------------------------------------------------------------------------
Avg. 269.55 236.688 38.991 37.441 37.554 36.165 35.77 33.402 34.096 33.953
---------------------------------------------------------------------------------------------------------------------------------------
Std. Dev. 1.213 2.503 2.312 2.288 2.031 1.452 2.079 1.142 0.379 1.882
---------------------------------------------------------------------------------------------------------------------------------------
% improvement 12.191 3.975 3.699 6.620 0.419
========================================================================================================================================
====================================================================================================================================================
UNIXBENCH
====================================================================================================================================================
# parallel copies 1 parallel 20 parrallel 24 parallel 48 parallel** 62 parallel
vanilla/patched vanilla patched vanilla pached vanilla patched vanilla patched vanilla patched
----------------------------------------------------------------------------------------------------------------------------------------------------
Dhrystone 2 using register variables 2037.6 2037.5 39615.4 38990.5 43976.8 44660.8 51238 51117.4 51672.5 52332.5
Double-Precision Whetstone 525.1 521.6 10389.7 10429.3 12236.5 12188.8 20897.1 20921.9 26957.5 27035.7
Execl Throughput 112.1 113.6 799 786.5 715.1 702.3 758.2 744 756.3 765.6
File Copy 1024 bufsize 2000 maxblocks 605.5 622 671.6 630.4 624.3 605.8 599 581.2 447.4 433.7
File Copy 256 bufsize 500 maxblocks 384 382.7 447.2 429.1 464.5 404.3 416.1 428.5 313.8 305.6
File Copy 4096 bufsize 8000 maxblocks 883.7 1100.5 1326 1307 1343.2 1305.9 1260.4 1245.3 1001.4 920.1
Pipe Throughput 283.7 282.8 5636.6 5634.2 6551 6571 10390 10437.4 10459 10498.9
Pipe-based Context Switching 41.5 143.7 518.5 1899.1 737.5 2068.8 2877.1 3093.2 2949.3 3184.1
Process Creation 58.5 78.4 370.7 389.4 338 355.8 380.1 375.5 383.8 369.6
Shell Scripts (1 concurrent) 443.7 475.5 1901.9 1945 1765.1 1789.6 2417 2354.4 2395.3 2362.2
Shell Scripts (8 concurrent) 1283.1 1319.1 2265.4 2209.8 2263.3 2209 2202.7 2216.1 2190.4 2206.5
System Call Overhead 254.1 254.3 891.6 881.6 971.1 958.3 1446.8 1409.5 1461.7 1429.2
System Benchmarks Index Score 340.8 398.6 1690.6 1866.3 1770.6 1902 2303.5 2300.8 2208.3 2189.8
----------------------------------------------------------------------------------------------------------------------------------------------------
% increase (of the Index Score) 16.960 10.393 7.421 -0.117 -0.838
====================================================================================================================================================

OVERHEAD EVALUATION
===================

Only in the Xen build case, I quickly checked with `perf stat' some
scheduling related metrics. I only did this on the biggest box, for now,
as it is there that we show the larger improvement (in case of "-j1" and
a couple of slight regressions (although, those happen in UnixBench).

We see that using only one, "flat", scheduling domain always means less
migrations, while it seems to be increasing the number of context
switches.

===============================================================================================================================================================
“-j1” “-j24” “-j48” “-j62”
---------------------------------------------------------------------------------------------------------------------------------------------------------------
cpu-migrations context-switches cpu-migrations context-switches cpu-migrations context-switches cpu-migrations context-switches
---------------------------------------------------------------------------------------------------------------------------------------------------------------
vanilla 21,242(0.074 K/s) 46,196(0.160 K/s) 22,992(0.066 K/s) 48,684(0.140 K/s) 24,516(0.064 K/s) 63,391(0.166 K/s) 23,164(0.062 K/s) 68,239(0.182 K/s)
patched 19,522(0.077 K/s) 50,871(0.201 K/s) 20,593(0.059 K/s) 57,688(0.167 K/s) 21,137(0.056 K/s) 63,822(0.169 K/s) 20,830(0.055 K/s) 69,783(0.185 K/s)
===============================================================================================================================================================

REQUEST FOR COMMENTS
====================
Basically, the kind of feedback I'd be really glad to hear is:
- what you guys thing of the approach,
- whether you think, looking at this preliminary set of numbers, that
this is something worth continuing investigating,
- if yes, what other workloads and benchmark it would make sense to
throw at it.

Thanks and Regards,
Dario

--
<<This happens because I choose it to happen!>> (Raistlin Majere)
-----------------------------------------------------------------
Dario Faggioli, Ph.D, http://about.me/dario.faggioli
Senior Software Engineer, Citrix Systems R&D Ltd., Cambridge (UK)
---
commit 3240f68a08511c3db616cfc2a653e6761e23ff7f
Author: Dario Faggioli <[email protected]>
Date: Tue Aug 18 08:41:38 2015 -0700

xen: if on Xen, "flatten" the scheduling domain hierarchy

With this patch applied, only one scheduling domain is
created (called the 'VCPU' domain) spanning all the
guest's vCPUs.

This is because, since vCPUs are moving around on pCPUs,
there is no point in building a full hierarchy, based
*any* topology information, which will just never be
accurate. Having only one "flat" domain is really the
only thing that looks sensible.

Signed-off-by: Dario Faggioli <[email protected]>

diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
index 8648438..34f39f1 100644
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -55,6 +55,21 @@ static irqreturn_t xen_call_function_interrupt(int irq, void *dev_id);
static irqreturn_t xen_call_function_single_interrupt(int irq, void *dev_id);
static irqreturn_t xen_irq_work_interrupt(int irq, void *dev_id);

+const struct cpumask *xen_pcpu_sched_domain_mask(int cpu)
+{
+ return cpu_online_mask;
+}
+
+static struct sched_domain_topology_level xen_sched_domain_topology[] = {
+ { xen_pcpu_sched_domain_mask, SD_INIT_NAME(VCPU) },
+ { NULL, },
+};
+
+static void xen_set_sched_topology(void)
+{
+ set_sched_topology(xen_sched_domain_topology);
+}
+
/*
* Reschedule call back.
*/
@@ -335,6 +350,8 @@ static void __init xen_smp_prepare_cpus(unsigned int max_cpus)
}
set_cpu_sibling_map(0);

+ xen_set_sched_topology();
+
if (xen_smp_intr_init(0))
BUG();



Attachments:
topology.patch (1.60 kB)
signature.asc (181.00 B)
This is a digitally signed message part
Download all attachments

2015-08-18 16:53:36

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

On August 18, 2015 8:55:32 AM PDT, Dario Faggioli <[email protected]> wrote:
>Hey everyone,
>
>So, as a followup of what we were discussing in this thread:
>
> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>
>I started looking in more details at scheduling domains in the Linux
>kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>of interacting, while this thing I'm proposing here is completely
>independent from them both.
>
>In fact, no matter whether vNUMA is supported and enabled, and no
>matter
>whether CPUID is reporting accurate, random, meaningful or completely
>misleading information, I think that we should do something about how
>scheduling domains are build.
>
>Fact is, unless we use 1:1, and immutable (across all the guest
>lifetime) pinning, scheduling domains should not be constructed, in
>Linux, by looking at *any* topology information, because that just does
>not make any sense, when vcpus move around.
>
>Let me state this again (hoping to make myself as clear as possible):
>no
>matter in how much good shape we put CPUID support, no matter how
>beautifully and consistently that will interact with both vNUMA,
>licensing requirements and whatever else. It will be always possible
>for
>vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>on two different NUMA nodes at time t2. Hence, the Linux scheduler
>should really not skew his load balancing logic toward any of those two
>situations, as neither of them could be considered correct (since
>nothing is!).

What about Windows guests?

>
>For now, this only covers the PV case. HVM case shouldn't be any
>different, but I haven't looked at how to make the same thing happen in
>there as well.
>
>OVERALL DESCRIPTION
>===================
>What this RFC patch does is, in the Xen PV case, configure scheduling
>domains in such a way that there is only one of them, spanning all the
>pCPUs of the guest.

Wow. That is an pretty simple patch!!

>
>Note that the patch deals directly with scheduling domains, and there
>is
>no need to alter the masks that will then be used for building and
>reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That
>is
>the main difference between it and the patch proposed by Juergen here:
>http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>
>This means that when, in future, we will fix CPUID handling and make it
>comply with whatever logic or requirements we want, that won't have
>any
>unexpected side effects on scheduling domains.
>
>Information about how the scheduling domains are being constructed
>during boot are available in `dmesg', if the kernel is booted with the
>'sched_debug' parameter. It is also possible to look
>at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>
>With the patch applied, only one scheduling domain is created, called
>the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>tell that from the fact that every cpu* folder
>in /proc/sys/kernel/sched_domain/ only have one subdirectory
>('domain0'), with all the tweaks and the tunables for our scheduling
>domain.
>
...
>
>REQUEST FOR COMMENTS
>====================
>Basically, the kind of feedback I'd be really glad to hear is:
> - what you guys thing of the approach,
> - whether you think, looking at this preliminary set of numbers, that
> this is something worth continuing investigating,
> - if yes, what other workloads and benchmark it would make sense to
> throw at it.
>

The thing that I was worried about is that we would be modifying the generic code, but your changes are all in Xen code!

Woot!

In terms of workloads, I am CCing Herbert who I hope can provide advise on this.

Herbert, the full email is here:
http://lists.xen.org/archives/html/xen-devel/2015-08/msg01691.html


>Thanks and Regards,
>Dario

2015-08-20 18:19:09

by Jürgen Groß

[permalink] [raw]
Subject: Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

On 08/18/2015 05:55 PM, Dario Faggioli wrote:
> Hey everyone,
>
> So, as a followup of what we were discussing in this thread:
>
> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>
> I started looking in more details at scheduling domains in the Linux
> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
> of interacting, while this thing I'm proposing here is completely
> independent from them both.
>
> In fact, no matter whether vNUMA is supported and enabled, and no matter
> whether CPUID is reporting accurate, random, meaningful or completely
> misleading information, I think that we should do something about how
> scheduling domains are build.
>
> Fact is, unless we use 1:1, and immutable (across all the guest
> lifetime) pinning, scheduling domains should not be constructed, in
> Linux, by looking at *any* topology information, because that just does
> not make any sense, when vcpus move around.
>
> Let me state this again (hoping to make myself as clear as possible): no
> matter in how much good shape we put CPUID support, no matter how
> beautifully and consistently that will interact with both vNUMA,
> licensing requirements and whatever else. It will be always possible for
> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
> on two different NUMA nodes at time t2. Hence, the Linux scheduler
> should really not skew his load balancing logic toward any of those two
> situations, as neither of them could be considered correct (since
> nothing is!).
>
> For now, this only covers the PV case. HVM case shouldn't be any
> different, but I haven't looked at how to make the same thing happen in
> there as well.
>
> OVERALL DESCRIPTION
> ===================
> What this RFC patch does is, in the Xen PV case, configure scheduling
> domains in such a way that there is only one of them, spanning all the
> pCPUs of the guest.
>
> Note that the patch deals directly with scheduling domains, and there is
> no need to alter the masks that will then be used for building and
> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
> the main difference between it and the patch proposed by Juergen here:
> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>
> This means that when, in future, we will fix CPUID handling and make it
> comply with whatever logic or requirements we want, that won't have any
> unexpected side effects on scheduling domains.
>
> Information about how the scheduling domains are being constructed
> during boot are available in `dmesg', if the kernel is booted with the
> 'sched_debug' parameter. It is also possible to look
> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>
> With the patch applied, only one scheduling domain is created, called
> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
> tell that from the fact that every cpu* folder
> in /proc/sys/kernel/sched_domain/ only have one subdirectory
> ('domain0'), with all the tweaks and the tunables for our scheduling
> domain.
>
> EVALUATION
> ==========
> I've tested this with UnixBench, and by looking at Xen build time, on a
> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
> now, but I plan to re-run them in DomUs soon (Juergen may be doing
> something similar to this in DomU already, AFAUI).
>
> I've run the benchmarks with and without the patch applied ('patched'
> and 'vanilla', respectively, in the tables below), and with different
> number of build jobs (in case of the Xen build) or of parallel copy of
> the benchmarks (in the case of UnixBench).
>
> What I get from the numbers is that the patch almost always brings
> benefits, in some cases even huge ones. There are a couple of cases
> where we regress, but always only slightly so, especially if comparing
> that to the magnitude of some of the improvement that we get.
>
> Bear also in mind that these results are gathered from Dom0, and without
> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
> we move things in DomU and do overcommit at the Xen scheduler level, I
> am expecting even better results.
>
...
> REQUEST FOR COMMENTS
> ====================
> Basically, the kind of feedback I'd be really glad to hear is:
> - what you guys thing of the approach,

Yesterday at the end of the developer meeting we (Andrew, Elena and
myself) discussed this topic again.

Regarding a possible future scenario with credit2 eventually supporting
gang scheduling on hyperthreads (which is desirable due to security
reasons [side channel attack] and fairness) my patch seems to be more
suited for that direction than yours. Correct me if I'm wrong, but I
think scheduling domains won't enable the guest kernel's scheduler to
migrate threads more easily between hyperthreads opposed to other vcpus,
while my approach can easily be extended to do so.

> - whether you think, looking at this preliminary set of numbers, that
> this is something worth continuing investigating,

I believe as both approaches lead to the same topology information used
by the scheduler (all vcpus are regarded as being equal) your numbers
should apply to my patch as well. Would you mind verifying this?

I still believe making the guest scheduler's decisions independant from
cpuid values is the way to go, as this will enable us to support more
scenarios (e.g. cpuid based licensing). For HVM guests and old PV guests
mangling the cpuid should still be done, though.

> - if yes, what other workloads and benchmark it would make sense to
> throw at it.

As you already mentioned an overcommitted host should be looked at as
well.


Thanks for doing the measurements,


Juergen

2015-08-27 10:24:36

by George Dunlap

[permalink] [raw]
Subject: Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

On 08/18/2015 04:55 PM, Dario Faggioli wrote:
> Hey everyone,
>
> So, as a followup of what we were discussing in this thread:
>
> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>
> I started looking in more details at scheduling domains in the Linux
> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
> of interacting, while this thing I'm proposing here is completely
> independent from them both.
>
> In fact, no matter whether vNUMA is supported and enabled, and no matter
> whether CPUID is reporting accurate, random, meaningful or completely
> misleading information, I think that we should do something about how
> scheduling domains are build.
>
> Fact is, unless we use 1:1, and immutable (across all the guest
> lifetime) pinning, scheduling domains should not be constructed, in
> Linux, by looking at *any* topology information, because that just does
> not make any sense, when vcpus move around.
>
> Let me state this again (hoping to make myself as clear as possible): no
> matter in how much good shape we put CPUID support, no matter how
> beautifully and consistently that will interact with both vNUMA,
> licensing requirements and whatever else. It will be always possible for
> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
> on two different NUMA nodes at time t2. Hence, the Linux scheduler
> should really not skew his load balancing logic toward any of those two
> situations, as neither of them could be considered correct (since
> nothing is!).
>
> For now, this only covers the PV case. HVM case shouldn't be any
> different, but I haven't looked at how to make the same thing happen in
> there as well.
>
> OVERALL DESCRIPTION
> ===================
> What this RFC patch does is, in the Xen PV case, configure scheduling
> domains in such a way that there is only one of them, spanning all the
> pCPUs of the guest.
>
> Note that the patch deals directly with scheduling domains, and there is
> no need to alter the masks that will then be used for building and
> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
> the main difference between it and the patch proposed by Juergen here:
> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>
> This means that when, in future, we will fix CPUID handling and make it
> comply with whatever logic or requirements we want, that won't have any
> unexpected side effects on scheduling domains.
>
> Information about how the scheduling domains are being constructed
> during boot are available in `dmesg', if the kernel is booted with the
> 'sched_debug' parameter. It is also possible to look
> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>
> With the patch applied, only one scheduling domain is created, called
> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
> tell that from the fact that every cpu* folder
> in /proc/sys/kernel/sched_domain/ only have one subdirectory
> ('domain0'), with all the tweaks and the tunables for our scheduling
> domain.
>
> EVALUATION
> ==========
> I've tested this with UnixBench, and by looking at Xen build time, on a
> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
> now, but I plan to re-run them in DomUs soon (Juergen may be doing
> something similar to this in DomU already, AFAUI).
>
> I've run the benchmarks with and without the patch applied ('patched'
> and 'vanilla', respectively, in the tables below), and with different
> number of build jobs (in case of the Xen build) or of parallel copy of
> the benchmarks (in the case of UnixBench).
>
> What I get from the numbers is that the patch almost always brings
> benefits, in some cases even huge ones. There are a couple of cases
> where we regress, but always only slightly so, especially if comparing
> that to the magnitude of some of the improvement that we get.
>
> Bear also in mind that these results are gathered from Dom0, and without
> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
> we move things in DomU and do overcommit at the Xen scheduler level, I
> am expecting even better results.
>
> RESULTS
> =======
> To have a quick idea of how a benchmark went, look at the '%
> improvement' row of each table.
>
> I'll put these results online, in a googledoc spreadsheet or something
> like that, to make them easier to read, as soon as possible.
>
> *** Intel(R) Xeon(R) E5620 @ 2.40GHz
> *** pCPUs 16 DOM0 vCPUS 16
> *** RAM 12285 MB DOM0 Memory 9955 MB
> *** NUMA nodes 2
> =======================================================================================================================================
> MAKE XEN (lower == better)
> =======================================================================================================================================
> # of build jobs -j1 -j6 -j8 -j16** -j24
> vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
> ---------------------------------------------------------------------------------------------------------------------------------------
> 153.72 152.41 35.33 34.93 30.7 30.33 26.79 25.97 26.88 26.21
> 153.81 152.76 35.37 34.99 30.81 30.36 26.83 26.08 27 26.24
> 153.93 152.79 35.37 35.25 30.92 30.39 26.83 26.13 27.01 26.28
> 153.94 152.94 35.39 35.28 31.05 30.43 26.9 26.14 27.01 26.44
> 153.98 153.06 35.45 35.31 31.17 30.5 26.95 26.18 27.02 26.55
> 154.01 153.23 35.5 35.35 31.2 30.59 26.98 26.2 27.05 26.61
> 154.04 153.34 35.56 35.42 31.45 30.76 27.12 26.21 27.06 26.78
> 154.16 153.5 37.79 35.58 31.68 30.83 27.16 26.23 27.16 26.78
> 154.18 153.71 37.98 35.61 33.73 30.9 27.49 26.32 27.16 26.8
> 154.9 154.67 38.03 37.64 34.69 31.69 29.82 26.38 27.2 28.63
> ---------------------------------------------------------------------------------------------------------------------------------------
> Avg. 154.067 153.241 36.177 35.536 31.74 30.678 27.287 26.184 27.055 26.732
> ---------------------------------------------------------------------------------------------------------------------------------------
> Std. Dev. 0.325 0.631 1.215 0.771 1.352 0.410 0.914 0.116 0.095 0.704
> ---------------------------------------------------------------------------------------------------------------------------------------
> % improvement 0.536 1.772 3.346 4.042 1.194
> ========================================================================================================================================
> ====================================================================================================================================================
> UNIXBENCH
> ====================================================================================================================================================
> # parallel copies 1 parallel 6 parrallel 8 parallel 16 parallel** 24 parallel
> vanilla/patched vanilla patched vanilla pached vanilla patched vanilla patched vanilla patched
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Dhrystone 2 using register variables 2302.2 2302.1 13157.8 12262.4 15691.5 15860.1 18927.7 19078.5 18654.3 18855.6
> Double-Precision Whetstone 620.2 620.2 3481.2 3566.9 4669.2 4551.5 7610.1 7614.3 11558.9 11561.3
> Execl Throughput 184.3 186.7 884.6 905.3 1168.4 1213.6 2134.6 2210.2 2250.9 2265
> File Copy 1024 bufsize 2000 maxblocks 780.8 783.3 1243.7 1255.5 1250.6 1215.7 1080.9 1094.2 1069.8 1062.5
> File Copy 256 bufsize 500 maxblocks 479.8 482.8 781.8 803.6 806.4 781 682.9 707.7 698.2 694.6
> File Copy 4096 bufsize 8000 maxblocks 1617.6 1593.5 2739.7 2943.4 2818.3 2957.8 2389.6 2412.6 2371.6 2423.8
> Pipe Throughput 363.9 361.6 2068.6 2065.6 2622 2633.5 4053.3 4085.9 4064.7 4076.7
> Pipe-based Context Switching 70.6 207.2 369.1 1126.8 623.9 1431.3 1970.4 2082.9 1963.8 2077
> Process Creation 103.1 135 503 677.6 618.7 855.4 1138 1113.7 1195.6 1199
> Shell Scripts (1 concurrent) 723.2 765.3 4406.4 4334.4 5045.4 5002.5 5861.9 5844.2 5958.8 5916.1
> Shell Scripts (8 concurrent) 2243.7 2715.3 5694.7 5663.6 5694.7 5657.8 5637.1 5600.5 5582.9 5543.6
> System Call Overhead 330 330.1 1669.2 1672.4 2028.6 1996.6 2920.5 2947.1 2923.9 2952.5
> System Benchmarks Index Score 496.8 567.5 1861.9 2106 2220.3 2441.3 2972.5 3007.9 3103.4 3125.3
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> % increase (of the Index Score) 14.231 13.110 9.954 1.191 0.706
> ====================================================================================================================================================
>
> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
> *** pCPUs 24 DOM0 vCPUS 16
> *** RAM 36851 MB DOM0 Memory 9955 MB
> *** NUMA nodes 2
> =======================================================================================================================================
> MAKE XEN (lower == better)
> =======================================================================================================================================
> # of build jobs -j1 -j8 -j12 -j24** -j32
> vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
> ---------------------------------------------------------------------------------------------------------------------------------------
> 119.49 119.47 23.37 23.29 20.12 19.85 17.99 17.9 17.82 17.8
> 119.59 119.64 23.52 23.31 20.16 19.99 18.19 18.05 18.23 17.89
> 119.59 119.65 23.53 23.35 20.19 20.08 18.26 18.09 18.35 17.91
> 119.72 119.75 23.63 23.41 20.2 20.14 18.54 18.1 18.4 17.95
> 119.95 119.86 23.68 23.42 20.24 20.19 18.57 18.15 18.44 18.03
> 119.97 119.9 23.72 23.51 20.38 20.31 18.61 18.21 18.49 18.03
> 119.97 119.91 25.03 23.53 20.38 20.42 18.75 18.28 18.51 18.08
> 120.01 119.98 25.05 23.93 20.39 21.69 19.99 18.49 18.52 18.6
> 120.24 119.99 25.12 24.19 21.67 21.76 20.08 19.74 19.73 19.62
> 120.66 121.22 25.16 25.36 21.94 21.85 20.26 20.3 19.92 19.81
> ---------------------------------------------------------------------------------------------------------------------------------------
> Avg. 119.919 119.937 24.181 23.73 20.567 20.628 18.924 18.531 18.641 18.372
> ---------------------------------------------------------------------------------------------------------------------------------------
> Std. Dev. 0.351 0.481 0.789 0.642 0.663 0.802 0.851 0.811 0.658 0.741
> ---------------------------------------------------------------------------------------------------------------------------------------
> % improvement -0.015 1.865 -0.297 2.077 1.443
> ========================================================================================================================================
> ====================================================================================================================================================
> UNIXBENCH
> ====================================================================================================================================================
> # parallel copies 1 parallel 8 parrallel 12 parallel 24 parallel** 32 parallel
> vanilla/patched vanilla patched vanilla pached vanilla patched vanilla patched vanilla patched
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> Dhrystone 2 using register variables 2650.1 2664.6 18967.8 19060.4 27534.1 27046.8 30077.9 30110.6 30542.1 30358.7
> Double-Precision Whetstone 713.7 713.5 5463.6 5455.1 7863.9 7923.8 12725.1 12727.8 17474.3 17463.3
> Execl Throughput 280.9 283.8 1724.4 1866.5 2029.5 2367.6 2370 2521.3 2453 2506.8
> File Copy 1024 bufsize 2000 maxblocks 891.1 894.2 1423 1457.7 1385.6 1482.2 1226.1 1224.2 1235.9 1265.5
> File Copy 256 bufsize 500 maxblocks 546.9 555.4 949 972.1 882.8 878.6 821.9 817.7 784.7 810.8
> File Copy 4096 bufsize 8000 maxblocks 1743.4 1722.8 3406.5 3438.9 3314.3 3265.9 2801.9 2788.3 2695.2 2781.5
> Pipe Throughput 426.8 423.4 3207.9 3234 4635.1 4708.9 7326 7335.3 7327.2 7319.7
> Pipe-based Context Switching 110.2 223.5 680.8 1602.2 998.6 2324.6 3122.1 3252.7 3128.6 3337.2
> Process Creation 130.7 224.4 1001.3 1043.6 1209 1248.2 1337.9 1380.4 1338.6 1280.1
> Shell Scripts (1 concurrent) 1140.5 1257.5 5462.8 6146.4 6435.3 7206.1 7425.2 7636.2 7566.1 7636.6
> Shell Scripts (8 concurrent) 3492 3586.7 7144.9 7307 7258 7320.2 7295.1 7296.7 7248.6 7252.2
> System Call Overhead 387.7 387.5 2398.4 2367 2793.8 2752.7 3735.7 3694.2 3752.1 3709.4
> System Benchmarks Index Score 634.8 712.6 2725.8 3005.7 3232.4 3569.7 3981.3 4028.8 4085.2 4126.3
> ----------------------------------------------------------------------------------------------------------------------------------------------------
> % increase (of the Index Score) 12.256 10.269 10.435 1.193 1.006
> ====================================================================================================================================================
>
> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
> *** pCPUs 48 DOM0 vCPUS 16
> *** RAM 393138 MB DOM0 Memory 9955 MB
> *** NUMA nodes 2
> =======================================================================================================================================
> MAKE XEN (lower == better)
> =======================================================================================================================================
> # of build jobs -j1 -j20 -j24 -j48** -j62
> vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
> ---------------------------------------------------------------------------------------------------------------------------------------
> 267.78 233.25 36.53 35.53 35.98 34.99 33.46 32.13 33.57 32.54
> 268.42 233.92 36.82 35.56 36.12 35.2 34.24 32.24 33.64 32.56
> 268.85 234.39 36.92 35.75 36.15 35.35 34.48 32.86 33.67 32.74
> 268.98 235.11 36.96 36.01 36.25 35.46 34.73 32.89 33.97 32.83
> 269.03 236.48 37.04 36.16 36.45 35.63 34.77 32.97 34.12 33.01
> 269.54 237.05 40.33 36.59 36.57 36.15 34.97 33.09 34.18 33.52
> 269.99 238.24 40.45 36.78 36.58 36.22 34.99 33.69 34.28 33.63
> 270.11 238.48 41.13 39.98 40.22 36.24 38 33.92 34.35 33.87
> 270.96 239.07 41.66 40.81 40.59 36.35 38.99 34.19 34.49 37.24
> 271.84 240.89 42.07 41.24 40.63 40.06 39.07 36.04 34.69 37.59
> ---------------------------------------------------------------------------------------------------------------------------------------
> Avg. 269.55 236.688 38.991 37.441 37.554 36.165 35.77 33.402 34.096 33.953
> ---------------------------------------------------------------------------------------------------------------------------------------
> Std. Dev. 1.213 2.503 2.312 2.288 2.031 1.452 2.079 1.142 0.379 1.882
> ---------------------------------------------------------------------------------------------------------------------------------------
> % improvement 12.191 3.975 3.699 6.620 0.419
> ========================================================================================================================================

I'm a bit confused here as to why, if dom0 has 16 vcpus in all of your
tests, you change the -j number (apparently) based on the number of
pcpus available to Xen. Wouldn't it make more sense to stick with
1/6/8/16/24? That would allow us to have actually comparable numbers.

But in any case, it seems to me that the numbers do show a uniform
improvement and no regressions -- I think this approach looks really
good, particularly as it is so small and well-contained.

-George

2015-08-27 17:05:41

by George Dunlap

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy

On Thu, Aug 27, 2015 at 11:24 AM, George Dunlap
<[email protected]> wrote:
> On 08/18/2015 04:55 PM, Dario Faggioli wrote:
>> Hey everyone,
>>
>> So, as a followup of what we were discussing in this thread:
>>
>> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>
>> I started looking in more details at scheduling domains in the Linux
>> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>> of interacting, while this thing I'm proposing here is completely
>> independent from them both.
>>
>> In fact, no matter whether vNUMA is supported and enabled, and no matter
>> whether CPUID is reporting accurate, random, meaningful or completely
>> misleading information, I think that we should do something about how
>> scheduling domains are build.
>>
>> Fact is, unless we use 1:1, and immutable (across all the guest
>> lifetime) pinning, scheduling domains should not be constructed, in
>> Linux, by looking at *any* topology information, because that just does
>> not make any sense, when vcpus move around.
>>
>> Let me state this again (hoping to make myself as clear as possible): no
>> matter in how much good shape we put CPUID support, no matter how
>> beautifully and consistently that will interact with both vNUMA,
>> licensing requirements and whatever else. It will be always possible for
>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>> should really not skew his load balancing logic toward any of those two
>> situations, as neither of them could be considered correct (since
>> nothing is!).
>>
>> For now, this only covers the PV case. HVM case shouldn't be any
>> different, but I haven't looked at how to make the same thing happen in
>> there as well.
>>
>> OVERALL DESCRIPTION
>> ===================
>> What this RFC patch does is, in the Xen PV case, configure scheduling
>> domains in such a way that there is only one of them, spanning all the
>> pCPUs of the guest.
>>
>> Note that the patch deals directly with scheduling domains, and there is
>> no need to alter the masks that will then be used for building and
>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
>> the main difference between it and the patch proposed by Juergen here:
>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>
>> This means that when, in future, we will fix CPUID handling and make it
>> comply with whatever logic or requirements we want, that won't have any
>> unexpected side effects on scheduling domains.
>>
>> Information about how the scheduling domains are being constructed
>> during boot are available in `dmesg', if the kernel is booted with the
>> 'sched_debug' parameter. It is also possible to look
>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>
>> With the patch applied, only one scheduling domain is created, called
>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>> tell that from the fact that every cpu* folder
>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>> ('domain0'), with all the tweaks and the tunables for our scheduling
>> domain.
>>
>> EVALUATION
>> ==========
>> I've tested this with UnixBench, and by looking at Xen build time, on a
>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>> something similar to this in DomU already, AFAUI).
>>
>> I've run the benchmarks with and without the patch applied ('patched'
>> and 'vanilla', respectively, in the tables below), and with different
>> number of build jobs (in case of the Xen build) or of parallel copy of
>> the benchmarks (in the case of UnixBench).
>>
>> What I get from the numbers is that the patch almost always brings
>> benefits, in some cases even huge ones. There are a couple of cases
>> where we regress, but always only slightly so, especially if comparing
>> that to the magnitude of some of the improvement that we get.
>>
>> Bear also in mind that these results are gathered from Dom0, and without
>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>> we move things in DomU and do overcommit at the Xen scheduler level, I
>> am expecting even better results.
>>
>> RESULTS
>> =======
>> To have a quick idea of how a benchmark went, look at the '%
>> improvement' row of each table.
>>
>> I'll put these results online, in a googledoc spreadsheet or something
>> like that, to make them easier to read, as soon as possible.
>>
>> *** Intel(R) Xeon(R) E5620 @ 2.40GHz
>> *** pCPUs 16 DOM0 vCPUS 16
>> *** RAM 12285 MB DOM0 Memory 9955 MB
>> *** NUMA nodes 2
>> =======================================================================================================================================
>> MAKE XEN (lower == better)
>> =======================================================================================================================================
>> # of build jobs -j1 -j6 -j8 -j16** -j24
>> vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> 153.72 152.41 35.33 34.93 30.7 30.33 26.79 25.97 26.88 26.21
>> 153.81 152.76 35.37 34.99 30.81 30.36 26.83 26.08 27 26.24
>> 153.93 152.79 35.37 35.25 30.92 30.39 26.83 26.13 27.01 26.28
>> 153.94 152.94 35.39 35.28 31.05 30.43 26.9 26.14 27.01 26.44
>> 153.98 153.06 35.45 35.31 31.17 30.5 26.95 26.18 27.02 26.55
>> 154.01 153.23 35.5 35.35 31.2 30.59 26.98 26.2 27.05 26.61
>> 154.04 153.34 35.56 35.42 31.45 30.76 27.12 26.21 27.06 26.78
>> 154.16 153.5 37.79 35.58 31.68 30.83 27.16 26.23 27.16 26.78
>> 154.18 153.71 37.98 35.61 33.73 30.9 27.49 26.32 27.16 26.8
>> 154.9 154.67 38.03 37.64 34.69 31.69 29.82 26.38 27.2 28.63
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> Avg. 154.067 153.241 36.177 35.536 31.74 30.678 27.287 26.184 27.055 26.732
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> Std. Dev. 0.325 0.631 1.215 0.771 1.352 0.410 0.914 0.116 0.095 0.704
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> % improvement 0.536 1.772 3.346 4.042 1.194
>> ========================================================================================================================================
>> ====================================================================================================================================================
>> UNIXBENCH
>> ====================================================================================================================================================
>> # parallel copies 1 parallel 6 parrallel 8 parallel 16 parallel** 24 parallel
>> vanilla/patched vanilla patched vanilla pached vanilla patched vanilla patched vanilla patched
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> Dhrystone 2 using register variables 2302.2 2302.1 13157.8 12262.4 15691.5 15860.1 18927.7 19078.5 18654.3 18855.6
>> Double-Precision Whetstone 620.2 620.2 3481.2 3566.9 4669.2 4551.5 7610.1 7614.3 11558.9 11561.3
>> Execl Throughput 184.3 186.7 884.6 905.3 1168.4 1213.6 2134.6 2210.2 2250.9 2265
>> File Copy 1024 bufsize 2000 maxblocks 780.8 783.3 1243.7 1255.5 1250.6 1215.7 1080.9 1094.2 1069.8 1062.5
>> File Copy 256 bufsize 500 maxblocks 479.8 482.8 781.8 803.6 806.4 781 682.9 707.7 698.2 694.6
>> File Copy 4096 bufsize 8000 maxblocks 1617.6 1593.5 2739.7 2943.4 2818.3 2957.8 2389.6 2412.6 2371.6 2423.8
>> Pipe Throughput 363.9 361.6 2068.6 2065.6 2622 2633.5 4053.3 4085.9 4064.7 4076.7
>> Pipe-based Context Switching 70.6 207.2 369.1 1126.8 623.9 1431.3 1970.4 2082.9 1963.8 2077
>> Process Creation 103.1 135 503 677.6 618.7 855.4 1138 1113.7 1195.6 1199
>> Shell Scripts (1 concurrent) 723.2 765.3 4406.4 4334.4 5045.4 5002.5 5861.9 5844.2 5958.8 5916.1
>> Shell Scripts (8 concurrent) 2243.7 2715.3 5694.7 5663.6 5694.7 5657.8 5637.1 5600.5 5582.9 5543.6
>> System Call Overhead 330 330.1 1669.2 1672.4 2028.6 1996.6 2920.5 2947.1 2923.9 2952.5
>> System Benchmarks Index Score 496.8 567.5 1861.9 2106 2220.3 2441.3 2972.5 3007.9 3103.4 3125.3
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> % increase (of the Index Score) 14.231 13.110 9.954 1.191 0.706
>> ====================================================================================================================================================
>>
>> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
>> *** pCPUs 24 DOM0 vCPUS 16
>> *** RAM 36851 MB DOM0 Memory 9955 MB
>> *** NUMA nodes 2
>> =======================================================================================================================================
>> MAKE XEN (lower == better)
>> =======================================================================================================================================
>> # of build jobs -j1 -j8 -j12 -j24** -j32
>> vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> 119.49 119.47 23.37 23.29 20.12 19.85 17.99 17.9 17.82 17.8
>> 119.59 119.64 23.52 23.31 20.16 19.99 18.19 18.05 18.23 17.89
>> 119.59 119.65 23.53 23.35 20.19 20.08 18.26 18.09 18.35 17.91
>> 119.72 119.75 23.63 23.41 20.2 20.14 18.54 18.1 18.4 17.95
>> 119.95 119.86 23.68 23.42 20.24 20.19 18.57 18.15 18.44 18.03
>> 119.97 119.9 23.72 23.51 20.38 20.31 18.61 18.21 18.49 18.03
>> 119.97 119.91 25.03 23.53 20.38 20.42 18.75 18.28 18.51 18.08
>> 120.01 119.98 25.05 23.93 20.39 21.69 19.99 18.49 18.52 18.6
>> 120.24 119.99 25.12 24.19 21.67 21.76 20.08 19.74 19.73 19.62
>> 120.66 121.22 25.16 25.36 21.94 21.85 20.26 20.3 19.92 19.81
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> Avg. 119.919 119.937 24.181 23.73 20.567 20.628 18.924 18.531 18.641 18.372
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> Std. Dev. 0.351 0.481 0.789 0.642 0.663 0.802 0.851 0.811 0.658 0.741
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> % improvement -0.015 1.865 -0.297 2.077 1.443
>> ========================================================================================================================================
>> ====================================================================================================================================================
>> UNIXBENCH
>> ====================================================================================================================================================
>> # parallel copies 1 parallel 8 parrallel 12 parallel 24 parallel** 32 parallel
>> vanilla/patched vanilla patched vanilla pached vanilla patched vanilla patched vanilla patched
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> Dhrystone 2 using register variables 2650.1 2664.6 18967.8 19060.4 27534.1 27046.8 30077.9 30110.6 30542.1 30358.7
>> Double-Precision Whetstone 713.7 713.5 5463.6 5455.1 7863.9 7923.8 12725.1 12727.8 17474.3 17463.3
>> Execl Throughput 280.9 283.8 1724.4 1866.5 2029.5 2367.6 2370 2521.3 2453 2506.8
>> File Copy 1024 bufsize 2000 maxblocks 891.1 894.2 1423 1457.7 1385.6 1482.2 1226.1 1224.2 1235.9 1265.5
>> File Copy 256 bufsize 500 maxblocks 546.9 555.4 949 972.1 882.8 878.6 821.9 817.7 784.7 810.8
>> File Copy 4096 bufsize 8000 maxblocks 1743.4 1722.8 3406.5 3438.9 3314.3 3265.9 2801.9 2788.3 2695.2 2781.5
>> Pipe Throughput 426.8 423.4 3207.9 3234 4635.1 4708.9 7326 7335.3 7327.2 7319.7
>> Pipe-based Context Switching 110.2 223.5 680.8 1602.2 998.6 2324.6 3122.1 3252.7 3128.6 3337.2
>> Process Creation 130.7 224.4 1001.3 1043.6 1209 1248.2 1337.9 1380.4 1338.6 1280.1
>> Shell Scripts (1 concurrent) 1140.5 1257.5 5462.8 6146.4 6435.3 7206.1 7425.2 7636.2 7566.1 7636.6
>> Shell Scripts (8 concurrent) 3492 3586.7 7144.9 7307 7258 7320.2 7295.1 7296.7 7248.6 7252.2
>> System Call Overhead 387.7 387.5 2398.4 2367 2793.8 2752.7 3735.7 3694.2 3752.1 3709.4
>> System Benchmarks Index Score 634.8 712.6 2725.8 3005.7 3232.4 3569.7 3981.3 4028.8 4085.2 4126.3
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> % increase (of the Index Score) 12.256 10.269 10.435 1.193 1.006
>> ====================================================================================================================================================
>>
>> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
>> *** pCPUs 48 DOM0 vCPUS 16
>> *** RAM 393138 MB DOM0 Memory 9955 MB
>> *** NUMA nodes 2
>> =======================================================================================================================================
>> MAKE XEN (lower == better)
>> =======================================================================================================================================
>> # of build jobs -j1 -j20 -j24 -j48** -j62
>> vanilla/patched vanilla patched vanilla patched vanilla patched vanilla patched vanilla patched
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> 267.78 233.25 36.53 35.53 35.98 34.99 33.46 32.13 33.57 32.54
>> 268.42 233.92 36.82 35.56 36.12 35.2 34.24 32.24 33.64 32.56
>> 268.85 234.39 36.92 35.75 36.15 35.35 34.48 32.86 33.67 32.74
>> 268.98 235.11 36.96 36.01 36.25 35.46 34.73 32.89 33.97 32.83
>> 269.03 236.48 37.04 36.16 36.45 35.63 34.77 32.97 34.12 33.01
>> 269.54 237.05 40.33 36.59 36.57 36.15 34.97 33.09 34.18 33.52
>> 269.99 238.24 40.45 36.78 36.58 36.22 34.99 33.69 34.28 33.63
>> 270.11 238.48 41.13 39.98 40.22 36.24 38 33.92 34.35 33.87
>> 270.96 239.07 41.66 40.81 40.59 36.35 38.99 34.19 34.49 37.24
>> 271.84 240.89 42.07 41.24 40.63 40.06 39.07 36.04 34.69 37.59
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> Avg. 269.55 236.688 38.991 37.441 37.554 36.165 35.77 33.402 34.096 33.953
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> Std. Dev. 1.213 2.503 2.312 2.288 2.031 1.452 2.079 1.142 0.379 1.882
>> ---------------------------------------------------------------------------------------------------------------------------------------
>> % improvement 12.191 3.975 3.699 6.620 0.419
>> ========================================================================================================================================
>
> I'm a bit confused here as to why, if dom0 has 16 vcpus in all of your
> tests, you change the -j number (apparently) based on the number of
> pcpus available to Xen. Wouldn't it make more sense to stick with
> 1/6/8/16/24? That would allow us to have actually comparable numbers.
>
> But in any case, it seems to me that the numbers do show a uniform
> improvement and no regressions -- I think this approach looks really
> good, particularly as it is so small and well-contained.

That said, it's probably a good idea to make this optional somehow, so
that if people do decide to do a pinning / partitioning approach, the
guest scheduler actually can take advantage of topological
information.

-George

2015-08-31 16:13:11

by Boris Ostrovsky

[permalink] [raw]
Subject: Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy



On 08/20/2015 02:16 PM, Juergen Groß wrote:
> On 08/18/2015 05:55 PM, Dario Faggioli wrote:
>> Hey everyone,
>>
>> So, as a followup of what we were discussing in this thread:
>>
>> [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>
>> I started looking in more details at scheduling domains in the Linux
>> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>> of interacting, while this thing I'm proposing here is completely
>> independent from them both.
>>
>> In fact, no matter whether vNUMA is supported and enabled, and no matter
>> whether CPUID is reporting accurate, random, meaningful or completely
>> misleading information, I think that we should do something about how
>> scheduling domains are build.
>>
>> Fact is, unless we use 1:1, and immutable (across all the guest
>> lifetime) pinning, scheduling domains should not be constructed, in
>> Linux, by looking at *any* topology information, because that just does
>> not make any sense, when vcpus move around.
>>
>> Let me state this again (hoping to make myself as clear as possible): no
>> matter in how much good shape we put CPUID support, no matter how
>> beautifully and consistently that will interact with both vNUMA,
>> licensing requirements and whatever else. It will be always possible for
>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>> should really not skew his load balancing logic toward any of those two
>> situations, as neither of them could be considered correct (since
>> nothing is!).
>>
>> For now, this only covers the PV case. HVM case shouldn't be any
>> different, but I haven't looked at how to make the same thing happen in
>> there as well.
>>
>> OVERALL DESCRIPTION
>> ===================
>> What this RFC patch does is, in the Xen PV case, configure scheduling
>> domains in such a way that there is only one of them, spanning all the
>> pCPUs of the guest.
>>
>> Note that the patch deals directly with scheduling domains, and there is
>> no need to alter the masks that will then be used for building and
>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
>> the main difference between it and the patch proposed by Juergen here:
>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>
>>
>> This means that when, in future, we will fix CPUID handling and make it
>> comply with whatever logic or requirements we want, that won't have any
>> unexpected side effects on scheduling domains.
>>
>> Information about how the scheduling domains are being constructed
>> during boot are available in `dmesg', if the kernel is booted with the
>> 'sched_debug' parameter. It is also possible to look
>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>
>> With the patch applied, only one scheduling domain is created, called
>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>> tell that from the fact that every cpu* folder
>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>> ('domain0'), with all the tweaks and the tunables for our scheduling
>> domain.
>>
>> EVALUATION
>> ==========
>> I've tested this with UnixBench, and by looking at Xen build time, on a
>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>> something similar to this in DomU already, AFAUI).
>>
>> I've run the benchmarks with and without the patch applied ('patched'
>> and 'vanilla', respectively, in the tables below), and with different
>> number of build jobs (in case of the Xen build) or of parallel copy of
>> the benchmarks (in the case of UnixBench).
>>
>> What I get from the numbers is that the patch almost always brings
>> benefits, in some cases even huge ones. There are a couple of cases
>> where we regress, but always only slightly so, especially if comparing
>> that to the magnitude of some of the improvement that we get.
>>
>> Bear also in mind that these results are gathered from Dom0, and without
>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>> we move things in DomU and do overcommit at the Xen scheduler level, I
>> am expecting even better results.
>>
> ...
>> REQUEST FOR COMMENTS
>> ====================
>> Basically, the kind of feedback I'd be really glad to hear is:
>> - what you guys thing of the approach,
>
> Yesterday at the end of the developer meeting we (Andrew, Elena and
> myself) discussed this topic again.
>
> Regarding a possible future scenario with credit2 eventually supporting
> gang scheduling on hyperthreads (which is desirable due to security
> reasons [side channel attack] and fairness) my patch seems to be more
> suited for that direction than yours. Correct me if I'm wrong, but I
> think scheduling domains won't enable the guest kernel's scheduler to
> migrate threads more easily between hyperthreads opposed to other vcpus,
> while my approach can easily be extended to do so.
>
>> - whether you think, looking at this preliminary set of numbers, that
>> this is something worth continuing investigating,
>
> I believe as both approaches lead to the same topology information used
> by the scheduler (all vcpus are regarded as being equal) your numbers
> should apply to my patch as well. Would you mind verifying this?

If set_cpu_sibling_map()'s has_mp is false, wouldn't we effectively have
both of your patches?

Also, it seems to me that Xen guests would not be the only ones having
to deal with topology inconsistencies due to migrating VCPUs. Don't KVM
guests, for example, have the same problem? And if yes, perhaps we
should try solving it in non-Xen-specific way (especially given that
both of those patches look pretty simple and thus are presumably easy to
integrate into common code).

And, as George already pointed out, this should be an optional feature
--- if a guest spans physical nodes and VCPUs are pinned then we don't
always want flat topology/domains.

-boris


>
> I still believe making the guest scheduler's decisions independant from
> cpuid values is the way to go, as this will enable us to support more
> scenarios (e.g. cpuid based licensing). For HVM guests and old PV guests
> mangling the cpuid should still be done, though.
>
>> - if yes, what other workloads and benchmark it would make sense to
>> throw at it.
>
> As you already mentioned an overcommitted host should be looked at as
> well.
>
>
> Thanks for doing the measurements,
>
>
> Juergen