MIME-Version: 1.0
In-Reply-To: <55DEE556.3010802@citrix.com>
References: <1439913332.4239.134.camel@citrix.com>
	<55DEE556.3010802@citrix.com>
Date: Thu, 27 Aug 2015 18:05:39 +0100
Message-ID: <CAFLBxZZ1wP8KZKuJxjJubeo2FyiSSzTytTWL2xf8Z1DDtg7b3w@mail.gmail.com>
Subject: Re: [Xen-devel] [PATCH RFC] xen: if on Xen, "flatten" the scheduling
 domain hierarchy
From: George Dunlap <dunlapg@umich.edu>
To: George Dunlap <george.dunlap@citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>,
        "xen-devel@lists.xenproject.org" <xen-devel@lists.xenproject.org>,
        Juergen Gross <jgross@suse.com>,
        Andrew Cooper <Andrew.Cooper3@citrix.com>,
        "Luis R. Rodriguez" <mcgrof@do-not-panic.com>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        David Vrabel <david.vrabel@citrix.com>,
        Boris Ostrovsky <boris.ostrovsky@oracle.com>,
        Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 21765
Lines: 245

On Thu, Aug 27, 2015 at 11:24 AM, George Dunlap
<george.dunlap@citrix.com> wrote:
> On 08/18/2015 04:55 PM, Dario Faggioli wrote:
>> Hey everyone,
>>
>> So, as a followup of what we were discussing in this thread:
>>
>>  [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest
>>  http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html
>>
>> I started looking in more details at scheduling domains in the Linux
>> kernel. Now, that thread was about CPUID and vNUMA, and their weird way
>> of interacting, while this thing I'm proposing here is completely
>> independent from them both.
>>
>> In fact, no matter whether vNUMA is supported and enabled, and no matter
>> whether CPUID is reporting accurate, random, meaningful or completely
>> misleading information, I think that we should do something about how
>> scheduling domains are build.
>>
>> Fact is, unless we use 1:1, and immutable (across all the guest
>> lifetime) pinning, scheduling domains should not be constructed, in
>> Linux, by looking at *any* topology information, because that just does
>> not make any sense, when vcpus move around.
>>
>> Let me state this again (hoping to make myself as clear as possible): no
>> matter in  how much good shape we put CPUID support, no matter how
>> beautifully and consistently that will interact with both vNUMA,
>> licensing requirements and whatever else. It will be always possible for
>> vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and
>> on two different NUMA nodes at time t2. Hence, the Linux scheduler
>> should really not skew his load balancing logic toward any of those two
>> situations, as neither of them could be considered correct (since
>> nothing is!).
>>
>> For now, this only covers the PV case. HVM case shouldn't be any
>> different, but I haven't looked at how to make the same thing happen in
>> there as well.
>>
>> OVERALL DESCRIPTION
>> ===================
>> What this RFC patch does is, in the Xen PV case, configure scheduling
>> domains in such a way that there is only one of them, spanning all the
>> pCPUs of the guest.
>>
>> Note that the patch deals directly with scheduling domains, and there is
>> no need to alter the masks that will then be used for building and
>> reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That is
>> the main difference between it and the patch proposed by Juergen here:
>> http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html
>>
>> This means that when, in future, we will fix CPUID handling and make it
>> comply with whatever logic or requirements we want, that won't have  any
>> unexpected side effects on scheduling domains.
>>
>> Information about how the scheduling domains are being constructed
>> during boot are available in `dmesg', if the kernel is booted with the
>> 'sched_debug' parameter. It is also possible to look
>> at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat.
>>
>> With the patch applied, only one scheduling domain is created, called
>> the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can
>> tell that from the fact that every cpu* folder
>> in /proc/sys/kernel/sched_domain/ only have one subdirectory
>> ('domain0'), with all the tweaks and the tunables for our scheduling
>> domain.
>>
>> EVALUATION
>> ==========
>> I've tested this with UnixBench, and by looking at Xen build time, on a
>> 16, 24 and 48 pCPUs hosts. I've run the benchmarks in Dom0 only, for
>> now, but I plan to re-run them in DomUs soon (Juergen may be doing
>> something similar to this in DomU already, AFAUI).
>>
>> I've run the benchmarks with and without the patch applied ('patched'
>> and 'vanilla', respectively, in the tables below), and with different
>> number of build jobs (in case of the Xen build) or of parallel copy of
>> the benchmarks (in the case of UnixBench).
>>
>> What I get from the numbers is that the patch almost always brings
>> benefits, in some cases even huge ones. There are a couple of cases
>> where we regress, but always only slightly so, especially if comparing
>> that to the magnitude of some of the improvement that we get.
>>
>> Bear also in mind that these results are gathered from Dom0, and without
>> any overcommitment at the vCPU level (i.e., nr. vCPUs == nr pCPUs). If
>> we move things in DomU and do overcommit at the Xen scheduler level, I
>> am expecting even better results.
>>
>> RESULTS
>> =======
>> To have a quick idea of how a benchmark went, look at the '%
>> improvement' row of each table.
>>
>> I'll put these results online, in a googledoc spreadsheet or something
>> like that, to make them easier to read, as soon as possible.
>>
>> *** Intel(R) Xeon(R) E5620 @ 2.40GHz
>> *** pCPUs      16        DOM0 vCPUS  16
>> *** RAM        12285 MB  DOM0 Memory 9955 MB
>> *** NUMA nodes 2
>> =======================================================================================================================================
>> MAKE XEN (lower == better)
>> =======================================================================================================================================
>> # of build jobs                     -j1                   -j6                   -j8                   -j16**                -j24
>> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>                               153.72     152.41      35.33      34.93       30.7      30.33      26.79      25.97      26.88      26.21
>>                               153.81     152.76      35.37      34.99      30.81      30.36      26.83      26.08         27      26.24
>>                               153.93     152.79      35.37      35.25      30.92      30.39      26.83      26.13      27.01      26.28
>>                               153.94     152.94      35.39      35.28      31.05      30.43       26.9      26.14      27.01      26.44
>>                               153.98     153.06      35.45      35.31      31.17       30.5      26.95      26.18      27.02      26.55
>>                               154.01     153.23       35.5      35.35       31.2      30.59      26.98       26.2      27.05      26.61
>>                               154.04     153.34      35.56      35.42      31.45      30.76      27.12      26.21      27.06      26.78
>>                               154.16      153.5      37.79      35.58      31.68      30.83      27.16      26.23      27.16      26.78
>>                               154.18     153.71      37.98      35.61      33.73       30.9      27.49      26.32      27.16       26.8
>>                               154.9      154.67      38.03      37.64      34.69      31.69      29.82      26.38       27.2      28.63
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  Avg.                        154.067    153.241     36.177     35.536      31.74     30.678     27.287     26.184     27.055     26.732
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  Std. Dev.                     0.325      0.631      1.215      0.771      1.352      0.410      0.914      0.116      0.095      0.704
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  % improvement                            0.536                 1.772                 3.346                 4.042                 1.194
>> ========================================================================================================================================
>> ====================================================================================================================================================
>> UNIXBENCH
>> ====================================================================================================================================================
>> # parallel copies                            1 parallel            6 parrallel           8 parallel            16 parallel**         24 parallel
>> vanilla/patched                          vanilla    patched    vanilla    pached     vanilla    patched    vanilla    patched    vanilla    patched
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> Dhrystone 2 using register variables       2302.2     2302.1    13157.8    12262.4    15691.5    15860.1    18927.7    19078.5    18654.3    18855.6
>> Double-Precision Whetstone                  620.2      620.2     3481.2     3566.9     4669.2     4551.5     7610.1     7614.3    11558.9    11561.3
>> Execl Throughput                            184.3      186.7      884.6      905.3     1168.4     1213.6     2134.6     2210.2     2250.9       2265
>> File Copy 1024 bufsize 2000 maxblocks       780.8      783.3     1243.7     1255.5     1250.6     1215.7     1080.9     1094.2     1069.8     1062.5
>> File Copy 256 bufsize 500 maxblocks         479.8      482.8      781.8      803.6      806.4        781      682.9      707.7      698.2      694.6
>> File Copy 4096 bufsize 8000 maxblocks      1617.6     1593.5     2739.7     2943.4     2818.3     2957.8     2389.6     2412.6     2371.6     2423.8
>> Pipe Throughput                             363.9      361.6     2068.6     2065.6       2622     2633.5     4053.3     4085.9     4064.7     4076.7
>> Pipe-based Context Switching                 70.6      207.2      369.1     1126.8      623.9     1431.3     1970.4     2082.9     1963.8       2077
>> Process Creation                            103.1        135        503      677.6      618.7      855.4       1138     1113.7     1195.6       1199
>> Shell Scripts (1 concurrent)                723.2      765.3     4406.4     4334.4     5045.4     5002.5     5861.9     5844.2     5958.8     5916.1
>> Shell Scripts (8 concurrent)               2243.7     2715.3     5694.7     5663.6     5694.7     5657.8     5637.1     5600.5     5582.9     5543.6
>> System Call Overhead                          330      330.1     1669.2     1672.4     2028.6     1996.6     2920.5     2947.1     2923.9     2952.5
>> System Benchmarks Index Score               496.8      567.5     1861.9       2106     2220.3     2441.3     2972.5     3007.9     3103.4     3125.3
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> % increase (of the Index Score)                       14.231                13.110                 9.954                 1.191                 0.706
>> ====================================================================================================================================================
>>
>> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
>> *** pCPUs      24        DOM0 vCPUS  16
>> *** RAM        36851 MB  DOM0 Memory 9955 MB
>> *** NUMA nodes 2
>> =======================================================================================================================================
>> MAKE XEN (lower == better)
>> =======================================================================================================================================
>> # of build jobs                     -j1                   -j8                   -j12                   -j24**               -j32
>> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>                               119.49     119.47      23.37      23.29      20.12      19.85      17.99       17.9      17.82       17.8
>>                               119.59     119.64      23.52      23.31      20.16      19.99      18.19      18.05      18.23      17.89
>>                               119.59     119.65      23.53      23.35      20.19      20.08      18.26      18.09      18.35      17.91
>>                               119.72     119.75      23.63      23.41       20.2      20.14      18.54       18.1       18.4      17.95
>>                               119.95     119.86      23.68      23.42      20.24      20.19      18.57      18.15      18.44      18.03
>>                               119.97      119.9      23.72      23.51      20.38      20.31      18.61      18.21      18.49      18.03
>>                               119.97     119.91      25.03      23.53      20.38      20.42      18.75      18.28      18.51      18.08
>>                               120.01     119.98      25.05      23.93      20.39      21.69      19.99      18.49      18.52       18.6
>>                               120.24     119.99      25.12      24.19      21.67      21.76      20.08      19.74      19.73      19.62
>>                               120.66     121.22      25.16      25.36      21.94      21.85      20.26       20.3      19.92      19.81
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  Avg.                        119.919    119.937     24.181      23.73     20.567     20.628     18.924     18.531     18.641     18.372
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  Std. Dev.                     0.351      0.481      0.789      0.642      0.663      0.802      0.851      0.811      0.658      0.741
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  % improvement                           -0.015                 1.865                -0.297                 2.077                 1.443
>> ========================================================================================================================================
>> ====================================================================================================================================================
>> UNIXBENCH
>> ====================================================================================================================================================
>> # parallel copies                            1 parallel            8 parrallel            12 parallel           24 parallel**         32 parallel
>> vanilla/patched                          vanilla     patched   vanilla     pached     vanilla    patched    vanilla    patched    vanilla    patched
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> Dhrystone 2 using register variables       2650.1     2664.6    18967.8    19060.4    27534.1    27046.8    30077.9    30110.6    30542.1    30358.7
>> Double-Precision Whetstone                  713.7      713.5     5463.6     5455.1     7863.9     7923.8    12725.1    12727.8    17474.3    17463.3
>> Execl Throughput                            280.9      283.8     1724.4     1866.5     2029.5     2367.6       2370     2521.3       2453     2506.8
>> File Copy 1024 bufsize 2000 maxblocks       891.1      894.2       1423     1457.7     1385.6     1482.2     1226.1     1224.2     1235.9     1265.5
>> File Copy 256 bufsize 500 maxblocks         546.9      555.4        949      972.1      882.8      878.6      821.9      817.7      784.7      810.8
>> File Copy 4096 bufsize 8000 maxblocks      1743.4     1722.8     3406.5     3438.9     3314.3     3265.9     2801.9     2788.3     2695.2     2781.5
>> Pipe Throughput                             426.8      423.4     3207.9       3234     4635.1     4708.9       7326     7335.3     7327.2     7319.7
>> Pipe-based Context Switching                110.2      223.5      680.8     1602.2      998.6     2324.6     3122.1     3252.7     3128.6     3337.2
>> Process Creation                            130.7      224.4     1001.3     1043.6       1209     1248.2     1337.9     1380.4     1338.6     1280.1
>> Shell Scripts (1 concurrent)               1140.5     1257.5     5462.8     6146.4     6435.3     7206.1     7425.2     7636.2     7566.1     7636.6
>> Shell Scripts (8 concurrent)                 3492     3586.7     7144.9       7307       7258     7320.2     7295.1     7296.7     7248.6     7252.2
>> System Call Overhead                        387.7      387.5     2398.4       2367     2793.8     2752.7     3735.7     3694.2     3752.1     3709.4
>> System Benchmarks Index Score               634.8      712.6     2725.8     3005.7     3232.4     3569.7     3981.3     4028.8     4085.2     4126.3
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>> % increase (of the Index Score)                       12.256                10.269                10.435                 1.193                 1.006
>> ====================================================================================================================================================
>>
>> *** Intel(R) Xeon(R) X5650 @ 2.67GHz
>> *** pCPUs      48        DOM0 vCPUS  16
>> *** RAM        393138 MB DOM0 Memory 9955 MB
>> *** NUMA nodes 2
>> =======================================================================================================================================
>> MAKE XEN (lower == better)
>> =======================================================================================================================================
>> # of build jobs                     -j1                   -j20                   -j24                  -j48**               -j62
>> vanilla/patched              vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched    vanilla    patched
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>                               267.78     233.25      36.53      35.53      35.98      34.99      33.46      32.13      33.57      32.54
>>                               268.42     233.92      36.82      35.56      36.12       35.2      34.24      32.24      33.64      32.56
>>                               268.85     234.39      36.92      35.75      36.15      35.35      34.48      32.86      33.67      32.74
>>                               268.98     235.11      36.96      36.01      36.25      35.46      34.73      32.89      33.97      32.83
>>                               269.03     236.48      37.04      36.16      36.45      35.63      34.77      32.97      34.12      33.01
>>                               269.54     237.05      40.33      36.59      36.57      36.15      34.97      33.09      34.18      33.52
>>                               269.99     238.24      40.45      36.78      36.58      36.22      34.99      33.69      34.28      33.63
>>                               270.11     238.48      41.13      39.98      40.22      36.24         38      33.92      34.35      33.87
>>                               270.96     239.07      41.66      40.81      40.59      36.35      38.99      34.19      34.49      37.24
>>                               271.84     240.89      42.07      41.24      40.63      40.06      39.07      36.04      34.69      37.59
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  Avg.                         269.55    236.688     38.991     37.441     37.554     36.165      35.77     33.402     34.096     33.953
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  Std. Dev.                     1.213      2.503      2.312      2.288      2.031      1.452      2.079      1.142      0.379      1.882
>> ---------------------------------------------------------------------------------------------------------------------------------------
>>  % improvement                           12.191                 3.975                 3.699                 6.620                 0.419
>> ========================================================================================================================================
>
> I'm a bit confused here as to why, if dom0 has 16 vcpus in all of your
> tests, you change the -j number (apparently) based on the number of
> pcpus available to Xen.  Wouldn't it make more sense to stick with
> 1/6/8/16/24?  That would allow us to have actually comparable numbers.
>
> But in any case, it seems to me that the numbers do show a uniform
> improvement and no regressions -- I think this approach looks really
> good, particularly as it is so small and well-contained.

That said, it's probably a good idea to make this optional somehow, so
that if people do decide to do a pinning / partitioning approach, the
guest scheduler actually can take advantage of topological
information.

 -George
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/