Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Date:   Sat, 27 Apr 2019 11:06:57 +0200
From:   Ingo Molnar <mingo@kernel.org>
To:     Mel Gorman <mgorman@techsingularity.net>
Cc:     Aubrey Li <aubrey.intel@gmail.com>,
        Julien Desfossez <jdesfossez@digitalocean.com>,
        Vineeth Remanan Pillai <vpillai@digitalocean.com>,
        Nishanth Aravamudan <naravamudan@digitalocean.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Tim Chen <tim.c.chen@linux.intel.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Paul Turner <pjt@google.com>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Linux List Kernel Mailing <linux-kernel@vger.kernel.org>,
        Subhra Mazumdar <subhra.mazumdar@oracle.com>,
        Fr?d?ric Weisbecker <fweisbec@gmail.com>,
        Kees Cook <keescook@chromium.org>,
        Greg Kerr <kerrnel@google.com>, Phil Auld <pauld@redhat.com>,
        Aaron Lu <aaron.lwe@gmail.com>,
        Valentin Schneider <valentin.schneider@arm.com>,
        Pawan Gupta <pawan.kumar.gupta@linux.intel.com>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Jiri Kosina <jkosina@suse.cz>
Subject: Re: [RFC PATCH v2 00/17] Core scheduling v2
Message-ID: <20190427090657.GB99668@gmail.com>
References: <cover.1556025155.git.vpillai@digitalocean.com>
 <CAERHkrtOCbLQ-tFq9ujjnyaudtd_e0UaSA2GQG64JqdS6cuTKg@mail.gmail.com>
 <20190424140013.GA14594@sinkpad>
 <CAERHkrtVjU4SaHjPGNb140qBiU+wVELGaH6j9Z=cctToZ0qn9g@mail.gmail.com>
 <20190425095508.GA8387@gmail.com>
 <20190425144619.GX18914@techsingularity.net>
 <20190425185343.GA122353@gmail.com>
 <20190425213145.GY18914@techsingularity.net>
 <20190426094545.GD126896@gmail.com>
 <20190426101947.GZ18914@techsingularity.net>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20190426101947.GZ18914@techsingularity.net>
User-Agent: Mutt/1.10.1 (2018-07-13)
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk


* Mel Gorman <mgorman@techsingularity.net> wrote:

> On Fri, Apr 26, 2019 at 11:45:45AM +0200, Ingo Molnar wrote:
> > 
> > * Mel Gorman <mgorman@techsingularity.net> wrote:
> > 
> > > > > I can show a comparison with equal levels of parallelisation but with 
> > > > > HT off, it is a completely broken configuration and I do not think a 
> > > > > comparison like that makes any sense.
> > > > 
> > > > I would still be interested in that comparison, because I'd like
> > > > to learn whether there's any true *inherent* performance advantage to 
> > > > HyperThreading for that particular workload, for exactly tuned 
> > > > parallelism.
> > > > 
> > > 
> > > It really isn't a fair comparison. MPI seems to behave very differently
> > > when a machine is saturated. It's documented as changing its behaviour
> > > as it tries to avoid the worst consequences of saturation.
> > > 
> > > Curiously, the results on the 2-socket machine were not as bad as I
> > > feared when the HT configuration is running with twice the number of
> > > threads as there are CPUs
> > > 
> > > Amean     bt      771.15 (   0.00%)     1086.74 * -40.93%*
> > > Amean     cg      445.92 (   0.00%)      543.41 * -21.86%*
> > > Amean     ep       70.01 (   0.00%)       96.29 * -37.53%*
> > > Amean     is       16.75 (   0.00%)       21.19 * -26.51%*
> > > Amean     lu      882.84 (   0.00%)      595.14 *  32.59%*
> > > Amean     mg       84.10 (   0.00%)       80.02 *   4.84%*
> > > Amean     sp     1353.88 (   0.00%)     1384.10 *  -2.23%*
> > 
> > Yeah, so what I wanted to suggest is a parallel numeric throughput test 
> > with few inter-process data dependencies, and see whether HT actually 
> > improves total throughput versus the no-HT case.
> > 
> > No over-saturation - but exactly as many threads as logical CPUs.
> > 
> > I.e. with 20 physical cores and 40 logical CPUs the numbers to compare 
> > would be a 'nosmt' benchmark running 20 threads, versus a SMT test 
> > running 40 threads.
> > 
> > I.e. how much does SMT improve total throughput when the workload's 
> > parallelism is tuned to utilize 100% of the available CPUs?
> > 
> > Does this make sense?
> > 
> 
> Yes. Here is the comparison.
> 
> Amean     bt      678.75 (   0.00%)      789.13 * -16.26%*
> Amean     cg      261.22 (   0.00%)      428.82 * -64.16%*
> Amean     ep       55.36 (   0.00%)       84.41 * -52.48%*
> Amean     is       13.25 (   0.00%)       17.82 * -34.47%*
> Amean     lu     1065.08 (   0.00%)     1090.44 (  -2.38%)
> Amean     mg       89.96 (   0.00%)       84.28 *   6.31%*
> Amean     sp     1579.52 (   0.00%)     1506.16 *   4.64%*
> Amean     ua      611.87 (   0.00%)      663.26 *  -8.40%*
> 
> This is the socket machine and with HT On, there are 80 logical CPUs
> versus HT Off with 40 logical CPUs.

That's very interesting - so for most workloads HyperThreading is a 
massive loss, and for 'mg' and 'sp' it's a 5-6% win?

I'm wondering how much of say the 'cg' workload's -64% loss could be task 
placement inefficiency - or are these all probable effects of 80 threads 
trying to use too many cache and memory resources and thus utilizing it 
all way too inefficiently?

Are these relatively simple numeric workloads, with not much scheduling 
and good overall pinning of tasks, or is it more complex than that?

Also, the takeaway appears to be: by using HT there's a potential 
advantage of +6% on the benefit side, but a potential -50%+ performance 
hit on the risk side?

I believe these results also *strongly* support a much stricter task 
placement policy in up to 50% saturation of SMT systems - it's almost 
always going to be a win for workloads that are actually trying to fill 
in some useful role.

Thanks,

	Ingo