Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753394AbbHRQxg (ORCPT ); Tue, 18 Aug 2015 12:53:36 -0400 Received: from userp1040.oracle.com ([156.151.31.81]:45103 "EHLO userp1040.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750903AbbHRQxf convert rfc822-to-8bit (ORCPT ); Tue, 18 Aug 2015 12:53:35 -0400 User-Agent: K-9 Mail for Android In-Reply-To: <1439913332.4239.134.camel@citrix.com> References: <1439913332.4239.134.camel@citrix.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8BIT Subject: Re: [PATCH RFC] xen: if on Xen, "flatten" the scheduling domain hierarchy From: Konrad Rzeszutek Wilk Date: Tue, 18 Aug 2015 09:53:13 -0700 To: Dario Faggioli , "xen-devel@lists.xenproject.org" , herbert.van.den.bergh@oracle.com CC: Juergen Gross , Andrew Cooper , "Luis R. Rodriguez" , David Vrabel , Boris Ostrovsky , linux-kernel , Stefano Stabellini , George Dunlap Message-ID: <269B91D8-6656-43C3-9912-CED232C4E014@oracle.com> X-Source-IP: aserv0022.oracle.com [141.146.126.234] Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4192 Lines: 108 On August 18, 2015 8:55:32 AM PDT, Dario Faggioli wrote: >Hey everyone, > >So, as a followup of what we were discussing in this thread: > > [Xen-devel] PV-vNUMA issue: topology is misinterpreted by the guest >http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg03241.html > >I started looking in more details at scheduling domains in the Linux >kernel. Now, that thread was about CPUID and vNUMA, and their weird way >of interacting, while this thing I'm proposing here is completely >independent from them both. > >In fact, no matter whether vNUMA is supported and enabled, and no >matter >whether CPUID is reporting accurate, random, meaningful or completely >misleading information, I think that we should do something about how >scheduling domains are build. > >Fact is, unless we use 1:1, and immutable (across all the guest >lifetime) pinning, scheduling domains should not be constructed, in >Linux, by looking at *any* topology information, because that just does >not make any sense, when vcpus move around. > >Let me state this again (hoping to make myself as clear as possible): >no >matter in how much good shape we put CPUID support, no matter how >beautifully and consistently that will interact with both vNUMA, >licensing requirements and whatever else. It will be always possible >for >vCPU #0 and vCPU #3 to be scheduled on two SMT threads at time t1, and >on two different NUMA nodes at time t2. Hence, the Linux scheduler >should really not skew his load balancing logic toward any of those two >situations, as neither of them could be considered correct (since >nothing is!). What about Windows guests? > >For now, this only covers the PV case. HVM case shouldn't be any >different, but I haven't looked at how to make the same thing happen in >there as well. > >OVERALL DESCRIPTION >=================== >What this RFC patch does is, in the Xen PV case, configure scheduling >domains in such a way that there is only one of them, spanning all the >pCPUs of the guest. Wow. That is an pretty simple patch!! > >Note that the patch deals directly with scheduling domains, and there >is >no need to alter the masks that will then be used for building and >reporting the topology (via CPUID, /proc/cpuinfo, /sysfs, etc.). That >is >the main difference between it and the patch proposed by Juergen here: >http://lists.xenproject.org/archives/html/xen-devel/2015-07/msg05088.html > >This means that when, in future, we will fix CPUID handling and make it >comply with whatever logic or requirements we want, that won't have >any >unexpected side effects on scheduling domains. > >Information about how the scheduling domains are being constructed >during boot are available in `dmesg', if the kernel is booted with the >'sched_debug' parameter. It is also possible to look >at /proc/sys/kernel/sched_domain/cpu*, and at /proc/schedstat. > >With the patch applied, only one scheduling domain is created, called >the 'VCPU' domain, spanning all the guest's (or Dom0's) vCPUs. You can >tell that from the fact that every cpu* folder >in /proc/sys/kernel/sched_domain/ only have one subdirectory >('domain0'), with all the tweaks and the tunables for our scheduling >domain. > ... > >REQUEST FOR COMMENTS >==================== >Basically, the kind of feedback I'd be really glad to hear is: > - what you guys thing of the approach, > - whether you think, looking at this preliminary set of numbers, that > this is something worth continuing investigating, > - if yes, what other workloads and benchmark it would make sense to > throw at it. > The thing that I was worried about is that we would be modifying the generic code, but your changes are all in Xen code! Woot! In terms of workloads, I am CCing Herbert who I hope can provide advise on this. Herbert, the full email is here: http://lists.xen.org/archives/html/xen-devel/2015-08/msg01691.html >Thanks and Regards, >Dario -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/