Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp521389imm; Wed, 26 Sep 2018 02:37:33 -0700 (PDT) X-Google-Smtp-Source: ACcGV63TcrUlJ4nxf9OFPtilRsjf339bOWzQ6QyKKkZSUdXL4slw06Iiw8N1rquYJY4iinMKPE4p X-Received: by 2002:a63:1e19:: with SMTP id e25-v6mr4916859pge.44.1537954653187; Wed, 26 Sep 2018 02:37:33 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1537954653; cv=none; d=google.com; s=arc-20160816; b=wSI3zfh+aiZlf7NXu39l37i54PDqY2Co/HP33MNRNHJcw/WYv4vYrOqLmpSWZ2cQc4 g3OAP2/sAzaIFkXsDYt1nQlAER86A9eH41cNCt//klLpVzIRB4iRPHW7HckMhZQlrKz+ 6rwNFPHHZ4Mef4PniHwkWjyCFSp45ZWKHrhxuQI7sMM3De7krCTtvrto6nyOJ3CO5DVC VZZvO3Oae5G9U1wV4HZAv5jKEQG/sgT+rOUB/8T94e3pbmB9nEOw4fw1zkS69DpoXqOQ xJCoBuYrgCxIB/65qxF75Lu180KoqRoTaBO+fAEjXSH7tN2vO1e+Lub2tIB/IQRgqe8E GweA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:openpgp:from:references:cc:to:subject:dkim-signature; bh=nph0eEWFDIa4QIGdQp3jl/qJEC5U9euGkCAiG9Fuinw=; b=fDAM7vYAp67cL7w7k/uc2CTyW8vG/RodK8dpjo358SfgLA7ANMXvI5csUXuarUKMYT z5HsCeosACYAiN3yNTUitLTKbrSDuaK/YZ5MvvCMRemnzMvS8HEXWD1NQW6SBOSYAWPK HHX0l5mvLvWRp1uHRGbYeh0sj1W2V/YcvHydqlPNyUeKYjvjj3B5ufyPhqWEF0OkVIPb /YG6joZsEZB2PA3BOqHw0jS5KpAGUYnobFXAHGMza2eCINWAtq8oPZpDy3LaWr6khKb4 m37bPuYzD+pbPIhgsgIosPJznEAg6EAbvijXECaLbWGmi+CTmfhCUPrtmPke5gzL/isQ 5B2A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazon.de header.s=amazon201209 header.b=tWQKnmTy; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.de Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id r24-v6si3546345pfb.142.2018.09.26.02.37.17; Wed, 26 Sep 2018 02:37:33 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@amazon.de header.s=amazon201209 header.b=tWQKnmTy; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727245AbeIZPsC (ORCPT + 99 others); Wed, 26 Sep 2018 11:48:02 -0400 Received: from smtp-fw-9102.amazon.com ([207.171.184.29]:61362 "EHLO smtp-fw-9102.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726602AbeIZPsC (ORCPT ); Wed, 26 Sep 2018 11:48:02 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.de; i=@amazon.de; q=dns/txt; s=amazon201209; t=1537954557; x=1569490557; h=subject:to:cc:references:from:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=nph0eEWFDIa4QIGdQp3jl/qJEC5U9euGkCAiG9Fuinw=; b=tWQKnmTy78gMzsJH1WxOiDwzqiqkEUzG1+8swUWFkuFgutzLcqLy8Z7t cvREeYtfKmmIA7stnh/1sHlViIaKp/ErQwXPqE6BYjKayxpm/yV/91urY GTe3IATUv1Vkhkm754JhK/9jFqDwpXv4V/UbUaOo+KJNGrzWFzDVzJXUO A=; X-IronPort-AV: E=Sophos;i="5.54,305,1534809600"; d="scan'208";a="632751626" Received: from sea3-co-svc-lb6-vlan3.sea.amazon.com (HELO email-inbound-relay-1e-62350142.us-east-1.amazon.com) ([10.47.22.38]) by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP/TLS/DHE-RSA-AES256-SHA; 26 Sep 2018 09:35:53 +0000 Received: from u7588a65da6b65f.ant.amazon.com (iad7-ws-svc-lb50-vlan3.amazon.com [10.0.93.214]) by email-inbound-relay-1e-62350142.us-east-1.amazon.com (8.14.7/8.14.7) with ESMTP id w8Q9ZkvT092100 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=FAIL); Wed, 26 Sep 2018 09:35:49 GMT Received: from u7588a65da6b65f.ant.amazon.com (localhost [127.0.0.1]) by u7588a65da6b65f.ant.amazon.com (8.15.2/8.15.2/Debian-3) with ESMTP id w8Q9Zi1b018067; Wed, 26 Sep 2018 11:35:44 +0200 Subject: Re: [RFC 00/60] Coscheduling for Linux To: Peter Zijlstra Cc: Ingo Molnar , linux-kernel@vger.kernel.org, Paul Turner , Vincent Guittot , Morten Rasmussen , Tim Chen References: <20180907214047.26914-1-jschoenh@amazon.de> <20180914111251.GC24106@hirez.programming.kicks-ass.net> <1d86f497-9fef-0b19-50d6-d46ef1c0bffa@amazon.de> <20180917133703.GU24124@hirez.programming.kicks-ass.net> From: "=?UTF-8?Q?Jan_H._Sch=c3=b6nherr?=" Openpgp: preference=signencrypt Message-ID: <88a58ef0-4175-a247-9b48-076ffe1c750e@amazon.de> Date: Wed, 26 Sep 2018 11:35:44 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <20180917133703.GU24124@hirez.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 09/17/2018 03:37 PM, Peter Zijlstra wrote: > On Fri, Sep 14, 2018 at 06:25:44PM +0200, Jan H. Schönherr wrote: >> With gang scheduling as defined by Feitelson and Rudolph [6], you'd have to >> explicitly schedule idle time. With coscheduling as defined by Ousterhout [7], >> you don't. In this patch set, the scheduling of idle time is "merely" a quirk >> of the implementation. And even with this implementation, there's nothing >> stopping you from down-sizing the width of the coscheduled set to take out >> the idle vCPUs dynamically, cutting down on fragmentation. > > The thing is, if you drop the full width gang scheduling, you instantly > require the paravirt spinlock / tlb-invalidate stuff again. Can't say much about tlb-invalidate, but yes to the spinlock stuff: if there isn't any additional information available, all runnable tasks/vCPUs have to be coscheduled to avoid lock holder preemption. With additional information about tasks potentially holding locks or potentially spinning on a lock, it would be possible to coschedule smaller subsets -- no idea if that would be any more efficient though. > Of course, the constraints of L1TF itself requires the explicit > scheduling of idle time under a bunch of conditions. That is true for some of the resource contention use cases, too. Though, they are much more relaxed wrt. their requirements on the simultaneousness of the context switch. > I did not read your [7] in much detail (also very bad quality scan that > :-/; but I don't get how they leap from 'thrashing' to co-scheduling. In my personal interpretation, that analogy refers to the case where the waiting time for a lock is shorter than the time for a context switch -- but where the context switch was done anyway, "thrashing" the CPU. Anyway. I only brought it up, because everyone has a different understanding of what "coscheduling" or "gang scheduling" actually means. The memorable quotes are from Ousterhout: "A task force is coscheduled if all of its runnable processes are exe- cuting simultaneously on different processors. Each of the processes in that task force is also said to be coscheduled." (where a "task force" is a group of closely cooperating tasks), and from Feitelson and Rudolph: "[Gang scheduling is defined] as the scheduling of a group of threads to run on a set of processors at the same time, on a one-to-one basis." (with the additional assumption of time slices, collective preemption, and that threads don't relinquish the CPU during their time slice). That makes gang scheduling much more specific, while coscheduling just refers to the fact that some things are executed simultaneously. > Their initial problem, where A generates data that B needs and the 3 > scenarios: > > 1) A has to wait for B > 2) B has to wait for A > 3) the data gets buffered > > Seems fairly straight forward and is indeed quite common, needing > co-scheduling for that, I'm not convinced. > > We have of course added all sorts of adaptive wait loops in the kernel > to deal with just that issue. > > With co-scheduling you 'ensure' B is running when A is, but that doesn't > mean you can actually make more progress, you could just be burning a > lot of CPu cycles (which could've been spend doing other work). I don't think, that coscheduling should be applied blindly. Just like the adaptive wait loops you mentioned: in the beginning there was active waiting; it wasn't that great, so passive waiting was invented; turns out, the overhead is too high in some cases, so let's spin adaptively for a moment. We went from uncoordinated scheduling to system-wide coordinated scheduling (which turned out to be not very efficient for many cases). And now we are in the phase to find the right adaptiveness. There is work on enabling coscheduling only on-demand (when a parallel application profits from it) or to be more fuzzy about it (giving the scheduler more freedom); there is work to go away from system-wide coordination to (dynamically) smaller isles (where I see my own work as well). And "recently" we also have the resource contention and security use cases leaving their impression on the topic as well. > I'm also not convinced co-scheduling makes _any_ sense outside SMT -- > does one of the many papers you cite make a good case for !SMT > co-scheduling? It just doesn't make sense to co-schedule the LLC domain, > that's 16+ cores on recent chips. There's the resource contention stuff, much of which targets the last level cache or memory controller bandwidth. So, that is making a case for coscheduling larger parts than SMT. However, I didn't find anything in a short search that would already cover some of the more recent processors with 16+ cores. There's the auto-tuning of parallel algorithms to a certain system architecture. That would also profit from LLC coscheduling (and slightly larger time slices) to run multiple of those in parallel. Again, no idea for recent processors. There's work to coschedule whole clusters, which goes beyond the scope of a single system, but also predates recent systems. (Search for, e.g., "implicit coscheduling"). So, 16+ cores is unknown territory, AFAIK. But not every recent system has 16+ cores, or will have 16+ cores in the near future. Regards Jan