Received: by 2002:ac0:a5a6:0:0:0:0:0 with SMTP id m35-v6csp641826imm; Fri, 14 Sep 2018 04:13:29 -0700 (PDT) X-Google-Smtp-Source: ANB0VdbeMptHehu4200soXLqOviFpO2Wp+I80X+m6E8znlc96p4VUi9vq1MuuXJ/u7We1Xg2U6Zn X-Received: by 2002:a63:bd01:: with SMTP id a1-v6mr11326072pgf.12.1536923609372; Fri, 14 Sep 2018 04:13:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1536923609; cv=none; d=google.com; s=arc-20160816; b=unHDJXZFTkTpi20sr0VdsAgeXARZHJXD1ggI4IW+h6iNHg4MMZYtMkGdwAA0cloqqs Gt7Uo9miLNFo5IJ1FjGa2jEOKkXz8Y0iBDCLbL0SfQTcb42e6bi9bbTbu1jG5UPVNosb 1COF+hzFODiTYykLcButsVJU86p2rWCjH4LZg5VC7Q+Sk7t/PGTMDqUIL3mXikXNJU6F J33ZeNLDk1VWm999rJEmT15+CEjldd4NDY2+WqZgbzTgdA25HG17jJTIP63sQcsXutZi J91dzBJ4fOu4vra2k86IOkk9/WIo1ga8D/r9E0VCoBsDfTSSrGm3VJxCBquh7hPKO/i8 CgXQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=LA0CZvLOX3/cQVMjaOx68iXd4NtqA7STVra1830aV6A=; b=HMr4bUfY2EygiqLoNAIYVhlqKpxGDGor0qgf6Ofianu0nj2oYHzjljR/wuDB/rX5ib U1W25DysnvMBm4z6ltSHvEeujd+tEfqeSOf+rD+JDUeQHb9tB26MSl3Bhu7syYYdlY/I FJn8zjQsMnpS9CGHKlnUPF57slbYR691rogcIWsP9iJ9BAYoW2puiWUKbGac3T9T53Mk gFxIkjuvF47G8Wads/vFmRA7Ndj8bzLnX38zcQIAkhHsQEwg/9Nt6VNMa89XGYOX6hD1 OEP3zscW9SRAmBx6Erh1NC4Fvs6s8TdAHUbaF9rY9alFjA3pDD53dF9sQCffcaB6U6lU u5Mg== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=JFaRrV5a; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id v188-v6si6475530pgb.96.2018.09.14.04.13.14; Fri, 14 Sep 2018 04:13:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=JFaRrV5a; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727828AbeINQ1A (ORCPT + 99 others); Fri, 14 Sep 2018 12:27:00 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:59628 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726872AbeINQ1A (ORCPT ); Fri, 14 Sep 2018 12:27:00 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Transfer-Encoding :Content-Type:MIME-Version:References:Message-ID:Subject:Cc:To:From:Date: Sender:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id:List-Help: List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=LA0CZvLOX3/cQVMjaOx68iXd4NtqA7STVra1830aV6A=; b=JFaRrV5avU3Z9hY3JoCwqz60q7 tB5PNwOI5v26UMw6Xrie+AdtqXQvV05Q3sxM/xbfItS1TEEw5nCdLxY3UZhilpUpXzRjFHwGd69vz //tU2Lt6KNIyOxE0hBUNURqUPaMBZiDnPDjo1HFVcG1Y7zekSYFZRItbn+MNySsIzTuldAuFPPCSr 8DwiL8jcqj+Fra7t+uYoX36IMsYZgMFjiba9aphUo2DRVImsWt8e2q4LnZoBUtw3z76Xmo0iICMBR cYnh4Rk8ggDjuMjpQ2VLLtab8qXYVi18lCjlh4ohPd/SfCbj09mc64SOsTf1chy0kKWq4KHR/mXEL RLIlhUFw==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=hirez.programming.kicks-ass.net) by bombadil.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1g0m1t-0006AK-5S; Fri, 14 Sep 2018 11:12:53 +0000 Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 1C9BD202C1A2E; Fri, 14 Sep 2018 13:12:51 +0200 (CEST) Date: Fri, 14 Sep 2018 13:12:51 +0200 From: Peter Zijlstra To: Jan =?iso-8859-1?Q?H=2E_Sch=F6nherr?= Cc: Ingo Molnar , linux-kernel@vger.kernel.org, Paul Turner , Vincent Guittot , Morten Rasmussen , Tim Chen Subject: Re: [RFC 00/60] Coscheduling for Linux Message-ID: <20180914111251.GC24106@hirez.programming.kicks-ass.net> References: <20180907214047.26914-1-jschoenh@amazon.de> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20180907214047.26914-1-jschoenh@amazon.de> User-Agent: Mutt/1.10.0 (2018-05-17) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Sep 07, 2018 at 11:39:47PM +0200, Jan H. Sch?nherr wrote: > This patch series extends CFS with support for coscheduling. The > implementation is versatile enough to cover many different coscheduling > use-cases, while at the same time being non-intrusive, so that behavior of > legacy workloads does not change. I don't call this non-intrusive. > Peter Zijlstra once called coscheduling a "scalability nightmare waiting to > happen". Well, with this patch series, coscheduling certainly happened. I'll beg to differ; this isn't anywhere near something to consider merging. Also 'happened' suggests a certain stage of completeness, this again doesn't qualify. > However, I disagree on the scalability nightmare. :) There are known scalability problems with the existing cgroup muck; you just made things a ton worse. The existing cgroup overhead is significant, you also made that many times worse. The cgroup stuff needs cleanups and optimization, not this. > B) Why would I want this? > In the L1TF context, it prevents other applications from loading > additional data into the L1 cache, while one application tries to leak > data. That is the whole and only reason you did this; and it doesn't even begin to cover the requirements for it. Not to mention I detest cgroups; for their inherent complixity and the performance costs associated with them. _If_ we're going to do something for L1TF then I feel it should not depend on cgroups. It is after all, perfectly possible to run a kvm thingy without cgroups. > 1. Execute parallel applications that rely on active waiting or synchronous > execution concurrently with other applications. > > The prime example in this class are probably virtual machines. Here, > coscheduling is an alternative to paravirtualized spinlocks, pause loop > exiting, and other techniques with its own set of advantages and > disadvantages over the other approaches. Note that in order to avoid PLE and paravirt spinlocks and paravirt tlb-invalidate you have to gang-schedule the _entire_ VM, not just SMT siblings. Now explain to me how you're going to gang-schedule a VM with a good number of vCPU threads (say spanning a number of nodes) and preserving the rest of CFS without it turning into a massive trainwreck? Such things (gang scheduling VMs) _are_ possible, but not within the confines of something like CFS, they are also fairly inefficient because, as you do note, you will have to explicitly schedule idle time for idle vCPUs. Things like the Tableau scheduler are what come to mind; but I'm not sure how to integrate that with a general purpose scheduling scheme. You pretty much have to dedicate a set of CPUs to just scheduling VMs with such a scheduler. And that would call for cpuset-v2 integration along with a new scheduling class. And then people will complain again that partitioning a system isn't dynamic enough and we need magic :/ (and this too would be tricky to virtualize itself) > C) How does it work? > -------------------- > > This patch series introduces hierarchical runqueues, that represent larger > and larger fractions of the system. By default, there is one runqueue per > scheduling domain. These additional levels of runqueues are activated by > the "cosched_max_level=" kernel command line argument. The bottom level is > 0. You gloss over a ton of details here; many of which are non trivial and marked broken in your patches. Unless you have solid suggestions on how to deal with all of them, this is a complete non-starter. The per-cpu IRQ/steal time accounting for example. The task timeline isn't the same on every CPU because of those. You now basically require steal time and IRQ load to match between CPUs. That places very strict requirements and effectively breaks virt invariance. That is, the scheduler now behaves significantly different inside a VM than it does outside of it -- without the guest being gang scheduled itself and having physical pinning to reflect the same topology the coschedule=1 thing should not be exposed in a guest. And that is a mayor failing IMO. Also; I think you're sharing a cfs_rq between CPUs: + init_cfs_rq(&sd->shared->rq.cfs); that is broken, the virtual runtime stuff needs nontrivial modifications for multiple CPUs. And if you do that, I've no idea how you're dealing with SMP affinities. > You currently have to explicitly set affinities of tasks within coscheduled > task groups, as load balancing is not implemented for them at this point. You don't even begin to outline how you preserve smp-nice fairness. > D) What can I *not* do with this? > --------------------------------- > > Besides the missing load-balancing within coscheduled task-groups, this > implementation has the following properties, which might be considered > short-comings. > > This particular implementation focuses on SCHED_OTHER tasks managed by CFS > and allows coscheduling them. Interrupts as well as tasks in higher > scheduling classes are currently out-of-scope: they are assumed to be > negligible interruptions as far as coscheduling is concerned and they do > *not* cause a preemption of a whole group. This implementation could be > extended to cover higher scheduling classes. Interrupts, however, are an > orthogonal issue. > > The collective context switch from one coscheduled set of tasks to another > -- while fast -- is not atomic. If a use-case needs the absolute guarantee > that all tasks of the previous set have stopped executing before any task > of the next set starts executing, an additional hand-shake/barrier needs to > be added. IOW it's completely friggin useless for L1TF. > E) What's the overhead? > ----------------------- > > Each (active) hierarchy level has roughly the same effect as one additional > level of nested cgroups. In addition -- at this stage -- there may be some > additional lock contention if you coschedule larger fractions of the system > with a dynamic task set. Have you actually read your own code? What about that atrocious locking you sprinkle all over the place? 'some additional lock contention' doesn't even begin to describe that horror show. Hint: we're not going to increase the lockdep subclasses, and most certainly not for scheduler locking. All in all, I'm not inclined to consider this approach, it complicates an already overly complicated thing (cpu-cgroups) and has a ton of unresolved issues while at the same time it doesn't (and cannot) meet the goal it was made for.