Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751312AbdHXHxg (ORCPT ); Thu, 24 Aug 2017 03:53:36 -0400 Received: from mail.sssup.it ([193.205.80.98]:60903 "EHLO mail.santannapisa.it" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751135AbdHXHxe (ORCPT ); Thu, 24 Aug 2017 03:53:34 -0400 Date: Thu, 24 Aug 2017 09:53:26 +0200 From: Luca Abeni To: Mathieu Poirier Cc: Ingo Molnar , Peter Zijlstra , tj@kernel.org, vbabka@suse.cz, Li Zefan , akpm@linux-foundation.org, weiyongjun1@huawei.com, Juri Lelli , Steven Rostedt , Claudio Scordino , Daniel Bristot de Oliveira , "linux-kernel@vger.kernel.org" , Tommaso Cucinotta Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting Message-ID: <20170824095326.4f5c1777@luca> In-Reply-To: References: <1502918443-30169-1-git-send-email-mathieu.poirier@linaro.org> <20170822142136.3604336e@luca> X-Mailer: Claws Mail 3.13.2 (GTK+ 2.24.30; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4879 Lines: 107 On Wed, 23 Aug 2017 13:47:13 -0600 Mathieu Poirier wrote: > >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1] > >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug > >> operations. When CPUhotplug and some CUPset manipulation take place root > >> domains are destroyed and new ones created, loosing at the same time DL > >> accounting pertaining to utilisation. > > > > Thanks for looking at this longstanding issue! I am just back from > > vacations; in the next days I'll try your patches. > > Do you have some kind of scripts for reproducing the issue > > automatically? (I see that in the original email Steven described how > > to reproduce it manually; I just wonder if anyone already scripted the > > test). > > I didn't bother scripting it since it is so easy to do. I'm eager to > see how things work out on your end. Ok, so I'll try to reproduce the issue manually as described in Steven's original email; I'll run some tests as soon as I finish with some stuff that accumulated during vacations. [...] > >> OPEN ISSUE: > >> > >> Regardless of how we proceed (using existing CPUset list or new ones) we > >> need to deal with DL tasks that span more than one root domain, something > >> that will typically happen after a CPUset operation. For example, if we > >> split the number of available CPUs on a system in two CPUsets and then turn > >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the > >> parent CPUset will end up spanning two root domains. > >> > >> One way to deal with this is to prevent CPUset operations from happening > >> when such condition is detected, as enacted in this set. > > > > I think this is the simplest (if not only?) solution if we want to use > > gEDF in each root domain. > > Global Earliest Deadline First? Is my interpretation correct? Right. As far as I understand, the original SCHED_DEADLINE design is to partition the CPUs in disjoint sets, and then use global EDF scheduling on each one of those sets (this guarantees bounded tardiness, and if you run some additional admission tests in user space you can also guarantee the hard respect of every deadline). > >> Although simple > >> this approach feels brittle and akin to a "whack-a-mole" game. A better > >> and more reliable approach would be to teach the DL scheduler to deal with > >> tasks that span multiple root domains, a serious and substantial > >> undertaking. > >> > >> I am sending this as a starting point for discussion. I would be grateful > >> if you could take the time to comment on the approach and most importantly > >> provide input on how to deal with the open issue underlined above. > > > > I suspect that if we want to guarantee bounded tardiness then we have to > > go for a solution similar to the one suggested by Tommaso some time ago > > (if I remember well): > > > > if we want to create some "second level cpusets" inside a "parent > > cpuset", allowing deadline tasks to be placed inside both the "parent > > cpuset" and the "second level cpusets", then we have to subtract the > > "second level cpusets" maximum utilizations from the "parent cpuset" > > utilization. > > > > I am not sure how difficult it can be to implement this... > > Humm... I am missing some context here. Or maybe I misunderstood the issue you were seeing (I am no expert on cpusets). Is it related to hierarchies of cpusets (with one cpuset contained inside another one)? Can you describe how to reproduce the problematic situation? > Nonetheless the approach I > was contemplating was to repeat the current mathematics to all the > root domains accessible from a p->cpus_allowed's flag. I think in the original SCHED_DEADLINE design there should be only one root domain compatible with the task's affinity... If this does not happen, I suspect it is a bug (Juri, can you confirm?). My understanding is that with SCHED_DEADLINE cpusets should be used to partition the system's CPUs in disjoint sets (and I think there is one root domain for each one of those disjoint sets). And the task affinity mask should correspond with the CPUs composing the set in which the task is executing. > As such we'd > have the same acceptance test but repeated to more than one root > domain. To do that time can be an issue but the real problem I see is > related to the current DL code. It is geared around a single root > domain and changing that means meddling in a lot of places. I had a > prototype that was beginning to address that but decided to gather > people's opinion before getting in too deep. I still do not fully understand this (I got the impression that this is related to hierarchies of cpusets, but I am not sure if this understanding is correct). Maybe an example would help me to understand. Thanks, Luca