Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932668AbdHWTrR (ORCPT ); Wed, 23 Aug 2017 15:47:17 -0400 Received: from mail-lf0-f45.google.com ([209.85.215.45]:34458 "EHLO mail-lf0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932533AbdHWTrQ (ORCPT ); Wed, 23 Aug 2017 15:47:16 -0400 MIME-Version: 1.0 In-Reply-To: <20170822142136.3604336e@luca> References: <1502918443-30169-1-git-send-email-mathieu.poirier@linaro.org> <20170822142136.3604336e@luca> From: Mathieu Poirier Date: Wed, 23 Aug 2017 13:47:13 -0600 Message-ID: Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting To: Luca Abeni Cc: Ingo Molnar , Peter Zijlstra , tj@kernel.org, vbabka@suse.cz, Li Zefan , akpm@linux-foundation.org, weiyongjun1@huawei.com, Juri Lelli , Steven Rostedt , Claudio Scordino , Daniel Bristot de Oliveira , "linux-kernel@vger.kernel.org" , Tommaso Cucinotta Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5034 Lines: 114 On 22 August 2017 at 06:21, Luca Abeni wrote: > Hi Mathieu, Good day to you, > > On Wed, 16 Aug 2017 15:20:36 -0600 > Mathieu Poirier wrote: > >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1] >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug >> operations. When CPUhotplug and some CUPset manipulation take place root >> domains are destroyed and new ones created, loosing at the same time DL >> accounting pertaining to utilisation. > > Thanks for looking at this longstanding issue! I am just back from > vacations; in the next days I'll try your patches. > Do you have some kind of scripts for reproducing the issue > automatically? (I see that in the original email Steven described how > to reproduce it manually; I just wonder if anyone already scripted the > test). I didn't bother scripting it since it is so easy to do. I'm eager to see how things work out on your end. > >> An earlier attempt by Juri [2] used the scheduling classes' rq_online() and >> rq_offline() methods, something that highlighted a problem with sleeping >> DL tasks. The email thread that followed envisioned creating a list of >> sleeping tasks to circle through when recomputing DL accounting. >> >> In this set the problem is addressed by relying on existing list of tasks >> (sleeping or not) already maintained by CPUsets. When CPUset or >> CPUhotplug operations have completed we circle through the list of tasks >> maintained by each CPUset looking for DL tasks. When a DL task is found >> its utilisation is added to the root domain it pertains to by way of its >> runqueue. >> >> The advantage of proceeding this way is that recomputing of DL accounting >> is done the same way for both active and inactive tasks, along with >> guaranteeing that DL accounting for tasks end up in the correct root >> domain regardless of the CPUset topology. The disadvantage is that >> circling through all the tasks in a CPUset can be time consuming. The >> counter argument is that both CPUset and CPUhotplug operations are time >> consuming in the first place. > > I do not know the cpuset code too much, but I agree that your approach > looks better than creating an additional list for blocked deadline > tasks. > > >> OPEN ISSUE: >> >> Regardless of how we proceed (using existing CPUset list or new ones) we >> need to deal with DL tasks that span more than one root domain, something >> that will typically happen after a CPUset operation. For example, if we >> split the number of available CPUs on a system in two CPUsets and then turn >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the >> parent CPUset will end up spanning two root domains. >> >> One way to deal with this is to prevent CPUset operations from happening >> when such condition is detected, as enacted in this set. > > I think this is the simplest (if not only?) solution if we want to use > gEDF in each root domain. Global Earliest Deadline First? Is my interpretation correct? > >> Although simple >> this approach feels brittle and akin to a "whack-a-mole" game. A better >> and more reliable approach would be to teach the DL scheduler to deal with >> tasks that span multiple root domains, a serious and substantial >> undertaking. >> >> I am sending this as a starting point for discussion. I would be grateful >> if you could take the time to comment on the approach and most importantly >> provide input on how to deal with the open issue underlined above. > > I suspect that if we want to guarantee bounded tardiness then we have to > go for a solution similar to the one suggested by Tommaso some time ago > (if I remember well): > > if we want to create some "second level cpusets" inside a "parent > cpuset", allowing deadline tasks to be placed inside both the "parent > cpuset" and the "second level cpusets", then we have to subtract the > "second level cpusets" maximum utilizations from the "parent cpuset" > utilization. > > I am not sure how difficult it can be to implement this... Humm... I am missing some context here. Nonetheless the approach I was contemplating was to repeat the current mathematics to all the root domains accessible from a p->cpus_allowed's flag. As such we'd have the same acceptance test but repeated to more than one root domain. To do that time can be an issue but the real problem I see is related to the current DL code. It is geared around a single root domain and changing that means meddling in a lot of places. I had a prototype that was beginning to address that but decided to gather people's opinion before getting in too deep. > > > If, instead, we want to allow to guarantee the respect of all the > deadlines, then we need to have a look at Brandenburg's paper on > arbitrary affinities: > https://people.mpi-sws.org/~bbb/papers/pdf/rtsj14.pdf > Ouch, that's an extended read... > > Thanks, > Luca