Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753823AbdHXUcZ (ORCPT ); Thu, 24 Aug 2017 16:32:25 -0400 Received: from mail-lf0-f48.google.com ([209.85.215.48]:35068 "EHLO mail-lf0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753727AbdHXUcW (ORCPT ); Thu, 24 Aug 2017 16:32:22 -0400 MIME-Version: 1.0 In-Reply-To: <20170824095326.4f5c1777@luca> References: <1502918443-30169-1-git-send-email-mathieu.poirier@linaro.org> <20170822142136.3604336e@luca> <20170824095326.4f5c1777@luca> From: Mathieu Poirier Date: Thu, 24 Aug 2017 14:32:20 -0600 Message-ID: Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting To: Luca Abeni Cc: Ingo Molnar , Peter Zijlstra , tj@kernel.org, vbabka@suse.cz, Li Zefan , akpm@linux-foundation.org, weiyongjun1@huawei.com, Juri Lelli , Steven Rostedt , Claudio Scordino , Daniel Bristot de Oliveira , "linux-kernel@vger.kernel.org" , Tommaso Cucinotta Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10637 Lines: 237 On 24 August 2017 at 01:53, Luca Abeni wrote: > On Wed, 23 Aug 2017 13:47:13 -0600 > Mathieu Poirier wrote: >> >> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1] >> >> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug >> >> operations. When CPUhotplug and some CUPset manipulation take place root >> >> domains are destroyed and new ones created, loosing at the same time DL >> >> accounting pertaining to utilisation. >> > >> > Thanks for looking at this longstanding issue! I am just back from >> > vacations; in the next days I'll try your patches. >> > Do you have some kind of scripts for reproducing the issue >> > automatically? (I see that in the original email Steven described how >> > to reproduce it manually; I just wonder if anyone already scripted the >> > test). >> >> I didn't bother scripting it since it is so easy to do. I'm eager to >> see how things work out on your end. > > Ok, so I'll try to reproduce the issue manually as described in Steven's > original email; I'll run some tests as soon as I finish with some stuff > that accumulated during vacations. > > [...] >> >> OPEN ISSUE: >> >> >> >> Regardless of how we proceed (using existing CPUset list or new ones) we >> >> need to deal with DL tasks that span more than one root domain, something >> >> that will typically happen after a CPUset operation. For example, if we >> >> split the number of available CPUs on a system in two CPUsets and then turn >> >> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the >> >> parent CPUset will end up spanning two root domains. >> >> >> >> One way to deal with this is to prevent CPUset operations from happening >> >> when such condition is detected, as enacted in this set. >> > >> > I think this is the simplest (if not only?) solution if we want to use >> > gEDF in each root domain. >> >> Global Earliest Deadline First? Is my interpretation correct? > > Right. As far as I understand, the original SCHED_DEADLINE design is to > partition the CPUs in disjoint sets, and then use global EDF scheduling > on each one of those sets (this guarantees bounded tardiness, and if > you run some additional admission tests in user space you can also > guarantee the hard respect of every deadline). > > >> >> Although simple >> >> this approach feels brittle and akin to a "whack-a-mole" game. A better >> >> and more reliable approach would be to teach the DL scheduler to deal with >> >> tasks that span multiple root domains, a serious and substantial >> >> undertaking. >> >> >> >> I am sending this as a starting point for discussion. I would be grateful >> >> if you could take the time to comment on the approach and most importantly >> >> provide input on how to deal with the open issue underlined above. >> > >> > I suspect that if we want to guarantee bounded tardiness then we have to >> > go for a solution similar to the one suggested by Tommaso some time ago >> > (if I remember well): >> > >> > if we want to create some "second level cpusets" inside a "parent >> > cpuset", allowing deadline tasks to be placed inside both the "parent >> > cpuset" and the "second level cpusets", then we have to subtract the >> > "second level cpusets" maximum utilizations from the "parent cpuset" >> > utilization. >> > >> > I am not sure how difficult it can be to implement this... >> >> Humm... I am missing some context here. > > Or maybe I misunderstood the issue you were seeing (I am no expert on > cpusets). Is it related to hierarchies of cpusets (with one cpuset > contained inside another one)? Having spent a lot of time in the CPUset code, I can understand the confusion. CPUset allows to create a hierarchy of sets, _seemingly_ creating overlapping root domains. Fortunately that isn't the case - overlapping CPUsets are morphed together to create non-overlapping root domains. The magic happens in rebuild_sched_domains_locked() [1] where generate_sched_domains() [2] transforms any CPUset topology into disjoint domains. > Can you describe how to reproduce the problematic situation? Let's start with a 4 CPU system (in this case the Q401c Dragon board) where patches 1/7 and 2/7 have been applied to a vanilla kernel. I'm also using Juri's tools [3,4] as describe in Steve's email [5]. root@linaro-developer:/home/linaro# uname -a Linux linaro-developer 4.13.0-rc5-00012-g98bf1310205e #149 SMP PREEMPT Thu Aug 24 13:12:39 MDT 2017 aarch64 GNU/Linux root@linaro-developer:/home/linaro# root@linaro-developer:/home/linaro# cat /sys/devices/system/cpu/online 0-3 root@linaro-developer:/home/linaro# root@linaro-developer:/home/linaro# grep dl /proc/sched_debug dl_rq[0]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[1]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[2]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 dl_rq[3]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 root@linaro-developer:/home/linaro# This checks out as expected. Now let's create 2 CPUsets and make sure new root domains are created by setting the 'sched_load_balance' flag to '0' on the default CPUset. root@linaro-developer:/sys/fs/cgroup/cpuset# mkdir set1 set2 root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set1/cpuset.mem root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set2/cpuset.mems root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0,1 > set1/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset# echo 2,3 > set2/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > cpuset.sched_load_balance root@linaro-developer:/sys/fs/cgroup/cpuset# At this time runqueue0 and runqueue1 point to root domain A while runqueue2 and runqueue3 point to root domain B (something that can't be seen without adding more instrumentation). Newly created tasks can roam on all the CPUs available: root@linaro-developer:/home/linaro# ./burn & [1] 3973 root@linaro-developer:/home/linaro# grep Cpus_allowed: /proc/3973/status Cpus_allowed: f root@linaro-developer:/home/linaro# The above demonstrate that even if we have two CPUsets new task belong to the "default" CPUset and as such can use all the available CPUs. Now let's make task 3973 a DL task: root@linaro-developer:/home/linaro# ./schedtool -E -t 900000:1000000 3973 root@linaro-developer:/home/linaro# grep dl /proc/sched_debug dl_rq[0]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 <------ Problem dl_rq[1]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 0 <------ Problem dl_rq[2]: .dl_nr_running : 1 .dl_nr_migratory : 1 .dl_bw->bw : 996147 .dl_bw->total_bw : 943718 <------ As expected dl_rq[3]: .dl_nr_running : 0 .dl_nr_migratory : 0 .dl_bw->bw : 996147 .dl_bw->total_bw : 943718 <------ As expected root@linaro-developer:/home/linaro/jlelli# When task 3973 was promoted to a DL task it was running on either CPU2 or CPU3. The acceptance test was done on root domain B and the task utilisation added as expected. But as pointed out above task 3973 can still be scheduled on CPU0 and CPU1 and that is a problem since the utilisation hasn't been added there as well. The task is now spread over two root domains rather than a single one, as currently expected by the DL code (note that there are many ways to reproduce this situation). In its current form the patchset prevents specific operations from being carried out if we recognise that a task could end up spanning more than a single root domain. But that will break as soon as we find a new way to create a DL task that spans multiple domains (and I may not have caught them all either). Another way to fix this is to do an acceptance test on all the root domain of a task. So above we'd run the acceptance test on root domain A and B before promoting the task. Of course we'd also have to add the utilisation of that task to both root domain. Although simple it goes at the core of the DL scheduler and touches pretty much every aspect of it, something I'm reluctant to embark on. [1]. http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L814 [2]. http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L634 [3]. https://github.com/jlelli/tests.git [4]. https://github.com/jlelli/schedtool-dl.git [5]. https://lkml.org/lkml/2016/2/3/966 > >> Nonetheless the approach I >> was contemplating was to repeat the current mathematics to all the >> root domains accessible from a p->cpus_allowed's flag. > > I think in the original SCHED_DEADLINE design there should be only one > root domain compatible with the task's affinity... If this does not > happen, I suspect it is a bug (Juri, can you confirm?). > > My understanding is that with SCHED_DEADLINE cpusets should be used to > partition the system's CPUs in disjoint sets (and I think there is one > root domain for each one of those disjoint sets). And the task affinity > mask should correspond with the CPUs composing the set in which the > task is executing. > > >> As such we'd >> have the same acceptance test but repeated to more than one root >> domain. To do that time can be an issue but the real problem I see is >> related to the current DL code. It is geared around a single root >> domain and changing that means meddling in a lot of places. I had a >> prototype that was beginning to address that but decided to gather >> people's opinion before getting in too deep. > > I still do not fully understand this (I got the impression that this is > related to hierarchies of cpusets, but I am not sure if this > understanding is correct). Maybe an example would help me to understand. The above should say it all - please get back to me if I haven't expressed myself clearly. > > > > Thanks, > Luca