MIME-Version: 1.0
In-Reply-To: <20170822142136.3604336e@luca>
References: <1502918443-30169-1-git-send-email-mathieu.poirier@linaro.org> <20170822142136.3604336e@luca>
From: Mathieu Poirier <mathieu.poirier@linaro.org>
Date: Wed, 23 Aug 2017 13:47:13 -0600
Message-ID: <CANLsYkxjHvx37+kNK8SKFM3NFz2G1vuzDCDuuQA9N_QvYCJbNg@mail.gmail.com>
Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting
To: Luca Abeni <luca.abeni@santannapisa.it>
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
        tj@kernel.org, vbabka@suse.cz, Li Zefan <lizefan@huawei.com>,
        akpm@linux-foundation.org, weiyongjun1@huawei.com,
        Juri Lelli <juri.lelli@arm.com>, Steven Rostedt <rostedt@goodmis.org>,
        Claudio Scordino <claudio@evidence.eu.com>,
        Daniel Bristot de Oliveira <bristot@redhat.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5034
Lines: 114

On 22 August 2017 at 06:21, Luca Abeni <luca.abeni@santannapisa.it> wrote:
> Hi Mathieu,

Good day to you,

>
> On Wed, 16 Aug 2017 15:20:36 -0600
> Mathieu Poirier <mathieu.poirier@linaro.org> wrote:
>
>> This is a renewed attempt at fixing a problem reported by Steve Rostedt [1]
>> where DL bandwidth accounting is not recomputed after CPUset and CPUhotplug
>> operations.  When CPUhotplug and some CUPset manipulation take place root
>> domains are destroyed and new ones created, loosing at the same time DL
>> accounting pertaining to utilisation.
>
> Thanks for looking at this longstanding issue! I am just back from
> vacations; in the next days I'll try your patches.
> Do you have some kind of scripts for reproducing the issue
> automatically? (I see that in the original email Steven described how
> to reproduce it manually; I just wonder if anyone already scripted the
> test).

I didn't bother scripting it since it is so easy to do.  I'm eager to
see how things work out on your end.

>
>> An earlier attempt by Juri [2] used the scheduling classes' rq_online() and
>> rq_offline() methods, something that highlighted a problem with sleeping
>> DL tasks. The email thread that followed envisioned creating a list of
>> sleeping tasks to circle through when recomputing DL accounting.
>>
>> In this set the problem is addressed by relying on existing list of tasks
>> (sleeping or not) already maintained by CPUsets. When CPUset or
>> CPUhotplug operations have completed we circle through the list of tasks
>> maintained by each CPUset looking for DL tasks.  When a DL task is found
>> its utilisation is added to the root domain it pertains to by way of its
>> runqueue.
>>
>> The advantage of proceeding this way is that recomputing of DL accounting
>> is done the same way for both active and inactive tasks, along with
>> guaranteeing that DL accounting for tasks end up in the correct root
>> domain regardless of the CPUset topology.  The disadvantage is that
>> circling through all the tasks in a CPUset can be time consuming.  The
>> counter argument is that both CPUset and CPUhotplug operations are time
>> consuming in the first place.
>
> I do not know the cpuset code too much, but I agree that your approach
> looks better than creating an additional list for blocked deadline
> tasks.
>
>
>> OPEN ISSUE:
>>
>> Regardless of how we proceed (using existing CPUset list or new ones) we
>> need to deal with DL tasks that span more than one root domain,  something
>> that will typically happen after a CPUset operation.  For example, if we
>> split the number of available CPUs on a system in two CPUsets and then turn
>> off the 'sched_load_balance' flag on the parent CPUset, DL tasks in the
>> parent CPUset will end up spanning two root domains.
>>
>> One way to deal with this is to prevent CPUset operations from happening
>> when such condition is detected, as enacted in this set.
>
> I think this is the simplest (if not only?) solution if we want to use
> gEDF in each root domain.

Global Earliest Deadline First?  Is my interpretation correct?

>
>> Although simple
>> this approach feels brittle and akin to a "whack-a-mole" game.  A better
>> and more reliable approach would be to teach the DL scheduler to deal with
>> tasks that span multiple root domains, a serious and substantial
>> undertaking.
>>
>> I am sending this as a starting point for discussion.  I would be grateful
>> if you could take the time to comment on the approach and most importantly
>> provide input on how to deal with the open issue underlined above.
>
> I suspect that if we want to guarantee bounded tardiness then we have to
> go for a solution similar to the one suggested by Tommaso some time ago
> (if I remember well):
>
> if we want to create some "second level cpusets" inside a "parent
> cpuset", allowing deadline tasks to be placed inside both the "parent
> cpuset" and the "second level cpusets", then we have to subtract the
> "second level cpusets" maximum utilizations from the "parent cpuset"
> utilization.
>
> I am not sure how difficult it can be to implement this...

Humm...  I am missing some context here.  Nonetheless the approach I
was contemplating was to repeat the current mathematics to all the
root domains accessible from a p->cpus_allowed's flag.  As such we'd
have the same acceptance test but repeated to more than one root
domain.  To do that time can be an issue but the real problem I see is
related to the current DL code.  It is geared around a single root
domain and changing that means meddling in a lot of places.  I had a
prototype that was beginning to address that but decided to gather
people's opinion before getting in too deep.

>
>
> If, instead, we want to allow to guarantee the respect of all the
> deadlines, then we need to have a look at Brandenburg's paper on
> arbitrary affinities:
> https://people.mpi-sws.org/~bbb/papers/pdf/rtsj14.pdf
>

Ouch, that's an extended read...

>
>                         Thanks,
>                                 Luca