Reply-To: xlpang@redhat.com
Subject: Re: [PATCH] sched/deadline/rtmutex: Fix a PI crash for deadline tasks
References: <56FE78E0.5060504@redhat.com> <20160401215143.GB2906@worktop>
 <57037974.1020002@redhat.com>
 <20160405091954.GI3448@twins.programming.kicks-ass.net>
 <20160405092954.GC24771@twins.programming.kicks-ass.net>
 <20160408122510.15978179@gandalf.local.home> <20160408173835.GC1087@worktop>
 <20160408145055.13c98a75@gandalf.local.home>
 <20160408185916.GQ3448@twins.programming.kicks-ass.net>
 <57087633.3020700@redhat.com>
 <20160409132935.GU3448@twins.programming.kicks-ass.net>
 <570A0D37.8080400@redhat.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>, linux-kernel@vger.kernel.org,
        Juri Lelli <juri.lelli@arm.com>, Ingo Molnar <mingo@redhat.com>,
        Thomas Gleixner <tglx@linutronix.de>
From: Xunlei Pang <xpang@redhat.com>
Message-ID: <570C6694.50803@redhat.com>
Date: Tue, 12 Apr 2016 11:08:04 +0800
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.2.0
MIME-Version: 1.0
In-Reply-To: <570A0D37.8080400@redhat.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4151
Lines: 92

On 2016／04／10 at 16:22, Xunlei Pang wrote:
> On 2016/04/09 at 21:29, Peter Zijlstra wrote:
>> On Sat, Apr 09, 2016 at 11:25:39AM +0800, Xunlei Pang wrote:
>>
>>>> In any case, I just realized we do not in fact provide this guarantee
>>>> (of pointing to a blocked task) that needs a bit more work.
>>> Current patch calls rt_mutex_adjust_prio() before wake_up_q() the
>>> wakee, at that moment the wakee has been removed by
>>> rt_mutex_slowunlock()->mark_wakeup_next_waiter(), from current's
>>> pi_waiters, rt_mutex_adjust_prio() won't see this wakee, so I think
>>> this should not be problem.
>> No, any wakeup after mark_wakeup_next_waiter() will make the task run.
>> And since we must always consider spurious wakeups, we cannot rely on us
>> (eg. our wake_up_q call) being the one and only.
>>
>> Therefore it is possible and the only thing that stands between us and
>> doom is the fact that the wake_q stuff holds a task reference.
>>
>> But we cannot guarantee that the task we have a pointer to is in fact
>> blocked.
> Does that really matters? the pointer is accessed on behalf of current,  and current
> will call rt_mutex_adjust_prio() very soon to update the right pointer.
>
> Also the pointer is used to update current's deadline/runtime, we can restore these
> params in rt_mutex_setprio() for deboost cases. I just checked current code,  it did
> nothing to restore current's deadline/runtime when deboosting, maybe we can leave
> this job to future deadline bandwidth inheritance?
>
> Regards,
> Xunlei
>

I spotted another issue, we access pi_task without any lock in enqueue_task_dl(),
though we have "dl_prio(pi_task->normal_prio)" condition, that's not enough, "dl_period"
and "dl_runtime" of pi_task can change, if it changed to !deadline class, dl_runtime
was cleared to 0, we will hit a forever loop in replenish_dl_entity() below:
    while (dl_se->runtime <= 0) {
        dl_se->deadline += pi_se->dl_period;
        dl_se->runtime += pi_se->dl_runtime;
    }

or hit "BUG_ON(pi_se->dl_runtime <= 0);".

That's all because without any lock of that task, there is no guarantee.

So I'm thinking of adding more members in rt_mutex_waiter(we don't lose memory,
it's defined on stack) and use this structure instead of task_struct as the top waiter
(i.e. using get_pi_top_waiter() instead of get_pi_top_task()), like:

Step1:
struct rt_mutex_waiter {
         int prio;
+       /* updated under waiter's pi_lock and rt_mutex lock */
+       u64 dl_runtime, dl_period;
+       /*
+        * under owner's pi_lock, rq lock, and rt_mutex lock, copied
+        * directly from dl_runtime, dl_period(under same rt_mutex lock).
+        */
+       u64 dl_runtime_copy, dl_period_copy;

Similarly, adding the memember in task_struct:
#ifdef CONFIG_RT_MUTEXES
    /* PI waiters blocked on a rt_mutex held by this task */
    struct rb_root pi_waiters;
    struct rb_node *pi_waiters_leftmost;
+  struct rb_node *pi_waiters_leftmost_copy; /* updated unlock pi_lock and rq lock */
    /* Deadlock detection and priority inheritance handling */
    struct rt_mutex_waiter *pi_blocked_on;
#endif

Then, we can update "pi_waiters_leftmost_copy" together with "dl_runtime_copy"
and "dl_period_copy" under rq lock, then enqueue_task_dl() can access them
without any problem.

Step2:
We must update "dl_runtime_copy" and "dl_period_copy" under rt_mutex lock,
because it is copied from "dl_runtime" and "dl_period" of rt_mutex_waiter, so we
add a new update function as long as we held rq lock and rt_mutex lock, mainly
can be implemented in rt_mutex_setprio.

Step3:
Note that rt_mutex_setprio() can be called without rtmutex lock by rt_mutex_adjust_prio(),
we can add a parameter to indicate not doing the copy-updating work at that place,
the same applies to rt_mutex_setprio(add a new waiter parameter and keep the original
"prio" parameter). Then we can do the copy-updating work in mark_wakeup_next_waiter()
before unlocking current's pi_lock, as long as we hold rq lock, because rtmutex lock and
owner's pi_lock was already held.

This can also solve the issue you mentioned with only a little overhead involved.

Regards,
Xunlei