Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67;
Subject: Re: INFO: rcu detected stall in do_idle
To:     Juri Lelli <juri.lelli@redhat.com>
Cc:     luca abeni <luca.abeni@santannapisa.it>,
        Peter Zijlstra <peterz@infradead.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Juri Lelli <juri.lelli@gmail.com>,
        syzbot <syzbot+385468161961cee80c31@syzkaller.appspotmail.com>,
        Borislav Petkov <bp@alien8.de>,
        "H. Peter Anvin" <hpa@zytor.com>,
        LKML <linux-kernel@vger.kernel.org>, mingo@redhat.com,
        nstange@suse.de, syzkaller-bugs@googlegroups.com, henrik@austad.us,
        Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>,
        Claudio Scordino <claudio@evidence.eu.com>
References: <20181018104713.GC21611@localhost.localdomain>
 <20181018130811.61337932@luca64>
 <20181019113942.GH3121@hirez.programming.kicks-ass.net>
 <20181019225005.61707c64@nowhere>
 <20181024120335.GE29272@localhost.localdomain>
 <20181030104554.GB8177@hirez.programming.kicks-ass.net>
 <20181030120804.2f30c2da@sweethome>
 <2942706f-db18-6d38-02f7-ef21205173ca@redhat.com>
 <20181031164009.GM18091@localhost.localdomain>
 <027899c5-c5ca-b214-2a87-abe17579724a@redhat.com>
 <20181101055512.GO18091@localhost.localdomain>
From:   Daniel Bristot de Oliveira <bristot@redhat.com>
Message-ID: <1bf857dc-d6ac-e505-82bd-dd28449d3a60@redhat.com>
Date:   Fri, 2 Nov 2018 11:00:36 +0100
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101
 Thunderbird/60.2.1
MIME-Version: 1.0
In-Reply-To: <20181101055512.GO18091@localhost.localdomain>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On 11/1/18 6:55 AM, Juri Lelli wrote:
>> I meant, I am not against the/a fix, i just think that... it is more complicated
>> that it seems.
>>
>> For example: Let's assume that we have a non-rt bad thread A in CPU 0 generating
>> IPIs because of static key update, and a good dl thread B in the CPU 1.
>>
>> In this case, the thread B could run less than what was reserved for it, but it
>> was not causing the interrupts. It is not fair to put a penalty in the thread B.
>>
>> The same is valid for a dl thread running in the same CPU that is receiving a
>> lot of network packets to another application, and other legit cases.
>>
>> In the end, if we want to avoid non-rt threads starving, we need to prioritize
>> them some time, but in this case, we return to the DL server for non-rt threads.
>>
>> Thoughts?
> And I see your point. :-)
> 
> I'd also add (maybe you mentioned this as well) that it seems the same
> could happen with RT throttling safety measure, as we are using
> clock_task there as well to account runtime and throttle stuff.

Yes! The same problem can happen with rt scheduler as well! I saw this problem
first with the rt throttling mechanism when I was trying to make it work in the
microseconds granularity (it is only enforced in the schedule tick, so it is in
an ms granularity in practice). After using hr timers to do the enforcement in
the microseconds granularity, I was trying to let just fewer us for the non-rt.
But as the IRQ runtime was higher than these fewer us, the rt_rq was never
throttled. It is the same/similar behavior we see here.

As we think in the rt throttling as "avoiding rt workload to consume more than
rt_runtime/rt_period", and considering that IRQs are a level of task with a
fixed priority higher than all the real-time related schedulers, i.e., deadline
and rt, we can safely argue that we can consider the IRQ time into the pool of
rt workload and account it in the rt_runtime. The easiest way to do it is to use
the rq_clock() in the measurement. I agree.

The point is that the CBS has a dual goal: it avoids a task running for more
than its runtime (a throttling behavior), but it also is used as a guarantee of
runtime for the case in which the task behaves, and the system is not
overloaded. Considering we can have more load than we can schedule in a
multiprocessor - but that is another story.

The the obvious reasoning here is: Ok boy, but the system IS overloaded in this
case, we have a RCU stall! And that is true if you look at the processor
starving RCU. But if the system has mode than one CPU, it could have CPU time
available in another CPU. So, we could just move the dl task from one CPU to
another.

Btw, that is another point. We have the AC with the sum of the utilization of
all CPUs. But we do no enforcement for per-cpu utilization. If one set a single
thread with runtime=deadline=period  (in a system with more than one CPU), and
run in a busy-loop, we will eventually have an RCU stall as well (I just did on
my box, I got a soft lockup). I know this is a different problem. But, maybe,
there is a general solution for both issues:

For instance, if the sum of the execution time of all "task" with priority
higher than the OTHER class (rt, dl, stop_machine, IRQs, NMIs, Hypervisor?) in a
CPU is higher than rt_runtime in the rt_period, we need to avoid what is
"avoidable" by trying to move rt and dl threads away from that CPU. Another
possibility is to bump the priority of the OTHER class (and we are back to the
DL server).

- Dude, would not be easy just changing the CBS?

Yeah, but by changing the CBS, we may end up breaking the algorithms/properties
that rely on CBS... like GRUB, user-space/kernel-space synchronization...

> OTOH, when something like you describe happens, guarantees are probably
> already out of the window and we should just do our best to at least
> keep the system "working"? (maybe only to warn the user that something
> bad has happened)

Btw, don't get me wrong, I am not against changing CBS: I am just trying to
raise other viewpoints to avoid touching in the base of the DL scheduler, and
avoid punishing a thread that behaves well.

Anyway, notifying that dl+rt+IRQ time is higher than the rt_runtime is another
good thing to do as well. We will be notified anyway, either by RCU or
softlockup... but they are side effects warning. By notifying that we have an
overload of rt or higher workload we will be pointing to the cause.

Thoughts?

-- Daniel