Subject: Re: [patch 3/3] x86: kvm guest side support for KVM_HC_RT_PRIO
 hypercall\
To: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
        Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>, mingo@redhat.com,
        kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
        Thomas Gleixner <tglx@linutronix.de>
References: <20170922100004.ydmaxvgpc2zx7j25@hirez.programming.kicks-ass.net>
 <20170922105609.deln6kylvvpaijg7@hirez.programming.kicks-ass.net>
 <20170922123305.GB29608@amt.cnet>
 <20170922125556.cyzybj6c7jqypbmo@hirez.programming.kicks-ass.net>
 <951aaa3f-b20d-6f67-9454-f193f4445fc7@redhat.com>
 <20170923134114.qdfdegrd6afqrkut@hirez.programming.kicks-ass.net>
 <855950672.7912001.1506258344142.JavaMail.zimbra@redhat.com>
 <20170925025751.GB30813@amt.cnet>
 <20170925091316.bnwpiscs2bvpdxk5@hirez.programming.kicks-ass.net>
 <00ff8cbf-4e41-a950-568c-3bd95e155d4b@redhat.com>
 <20170926224925.GA9119@amt.cnet>
From: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <6f4afefd-8726-13ff-371e-0d3896b4cf6a@redhat.com>
Date: Wed, 27 Sep 2017 11:37:48 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.3.0
MIME-Version: 1.0
In-Reply-To: <20170926224925.GA9119@amt.cnet>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 5398
Lines: 142

On 27/09/2017 00:49, Marcelo Tosatti wrote:
> On Mon, Sep 25, 2017 at 05:12:42PM +0200, Paolo Bonzini wrote:
>> On 25/09/2017 11:13, Peter Zijlstra wrote:
>>> On Sun, Sep 24, 2017 at 11:57:53PM -0300, Marcelo Tosatti wrote:
>>>> I think you are missing the following point:
>>>>
>>>> "vcpu0 can be interrupted when its not in a spinlock protected section, 
>>>> otherwise it can't."
>>
>> Who says that?  Certainly a driver can dedicate a single VCPU to
>> periodic polling of the device, in such a way that the polling does not
>> require a spinlock.
> 
> This sequence:
> 
> 
> VCPU-0					VCPU-1 (running realtime workload)
> 
> takes spinlock A
> scheduled out				
> 					spinlock(A) (busy spins until
> 						     VCPU-0 is scheduled
> 						     back in)
> scheduled in
> finishes execution of 
> code under protected section
> releases spinlock(A)			
> 
> 					takes spinlock(A)

Yes, but then you have

                                        busy waits for flag to be set
  polls device
  scheduled out
  device receives data
                                        ...
  scheduled in
  set flag

or

  check for work to do
  scheduled out
                                        submit work
                                        busy waits for work to complete
  scheduled in
  do work

None of which have anything to do with spinlocks.  They're just
different ways to get priority inversion, and this...

>>>> So this point of "everything needs to be RT and the priorities must be
>>>> designed carefully", is this: 
>>>>
>>>> 	WHEN in spinlock protected section (more specifically, when 
>>>> 	spinlock protected section _shared with realtime vcpus_),
>>>>
>>>> 	priority of vcpu0 > priority of emulator thread
>>>>
>>>> 	OTHERWISE
>>>>
>>>> 	priority of vcpu0 < priority of emulator thread.
>>
>> This is _not_ designed carefully, this is messy.
> 
> This is very precise to me. What is "messy" about it? (its clearly
> defined).

... it's not how you design RT systems to avoid priority inversion.
It's just solving an instance of the issue.

>> The emulator thread can interrupt the VCPU thread, so it has to be at
>> higher RT priority (+ priority inheritance of mutexes).  
> 
> It can only do that _when_ the VCPU thread is not running a critical
> section which a higher priority task depends on.

All VCPUs must have the same priority.  There's no such thing as a
housekeeping VCPU.

There could be one sacrificial VCPU that you place on the same physical
CPU as the emulator thread, but that's it.  The whole system must run
smoothly even if you place those on the same physical CPU, without any
PV hacks.

> So when you say "The emulator thread can interrupt the VCPU thread", 
> you're saying that it has to be modified to interrupt for a maximum
> amount of time (say 15us).
> 
> Is that what you are suggesting?

This is correct.  But I think you are missing a fundamental point, that
is not specific to virt.  If a thread 1 can trigger an event in thread
2, and thread 2 runs at priority N, thread 1 must be placed at priority >N.

In this case, the emulator thread can signal I/O completion to the VCPU:
it doesn't matter if the VCPU is polling or using interrupts, the
emulator thread must be placed at higher priority than the VCPU.  This
is the root cause of the SeaBIOS issue that you were seeing.  Yes, it
was me who suggested moving VCPU0 and the emulator thread to
SCHED_NORMAL, but that was just a bandaid until the real fix was done
(which is to set up SCHED_FIFO priorities correctly and use PI mutexes).

In fact, this is not specific to the emulator thread.  It applies just
as well to vhost threads, to QEMU iothreads, even to the KVM PIT
emulation thread if the guest were using it.

> Paolo, you don't control how many interruptions of the emulator thread
> happen per second.

Indeed these non-VCPU threads should indeed run rarely and for a small
amount of time only to achieve bounded latencies in the VCPUs.  But if
they don't, those are simply bugs and we fix them.  In fact, sources of
frequent interruptions have all been fixed or moved outside the main
thread; for example, disks can use separate iothreads.  Configuration
problems are also bugs, just in a different place.

For a very basic VM that I've just tried (whatever QEMU does by default,
only tweak was not using the QEMU GUI), there is exactly one
interruption per second when the VM is idle.  That interruption in turn
is caused by udev periodically probing the guest CDROM (!).  So that's a
configuration problem: remove the CDROM, and it does one interruption
every 10-30 seconds, again all of them guest triggered.

Again: if you have many interruptions, it's not a flaw in KVM or QEMU's
design, it's just that someone is doing something stupid.  It could be
the guest (e.g. unnecessary devices or daemons as in the example above),
QEMU (e.g. the RTC emulation used to trigger QEMU timers twice a second
just to increment the clock), or the management (e.g. polling "is the VM
running" 50 times per second).  But it can and must be fixed.

The design should be the basic one: attribute the right priority to each
thread and things just work.

Paolo

> So if you let the emulator thread interrupt the
> emulator thread at all times, without some kind of bounding 
> of these interruptions per time unit, you have a similar
> problem as (*) (where the realtime task is scheduled).
>