Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <f753391e-7bdc-bada-856a-87344e75bd74@redhat.com>
Date:   Fri, 23 Sep 2022 15:21:45 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.3.0
Content-Language: en-US
To:     Emanuele Giuseppe Esposito <eesposit@redhat.com>,
        Sean Christopherson <seanjc@google.com>,
        Paolo Bonzini <pbonzini@redhat.com>,
        Maxim Levitsky <mlevitsk@redhat.com>
Cc:     kvm@vger.kernel.org, Vitaly Kuznetsov <vkuznets@redhat.com>,
        Wanpeng Li <wanpengli@tencent.com>,
        Jim Mattson <jmattson@google.com>,
        Joerg Roedel <joro@8bytes.org>,
        Thomas Gleixner <tglx@linutronix.de>,
        Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
        Dave Hansen <dave.hansen@linux.intel.com>, x86@kernel.org,
        "H. Peter Anvin" <hpa@zytor.com>, linux-kernel@vger.kernel.org,
        Like Xu <like.xu.linux@gmail.com>
References: <20220909104506.738478-1-eesposit@redhat.com>
 <YxtOEgJhe4EcAJsE@google.com>
 <5f0345d2-d4d1-f4fe-86ba-6e22561cb6bd@redhat.com>
 <37b3162e-7b3a-919f-80e2-f96eca7d4b4c@redhat.com>
 <dfcbdf1d-b078-ec6c-7706-6af578f79ec2@redhat.com>
 <55d7f0bd-ace1-506b-ea5b-105a86290114@redhat.com>
From:   David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates
In-Reply-To: <55d7f0bd-ace1-506b-ea5b-105a86290114@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: bulk

On 23.09.22 15:10, Emanuele Giuseppe Esposito wrote:
> 
> 
> Am 19/09/2022 um 19:30 schrieb David Hildenbrand:
>> On 19.09.22 09:53, David Hildenbrand wrote:
>>> On 18.09.22 18:13, Emanuele Giuseppe Esposito wrote:
>>>>
>>>>
>>>> Am 09/09/2022 um 16:30 schrieb Sean Christopherson:
>>>>> On Fri, Sep 09, 2022, Emanuele Giuseppe Esposito wrote:
>>>>>> KVM is currently capable of receiving a single memslot update through
>>>>>> the KVM_SET_USER_MEMORY_REGION ioctl.
>>>>>> The problem arises when we want to atomically perform multiple
>>>>>> updates,
>>>>>> so that readers of memslot active list avoid seeing incomplete states.
>>>>>>
>>>>>> For example, in RHBZ
>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1979276
>>>>>
>>>>> I don't have access.  Can you provide a TL;DR?
>>>>
>>>> You should be able to have access to it now.
>>>>
>>>>>
>>>>>> we see how non atomic updates cause boot failure, because vcpus
>>>>>> will se a partial update (old memslot delete, new one not yet created)
>>>>>> and will crash.
>>>>>
>>>>> Why not simply pause vCPUs in this scenario?  This is an awful lot
>>>>> of a complexity
>>>>> to take on for something that appears to be solvable in userspace.
>>>>>
>>>>
>>>> I think it is not that easy to solve in userspace: see
>>>> https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/
>>>>
>>>>
>>>>
>>>> "Using pause_all_vcpus()/resume_all_vcpus() is not possible, as it will
>>>> temporarily drop the BQL - something most callers can't handle (esp.
>>>> when called from vcpu context e.g., in virtio code)."
>>>
>>> Can you please comment on the bigger picture? The patch from me works
>>> around *exactly that*, and for that reason, contains that comment.
>>>
>>
>> FWIW, I hacked up my RFC to perform atomic updates on any memslot
>> transactions (not just resizes) where ranges do add overlap with ranges
>> to remove.
>>
>> https://github.com/davidhildenbrand/qemu/tree/memslot
>>
>>
>> I only performed simple boot check under x86-64 (where I can see region
>> resizes) and some make checks -- pretty sure it has some rough edges;
>> but should indicate what's possible and what the possible price might
>> be. [one could wire up a new KVM ioctl and call it conditionally on
>> support if really required]
>>
> 
> A benefit of my ioctl implementation is that could be also used by other
> hypervisors, which then do not need to share this kind of "hack".
> However, after also talking with Maxim and Paolo, we all agreed that the
> main disadvantage of your approach is that is not scalable with the
> number of vcpus. It is also inferior to stop *all* vcpus just to allow a
> memslot update (KVM only pauses vCPUs that access the modified memslots
> instead).
> 
> So I took some measurements, to see what is the performance difference
> between my implementation and yours. I used a machine where I could
> replicate the bug mentioned in bugzilla, an AMD EPYC 7413 24-Core
> Processor with kernel 5.19.0 (+ my patches).
> 
> Then I measured the time it takes that QEMU spends in kvm_commit (ie in
> memslot updates) while booting a VM. In other words, if kvm_commit takes
> 10 ms and QEMU calls it 20 times, "time to boot" is 200ms. kvm_commit is
> not called anymore after boot, so this measurement is easy to compare
> over multiple invocations of QEMU.
> 
> I ran the tests with different amount of cores: 1,2,4,8,16,32. QEMU
> command is the same to replicate the bug:
> ./qemu-system-x86_64 --overcommit cpu-pm=on --smp $v --accel kvm
> --display none >> ~/smp_$v;
> 
> Each boot is reproduced 100 times, and then from results I measure
> average and stddev (in milliseconds).
> 
> ioctl:
> -smp 1:        Average: 2.1ms        Stdev: 0.8ms
> -smp 2:        Average: 2.5ms        Stdev: 1.5ms
> -smp 4:        Average: 2.2ms        Stdev: 1.1ms
> -smp 8:        Average: 2.4ms        Stdev: 0.7ms
> -smp 16:       Average: 3.6ms        Stdev: 2.4ms  (1000 repetitions)
> -smp 24:       Average: 12.5ms        Stdev: 0.9ms  (1000 repetitions)
> 
> 
> pause/resume: (https://github.com/davidhildenbrand/qemu/tree/memslot)
> -smp 1:        Average: 2.2ms        Stdev: 1.2ms
> -smp 2:        Average: 3.0ms        Stdev: 1.4ms
> -smp 4:        Average: 3.1ms        Stdev: 1.3m
> -smp 8:        Average: 3.4ms        Stdev: 1.4ms
> -smp 16:       Average: 12ms        Stdev: 7.0ms  (1000 repetitions)
> -smp 24:       Average: 20ms        Stdev: 7.3ms  (1000 repetitions)
> 
> 
> Above 24 vCPUs performance gets worse quickly but I think it's already
> quite clear that the results for ioctl scale better as the number of
> vcpus increases, while pausing the vCPUs becomes slower already with 16
> vcpus.

Right, the question is if it happens sufficiently enough that we even 
care and if there are not ways to mitigate.

It doesn't necessarily have to scale with the #VCPUs I think. What 
should dominate the overall time in theory how long it takes for one 
VCPU (the slowest one) to leave the kernel.

I wondered if

1) it might be easier to have a single KVM mechanism/call to kick all 
VCPUs out of KVM instead of doing it per VCPU. That might speed up 
things eventually heavily already.

2) One thing I wondered is whether the biggest overhead is actually 
taking the locks in QEMU and not actually waiting for the VCPUs. Maybe 
we could optimize that as well. (for now I use one lock per VCPU because 
it felt like it would reduce the ioctl overhead; maybe there is a better 
alternative to balance between both users)

So treat my patch as a completely unoptimized version.

-- 
Thanks,

David / dhildenb