Received: by 2002:a05:6a10:c7c6:0:0:0:0 with SMTP id h6csp1471290pxy; Mon, 2 Aug 2021 02:23:17 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyjK6rIc3C2L3TLRxNMTa3wNv/w/NxdN8pLh85TunMxwbRkkUz/AUQw9qQfccUwQsF/k8Jd X-Received: by 2002:a92:cf4d:: with SMTP id c13mr551016ilr.300.1627896197225; Mon, 02 Aug 2021 02:23:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1627896197; cv=none; d=google.com; s=arc-20160816; b=TalBfuuZvI+cdCaG4Le4Cr8HIBfbd91Iq2VHg8cwSZF3FiMZSiuaWJ0fdeRRaF5YQe tqM7np+/lCaUndl+1xC8fDcny6Wq8Bz3QkFsSnb7jbaxSgve/QKUPtNNVJ8cnTgv1VY3 DEaSAV9ROXF5MLZMTJOUcJe1gPP0WB2KPrHLSwNKf1bdWsdA8AN2HDcsnRtv5QvBnMV9 +V5sTGJwA0aJNeWKm48aa6Al0jNPgbEqBpMTqjF86QYJr1OX+XKm8RHNuJVyoQmxbUJw bJ8Hp/KOhaHzCjE2PX6b3OQSq+8rk8YY6wyvJVt+AgSQFtQ9ILdNU6Bqtvek9UxEXgOT SGIA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :user-agent:references:in-reply-to:date:cc:to:from:subject :message-id:dkim-signature; bh=V5kcFdNRWOiht+T1lIPq+oWf7FIdnNa43xgv94hglac=; b=IpnK/+OXz8BJn1eHMvOJZyA2tzJkRQTXwPZnn0r7HY/pI87zDuoI2oHvIlxDaHLYlp PKxQq4noU8N+XfQvYxvqUy9HoSw6GYzklvLogm3BC78pdmiiBF9sbl4uVmAscMtQrlvh h+AOgJDSEDG/VIDT0XHL4R0IZZHndj0GkdB6eWCg2WbrtKACscZS+tAng9EEMh6tf/Uf bMWWLomi8JsnmfH8E5pUVmVXFCdyw0IAbrjflWLKqX0GYVzzooXTrlQI2vNZNhtaSJb9 WlXI/nWAt1NpNAQ08lzgFIpJDP1MnVwfHUN9OUaAd60oeVyNxjN8+PyJY7RCjKCdrCG0 aaYw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=fKoweC3r; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l14si12722816ilf.126.2021.08.02.02.23.06; Mon, 02 Aug 2021 02:23:17 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=fKoweC3r; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233051AbhHBJUV (ORCPT + 99 others); Mon, 2 Aug 2021 05:20:21 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]:38261 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233017AbhHBJUU (ORCPT ); Mon, 2 Aug 2021 05:20:20 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1627896011; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=V5kcFdNRWOiht+T1lIPq+oWf7FIdnNa43xgv94hglac=; b=fKoweC3rX06W7nHTb0hoOiBKggfAjeWthS633dqRhuuryT9R6pAbAkVpMEDEMg78kIPsbJ JEjS056bthAgjz4At8FLl3w8b29r/M6ca4BLMKhg+VknXfdaZOXMZ6OOP0MMu9qKsjZWbo g9Ym/C29SK1g+uWj4llEDsjU+ykdq3Y= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-188-AGFRgZCwNr2HdytgCBED_Q-1; Mon, 02 Aug 2021 05:20:10 -0400 X-MC-Unique: AGFRgZCwNr2HdytgCBED_Q-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id DD6A987D541; Mon, 2 Aug 2021 09:20:07 +0000 (UTC) Received: from starship (unknown [10.35.206.50]) by smtp.corp.redhat.com (Postfix) with ESMTP id E058D5D6CF; Mon, 2 Aug 2021 09:20:03 +0000 (UTC) Message-ID: <599d7449ca94a7ef31b406330bd2bb0ad1870f2b.camel@redhat.com> Subject: Re: KVM's support for non default APIC base From: Maxim Levitsky To: Sean Christopherson Cc: kvm@vger.kernel.org, "open list:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , Jim Mattson , Joerg Roedel , Borislav Petkov , Vitaly Kuznetsov , Wanpeng Li , Paolo Bonzini , Thomas Gleixner , "H. Peter Anvin" , Ingo Molnar , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" Date: Mon, 02 Aug 2021 12:20:02 +0300 In-Reply-To: <564fd4461c73a4ec08d68e2364401db981ecba3a.camel@redhat.com> References: <20210713142023.106183-1-mlevitsk@redhat.com> <20210713142023.106183-9-mlevitsk@redhat.com> <564fd4461c73a4ec08d68e2364401db981ecba3a.camel@redhat.com> Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.36.5 (3.36.5-2.fc32) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2021-07-22 at 12:12 +0300, Maxim Levitsky wrote: > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote: > > On Sun, Jul 18, 2021, Maxim Levitsky wrote: > > > I am more inclined to fix this by just tracking if we hold the srcu > > > lock on each VCPU manually, just as we track the srcu index anyway, > > > and then kvm_request_apicv_update can use this to drop the srcu > > > lock when needed. > > > > The entire approach of dynamically adding/removing the memslot seems doomed to > > failure, and is likely responsible for the performance issues with AVIC, e.g. a > > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable > > and again on re-enable. > > > > Rather than pile on more gunk, what about special casing the APIC access page > > memslot in try_async_pf()? E.g. zap the GFN in avic_update_access_page() when > > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page > > fault path skip directly to MMIO emulation without caching the MMIO info. It'd > > also give us a good excuse to rename try_async_pf() :-) > > > > If lack of MMIO caching is a performance problem, an alternative solution would > > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to > > clear their cache. > > > > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be > > less awful than the current memslot+SRCU mess > > Hi Sean, Paolo and everyone else: > > I am exploring the approach that you proposed and I noticed that we have very inconsistient > code that handles the APIC base relocation for in-kernel local apic. > I do know that APIC base relocation is not supported, and I don't have anything against > this as long as VMs don't use that feature, but I do want this to be consistent. > > I did a study of the code that is involved in this mess and I would like to hear your opinion: > > There are basically 3 modes of operation of in kernel local apic: > > Regular unaccelerated local apic: > > -> APIC MMIO base address is stored at 'apic->base_address', and updated in > kvm_lapic_set_base which is called from msr write of 'MSR_IA32_APICBASE' > (both by SVM and VMX). > The value is even migrated. > > -> APIC mmio read/write is done from MMU, when we access MMIO page: > vcpu_mmio_write always calls apic_mmio_write which checks if the write is in > apic->base_address page and if so forwards the write local apic with offset > in this page. > > -> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called, > thus must contain no guest memslots. > If the guest relocates the APIC base somewhere where we have a memslot, > memslot will take priority, while on real hardware, LAPIC is likely to > take priority. > > APICv: > > -> The default apic MMIO base (0xfee00000) is covered by a dummy page which is > allocated from qemu's process using __x86_set_memory_region. > > This is done once in alloc_apic_access_page which is called on vcpu creation, > (but only once when the memslot is not yet enabled) > > -> to avoid pinning this page into qemu's memory, reference to it > is dropped in alloc_apic_access_page. > (see commit c24ae0dcd3e8695efa43e71704d1fc4bc7e29e9b) > > -> kvm_arch_mmu_notifier_invalidate_range -> checks if we invalidate GPA 0xfee00000 > and if so, raises KVM_REQ_APIC_PAGE_RELOAD request. > > -> kvm_vcpu_reload_apic_access_page handles the KVM_REQ_APIC_PAGE_RELOAD request by calling > kvm_x86_ops.set_apic_access_page_addr which is only implemented on VMX > (vmx_set_apic_access_page_addr) > > -> vmx_set_apic_access_page_addr does gfn_to_page on 0xfee00000 GPA, > and if the result is valid, writes the physical address of this page to APIC_ACCESS_ADDR vmcs field. > > (This is a major difference from the AVIC - AVIC's avic_vapic_bar is *GPA*, while APIC_ACCESS_ADDR > is host physical address which the hypervisor is supposed to map at APIC MMIO GPA using EPT) > > Note that if we have an error here, we might end with invalid APIC_ACCESS_ADDR field. > > -> writes to the HPA of that special page (which has GPA of 0xfee00000, and mapped via EPT) go to > APICv or cause special VM exits: (EXIT_REASON_APIC_ACCESS, EXIT_REASON_APIC_WRITE) > > * EXIT_REASON_APIC_ACCESS (which is used for older limited 'flexpriotiy' mode which only emulates TPR practically) > actually emulates the instruction to know the written value, > but we have a special case in vcpu_is_mmio_gpa which makes the emulation treat the access to the default > apic base as MMIO. > > * EXIT_REASON_APIC_WRITE is a trap VMexit which comes with full APICv, and since it also has offset > qualification and the value is already in the apic page, this info is just passed to kvm_lapic_reg_write > > > -> If APIC base is relocated, the APICv isn't aware of it, and the writes to new APIC base, > (assuming that we have no memslots covering it) will go through standard APIC MMIO emulation, > and *might* create a mess. > > AVIC: > > -> The default apic MMIO base (0xfee00000) > is also covered by a dummy page which is allocated from qemu's process using __x86_set_memory_region > in avic_update_access_page which is called also on vcpu creation (also only once), > and from SVM's dynamic AVIC inhibition. > > -> The reference to this page is not dropped thus there is no KVM_REQ_APIC_PAGE_RELOAD handler. > I think we should do the same we do for APICv here? > > -> avic_vapic_bar is GPA and thus contains 0xfee00000 but writes to MSR_IA32_APICBASE do update it > (avic_update_vapic_bar which is called from MSR_IA32_APICBASE write in SVM code) > > thus if the guest relocates the APIC base to a writable memory page, actually AVIC would happen to work. > (opposite from the stock xAPIC handlilng, which only works when apic base is in MMIO area.) > > -> writes to the GPA in avic_vapic_bar are first checked in NPT (but HPA written there ignored), > and then either go to AVIC or cause SVM_EXIT_AVIC_UNACCELERATED_ACCESS which has offset of the write > in the exit_info_1 > (there is also SVM_EXIT_AVIC_INCOMPLETE_IPI which is called on some ICR writes) > > > As far as I know the only good reason to relocate APIC base is to access it from the real mode > which is not something that is done these days by modern BIOSes. > > I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default base is set and apic enabled) > and remove all remains of the support for variable APIC base. > (we already have a warning when APIC base is set to non default value) > > > Best regards, > Maxim Levitsky > Ping. Best regards, Maxim Levitsky