Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp1756791rwd; Fri, 9 Jun 2023 01:28:10 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4agZcsJcynyKeuNHnS4Yohi0iD0kfo6JzvY34x5MJXY8DwuymBY8i/WOoJSom5SVmKzs5G X-Received: by 2002:a05:620a:28c9:b0:75b:23a0:dece with SMTP id l9-20020a05620a28c900b0075b23a0decemr540037qkp.76.1686299290726; Fri, 09 Jun 2023 01:28:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686299290; cv=none; d=google.com; s=arc-20160816; b=VqI6u3NS7oEjmuzxg4KZUPybt6aRgHwMFkVdovQXpEx1SQTjsCZS8mi2DNPL9eA81U FbDMXLwm88HiuqIViroIHIHaYZBvqP93wa2fXJ07RoJS2fmlDCxC34zTxDme2F/JttYK R7oYpWTb9fKG+wnDNY6pRYM5oKIf2v+cGr56FX4O/QoLCWJJySm3b39FPofE713zEwmU CFJ4V6M+5vosRoqNjedIl29xzRXQS9zEeOwbqv2sZwrJUPxZ/oAyJDr4jWbo9UByGM7X A7RWwinYGvjwbxTpGk1tIr0q+ZDOMCnGYkNhrpp/R65u7vX5LGUd5/yvGqN1F+pk9RxJ sD/Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:references:in-reply-to :subject:cc:to:from:message-id:date:dkim-signature; bh=Ua2g7PD+OGr+gd6GTike8CF6BfAEfnCtyIWum8my1XA=; b=Em8Vbt3FmVGNuKYvDtVqK1LrcRfGheNcB59sPkFuuEzEhmpb+FjRLyAz3jBU19eyjm 5RhV2sutmaWHqqBULTt8yEPEcX0Rmy0IzbkhZRhZDxtcAiWeqjPzYQOdvBkpXDCY0jOp dGKWdBI2sdBojsi9ajNHcBqTKC6oGgKombYxQFw9s/iyxIf/UmOUMg9Ig1RINre97T/S gPXn/BovuTQBqbSNfd0Bba80IVVRbAd6edcb9b8O8Nj9d3NNhys7/eO1KbYTvEcNPlXQ sgJOI2n3G0asHQlcdDWvXs4q/DB9Cnvz4Nc7TPJShTDy7O6R+2W+VZ3H7YvezgC9DadL 9ftg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=rEnpL7Q7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id h22-20020a17090aa89600b002562c783d8asi2336836pjq.85.2023.06.09.01.27.55; Fri, 09 Jun 2023 01:28:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=rEnpL7Q7; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240117AbjFIIHf (ORCPT + 99 others); Fri, 9 Jun 2023 04:07:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43716 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240127AbjFIIHN (ORCPT ); Fri, 9 Jun 2023 04:07:13 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 24F4C3AAB; Fri, 9 Jun 2023 01:06:42 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 617D565454; Fri, 9 Jun 2023 08:06:41 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C1AA5C4339B; Fri, 9 Jun 2023 08:06:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1686298000; bh=GlOUNYh02FN/IPnOkb4OtR6mtW+ZvDYMAdY8/TixCYk=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=rEnpL7Q7W4NHTV6E32FIBEBeqDEFEPgddUNNINHnS2sfCgLZTqld4RhsxHrfESZfs NqFEptFLCYpc70mNt17+Nz7RyxafRIlv3ScetTI4OwCwYXGTZi3KcMaSbEs/z3n27/ AKSyE8ZsfH/P+Ydic+8Y8fX9w0OcoJMjpc2AjntpBd0H7A4n82d7+nvWuScifnbAkH pFrvUt4L2mRIOvLl8rmi4lU05m+zhvxJkNXgOecrax9YyCiU4RQVT5W+oBrxjEfpHw GxOhrdk+0ncR5eOk/FUd98mCanrHgVUjjqDD67j0TJ72kHCvdd6DIuj1sr45pXvn17 CNj+XpaGqrZlg== Received: from 152.5.30.93.rev.sfr.net ([93.30.5.152] helo=wait-a-minute.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.95) (envelope-from ) id 1q7X8o-0040hX-Dn; Fri, 09 Jun 2023 09:06:38 +0100 Date: Fri, 09 Jun 2023 09:06:36 +0100 Message-ID: <877csdnjoz.wl-maz@kernel.org> From: Marc Zyngier To: Gavin Shan Cc: kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, oliver.upton@linux.dev, hshuai@redhat.com, zhenyzha@redhat.com, shan.gavin@gmail.com, Andrea Arcangeli , Peter Xu , David Hildenbrand Subject: Re: [PATCH] KVM: Avoid illegal stage2 mapping on invalid memory slot In-Reply-To: <695fe5a2-a295-4105-b31b-4cc92b089659@redhat.com> References: <20230608090348.414990-1-gshan@redhat.com> <87bkhqnhzm.wl-maz@kernel.org> <695fe5a2-a295-4105-b31b-4cc92b089659@redhat.com> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/28.2 (x86_64-pc-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-SA-Exim-Connect-IP: 93.30.5.152 X-SA-Exim-Rcpt-To: gshan@redhat.com, kvmarm@lists.linux.dev, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, pbonzini@redhat.com, seanjc@google.com, oliver.upton@linux.dev, hshuai@redhat.com, zhenyzha@redhat.com, shan.gavin@gmail.com, aarcange@redhat.com, peterx@redhat.com, david@redhat.com X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false X-Spam-Status: No, score=-4.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 09 Jun 2023 00:17:34 +0100, Gavin Shan wrote: > > Hi Marc, > > [Cc Andrea/David/Peter Xu] > > On 6/9/23 12:31 AM, Marc Zyngier wrote: > > On Thu, 08 Jun 2023 10:03:48 +0100, > > Gavin Shan wrote: > >> > >> We run into guest hang in edk2 firmware when KSM is kept as running > >> on the host. The edk2 firmware is waiting for status 0x80 from QEMU's > >> pflash device (TYPE_PFLASH_CFI01) during the operation for sector > >> erasing or buffered write. The status is returned by reading the > >> memory region of the pflash device and the read request should > >> have been forwarded to QEMU and emulated by it. Unfortunately, the > >> read request is covered by an illegal stage2 mapping when the guest > >> hang issue occurs. The read request is completed with QEMU bypassed and > >> wrong status is fetched. > >> > >> The illegal stage2 mapping is populated due to same page mering by > >> KSM at (C) even the associated memory slot has been marked as invalid > >> at (B). > >> > >> CPU-A CPU-B > >> ----- ----- > >> ioctl(kvm_fd, KVM_SET_USER_MEMORY_REGION) > >> kvm_vm_ioctl_set_memory_region > >> kvm_set_memory_region > >> __kvm_set_memory_region > >> kvm_set_memslot(kvm, old, NULL, KVM_MR_DELETE) > >> kvm_invalidate_memslot > >> kvm_copy_memslot > >> kvm_replace_memslot > >> kvm_swap_active_memslots (A) > >> kvm_arch_flush_shadow_memslot (B) > >> same page merging by KSM > >> kvm_mmu_notifier_change_pte > >> kvm_handle_hva_range > >> __kvm_handle_hva_range (C) > >> > >> Fix the issue by skipping the invalid memory slot at (C) to avoid the > >> illegal stage2 mapping. Without the illegal stage2 mapping, the read > >> request for the pflash's status is forwarded to QEMU and emulated by > >> it. The correct pflash's status can be returned from QEMU to break > >> the infinite wait in edk2 firmware. > > > > Huh, nice one :-(. > > > > Yeah, it's a sneaky one :) > > >> > >> Cc: stable@vger.kernel.org # v5.13+ > >> Fixes: 3039bcc74498 ("KVM: Move x86's MMU notifier memslot walkers to generic code") > >> Reported-by: Shuai Hu > >> Reported-by: Zhenyu Zhang > >> Signed-off-by: Gavin Shan > >> --- > >> virt/kvm/kvm_main.c | 3 +++ > >> 1 file changed, 3 insertions(+) > >> > >> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > >> index 479802a892d4..7f81a3a209b6 100644 > >> --- a/virt/kvm/kvm_main.c > >> +++ b/virt/kvm/kvm_main.c > >> @@ -598,6 +598,9 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm, > >> unsigned long hva_start, hva_end; > >> slot = container_of(node, struct > >> kvm_memory_slot, hva_node[slots->node_idx]); > >> + if (slot->flags & KVM_MEMSLOT_INVALID) > >> + continue; > >> + > >> hva_start = max(range->start, slot->userspace_addr); > >> hva_end = min(range->end, slot->userspace_addr + > >> (slot->npages << PAGE_SHIFT)); > > > > I don't immediately see what makes it safer. If we're not holding one > > of slots_{,arch_}lock in the notifier, we can still race against the > > update, can't we? I don't think holding the srcu lock helps us here. > > [...] > change_pte() is always surrounded by invalidate_range_start and > invalidate_range_end(). It means kvm->mn_active_invalidate_count is always > larger than zero when change_pte() is called. With this condition > (kvm->mn_active_invalidate_count > 0), The swapping between the inactive > and active memory slots by kvm_swap_active_memslots() can't be done. > So there are two cases for one memory slot when change_pte() is called: > (a) it has been marked as KVM_MEMSLOT_INVALID in the active memory slots > by kvm_invalidate_memslot(), called before invalidate_range_start(); > (b) the memory slot has been deleted from the active memory slots. We're > only concerned by (a) when the active memory slots are iterated in > __kvm_handle_hva_range(). OK, so to sum it up: - the memslot cannot be swapped while we're walking the active memslots because kvm->mn_active_invalidate_count is elevated, and kvm_swap_active_memslots() will busy loop until this has reached 0 again - holding the srcu read_lock prevents an overlapping memslot update from being published at the wrong time (synchronize_srcu_expedited() in kvm_swap_active_memslots()) If the above holds, then I agree the fix looks correct. I'd definitely want to see some of this rationale captured in the commit message though. Thanks, M. > > static void kvm_mmu_notifier_change_pte(...) > { > : > /* > * .change_pte() must be surrounded by .invalidate_range_{start,end}(). > * If mmu_invalidate_in_progress is zero, then no in-progress > * invalidations, including this one, found a relevant memslot at > * start(); rechecking memslots here is unnecessary. Note, a false > * positive (count elevated by a different invalidation) is sub-optimal > * but functionally ok. > */ > WARN_ON_ONCE(!READ_ONCE(kvm->mn_active_invalidate_count)); > if (!READ_ONCE(kvm->mmu_invalidate_in_progress)) > return; > : > } > > > The srcu lock in __kvm_handle_hva_range() prevents the swapping of > the active and inactive memory slots by kvm_swap_active_memslots(). For > this particular case, it's not relevant because the swapping between > the inactive and active memory slots has been done for once, before > invalidate_range_start() is called. -- Without deviation from the norm, progress is not possible.