Received: by 2002:a05:6358:1087:b0:cb:c9d3:cd90 with SMTP id j7csp1272687rwi; Wed, 26 Oct 2022 13:05:31 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6ukHC1bU7um0nhF/g988/uPLNcabOiJPJyaf7m9l8eq8BfgxLB3MrAGp91GfF4hPGT5T/8 X-Received: by 2002:a17:906:3e01:b0:781:6462:700f with SMTP id k1-20020a1709063e0100b007816462700fmr39623055eji.307.1666814730989; Wed, 26 Oct 2022 13:05:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666814730; cv=none; d=google.com; s=arc-20160816; b=ijC+RgFtPMYXIWpgACXQl0vFdVlbE5vwOZO2KccXqmxV/s1l4uP8myFvuVLb0IcTCa Rwb+52OPlCDtVPlw01qB0OXRclly6IgUo0JpOruYfvbbg2QVTcTWXcih/VuG+2vnhFq7 r3eF+u+4ZGnEGPwJ/1K9GXdbqOfh6Pag18i7swbBbOKWoyCU1dQHO6zO2EAbv22IE/oJ Ba37KWvjAuED5UVbI7y8m96JuCirHsH6r0+sNtDmccHYMBVZHhGiuB4OYSnN5V7FcXDk Ewu6iEZny1DtdxpQhMwUTXi6Ef/8+pvAaG+2DRWYmf1edomJJsr/EeXJsbkfB4+jxkeQ A7WQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=jGjEp8J6XveduvHsdvLy4QcugylDX5h7FyuVeFj9/ag=; b=zqs8miT5/O6T/TFtMZaVeAdpBcjFfYGumQwq7zuA5jViYOAjqyZ5LVI4sTmq7ahd5j c0MTGZU/8uO5pV5LNNTqCNHMTFEV3zWcGDfCz79wTysIbjZ9pzLqNBv+UcF/H6Iwx41k WQYzXg2VHTFw955kwbH5dEVplMC7UNro323Gzl+QodMD4ZmyB88EjL+pv0wSX0MTCuDt kopVexHXXbtI9aGsV+0r9FP6tnvnFtzZxG6Yuss7IRkiUH32xvjPJHCCj2wirLt3paxM 6SHRuGJAnFegXKskxjya4NOBgnk03DsH+Je/awxDLVsXPCrG2/23Z1qeXlAdEV3d+tLZ b83w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=bIJBCxmZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id qf14-20020a1709077f0e00b007aaf5a45f5asi6573519ejc.47.2022.10.26.13.04.44; Wed, 26 Oct 2022 13:05:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b=bIJBCxmZ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234113AbiJZTda (ORCPT + 99 others); Wed, 26 Oct 2022 15:33:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39668 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230090AbiJZTdZ (ORCPT ); Wed, 26 Oct 2022 15:33:25 -0400 Received: from mail-pl1-x62f.google.com (mail-pl1-x62f.google.com [IPv6:2607:f8b0:4864:20::62f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0358197EFD for ; Wed, 26 Oct 2022 12:33:22 -0700 (PDT) Received: by mail-pl1-x62f.google.com with SMTP id io19so10221324plb.8 for ; Wed, 26 Oct 2022 12:33:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=jGjEp8J6XveduvHsdvLy4QcugylDX5h7FyuVeFj9/ag=; b=bIJBCxmZ8BtTAakelkNL7MAnGlWTzIee1l87zbz64oAh2qKROAwI8gemqHkXLx8oA6 /tkgzX7RYgja7idwCJDgCr3u2YnDsw1Rad4YVTWE6psgL4ux4GLoC/BbArA+0gE/EXHa GedtvmLeWsTzHHz+g0qYCPVjHpSV2bY4c0jRkHHKlV0pDj61mqSFnel7Qn4Fm325L3zV Wb/kPJRMYwopVsA2LH9ibigkenj+uIx7wTy/di5tqLY5os65c0+WIf3Nj/vIi4YIytJU W3n93LiGaMKjfxnfoWKDV/cu6D/qFmRCi9g6hI2wkOiyVsiOVfdrLOg4n1dBaAIhZFQ2 MOpg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=jGjEp8J6XveduvHsdvLy4QcugylDX5h7FyuVeFj9/ag=; b=eJk+0JxVTmtRc+LbRsxzrGibmC+XYGSWz2/00tMT9UGuwf2Hs5PmjGrZJPtd71Lyed TnF9WH/BUEt9c5HZsJnVCWlhgjYoINMgsTXh8vIGQxgTRELt2aXIvPZdP1WDdqq94MmS /hcLf7K1erNSBPAsGv9+05FuFvWViEZuMZmYa0nW/4IJRGqaJxaLFkTHthigJciSVHAC kEgcfYX/gLOYsFTgA6Mh+nxmsCp62qrZTsFaen63B+PBKxmHpadEDz/3RJ2bjVEhnFaR Zq8Y9GXfdo0VJNar+rl1rrbvsi1DZF4OlqBdT0WUXFbKpPAa3223aaYmjudJGRkNc8fd tSsA== X-Gm-Message-State: ACrzQf3LpCf2Ia7VQfzw29I2HO8uS1DeassR4wUXjf+CZeNhXkuixhu/ sRoXV7qLlp+tni6j52j6nNF1IQ== X-Received: by 2002:a17:90b:4d0d:b0:1fb:a86d:e752 with SMTP id mw13-20020a17090b4d0d00b001fba86de752mr5948777pjb.120.1666812801340; Wed, 26 Oct 2022 12:33:21 -0700 (PDT) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id y4-20020a17090a390400b00212735c8898sm1429389pjb.30.2022.10.26.12.33.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 26 Oct 2022 12:33:20 -0700 (PDT) Date: Wed, 26 Oct 2022 19:33:17 +0000 From: Sean Christopherson To: Paolo Bonzini Cc: Christian Borntraeger , Emanuele Giuseppe Esposito , kvm@vger.kernel.org, Jonathan Corbet , Maxim Levitsky , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , David Hildenbrand , x86@kernel.org, "H. Peter Anvin" , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: Hyper-V VTLs, permission bitmaps and userspace exits (was Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm) Message-ID: References: <20221022154819.1823133-1-eesposit@redhat.com> <2701ce67-bfff-8c0c-4450-7c4a281419de@redhat.com> <384b2622-8d7f-ce02-1452-84a86e3a5697@linux.ibm.com> <5a26c107-9ab5-60ee-0e9c-a9955dfe313d@redhat.com> <02c910bb-3ea0-fa84-7a1c-92fb9e8b03de@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Oct 26, 2022, Paolo Bonzini wrote: > On 10/26/22 01:07, Sean Christopherson wrote: > > > > - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT > > > > > > Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it > > > a guest bug to access the memory (i.e. ignore the strange read-only changes > > > which only happen at boot, and which I agree are QEMU-specific)? > > > > Yes? I don't know exactly what "the KVM_GET_DIRTY_LOG case" is. > > It is not possible to atomically read the dirty bitmap and delete a memslot. > When you delete a memslot, the bitmap is gone. In this case however memory > accesses to the deleted memslot are a guest bug, so stopping KVM-GT would > not be necessary. If accesses to the deleted memslot are a guest bug, why do you care about pausing vCPUs? I don't mean to be beligerent, I'm genuinely confused. > So while I'm being slowly convinced that QEMU should find a way to pause its > vCPUs around memslot changes, I'm not sure that pausing everything is needed > in general. > > > > > And because of the nature of KVM, to support this API on all architectures, KVM > > > > needs to make change on all architectures, whereas userspace should be able to > > > > implement a generic solution. > > > > > > Yes, I agree that this is essentially just a more efficient kill(). > > > Emanuele, perhaps you can put together a patch to x86/vmexit.c in > > > kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in > > > a for(;;) busy wait, to measure the various ways to do it? > > > > I'm a bit confused. Is the goal of this to simplify QEMU, dedup VMM code, provide > > a more performant solution, something else entirely? > > Well, a bit of all of them and perhaps that's the problem. And while the > issues at hand *are* self-inflicted wounds on part of QEMU, it seems to me > that the underlying issues are general. > > For example, Alex Graf and I looked back at your proposal of a userspace > exit for "bad" accesses to memory, wondering if it could help with Hyper-V > VTLs too. To recap, the "higher privileged" code at VTL1 can set up VM-wide > restrictions on access to some pages through a hypercall > (HvModifyVtlProtectionMask). After the hypercall, VTL0 would not be able to > access those pages. The hypercall would be handled in userspace and would > invoke a KVM_SET_MEMORY_REGION_PERM ioctl to restrict the RWX permissions, > and this ioctl would set up a VM-wide permission bitmap that would be used > when building page tables. > > Using such a bitmap instead of memslots makes it possible to cause userspace > vmexits on VTL mapping violations with efficient data structures. And it > would also be possible to use this mechanism around KVM_GET_DIRTY_LOG, to > read the KVM dirty bitmap just before removing a memslot. What exactly is the behavior you're trying to achieve for KVM_GET_DIRTY_LOG => delete? If KVM provides KVM_EXIT_MEMORY_FAULT, can you not achieve the desired behavior by doing mprotect(PROT_NONE) => KVM_GET_DIRTY_LOG => delete? If PROT_NONE causes the memory to be freed, won't mprotect(PROT_READ) do what you want even without KVM_EXIT_MEMORY_FAULT? > However, external accesses to the regions (ITS, Xen, KVM-GT, non KVM_RUN > ioctls) would not be blocked, due to the lack of a way to report the exit. Aren't all of those out of scope? E.g. in a very hypothetical world where XEN's event channel is being used with VTLs, if VTL1 makes the event channel inaccessible, that's a guest and/or userspace configuration issue and the guest is hosed no matter what KVM does. Ditto for these case where KVM-GT's buffer is blocked. I'm guessing the ITS is similar? > The intersection of these features with VTLs should be very small (sometimes > zero since VTLs are x86 only), but the ioctls would be a problem so I'm > wondering what your thoughts are on this. How do the ioctls() map to VTLs? I.e. are they considered VTL0, VTL1, out-of-band? > Also, while the exit API could be the same, it is not clear to me that the > permission bitmap would be a good match for entirely "void" memslots used to > work around non-atomic memslot changes. So for now let's leave this aside > and only consider the KVM_GET_DIRTY_LOG case. As above, can't userspace just mprotect() the entire memslot to prevent writes between getting the dirty log and deleting the memslot?