Received: by 2002:a05:6358:1087:b0:cb:c9d3:cd90 with SMTP id j7csp1134712rwi; Wed, 26 Oct 2022 11:10:06 -0700 (PDT) X-Google-Smtp-Source: AMsMyM7yPmFFpAYFzUM7w0u5WMv/xFT3TGM1dndNhxdtM35rFOlKJwsQMHWkkLNh4hdM6L1PFM7P X-Received: by 2002:a17:907:70a:b0:741:78ab:dce5 with SMTP id xb10-20020a170907070a00b0074178abdce5mr39359783ejb.527.1666807806051; Wed, 26 Oct 2022 11:10:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1666807806; cv=none; d=google.com; s=arc-20160816; b=vHCFhrPiCitEwxUKcu3sYbZ4//uQK0435NndQRB6Mrg/t0gbrugNlfJWaNOnMyidOJ tB8/6nx5gWwaMz1v66AP1Np5uUZXpj4MLFNFrleS0tmB8MedEJ/oF3Rb0H+QIIldO9tz S8VpeqE4OWOm/fFQeZ3BaXJybAXIIB7BrFr8Ij5pqHiZuQlW8PglnUwDzEbKEdc81iYR N3ZifH6nezqicFL/tlt4BWxrqdHppMQtvaw4PWLsbS61XDg5IWySlQ7ahfs7Vgx0Sa9E Yc3PXGSiW5mDxE76UqWvxEF1apk0ZKXqRl9cpUc6im9Q3NczYJtMZxWYno/KFcvVDm1L oNmg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=geniLJHmWq3kpgcGVKm4jCdheUxhYshUAvvdO1yrzTQ=; b=osGUud2ofKEKOKM6hQ2ZFPOVz0Z8r99VOttVS5qD7c+Gp59Gzza9kn3p7+Sw47ITPg OP04T+2QIl5dm9xmtfI1fUrGzPZsqz3hG6AE1PZFx2gOlz2BIGPCi73x0zs6oBO/IJix I/F2WoHftRE3QtaraAti93pWLjRMQBEmuADGslHOxvfq9MBt3UlA31AutLahATkLY4Fj da0+gWyBa2/zUVczjAy+XB5Swg/bYEfk6wYxMcAdsOm8i/RKe2Fulxqw7jhDvXu8gHvC 9G3+iXfDCIuiHpoKK6GQLHZ7lIIKEnIRF5LFBZ64vDTzr4GredY4Ra9QLguNw9Yb27wT BmOg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ZpTEV4cg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id dt4-20020a170907728400b0073d9c29892csi7171470ejc.939.2022.10.26.11.09.30; Wed, 26 Oct 2022 11:10:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=ZpTEV4cg; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234228AbiJZRwf (ORCPT + 99 others); Wed, 26 Oct 2022 13:52:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56114 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234235AbiJZRwY (ORCPT ); Wed, 26 Oct 2022 13:52:24 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2C245108DCC for ; Wed, 26 Oct 2022 10:52:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1666806735; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=geniLJHmWq3kpgcGVKm4jCdheUxhYshUAvvdO1yrzTQ=; b=ZpTEV4cgsSVP+VR8YE55RSluSXq6vh4I+E1yE1h4L9pY9cE+TseMiinFgIK12NqSABPq0G XVU1D3LjR5JvXI/frmS2ULsFyR0eLMWPSkSW4I64aguFROyPbnrO1v2KcLK6SV6ONX/07X vJtzpJGPTY8ygmugu91VBoJDRSWXiJA= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-163-Rf7KdyzOM_W2LQOX2UPDnA-1; Wed, 26 Oct 2022 13:52:13 -0400 X-MC-Unique: Rf7KdyzOM_W2LQOX2UPDnA-1 Received: by mail-wm1-f72.google.com with SMTP id iw9-20020a05600c54c900b003cf4eea332cso419256wmb.1 for ; Wed, 26 Oct 2022 10:52:13 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=geniLJHmWq3kpgcGVKm4jCdheUxhYshUAvvdO1yrzTQ=; b=HY31WGhDwNUPuCdLLu4mcHUe0Gll8DaeEbkEkt5F0HIeHhHnqxN8Wf5LnOoN3TB8cL VAK3Oqrkda9PzL8qKseLjrJav3oh/GovdgVuffesdKn337l+GivpZmiKDq3uxbqtD9IC BfnhEmU+OiKk3oXe6o6dX6q3lkL7EAahl510SEUapIwKDoeMeTZo3jp1O3bBdobjuKyj UC+ihwnZY93IKDpDwfCTTOPurv9qnQP+exMdG2KO+8S5coUr+2wdZeYSluF+lV14Soce qs8BKTv3EKyRsuk0FMa8VMHvoVHdt8FaZaLaBJfLZpB4vciNnbVW+rWxOPCKDBRKmIbi PsUA== X-Gm-Message-State: ACrzQf073vAIMQDO5lJ9Wzza6NqZV8EuQdNDbVDK+Gr92ImcbinNZ2Qq Zi6LCP2I3Zhg5KJzOoRPJV00Y6oxl53UHBc/zLMSyrO6AuApYw+QgSCS3is6YY2NR2+haU2Nzy8 BlbLVbZJpx0DJKjSpGVEutIx5 X-Received: by 2002:a05:6000:1ac7:b0:232:b68c:9e0 with SMTP id i7-20020a0560001ac700b00232b68c09e0mr30940564wry.54.1666806732600; Wed, 26 Oct 2022 10:52:12 -0700 (PDT) X-Received: by 2002:a05:6000:1ac7:b0:232:b68c:9e0 with SMTP id i7-20020a0560001ac700b00232b68c09e0mr30940544wry.54.1666806732335; Wed, 26 Oct 2022 10:52:12 -0700 (PDT) Received: from ?IPV6:2001:b07:6468:f312:1c09:f536:3de6:228c? ([2001:b07:6468:f312:1c09:f536:3de6:228c]) by smtp.googlemail.com with ESMTPSA id g12-20020a05600c310c00b003b4cba4ef71sm2448055wmo.41.2022.10.26.10.52.10 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 26 Oct 2022 10:52:11 -0700 (PDT) Message-ID: Date: Wed, 26 Oct 2022 19:52:10 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.2.1 Subject: Hyper-V VTLs, permission bitmaps and userspace exits (was Re: [PATCH 0/4] KVM: API to block and resume all running vcpus in a vm) Content-Language: en-US To: Sean Christopherson Cc: Christian Borntraeger , Emanuele Giuseppe Esposito , kvm@vger.kernel.org, Jonathan Corbet , Maxim Levitsky , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , David Hildenbrand , x86@kernel.org, "H. Peter Anvin" , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org References: <20221022154819.1823133-1-eesposit@redhat.com> <2701ce67-bfff-8c0c-4450-7c4a281419de@redhat.com> <384b2622-8d7f-ce02-1452-84a86e3a5697@linux.ibm.com> <5a26c107-9ab5-60ee-0e9c-a9955dfe313d@redhat.com> <02c910bb-3ea0-fa84-7a1c-92fb9e8b03de@redhat.com> From: Paolo Bonzini In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.6 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 10/26/22 01:07, Sean Christopherson wrote: > I don't think it's realistic to make accesses outside of KVM_RUN go away, e.g. > see the ARM ITS discussion in the dirty ring thread. kvm_xen_set_evtchn() also > explicitly depends on writing guest memory without going through KVM_RUN (and > apparently can be invoked from a kernel thread?!?). Yeah, those are the pages that must be considered dirty when using the dirty ring. > In theory, I do actually like the idea of restricting memory access to KVM_RUN, > but in reality I just think that forcing everything into KVM_RUN creates far more > problems than it solves. E.g. my complaint with KVM_REQ_GET_NESTED_STATE_PAGES > is that instead of syncrhonously telling userspace it has a problem, KVM chugs > along as if everything is fine and only fails at later point in time. I doubt > userspace would actually do anything differently, i.e. the VM is likely hosed no > matter what, but deferring work adds complexity in KVM and makes it more difficult > to debug problems when they occur. > >>> - to stop anything else in the system that consumes KVM memslots, e.g. KVM GT >> >> Is this true if you only look at the KVM_GET_DIRTY_LOG case and consider it >> a guest bug to access the memory (i.e. ignore the strange read-only changes >> which only happen at boot, and which I agree are QEMU-specific)? > > Yes? I don't know exactly what "the KVM_GET_DIRTY_LOG case" is. It is not possible to atomically read the dirty bitmap and delete a memslot. When you delete a memslot, the bitmap is gone. In this case however memory accesses to the deleted memslot are a guest bug, so stopping KVM-GT would not be necessary. So while I'm being slowly convinced that QEMU should find a way to pause its vCPUs around memslot changes, I'm not sure that pausing everything is needed in general. >>> And because of the nature of KVM, to support this API on all architectures, KVM >>> needs to make change on all architectures, whereas userspace should be able to >>> implement a generic solution. >> >> Yes, I agree that this is essentially just a more efficient kill(). >> Emanuele, perhaps you can put together a patch to x86/vmexit.c in >> kvm-unit-tests, where CPU0 keeps changing memslots and the other CPUs are in >> a for(;;) busy wait, to measure the various ways to do it? > > I'm a bit confused. Is the goal of this to simplify QEMU, dedup VMM code, provide > a more performant solution, something else entirely? Well, a bit of all of them and perhaps that's the problem. And while the issues at hand *are* self-inflicted wounds on part of QEMU, it seems to me that the underlying issues are general. For example, Alex Graf and I looked back at your proposal of a userspace exit for "bad" accesses to memory, wondering if it could help with Hyper-V VTLs too. To recap, the "higher privileged" code at VTL1 can set up VM-wide restrictions on access to some pages through a hypercall (HvModifyVtlProtectionMask). After the hypercall, VTL0 would not be able to access those pages. The hypercall would be handled in userspace and would invoke a KVM_SET_MEMORY_REGION_PERM ioctl to restrict the RWX permissions, and this ioctl would set up a VM-wide permission bitmap that would be used when building page tables. Using such a bitmap instead of memslots makes it possible to cause userspace vmexits on VTL mapping violations with efficient data structures. And it would also be possible to use this mechanism around KVM_GET_DIRTY_LOG, to read the KVM dirty bitmap just before removing a memslot. However, external accesses to the regions (ITS, Xen, KVM-GT, non KVM_RUN ioctls) would not be blocked, due to the lack of a way to report the exit. The intersection of these features with VTLs should be very small (sometimes zero since VTLs are x86 only), but the ioctls would be a problem so I'm wondering what your thoughts are on this. Also, while the exit API could be the same, it is not clear to me that the permission bitmap would be a good match for entirely "void" memslots used to work around non-atomic memslot changes. So for now let's leave this aside and only consider the KVM_GET_DIRTY_LOG case. Paolo