Received: by 2002:a05:6358:1087:b0:cb:c9d3:cd90 with SMTP id j7csp715108rwi; Thu, 13 Oct 2022 04:24:56 -0700 (PDT) X-Google-Smtp-Source: AMsMyM6ML+meHDJawrWtGBZE1eiJ1dVHH0Ng81vN9I+GEaHI7dyxy1EdAcIb9ghNtfnK9XbpRPN6 X-Received: by 2002:a17:90b:3805:b0:20d:4e77:371f with SMTP id mq5-20020a17090b380500b0020d4e77371fmr10863123pjb.170.1665660296477; Thu, 13 Oct 2022 04:24:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1665660296; cv=none; d=google.com; s=arc-20160816; b=mVq8Dv0hQNx8mT9NSe9M9QGKGqihGCD6n96TSKaQvHlk5rXKJ9hBWQXO/497vOtipS vnDRGDoV/iEnJ4B1C1zOh3IMVvDKhAXfzH19wU+WvFcUeaRXEfdUf80BTm/uZGIg3ZbO bR3xgUX4lo5HHYNNBgYEE7y++ZHr15Ll2tPO0HMj/JmoWUYmrCn337+9jWqLtbDC9YyP rmiWWg/i7xKPmIZ/g70cdZqtLKMSNS8p6ur8u8moxUBX4h97DQguFJlkkp8vxk8vZgyD mNzjUW/KdUIPaavtKb4LXFLoYYPUowbuntyMEVMRIXtfIdBNCJf64jYQEddqXDex6UBF i1bw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:dkim-signature; bh=j29ngQCSFDZIffbvLhxq5+79cMSaVQV90U1dHeLGPXo=; b=088M8YkWktwo2kwfhI5b4tk/U0qmKF6LFDwk56GMwJ+7992tjVWw1TlHsFlDVgtt6T lhAf4xE32rUDZ47q0A+6ZiJjH8qwW79rHokO8Wc/JI7Pp9m4604jw6Ll50LyTMiydfdn bld6SAcP7lkYIZLqfGU0ZhUgoKfa/K2bGp7+FesuyCiIuNg2nvzCLZ3WdNT9067uqS82 G2vLcoy/T3p8Lvqret5i9dy66YOynJ5+2f4xOjzgGaDbvXBZ0i4nVZNGmcrTQP9oMycC rvsvCECo3C9mAsTy9y8iGHEdDcjLeqJBAtQjeIPKNvNUWBoYUOKktaizTRxIwjAXdxqy WvnA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Hkv4JVjv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c33-20020a17090a492400b00205b268cd58si4568179pjh.181.2022.10.13.04.24.30; Thu, 13 Oct 2022 04:24:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b=Hkv4JVjv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229470AbiJMLM0 (ORCPT + 99 others); Thu, 13 Oct 2022 07:12:26 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57872 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229513AbiJMLMT (ORCPT ); Thu, 13 Oct 2022 07:12:19 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 23BA711E440 for ; Thu, 13 Oct 2022 04:12:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1665659537; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=j29ngQCSFDZIffbvLhxq5+79cMSaVQV90U1dHeLGPXo=; b=Hkv4JVjvIM+UwZelyNL1ngYDhR1BcvB7bLI9pFRFTpDQmn+Ca9MdvVjX3/eKIimy9HAwxz qK5dQAGFg3mIh5UTUc4ay+2g6Q81p/HBa+1boPLoB7CMYR6JJJePARp5anoUlxxnkbAOZU +l7mHLyqzZexwn16R0W+MgaO/x6lH38= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-437-MJDZCpj6PpiuTxJ5nBPzgQ-1; Thu, 13 Oct 2022 07:12:16 -0400 X-MC-Unique: MJDZCpj6PpiuTxJ5nBPzgQ-1 Received: by mail-wr1-f69.google.com with SMTP id s30-20020adfa29e000000b002302b9671feso415233wra.20 for ; Thu, 13 Oct 2022 04:12:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=j29ngQCSFDZIffbvLhxq5+79cMSaVQV90U1dHeLGPXo=; b=u1VrlFLSQ0V9lT4D1TD+eQcHbyMw0WcHuTsOWz06sEp/I2HI1jVpaSJcuzakJsPrgx MFLatq3W8xfSrpCRQvRNcW7BeN9BTlM89QziL50R4C1occ47OJ7jyHG+aznc8DIHPPtS XGxl1Z9krAYTZHPXlID8OqR6uZEufgykACpm2P1juBxtU6t2qEjE2rm+H/EIkfPYRsSO 8BcEPetZL0G2VbxbJdzGTNbUle4rj37ryg53TaCX8GxfF2SruoVDVjeQpjXAcGO8X+bM CnhXQqZPaTmULen1XJZ70pQw2rlwgqInbhQm3/enDW3QL1miwQnV9VVc/NuvCV6r9nWt XNnA== X-Gm-Message-State: ACrzQf1NjrF0snLSxt4owzaJGPlBjma/1YGfFdxR1KDwOvfTkWpq/3Ek 4am90oodB+HfD27CcNStfh7tKOHNNCREPz0H5ck7MCRO26bB1W5EVOOMJ6/6kPA+Io78tWykUyF Sxw5+weMfVtsMxxFWDo0kkPOb X-Received: by 2002:a5d:456b:0:b0:230:9e5b:c64c with SMTP id a11-20020a5d456b000000b002309e5bc64cmr12453717wrc.211.1665659535079; Thu, 13 Oct 2022 04:12:15 -0700 (PDT) X-Received: by 2002:a5d:456b:0:b0:230:9e5b:c64c with SMTP id a11-20020a5d456b000000b002309e5bc64cmr12453690wrc.211.1665659534756; Thu, 13 Oct 2022 04:12:14 -0700 (PDT) Received: from [192.168.149.123] (58.254.164.109.static.wline.lns.sme.cust.swisscom.ch. [109.164.254.58]) by smtp.gmail.com with ESMTPSA id t1-20020adff041000000b0022ac38fb20asm1780424wro.111.2022.10.13.04.12.12 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 13 Oct 2022 04:12:13 -0700 (PDT) Message-ID: <12ac9bf1-bcd6-d684-d265-1ad524b99feb@redhat.com> Date: Thu, 13 Oct 2022 13:12:12 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 Subject: Re: [RFC PATCH 0/9] kvm: implement atomic memslot updates Content-Language: en-US To: David Hildenbrand , Sean Christopherson , Paolo Bonzini Cc: Maxim Levitsky , kvm@vger.kernel.org, Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , linux-kernel@vger.kernel.org, Like Xu References: <111a46c1-7082-62e3-4f3a-860a95cd560a@redhat.com> <14d5b8f2-7cb6-ce24-c7a7-32aa9117c953@redhat.com> <3b04db9d-0177-7e6e-a54c-a28ada8b1d36@redhat.com> <8534dfe4-bc71-2c14-b268-e610a3111d14@redhat.com> <261aff0b-874e-0644-e0c8-97e0a9bfbe04@redhat.com> From: Emanuele Giuseppe Esposito In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-4.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A, RCVD_IN_DNSWL_LOW,SPF_HELO_NONE,SPF_NONE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Am 13/10/2022 um 10:44 schrieb David Hildenbrand: > On 13.10.22 09:43, Emanuele Giuseppe Esposito wrote: >> >> >> Am 29/09/2022 um 23:39 schrieb Sean Christopherson: >>> If we really want to provide a better experience for userspace, why >>> not provide >>> more primitives to address those specific use cases?  E.g. fix KVM's >>> historic wart >>> of disallowing toggling of KVM_MEM_READONLY, and then add one or more >>> ioctls to: >>> >>>    - Merge 2+ memory regions that are contiguous in GPA and HVA >>>    - Split a memory region into 2+ memory regions >>>    - Truncate a memory region >>>    - Grow a memory region >> >> I looked again at the code and specifically the use case that triggers >> the crash in bugzilla. I *think* (correct me if I am wrong), that the >> only operation that QEMU performs right now is "grow/shrink". > > I remember that there were BUG reports where we'd actually split and run > into that problem. Just don't have them at hand. I think they happened > during early boot when the OS re-configured some PCI thingies. If you could point me where this is happening, it would be nice. So far I could not find or see any split/merge operation. > >> >> So *if* we want to go this way, we could start with a simple grow/shrink >> API. >> >> Even though we need to consider that this could bring additional >> complexity in QEMU. Currently, DELETE+CREATE (grow/shrink) is not >> performed one after the other (in my innocent fantasy I was expecting to >> find 2 subsequent ioctls in the code), but in QEMU's >> address_space_set_flatview(), which seems to first remove all regions >> and then add them when changing flatviews. >> >> address_space_update_topology_pass(as, old_view2, new_view, >> adding=false); >> address_space_update_topology_pass(as, old_view2, new_view, adding=true); >> >> I don't think we can change this, as other listeners also rely on such >> ordering, but we can still batch all callback requests like I currently >> do and process them in kvm_commit(), figuring there which operation is >> which. >> >> In other words, we would have something like: >> >> region_del() --> DELETE memslot X -> add it to the queue of operations >> region_del() --> DELETE memslot Y -> add it to the queue of operations >> region_add() --> CREATE memslot X (size doubled) -> add it to the queue >> of operations >> region_add() --> CREATE memslot Y (size halved) -> add it to the queue >> of operations >> ... >> commit() --> scan QUEUE and figure what to do -> GROW X (+size), SHRINK >> Y (-size) -> issue 2 ioctls, GROW and SHRINK. > > I communicated resizes (region_resize()) to the notifier in patch #3 of > https://lore.kernel.org/qemu-devel/20200312161217.3590-1-david@redhat.com/ > > You could use that independently of the remainder. It should be less > controversial ;) > > But I think it only tackles part of the more generic problem (split/merge). Yes, very useful! Using that patch we would have a single place where to issue grow/shrink ioctls. I don't think we need to inhibit ioctls, since the operation will be atomic (this time in the true meaning, since we don't need INVALIDATE. > >> >>> That would probably require more KVM code overall, but each operation >>> would be more >>> tightly bounded and thus simpler to define.  And I think more precise >>> APIs would >>> provide other benefits, e.g. growing a region wouldn't require first >>> deleting the >>> current region, and so could avoid zapping SPTEs and destroying >>> metadata.  Merge, >>> split, and truncate would have similar benefits, though they might be >>> more >>> difficult to realize in practice. >> >> So essentially grow would not require INVALIDATE. Makes sense, but would >> it work also with shrink? I guess so, as the memslot is still present >> (but shrinked) right? >> >> Paolo, would you be ok with this smaller API? Probably just starting >> with grow and shrink first. >> >> I am not against any of the two approaches: >> - my approach has the disadvantage that the list could be arbitrarily >> long, and it is difficult to rollback the intermediate changes if >> something breaks during the request processing (but could be simplified >> by making kvm exit or crash). >> >> - Sean approach could potentially provide more burden to the userspace, >> as we need to figure which operation is which. Also from my >> understanding split and merge won't be really straightforward to >> implement, especially in userspace. >> >> David, any concern from userspace prospective on this "CISC" approach? > > In contrast to resizes in QEMU that only affect a single memory > region/slot, splitting/merging is harder to factor out and communicate > to a notifier. As an alternative, we could handle it in the commit stage > in the notifier itself, similar to what my prototype does, and figure > out what needs to be done in there and how to call the proper KVM > interface (and which interface to call). > > With virtio-mem (in the future) we might see merges of 2 slots into a > single one, by closing a gap in-between them. In "theory" we could > combine some updates into a single transaction. But it won't be 100s ... > > I think I'd prefer an API that doesn't require excessive ad-hoc > extensions later once something in QEMU changes. > > > I think in essence, all we care about is performing atomic changes that > *have to be atomic*, because something we add during a transaction > overlaps with something we remove during a transaction. Not necessarily > all updates in a transaction! > > My prototype essentially detects that scenario, but doesn't call new KVM > interfaces to deal with these. With "prototype" I assume you mean the patch linked above (region_resize), not the userspace-only proposal you sent initially right? > > I assume once we take that into consideration, we can mostly assume that > any such atomic updates (only of the regions that really have to be part > of an atomic update) won't involve more than a handful of memory > regions. We could add a sane KVM API limit. > > And if we run into that API limit in QEMU, we can print a warning and do > it non-atomically. > If we implement single operations (split/merge/grow/shrink), we don't even need that limit. Except for merge, maybe. Ok, if it'ok for you all I can try to use David patch and implement some simple grow/shrink. Then we need to figure where and when exactly QEMU performs split and merge operations, and maybe implement something similar to what you did in your proposal? Thank you, Emanuele