Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp732097pxb; Tue, 2 Feb 2021 17:01:28 -0800 (PST) X-Google-Smtp-Source: ABdhPJwLei5TuO970BGFRuKogegEmvADpWhdmCC+BUQEjnH2W9lIpShlapKGUDkGKWCBqWqiJgFJ X-Received: by 2002:a05:6402:318e:: with SMTP id di14mr657067edb.223.1612314088508; Tue, 02 Feb 2021 17:01:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612314088; cv=none; d=google.com; s=arc-20160816; b=QcoO9A6O9KjcUJzWHjnb/ajTOVg2MpboXFMXsE4N+z+RWWHIt/7KNnjU5xq/Ug/AEa W0vw8Tlzr7YlaozyFtfyv8P7YnO/ehZRcF6e7LK/6gRQ4osZ+npUOd+zLpnmCWt1qntN Zl7isU5siZaB/rLxemsE/GKLVzVyiseX+7f8Z1zyL4tpG3dNP3vLI2wWZqu9mxKKmsWs n4JQljd6OvCXzljmzulHgItYO59fnbZ7ubpmq2RaFXcqJ3QgNJO8B/bxgmQygnu3Q2mL 598752Lq33t24N138Mws/gjx4K8RCo/YWS+xluNqOCC2pFMwphjzs53Bzg3sqQw84Yaw VjMQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=TFhKnEbS5tnzAjSTgUGK0hVK7sWaCRvcwy0WR3lAdSc=; b=dNzgj5YM3dUv+PKmvrAUjDMqUU3ZRORInsCQ8XZ7oDj9ZIDGcJdOSGnBwDNVyimzBa 78VH+gDIR++bkOg7/aIbwM7N1qclG0zEoRRUhGrP+m7CcU/yLjowqCrj6AbimAw/iEeC LSbYVyZPqyku70lusuUGSoqeU0slwSOsfNxFrAvYufOBV83n5lOnGCgQqoh0DVd/J4S5 0gVucK11VN4ID5CpL1okalXcZzghsINOi6zRbOcKCQeOmSOw+H/HRxLoAXoa6tG9O6cM m15bAH3bUwoeKLIccnF4mlB2JbcrkR1UxD+9CUqOmmJ1PIbcp2RH3Brs/fKXMuq6rr/q JARg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=PGpVhmBc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id gw9si324388ejb.131.2021.02.02.17.01.04; Tue, 02 Feb 2021 17:01:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=PGpVhmBc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229651AbhBBXog (ORCPT + 99 others); Tue, 2 Feb 2021 18:44:36 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34476 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229534AbhBBXof (ORCPT ); Tue, 2 Feb 2021 18:44:35 -0500 Received: from mail-pl1-x633.google.com (mail-pl1-x633.google.com [IPv6:2607:f8b0:4864:20::633]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 66D9AC06178A for ; Tue, 2 Feb 2021 15:43:55 -0800 (PST) Received: by mail-pl1-x633.google.com with SMTP id y10so9158931plk.7 for ; Tue, 02 Feb 2021 15:43:55 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=TFhKnEbS5tnzAjSTgUGK0hVK7sWaCRvcwy0WR3lAdSc=; b=PGpVhmBczpBs7N99PQ0DWAWN8ErN6uMHF50ryZxnMpmjcUn2INHIhLy49vDTZaXUs+ BVF5QIcKTc7U8q6upR7sctXZ9+nkzAJmMtyxcpva0WWjNN4S6OIGOOVovjwvnm6R8cik qLehUSWZ+xIE41hqEk1mhaTd6KDmel8E1sMojBGJiiiaI7izuXOK1SK8HINyzAuYlGcG sI48ncibeXXUBnzZ6yJpXJiIL3nhDRQdPoJlU4WlF/NqBBcH3TZkFS8AuDyvPA4Ui9Dm qV/yakUNubyWcVpXdOJz7BIkzgMEfhfVNUOWFxFE0i1hm81HRPUVAiHP0zOz84KwEqET LuPw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=TFhKnEbS5tnzAjSTgUGK0hVK7sWaCRvcwy0WR3lAdSc=; b=SW08j7KcjoH0AQb7nK8eJZ6nkApcSzUszwqYdiAmdb0COblZfDEK5nkRcLAsqozZ6l 5JZDFSUyuBdOv7GQ8mRY4nt+zVGkbtg/uSTpvqcsjzT3VI9pqpTWKq0ZlW/fOQq8M+1U dJnOfYukDJAdaSuDc514v+dISfxsZdI75p2iBK+NUEV2ahnYHfjlaDeDrRA8C386UpLA TY1HOw2NwIHHgWQZNQbHlr73bqqFH07onGORwqiMzkYeucVnGYVFQIMPA4o7pDLZIOiJ R1TZ9r4Q2LZMkOykkfkee/wHmN/mC33G8fg8TSyqZun5mkQnwACoV78fG9n33TNW6WuS fgHA== X-Gm-Message-State: AOAM5327KsiF318Bw6cm7IACWt0miI2Mu+aGiqjNaaYavcytrWJURI+8 sf1F/75m1SaR9n76HxsOSpH0VA== X-Received: by 2002:a17:902:f686:b029:de:18c7:41f8 with SMTP id l6-20020a170902f686b02900de18c741f8mr564398plg.65.1612309434519; Tue, 02 Feb 2021 15:43:54 -0800 (PST) Received: from google.com ([2620:15c:f:10:e1bc:da69:2e4b:ce97]) by smtp.gmail.com with ESMTPSA id r7sm100931pfc.26.2021.02.02.15.43.52 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 02 Feb 2021 15:43:53 -0800 (PST) Date: Tue, 2 Feb 2021 15:43:46 -0800 From: Sean Christopherson To: "Maciej S. Szmigiero" Cc: Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Igor Mammedov , Marc Zyngier , James Morse , Julien Thierry , Suzuki K Poulose , Huacai Chen , Aleksandar Markovic , Paul Mackerras , Christian Borntraeger , Janosch Frank , David Hildenbrand , Cornelia Huck , Claudio Imbrenda , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] KVM: Scalable memslots implementation Message-ID: References: <4d748e0fd50bac68ece6952129aed319502b6853.1612140117.git.maciej.szmigiero@oracle.com> <9e6ca093-35c3-7cca-443b-9f635df4891d@maciej.szmigiero.name> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <9e6ca093-35c3-7cca-443b-9f635df4891d@maciej.szmigiero.name> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 02, 2021, Maciej S. Szmigiero wrote: > On 02.02.2021 02:33, Sean Christopherson wrote: > > > Making lookup and memslot management operations O(log(n)) brings > > > some performance benefits (tested on a Xeon 8167M machine): > > > 509 slots in use: > > > Test Before After Improvement > > > Map 0,0246s 0,0240s 2% > > > Unmap 0,0833s 0,0318s 62% > > > Unmap 2M 0,00177s 0,000917s 48% > > > Move active 0,0000959s 0,0000816s 15% > > > Move inactive 0,0000960s 0,0000799s 17% > > > > I assume "move" refers to the gfn? If so, I believe this can be ignored for the > > most part as it's not a common operation, and already has a lot of leading zeros :-) > > Even if it is not a common operation (today) making it better is > still a good thing. > > The move test result has a lot of leading zeros since it is moving just > a single memslot and that does not take a lot of time in the absolute > sense. Yes, that's my point. The absolute time is barely measurable, this is an extremely rare operation, and the optimal approach isn't orders of magnitude faster, i.e. we can comfortably ignore the "move" performance when weighing options. > > > Slot setup 0,0107s 0,00825s 23% > > > > What does "slot setup" measure? I assume it's one-time pain? If so, then we > > can probably ignore this as use cases that care about millisecond improvements > > in boot time are unlikely to have 50 memslots, let alone 500+ memslots. > > This value shows how long it took the test to add all these memslots. > > Strictly speaking, it also includes the time spent allocating > the backing memory and time spent in the (userspace) selftest framework > vm_userspace_mem_region_add() function, but since these operations are > exactly the same for both in-kernel memslots implementations the > difference in results is all due to the new kernel code (that is, this > patch). > > The result also shows how the performance of the create memslot operation > scales with various count of memslots in use (the measurement is always > done with the same guest memory size). > > Hyper-V SynIC may require up to two additional slots per vCPU. > A large guest with with 128 vCPUs will then use 256 memslots for this > alone. > Also, performance improvements add up. I generally agree, but if this is literally a one time savings of a millisecond or so, for VM with a boot time measured in seconds or even tends of seconds... > At (guest) runtime live migration uses the memslot set flags operation > to turn on and off dirty pages logging. Do you have numbers for the overhead of enabling dirty logging? I assume the per-memslot overhead will be similr to the "move" microbenchmark? > Hot{un,}plug of memory and some other devices (like GPUs) create and > delete memslots, too. > > > I'm not nitpicking the benchmarks to discredit your measurements, rather to > > point out that I suspect the only thing that's "broken" and that anyone truly > > cares about is unmapping, i.e. hva->memslot lookups. If that is indeed the > > case, would it be sufficient to focus on speeding up _just_ the hva lookups?> > > Specifically, I think we can avoid copying the "active vs. inactive" scheme that > > is used for the main gfn-based array by having the hva tree resolve to an _id_, > > not to the memslot itself. I.e. bounce through id_to_index, which is coupled > > with the main array, so that lookups are always done on the "active" memslots, > > without also having to implement an "inactive" hva tree. > > I guess you mean to still turn id_to_index into a hash table, since > otherwise a VMM which uses just two memslots but numbered 0 and 508 > will have a 509-entry id_to_index array allocated. That should be irrelevant for the purposes of optimizing hva lookups, and mostly irrelevant for optimizing memslot updates. Using a hash table is almost a pure a memory optimization, it really only matters when the max number of memslots skyrockets, which is a separate discussion from optimizing hva lookups. > > For deletion, seeing the defunct/invalid memslot is not a functional problem; > > it's technically a performance "problem", but one that we already have. For > > creation, id_to_index will be -1, and so the memslot lookup will return NULL > > until the new memslot is visible. > > This sounds like you would keep the id_to_index array / hash table > separate from the main array as it is in the old code (I read "coupled > with the main array" above as a suggestion to move it to the part that > gets resized when memslots are created or deleted in the current code, > that is struct kvm_memslots). What I meant by "coupled" is that, in the current code, the id_to_index and main memslots array are updated in tandem, it's impossible for readers to see unsynchronized arrays. > Then if you create or delete a memslot the memslots located further in > the memslot array (with lower gfn that the processed slot) will have > their indices shifted - you can't atomically update all of them. > > But overall, this solution (and the one with id_to_index moved into the > main array, too) is still O(n) per memslot operation as you still need to > copy the array to either make space for the new memslot or to remove the > hole from the removed memslot. Yes, but that problem that can be solved separately from the performance issue with hva lookups. > Due to that scaling issue it's rather hard to use 32k memslots with the > old code, the improvement was like 20+ times there on an early version > of this code. > > And if we start adding special cases for things like flags change or > gfn moves to workaround their scaling issues the code will quickly grow > even more complicated. > > > > > All hva lookups would obviously need to be changed, but the touchpoint for the > > write would be quite small, e.g. > > > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > > index 8367d88ce39b..c03beb4833b2 100644 > > --- a/virt/kvm/kvm_main.c > > +++ b/virt/kvm/kvm_main.c > > @@ -1220,6 +1220,20 @@ static int kvm_set_memslot(struct kvm *kvm, > > if (r) > > goto out_slots; > > > > + /* > > + * Update the hva=>id tree if a memslot is being deleted or created. > > + * No update is required for moving a memslot or changing its flags, > > + * as those don't affect its _id_. For deletion, the memslot has been > > + * zapped and flushed, fast hva lookups are guaranteed to be nops. For > > + * creation, the new memslot isn't visible until the final installation > > + * is complete. Fast hva lookups may prematurely see the entry, but > > + * id_to_memslot() will return NULL upon seeing id_to_index[id] == -1. > > + */ > > + if (change == KVM_MR_DELETE) > > + kvm_hva_tree_remove(...); > > + else if (change == KVM_MR_CREATE) > > + kvm_hva_tree_insert(...); > > + > > update_memslots(slots, new, change); > > slots = install_new_memslots(kvm, as_id, slots); > > > > > > I'm not opposed to using more sophisticated storage for the gfn lookups, but > > only if there's a good reason for doing so. IMO, the rbtree isn't simpler, just > > different. Memslot modifications are unlikely to be a hot path (and if it is, > > x86's "zap everything" implementation is a far bigger problem), and it's hard to > > beat the memory footprint of a raw array. That doesn't leave much motivation > > for such a big change to some of KVM's scariest (for me) code. > > > > Improvements can be done step-by-step, > kvm_mmu_invalidate_zap_pages_in_memslot() can be rewritten, too in the > future, if necessary. > After all, complains are that this change alone is too big. It's not simply that it's too big, it's that it solves several problems in a single patch that can, and should, be done in separate patches. Dumping everything into a single patch makes bisecting nearly worthless, e.g. if fast hva lookups breaks a non-x86 architecture, we should able to bisect to exactly that, not a massive patch that completely rewrites all of the memslot code in one fell swoop. Mega patches with multiple logical changes are also extremely difficult to review. See 'Patch preparation' in Documentation/process/5.Posting.rst for more info on splitting up patches. > I think that if you look not at the patch itself but at the resulting > code the new implementation looks rather straightforward, Sorry to be blunt, but that's just not how Linux kernel development works. Again, I am not opposed to any particular idea/approach in this patch, but the individual enhancements absolutely need to be split into separate patches. I focused on the hva tree because I think that has, by far, the best bang for the buck. The performance benefits are clear, the changes can be done with minimal impact to existing code, and each architcture can opt-in one at a time. What I'm suggesting is that we first get the fast hva lookups merged, and then worry about getting KVM to play nice with tens of thousands of memslots. > there are comments at every step in kvm_set_memslot() to explain exactly what > is being done. Not only it already scales well, but it is also flexible to > accommodate further improvements or even new operations. > > The new implementation also uses standard kernel {rb,interval}-tree and hash > table implementation as its basic data structures, so it automatically > benefits from any generic improvements to these. > > All for the low price of just 174 net lines of code added. > > Maciej