Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp4262659pxb; Mon, 1 Feb 2021 17:36:38 -0800 (PST) X-Google-Smtp-Source: ABdhPJzHwkeJzgwuOPxomHRWMxeunRpfXpWPxpICeugIsRM5ULqIGia+p8PaCvBpcifJZNAUu19V X-Received: by 2002:a17:907:961c:: with SMTP id gb28mr19700158ejc.393.1612229797974; Mon, 01 Feb 2021 17:36:37 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612229797; cv=none; d=google.com; s=arc-20160816; b=zQClUuC8fHVjwq7E438Xg0Tk/ksPnzTB3YUPoXaHmOGumwpWMuV7ZccFI+2YyKEeTt FKf814qJ6OxV9CRLAPht5MtQ3o/hWRBAgf+IOIPIfBJg7RYX9qbU0p8f2jYnVJ4ZRz+K ye+pSHSbjFFF8dPTkDucPPbslwhRP8rank0JvDq8l31sHLCM8oS/e85nGKSKYopec+yU DklkfOWlcaEdLf/LPsODAlKSoYlH1qu6Ze3MoljUer6IDglBZnrxfzBp6Z3RpmoK0To5 fUd1XH0czAiC6wwYQpc8WXJ7UlZmTyvkrPfyxkFQiIVZxXkaa3/2u0CnhXvPbYfIxTdm YfgA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=B3rRC5L1HAgk1eG9rrmQ31gof/zzPM5U9FXhi6qF+2c=; b=apWw4W91r24SC9Y0ZCV/MK0yT8vhBwrdT+lMK8WWEecBlUSwRJbRrkrzu+jaWPdtFG ZZKhQxoed+pascO4jtdUyitdYmniEcYuGhFJJOkhRzcNzXQ3T4oQ+5zeAlmAYHm6uNS0 fBRyCsyQAnVVbL/YoCbJOnNQ1PGZQbVgCCqt18tkK9jv+0WN0nCsF2Co4enyb3xwUVVS D/tteBFuVCF1/9c8WmGcDnGdQ5nsaKgaQrhcq5OjGWG6/qclumK+HOsVRHaZkDGVOIHA kiRpNMwoXG9DarAOWnhFN5/bIrfwluLP/bRKA8XL2V/swfmnWhB8I9Ws4Jnb3wBroDVs QiQw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=NIcPfkwV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w7si2022904edu.51.2021.02.01.17.36.13; Mon, 01 Feb 2021 17:36:37 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=NIcPfkwV; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231167AbhBBBfB (ORCPT + 99 others); Mon, 1 Feb 2021 20:35:01 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58838 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231124AbhBBBer (ORCPT ); Mon, 1 Feb 2021 20:34:47 -0500 Received: from mail-pf1-x436.google.com (mail-pf1-x436.google.com [IPv6:2607:f8b0:4864:20::436]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id E8B94C06174A for ; Mon, 1 Feb 2021 17:34:06 -0800 (PST) Received: by mail-pf1-x436.google.com with SMTP id m6so13273524pfk.1 for ; Mon, 01 Feb 2021 17:34:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=B3rRC5L1HAgk1eG9rrmQ31gof/zzPM5U9FXhi6qF+2c=; b=NIcPfkwVcNyZufJvoDC8Yc4amsiAcsdClUUVjpuGA3I1VtTQ/6DtBBEKnp86ICwo17 Xtv94cH1H29Sd7g2XhAH8RBW7+b7Qya1l0Ll3PFELL7ZlB50ohGQaqM89hE5eAjw4wmw 5b1Tda9TJWsscnhNVS3F5UC2RX9d4VNCQFDOg5IIx9LXZiBxH3e9qQ5n7dyZcmvq5WHe F73bYMUvfGVsSmJ/TsBQE72uSGUMk4MV7tcE6Pp9LrIzkOQDgAvSgjgMaTqYjKx0ZXMW A4mCj0N4y5li/9rRi4F63YkLhiyog0b/hLmzgFH+r2TfQQTZY2xSNqAanlWK4lRku5Ij onjA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=B3rRC5L1HAgk1eG9rrmQ31gof/zzPM5U9FXhi6qF+2c=; b=YReZPtXnCRg7rpa3Fqr61LiMKojdZbeC4Ov/3r24purus6XI86aZT9M9LektUscrAq gZUN82jUWZT0ECiXVG7Dp16XBkuVLIe1CaDK6CoytwMfgugEwDc7jm28gSIYDyElC4JJ IFrNJmrzNQU4LtA0ct49AJl8qeKu8gkImL7KaO72K0uXu+x2RfjUNahLd5sazbJA7uT4 EKwKOP827TCyI72nvIUVvbu9QPS3uXLXBEV5bFWY0VVshsTqEXSBgL1lVTZqDucsW9Xr oEfFU4+QvNWeY6tDFi6aWNe5CAX+/cX/tB/iil1w0djiluITx4XSXQ//S1EzfIjUBdx/ fMog== X-Gm-Message-State: AOAM530Puz+mxcik6+5gGH1KTZhvvpKRGN5id/GcXBqNpqMWmm2hwWpI V3qb4oCzbiNd8I3Kr9aBdjAECg== X-Received: by 2002:a63:a301:: with SMTP id s1mr19860278pge.325.1612229645965; Mon, 01 Feb 2021 17:34:05 -0800 (PST) Received: from google.com ([2620:15c:f:10:829:fccd:80d7:796f]) by smtp.gmail.com with ESMTPSA id g15sm18709108pfb.30.2021.02.01.17.34.04 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 01 Feb 2021 17:34:05 -0800 (PST) Date: Mon, 1 Feb 2021 17:33:58 -0800 From: Sean Christopherson To: "Maciej S. Szmigiero" Cc: Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Igor Mammedov , Marc Zyngier , James Morse , Julien Thierry , Suzuki K Poulose , Huacai Chen , Aleksandar Markovic , Paul Mackerras , Christian Borntraeger , Janosch Frank , David Hildenbrand , Cornelia Huck , Claudio Imbrenda , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH 2/2] KVM: Scalable memslots implementation Message-ID: References: <4d748e0fd50bac68ece6952129aed319502b6853.1612140117.git.maciej.szmigiero@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <4d748e0fd50bac68ece6952129aed319502b6853.1612140117.git.maciej.szmigiero@oracle.com> Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The patch numbering and/or threading is messed up. This should either be a standalone patch, or fully incorporated into the same series as the selftests changes. But, that's somewhat of a moot point... On Mon, Feb 01, 2021, Maciej S. Szmigiero wrote: > From: "Maciej S. Szmigiero" > > The current memslot code uses a (reverse) gfn-ordered memslot array for > keeping track of them. > This only allows quick binary search by gfn, quick lookup by hva is not > possible (the implementation has to do a linear scan of the whole memslot > array). > > Because the memslot array that is currently in use cannot be modified > every memslot management operation (create, delete, move, change flags) > has to make a copy of the whole array so it has a scratch copy to work > on. > > Strictly speaking, however, it is only necessary to make copy of the > memslot that is being modified, copying all the memslots currently > present is just a limitation of the array-based memslot implementation. > > Two memslot sets, however, are still needed so the VM continues to > run on the currently active set while the requested operation is being > performed on the second, currently inactive one. > > In order to have two memslot sets, but only one copy of the actual > memslots it is necessary to split out the memslot data from the > memslot sets. > > The memslots themselves should be also kept independent of each other > so they can be individually added or deleted. > > These two memslot sets should normally point to the same set of > memslots. They can, however, be desynchronized when performing a > memslot management operation by replacing the memslot to be modified > by its copy. > After the operation is complete, both memslot sets once again > point to the same, common set of memslot data. > > This commit implements the aforementioned idea. This needs split into at least 5+ patches, and probably more like 10+ to have a realistic chance of getting it thoroughly reviewed. E.g. changes that can easily be split out: - Move s390's "approximate" logic into search_memslots. - Introduce n_memslots_pages - Using a hash for id_to_index - Changing KVM_USER_MEM_SLOTS to SHRT_MAX - kvm_zap_gfn_range() changes/optimizations AFAICT, tracking both the active and inactive memslot in memslots_all could also be done in a separate patch, though I can't tell if that would actually gain anything. > The new implementation uses two trees to perform quick lookups: > For tracking of gfn an ordinary rbtree is used since memslots cannot > overlap in the guest address space and so this data structure is > sufficient for ensuring that lookups are done quickly. Switching to a rbtree for the gfn lookups also appears to be a separate change, though that's little more than a guess based on a quick read of the changes. > For tracking of hva, however, an interval tree is needed since they > can overlap between memslots. ... > Making lookup and memslot management operations O(log(n)) brings > some performance benefits (tested on a Xeon 8167M machine): > 509 slots in use: > Test Before After Improvement > Map 0,0246s 0,0240s 2% > Unmap 0,0833s 0,0318s 62% > Unmap 2M 0,00177s 0,000917s 48% > Move active 0,0000959s 0,0000816s 15% > Move inactive 0,0000960s 0,0000799s 17% I assume "move" refers to the gfn? If so, I believe this can be ignored for the most part as it's not a common operation, and already has a lot of leading zeros :-) > Slot setup 0,0107s 0,00825s 23% What does "slot setup" measure? I assume it's one-time pain? If so, then we can probably ignore this as use cases that care about millisecond improvements in boot time are unlikely to have 50 memslots, let alone 500+ memslots. I'm not nitpicking the benchmarks to discredit your measurements, rather to point out that I suspect the only thing that's "broken" and that anyone truly cares about is unmapping, i.e. hva->memslot lookups. If that is indeed the case, would it be sufficient to focus on speeding up _just_ the hva lookups? Specifically, I think we can avoid copying the "active vs. inactive" scheme that is used for the main gfn-based array by having the hva tree resolve to an _id_, not to the memslot itself. I.e. bounce through id_to_index, which is coupled with the main array, so that lookups are always done on the "active" memslots, without also having to implement an "inactive" hva tree. For deletion, seeing the defunct/invalid memslot is not a functional problem; it's technically a performance "problem", but one that we already have. For creation, id_to_index will be -1, and so the memslot lookup will return NULL until the new memslot is visible. All hva lookups would obviously need to be changed, but the touchpoint for the write would be quite small, e.g. diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 8367d88ce39b..c03beb4833b2 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1220,6 +1220,20 @@ static int kvm_set_memslot(struct kvm *kvm, if (r) goto out_slots; + /* + * Update the hva=>id tree if a memslot is being deleted or created. + * No update is required for moving a memslot or changing its flags, + * as those don't affect its _id_. For deletion, the memslot has been + * zapped and flushed, fast hva lookups are guaranteed to be nops. For + * creation, the new memslot isn't visible until the final installation + * is complete. Fast hva lookups may prematurely see the entry, but + * id_to_memslot() will return NULL upon seeing id_to_index[id] == -1. + */ + if (change == KVM_MR_DELETE) + kvm_hva_tree_remove(...); + else if (change == KVM_MR_CREATE) + kvm_hva_tree_insert(...); + update_memslots(slots, new, change); slots = install_new_memslots(kvm, as_id, slots); I'm not opposed to using more sophisticated storage for the gfn lookups, but only if there's a good reason for doing so. IMO, the rbtree isn't simpler, just different. Memslot modifications are unlikely to be a hot path (and if it is, x86's "zap everything" implementation is a far bigger problem), and it's hard to beat the memory footprint of a raw array. That doesn't leave much motivation for such a big change to some of KVM's scariest (for me) code.