Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp331671pxb; Wed, 3 Nov 2021 05:05:49 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxqsojGT+o1RdyubD1h0CK4Pw/x4H5Qcq5QmLL1ZrkBSBsTAHvOkidjcMWf/Uh5tdwFjhUA X-Received: by 2002:a50:e14c:: with SMTP id i12mr58937502edl.125.1635941149573; Wed, 03 Nov 2021 05:05:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1635941149; cv=none; d=google.com; s=arc-20160816; b=XlBJ4XyXCHZ0hW6MzsOhDHlCk8mMwo0A9P/M2n2CtxxDtuICju84T2oYy7gKMeSprB NeQZ6je6HPzPqGan0ej/qNUk+39tMYBGtE/bCFZRPhQhUhZtnUyNl2TGs9ZM6RHNMxQh oYjMe5/MhJW7imYcXkNLMLhKHYspcySryspch2L/acdysMuBVVOnMZ9frjkZQoy5JzE7 sDAx+wC0ECR4ftf9n9mw2BxTKz7bgvGtooxsPSBG1qpBSa5tXKmvHHByv+nCd6nSEBGC SzZJ91T3fsA3nuMQpbfXyk64/ailkRkmrNNZxMtrmVhZaej6LmQtLlZxpgd6qDY6y8bo L+BQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:subject:from :references:cc:to; bh=/z5jakWo4pFnSEmjMjC+A4QQXk+LHLrk7Sr9K2fZRaY=; b=Q+U54Z5EVoOUUxvAgaaalu9vAu1dHRBWAb3KovqWHiG9xOmSMES2nEx/6zWzlZazJB t6lPqvgYuHsZ1yNggHeeooxEqbepOO+Yz6M79EQ9UbfzTMSv66j+tXnARp3/WMygru4t VgX6hCoLxVPgKSYlEiNqmLKeOlS2iIvkIMkVyglCuRWcHMmRktKGsWgibITsPyZVE099 SQt3/D2IQXp0Q82Rhl7vv4cM4YliREZi5bem10uFpm1yaQ0iW8MOWX1QvNjYsDW+NA81 dDLf8tNU9mqdY7aZamp/vAvgsSWwIMpOA2ZKPZmjkQgYinR6uLAR+evRVakXOWRTqpds 2GXA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id o3si4490833ejy.753.2021.11.03.05.05.23; Wed, 03 Nov 2021 05:05:49 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231838AbhKCMCV (ORCPT + 99 others); Wed, 3 Nov 2021 08:02:21 -0400 Received: from vps-vb.mhejs.net ([37.28.154.113]:49726 "EHLO vps-vb.mhejs.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231476AbhKCMCU (ORCPT ); Wed, 3 Nov 2021 08:02:20 -0400 Received: from MUA by vps-vb.mhejs.net with esmtps (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 (Exim 4.94.2) (envelope-from ) id 1miEvW-00070a-Rg; Wed, 03 Nov 2021 12:59:34 +0100 To: Sean Christopherson Cc: Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Igor Mammedov , Marc Zyngier , James Morse , Julien Thierry , Suzuki K Poulose , Huacai Chen , Aleksandar Markovic , Paul Mackerras , Christian Borntraeger , Janosch Frank , David Hildenbrand , Cornelia Huck , Claudio Imbrenda , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org References: From: "Maciej S. Szmigiero" Subject: Re: [PATCH v5 01/13] KVM: x86: Cache total page count to avoid traversing the memslot array Message-ID: <8017cf9d-2b03-0c27-b78a-41b3d03c308b@maciej.szmigiero.name> Date: Wed, 3 Nov 2021 12:59:27 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01.11.2021 23:29, Sean Christopherson wrote: > On Wed, Oct 20, 2021, Sean Christopherson wrote: >> On Wed, Oct 20, 2021, Maciej S. Szmigiero wrote: >>> On 20.10.2021 00:24, Sean Christopherson wrote: >>>> E.g. the whole thing can be >>>> >>>> if (!kvm->arch.n_requested_mmu_pages && >>>> (change == KVM_MR_CREATE || change == KVM_MR_DELETE)) { >>>> unsigned long nr_mmu_pages; >>>> >>>> if (change == KVM_MR_CREATE) { >>>> kvm->arch.n_memslots_pages += new->npages; >>>> } else { >>>> WARN_ON(kvm->arch.n_memslots_pages < old->npages); >>>> kvm->arch.n_memslots_pages -= old->npages; >>>> } >>>> >>>> nr_mmu_pages = (unsigned long)kvm->arch.n_memslots_pages; >>>> nr_mmu_pages *= (KVM_PERMILLE_MMU_PAGES / 1000); >>> >>> The above line will set nr_mmu_pages to zero since KVM_PERMILLE_MMU_PAGES >>> is 20, so when integer-divided by 1000 will result in a multiplication >>> coefficient of zero. >> >> Ugh, math. And thus do_div() to avoid the whole 64-bit divide issue on 32-bit KVM. >> Bummer. > > I was revisiting this today because (a) simply making n_memslots_pages a u64 doesn't > cleanly handle the case where the resulting nr_mmu_pages would wrap, Handling this case without capping total n_memslots_pages would require either capping memslots_pages on 32-bit KVM to make it fit in 32-bits or changing kvm_mmu_change_mmu_pages() and all the logic further down to accept u64's. > (b) any fix > in that are should really go in a separate patch to fix > kvm_mmu_calculate_default_mmu_pages() and then carry that behavior forward > > But as I dove deeper (and deeper), I came to the conclusion that supporting a > total number of memslot pages that doesn't fit in an unsigned long is a waste of > our time. With a 32-bit kernel, userspace can at most address 3gb of virtual > memory, whereas wrapping the total number of pages would require 4tb+ of guest > physical memory. Even with x86's second address space for SMM, that means userspace > would need to alias all of guest memory more than one _thousand_ times. And on > older hardware with MAXPHYADDR < 43, the guest couldn't actually access any of those > aliases even if userspace lied about guest.MAXPHYADDR. > > So unless I'm missing something, or PPC or MIPS has some crazy way for a 32-bit > host to support 4TB of guest memory, my vote would be to explicitly disallow > creating more memslot pages than can fit in an unsigned long. Then x86 KVM could > reuse the cache nr_memslot_pages and x86's MMU wouldn't have to update a big pile > of code to support a scenario that practically speaking is useless. > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h > index 72b329e82089..acabdbdef5cf 100644 > --- a/include/linux/kvm_host.h > +++ b/include/linux/kvm_host.h > @@ -552,6 +552,7 @@ struct kvm { > */ > struct mutex slots_arch_lock; > struct mm_struct *mm; /* userspace tied to this vm */ > + unsigned long nr_memslot_pages; > struct kvm_memslots __rcu *memslots[KVM_ADDRESS_SPACE_NUM]; > struct kvm_vcpu *vcpus[KVM_MAX_VCPUS]; > > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 8bf4b89cfa03..c63fc5c05322 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1567,6 +1567,15 @@ static void kvm_commit_memory_region(struct kvm *kvm, > const struct kvm_memory_slot *new, > enum kvm_mr_change change) > { > + /* > + * Update the total number of memslot pages before calling the arch > + * hook so that architectures can consume the result directly. > + */ > + if (change == KVM_MR_DELETE) > + kvm->nr_memslot_pages -= old->npages; > + else if (change == KVM_MR_CREATE) > + kvm->nr_memslot_pages += new->npages; > + > kvm_arch_commit_memory_region(kvm, old, new, change); > > /* > @@ -1738,6 +1747,9 @@ int __kvm_set_memory_region(struct kvm *kvm, > if (!old || !old->npages) > return -EINVAL; > > + if (WARN_ON_ONCE(kvm->nr_memslot_pages < old->npages)) > + return -EIO; > + > memset(&new, 0, sizeof(new)); > new.id = id; > new.as_id = as_id; > @@ -1756,6 +1768,13 @@ int __kvm_set_memory_region(struct kvm *kvm, > > if (!old || !old->npages) { > change = KVM_MR_CREATE; > + > + /* > + * To simplify KVM internals, the total number of pages across > + * all memslots must fit in an unsigned long. > + */ > + if ((kvm->nr_memslot_pages + new.npages) < kvm->nr_memslot_pages) > + return -EINVAL; > } else { /* Modify an existing slot. */ > if ((new.userspace_addr != old->userspace_addr) || > (new.npages != old->npages) || > Capping total n_memslots_pages makes sense to me to avoid the (existing) nr_mmu_pages wraparound issue, will update the next patchset version accordingly. Thanks, Maciej