Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp193413pxb; Wed, 20 Oct 2021 19:51:51 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzvYQlX5qV9/1tyF8Q02l5K/7m/dYMwWOhswSMIqFMxWzNVGA5SoFo42iNpC1e/JmM6MaGN X-Received: by 2002:a17:90b:350c:: with SMTP id ls12mr3371698pjb.36.1634784711161; Wed, 20 Oct 2021 19:51:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1634784711; cv=none; d=google.com; s=arc-20160816; b=JUzu28oI+mAM32T+pHKA/t01u8+tA9mBb7Vl5D0xZmttioZpZJM+XMqGsbXPZ0Q70t UbGqBaOVST+YPv0ZXL2qlrlDy5zd6ld+lHNiT7teLaXUSV4GqUNfCnbahfKTwDzjw0bS Mgvgmm/2axNNbnBWzvyZbt5VNASbrY44+AHEUraU1ra3OTVT/YGjcnasPJI/Xxv3AioR Lw129zYDxb0pKSmeSaN3Ec53yXKI7r/wh5QAVfet6kwsYgYhRTsxkijUpl/Iv1SZvlEa woorm2i4+PhRNEruDoA8FKRNzhPXdhg2jZUAih8Ipk2i01+uLSwj6jWddkyKL775FCAD lZTQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=ujSZ0sBQWBy7DoFnbDi/cP1giiBFvyPhc9wOnIG/WgY=; b=0h9OIXMkimKNCsqkIi+IKspiuj+ivGvbgOrU2vzzKQvhNrrkELoc7gg7hZvSxiyqSi YzT7kMTyguYa7V7pCxewSdQ8jqLZmsKPu/ua4OAz/lrfIeWV+DxBuWjdahgeDVQRpVFP DW0dKhnBvrudpcxpyCJF+Gkv0qArM4EK6CurytFfCpe4PC89JeLwBT9eLMEhry3vtAvU RjAkRr+Oo9O5qOP597wIbFLb7GzNj71Mk7QAX7SPP3dql2CkJ2UwA9dzDKhJiDqWwBwi 7OxwY/W2cI1cN0SKeKkwJx5jfHTdqMgzyHhHG0lSUWMV1/7QaAuRW9bxTRXa1PaS89jQ 69QA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=l5nYw0eb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id t3si5810597pfj.78.2021.10.20.19.51.35; Wed, 20 Oct 2021 19:51:51 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=l5nYw0eb; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230363AbhJUCux (ORCPT + 99 others); Wed, 20 Oct 2021 22:50:53 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:49572 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230272AbhJUCux (ORCPT ); Wed, 20 Oct 2021 22:50:53 -0400 Received: from mail-pg1-x535.google.com (mail-pg1-x535.google.com [IPv6:2607:f8b0:4864:20::535]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 34D2EC06161C for ; Wed, 20 Oct 2021 19:48:38 -0700 (PDT) Received: by mail-pg1-x535.google.com with SMTP id j190so17523953pgd.0 for ; Wed, 20 Oct 2021 19:48:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=ujSZ0sBQWBy7DoFnbDi/cP1giiBFvyPhc9wOnIG/WgY=; b=l5nYw0eb/GADT7fKkzcnPjKKB+3Qugv9ImeITw0P40SFhdILQ+PkAtU9g9zITGCmX7 vA5g6T1LHStfVnJcOHDk6g461NnPiCrk5Pvb3EE0tI+9VaRZ0G2YCVMZth3aJAkD1weM os33Wz9kz+x9swPLOIikiqmi60nI5af5tSiCw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=ujSZ0sBQWBy7DoFnbDi/cP1giiBFvyPhc9wOnIG/WgY=; b=WkDSRR2qFiieq1AA2UReYr+OiWQnU1bCTvsGjdllOvswOAm4/HqIM7GFqvhjCoGtIz 7bVXgdak7A4jowrbUuVp1xz8vycgxpXyppf4XWFkbIJd7gasjEt2NELJ2pK2DMrpwgcz QzeUEcAQif5T7GQPDaaVfNn9JuSDpTscwwQcL+tTvwtgUnfJZE9J42KlVTJl2S+39NON pMMwWjWmYTpMLM0sA5iC/TebOXmHAYKk4mm52f2spuyloe2qoleGWtjgSsFdHLexcZAr q2yMlAR0rwDYzE6xZGOZ0bBMLLrCvKomyEgOv/UXpxSvhQuCuLP0GVyp4CmExoCBojii C/sA== X-Gm-Message-State: AOAM532jQXzBsRHXB1KNCTpJn2eO+vxyaW+MvVYQcQ1QkJ8zYrkfnUJ2 d5LvLcW2tFFW7HTWk9Xa7dMg1A== X-Received: by 2002:a62:ab17:0:b0:44c:f727:98cd with SMTP id p23-20020a62ab17000000b0044cf72798cdmr2632689pff.35.1634784517488; Wed, 20 Oct 2021 19:48:37 -0700 (PDT) Received: from google.com ([2409:10:2e40:5100:3bbb:903f:245c:8335]) by smtp.gmail.com with ESMTPSA id c11sm7131016pji.38.2021.10.20.19.48.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 20 Oct 2021 19:48:36 -0700 (PDT) Date: Thu, 21 Oct 2021 11:48:31 +0900 From: Sergey Senozhatsky To: David Matlack Cc: Sergey Senozhatsky , Paolo Bonzini , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Suleiman Souhlal , kvm list , LKML Subject: Re: [PATCHV2 1/3] KVM: x86: introduce kvm_mmu_pte_prefetch structure Message-ID: References: <20211019153214.109519-1-senozhatsky@chromium.org> <20211019153214.109519-2-senozhatsky@chromium.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On (21/10/20 08:56), David Matlack wrote: > > > The spinlock is overkill. I'd suggest something like this: > > > - When VM-ioctl is invoked to update prefetch count, store it in > > > kvm_arch. No synchronization with vCPUs needed. > > > - When a vCPU takes a fault: Read the prefetch count from kvm_arch. If > > > different than count at last fault, re-allocate vCPU prefetch array. > > > (So you'll need to add prefetch array and count to kvm_vcpu_arch as > > > well.) > > > > > > No extra locks are needed. vCPUs that fault after the VM-ioctl will > > > get the new prefetch count. We don't really care if a prefetch count > > > update races with a vCPU fault as long as vCPUs are careful to only > > > read the count once (i.e. use READ_ONCE(vcpu->kvm.prefetch_count)) and > > > use that. Assuming prefetch count ioctls are rare, the re-allocation > > > on the fault path will be rare as well. > > > > So reallocation from the faul-path should happen before vCPU takes the > > mmu_lock? > > Yes. Take a look at mmu_topup_memory_caches for an example of > allocating in the fault path prior to taking the mmu lock. > > > And READ_ONCE(prefetch_count) should also happen before vCPU > > takes mmu_lock, I assume, so we need to pass it as a parameter to all > > the functions that will access prefetch array. > > Store the value of READ_ONCE(prefetch_count) in struct kvm_vcpu_arch > because you also need to know if it changes on the next fault. Then > you also don't have to add a parameter to a bunch of functions in the > fault path. > > > > > > Note: You could apply this same approach to a module param, except > > > vCPUs would be reading the module param rather than vcpu->kvm during > > > each fault. > > > > > > And the other alternative, like you suggested in the other patch, is > > > to use a vCPU ioctl. That would side-step the synchronization issue > > > because vCPU ioctls require the vCPU mutex. So the reallocation could > > > be done in the ioctl and not at fault time. > > > > One more idea, wonder what do you think: > > > > There is an upper limit on the number of PTEs we prefault, which is 128 as of > > now, but I think 64 will be good enough, or maybe even 32. So we can always > > allocate MAX_PTE_PREFETCH_NUM arrays in vcpu->arch and ioctl() will change > > ->num_ents only, which is always in (0, MAX_PTE_PREFETCH_NUM - 1] range. This > > way we never have to reallocate anything, we just adjust the "maximum index" > > value. > > 128 * 8 would be 1KB per vCPU. That is probably reasonable, but I > don't think the re-allocation would be that complex. 128 is probably too large. What I'm thinking is "32 * 8" per-VCPU. Then it can even be something like: --- diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 5271fce6cd65..b3a436f8fdf5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -607,6 +607,11 @@ struct kvm_vcpu_xen { u64 runstate_times[4]; }; +struct kvm_mmu_pte_prefetch { + u64 ents[32]; + unsigned int num_ents; /* max prefetch value for this vCPU */ +}; + struct kvm_vcpu_arch { /* * rip and regs accesses must go through @@ -682,6 +687,8 @@ struct kvm_vcpu_arch { struct kvm_mmu_memory_cache mmu_gfn_array_cache; struct kvm_mmu_memory_cache mmu_page_header_cache; + struct kvm_mmu_pte_prefetch mmu_pte_prefetch; + /* * QEMU userspace and the guest each have their own FPU state. * In vcpu_run, we switch between the user and guest FPU contexts. --- > > > > > Taking a step back, can you say a bit more about your usecase? > > > > We are looking at various ways of reducing the number of vm-exits. There > > is only one VM running on the device (a pretty low-end laptop). > > When you say reduce the number of vm-exits, can you be more specific? > Are you trying to reduce the time it takes for vCPUs to fault in > memory during VM startup? VM Boot is the test I'm running, yes, and I see some improvements with pte-prefault == 16. > I just mention because there are likely other techniques you can apply > that would not require modifying KVM code (e.g. prefaulting the host > memory before running the VM, using the TDP MMU instead of the legacy > MMU to allow parallel faults, using hugepages to map in more memory > per fault, etc.) THP would be awesome. We have THP enabled on devices that have 8-plus gigabytes of RAM; but we don't have THP enabled on low end devices, that only have 4 gigabytes of RAM. On low end devices KVM with THP causes host memory regression, that we cannot accept, hence for 4 gig devices we try various "other solutions". We are using TDP. And somehow I never see (literally never) async PFs. It's always either hva_to_pfn_fast() or hva_to_pfn_slow() or __direct_map() from tdp_page_fault().