Received: by 2002:a05:6359:c8b:b0:c7:702f:21d4 with SMTP id go11csp386914rwb; Wed, 28 Sep 2022 04:13:14 -0700 (PDT) X-Google-Smtp-Source: AMsMyM4S8c03BWXt4NWveWsJgDdyWPiX7xBoQC+5QX/rmahUk9QQqvw2z4rWT9J+PmLvDTXV8kgL X-Received: by 2002:a05:6a00:23c5:b0:557:a75c:b430 with SMTP id g5-20020a056a0023c500b00557a75cb430mr23545239pfc.16.1664363594078; Wed, 28 Sep 2022 04:13:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1664363594; cv=none; d=google.com; s=arc-20160816; b=PoLH/yhKkUG4+rVFowBxsLCnwjZOdwbYpwN6aQ9A6shzot4H42nPHbM4k9sX/3nJhB Bi7kelAww6YOGX19i2kK2stJishNPoOLC9Ro2rkuswuMQuERKsUb1dhX/VsoIWv8fEcS rEyhTnOZ5iGBE0Oks72dzAc9af5jeJ4Z4OUOAzI26stqeTpOEDyliVCMeCXd4njS5yK3 sujiaoPUYkpsOJa4WX8Cz+8IxKn6v9aSLG8VkVHAbQO1IMIuRmMUdVSrBOB96lIIrQuG kK/xtEVJocKK9o4+qOdEanCPXS7MfS+ETuhNINUU2q5tL1yWqlvJdCt+eP+w6J5qySVo MoCw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent:references:in-reply-to :subject:cc:to:from:message-id:date:dkim-signature; bh=fBlro6+tN1LjZwKwGjcrk0inxop2mg4CxPn06R+Phsg=; b=gPHgnmt0AEwXz9oUFY6qGM0aXB5sphiEAQ6bhntv7sJXhssEhsElTBMGQSyGoELsS8 EK5mLHOd91MP9wwHbbkNmxnqcUCrmVFy8atjPe6cPKI/UeGJDZgpiHBBfruOhaEt5t66 V5+mjie+mnIcetqxXURH2kcEoJ0HdQic+IVisCorPBqUJy3B9DLfCg2w0pm9N905ZRwi ZEA3TkexyAgL9LRqXpfa5a3MdRnOSBWDAM3yZPgpih9L4WniP2Rtr/br6/P43IKaHvpj gnYo4nOTu4EwGCdvcprzA8JcE6ym7h49O8IFwoIzOwGKkwP9PO5o621D3gx5ojzB5puX WfUg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=aJQOP2MP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id mh11-20020a17090b4acb00b001f021b4f1d5si2357739pjb.1.2022.09.28.04.12.58; Wed, 28 Sep 2022 04:13:14 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=aJQOP2MP; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231650AbiI1LJR (ORCPT + 99 others); Wed, 28 Sep 2022 07:09:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:39108 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234075AbiI1LIF (ORCPT ); Wed, 28 Sep 2022 07:08:05 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 72C0FE010 for ; Wed, 28 Sep 2022 04:05:54 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 0DA7F61E1A for ; Wed, 28 Sep 2022 11:05:54 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6CE21C433C1; Wed, 28 Sep 2022 11:05:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1664363153; bh=pIU8qt0C40AybVJ4/l4q1RuU2q+j8+ZhzCYxB17+NoI=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=aJQOP2MPYOsRcmrdpv18CgcCOgoZoCovHWakOuJ4yCA9S5yb/DH6xhWPX314g1u5D anxmcEFVqvqnSvSKDS5GgxVH2goj7hY/ngTCsMVIbZhaqYh1gXCSlc0R2ccN53y7sR bz8uJU4u4uGPUDhx0pvY9m1/SRkIypqxAqiGa9zWK1I5KCILZDU2pQqHchHaaajM7L nXt3BcVnaPVPGfgXknlajSjqdnFLVFXUKj6XVyMpkZzPWb90b70x33rvwEL69L/EwM jU/DYu1Ufcl8DQv1b8uSMyIP0dMblNzqe266I1pY/gyB9a5tFI3tS1d7P6nmOFkKxI MJtL1Ys0sDd1Q== Received: from sofa.misterjones.org ([185.219.108.64] helo=goblin-girl.misterjones.org) by disco-boy.misterjones.org with esmtpsa (TLS1.3) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.95) (envelope-from ) id 1odUsx-00DFzW-4r; Wed, 28 Sep 2022 12:05:51 +0100 Date: Wed, 28 Sep 2022 07:05:50 -0400 Message-ID: <86o7uz7o8h.wl-maz@kernel.org> From: Marc Zyngier To: Oliver Upton Cc: Catalin Marinas , Will Deacon , James Morse , Alexandru Elisei , Suzuki K Poulose , Ricardo Koller , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, kvmarm@lists.cs.columbia.edu, Quentin Perret Subject: Re: [PATCH v2] KVM: arm64: Limit stage2_apply_range() batch size to 1GB In-Reply-To: References: <20220926222146.661633-1-oliver.upton@linux.dev> <86v8p96og1.wl-maz@kernel.org> User-Agent: Wanderlust/2.15.9 (Almost Unreal) SEMI-EPG/1.14.7 (Harue) FLIM-LB/1.14.9 (=?UTF-8?B?R29qxY0=?=) APEL-LB/10.8 EasyPG/1.0.0 Emacs/27.1 (aarch64-unknown-linux-gnu) MULE/6.0 (HANACHIRUSATO) MIME-Version: 1.0 (generated by SEMI-EPG 1.14.7 - "Harue") Content-Type: text/plain; charset=US-ASCII X-SA-Exim-Connect-IP: 185.219.108.64 X-SA-Exim-Rcpt-To: oliver.upton@linux.dev, catalin.marinas@arm.com, will@kernel.org, james.morse@arm.com, alexandru.elisei@arm.com, suzuki.poulose@arm.com, ricarkol@google.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, kvmarm@lists.cs.columbia.edu, qperret@google.com X-SA-Exim-Mail-From: maz@kernel.org X-SA-Exim-Scanned: No (on disco-boy.misterjones.org); SAEximRunCond expanded to false X-Spam-Status: No, score=-7.2 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 27 Sep 2022 15:43:10 -0400, Oliver Upton wrote: > > Hi Marc, > > On Tue, Sep 27, 2022 at 07:34:22AM -0400, Marc Zyngier wrote: > > [...] > > > > diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c > > > index c9a13e487187..5d05bb92e129 100644 > > > --- a/arch/arm64/kvm/mmu.c > > > +++ b/arch/arm64/kvm/mmu.c > > > @@ -31,6 +31,12 @@ static phys_addr_t hyp_idmap_vector; > > > > > > static unsigned long io_map_base; > > > > > > +static inline phys_addr_t stage2_apply_range_next(phys_addr_t addr, phys_addr_t end) > > > > Please drop the inline. I'm sure the compiler will perform its > > magic. > > > > Can I also bikeshed a bit about the name? This doesn't "apply" > > anything, nor does it return the next range. It really computes the > > end of the current one. > > > > Something like stage2_range_addr_end() would at least be consistent > > with the rest of the arm64 code (grep for _addr_end ...). > > Bikeshed all you want :) But yeah, I like your suggestion. > > > > +{ > > > + phys_addr_t boundary = round_down(addr + SZ_1G, SZ_1G); > > > > nit: the rest of the code is using ALIGN_DOWN(). Any reason why this > > can't be used here? > > Nope! > > > > + > > > + return (boundary - 1 < end - 1) ? boundary : end; > > > +} > > > > > > /* > > > * Release kvm_mmu_lock periodically if the memory region is large. Otherwise, > > > @@ -52,7 +58,7 @@ static int stage2_apply_range(struct kvm *kvm, phys_addr_t addr, > > > if (!pgt) > > > return -EINVAL; > > > > > > - next = stage2_pgd_addr_end(kvm, addr, end); > > > + next = stage2_apply_range_next(addr, end); > > > ret = fn(pgt, addr, next - addr); > > > if (ret) > > > break; > > > > > > > The main problem I see with this is that some entries now get visited > > multiple times if they cover more than a single 1GB entry (like a > > 512GB level-0 entry with 4k pages and 48bit IPA) . As long as this > > isn't destructive (CMOs, for example), this is probably OK. For > > operations that are not idempotent (such as stage2_unmap_walker), this > > is a significant change in behaviour. > > > > My concern is that we have on one side a walker that is strictly > > driven by the page-table sizes, and we now get an arbitrary value that > > doesn't necessarily a multiple of block sizes. Yes, this works right > > now because you can't create a block mapping larger than 1GB with any > > of the supported page size. > > > > But with 52bit VA/PA support, this changes: we can have 512GB (4k), > > 64GB (16k) and 4TB (64k) block mappings at S2. We don't support this > > yet at S2, but when this hits, we'll be in potential trouble. > > Ah, I didn't fully capture the reasoning about the batch size. I had > thought about batching by operating on at most 1 block of the largest > supported granularity, but that felt like an inefficient walk restarting > from root every 32M (for 16K paging). > > OTOH, if/when we add support for larger blocks in S2 we will run into > the same exact problem if we batch on the largest block size. If > dirty logging caused the large blocks to be shattered down to leaf > granularity then we will visit a crazy amount of PTEs before releasing > the lock. > > I guess what I'm getting at is we need to detect lock contention and the > need to reschedule in the middle of the walk instead of at some > predetermined boundary, though that ventures into the territory of 'TDP > MMU' features... > > So, seems to me we can crack this a few ways: > > 1.Batch at the arbitrary 1GB since it works currently and produces a > more efficient walk for all page sizes. I can rework some of the > logic in kvm_level_supports_block_mapping() such that we can > BUILD_BUG_ON() if the largest block size exceeds 1GB. Kicks the can > down the road on a better implementation. > > 2.Batch by the largest supported block mapping size. This will result > in less efficient walks for !4K page sizes and likely produce soft > lockups when we support even larger blocks. Nonetheless, the > implementation will remain correct regardless of block size. > > 3.Test for lock contention and need_resched() in the middle of the > walk, rewalking from wherever we left off when scheduled again. TDP > MMU already does this, so it could be a wasted effort adding support > for it to the ARM MMU if we are to switch over at some point. > > WDYT? [+ Quentin, who is looking at this as well] At this stage, I'm more keen on option (2). It solves your immediate issue, and while it doesn't improve the humongous block mapping case, it doesn't change the behaviour of the existing walkers (you are guaranteed to visit a leaf only once per traversal). Option (3) is more of a long term goal, and I'd rather we improve things in now. Thanks, M. -- Without deviation from the norm, progress is not possible.