Received: by 2002:a05:7412:2a8a:b0:fc:a2b0:25d7 with SMTP id u10csp267157rdh; Wed, 7 Feb 2024 04:24:38 -0800 (PST) X-Google-Smtp-Source: AGHT+IGdrNic4o5gDEgcbf+gzDOcw4EkghnHMYH8hhe7GGmwmbnWrxkcNSLdzh7RoMSwt6IXW4Wi X-Received: by 2002:a17:906:2b59:b0:a38:4475:8590 with SMTP id b25-20020a1709062b5900b00a3844758590mr2378983ejg.66.1707308678529; Wed, 07 Feb 2024 04:24:38 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1707308678; cv=pass; d=google.com; s=arc-20160816; b=FUQ2rgVx6q6AvU9ZeZ2yTkbW/bveMsGM/+kyIeVBnjogRavhU6iXh600V/8to0ncFM EpCNeBFdwzO1NkVyZLhRY8cdqi/MpeNY19ZClL3r2vDjUrIC80r/K2WyF1/PEW8kqHqN 3uv1xAH8NdH8FDgHDU+CnqCvvGiIjYeG2bF245Frr5SRyfRG/2LjCgsExlFl5OJ8dTwt xQh0YFlNBJEZyK5sTDXxaMBPHGfPYONWYemdYWN1LT6IVZT1d5briMMpnHDEQYjGrD2B 2TrCIelK0AsOVM//dL+VVxjYyHIuQ3ntPiKW/j5i9g4ClXMKA/d2WfIPvXd3K6LG0ro5 OUvw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date; bh=zJtNgNrc3XCCOXRxcGlB5+xouEYxteZlGeXj43clDyo=; fh=n+n3xPBFbykmN8JdiK5uJnWD+ilWQwYVOpHGI9roUpE=; b=uPJGSZ8hDKzQgHQk0RUr3mugj6SxcGGt6r4Ym38EXFsC5SMiCa+id1jbjcWNC4Tetz 7ErOLrBmHy2b9TP9mUXX5MWcI0/5xrbV5QqNUYtL+zT6fUGonKI8rAV3I/+EMKAija0/ UCFcMIcgmDdQ4isCS5knGSuyCZdKf2bGMIrkv+en4k2JPZv/meKwVnn4UJ5x5HqIGN+k fxOcyldZaQYvVPxwbfNx5T3eMeOgHk8v6GnTaXCatEa5YKOw88/ffIcavtEICp9qNSNr g6gibz2sPhzTirnJuq/EilkP0lWWa4+CzMxG6FYDf4uJbcfcGEXTbgDLUvPMbN6WgAqv pdDA==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-56479-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-56479-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com X-Forwarded-Encrypted: i=2; AJvYcCXjY3Nzg0BrmyA60Mpq6jH6vtSGU0rkrplXIVngMV9ounIsXAjQ/JN2qKHMc9jyBFQTw8Sbg2xj1gK3dvjYiTPX7bynADSAV3fPRUG4+Q== Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id se4-20020a170906ce4400b00a37fcb2e838si883058ejb.115.2024.02.07.04.24.38 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 Feb 2024 04:24:38 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-56479-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-56479-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-56479-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 4510E1F26596 for ; Wed, 7 Feb 2024 12:24:38 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 24FF359B7E; Wed, 7 Feb 2024 12:24:34 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id A621259B5F for ; Wed, 7 Feb 2024 12:24:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707308673; cv=none; b=K3QIbnom1mGJ3Br3W+eyjBZKbV+LcTjDPIKHn00ErfZkKiE/3ARrrr5J/1IKH7YROHVaY3AUCgRl51cu3z7SO+WAd7QQXek8fcJkzzISF+VIeBB5xSyqfzkiIM3LRvG40VafdLmoRJ6FLBXQL3ZVzOAOMsbP0hE4HppACCPxoH8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1707308673; c=relaxed/simple; bh=iSMeIQtleGcAcnf7YL8TMly6GDq3B/vPhWx07pFdICM=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=IhY3nNFc0DM7OXuJ2nuJ+ByQ126LxPnrOC0rbBtyIZlNol3Z5xcsxjjthPL0IR2DKpzQ7B+9JsbpAgjTkK8apD8r9Uo0i6RWc6WS9xKNvMIRpM2ijs4PJPPIClPL2jotGboIC2l5O0Iox7uySk+QoLDmJue+hhDFAmAN9OfUeLE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 754951FB; Wed, 7 Feb 2024 04:25:13 -0800 (PST) Received: from FVFF77S0Q05N.cambridge.arm.com (FVFF77S0Q05N.cambridge.arm.com [10.1.26.150]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 750803F5A1; Wed, 7 Feb 2024 04:24:29 -0800 (PST) Date: Wed, 7 Feb 2024 12:24:26 +0000 From: Mark Rutland To: Will Deacon Cc: Matthew Wilcox , Nanyong Sun , Catalin Marinas , mike.kravetz@oracle.com, muchun.song@linux.dev, akpm@linux-foundation.org, anshuman.khandual@arm.com, wangkefeng.wang@huawei.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v3 0/3] A Solution to Re-enable hugetlb vmemmap optimize Message-ID: References: <20240113094436.2506396-1-sunnanyong@huawei.com> <20240207111252.GA22167@willie-the-truck> <20240207121125.GA22234@willie-the-truck> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240207121125.GA22234@willie-the-truck> On Wed, Feb 07, 2024 at 12:11:25PM +0000, Will Deacon wrote: > On Wed, Feb 07, 2024 at 11:21:17AM +0000, Matthew Wilcox wrote: > > On Wed, Feb 07, 2024 at 11:12:52AM +0000, Will Deacon wrote: > > > On Sat, Jan 27, 2024 at 01:04:15PM +0800, Nanyong Sun wrote: > > > > > > > > On 2024/1/26 2:06, Catalin Marinas wrote: > > > > > On Sat, Jan 13, 2024 at 05:44:33PM +0800, Nanyong Sun wrote: > > > > > > HVO was previously disabled on arm64 [1] due to the lack of necessary > > > > > > BBM(break-before-make) logic when changing page tables. > > > > > > This set of patches fix this by adding necessary BBM sequence when > > > > > > changing page table, and supporting vmemmap page fault handling to > > > > > > fixup kernel address translation fault if vmemmap is concurrently accessed. > > > > > I'm not keen on this approach. I'm not even sure it's safe. In the > > > > > second patch, you take the init_mm.page_table_lock on the fault path but > > > > > are we sure this is unlocked when the fault was taken? > > > > I think this situation is impossible. In the implementation of the second > > > > patch, when the page table is being corrupted > > > > (the time window when a page fault may occur), vmemmap_update_pte() already > > > > holds the init_mm.page_table_lock, > > > > and unlock it until page table update is done.Another thread could not hold > > > > the init_mm.page_table_lock and > > > > also trigger a page fault at the same time. > > > > If I have missed any points in my thinking, please correct me. Thank you. > > > > > > It still strikes me as incredibly fragile to handle the fault and trying > > > to reason about all the users of 'struct page' is impossible. For example, > > > can the fault happen from irq context? > > > > The pte lock cannot be taken in irq context (which I think is what > > you're asking?) While it is not possible to reason about all users of > > struct page, we are somewhat relieved of that work by noting that this is > > only for hugetlbfs, so we don't need to reason about slab, page tables, > > netmem or zsmalloc. > > My concern is that an interrupt handler tries to access a 'struct page' > which faults due to another core splitting a pmd mapping for the vmemmap. > In this case, I think we'll end up trying to resolve the fault from irq > context, which will try to take the spinlock. I think that (as per my comments on patch 2), a similar deadlock can happen on RT even if the vmemmap is only accessed in regular process context, and at minimum this needs better comentary and/or lockdep assertions. I'd also prefer that we dropped this for now. > Avoiding the fault would make this considerably more robust and the > architecture has introduced features to avoid break-before-make in some > circumstances (see FEAT_BBM and its levels), so having this optimisation > conditional on that would seem to be a better approach in my opinion. FWIW, that's my position too. Mark.