Received: by 2002:ab2:1689:0:b0:1f7:5705:b850 with SMTP id d9csp1145651lqa; Sun, 28 Apr 2024 22:08:07 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCX9WoO8DLlv/ITF3s6Q655Sy0b5rPNjTob6GuuZmdk3rn9gpY1ndb1QwF+jFk9vfzZFUGqZX7ePmifizX7JOXSNcJISoug56iS4goZmnw== X-Google-Smtp-Source: AGHT+IHmLRHmaPhl37k0WoQEH4yvdht5McY6N9MSUxrErdDnHKFLt7r0FfASuA9Ge2k/hoznCQPm X-Received: by 2002:a17:902:ce8d:b0:1eb:74c7:3eaa with SMTP id f13-20020a170902ce8d00b001eb74c73eaamr5479379plg.23.1714367287115; Sun, 28 Apr 2024 22:08:07 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1714367287; cv=pass; d=google.com; s=arc-20160816; b=y8N8PWz72YCOrRAMADS7eGhMet7ss7qhGFIaPzu2EhMjKgUkvHsjAbPsgU9uYIJBHf SyHy1k5n2mNbkZVtVTOndtmvAjwxq3vxygBlNJW3qMfS6ey4AlEKyOPhpsrcPVvC7V4c kurEA/6uc7Onyl6YYhU22Ykl1LqGlG9P/vo9csC0AMrIlITY2EQxX0tF/SN4o1tpoiWF LpnuL7MXMFO+iYSNn+wMOhqJzfk7cj1rtiZ1swHzidzGH8tsHQlzxtDQILYid1X7kGad Jr55/wow446HkgU/xqWh8G8hGwmRE3q9FE9rI1OpkHsiJTYkB+UpqgUbyHbAk2MG2DZy 94+A== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=GJ8Us5DSvWRETwk72QXsOp0lF4wXPk2ghuTXcLG3RzE=; fh=B1TvvHQshFycatJSckW3WhKjAft01nDcYdoILLZX9A0=; b=x95ORutf/BvmcGcAoaBoU/fmuTNG3mema4udtSF5KJNDod4lC7BW39MOCpiFEhoPig pKhfo7kszXeOv91MzbB4TbgvCfhAgiedlfBxczg/EAkt0ptBVT6YXC6XAvO5PlBdgVAr rxGxWiZH4aAwLvAQf7Q0RwAfimmSz+1FLiU5bdS+VRAT7ddNO36+O7SizWLBQKONk2cS W9UdMukln/wkhSkBlvUwBf06YwasEGiNGdMWSRFxBSH3Vmq8tFqmGv+l9H1okIBVzcgK +fSjpsqlRauzbFbeIggnSkxGntxykkxaZUIlwRXDXSDfGvPhz9xEBW/3cklvWR2JAUGd TzEg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-161740-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-161740-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id s8-20020a170902a50800b001e8414dd357si19074698plq.77.2024.04.28.22.08.06 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 28 Apr 2024 22:08:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-161740-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-161740-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-161740-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 62285282012 for ; Mon, 29 Apr 2024 05:08:02 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 40545D518; Mon, 29 Apr 2024 05:07:52 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0CE57E573 for ; Mon, 29 Apr 2024 05:07:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714367271; cv=none; b=VAOoz2TJWcTrWPB9yv3APywE88uGZedE6D4S9vHPNTmCT30T4gMBCcnMOP64BoO0kKhg4JRq89xgkiR7HhdPc83EPKyy9Zz3G9jYqGg0GrRaEPinGDGPaxnIxwCjy6nxvYWuHIClEOrVV85kmZ8PbQmdDcYhho99JtwC5zP9uH8= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1714367271; c=relaxed/simple; bh=L1B3vk6TyTRh+NjtiUVnxHdKmDWYBNH7CVWeMCzBc9M=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=Ob2zF4NLZN+CORjI2bgGdJU8+H8D+QrBlN5mk89nhv8vpRzlt2Epml8Z3RBsBWbJnFSNealLD7tNchNLje3Tq8NOLKcl6CICKPO/n+0Wfa4Ph4RIfl1KcckEuN94+WFPmXWBO+/kd7FlPm7jKrop201TWaZOxYKBpjbq77e3I3o= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id EFDA32F4; Sun, 28 Apr 2024 22:08:14 -0700 (PDT) Received: from [10.162.42.72] (a077893.blr.arm.com [10.162.42.72]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A5DD43F793; Sun, 28 Apr 2024 22:07:45 -0700 (PDT) Message-ID: <38cc14e0-e534-499a-a04f-463871e559de@arm.com> Date: Mon, 29 Apr 2024 10:37:42 +0530 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1] mm: Fix race between __split_huge_pmd_locked() and GUP-fast Content-Language: en-US To: Ryan Roberts , Andrew Morton , Zi Yan , "Aneesh Kumar K.V" Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20240425170704.3379492-1-ryan.roberts@arm.com> <39d0236c-4297-4001-bc7e-f516bc674f7b@arm.com> <87df74b7-2352-4ad6-b9f9-baed9291cbed@arm.com> From: Anshuman Khandual In-Reply-To: <87df74b7-2352-4ad6-b9f9-baed9291cbed@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 4/26/24 13:13, Ryan Roberts wrote: > On 26/04/2024 05:19, Anshuman Khandual wrote: >> On 4/25/24 22:37, Ryan Roberts wrote: >>> __split_huge_pmd_locked() can be called for a present THP, devmap or >>> (non-present) migration entry. It calls pmdp_invalidate() >>> unconditionally on the pmdp and only determines if it is present or not >>> based on the returned old pmd. This is a problem for the migration entry >>> case because pmd_mkinvalid(), called by pmdp_invalidate() must only be >>> called for a present pmd. >> >> pmdp_invalidate() must be called only for present PMD - is this expected >> by core MM ? Does this cause any problem otherwise ? > > I'm saying that only calling pmdp_invalidate() on a pte_present()==true pte is > the only semantic that makes sense. And, yes, it causes a problem if called on a > pte_present()==false pte - that's exactly what I'm describing in this commit log. > > To labour the point, this is the logical type hierachy of PTEs (and block-mapped > PMDs) as I see it: > > ---8<---- > > pte > |- present > | |- valid > | |- invalid > | > |- not_present > |- none > |- swap_pte > > present: All fields must be interpretted the way the HW sees them. e.g. > pte_pfn(), pte_write(), pte_dirty(), pte_young(), pte_mkwrite(), > pte_mkold() can all be legitimately used to query and modify the pte. > > valid: The HW may access the pte, interpret the fields and create a TLB entry, > etc. > > invalid: The HW will never access the pte or create a TLB entry for it. > > not_present: The fields are SW-defined. HW never accesses the PTE. > > none: Unused; represents a hole > > swap_pte: Contains a swap entry and swap pte bits. The contained swap entry > may 1 of a few different types e.g. actual swap entry, migration > entry, hw poison, etc. Sure, makes sense. > > ---8<---- > > We test present vs not_present with pte_present() > > We test none vs swap_pte with pte_none() Agreed. > > valid vs invalid is slightly more vague. The core-mm can move a PMD from valid > -> invalid by calling pmd_mkinvalid(). But it can't query the state. And it > can't do this generically for a PTE. > > > Based on that lot, it makes no sense to me that we should permit calling > pmd_mkinvalid() on a non-present pte. Indeed, we don't permit calling > pte_mkwrite() etc on a non-present pte. And those functions are not defensive; > they don't check that the pte is present before making the change. They just > trust that the core-mm will not call them for non-present ptes. > > The alternative approach would be to make pmdp_invalidate() defensive so that it > checks the pmd is present before making any changes. But it doesn't semantically > make sense to invalidate a non-present pmd in the first place so why call it > under these circumstances? There is also a practical problem in that some arches > implement their own pmdp_invalidate() so you would want to make all those > defensive too, which would grow the size of the change. I would suggest adding warnings in such arch specific pmdp_invalidate() helpers to catch further unexpected calls in non present PMD state ? > > >> >>> >>> On arm64 at least, pmd_mkinvalid() will mark the pmd such that any >>> future call to pmd_present() will return true. And therefore any >> >> IIRC the following semantics needs to be followed as expected by core MM. >> >> ------------------------------------------------------------------------- >> | PMD states | pmd_present | pmd_trans_huge | >> ------------------------------------------------------------------------- >> | Mapped | Yes | Yes | >> ------------------------------------------------------------------------- >> | Splitting | Yes | Yes | >> ------------------------------------------------------------------------- >> | Migration/Swap | No | No | >> ------------------------------------------------------------------------- > > Indeed, the problem, as I see it, is if pmd_mkinvalid() is called on a > "Migration/Swap" pmd, then a future call to pmd_present() will return Yes, which > is clearly wrong. pmd_trans_huge() will also return Yes due to: > > static inline int pmd_trans_huge(pmd_t pmd) > { > return pmd_val(pmd) && pmd_present(pmd) && !(pmd_val(pmd) & PMD_TABLE_BIT); > } Agreed pmd_mkinvalid() on "Migration/Swap" entries making any subsequent pmd_present() and pmd_trans_huge() return true is problematic. As you said, the solution is to prevent pmd_mkinvalid() on "Migration/Swap" entries. > > At least this happens for arm64. Although Zi suggests other arches look like > they will do this too in the other email. > > The reason is that arm64's pmd_mkinvalid() unconditionally sets > PMD_PRESENT_INVALID (bit 59) and clears PMD_SECT_VALID (bit 0) in the pte. So > next time pmd_present() is called it will see PMD_PRESENT_INVALID is set and > return true. > >> >> >>> lockless pgtable walker could see the migration entry pmd in this state >>> and start interpretting the fields as if it were present, leading to >>> BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. >> >> Could you please explain how bad things might happen ? > > See 2 places where pmdp_get_lockless() is called in gup.c, without the PTL. > These could both return the swap pte for which pmd_mkinvalid() has been called. > In both cases, this would lead to the pmd_present() check eroneously returning > true, eventually causing incorrect interpretation of the pte fields. e.g.: > > gup_pmd_range() > pmd_t pmd = pmdp_get_lockless(pmdp); > gup_huge_pmd(pmd, ...) > page = nth_page(pmd_page(orig), (addr & ~PMD_MASK) >> PAGE_SHIFT); > > page is guff. > > Let me know what you think! Agreed, the page here might not be valid any more for the GUP as the migration would have changed the pfn in the mean time. > > Thanks, > Ryan > > >> >>> I suspect the same is possible on other architectures. >>> >>> Fix this by only calling pmdp_invalidate() for a present pmd. And for >>> good measure let's add a warning to the generic implementation of >>> pmdp_invalidate(). I've manually reviewed all other >>> pmdp_invalidate[_ad]() call sites and believe all others to be >>> conformant. >>> >>> This is a theoretical bug found during code review. I don't have any >>> test case to trigger it in practice. >>> >>> Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") >>> Signed-off-by: Ryan Roberts >>> --- >>> >>> Applies on top of v6.9-rc5. Passes all the mm selftests on arm64. >>> >>> Thanks, >>> Ryan >>> >>> >>> mm/huge_memory.c | 5 +++-- >>> mm/pgtable-generic.c | 2 ++ >>> 2 files changed, 5 insertions(+), 2 deletions(-) >>> >>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>> index 89f58c7603b2..80939ad00718 100644 >>> --- a/mm/huge_memory.c >>> +++ b/mm/huge_memory.c >>> @@ -2513,12 +2513,12 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >>> * for this pmd), then we flush the SMP TLB and finally we write the >>> * non-huge version of the pmd entry with pmd_populate. >>> */ >>> - old_pmd = pmdp_invalidate(vma, haddr, pmd); >>> >>> - pmd_migration = is_pmd_migration_entry(old_pmd); >>> + pmd_migration = is_pmd_migration_entry(*pmd); >>> if (unlikely(pmd_migration)) { >>> swp_entry_t entry; >>> >>> + old_pmd = *pmd; >>> entry = pmd_to_swp_entry(old_pmd); >>> page = pfn_swap_entry_to_page(entry); >>> write = is_writable_migration_entry(entry); >>> @@ -2529,6 +2529,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd, >>> soft_dirty = pmd_swp_soft_dirty(old_pmd); >>> uffd_wp = pmd_swp_uffd_wp(old_pmd); >>> } else { >>> + old_pmd = pmdp_invalidate(vma, haddr, pmd); >>> page = pmd_page(old_pmd); >>> folio = page_folio(page); >>> if (pmd_dirty(old_pmd)) { >>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c >>> index 4fcd959dcc4d..74e34ea90656 100644 >>> --- a/mm/pgtable-generic.c >>> +++ b/mm/pgtable-generic.c >>> @@ -198,6 +198,7 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp) >>> pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, >>> pmd_t *pmdp) >>> { >>> + VM_WARN_ON(!pmd_present(*pmdp)); >>> pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp)); >>> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE); >>> return old; >>> @@ -208,6 +209,7 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address, >>> pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address, >>> pmd_t *pmdp) >>> { >>> + VM_WARN_ON(!pmd_present(*pmdp)); >>> return pmdp_invalidate(vma, address, pmdp); >>> } >>> #endif >>> -- >>> 2.25.1 >>> >>> >