Received: by 2002:a17:90a:37a3:0:0:0:0 with SMTP id v32csp3529480pjb; Mon, 1 Jul 2019 20:38:53 -0700 (PDT) X-Google-Smtp-Source: APXvYqzDwHLYSCE1a1mb1vh7cLmweLE77zuJZRmpRYux25IltbdAr1hD8I+a2a6yruqsr2tzrUSE X-Received: by 2002:a63:ed06:: with SMTP id d6mr28264584pgi.267.1562038733680; Mon, 01 Jul 2019 20:38:53 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1562038733; cv=none; d=google.com; s=arc-20160816; b=ydentxflhGDkPGUrmrbpGevGcm00VKlUV0o24KiUxfvsXSJYw9kc1rqTTFuWuI8oBl p2qfjxvYa4cuaLm1EMpsRzTVnvj15gV99XC8WzgtnWZnjQfUJy4b5H5T9gZm+KKiGd9i MDNkkOYpy74s3pvKskYX9D/sMSlAgagE6W7sbCbdawGPFDJBfhlWjh3mqhSW0Tl7ZJbG 9qfNf0bg9phfMn/bOwww7P9QESxoTrZMfpONnvIxeFNEtsF8y4gctIFz0uuq/aodz+9T eR3oeZFGV/y7hYoinf6RCpUAZ+laLPDaMQPZGXeB2nh0fJcFhSUWstONfQ2vtltkYOXp oJMQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:references:cc:to:subject:from; bh=yOnrSKHpq7YrWY8+fjIlfOy6xxS79LZ26qOncoP3LBQ=; b=YDBX/MELlnaymPDkQb39z3RfTp/YxjzKT/jhEpvwjZ/nXCvs2gZ4CzeCNk/KA5jbUb an8dZl079EOnjUTEYKjQ2mrgdO9l7r+1I4YWHLlSgM7ENvw8YENLnZFOFIDZV/mcUg8z OINViO+wtAuKuv/ONFCF0sO2gcL+Y8OgtWq132/Pf9co4MWX3nenBcF8W7DNlMCu0a2P y+Mjww469LNsjOX8RLJYK8rK3OzMynvMyNENiwrmuWUJWKtzWvlbql9ifr2m0y1SRYQr kX6kT4/BA4Q8xCHmJb1D9LyOa2Q/jzwYzpHbBSNfuQ1WREBTz3Kf/hHEA530udV704xw CiMA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id x185si12398188pgb.161.2019.07.01.20.38.39; Mon, 01 Jul 2019 20:38:53 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726960AbfGBDhG (ORCPT + 99 others); Mon, 1 Jul 2019 23:37:06 -0400 Received: from foss.arm.com ([217.140.110.172]:43662 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726434AbfGBDhF (ORCPT ); Mon, 1 Jul 2019 23:37:05 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B7BBD28; Mon, 1 Jul 2019 20:37:04 -0700 (PDT) Received: from [10.163.1.231] (unknown [10.163.1.231]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 9F2973F703; Mon, 1 Jul 2019 20:37:01 -0700 (PDT) From: Anshuman Khandual Subject: Re: [RFC 1/2] arm64/mm: Change THP helpers to comply with generic MM semantics To: Catalin Marinas Cc: linux-mm@kvack.org, Will Deacon , Mark Rutland , Marc Zyngier , Suzuki Poulose , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, Andrea Arcangeli References: <1561639696-16361-1-git-send-email-anshuman.khandual@arm.com> <1561639696-16361-2-git-send-email-anshuman.khandual@arm.com> <20190628102003.GA56463@arrakis.emea.arm.com> Message-ID: <82237e21-1f14-ab6e-0f80-9706141e2172@arm.com> Date: Tue, 2 Jul 2019 09:07:28 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <20190628102003.GA56463@arrakis.emea.arm.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 06/28/2019 03:50 PM, Catalin Marinas wrote: > Hi Anshuman, Hello Catalin, > > On Thu, Jun 27, 2019 at 06:18:15PM +0530, Anshuman Khandual wrote: >> pmd_present() and pmd_trans_huge() are expected to behave in the following >> manner during various phases of a given PMD. It is derived from a previous >> detailed discussion on this topic [1] and present THP documentation [2]. >> >> pmd_present(pmd): >> >> - Returns true if pmd refers to system RAM with a valid pmd_page(pmd) >> - Returns false if pmd does not refer to system RAM - Invalid pmd_page(pmd) >> >> pmd_trans_huge(pmd): >> >> - Returns true if pmd refers to system RAM and is a trans huge mapping >> >> ------------------------------------------------------------------------- >> | PMD states | pmd_present | pmd_trans_huge | >> ------------------------------------------------------------------------- >> | Mapped | Yes | Yes | >> ------------------------------------------------------------------------- >> | Splitting | Yes | Yes | >> ------------------------------------------------------------------------- >> | Migration/Swap | No | No | >> ------------------------------------------------------------------------- > > Before we actually start fixing this, I would strongly suggest that you > add a boot selftest (see lib/Kconfig.debug for other similar cases) > which checks the consistency of the page table macros w.r.t. the > expected mm semantics. Once the mm maintainers agreed with the > semantics, it will really help architecture maintainers in implementing > them correctly. Sure and it will help all architectures to be in sync wrt semantics. > > You wouldn't need actual page tables, just things like assertions on > pmd_trans_huge(pmd_mkhuge(pmd)) == true. You could go further and have > checks on pmdp_invalidate(&dummy_vma, dummy_addr, &dummy_pmd) with the > dummy_* variables on the stack. Hmm. I guess macros which operate directly on a page table entry will be okay but the ones which check on specific states for VMA or MM might be bit tricky. Try to emulate VMA/MM states while on stack ?. But sure, will explore adding such a test. > >> The problem: >> >> PMD is first invalidated with pmdp_invalidate() before it's splitting. This >> invalidation clears PMD_SECT_VALID as below. >> >> PMD Split -> pmdp_invalidate() -> pmd_mknotpresent -> Clears PMD_SECT_VALID >> >> Once PMD_SECT_VALID gets cleared, it results in pmd_present() return false >> on the PMD entry. > > I think that's an inconsistency in the expected semantics here. Do you > mean that pmd_present(pmd_mknotpresent(pmd)) should be true? If not, do Actually that is true (more so if we are using generic pmdp_invalidate). Else in general pmd_present(pmdp_invalidate(pmd)) needs to be true to successfully represent a splitting THP. That is what Andrea explained back on this thread (https://lkml.org/lkml/2018/10/17/231). Extracting relevant sections from that thread - "pmd_present never meant the real present bit in the pte was set, it just means the pmd points to RAM. It means it doesn't point to swap or migration entry and you can do pmd_to_page and it works fine." "The clear of the real present bit during pmd (virtual) splitting is done with pmdp_invalidate, that is created specifically to keeps pmd_trans_huge=true, pmd_present=true despite the present bit is not set. So you could imagine _PAGE_PSE as the real present bit." pmd_present() and pmd_mknotpresent() are not exact inverse. Problem is all platforms using generic pmdp_invalidate() calls pmd_mknotpresent() which invariably across platforms remove the valid or present bit from the entry. The point to note here is that pmd_mknotpresent() invalidates the entry from MMU point of view but pmd_present() does not check for a MMU valid PMD entry. Hence pmd_present(pmd_mknotpresent(pmd)) can still be true. In absence of a positive section mapping bit on arm64, PTE_SPECIAL is being set temporarily to remember that it was a mapped PMD which got invalidated recently but which still points to memory. Hence pmd_present() must evaluate true. pmd_mknotpresent() does not make !pmd_present() it just invalidates the entry. > we need to implement our own pmdp_invalidate() or change the generic one > to set a "special" bit instead of just a pmd_mknotpresent? Though arm64 can subscribe __HAVE_ARCH_PMDP_INVALIDATE and implement it's own pmdp_invalidate() in order to not call pmd_mknotpresent() and instead operate on the invalid and special bits directly. But its not going to alter relevant semantics here. AFAICS it might be bit better as it saves pmd_mknotpresent() from putting in that special bit in there which it is not supposed do. IFAICS there is no compelling reason for generic pmdp_invalidate() to change either. It calls pmd_mknotpresent() which invalidates the entry through valid or present bit and platforms which have dedicated huge page bit can still test positive for pmd_present() after it's invalidation. It works for such platforms. Platform specific override is required when invalidation via pmd_mknotpresent() is not enough. > >> +static inline int pmd_present(pmd_t pmd) >> +{ >> + if (pte_present(pmd_pte(pmd))) >> + return 1; >> + >> + return pte_special(pmd_pte(pmd)); >> +} > [...] >> +static inline pmd_t pmd_mknotpresent(pmd_t pmd) >> +{ >> + pmd = pte_pmd(pte_mkspecial(pmd_pte(pmd))); >> + return __pmd(pmd_val(pmd) & ~PMD_SECT_VALID); >> +} > > I'm not sure I agree with the semantics here where pmd_mknotpresent() > does not actually make pmd_present() == false. As Andrea explained, pmd_present() does not check validity of the PMD entry from MMU perspective but the presence of a valid pmd_page() which still refers to a valid struct page in the memory. It is irrespective of whether the entry in itself is valid for MMU walk or not. + Cc: Andrea Arcangeli I have added Andrea on this thread if he would like to add something. - Anshuman