Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp112100imm; Tue, 16 Oct 2018 19:10:17 -0700 (PDT) X-Google-Smtp-Source: ACcGV63ycke5k7jZqBbKjP8BYDT0Zv5o02Bbn5PphxKcSJ+6MMXNiFnMWxzS/4mzHkhOgRsXDtQW X-Received: by 2002:a65:4301:: with SMTP id j1-v6mr6490577pgq.279.1539742216946; Tue, 16 Oct 2018 19:10:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539742216; cv=none; d=google.com; s=arc-20160816; b=pJpbjpIKRUSwhHJrTp+n8/kRW04HEXaTAWfRC4GQlK+x/QTSs4+hGJz7LAEA91WeEV nmK7VSg8/BAsJlfqENyK75/QFC28JowUhRJc9GUJOLrm6gnXwqKA8M0flASZOXCVAp86 bkrECHGEqdG2gAfLs51nLDct372Rfi+mwwTISIGcYr/XB6Cr0XB5+iUxeCkj3arWRVqQ HDNVvu3i1qfW2CqUlYPSDwCWvgQZmt6FfuuO3S0XJooHaN0MmFxHIocWYaNUs76UlnNw s+QpWcPZSbgdl6dJwaEr+xml3IxWfWpAWd51aNnzhkH312WMR2ZauGmo+ABpksG5/9oT TXwA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=SRJIr2FB12HJbI5bDCX/uwWPTWOo2BaG6EymLvm0Uhc=; b=0KjT09vnGmoEMcLF4Gth/4twxYsN0Gz3Ie1KztZGZou9n4VtufA/d/jVEqCUSE1fz1 +v2N0e65Kk3jPfw1gdGf37AavJYcveMpy5hBGbW8ctvUFdDUZtqDUVNec4fTZOy732pg kmUA/y4537kTvb7zlVaUteCGhX5fgxrNLhXoMJVDvUl6q1HdQbfwmvCkdP4UvHCAjtV2 iBCJaChpYNTU0rTP1RLi7po2bj4MSoCFCKkIplD/I5L6MKiUBHxe/UYI2KkUct1p+EA5 3Pyt8IFLDNgrL8/WNEBVwitEgjb9XIWU7/oqWayV5JFIfeiqgGhiedkmiw8WD4FZI+VF a/WQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z67-v6si9488873pfz.5.2018.10.16.19.10.01; Tue, 16 Oct 2018 19:10:16 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727255AbeJQKCv (ORCPT + 99 others); Wed, 17 Oct 2018 06:02:51 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46488 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727155AbeJQKCv (ORCPT ); Wed, 17 Oct 2018 06:02:51 -0400 Received: from smtp.corp.redhat.com (int-mx07.intmail.prod.int.phx2.redhat.com [10.5.11.22]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 77CFF307C700; Wed, 17 Oct 2018 02:09:31 +0000 (UTC) Received: from sky.random (ovpn-120-12.rdu2.redhat.com [10.10.120.12]) by smtp.corp.redhat.com (Postfix) with ESMTPS id EFA4C1001F53; Wed, 17 Oct 2018 02:09:30 +0000 (UTC) Date: Tue, 16 Oct 2018 22:09:30 -0400 From: Andrea Arcangeli To: Zi Yan Cc: Anshuman Khandual , linux-mm@kvack.org, linux-kernel@vger.kernel.org, kirill.shutemov@linux.intel.com, akpm@linux-foundation.org, mhocko@suse.com, will.deacon@arm.com, Naoya Horiguchi Subject: Re: [PATCH] mm/thp: Correctly differentiate between mapped THP and PMD migration entry Message-ID: <20181017020930.GN30832@redhat.com> References: <1539057538-27446-1-git-send-email-anshuman.khandual@arm.com> <7E8E6B14-D5C4-4A30-840D-A7AB046517FB@cs.rutgers.edu> <84509db4-13ce-fd53-e924-cc4288d493f7@arm.com> <1968F276-5D96-426B-823F-38F6A51FB465@cs.rutgers.edu> <5e0e772c-7eef-e75c-2921-e80d4fbe8324@arm.com> <2398C491-E1DA-4B3C-B60A-377A09A02F1A@cs.rutgers.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <2398C491-E1DA-4B3C-B60A-377A09A02F1A@cs.rutgers.edu> User-Agent: Mutt/1.10.1 (2018-07-13) X-Scanned-By: MIMEDefang 2.84 on 10.5.11.22 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.42]); Wed, 17 Oct 2018 02:09:31 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Zi, On Sun, Oct 14, 2018 at 08:53:55PM -0400, Zi Yan wrote: > Hi Andrea, what is the purpose/benefit of making x86’s pmd_present() returns true > for a THP under splitting? Does it cause problems when ARM64’s pmd_present() > returns false in the same situation? !pmd_present means it's a migration entry or swap entry and doesn't point to RAM. It means if you do pmd_to_page(*pmd) it will return you an undefined result. During splitting the physical page is still very well pointed by the pmd as long as pmd_trans_huge returns true and you hold the pmd_lock. pmd_trans_huge must be true at all times for a transhuge pmd that points to a hugepage, or all VM fast paths won't serialize with the pmd_lock, that is the only reason why, and it's a very good reason because it avoids to take the pmd_lock when walking over non transhuge pmds (i.e. when there are no THP allocated). Now if we've to keep _PAGE_PSE set and return true in pmd_trans_huge at all times, why would you want to make pmd_present return false? How could it help if pmd_trans_huge returns true, but pmd_present returns false despite pmd_to_page works fine and the pmd is really still pointing to the page? When userland faults on such pmd !pmd_present it will make the page fault take a swap or migration path, but that's the wrong path if the pmd points to RAM. What we need to do during split is an invalidate of the huge TLB. There's no pmd_trans_splitting anymore, so we only clear the present bit in the PTE despite pmd_present still returns true (just like PROT_NONE, nothing new in this respect). pmd_present never meant the real present bit in the pte was set, it just means the pmd points to RAM. It means it doesn't point to swap or migration entry and you can do pmd_to_page and it works fine. We need to invalidate the TLB by clearing the present bit and by flushing the TLB before overwriting the transhuge pmd with the regular pte (i.e. to make it non huge). That is actually required by an errata (l1 cache aliasing of the same mapping through two different TLB of two different sizes broke some old CPU and triggered machine checks). It's not something fundamentally necessary from a common code point of view. It's more risky from an hardware (not software) standpoint and before you can get rid of the pmd you need to do a TLB flush anyway to be sure CPUs stops using it, so better clear the present bit before doing the real costly thing (the tlb flush with IPIs). Clearing the present bit during the TLB flush is a cost that gets lost in the noise. The clear of the real present bit during pmd (virtual) splitting is done with pmdp_invalidate, that is created specifically to keeps pmd_trans_huge=true, pmd_present=true despite the present bit is not set. So you could imagine _PAGE_PSE as the real present bit. Before the physical split was deferred and decoupled from the virtual memory pmd split, pmd_trans_splitting allowed to wait the split to finish and to keep all gup_fast at bay during it (while the page was still mapped readable and writable in userland by other CPUs). Now the physical split is deferred so you just split the pmd locally and only a physical split invoked on the page (not the virtual split invoked on the pmd with split_huge_pmd) has to keep gup at bay, and it does so by freezing the refcount so all gup_fast fail with the page_cache_get_speculative during the freeze. This removed the need of the pmd_splitting flag in gup_fast (when pmd_splitting was set gup fast had to go through the non-fast gup), but it means that now a hugepage cannot be physically splitted if it's gup pinned. The main difference is that freezing the refcount can fail, so the code must learn to cope with such failure and defer it. Decoupling the physical and virtual splits introduced the need of tracking the doublemap case with a new PG_double_map flag too. It makes the refcounting of hugepages trivial in comparison (identical to hugetlbfs in fact), but it requires total_mapcount to account for all those huge and non huge mappings. It primarily pays off to add THP to tmpfs where the physical split may have to be deferred for pagecache reasons anyway. Thanks, Andrea