Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp2987099pxf; Sun, 14 Mar 2021 20:19:09 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzEO4GUF3u0XFTKLqQ03e20o6xdAaCocZQII3FRWZ31nCFaToTP+7J0JKrahdi7WqSMTMwq X-Received: by 2002:a17:906:ca02:: with SMTP id jt2mr21239394ejb.312.1615778349184; Sun, 14 Mar 2021 20:19:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1615778349; cv=none; d=google.com; s=arc-20160816; b=SgOWKNgUyMhhYg772xlwDojgNoOEsrmauqaveZWdG/vf4BYgSDjKywX8AAl1zW0DtS 4sxTX6xmJhlDJ8312OF2fYTrRB7OGWT+a6t5fVo76fR0RbeyVI3cGWOaqXC7tqCZOfkf 57zYkrYuPz/0y13J9xsiF+Mma8ncCFVLZgSC/oSzvqX1d2aAG3bRB9eXGmIApQheJiir aEXwpUXlWjE2eqImwCYva4rgvLyXYHNjYkh3Vg5iM2c6pjixtOX+/GEOl3aTBK6xBB+K JyBQpPGi+qgaLtydroah9mkmmhlkGJFDATkRgGZtS6uNybbVTp7Yk1Urk9G4lapL8fEp Q4WA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=8CSdX0A5B03bYxZSOAFsQfaxoGs/c+Z6tlah++60gTY=; b=Fq1wlFgjBt5PRrnbYXENjfNqwsSuzQVX73D9l3sXnQtuuYyQiCpt+xyKts97NOesc6 Lt/Wq/wkELZvPxzBmeLrwtEzih7t3EqhpkpXmxbHV6RtacQg4G5FYQaxALvOQdf0f9un +BTyjnqaC4AMTig+KNfmzRUjH2O3zqF7MOz5AU2qmDbyUHFdUg1tNSNiBZC0vXP1CSzJ 47f/1ghKmWwmbNUpefvMoJVLOaPr+nfPomUoI+VDbuHWe7gvN6r7LQnVJq2oJojBI345 PpDb6sPMqVo/Aymvw4XU0wejGZz//KH8NkoEYmGGN0b6ZalnwpnFCxhMH1V4c6GfjEDs vbXA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=M1aCcZ0O; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id g8si2115148ejm.240.2021.03.14.20.18.46; Sun, 14 Mar 2021 20:19:09 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=M1aCcZ0O; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229804AbhCODRK (ORCPT + 99 others); Sun, 14 Mar 2021 23:17:10 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47552 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229599AbhCODQk (ORCPT ); Sun, 14 Mar 2021 23:16:40 -0400 Received: from mail-il1-x133.google.com (mail-il1-x133.google.com [IPv6:2607:f8b0:4864:20::133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 805EDC061574 for ; Sun, 14 Mar 2021 20:16:40 -0700 (PDT) Received: by mail-il1-x133.google.com with SMTP id i18so8008757ilq.13 for ; Sun, 14 Mar 2021 20:16:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=8CSdX0A5B03bYxZSOAFsQfaxoGs/c+Z6tlah++60gTY=; b=M1aCcZ0O6ZbH9y3UD8F1qMyWCkdN6xkqXLyKM3I+zmtPFHX8f6FNDs7tvIQe7NsjT/ xM3kxaB3ieHl8V+SF/VkTwa0+SRxkAmzV3IXkXDiwYr/5L+IKf1C34btj0YoSB+nwsDg yfmDE3cjTZAA7UffAYwO1RdVkMLZv6SI1HV4sThbBYzaqwIAsgkw1pdI3IyhGbf9l4Th LAmDHQYTkMybikVR43XL7MM7ocQ47MTqezZTIsoA/o+qFmHetlHMIq279IiTQ5+Hxtnm Pi45BWbNfksXX22GbhbNuXiUtK/siaJXHqraAvlYUKhPLN2TljkE/iJF8B7y1bp1tWD3 gBdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=8CSdX0A5B03bYxZSOAFsQfaxoGs/c+Z6tlah++60gTY=; b=dI6gDfbjKtIsCqExjjDRkMNHmNczjnwTnS2wJk6MT8uwpWViQyNb289OxWl7ww0g4X LCiT5/yPZdItXQ6IbffUnX7a0F95hVSCwEVAMAeqU8YvQvhhR8rmU1Q0VUlIuK7G5JBR IFbfVm+I/9uTMYASQdb/sxwuNEiAGD9l1f2Uf+v6LrpgB3xv0Me1NzZSPhqcFVRJ6xJh O9ZOKGYPp2nzuTTH4h1W2sjdRz0gRufqnizeIc9gCkwhVQ8yBxhBVxckWKk4YVClSqLl JHLGQnZBs67HEjJJ2p0Yn6pucrg4zKdQsiXMkdn66JvWZfJWSpwY5rIGrapVljjvOqCp hL0g== X-Gm-Message-State: AOAM531CYplH4QPL255fMjL5FtINeSHUaBzWNW5cM5tJlmfYbY42XcGT ViYtELobknmEWFaWIVMHv5/Urw== X-Received: by 2002:a92:de4c:: with SMTP id e12mr10256007ilr.63.1615778199817; Sun, 14 Mar 2021 20:16:39 -0700 (PDT) Received: from google.com ([2620:15c:183:200:4d84:eb70:5c32:32b8]) by smtp.gmail.com with ESMTPSA id a5sm7138189ilh.23.2021.03.14.20.16.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 14 Mar 2021 20:16:39 -0700 (PDT) Date: Sun, 14 Mar 2021 21:16:35 -0600 From: Yu Zhao To: Dave Hansen Cc: linux-mm@kvack.org, Alex Shi , Andrew Morton , Dave Hansen , Hillf Danton , Johannes Weiner , Joonsoo Kim , Matthew Wilcox , Mel Gorman , Michal Hocko , Roman Gushchin , Vlastimil Babka , Wei Yang , Yang Shi , Ying Huang , linux-kernel@vger.kernel.org, page-reclaim@google.com Subject: Re: [PATCH v1 06/14] mm, x86: support the access bit on non-leaf PMD entries Message-ID: References: <20210313075747.3781593-1-yuzhao@google.com> <20210313075747.3781593-7-yuzhao@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Mar 14, 2021 at 04:22:03PM -0700, Dave Hansen wrote: > On 3/12/21 11:57 PM, Yu Zhao wrote: > > Some architectures support the accessed bit on non-leaf PMD entries > > (parents) in addition to leaf PTE entries (children) where pages are > > mapped, e.g., x86_64 sets the accessed bit on a parent when using it > > as part of linear-address translation [1]. Page table walkers who are > > interested in the accessed bit on children can take advantage of this: > > they do not need to search the children when the accessed bit is not > > set on a parent, given that they have previously cleared the accessed > > bit on this parent in addition to its children. > > I'd like to hear a *LOT* more about how this is going to be used. > > The one part of this which is entirely missing is the interaction with > the TLB and mid-level paging structure caches. The CPU is pretty > aggressive about setting no-leaf accessed bits when TLB entries are > created. This *looks* to be depending on that behavior, but it would be > nice to spell it out explicitly. Good point. Let me start with a couple of observations we've made: 1) some applications create very sparse address spaces, for various reasons. A notable example is those using Scudo memory allocations: they usually have double-digit numbers of PTE entries for each PMD entry (and thousands of VMAs for just a few hundred MBs of memory usage, sigh...). 2) scans of an address space (from the reclaim path) are much less frequent than context switches of it. Under our heaviest memory pressure (30%+ overcommitted; guess how much we've profited from it :) ), their magnitudes are still on different orders. Specifically, on our smallest system (2GB, with PCID), we observed no difference between flushing and not flushing TLB in terms of page selections. We actually observed more TLB misses under heavier memory pressure, and our theory is that this is due to increased memory footprint that causes the pressure. There are two use cases for the accessed bit on non-leaf PMD entries: the hot tracking and the cold tracking. I'll focus on the cold tracking, which is what this series about. Since non-leaf entries are more likely to be cached, in theory, the false negative rate is higher compared with leaf entries as the CPU won't set the accessed bit again until the next TLB miss. (Here a false negative means the accessed bit isn't set on an entry has been used, after we cleared the accessed bit. And IIRC, there are also false positives, i.e., the accessed bit is set on entries used by speculative execution only.) But this is not a problem because of the second observation aforementioned. Now let's consider the worst case scenario: what happens when we hit a false negative on a non-leaf PMD entry? We think the pages mapped by the PTE entries of this PMD entry are inactive and try to reclaim them, until we see the accessed bit set on one of the PTE entries. This will cost us one futile attempt for all the 512 PTE entries. A glance at lru_gen_scan_around() in the 11th patch would explain exactly why. If you are guessing that function embodies the same idea of "fault around", you are right. And there are two places that could benefit from this patch (and the next) immediately, independent to this series. One is clear_refs_test_walk() in fs/proc/task_mmu.c. The other is madvise_pageout_page_range() and madvise_cold_page_range() in mm/madvise.c. Both are page table walkers that clear the accessed bit. I think I've covered a lot of ground but I'm sure there is a lot more. So please feel free to add and I'll include everything we discuss here in the next version.