Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp768193ybi; Fri, 24 May 2019 11:07:06 -0700 (PDT) X-Google-Smtp-Source: APXvYqykK242DdehFUKqAu4NeRXG7cVD0BPSt81HxEpuYBKsjwf9vqekXDV0ZrAYi02lTXMJHWXy X-Received: by 2002:a17:902:5c5:: with SMTP id f63mr106385920plf.327.1558721226252; Fri, 24 May 2019 11:07:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1558721226; cv=none; d=google.com; s=arc-20160816; b=jkRcLluafhHZZnMb9Q+eM6SsrZ2E1DhX2tMYXjBKu+4+SCihumINLDGr6xvguIoxs5 HQoUsdD1ONfMXcGU81CPGW3pFnN8xYqcH0OJPQpEr5cNnzHs3Yl4mapV2jJFt4f/S6KB WsTjo6Vym1aE0gXtvrBl0fbNVsl+6jNyLAoabACzVjZuYjNz0cPCcGKbV3MQJKmAnFfu Wlj8S6Q6gaE5GckhMxDInw+jwbPXc7CsJkikLGvzqGoUArRkSNNSrPTNQ9ERKT/ZN/5Q PdB6yDkch2m0Rf76jLSh6bQ9UMxi30zfLXwJfmo1ZR8/BaOmSMWJiIBzG0fwN6G9pkKM REfw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=ac3B/Xe+w3ROTUibljmXKgMNFheZh3fEEGPnecMKhL8=; b=mXRhRBOf4j5e0ADfWvyuyXogGiaigiSLtlp0hLC7cG+jwP7ljgQV6PAcI19aCnaAmL YkUBs9j+6D+nUJ2xBerQp+/xzRS/YcWbNECeg/2vK6vkPlK14v+mG+p4KPE99aPX/XeI fcpBjDDA3Z8hlwoQ23wDdvGBOSaiut/n6EYWNopCpXYAOMflsIhikubEb8SeCLHonz9H KdffgLIjxBlOHQE/2UMGZqKbQNbjaau1fqm+De/7BbCjov0/8XvTuIBVoWvyKj9iqokb yxMxXWYQr10nC/AGvpsaFgJiZo2c2bLSkpMhTSIStjREgJez5HVQ4V1Wr5PLQbIcwDD6 vfeg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=r1OoUil1; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t28si5208382pfe.145.2019.05.24.11.06.50; Fri, 24 May 2019 11:07:06 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=r1OoUil1; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728739AbfEXSEU (ORCPT + 99 others); Fri, 24 May 2019 14:04:20 -0400 Received: from mail-yw1-f68.google.com ([209.85.161.68]:44410 "EHLO mail-yw1-f68.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727138AbfEXSET (ORCPT ); Fri, 24 May 2019 14:04:19 -0400 Received: by mail-yw1-f68.google.com with SMTP id e74so3954768ywe.11 for ; Fri, 24 May 2019 11:04:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ac3B/Xe+w3ROTUibljmXKgMNFheZh3fEEGPnecMKhL8=; b=r1OoUil15z41t05s7Vkc/zjldTqEYgmmxywmhKuW54yfFoy/DzoLQdlPOyYRIwSPds sEHDpgwMAcqLq1YXHhMikBuTJisl+J2t2x60kdRX3fh+DCltNpsWy5VjfrYhj93J/WV9 hrRLr/L897Xl/6cxVjDjl8EB/mnP3sCPl+QE8oW6qtneUdfiqvfxgUyLSqP/wtg/foth i+Wxt0gU2HJ0rfVhaCvQ6415HVjOy7xHl2d9q0YhQjQFIqCXXQvxMliZxaEgaiXcRr6u Ovm0UwnXvMH3SnusLsATmcCksvmzuf4hl8kotBCewE2zc/q8LmW+uDruXRlmZFlvwfT8 EPlw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ac3B/Xe+w3ROTUibljmXKgMNFheZh3fEEGPnecMKhL8=; b=bpUDz9WK0QsJ4Vloa3nMt65h2Gp88QlIXt0tPoO7qMsZubpsTZMrFYeqayA0nVpEnL cjAx9y6X3MhesnwCgutjkuFi3uz+qyhvXzF0uluQHYlsUrrcHY9JBA2t/Bq4fD//ofJQ gFHNwkn17R9S6eZdYiedB+J1sqRY74IaC3jxSK5R125ZDgvhtfnek3r0zwHiYKkzmJn1 MFVqS3Rpy7uxQvXwGat99v4svFOfXyiuH6fvKymdDPr/32oMjrnT7StXEAifwFIuK8va pjjGPHTwnTYPLTWrVcOciFi2AaYyW8ibG2DB3/rD/a81ku8C9nVSyB0lmPpeFCO1tXd0 o8Iw== X-Gm-Message-State: APjAAAVUmOglzOB8mmlNf667nCJmuOryWF1se0j5DMi4Q+4vBIBTDolf ze3sD2gU1FQJAdlHFEgS9UC7th3sAY9GfGMAQpAwsA== X-Received: by 2002:a0d:d9d7:: with SMTP id b206mr28725213ywe.398.1558721058115; Fri, 24 May 2019 11:04:18 -0700 (PDT) MIME-Version: 1.0 References: <20190524153148.18481-1-hannes@cmpxchg.org> <20190524160417.GB1075@bombadil.infradead.org> <20190524173900.GA11702@cmpxchg.org> In-Reply-To: <20190524173900.GA11702@cmpxchg.org> From: Shakeel Butt Date: Fri, 24 May 2019 11:04:06 -0700 Message-ID: Subject: Re: [PATCH] mm: fix page cache convergence regression To: Johannes Weiner Cc: Matthew Wilcox , Linus Torvalds , Andrew Morton , Linux MM , Cgroups , LKML , Kernel Team Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, May 24, 2019 at 10:41 AM Johannes Weiner wrote: > > On Fri, May 24, 2019 at 09:04:17AM -0700, Matthew Wilcox wrote: > > On Fri, May 24, 2019 at 11:31:48AM -0400, Johannes Weiner wrote: > > > diff --git a/include/linux/xarray.h b/include/linux/xarray.h > > > index 0e01e6129145..cbbf76e4c973 100644 > > > --- a/include/linux/xarray.h > > > +++ b/include/linux/xarray.h > > > @@ -292,6 +292,7 @@ struct xarray { > > > spinlock_t xa_lock; > > > /* private: The rest of the data structure is not to be used directly. */ > > > gfp_t xa_flags; > > > + gfp_t xa_gfp; > > > void __rcu * xa_head; > > > }; > > > > No. I'm willing to go for a xa_flag which says to use __GFP_ACCOUNT, but > > you can't add another element to the struct xarray. > > Ok, we can generalize per-tree gfp flags later if necessary. > > Below is the updated fix that uses an XA_FLAGS_ACCOUNT flag instead. > > --- > From 63a0dbc571ff38f7c072c62d6bc28192debe37ac Mon Sep 17 00:00:00 2001 > From: Johannes Weiner > Date: Fri, 24 May 2019 10:12:46 -0400 > Subject: [PATCH] mm: fix page cache convergence regression > > Since a28334862993 ("page cache: Finish XArray conversion"), on most > major Linux distributions, the page cache doesn't correctly transition > when the hot data set is changing, and leaves the new pages thrashing > indefinitely instead of kicking out the cold ones. > > On a freshly booted, freshly ssh'd into virtual machine with 1G RAM > running stock Arch Linux: > > [root@ham ~]# ./reclaimtest.sh > + dd of=workingset-a bs=1M count=0 seek=600 > + cat workingset-a > + cat workingset-a > + cat workingset-a > + cat workingset-a > + cat workingset-a > + cat workingset-a > + cat workingset-a > + cat workingset-a > + ./mincore workingset-a > 153600/153600 workingset-a > + dd of=workingset-b bs=1M count=0 seek=600 > + cat workingset-b > + cat workingset-b > + cat workingset-b > + cat workingset-b > + ./mincore workingset-a workingset-b > 104029/153600 workingset-a > 120086/153600 workingset-b > + cat workingset-b > + cat workingset-b > + cat workingset-b > + cat workingset-b > + ./mincore workingset-a workingset-b > 104029/153600 workingset-a > 120268/153600 workingset-b > > workingset-b is a 600M file on a 1G host that is otherwise entirely > idle. No matter how often it's being accessed, it won't get cached. > > While investigating, I noticed that the non-resident information gets > aggressively reclaimed - /proc/vmstat::workingset_nodereclaim. This is > a problem because a workingset transition like this relies on the > non-resident information tracked in the page cache tree of evicted > file ranges: when the cache faults are refaults of recently evicted > cache, we challenge the existing active set, and that allows a new > workingset to establish itself. > > Tracing the shrinker that maintains this memory revealed that all page > cache tree nodes were allocated to the root cgroup. This is a problem, > because 1) the shrinker sizes the amount of non-resident information > it keeps to the size of the cgroup's other memory and 2) on most major > Linux distributions, only kernel threads live in the root cgroup and > everything else gets put into services or session groups: > > [root@ham ~]# cat /proc/self/cgroup > 0::/user.slice/user-0.slice/session-c1.scope > > As a result, we basically maintain no non-resident information for the > workloads running on the system, thus breaking the caching algorithm. > > Looking through the code, I found the culprit in the above-mentioned > patch: when switching from the radix tree to xarray, it dropped the > __GFP_ACCOUNT flag from the tree node allocations - the flag that > makes sure the allocated memory gets charged to and tracked by the > cgroup of the calling process - in this case, the one doing the fault. > > To fix this, allow xarray users to specify per-tree flag that makes > xarray allocate nodes using __GFP_ACCOUNT. Then restore the page cache > tree annotation to request such cgroup tracking for the cache nodes. > > With this patch applied, the page cache correctly converges on new > workingsets again after just a few iterations: > > [root@ham ~]# ./reclaimtest.sh > + dd of=workingset-a bs=1M count=0 seek=600 > + cat workingset-a > + cat workingset-a > + cat workingset-a > + cat workingset-a > + cat workingset-a > + cat workingset-a > + cat workingset-a > + cat workingset-a > + ./mincore workingset-a > 153600/153600 workingset-a > + dd of=workingset-b bs=1M count=0 seek=600 > + cat workingset-b > + ./mincore workingset-a workingset-b > 124607/153600 workingset-a > 87876/153600 workingset-b > + cat workingset-b > + ./mincore workingset-a workingset-b > 81313/153600 workingset-a > 133321/153600 workingset-b > + cat workingset-b > + ./mincore workingset-a workingset-b > 63036/153600 workingset-a > 153600/153600 workingset-b > > Cc: stable@vger.kernel.org # 4.20+ > Signed-off-by: Johannes Weiner Reviewed-by: Shakeel Butt