Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp8041425rwb; Wed, 23 Nov 2022 14:39:17 -0800 (PST) X-Google-Smtp-Source: AA0mqf7SzcIMKcc6N/Gs9R4qWHmxNB8LpfymBcSZ4eV1Vn6qaHML4jExyL4Q2WfTz5OgdWsL1Zft X-Received: by 2002:a17:906:4f0d:b0:73d:dfb2:d188 with SMTP id t13-20020a1709064f0d00b0073ddfb2d188mr25842047eju.426.1669243157029; Wed, 23 Nov 2022 14:39:17 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669243157; cv=none; d=google.com; s=arc-20160816; b=eSbRTLnNeXP5Z8ZG31doLl6jn9wQc/KhEFLePt/MvEiqz61GVxnPNw7S1LMWEUrdpf +dQKqv32cg4vUrIqFmue7j8ft1o7NomWcISBgPUTZwouQLmEN01wMbzcbGQy8DjTKmPU 77fv3fAdVtUQGJHJyVCA3Jfjk2pa6qsrj9Qtm9KA0d+hGi9yT7Xgy6spvs1rVeDe0XM3 Qr+6DY4ls/4YnLh23K0bbMCW/26GWgEja7gsQxsYqO2vYU33joGwpfqBr+8QPUK+JsR8 yD0U7ENpIFJ2loIcbLM36mYqqfXY5JnthumUiWySqri+VcxkAXC3gdBaYXGQRZqfYpQP 5B2g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=eXUl97TIC36iX2gzZ4z9T7kVADet1dCMIeHKCLo4dC4=; b=RuWu/XkpRyO62wIf0SpGCcyS4xCsebDcnZYBfPHy48ZKko/t6SXfH6nnEUmf2mpJe+ OkIUBoq96daYuK9HcQAQtTHkBLincfO0NyJG7Mk/BecBH6u0TByk1E3ZT6aLuokH1cBj Ky9BA1AnkQDXRscvYmx6x5R7Bo6t4mLrvsL5QpKqlOF/F85FzD2ocS4ONaKTfHu+Q00A CUMo67tPKewNx6VfcRkEpoNARKN+cEMd5FnxXI7f8PSdolFki4TeVa5F5uRr36VGdTTp 1grG1hamgnuu3oTipgv7LdMYJs3o35yMR29YWB+3v1kqlev8sehgjmslBBk1SOsHdGAk t5DQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=1d778WwT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id ce18-20020a170906b25200b0078db18d796dsi12216120ejb.982.2022.11.23.14.38.55; Wed, 23 Nov 2022 14:39:17 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=1d778WwT; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238904AbiKWV5v (ORCPT + 88 others); Wed, 23 Nov 2022 16:57:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34650 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235336AbiKWV5s (ORCPT ); Wed, 23 Nov 2022 16:57:48 -0500 Received: from mail-qt1-x835.google.com (mail-qt1-x835.google.com [IPv6:2607:f8b0:4864:20::835]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 288B722BEC for ; Wed, 23 Nov 2022 13:57:45 -0800 (PST) Received: by mail-qt1-x835.google.com with SMTP id h16so93671qtu.2 for ; Wed, 23 Nov 2022 13:57:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=eXUl97TIC36iX2gzZ4z9T7kVADet1dCMIeHKCLo4dC4=; b=1d778WwTHiFnSrOOtipV1oy8hteglhzctqXDcakElk44hR7nj0Kw1HZUKFrkLK1flS 1dy2ejEUeQqR3zUjNbKe64kHFHaCeNPfR+mHEFOOM8LiucYx2e8pjbjLk1BMo3Sbf3K9 wZL/Jay42f6APIg1zJmYPEYEFZuFaYZ54oLip8A+55aEI0PiSRx6eLERACZ4gNoB/nRm LTOQd16sAxmRdLvYL+ViYyxYH5UhhXYq81mueLRa3i7YmQ4Cty/v68jpsO7pFl5W2ApC fFUnjjrFHbN4SCxNJgdTGJifAWOycqM5/tm/mDgGyEo4EcDqRaXaB8j0C8GsSVJo9CBj 8CpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=eXUl97TIC36iX2gzZ4z9T7kVADet1dCMIeHKCLo4dC4=; b=4DXL7haKzAj3XvGYJ/ZYWiPcFfxMBeRu0uyS37TMVj5xsn20Gik7MBaIawj4lpMqLB ymlFOdl/WVlhHIecTWWX0HrNvibLkhG4LqiurQnrZX22Hg1lh1vbMe8OwwgZZOeD12r5 tAu9xohKwurg1GPTY0er/ZUo+kX4PAqAKFGKWn6zKRMLajISsP+Kw5E6nEQcwbOB03KJ 86I0+lt5r3Q644pI8xPt0QWraBIQlcZ/IrQCysCORdUj+/OmXoTfc722kVRwkprymDUb 7TFzqrUbZGG9EvR2et4LT0dB4KGzx/6JWWJ1ZHqvv1Y0NEXwOpgcrDPmvfK4VCNUwji8 HOtw== X-Gm-Message-State: ANoB5pl3trNoadKgo3XAuQN1TtzAriHC1XaTUY9+LJKI6lcEVMTxEMLR VC2xJSWxNEerIxLKlr7Bfg5/Rw== X-Received: by 2002:ac8:4a10:0:b0:3a5:2556:42ed with SMTP id x16-20020ac84a10000000b003a5255642edmr13309728qtq.422.1669240664291; Wed, 23 Nov 2022 13:57:44 -0800 (PST) Received: from localhost ([2620:10d:c091:480::1:bc4]) by smtp.gmail.com with ESMTPSA id t18-20020a05620a451200b006fafc111b12sm13219537qkp.83.2022.11.23.13.57.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 23 Nov 2022 13:57:43 -0800 (PST) Date: Wed, 23 Nov 2022 16:58:10 -0500 From: Johannes Weiner To: Mina Almasry Cc: Huang Ying , Yang Shi , Yosry Ahmed , Tim Chen , weixugc@google.com, shakeelb@google.com, gthelen@google.com, fvdl@google.com, Michal Hocko , Roman Gushchin , Muchun Song , Andrew Morton , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org Subject: Re: [RFC PATCH V1] mm: Disable demotion from proactive reclaim Message-ID: References: <20221122203850.2765015-1-almasrymina@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Nov 23, 2022 at 01:20:57PM -0800, Mina Almasry wrote: > On Wed, Nov 23, 2022 at 10:00 AM Johannes Weiner wrote: > > > > Hello Mina, > > > > On Tue, Nov 22, 2022 at 12:38:45PM -0800, Mina Almasry wrote: > > > Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg > > > reclaim""), the proactive reclaim interface memory.reclaim does both > > > reclaim and demotion. This is likely fine for us for latency critical > > > jobs where we would want to disable proactive reclaim entirely, and is > > > also fine for latency tolerant jobs where we would like to both > > > proactively reclaim and demote. > > > > > > However, for some latency tiers in the middle we would like to demote but > > > not reclaim. This is because reclaim and demotion incur different latency > > > costs to the jobs in the cgroup. Demoted memory would still be addressable > > > by the userspace at a higher latency, but reclaimed memory would need to > > > incur a pagefault. > > > > > > To address this, I propose having reclaim-only and demotion-only > > > mechanisms in the kernel. There are a couple possible > > > interfaces to carry this out I considered: > > > > > > 1. Disable demotion in the memory.reclaim interface and add a new > > > demotion interface (memory.demote). > > > 2. Extend memory.reclaim with a "demote=" flag to configure the demotion > > > behavior in the kernel like so: > > > - demote=0 would disable demotion from this call. > > > - demote=1 would allow the kernel to demote if it desires. > > > - demote=2 would only demote if possible but not attempt any > > > other form of reclaim. > > > > Unfortunately, our proactive reclaim stack currently relies on > > memory.reclaim doing both. It may not stay like that, but I'm a bit > > wary of changing user-visible semantics post-facto. > > > > In patch 2, you're adding a node interface to memory.demote. Can you > > add this to memory.reclaim instead? This would allow you to control > > demotion and reclaim independently as you please: if you call it on a > > node with demotion targets, it will demote; if you call it on a node > > without one, it'll reclaim. And current users will remain unaffected. > > Hello Johannes, thanks for taking a look! > > I can certainly add the "nodes=" arg to memory.reclaim and you're > right, that would help in bridging the gap. However, if I understand > the underlying code correctly, with only the nodes= arg the kernel > will indeed attempt demotion first, but the kernel will also merrily > fall back to reclaiming if it can't demote the full amount. I had > hoped to have the flexibility to protect latency sensitive jobs from > reclaim entirely while attempting to do demotion. The fallback to reclaim actually strikes me as wrong. Think of reclaim as 'demoting' the pages to the storage tier. If we have a RAM -> CXL -> storage hierarchy, we should demote from RAM to CXL and from CXL to storage. If we reclaim a page from RAM, it means we 'demote' it directly from RAM to storage, bypassing potentially a huge amount of pages colder than it in CXL. That doesn't seem right. If demotion fails, IMO it shouldn't satisfy the reclaim request by breaking the layering. Rather it should deflect that pressure to the lower layers to make room. This makes sure we maintain an aging pipeline that honors the memory tier hierarchy. So I'm hesitant to design cgroup controls around the current behavior. > The above is just one angle of the issue. Another angle (which Yosry > would care most about I think) is that at Google we call > memory.reclaim mainly when memory.current is too close to memory.max > and we expect the memory usage of the cgroup to drop as a result of a > success memory.reclaim call. I suspect once we take in commit > 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), > we would run into that regression, but I defer to Yosry here, he may > have a solution for that in mind already. IMO it should both demote and reclaim. Simliar to how memory.reclaim on a non-tiered memory system would both deactivate active pages and reclaim inactive pages.