Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp3279786iob; Mon, 16 May 2022 18:00:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwIUkjsRpch9LF/5aNnAz4a3Mn/a3+1Hq4SdNg040k9NnMmnkROp5qe3piMQB5tiDEaHlEK X-Received: by 2002:aa7:c849:0:b0:427:d464:e66f with SMTP id g9-20020aa7c849000000b00427d464e66fmr16125159edt.411.1652749236930; Mon, 16 May 2022 18:00:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652749236; cv=none; d=google.com; s=arc-20160816; b=DPfM1CqL6LY0ozImcQifBiU9W5W8nWHvZJRx5Fp1kZxQqWEIHPE/2dMjMnoGEkVNY3 1dhmVTs4NpLTNdV4QQL23frwfLsjP3JkNGu0dsAp3YIDIU68HrAwLCQcmmu3OnbD+E76 ZC/n9L/Glw7WM0a8xLwJS0hC1TlznOySBEXjeYHhFxluRd0ZkRvUAMquZLB05dBRSF7D dlF3ZaYUSXYeT7ltVlySWcHOfxBY35IW99pzORHkM3Mpt6T6n0SVlOSfq1DO1If/9P8E hCOnvq9Ht45xQInhbaTpON98cdLeaqk1fKln84XOufUg9Opg1qZjsUkdz4MQgt3YVov9 f7yg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=ZkVPoKiRyxZYCKFlzur2Du6imn3HJo9Ujd09H1rvEZI=; b=U9NPU0/61Sw2IGSIP5fIMIkaVM2kw1knhQKjsGZwSoIHzFMCol0xYxkxmDf0gR9Chz oLRcdyJv+t2b7rTv58StD/9Zkcm7wXeH8+0L1Z4fGjoZ0MFsI7weUC5SOvR8+Fl2YRMs 5q4fcWvsqGOnLzhWH/h6pG6HjzQvk0NdN+WxDoU39mKi+wHR28vJouD6HgTzkxFzodn8 ogG+YP9c+J6kzeoNwnVRdpSaQz+x9V6hYwSQdE7hNfhIy4V+IGaWR+4ILc6itWmx4zpf j/fK6YN1QxSAb60Ql5czmy1Zre63iVbU1UIFdpG/ni85jecx6H7jo3p3NC3ZmCep+38G f7dA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=vZTbPP2p; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id f2-20020a50fc82000000b0042aac6ad1d1si5086540edq.94.2022.05.16.18.00.11; Mon, 16 May 2022 18:00:36 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=vZTbPP2p; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231624AbiEPUe7 (ORCPT + 99 others); Mon, 16 May 2022 16:34:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:56938 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1351247AbiEPUCI (ORCPT ); Mon, 16 May 2022 16:02:08 -0400 Received: from mail-qv1-xf2a.google.com (mail-qv1-xf2a.google.com [IPv6:2607:f8b0:4864:20::f2a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4A032403F2 for ; Mon, 16 May 2022 13:01:10 -0700 (PDT) Received: by mail-qv1-xf2a.google.com with SMTP id e20so56983qvr.6 for ; Mon, 16 May 2022 13:01:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=ZkVPoKiRyxZYCKFlzur2Du6imn3HJo9Ujd09H1rvEZI=; b=vZTbPP2pWMooTwGBeObJt4v/OCxVqBMUtRl4gbZrrAlmglNAeFTKpKTvI8pBBQYN9c iqwo0hsTQ6M2urOAjmu0s5oyECHNDKV4KnVlwY6nOMTEdrAJ9NqVkrSARYKkYn/wwoFO m+zVXG2ALswGiW+CS9Jh21tKovWXrZW+YRHz4yMcE/Ggs4ybkWkJUMkfbsZ5s90LQn77 e2o98eTC86/Ft/yL5wJbt69b21smwsBJNqGNjMH+Zm1QxWMvkFIlIIgSML55tuD2zjKt 83b5fFamCTAEiXM24TcMDjWVlr3rXOlwoXH0BafPlBO8IAUs6LbNf+qDt89E1wHEDtRT pLQA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=ZkVPoKiRyxZYCKFlzur2Du6imn3HJo9Ujd09H1rvEZI=; b=MSCDJXX0c7juqhTWFjjPgBToJIsbOSVQqb7PAJ5RfuHWxxrn+DjPtBmPRQdt2CYqxa Qa5jS+hkZSbuvRbKWEZ2HJUCdIqJvbglu/wnurzNn5qnD5kD993IpBFvFGSjjmBnVeOk QsNGBNNHlxi+1/lJ90L5nlMBYjGErFhrDqkHwIUWLi0QY/mucwTgC5kaJa8pM/JdUoso FllAUZkG7sgRLIUACNFWCdQ9JmPHvfzCqe55BVbFQFGMSiKM33gZraXPlY8JFc0r18Z5 roDSuPIV/nkBTofppif7EcL8L5QErL6lHwZHR6sVSg42jFTFtFSah0wCJXM89CkWnHb9 Jqdw== X-Gm-Message-State: AOAM5328CbkmIIYT7mU2miPSIOAKIvsr0a0G5NUugOB6dLu/UIVN7XEq lBKk25zWy3AL7UIeivo9fZaejA== X-Received: by 2002:a05:6214:e6b:b0:45b:474:1035 with SMTP id jz11-20020a0562140e6b00b0045b04741035mr17194607qvb.39.1652731267078; Mon, 16 May 2022 13:01:07 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:fcf2]) by smtp.gmail.com with ESMTPSA id o7-20020a37be07000000b0069fc13ce203sm6698397qkf.52.2022.05.16.13.01.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 16 May 2022 13:01:06 -0700 (PDT) Date: Mon, 16 May 2022 16:01:05 -0400 From: Johannes Weiner To: Michal =?iso-8859-1?Q?Koutn=FD?= Cc: Andrew Morton , Michal Hocko , Roman Gushchin , Shakeel Butt , Seth Jennings , Dan Streetman , Minchan Kim , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH v2 6/6] zswap: memcg accounting Message-ID: References: <20220510152847.230957-1-hannes@cmpxchg.org> <20220510152847.230957-7-hannes@cmpxchg.org> <20220511173218.GB31592@blackbody.suse.cz> <20220513151426.GC16096@blackbody.suse.cz> <20220516143459.GA17557@blackbody.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20220516143459.GA17557@blackbody.suse.cz> X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Michal, On Mon, May 16, 2022 at 04:34:59PM +0200, Michal Koutn? wrote: > On Fri, May 13, 2022 at 01:08:13PM -0400, Johannes Weiner wrote: > > Right, it's accounted as a subset rather than fully disjointed. But it > > is a limitable counter of its own, so I exported it as such, with a > > current and a max knob. This is comparable to the kmem counter in v1. > > That counter and limit didn't turn out well. I liked the analogy to > writeback (and dirty limit) better. Right, I was only talking about the design decision to add a usage knob alongside the limit. The counter failed for a different reason. > > From an API POV it would be quite strange to have max for a counter > > that has no current. Likewise it would be strange for a major memory > > consumer to be missing from memory.stat. > > My understanding would be to have all memory.stat entries as you > propose, no extra .current counter and the .max knob for zswap > configuration. There is a longer-term advantage of sticking to the usage+limit pair precedent. Even though the usage knob happens to correspond to a memory.stat item in this case, we had also discussed the possibility of breaking out a "Compressed" item in memory.stat instead, of which zswap would only be a subset. It didn't pan out this way this time - for unrelated reasons. But it's conceivable this will happen in another scenario down the line, and then you'd need a separate usage knob anyway. It's a good idea to stay consistent. There is also an immediate practical advantage. zswap is limitable, so an auto-tuning service might want to monitor its usage at a higher frequency, with a higher precision, and with a higher urgency than memory stats are ususally logged. A dedicated usage knob allows doing that. memory.stat does not: it is a bigger file that needs to be searched with a string parser for every sample; it's flushed lazily, so it can be less precise than desired; yet, when it does flush, it flushes the entire tree rather than just the target group, making it more expensive an erratic than desired as well. > > It needs to be configured to the workload's access frequency curve, > > which can be done with trial-and-error (reasonable balance between > > zswpins and pswpins) or in a more targeted manner using tools such as > > page_idle, damon etc. > > [...] > > Because for load tuning, bytes make much more sense. That's how you > > measure the workingset, so a percentage is an awkward indirection. At > > the cgroup level, it makes even less sense: all memcg tunables are in > > bytes, it would be quite weird to introduce a "max" that is 0-100. Add > > the confusion of how percentages would propagate down the hierarchy... > > Thanks for the explanation. I guess there's no simple tranformation of > in-kernel available information that'd allow a more semantic > configuration of this value. The rather crude absolute value requires > (but also simply allows) some calibration or responsive tuning. Right. If you think about it, the same can be said about the memory limit itself. It's a crude, absolute number the kernel asks of you. Yet the optimal setting depends on the workload's ever-changing access frequency histogram, the speed of the storage backend for paging and swapping, and the workload's tolerances for paging latencies. Hopefully one day we'll be able to set a pressure/latency threshold on the cgroup, and have the kernel optimally distribute the workingset across RAM, zswap, second-tier memory, and storage - preferring the cheapest and slowest possible backing for every page while meeting the SLA. The proactive reclaim and memory offloading work is pursuing this. For now, we need the interface that allows tuning and exploring from userspace. When there is something that's ready for a general purpose OS kernel, those absolute knobs won't get in the way - just like memory.max doesn't get in the way of proactive reclaim today. Anyway, I'm just outlining where I'm coming from with this. It looks like we agree on the end result. > > Flushing unnecessary groups with a ratelimit doesn't sound like an > > improvement to me. > > Then I'm only concerned about a situation when there's a single deep > memcg that undergoes both workingset_refault() and zswap querying. > The latter (bare call to cgroup_rstat_flush()) won't reset > stats_flush_threshold, so the former (or the async flush more likely) > would attempt a flush too. The flush work (on the leaf memcg) would be > done twice even though it may be within the tolerance of cumulated > error the second time. > > This is a thing that might require attention in the future (depending on > some data how it actually performs). I see how the current approach is > justified. Yes, we can optimize it should the need arise. So far it's been fine. Thanks for your thoughts, Michal.