Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp954561iob; Fri, 13 May 2022 17:35:55 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzBmeWrnsUuU6c6yvRvXq0nQgUZJ5ImoFbLSl25/qIFV4V1bPPc3tLSBjcX0KjOlnIMWJL2 X-Received: by 2002:a05:600c:5105:b0:394:7d22:aa93 with SMTP id o5-20020a05600c510500b003947d22aa93mr6662589wms.107.1652488554885; Fri, 13 May 2022 17:35:54 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1652488554; cv=none; d=google.com; s=arc-20160816; b=mek/0Pt4D+9z5Vn0DLOUcAqJm1knODnghq9FohDk5kvVbXNGQJ+Mf2AjM5UkfgJCEg z3aXnreow7+O1SZaSpTIEZ+B8M+PNYwNAR9t6In0TpgW5KD4IxRYisuITRqwTsrB0DS8 Uh5XgflR/v48k4F1tJfG5WedfKkimrFJqpTNQzNYFQFht9Rzsf1Q+80kq5lkcLbEJbvv 4lYLQdI1gKB5EGjhK6887N+7ka0OpbKJ7ot1PqMuTie2CDp4hgjzhWaqX1btehHr7Kga cMTI2V1UXlXPfQV/s6LMVApMUj/s8PaUmwfsyNQdbjmWMjJj1IEI0Yb2Zp8TGgkM+Qw9 W4KQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=zXhkDIVf27CV50GWaqhnTSClMrVWORrrk1UTjwHnru8=; b=GxB/piAQwJWHd+3l0WjRUlDTHwGKohK9e3msNZCZ7hZbAoqbj5HgxoOKRWwdQKVDoo u7vVLi6skAelDO5EBViV+e/yLXF8waVmTAaSXm4jYzzvkYOCBnJqjYNLs6GPuiipV17D pZi9f6xwDiCtmZfwmsXqwvLzaynVCk60b29pzyWlVUOrqLvDl1etnoniTGmrS3XVwLb7 cQ1VLM6jXly9Mu29Cw11AY/CJZDqL0bzM8A9IyrAOH4OrA60ALWdD8RVuhNukbBLysYC dJ+EIzNnfRR4JV07RlQmiasarToAfVW5L4tli0FXv0JhCpC3jcICJIq4PMKzwdGGkZaw Zj9w== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=cnwlzxqB; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id z3-20020a05600c0a0300b0038ff27d68d4si8528701wmp.128.2022.05.13.17.35.54 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 13 May 2022 17:35:54 -0700 (PDT) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=cnwlzxqB; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 5305133929F; Fri, 13 May 2022 16:27:05 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1358871AbiEMRIU (ORCPT + 99 others); Fri, 13 May 2022 13:08:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59464 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1378302AbiEMRIS (ORCPT ); Fri, 13 May 2022 13:08:18 -0400 Received: from mail-qt1-x835.google.com (mail-qt1-x835.google.com [IPv6:2607:f8b0:4864:20::835]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 46DA16128C for ; Fri, 13 May 2022 10:08:16 -0700 (PDT) Received: by mail-qt1-x835.google.com with SMTP id v14so6139719qtc.3 for ; Fri, 13 May 2022 10:08:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=zXhkDIVf27CV50GWaqhnTSClMrVWORrrk1UTjwHnru8=; b=cnwlzxqBMjsJ+1WY7JYGk0zmaRis0uxKC7n7kcvotpz3nsqvchPQgSX6XYMUfhDe6C zQX+xYwhjdbi9MoF65ZpjqRRK7MJ0XKfxooT6ZqFqqo5RjIKjDa+M6SKyoLVYCac5pnq F/Ub9Xe3njx+6EalimZC+4nsWYG/wjM6ZV3DJI5rfIRoltmydpP3cLFDVWHPmw472OJh wjt3Bfjw86NWL6eyRuQrhiFfaHCEQEWvic5An3l6SAxg8tWA6OZwBS1CE8KjO+VtdCEW IPvb1TfV5UEO/ymbUOlWpPhhEU9+bxxgQMSRElxcmydKAxpio18y2YkuyD7/w+eB4Eo5 aUFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=zXhkDIVf27CV50GWaqhnTSClMrVWORrrk1UTjwHnru8=; b=4C/E4EPZvOEWBEylP0fl6m96VPnbylqXadosWYYfGJv7NGp8FuGADxLPvzHc0k1Ieo iN1CwkhTKU4FqTAQy8FIlcsnrFjUcPVR3sZ6O/tUXYsvDHLefcikFVeLSlqDdk39KROm oTv8K/TSJGuHyJBxyAT8AbbotvAx8r6sPvbrViQIbz5z27e5n7d3zZaR0V7raJhroEQc jSbnOH5PjrzexsV0vCK3mHnreJjIUHunxecfoAU/SgRTtYg39fDoy6/pShBKRGIr9JOY aiGMRWcfZ/tvLWjEulRJwduIgCilPecEP1tyNsKIX030LjniWEM2dH8tAket00IRMf/O QoFQ== X-Gm-Message-State: AOAM531i9n7/sSJkiDAmr6SxIGHmzCgY9rr6k3kBaYTrA2WLMXMK1HSi R1XA10Lodb6gd3QMZyDMtCi1dA== X-Received: by 2002:a05:622a:104e:b0:2f3:f7a5:62e6 with SMTP id f14-20020a05622a104e00b002f3f7a562e6mr5423719qte.582.1652461695399; Fri, 13 May 2022 10:08:15 -0700 (PDT) Received: from localhost ([2620:10d:c091:480::1:14fe]) by smtp.gmail.com with ESMTPSA id 143-20020a370995000000b0069fc13ce207sm1622277qkj.56.2022.05.13.10.08.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 13 May 2022 10:08:14 -0700 (PDT) Date: Fri, 13 May 2022 13:08:13 -0400 From: Johannes Weiner To: Michal =?iso-8859-1?Q?Koutn=FD?= Cc: Andrew Morton , Michal Hocko , Roman Gushchin , Shakeel Butt , Seth Jennings , Dan Streetman , Minchan Kim , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: Re: [PATCH v2 6/6] zswap: memcg accounting Message-ID: References: <20220510152847.230957-1-hannes@cmpxchg.org> <20220510152847.230957-7-hannes@cmpxchg.org> <20220511173218.GB31592@blackbody.suse.cz> <20220513151426.GC16096@blackbody.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20220513151426.GC16096@blackbody.suse.cz> X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,RDNS_NONE, SPF_HELO_NONE,T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello Michal, On Fri, May 13, 2022 at 05:14:26PM +0200, Michal Koutn? wrote: > On Wed, May 11, 2022 at 03:06:56PM -0400, Johannes Weiner wrote: > > Correct. After which the uncompressed page is reclaimed and uncharged. > > So the zswapout process will reduce the charge bottom line. > > A zswap object falling under memory.current was my first thinking, I was > confused why it's exported as a separate counter memory.zswap.current > (which IMO suggests disjoint counting) and it doubles a > memory.stat:zswap entry. > > Is the separate memory.zswap.current good for anything? (Except maybe > avoiding global rstat flush on memory.stat read but that'd be an > undesired precendent.) Right, it's accounted as a subset rather than fully disjointed. But it is a limitable counter of its own, so I exported it as such, with a current and a max knob. This is comparable to the kmem counter in v1. From an API POV it would be quite strange to have max for a counter that has no current. Likewise it would be strange for a major memory consumer to be missing from memory.stat. > (Ad the eventually reduced footprint, the transitional excursion above > memcg's (or ancestor's) limit should be limited by number of parallel > reclaims running (each one at most a page, right?), so it doesn't seem > necessary to tackle (now).) Correct. > > memory.zswap.* are there to configure zswap policy, within the > > boundaries of available memory - it's by definition a subset. > > I see how the .max works when equal to 0 or "max". The intermediate > values are more difficult to reason about. It needs to be configured to the workload's access frequency curve, which can be done with trial-and-error (reasonable balance between zswpins and pswpins) or in a more targeted manner using tools such as page_idle, damon etc. > Also, I can see that on the global level, zswap is configured relatively > (/sys/module/zswap/parameters/max_pool_percent). > You wrote that the actual configured value is workload specific, would > it be simpler to have also relative zswap limit per memcg? > > (Relative wrt memory.max, it'd be rather just a convenience with this > simple ratio, however, it'd correspond to the top level limit. OTOH, the > relatives would have counter-intuitive hierarchical behavior. I don't > mean this should be changed, rather wondering why this variant was > chosen.) A percentage isn't a bad way to pick a global default limit for a kernel feature. But it would have been preferable if zswap had used the percentage internally and made the knob based in bytes (like min_free_kbytes for example). Because for load tuning, bytes make much more sense. That's how you measure the workingset, so a percentage is an awkward indirection. At the cgroup level, it makes even less sense: all memcg tunables are in bytes, it would be quite weird to introduce a "max" that is 0-100. Add the confusion of how percentages would propagate down the hierarchy... > > +bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) > > +{ > > + struct mem_cgroup *memcg, *original_memcg; > > + bool ret = true; > > + > > + original_memcg = get_mem_cgroup_from_objcg(objcg); > > + for (memcg = original_memcg; memcg != root_mem_cgroup; > > + memcg = parent_mem_cgroup(memcg)) { > > + unsigned long max = READ_ONCE(memcg->zswap_max); > > + unsigned long pages; > > + > > + if (max == PAGE_COUNTER_MAX) > > + continue; > > + if (max == 0) { > > + ret = false; > > + break; > > + } > > + > > + cgroup_rstat_flush(memcg->css.cgroup); > > Here, I think it'd be better not to bypass mem_cgroup_flush_stats() (the > mechanism is approximate and you traverse all ancestors anyway), i.e. > mem_cgroup_flush_stats() before the loop instead of this. I don't traverse all ancestors, I bail on disabled groups and skip unlimited ones. This saves a lot of flushes in practice right now: our heaviest swapping cgroups have zswap disabled (max=0) because they're lowpri and forced to disk. Likewise, the zswap users have their zswap limit several levels down from the root, and I currently don't ever flush the higher levels (max=PAGE_COUNTER_MAX). Flushing unnecessary groups with a ratelimit doesn't sound like an improvement to me. Thanks