Received: by 2002:a05:7412:419a:b0:f3:1519:9f41 with SMTP id i26csp4941794rdh; Wed, 29 Nov 2023 15:46:04 -0800 (PST) X-Google-Smtp-Source: AGHT+IHIpvOzshkb6z10yhTyzHvHXlaqS03b9ZKoP2ZnRQo3ooPaY9Hyb+8E390Qno1aB6h+JuZt X-Received: by 2002:a05:6a00:1f0f:b0:6cd:dd0d:980f with SMTP id be15-20020a056a001f0f00b006cddd0d980fmr2358132pfb.30.1701301564236; Wed, 29 Nov 2023 15:46:04 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701301564; cv=none; d=google.com; s=arc-20160816; b=gqsUGt69Na07qINtI9UirQJEc99mhYy3g6z4fuFOCzMfEEItRa/AQe3cIGY+9ri4iE HusmP6Z3ZN9z/AR8IfI+8PQVAFy697sThyuCiP6ImJSfa/ojzP0oBN3tQ405rK52qSOa bn4ecLglVlgHdBKu5Rq1lbixbhXXsRK8D0drbFPuwA/eCmOCLOUAgUzLZs+cvCR+4RL7 vUG8hOY/i4zl5DXfZHzXFOgeBRezMO6tb84Jeb3sCzVwH6ufsKMSpzq58c1VtK1vKdrA 98WIDP4olF5j9t+2YYG5RXuKnC7bZZJwETkadsQyaqkeeohUhWsmQGKp2YGxgprzDr6O aqcQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=/tOCbkwNSpvGdMwE6tBYjY/uOZ+dKCwcjhdWtFECtR0=; fh=CQe3bLE0SvRazpTPupXM0NE1aafNrdD0FdNEToDOM4E=; b=OiBYvDQRT4QyCIPt3OpsN2NqITwEI6s+O8WtWBEO4CaVW/51IWy1jFHlQ2qpaVfuVy iG0Qql453H5De3KTcDgC1X841shOc4LSJVoR/QaFhmcuHMTDw0IpR9S01snjWWXwhAg9 VTH5solLoS1hVqrR2cCNZnfihzITFV4T0qaH7an8xygjZ7fH5QJNZ4UVzPos0SImXjOb 14xDizHYIu5F/ao2wRygu3QAMJH1Xraob3MCPdni2PembZJTcD3H+sXdrh9pr4LgKeLs uM7PlhbB1PFLel2IVNIvc5kDbaQ///Kb4N0NbvLN0vjv4Wk2ovkQZrkobWWdhQB4Rgfn XOrg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=jLHBOu9N; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from pete.vger.email (pete.vger.email. [23.128.96.36]) by mx.google.com with ESMTPS id s13-20020a63e80d000000b00589fcc39ef1si15557887pgh.365.2023.11.29.15.46.03 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 29 Nov 2023 15:46:04 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) client-ip=23.128.96.36; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20230601 header.b=jLHBOu9N; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.36 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by pete.vger.email (Postfix) with ESMTP id 326BF80811C4; Wed, 29 Nov 2023 15:45:12 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at pete.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1343661AbjK2Xoq (ORCPT + 99 others); Wed, 29 Nov 2023 18:44:46 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51532 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229575AbjK2Xop (ORCPT ); Wed, 29 Nov 2023 18:44:45 -0500 Received: from mail-io1-xd35.google.com (mail-io1-xd35.google.com [IPv6:2607:f8b0:4864:20::d35]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1308E10A; Wed, 29 Nov 2023 15:44:51 -0800 (PST) Received: by mail-io1-xd35.google.com with SMTP id ca18e2360f4ac-7b350130c3fso5786139f.3; Wed, 29 Nov 2023 15:44:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1701301490; x=1701906290; darn=vger.kernel.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/tOCbkwNSpvGdMwE6tBYjY/uOZ+dKCwcjhdWtFECtR0=; b=jLHBOu9NYC+a1uQERP5D+q5+PW2S4rhxggPMBGeTHmDpLmU7ICkZaq2Nz9yTuav1vY V+9rtul+fHeULsZIAbw7UZpxlev7WR6kZTWkfGWKnLHw9hfihl4V+TmBV2kcQ+7fWiwB ofMLdA29LKMLrfXcx5jx98MiXEIJa3Ka7qyv7Rp1Rc0Rd5dTuaTOdSfeX83aFlj020af MZ9hx2b3Vtfj8Gwf/TVCCmclDyZC42s71ks9Zuf1/mYhX7Rn7O4M5XEmb4m4oYt3ABH7 7/KCwb2nwtfo3qix9fZ4KR1QbN/C2V7o51ybqdFh6nrxsAt+YlLKCBzfZLw5usihbGS/ FUXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1701301490; x=1701906290; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/tOCbkwNSpvGdMwE6tBYjY/uOZ+dKCwcjhdWtFECtR0=; b=Ec6nOsOMvDCRiWMNGXuA8c/ak5vUY2bf0+MhfN21TZBvjw68Fr/RkOHbOHRK4vO/Mf oB1a2nXjM8hn47fOAdvUpdj+ewsN5mGiBrZEu4cfip0LA4jmcDYXuY6vuDWle23kAbh1 momQNBWA/fT76YxcXUatbMUBz0XBlG4eIS73umhX/cevzITvcAZzNl+9bJZr5wJqNRUG ahOfzW9M3FLsI5Rs2h3mrNALoQxu8mDdD3stBZSCzJ0oy6erpO66+GsUqC9kh7am+Kob wbakivZL4kg4yVAIbc8JWPkGRlELnjeydzodArYfYmKDgUx+s4RGHficwQtxSwyUFNV0 3I1g== X-Gm-Message-State: AOJu0Yzr7hrUVOORIvXJjC8ghnH7i0KDQEnTd1huPQhwtt7q+ysl0x3P l7sNTXmxQp+peynYXDIiv4uy7G8SyR8YVootuws= X-Received: by 2002:a05:6e02:eca:b0:35d:fd2:6b3 with SMTP id i10-20020a056e020eca00b0035d0fd206b3mr7641812ilk.26.1701301490262; Wed, 29 Nov 2023 15:44:50 -0800 (PST) MIME-Version: 1.0 References: <20231127234600.2971029-1-nphamcs@gmail.com> <20231127234600.2971029-7-nphamcs@gmail.com> <20231129162150.GE135852@cmpxchg.org> In-Reply-To: <20231129162150.GE135852@cmpxchg.org> From: Nhat Pham Date: Wed, 29 Nov 2023 15:44:39 -0800 Message-ID: Subject: Re: [PATCH v7 6/6] zswap: shrinks zswap pool based on memory pressure To: Johannes Weiner Cc: akpm@linux-foundation.org, cerasuolodomenico@gmail.com, yosryahmed@google.com, sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com, mhocko@kernel.org, roman.gushchin@linux.dev, shakeelb@google.com, muchun.song@linux.dev, chrisl@kernel.org, linux-mm@kvack.org, kernel-team@meta.com, linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, shuah@kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-0.6 required=5.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on pete.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (pete.vger.email [0.0.0.0]); Wed, 29 Nov 2023 15:45:12 -0800 (PST) On Wed, Nov 29, 2023 at 8:21=E2=80=AFAM Johannes Weiner wrote: > > On Mon, Nov 27, 2023 at 03:46:00PM -0800, Nhat Pham wrote: > > Currently, we only shrink the zswap pool when the user-defined limit is > > hit. This means that if we set the limit too high, cold data that are > > unlikely to be used again will reside in the pool, wasting precious > > memory. It is hard to predict how much zswap space will be needed ahead > > of time, as this depends on the workload (specifically, on factors such > > as memory access patterns and compressibility of the memory pages). > > > > This patch implements a memcg- and NUMA-aware shrinker for zswap, that > > is initiated when there is memory pressure. The shrinker does not > > have any parameter that must be tuned by the user, and can be opted in > > or out on a per-memcg basis. > > > > Furthermore, to make it more robust for many workloads and prevent > > overshrinking (i.e evicting warm pages that might be refaulted into > > memory), we build in the following heuristics: > > > > * Estimate the number of warm pages residing in zswap, and attempt to > > protect this region of the zswap LRU. > > * Scale the number of freeable objects by an estimate of the memory > > saving factor. The better zswap compresses the data, the fewer pages > > we will evict to swap (as we will otherwise incur IO for relatively > > small memory saving). > > * During reclaim, if the shrinker encounters a page that is also being > > brought into memory, the shrinker will cautiously terminate its > > shrinking action, as this is a sign that it is touching the warmer > > region of the zswap LRU. > > > > As a proof of concept, we ran the following synthetic benchmark: > > build the linux kernel in a memory-limited cgroup, and allocate some > > cold data in tmpfs to see if the shrinker could write them out and > > improved the overall performance. Depending on the amount of cold data > > generated, we observe from 14% to 35% reduction in kernel CPU time used > > in the kernel builds. > > I think this is a really important piece of functionality for zswap. > > Memory compression has been around and in use for a long time, but the > question of zswap vs swap sizing has been in the room since the hybrid > mode was first proposed. Because depending on the reuse patterns, > putting zswap with a static size limit in front of an existing swap > file can be a net negative for performance as it consumes more memory. > > It's great to finally see a solution to this which makes zswap *much* > more general purpose. And something that distributions might want to > turn on per default when swap is configured. > > Actually to the point where I think there should be a config option to > enable the shrinker per default. Maybe not right away, but in a few > releases when this feature has racked up some more production time. Sure thingy - how does everyone feel about this? > > > @@ -687,6 +687,7 @@ struct page *swap_cluster_readahead(swp_entry_t ent= ry, gfp_t gfp_mask, > > &page_allocated, false); > > if (unlikely(page_allocated)) > > swap_readpage(page, false, NULL); > > + zswap_lruvec_swapin(page); > > The "lruvec" in the name vs the page parameter is a bit odd. > zswap_page_swapin() would be slightly better, but it still also sounds > like it would cause an actual swapin of some sort. > > zswap_record_swapin()? Hmm that sounds good to me. I'm not very good with naming, if that's not already evident :) > > > @@ -520,6 +575,95 @@ static struct zswap_entry *zswap_entry_find_get(st= ruct rb_root *root, > > return entry; > > } > > > > +/********************************* > > +* shrinker functions > > +**********************************/ > > +static enum lru_status shrink_memcg_cb(struct list_head *item, struct = list_lru_one *l, > > + spinlock_t *lock, void *arg); > > + > > +static unsigned long zswap_shrinker_scan(struct shrinker *shrinker, > > + struct shrink_control *sc) > > +{ > > + struct lruvec *lruvec =3D mem_cgroup_lruvec(sc->memcg, NODE_DATA(= sc->nid)); > > + unsigned long shrink_ret, nr_protected, lru_size; > > + struct zswap_pool *pool =3D shrinker->private_data; > > + bool encountered_page_in_swapcache =3D false; > > + > > + nr_protected =3D > > + atomic_long_read(&lruvec->zswap_lruvec_state.nr_zswap_pro= tected); > > + lru_size =3D list_lru_shrink_count(&pool->list_lru, sc); > > + > > + /* > > + * Abort if the shrinker is disabled or if we are shrinking into = the > > + * protected region. > > + */ > > + if (!zswap_shrinker_enabled || nr_protected >=3D lru_size - sc->n= r_to_scan) { > > + sc->nr_scanned =3D 0; > > + return SHRINK_STOP; > > + } > > I'm scratching my head at the protection check. zswap_shrinker_count() > already factors protection into account, so sc->nr_to_scan should only > be what is left on the list after excluding the protected area. > > Do we even get here if the whole list is protected? Is this to protect > against concurrent shrinking of the list through multiple shrinkers or > swapins? If so, a comment would be nice :) Yep, this is mostly for concurrent shrinkers. Please fact-check me, but IIUC if we have too many reclaimers all calling upon the zswap shrinker (before any of them can make substantial progress), we can have a situation where the total number of objects freed by the reclaimers will eat into the protection area of the zswap LRU (even if the number of freeable objects is scaled down by the compression ratio, and further scaled down internally in the shrinker/vmscan code). I've observed this tendency when there is a) a lot of worker threads in my benchmark and b) memory pressure. This is a crude/racy way to alleviate the issue. I think this is actually a wider problem than just zswap and zswap shrinker - we need better reclaimer throttling logic IMO. Perhaps this check should be done higher up the stack - something along the lines of having each reclaimer "register" its intention (number of objects it wants to reclaim) to a particular shrinker, allowing the shrinker to deny a reclaimer if there is already a strong reclaim driving force. Or some other throttling heuristics based on the number of freeable objects and the reclaimer registration data. > > Otherwise, this looks great to me! > > Just nitpicks, no show stoppers: > > Acked-by: Johannes Weiner