Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
MIME-Version: 1.0
References: <20230524065051.6328-1-cerasuolodomenico@gmail.com>
 <CAJD7tkY0j4xzstf=OCX-k-mujzbXf_nt=H4F5JCZNZ8N1iEEcg@mail.gmail.com>
 <CA+CLi1gzVop3=kmARn09NKKh6yW=bt9KYrzc0AfXB64J0r-9Ew@mail.gmail.com> <CAJD7tkaJAPhHVwepUeiAmSwomNmUKiYUR2gixsvLwHxB6k7UZA@mail.gmail.com>
In-Reply-To: <CAJD7tkaJAPhHVwepUeiAmSwomNmUKiYUR2gixsvLwHxB6k7UZA@mail.gmail.com>
From:   Domenico Cerasuolo <cerasuolodomenico@gmail.com>
Date:   Fri, 26 May 2023 10:52:20 +0200
Message-ID: <CA+CLi1i02o+ZzWEgK-U=wftqQ9DgijZWUnEAeGNAZ17SeBh9ig@mail.gmail.com>
Subject: Re: [PATCH] mm: zswap: shrink until can accept
To:     Yosry Ahmed <yosryahmed@google.com>
Cc:     linux-mm@kvack.org, linux-kernel@vger.kernel.org,
        sjenning@redhat.com, ddstreet@ieee.org, vitaly.wool@konsulko.com,
        hannes@cmpxchg.org, kernel-team@fb.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
Precedence: bulk

On Thu, May 25, 2023 at 9:10=E2=80=AFPM Yosry Ahmed <yosryahmed@google.com>=
 wrote:
>
> On Thu, May 25, 2023 at 9:53=E2=80=AFAM Domenico Cerasuolo
> <cerasuolodomenico@gmail.com> wrote:
> >
> > On Thu, May 25, 2023 at 2:59=E2=80=AFAM Yosry Ahmed <yosryahmed@google.=
com> wrote:
> > >
> > > Hi Domenico,
> > >
> > > On Tue, May 23, 2023 at 11:50=E2=80=AFPM Domenico Cerasuolo
> > > <cerasuolodomenico@gmail.com> wrote:
> > > >
> > > > This update addresses an issue with the zswap reclaim mechanism, wh=
ich
> > > > hinders the efficient offloading of cold pages to disk, thereby
> > > > compromising the preservation of the LRU order and consequently
> > > > diminishing, if not inverting, its performance benefits.
> > > >
> > > > The functioning of the zswap shrink worker was found to be inadequa=
te,
> > > > as shown by basic benchmark test. For the test, a kernel build was
> > > > utilized as a reference, with its memory confined to 1G via a cgrou=
p and
> > > > a 5G swap file provided. The results are presented below, these are
> > > > averages of three runs without the use of zswap:
> > > >
> > > > real 46m26s
> > > > user 35m4s
> > > > sys 7m37s
> > > >
> > > > With zswap (zbud) enabled and max_pool_percent set to 1 (in a 32G
> > > > system), the results changed to:
> > > >
> > > > real 56m4s
> > > > user 35m13s
> > > > sys 8m43s
> > > >
> > > > written_back_pages: 18
> > > > reject_reclaim_fail: 0
> > > > pool_limit_hit:1478
> > > >
> > > > Besides the evident regression, one thing to notice from this data =
is
> > > > the extremely low number of written_back_pages and pool_limit_hit.
> > > >
> > > > The pool_limit_hit counter, which is increased in zswap_frontswap_s=
tore
> > > > when zswap is completely full, doesn't account for a particular
> > > > scenario: once zswap hits his limit, zswap_pool_reached_full is set=
 to
> > > > true; with this flag on, zswap_frontswap_store rejects pages if zsw=
ap is
> > > > still above the acceptance threshold. Once we include the rejection=
s due
> > > > to zswap_pool_reached_full && !zswap_can_accept(), the number goes =
from
> > > > 1478 to a significant 21578266.
> > > >
> > > > Zswap is stuck in an undesirable state where it rejects pages becau=
se
> > > > it's above the acceptance threshold, yet fails to attempt memory
> > > > reclaimation. This happens because the shrink work is only queued w=
hen
> > > > zswap_frontswap_store detects that it's full and the work itself on=
ly
> > > > reclaims one page per run.
> > > >
> > > > This state results in hot pages getting written directly to disk,
> > > > while cold ones remain memory, waiting only to be invalidated. The =
LRU
> > > > order is completely broken and zswap ends up being just an overhead
> > > > without providing any benefits.
> > > >
> > > > This commit applies 2 changes: a) the shrink worker is set to recla=
im
> > > > pages until the acceptance threshold is met and b) the task is also
> > > > enqueued when zswap is not full but still above the threshold.
> > > >
> > > > Testing this suggested update showed much better numbers:
> > > >
> > > > real 36m37s
> > > > user 35m8s
> > > > sys 9m32s
> > > >
> > > > written_back_pages: 10459423
> > > > reject_reclaim_fail: 12896
> > > > pool_limit_hit: 75653
> > >
> > > Impressive numbers, and great find in general!
> > >
> > > I wonder how other workloads benefit/regress from this change.
> > > Anything else that you happened to run? :)
> > >
> > Hi Yosry,
> >
> > thanks for the quick feedback!
> >
> > Besides stressers, I haven't tested any other actual workload, we don't=
 have
> > writeback in production yet so I can't provide any data from there. I w=
as
> > wondering what kind of actual workload I could use to test this on my d=
esktop,
> > but I couldn't think of anything relevant, I'm open to suggestions thou=
gh :)
>
> Nothing in mind in particular as well. Perhaps others have ideas.
>
> > > >
> > > > Fixes: 45190f01dd40 ("mm/zswap.c: add allocation hysteresis if pool=
 limit is hit")
> > > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com>
> > > > ---
> > > >  mm/zswap.c | 10 +++++++---
> > > >  1 file changed, 7 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/mm/zswap.c b/mm/zswap.c
> > > > index 59da2a415fbb..2ee0775d8213 100644
> > > > --- a/mm/zswap.c
> > > > +++ b/mm/zswap.c
> > > > @@ -587,9 +587,13 @@ static void shrink_worker(struct work_struct *=
w)
> > > >  {
> > > >         struct zswap_pool *pool =3D container_of(w, typeof(*pool),
> > > >                                                 shrink_work);
> > > > +       int ret;
> > > >
> > > > -       if (zpool_shrink(pool->zpool, 1, NULL))
> > > > -               zswap_reject_reclaim_fail++;
> > > > +       do {
> > > > +               ret =3D zpool_shrink(pool->zpool, 1, NULL);
> > > > +               if (ret)
> > > > +                       zswap_reject_reclaim_fail++;
> > > > +       } while (!zswap_can_accept() && ret !=3D -EINVAL);
> > >
> > > One question/comment here about the retry logic.
> > >
> > > So I looked through the awfully convoluted writeback code, and there
> > > are multiple layers, and some of them tend to overwrite the return
> > > value of the layer underneath :(
> > >
> > > For zsmalloc (as an example):
> > > zpool_shrink()->zs_zpool_shrink()->zs_reclaim_page()->zpool_ops.evict=
()->zswap_writeback_entry().
> > >
> > > First of all, that zpool_ops is an unnecessarily confusing
> > > indirection, but anyway.
> > >
> > > - zswap_writeback_entry() will either return -ENOMEM or -EEXIST on er=
ror
> > > - zs_reclaim_page()/zbud_reclaim_page() will return -EINVAL if the lr=
u
> > > is empty, and -EAGAIN on other errors.
> > > - zs_zpool_shrink()/zbud_zpool_shrink() will mostly propagate the
> > > return value of  zs_reclaim_page()/zbud_reclaim_page().
> > > - zpool_shrink() will return -EINVAL if the driver does not support
> > > shrinking, otherwise it will propagate the return value from the
> > > driver.
> > >
> > > So it looks like we will get -EINVAL only if the driver lru is empty
> > > or the driver does not support writeback, so rightfully we should not
> > > retry.
> > >
> > > If zswap_writeback_entry() returns -EEXIST, it probably means that we
> > > raced with another task decompressing the page, so rightfully we
> > > should retry to writeback another page instead.
> > >
> > > If zswap_writeback_entry() returns -ENOMEM, it doesn't necessarily
> > > mean that we couldn't allocate memory (unfortunately), it looks like
> > > we will return -ENOMEM in other cases as well. Arguably, we can retry
> > > in all cases, because even if we were actually out of memory, we are
> > > trying to make an allocation that will eventually free more memory.
> > >
> > > In all cases, I think it would be nicer if we retry if ret =3D=3D -EA=
GAIN
> > > instead. It is semantically more sane. In this specific case it is
> > > functionally NOP as zs_reclaim_page()/zbud_reclaim_page() will mostly
> > > return -EAGAIN anyway, but in case someone tries to use saner errnos
> > > in the future, this will age better.
> > Retrying if ret =3D=3D -EAGAIN seems much nicer indeed, will change it.
> > >
> > > Also, do we intentionally want to keep retrying forever on failure? D=
o
> > > we want to add a max number of retries?
> > If the drivers guaranteed that zpool_shrink will remove at least an ent=
ry
> > from their LRU, the retry wouldn't be needed because the LRU will
> > eventually be emptied. But this is an assumption on the implementation =
of
>
> I don't think any zpool driver can guarantee to writeback at least one
> page. It can fail for reasons beyond their control (e.g. cannot
> allocate a page to decompress to).
>
> > the zpool, so yes, we could use a max retries. I think that being a san=
ity
> > check, it should overshoot the required number of iterations in order t=
o
> > avoid a premature break, what about retrying a max of
> > zswap_stored_pages times?
>
> Why is it just a sanity check? Consider a case where the system is
> under extreme memory pressure that the drivers cannot allocate pages
> to decompress to. The drivers would be continuously competing with all
> other allocations on the machine. I think we should give up at some
> point. In any case, you changed the zswap_frontswap_store() to goto
> shrink if !zswap_can_accept(), so next time we try to compress a page
> to zswap we will invoke try again anyway -- perhaps under better
> circumstances.
>
> I think zswap_stored_pages is too much, keep in mind that it also
> includes same-filled pages which are not even stored in the zpool
> drivers. Maybe we should allow a fixed number of failures. If
> zpool_shrink() is successful, keep going until zswap_can_accept().
> Otherwise allow a fixed number of failures before giving up.
>

Yes, I think I got carried away by a writeback frenzy, while testing the LR=
U
refactor I had to fight to get a decent amount of writebacks, and ended up
being too afraid the shrink work would stop.
What about using MAX_RECLAIM_RETRIES? I like the idea of putting a limit bu=
t
I wouldn't know how to pick a fixed number.

> Maybe we can improve the error codes propagated through the writeback
> code as well to improve the retry logic. For example, if
> zswap_writeback_entry() returns EEXIST then retrying should be
> harmless, but ENOMEM might be harmful. Both of which are propagated as
> EAGAIN from zs_zpool_shrink()/zbud_zpool_shrink().
>

That would be a nice improvement indeed, I'd queue it after the LRU refacto=
r
though.

> > >
> > > >         zswap_pool_put(pool);
> > > >  }
> > > >
> > > > @@ -1188,7 +1192,7 @@ static int zswap_frontswap_store(unsigned typ=
e, pgoff_t offset,
> > > >         if (zswap_pool_reached_full) {
> > > >                if (!zswap_can_accept()) {
> > > >                         ret =3D -ENOMEM;
> > > > -                       goto reject;
> > > > +                       goto shrink;
> > > >                 } else
> > > >                         zswap_pool_reached_full =3D false;
> > > >         }
> > > > --
> > > > 2.34.1
> > > >