Received-SPF: pass (google.com: domain of linux-kernel+bounces-96147-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.48.161 as permitted sender) client-ip=147.75.48.161;
Date: Thu, 7 Mar 2024 19:55:01 +0000
From: Matthew Wilcox <willy@infradead.org>
To: linux-mm@kvack.org
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: On the optimum size of a batch
Message-ID: <Zeoble0xJQYEAriE@casper.infradead.org>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

I've had a few conversations recently about how many objects should be
in a batch in some disparate contextx, so I thought I'd write down my
opinion so I can refer to it in future.  TLDR: Start your batch size
around 10, adjust the batch size when measurements tell you to change it.

In this model, let's look at the cost of allocating N objects from an
allocator.  Assume there's a fixed cost, say 4 (units are not relevant
here) for going into the allocator and then there's a 1 unit cost per
object (eg we're taking a spinlock, pulling N objects out of the data
structure and releasing the spinlock again).

Allocating 100 * 1 objects would cost 500 units.  Our best case is that
we could save 396 units by allocating a batch of 100.  But we probably
don't know how many objects we're going to need to allocate, so we pull
objects from the allocator in smaller batches.  Here's a table showing
the costs for different batch sizes:

Batch size      Cost of allocating 100          thousand        million
1               500 (5 * 100)                   5000            5M
2               300 (6 * 50)                    3000            3M
4               200 (8 * 25)                    2000            2M
8               156 (12 * 13)                   1500            1.5M
16              140 (20 * 7)                    1260            1.25M
32              144 (36 * 4)                    1152            1.13M
64              136 (68 * 2)                    1088            1.06M
128             132 (132 * 1)                   1056            1.03M

You can see the knee of this curve is around 8.  It fluctuates a bit after
that depending on how many "left over" objects we have after allocating
the 100 it turned out that we needed.  Even if we think we're going to
be dealing with a _lot_ of objects (the thousand and million column),
we've got most of the advantage by the time we get to 8 (eg a reduction
of 3.5M from a total possible reduction of 4M), and while I wouldn't
sneeze at getting a few more percentage points of overhead reduction,
we're scrabbling at the margins now, not getting big wins.

This is a simple model for only one situation.  If we have a locking
contention breakdown, the overhead cost might be much higher than 4 units,
and that would lead us to a larger batch size.

Another consideration is how much of each object we have to touch.
put_pages_list() is frequently called with batches of 500 pages.  In order
to free a folio, we have to manipulate its contents, so touching at least
one cacheline per object.  And we make multiple passes over the batch,
first decrementing the refcount, removing it from the lru list; second
uncharging the folios from the memcg (writes to folio->memcg_data);
third calling free_pages_prepare which, eg, sets ->mapping to NULL;
fourth putting the folio on the pcp list (writing to the list_head).

With 500 folios on the list, that uses 500 * 64 bytes of cache which
just barely fits into the L1 cache of a modern laptop CPU (never mind
whatever else we might want to have in the L1).  Capping the batch size
at 15 (as my recent patches do) uses only 1kB of L1, which is a much
more reasonable amount of cache to take up.  We can be pretty sure the
first one is still in it when the last one has finished being processed.