Received-SPF: pass (google.com: domain of linux-kernel+bounces-99584-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3;
Date: Mon, 11 Mar 2024 17:58:01 -0400
From: Kent Overstreet <kent.overstreet@linux.dev>
To: Matthew Wilcox <willy@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>, linux-mm@kvack.org, 
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: On the optimum size of a batch
Message-ID: <3rn4qk754v7esfz3gtcjwoxoguidu5o6bv4mumwmfmzaq27bir@5td5wd4xdjoj>
References: <Zeoble0xJQYEAriE@casper.infradead.org>
 <Ze5onaXsI+LT1+Be@dread.disaster.area>
 <Ze59byUR80z42m8R@casper.infradead.org>
Precedence: bulk
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <Ze59byUR80z42m8R@casper.infradead.org>

On Mon, Mar 11, 2024 at 03:41:35AM +0000, Matthew Wilcox wrote:
> On Mon, Mar 11, 2024 at 01:12:45PM +1100, Dave Chinner wrote:
> > > Batch size      Cost of allocating 100          thousand        million
> > > 1               500 (5 * 100)                   5000            5M
> > > 2               300 (6 * 50)                    3000            3M
> > > 4               200 (8 * 25)                    2000            2M
> > > 8               156 (12 * 13)                   1500            1.5M
> > > 16              140 (20 * 7)                    1260            1.25M
> > > 32              144 (36 * 4)                    1152            1.13M
> > > 64              136 (68 * 2)                    1088            1.06M
> > > 128             132 (132 * 1)                   1056            1.03M
> > 
> > Isn't this just repeating the fundamental observation that SLUB is
> > based on?  i.e. it can use high-order pages so that it can
> > pre-allocate optimally sized batches of objects regardless of their
> > size? i.e.  it tries to size the backing page order to allocate in
> > chunks of 30-40 objects at a time?
> 
> What SLUB is currently doing is inefficient.  One of the conversations
> I had (off-list) about appropriate batch size is in relation to SLUB-ng
> where one of the participants claimed that the batch size of 32 was
> obviously too small (it wasn't; the performance problem was due to a
> bug).
> 
> What you're thinking about is the cost of allocating from the page
> allocator (which is now much cheaper than it used to be, but should
> be cheaper than it currently is).  But there is another inefficiency
> to consider, which is that the slab allocator has a per-slab lock,
> and while you can efficiently remove and add a number of objects to
> a single slab, you might only have one or two free objects per slab.
> To work around this some of the more performance sensitive parts of the
> kernel have implemented their own allocator in front of slab.  This is
> clearly a bad thing for all of us, and hence Vlastimil has been working
> on a better approach.
> 
> https://lore.kernel.org/linux-mm/20231129-slub-percpu-caches-v3-0-6bcf536772bc@suse.cz/
> 
> > Except for SLUB we're actually allocating in the hundreds of
> > millions to billions of objects on machines with TBs of RAM. IOWs we
> > really want to be much further down the curve than 8 - batches of at
> > least 32-64 have significantly lower cost and that matters when
> > scaling to (and beyond) hundreds of millions of objects....
> 
> But that doesn't necessarily mean that you want a larger batch size.
> Because you're not just allocating, you're also freeing and over a
> large enough timescale the number of objects allocated and freed is
> approximately equal.  In the SLUB case, your batch size needs to be
> large enough to absorb most of the allcation-vs-free bias jitter; that
> is if you know they always alternate AFAFAFAFAF a batch size of 2 would
> be fine.  If you know you get four allocations followed by four frees,
> having a batch size of 5 woud be fine.  We'd never go to the parent
> allocator if we got a AFAAFFAAAFFFAAAAFFFFAAFFAFAAFAAFFF pattern.

taking a step back - you're talking about one particular use case for
fbatch, but fbatch is used for lots of different things (there are lots
of places we want to vectorize and batch!).

There's also lots of different cache effects to consider. It's not just
data cache that's an issue; if we only consider data cache a smaller
fbatch might be preferable, so that our working set stays entirely in
L1.

But - icache usage always wants a larger batch size; "do this one thing
with all the pages", then jump to unrelated code and do the next thing -
larger batch size means fewer icache misses. And icache is branchier and
harder to prefetch than access to fbatch - it's just a vector - but
you're going to be seeing that a lot less in microbenchmarks than in
real world code.

Real world, you want to just hand off the _whole_ dataset you're working
on from one codepath to the next whereever practical.

this is why I've been harping on the readahead path - the way it makes
the fs repeatedly do xarray lookups to get the next folio when the core
mm/readahead.c code _already did that_ instead of just passing it a
vector is totally stupid...

and, if fbatch had been a proper dynamically sized vector instead of a
small fixed size array, this never would've happened in the first
place...