I've had a few conversations recently about how many objects should be
in a batch in some disparate contextx, so I thought I'd write down my
opinion so I can refer to it in future. TLDR: Start your batch size
around 10, adjust the batch size when measurements tell you to change it.
In this model, let's look at the cost of allocating N objects from an
allocator. Assume there's a fixed cost, say 4 (units are not relevant
here) for going into the allocator and then there's a 1 unit cost per
object (eg we're taking a spinlock, pulling N objects out of the data
structure and releasing the spinlock again).
Allocating 100 * 1 objects would cost 500 units. Our best case is that
we could save 396 units by allocating a batch of 100. But we probably
don't know how many objects we're going to need to allocate, so we pull
objects from the allocator in smaller batches. Here's a table showing
the costs for different batch sizes:
Batch size Cost of allocating 100 thousand million
1 500 (5 * 100) 5000 5M
2 300 (6 * 50) 3000 3M
4 200 (8 * 25) 2000 2M
8 156 (12 * 13) 1500 1.5M
16 140 (20 * 7) 1260 1.25M
32 144 (36 * 4) 1152 1.13M
64 136 (68 * 2) 1088 1.06M
128 132 (132 * 1) 1056 1.03M
You can see the knee of this curve is around 8. It fluctuates a bit after
that depending on how many "left over" objects we have after allocating
the 100 it turned out that we needed. Even if we think we're going to
be dealing with a _lot_ of objects (the thousand and million column),
we've got most of the advantage by the time we get to 8 (eg a reduction
of 3.5M from a total possible reduction of 4M), and while I wouldn't
sneeze at getting a few more percentage points of overhead reduction,
we're scrabbling at the margins now, not getting big wins.
This is a simple model for only one situation. If we have a locking
contention breakdown, the overhead cost might be much higher than 4 units,
and that would lead us to a larger batch size.
Another consideration is how much of each object we have to touch.
put_pages_list() is frequently called with batches of 500 pages. In order
to free a folio, we have to manipulate its contents, so touching at least
one cacheline per object. And we make multiple passes over the batch,
first decrementing the refcount, removing it from the lru list; second
uncharging the folios from the memcg (writes to folio->memcg_data);
third calling free_pages_prepare which, eg, sets ->mapping to NULL;
fourth putting the folio on the pcp list (writing to the list_head).
With 500 folios on the list, that uses 500 * 64 bytes of cache which
just barely fits into the L1 cache of a modern laptop CPU (never mind
whatever else we might want to have in the L1). Capping the batch size
at 15 (as my recent patches do) uses only 1kB of L1, which is a much
more reasonable amount of cache to take up. We can be pretty sure the
first one is still in it when the last one has finished being processed.
From: Matthew Wilcox
> Sent: 07 March 2024 19:55
>
> I've had a few conversations recently about how many objects should be
> in a batch in some disparate contextx, so I thought I'd write down my
> opinion so I can refer to it in future. TLDR: Start your batch size
> around 10, adjust the batch size when measurements tell you to change it.
>
> In this model, let's look at the cost of allocating N objects from an
> allocator. Assume there's a fixed cost, say 4 (units are not relevant
> here) for going into the allocator and then there's a 1 unit cost per
> object (eg we're taking a spinlock, pulling N objects out of the data
> structure and releasing the spinlock again).
I think you are trying to measure the length of a piece of string.
(and not the ones in the box labelled 'pieces of string too small to keep')
What I did recently for a global+local buffer allocator was to make
each entry on the global list itself be a list of objects.
So if the local list was empty it was a single cmpxchg to grab
(about 100) items.
Similarly if the local free list got too big a single locked
operation would free a block of items.
That significantly reduced lock contention.
If was necessary to split the free of very long lists - otherwise
a single item on the global list could contain silly numbers of items.
This was userspace, and we don't talk about the test that ended up
with ALL system memory on one linked list.
I managed to kill enough (remote) things to get a working shell.
It took the system about 20 minutes just to count the linked list!
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Thu, Mar 07, 2024 at 07:55:01PM +0000, Matthew Wilcox wrote:
> I've had a few conversations recently about how many objects should be
> in a batch in some disparate contextx, so I thought I'd write down my
> opinion so I can refer to it in future. TLDR: Start your batch size
> around 10, adjust the batch size when measurements tell you to change it.
>
> In this model, let's look at the cost of allocating N objects from an
> allocator. Assume there's a fixed cost, say 4 (units are not relevant
> here) for going into the allocator and then there's a 1 unit cost per
> object (eg we're taking a spinlock, pulling N objects out of the data
> structure and releasing the spinlock again).
>
> Allocating 100 * 1 objects would cost 500 units. Our best case is that
> we could save 396 units by allocating a batch of 100. But we probably
> don't know how many objects we're going to need to allocate, so we pull
> objects from the allocator in smaller batches. Here's a table showing
> the costs for different batch sizes:
>
> Batch size Cost of allocating 100 thousand million
> 1 500 (5 * 100) 5000 5M
> 2 300 (6 * 50) 3000 3M
> 4 200 (8 * 25) 2000 2M
> 8 156 (12 * 13) 1500 1.5M
> 16 140 (20 * 7) 1260 1.25M
> 32 144 (36 * 4) 1152 1.13M
> 64 136 (68 * 2) 1088 1.06M
> 128 132 (132 * 1) 1056 1.03M
Isn't this just repeating the fundamental observation that SLUB is
based on? i.e. it can use high-order pages so that it can
pre-allocate optimally sized batches of objects regardless of their
size? i.e. it tries to size the backing page order to allocate in
chunks of 30-40 objects at a time?
> You can see the knee of this curve is around 8. It fluctuates a bit after
> that depending on how many "left over" objects we have after allocating
> the 100 it turned out that we needed. Even if we think we're going to
> be dealing with a _lot_ of objects (the thousand and million column),
> we've got most of the advantage by the time we get to 8 (eg a reduction
> of 3.5M from a total possible reduction of 4M), and while I wouldn't
> sneeze at getting a few more percentage points of overhead reduction,
> we're scrabbling at the margins now, not getting big wins.
Except for SLUB we're actually allocating in the hundreds of
millions to billions of objects on machines with TBs of RAM. IOWs we
really want to be much further down the curve than 8 - batches of at
least 32-64 have significantly lower cost and that matters when
scaling to (and beyond) hundreds of millions of objects....
> This is a simple model for only one situation. If we have a locking
> contention breakdown, the overhead cost might be much higher than 4 units,
> and that would lead us to a larger batch size.
>
> Another consideration is how much of each object we have to touch.
> put_pages_list() is frequently called with batches of 500 pages. In order
> to free a folio, we have to manipulate its contents, so touching at least
> one cacheline per object.
Right, that's simply the cost of the batch cache footprint issue
rather than a "fixed cost mitigation" described for allocation.
So I'm not sure what you're trying to say here? We've known about
these batch optimisation considerations for a long, long time and
that batch size optimisation is always algorithm and access pattern
dependent, so.... ???
> And we make multiple passes over the batch,
> first decrementing the refcount, removing it from the lru list; second
> uncharging the folios from the memcg (writes to folio->memcg_data);
> third calling free_pages_prepare which, eg, sets ->mapping to NULL;
> fourth putting the folio on the pcp list (writing to the list_head).
Sounds like "batch cache footprint" would be reduced by inverting
that algorithm and doing all the work on a single object in a single
pass, rahter than doing it in multiple passes. That way the cache
footprint of the batch is determined entirely by the size of the
data structures accessed to process each object in the batch.
i.e. if you are going to take an L1 cache miss accessing every
object in the batch anyway, then reducing batch size doesn't improve
overall per-object processing efficiency. All it does is keep the
processing cost down to a single L1 cache miss per object in the
batch. The tradeoff for this is more frequent batch refills, so this
only works is the additional fixed cost for obtaining each batch is
lower than the cost of multiple L1 cache misses per object....
All this says to me is that sometimes the batch size is not actually
the problem that needs fixing - changing the algorithm
and/or processing pipeline to remove the possiblity of repeated
accesses to individual objects in the batch reduces selecting the
batch size down to the same "fixed cost mitigation" case you started
with....
-Dave.
--
Dave Chinner
[email protected]
On Mon, Mar 11, 2024 at 01:12:45PM +1100, Dave Chinner wrote:
> > Batch size Cost of allocating 100 thousand million
> > 1 500 (5 * 100) 5000 5M
> > 2 300 (6 * 50) 3000 3M
> > 4 200 (8 * 25) 2000 2M
> > 8 156 (12 * 13) 1500 1.5M
> > 16 140 (20 * 7) 1260 1.25M
> > 32 144 (36 * 4) 1152 1.13M
> > 64 136 (68 * 2) 1088 1.06M
> > 128 132 (132 * 1) 1056 1.03M
>
> Isn't this just repeating the fundamental observation that SLUB is
> based on? i.e. it can use high-order pages so that it can
> pre-allocate optimally sized batches of objects regardless of their
> size? i.e. it tries to size the backing page order to allocate in
> chunks of 30-40 objects at a time?
What SLUB is currently doing is inefficient. One of the conversations
I had (off-list) about appropriate batch size is in relation to SLUB-ng
where one of the participants claimed that the batch size of 32 was
obviously too small (it wasn't; the performance problem was due to a
bug).
What you're thinking about is the cost of allocating from the page
allocator (which is now much cheaper than it used to be, but should
be cheaper than it currently is). But there is another inefficiency
to consider, which is that the slab allocator has a per-slab lock,
and while you can efficiently remove and add a number of objects to
a single slab, you might only have one or two free objects per slab.
To work around this some of the more performance sensitive parts of the
kernel have implemented their own allocator in front of slab. This is
clearly a bad thing for all of us, and hence Vlastimil has been working
on a better approach.
https://lore.kernel.org/linux-mm/[email protected]/
> Except for SLUB we're actually allocating in the hundreds of
> millions to billions of objects on machines with TBs of RAM. IOWs we
> really want to be much further down the curve than 8 - batches of at
> least 32-64 have significantly lower cost and that matters when
> scaling to (and beyond) hundreds of millions of objects....
But that doesn't necessarily mean that you want a larger batch size.
Because you're not just allocating, you're also freeing and over a
large enough timescale the number of objects allocated and freed is
approximately equal. In the SLUB case, your batch size needs to be
large enough to absorb most of the allcation-vs-free bias jitter; that
is if you know they always alternate AFAFAFAFAF a batch size of 2 would
be fine. If you know you get four allocations followed by four frees,
having a batch size of 5 woud be fine. We'd never go to the parent
allocator if we got a AFAAFFAAAFFFAAAAFFFFAAFFAFAAFAAFFF pattern.
> > This is a simple model for only one situation. If we have a locking
> > contention breakdown, the overhead cost might be much higher than 4 units,
> > and that would lead us to a larger batch size.
> >
> > Another consideration is how much of each object we have to touch.
> > put_pages_list() is frequently called with batches of 500 pages. In order
> > to free a folio, we have to manipulate its contents, so touching at least
> > one cacheline per object.
>
> Right, that's simply the cost of the batch cache footprint issue
> rather than a "fixed cost mitigation" described for allocation.
No, it's not, it's an illustration that too large a batch size can
actively harm you.
> So I'm not sure what you're trying to say here? We've known about
> these batch optimisation considerations for a long, long time and
> that batch size optimisation is always algorithm and access pattern
> dependent, so.... ???
People forget these "things we've always known". I went looking and
couldn't find a good writeup of this, so did my own. In addition to the
percpu slub batch size, various people have opined that the folio_batch
size (15 objects) is too small for doing things like writeback and
readahead. They're going to have to bring data to convince me.
> > And we make multiple passes over the batch,
> > first decrementing the refcount, removing it from the lru list; second
> > uncharging the folios from the memcg (writes to folio->memcg_data);
> > third calling free_pages_prepare which, eg, sets ->mapping to NULL;
> > fourth putting the folio on the pcp list (writing to the list_head).
>
> Sounds like "batch cache footprint" would be reduced by inverting
> that algorithm and doing all the work on a single object in a single
> pass, rahter than doing it in multiple passes. That way the cache
> footprint of the batch is determined entirely by the size of the
> data structures accessed to process each object in the batch.
Well, now you're just opining without having studied the problem, and
I have, so I can say confidently that you're wrong. You could read
the code if you like.
From: Matthew Wilcox
> Sent: 11 March 2024 03:42
..
> But that doesn't necessarily mean that you want a larger batch size.
> Because you're not just allocating, you're also freeing and over a
> large enough timescale the number of objects allocated and freed is
> approximately equal. In the SLUB case, your batch size needs to be
> large enough to absorb most of the allcation-vs-free bias jitter; that
> is if you know they always alternate AFAFAFAFAF a batch size of 2 would
> be fine. If you know you get four allocations followed by four frees,
> having a batch size of 5 woud be fine. We'd never go to the parent
> allocator if we got a AFAAFFAAAFFFAAAAFFFFAAFFAFAAFAAFFF pattern.
That isn't really the allocation pattern you need to worry about.
Per cpu free list should be large enough to handle it.
The problem (as the first people doing a sparc SMP port found) is
that you'll get one bit of code that 'pumps' items from one
free list to another.
So you have to decide that you have too many local free objects
and then give some back to the global free list.
Keeping them on the global list as a block of 'n' items can
make things far better.
Indeed 'arrays of pointers' are likely to be better than a
linked list.
Caches in front of SLUB (or similar) really shouldn't be needed.
Except, perhaps, to indicate which list the items come from
and, maybe for some extra allocation stats.
There might be rcu oddities - where the memory can't be used
for a different structure. But there are probably alternative
solutions to that one.
The page free code is a different problem.
I suspect that needs to process batches of items to avoid
repeated (expensive) atomic accesses.
But it definitely needs to avoid thrashing the L1 cache.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
On Mon, Mar 11, 2024 at 03:41:35AM +0000, Matthew Wilcox wrote:
> On Mon, Mar 11, 2024 at 01:12:45PM +1100, Dave Chinner wrote:
> > > Batch size Cost of allocating 100 thousand million
> > > 1 500 (5 * 100) 5000 5M
> > > 2 300 (6 * 50) 3000 3M
> > > 4 200 (8 * 25) 2000 2M
> > > 8 156 (12 * 13) 1500 1.5M
> > > 16 140 (20 * 7) 1260 1.25M
> > > 32 144 (36 * 4) 1152 1.13M
> > > 64 136 (68 * 2) 1088 1.06M
> > > 128 132 (132 * 1) 1056 1.03M
> >
> > Isn't this just repeating the fundamental observation that SLUB is
> > based on? i.e. it can use high-order pages so that it can
> > pre-allocate optimally sized batches of objects regardless of their
> > size? i.e. it tries to size the backing page order to allocate in
> > chunks of 30-40 objects at a time?
>
> What SLUB is currently doing is inefficient. One of the conversations
> I had (off-list) about appropriate batch size is in relation to SLUB-ng
> where one of the participants claimed that the batch size of 32 was
> obviously too small (it wasn't; the performance problem was due to a
> bug).
>
> What you're thinking about is the cost of allocating from the page
> allocator (which is now much cheaper than it used to be, but should
> be cheaper than it currently is). But there is another inefficiency
> to consider, which is that the slab allocator has a per-slab lock,
> and while you can efficiently remove and add a number of objects to
> a single slab, you might only have one or two free objects per slab.
> To work around this some of the more performance sensitive parts of the
> kernel have implemented their own allocator in front of slab. This is
> clearly a bad thing for all of us, and hence Vlastimil has been working
> on a better approach.
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> > Except for SLUB we're actually allocating in the hundreds of
> > millions to billions of objects on machines with TBs of RAM. IOWs we
> > really want to be much further down the curve than 8 - batches of at
> > least 32-64 have significantly lower cost and that matters when
> > scaling to (and beyond) hundreds of millions of objects....
>
> But that doesn't necessarily mean that you want a larger batch size.
> Because you're not just allocating, you're also freeing and over a
> large enough timescale the number of objects allocated and freed is
> approximately equal. In the SLUB case, your batch size needs to be
> large enough to absorb most of the allcation-vs-free bias jitter; that
> is if you know they always alternate AFAFAFAFAF a batch size of 2 would
> be fine. If you know you get four allocations followed by four frees,
> having a batch size of 5 woud be fine. We'd never go to the parent
> allocator if we got a AFAAFFAAAFFFAAAAFFFFAAFFAFAAFAAFFF pattern.
taking a step back - you're talking about one particular use case for
fbatch, but fbatch is used for lots of different things (there are lots
of places we want to vectorize and batch!).
There's also lots of different cache effects to consider. It's not just
data cache that's an issue; if we only consider data cache a smaller
fbatch might be preferable, so that our working set stays entirely in
L1.
But - icache usage always wants a larger batch size; "do this one thing
with all the pages", then jump to unrelated code and do the next thing -
larger batch size means fewer icache misses. And icache is branchier and
harder to prefetch than access to fbatch - it's just a vector - but
you're going to be seeing that a lot less in microbenchmarks than in
real world code.
Real world, you want to just hand off the _whole_ dataset you're working
on from one codepath to the next whereever practical.
this is why I've been harping on the readahead path - the way it makes
the fs repeatedly do xarray lookups to get the next folio when the core
mm/readahead.c code _already did that_ instead of just passing it a
vector is totally stupid...
and, if fbatch had been a proper dynamically sized vector instead of a
small fixed size array, this never would've happened in the first
place...