2023-06-26 19:14:48

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] fs/buffer.c: remove per-CPU buffer_head lookup cache

On Mon, Jun 26, 2023 at 03:04:53PM -0300, Marcelo Tosatti wrote:
> Upon closer investigation, it was found that in current codebase, lookup_bh_lru
> is slower than __find_get_block_slow:
>
> 114 ns per __find_get_block
> 68 ns per __find_get_block_slow
>
> So remove the per-CPU buffer_head caching.

LOL. That's amazing. I can't even see why it's so expensive. The
local_irq_disable(), perhaps? Your test case is the best possible
one for lookup_bh_lru() where you're not even doing the copy.

Reviewed-by: Matthew Wilcox (oracle) <[email protected]>


2023-06-26 23:47:37

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] fs/buffer.c: remove per-CPU buffer_head lookup cache

On Mon, Jun 26, 2023 at 07:47:42PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 26, 2023 at 03:04:53PM -0300, Marcelo Tosatti wrote:
> > Upon closer investigation, it was found that in current codebase, lookup_bh_lru
> > is slower than __find_get_block_slow:
> >
> > 114 ns per __find_get_block
> > 68 ns per __find_get_block_slow
> >
> > So remove the per-CPU buffer_head caching.
>
> LOL. That's amazing. I can't even see why it's so expensive. The
> local_irq_disable(), perhaps? Your test case is the best possible
> one for lookup_bh_lru() where you're not even doing the copy.

I think it's even simpler than that.

i.e. the lookaside cache is being missed, so it's a pure cost and
the code is always having to call __find_get_block_slow() anyway.
Peeking at 16 buffers to not find a match is just as expensive as
walking 3-4 tree levels in an Xarray to find the buffer in the first
place....

IMO, this is an example of how lookaside caches are only a benefit
if the working set of items largely fits in the lookaside cache and
the cache lookup itself is much, much slower than a lookaside cache
miss.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-06-27 00:18:18

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] fs/buffer.c: remove per-CPU buffer_head lookup cache

On Tue, Jun 27, 2023 at 09:30:09AM +1000, Dave Chinner wrote:
> On Mon, Jun 26, 2023 at 07:47:42PM +0100, Matthew Wilcox wrote:
> > On Mon, Jun 26, 2023 at 03:04:53PM -0300, Marcelo Tosatti wrote:
> > > Upon closer investigation, it was found that in current codebase, lookup_bh_lru
> > > is slower than __find_get_block_slow:
> > >
> > > 114 ns per __find_get_block
> > > 68 ns per __find_get_block_slow
> > >
> > > So remove the per-CPU buffer_head caching.
> >
> > LOL. That's amazing. I can't even see why it's so expensive. The
> > local_irq_disable(), perhaps? Your test case is the best possible
> > one for lookup_bh_lru() where you're not even doing the copy.
>
> I think it's even simpler than that.
>
> i.e. the lookaside cache is being missed, so it's a pure cost and
> the code is always having to call __find_get_block_slow() anyway.

How does that happen?

__find_get_block(struct block_device *bdev, sector_t block, unsigned size)
{
struct buffer_head *bh = lookup_bh_lru(bdev, block, size);

if (bh == NULL) {
/* __find_get_block_slow will mark the page accessed */
bh = __find_get_block_slow(bdev, block);
if (bh)
bh_lru_install(bh);

The second (and all subsequent) calls to __find_get_block() should find
the BH in the LRU.

> IMO, this is an example of how lookaside caches are only a benefit
> if the working set of items largely fits in the lookaside cache and
> the cache lookup itself is much, much slower than a lookaside cache
> miss.

But the test code he posted always asks for the same buffer each time.
So it should find it in the lookaside cache?

2023-06-27 01:04:10

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] fs/buffer.c: remove per-CPU buffer_head lookup cache

On Tue, Jun 27, 2023 at 01:13:25AM +0100, Matthew Wilcox wrote:
> On Tue, Jun 27, 2023 at 09:30:09AM +1000, Dave Chinner wrote:
> > On Mon, Jun 26, 2023 at 07:47:42PM +0100, Matthew Wilcox wrote:
> > > On Mon, Jun 26, 2023 at 03:04:53PM -0300, Marcelo Tosatti wrote:
> > > > Upon closer investigation, it was found that in current codebase, lookup_bh_lru
> > > > is slower than __find_get_block_slow:
> > > >
> > > > 114 ns per __find_get_block
> > > > 68 ns per __find_get_block_slow
> > > >
> > > > So remove the per-CPU buffer_head caching.
> > >
> > > LOL. That's amazing. I can't even see why it's so expensive. The
> > > local_irq_disable(), perhaps? Your test case is the best possible
> > > one for lookup_bh_lru() where you're not even doing the copy.
> >
> > I think it's even simpler than that.
> >
> > i.e. the lookaside cache is being missed, so it's a pure cost and
> > the code is always having to call __find_get_block_slow() anyway.
>
> How does that happen?
>
> __find_get_block(struct block_device *bdev, sector_t block, unsigned size)
> {
> struct buffer_head *bh = lookup_bh_lru(bdev, block, size);
>
> if (bh == NULL) {
> /* __find_get_block_slow will mark the page accessed */
> bh = __find_get_block_slow(bdev, block);
> if (bh)
> bh_lru_install(bh);
>
> The second (and all subsequent) calls to __find_get_block() should find
> the BH in the LRU.
>
> > IMO, this is an example of how lookaside caches are only a benefit
> > if the working set of items largely fits in the lookaside cache and
> > the cache lookup itself is much, much slower than a lookaside cache
> > miss.
>
> But the test code he posted always asks for the same buffer each time.
> So it should find it in the lookaside cache?

Oh.

for (i = 0; ....) {
bh = __find_get_block(bdev, 1, 512);

That's a '1' being passed to __find_get_block, not 'i'.

/me goes and gets more coffee.

Maybe it's CONFIG_PREEMPT_RT=y doing something to the locks that
isn't obvious here...

-Dave.
--
Dave Chinner
[email protected]

2023-06-27 20:18:26

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH] fs/buffer.c: remove per-CPU buffer_head lookup cache

On Mon, Jun 26, 2023 at 07:47:42PM +0100, Matthew Wilcox wrote:
> On Mon, Jun 26, 2023 at 03:04:53PM -0300, Marcelo Tosatti wrote:
> > Upon closer investigation, it was found that in current codebase, lookup_bh_lru
> > is slower than __find_get_block_slow:
> >
> > 114 ns per __find_get_block
> > 68 ns per __find_get_block_slow
> >
> > So remove the per-CPU buffer_head caching.
>
> LOL. That's amazing. I can't even see why it's so expensive. The
> local_irq_disable(), perhaps? Your test case is the best possible
> one for lookup_bh_lru() where you're not even doing the copy.

Oops, that was due to incorrect buffer size being looked up versus
installed size.

About 15ns is due to irq disablement.
6ns due to checking preempt is disabled (from __this_cpu_read).

So the actual numbers for the single block lookup are
(searching for the same block number repeatedly):

42 ns per __find_get_block
68 ns per __find_get_block_slow

And increases linearly as the test increases the number of blocks which
are searched for:

say mod 3 is

__find_get_block(blocknr=1)
__find_get_block(blocknr=2)
__find_get_block(blocknr=3)

41 ns per __find_get_block mod=1
41 ns per __find_get_block mod=2
42 ns per __find_get_block mod=3
43 ns per __find_get_block mod=4
45 ns per __find_get_block mod=5
48 ns per __find_get_block mod=6
48 ns per __find_get_block mod=7
49 ns per __find_get_block mod=8
51 ns per __find_get_block mod=9
52 ns per __find_get_block mod=10
54 ns per __find_get_block mod=11
56 ns per __find_get_block mod=12
58 ns per __find_get_block mod=13
60 ns per __find_get_block mod=14
61 ns per __find_get_block mod=15
63 ns per __find_get_block mod=16
138 ns per __find_get_block mod=17
138 ns per __find_get_block mod=18
138 ns per __find_get_block mod=19 <-- results from first patch, when
lookup_bh_lru is a miss
70 ns per __find_get_block_slow mod=1
71 ns per __find_get_block_slow mod=2
71 ns per __find_get_block_slow mod=3
71 ns per __find_get_block_slow mod=4
71 ns per __find_get_block_slow mod=5
72 ns per __find_get_block_slow mod=6
71 ns per __find_get_block_slow mod=7
72 ns per __find_get_block_slow mod=8
71 ns per __find_get_block_slow mod=9
71 ns per __find_get_block_slow mod=10
71 ns per __find_get_block_slow mod=11
71 ns per __find_get_block_slow mod=12
71 ns per __find_get_block_slow mod=13
71 ns per __find_get_block_slow mod=14
71 ns per __find_get_block_slow mod=15
71 ns per __find_get_block_slow mod=16
71 ns per __find_get_block_slow mod=17
72 ns per __find_get_block_slow mod=18
72 ns per __find_get_block_slow mod=19

ls on home directory:
hits: 2 misses: 91

find on a linux-2.6 git tree:
hits: 25453 misses: 51084

make clean on a linux-2.6 git tree:
hits: 247615 misses: 32855

make on a linux-2.6 git tree:
hits: 1410414 misses: 166896

In more detail, where each bucket below indicates which index into
per-CPU buffer lookup_bh_lru was found:

hits idx1 idx2 ... idx16
hits 139506 24299 21597 7462 15790 19108 6477 1349 1237 938 845 636 637 523 431 454 misses: 65773

So i think it makes more sense to just disable the cache for isolated
CPUs.

> Reviewed-by: Matthew Wilcox (oracle) <[email protected]>

Thanks.