LinuxLists.cc - Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per

2023-09-12 20:58:26

Subject: Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression

On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
> > Well that's the problem. Since I can't run the reproducer, there's
> > nothing I can do to troubleshoot the problem myself.
>
> We dug more into the perf and other profiling data from 0Day server
> running this case, and it seems that the new simple_offset_add()
> called by shmem_mknod() brings extra cost related with slab,
> specifically the 'radix_tree_node', which cause the regression.
>
> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>
> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
> ---------------- ---------------------------
>
> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs

I can't find the benchmark source, but my suspicion is that this
creates and deletes a lot of files in a directory. The 'stable
directory offsets' series uses xa_alloc_cyclic(), so we'll end up
with a very sparse radix tree. ie it'll look something like this:

0 - "."
1 - ".."
6 - "d"
27 - "y"
4000 - "fzz"
65537 - "czzz"
643289767 - "bzzzzzz"

(i didn't work out the names precisely here, but this is approximately
what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
almost all of them)

The radix tree does not handle this well. It'll allocate one node for:

entries 0-63 (covers the first 4 entries)
entries 0-4095
entries 3968-4031 (the first 5)
entries 0-262143
entries 65536-69631
entries 65536-65599 (the first 6)
entries 0-16777215
entries 0-1073741823
entries 637534208-654311423
entries 643039232-643301375
entries 643289088-643293183
entries 643289728-643289791 (all 7)

That ends up being 12 nodes (you get 7 nodes per page) to store 7
pointers. Admittedly to get here, you have to do 643289765 creations
and nearly as many deletions, so are we going to see it in a
non-benchmark situation?

The maple tree is more resilient against this kind of shenanigan, but
we're not there in terms of supporting the kind of allocation you
want. For this kind of allocation pattern, you'd get all 7 pointers
in a single 256-byte node.

2023-09-14 10:10:27

by Chuck Lever III

[permalink] [raw]

Subject: Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression

> On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
>>> Well that's the problem. Since I can't run the reproducer, there's
>>> nothing I can do to troubleshoot the problem myself.
>>
>> We dug more into the perf and other profiling data from 0Day server
>> running this case, and it seems that the new simple_offset_add()
>> called by shmem_mknod() brings extra cost related with slab,
>> specifically the 'radix_tree_node', which cause the regression.
>>
>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>>
>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
>> ---------------- ---------------------------
>>
>> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
>> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
>
> I can't find the benchmark source, but my suspicion is that this
> creates and deletes a lot of files in a directory. The 'stable
> directory offsets' series uses xa_alloc_cyclic(), so we'll end up
> with a very sparse radix tree. ie it'll look something like this:
>
> 0 - "."
> 1 - ".."
> 6 - "d"
> 27 - "y"
> 4000 - "fzz"
> 65537 - "czzz"
> 643289767 - "bzzzzzz"
>
> (i didn't work out the names precisely here, but this is approximately
> what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
> almost all of them)
>
> The radix tree does not handle this well. It'll allocate one node for:
>
> entries 0-63 (covers the first 4 entries)
> entries 0-4095
> entries 3968-4031 (the first 5)
> entries 0-262143
> entries 65536-69631
> entries 65536-65599 (the first 6)
> entries 0-16777215
> entries 0-1073741823
> entries 637534208-654311423
> entries 643039232-643301375
> entries 643289088-643293183
> entries 643289728-643289791 (all 7)
>
> That ends up being 12 nodes (you get 7 nodes per page) to store 7
> pointers.

I'm able to run the reproducer Feng provided. simple_offset_add()
nearly doubles the cost of shmem_mknod() thanks to the memory
allocations done in xas_create().

However, tmpfs is already fast compared to persistent filesystems.
For instance, even with the simple_offset patch applied:

tmpfs: 158079.00 Directory Searches/second
btrfs: 64978.88 Directory Searches/second

> Admittedly to get here, you have to do 643289765 creations
> and nearly as many deletions, so are we going to see it in a
> non-benchmark situation?

Most directories in a tmpfs have a limited lifespan and thus are
unlikely to live long enough to be affected by this issue. The
only one that has a rather unlimited lifespan is the root
directory.

It's hard for me to tell whether this is a pervasive problem
or one we can live with until we find a more suitable data
structure. IMO the benefit of having stable directory offsets
far outweighs the eventual slow down in the root directory.

--
Chuck Lever

2024-01-04 19:36:34

by Chuck Lever III

[permalink] [raw]

Subject: Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression

> On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
>>> Well that's the problem. Since I can't run the reproducer, there's
>>> nothing I can do to troubleshoot the problem myself.
>>
>> We dug more into the perf and other profiling data from 0Day server
>> running this case, and it seems that the new simple_offset_add()
>> called by shmem_mknod() brings extra cost related with slab,
>> specifically the 'radix_tree_node', which cause the regression.
>>
>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>>
>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
>> ---------------- ---------------------------
>>
>> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
>> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
>
> I can't find the benchmark source, but my suspicion is that this
> creates and deletes a lot of files in a directory. The 'stable
> directory offsets' series uses xa_alloc_cyclic(), so we'll end up
> with a very sparse radix tree. ie it'll look something like this:
>
> 0 - "."
> 1 - ".."
> 6 - "d"
> 27 - "y"
> 4000 - "fzz"
> 65537 - "czzz"
> 643289767 - "bzzzzzz"
>
> (i didn't work out the names precisely here, but this is approximately
> what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
> almost all of them)
>
> The radix tree does not handle this well. It'll allocate one node for:
>
> entries 0-63 (covers the first 4 entries)
> entries 0-4095
> entries 3968-4031 (the first 5)
> entries 0-262143
> entries 65536-69631
> entries 65536-65599 (the first 6)
> entries 0-16777215
> entries 0-1073741823
> entries 637534208-654311423
> entries 643039232-643301375
> entries 643289088-643293183
> entries 643289728-643289791 (all 7)
>
> That ends up being 12 nodes (you get 7 nodes per page) to store 7
> pointers. Admittedly to get here, you have to do 643289765 creations
> and nearly as many deletions, so are we going to see it in a
> non-benchmark situation?
>
> The maple tree is more resilient against this kind of shenanigan, but
> we're not there in terms of supporting the kind of allocation you
> want. For this kind of allocation pattern, you'd get all 7 pointers
> in a single 256-byte node.

Hello Matthew, it's been a couple of kernel releases, so
following up.

Is Maple tree ready for libfs to use it for managing directory
offsets?

Should we just go for broke and convert libfs from xarray to
Maple tree now?

--
Chuck Lever

2024-01-05 16:27:48

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression

* Chuck Lever III <[email protected]> [240104 14:33]:
>
>
> > On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <[email protected]> wrote:
> >
> > On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
> >>> Well that's the problem. Since I can't run the reproducer, there's
> >>> nothing I can do to troubleshoot the problem myself.
> >>
> >> We dug more into the perf and other profiling data from 0Day server
> >> running this case, and it seems that the new simple_offset_add()
> >> called by shmem_mknod() brings extra cost related with slab,
> >> specifically the 'radix_tree_node', which cause the regression.
> >>
> >> Here is some slabinfo diff for commit a2e459555c5f and its parent:
> >>
> >> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
> >> ---------------- ---------------------------
> >>
> >> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
> >> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
> >> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
> >> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
> >
> > I can't find the benchmark source, but my suspicion is that this
> > creates and deletes a lot of files in a directory. The 'stable
> > directory offsets' series uses xa_alloc_cyclic(), so we'll end up
> > with a very sparse radix tree. ie it'll look something like this:
> >
> > 0 - "."
> > 1 - ".."
> > 6 - "d"
> > 27 - "y"
> > 4000 - "fzz"
> > 65537 - "czzz"
> > 643289767 - "bzzzzzz"
> >
> > (i didn't work out the names precisely here, but this is approximately
> > what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
> > almost all of them)
> >
> > The radix tree does not handle this well. It'll allocate one node for:
> >
> > entries 0-63 (covers the first 4 entries)
> > entries 0-4095
> > entries 3968-4031 (the first 5)
> > entries 0-262143
> > entries 65536-69631
> > entries 65536-65599 (the first 6)
> > entries 0-16777215
> > entries 0-1073741823
> > entries 637534208-654311423
> > entries 643039232-643301375
> > entries 643289088-643293183
> > entries 643289728-643289791 (all 7)
> >
> > That ends up being 12 nodes (you get 7 nodes per page) to store 7
> > pointers. Admittedly to get here, you have to do 643289765 creations
> > and nearly as many deletions, so are we going to see it in a
> > non-benchmark situation?
> >
> > The maple tree is more resilient against this kind of shenanigan, but
> > we're not there in terms of supporting the kind of allocation you
> > want. For this kind of allocation pattern, you'd get all 7 pointers
> > in a single 256-byte node.
>
> Hello Matthew, it's been a couple of kernel releases, so
> following up.
>
> Is Maple tree ready for libfs to use it for managing directory
> offsets?

The feature you are looking for is dense nodes. It will allow for
a compact tree when you have a number of single indexes mapping to
entries (ideal for many ranges of 1).

I'm actively working on dense nodes, which will yield 31 entries per
node when they are single index mappings. I'm hoping to have it
completed in the next few weeks and start beating it up with tests
before pushing it out.

>
> Should we just go for broke and convert libfs from xarray to
> Maple tree now?

We are trying to keep the API close for both the xarray and the maple
tree, so if you do the conversion to one then switching shouldn't be
much work. I'd try the maple tree to see if the performance is
acceptable today (I may be biased), but I don't know how big of an
effort this conversion would entail.

The maple tree will compress the NULL indexes to a single entry of NULL.
My main concern is the density of information and the number of
allocations the tree will do to keep up with the workload - both will
improve with the dense nodes feature.

If you convert to maple tree, you will get the update for free later as
the node type the tree chooses will be transparent to users.

If you need tagging then you should use the xarray as I haven't started
that feature yet - but I don't think you need that?

I also noticed that Matthew mentioned xa_alloc_cyclic() as the potential
call into the xarray. The maple tree counterpart isn't used much today
and may need to be optimised. If you can verify what this test does, we
could produce a test case for the maple tree test suite and optimise if
necessary.

Let us know if you have any other questions or need some pointers on how
to get started with a conversion.

Thanks,
Liam

2024-01-05 16:33:55

by Chuck Lever III

[permalink] [raw]

Subject: Re: [linus:master] [shmem] a2e459555c: aim9.disk_src.ops_per_sec -19.0% regression

> On Jan 5, 2024, at 11:27 AM, Liam Howlett <[email protected]> wrote:
>
> * Chuck Lever III <[email protected]> [240104 14:33]:
>>
>>
>>> On Sep 12, 2023, at 12:01 PM, Matthew Wilcox <[email protected]> wrote:
>>>
>>> On Tue, Sep 12, 2023 at 11:14:42PM +0800, Feng Tang wrote:
>>>>> Well that's the problem. Since I can't run the reproducer, there's
>>>>> nothing I can do to troubleshoot the problem myself.
>>>>
>>>> We dug more into the perf and other profiling data from 0Day server
>>>> running this case, and it seems that the new simple_offset_add()
>>>> called by shmem_mknod() brings extra cost related with slab,
>>>> specifically the 'radix_tree_node', which cause the regression.
>>>>
>>>> Here is some slabinfo diff for commit a2e459555c5f and its parent:
>>>>
>>>> 23a31d87645c6527 a2e459555c5f9da3e619b7e47a6
>>>> ---------------- ---------------------------
>>>>
>>>> 26363 +40.2% 36956 slabinfo.radix_tree_node.active_objs
>>>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.active_slabs
>>>> 26363 +40.3% 37001 slabinfo.radix_tree_node.num_objs
>>>> 941.00 +40.4% 1321 slabinfo.radix_tree_node.num_slabs
>>>
>>> I can't find the benchmark source, but my suspicion is that this
>>> creates and deletes a lot of files in a directory. The 'stable
>>> directory offsets' series uses xa_alloc_cyclic(), so we'll end up
>>> with a very sparse radix tree. ie it'll look something like this:
>>>
>>> 0 - "."
>>> 1 - ".."
>>> 6 - "d"
>>> 27 - "y"
>>> 4000 - "fzz"
>>> 65537 - "czzz"
>>> 643289767 - "bzzzzzz"
>>>
>>> (i didn't work out the names precisely here, but this is approximately
>>> what you'd get if you create files a-z, aa-zz, aaa-zzz, etc and delete
>>> almost all of them)
>>>
>>> The radix tree does not handle this well. It'll allocate one node for:
>>>
>>> entries 0-63 (covers the first 4 entries)
>>> entries 0-4095
>>> entries 3968-4031 (the first 5)
>>> entries 0-262143
>>> entries 65536-69631
>>> entries 65536-65599 (the first 6)
>>> entries 0-16777215
>>> entries 0-1073741823
>>> entries 637534208-654311423
>>> entries 643039232-643301375
>>> entries 643289088-643293183
>>> entries 643289728-643289791 (all 7)
>>>
>>> That ends up being 12 nodes (you get 7 nodes per page) to store 7
>>> pointers. Admittedly to get here, you have to do 643289765 creations
>>> and nearly as many deletions, so are we going to see it in a
>>> non-benchmark situation?
>>>
>>> The maple tree is more resilient against this kind of shenanigan, but
>>> we're not there in terms of supporting the kind of allocation you
>>> want. For this kind of allocation pattern, you'd get all 7 pointers
>>> in a single 256-byte node.
>>
>> Hello Matthew, it's been a couple of kernel releases, so
>> following up.
>>
>> Is Maple tree ready for libfs to use it for managing directory
>> offsets?
>
> The feature you are looking for is dense nodes. It will allow for
> a compact tree when you have a number of single indexes mapping to
> entries (ideal for many ranges of 1).
>
> I'm actively working on dense nodes, which will yield 31 entries per
> node when they are single index mappings. I'm hoping to have it
> completed in the next few weeks and start beating it up with tests
> before pushing it out.
>
>>
>> Should we just go for broke and convert libfs from xarray to
>> Maple tree now?
>
> We are trying to keep the API close for both the xarray and the maple
> tree, so if you do the conversion to one then switching shouldn't be
> much work. I'd try the maple tree to see if the performance is
> acceptable today (I may be biased), but I don't know how big of an
> effort this conversion would entail.
>
> The maple tree will compress the NULL indexes to a single entry of NULL.
> My main concern is the density of information and the number of
> allocations the tree will do to keep up with the workload - both will
> improve with the dense nodes feature.
>
> If you convert to maple tree, you will get the update for free later as
> the node type the tree chooses will be transparent to users.
>
> If you need tagging then you should use the xarray as I haven't started
> that feature yet - but I don't think you need that?

I don't recall using xarray tags for directory offset mapping.

> I also noticed that Matthew mentioned xa_alloc_cyclic() as the potential
> call into the xarray. The maple tree counterpart isn't used much today
> and may need to be optimised. If you can verify what this test does, we
> could produce a test case for the maple tree test suite and optimise if
> necessary.
>
> Let us know if you have any other questions or need some pointers on how
> to get started with a conversion.

Sounds like conversion is worth starting on, at least. I'll try
to clear some time to work on it.

--
Chuck Lever