2020-01-09 21:27:04

by Scott Cheloha

[permalink] [raw]
Subject: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

Searching for a particular memory block by id is an O(n) operation
because each memory block's underlying device is kept in an unsorted
linked list on the subsystem bus.

We can cut the lookup cost to O(log n) if we cache the memory blocks in
a radix tree. With a radix tree cache in place both memory subsystem
initialization and memory hotplug run palpably faster on systems with a
large number of memory blocks.

Signed-off-by: Scott Cheloha <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Nathan Lynch <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
(Sorry for the duplicate mail, forgot to mark this as v4)

v2 incorporates suggestions from David Hildenbrand.

v3 changes:
- Rebase atop "drivers/base/memory.c: drop the mem_sysfs_mutex"

- Be conservative: don't use radix_tree_for_each_slot() in
walk_memory_blocks() yet. It introduces RCU which could
change behavior. Walking the tree "by hand" with
find_memory_block_by_id() is slower but keeps the patch
simple.

v4 changes:
- Rewrite commit message to explicitly note the time
complexity improvements.

- Provide anecdotal accounts of time-savings in the changelog
(see below).

[email protected] has asked for additional details on time
savings, so here are some results I've collected when measuring
memory_dev_init() with/without the patch.

1. A 32GB POWER9 VM with 16MB memblocks has 2048 blocks:

# Unpatched
[ 0.005121] adding memory block 0... ok
[...]
[ 0.095230] adding memory block 1024... ok
[...]
[ 0.304248] adding memory block 2047... ok
[ 0.304508] added all memory blocks

# Patched
[ 0.004701] adding memory block 0... ok
[...]
[ 0.033383] adding memory block 1024... ok
[...]
[ 0.061387] adding memory block 2047... ok
[ 0.061414] added all memory blocks

Unpatched, memory_dev_init() runs in about 0.299 seconds. Patched,
it runs in about 0.057 seconds. Savings of .242 seconds, or nearly
a quarter of a second.

2. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks:

# Unpatched
[ 13.703907] memory_dev_init: adding blocks
[ 13.703931] memory_dev_init: added block 0
[ 13.762678] memory_dev_init: added block 1024
[ 13.910359] memory_dev_init: added block 2048
[ 14.146941] memory_dev_init: added block 3072
[...]
[ 218.516235] memory_dev_init: added block 57344
[ 229.310467] memory_dev_init: added block 58368
[ 240.590857] memory_dev_init: added block 59392
[ 252.351665] memory_dev_init: added block 60416
[...]
[ 2152.023248] memory_dev_init: added block 128000
[ 2196.464430] memory_dev_init: added block 129024
[ 2241.746515] memory_dev_init: added block 130048
[ 2287.406099] memory_dev_init: added all blocks

# Patched
[ 13.696898] memory_dev_init: adding blocks
[ 13.696920] memory_dev_init: added block 0
[ 13.710966] memory_dev_init: added block 1024
[ 13.724865] memory_dev_init: added block 2048
[ 13.738802] memory_dev_init: added block 3072
[...]
[ 14.520999] memory_dev_init: added block 57344
[ 14.536355] memory_dev_init: added block 58368
[ 14.551747] memory_dev_init: added block 59392
[ 14.567128] memory_dev_init: added block 60416
[...]
[ 15.595638] memory_dev_init: added block 126976
[ 15.611761] memory_dev_init: added block 128000
[ 15.627889] memory_dev_init: added block 129024
[ 15.644048] memory_dev_init: added block 130048
[ 15.660035] memory_dev_init: added all blocks

Unpatched, memory_dev_init() runs in about 2275 seconds,
or ~37 minutes. Patched, memory_dev_init() runs in about
1.97 seconds. Savings of ~37 minutes.

I did not actually measure walk_memory_blocks(), but during
boot on this machine without the patch I got the following
(abbreviated) traces:

[ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160

Those traces disappeared with the patch, so I'm pretty sure
this patch shaves ~30 minutes off of walk_memory_blocks()
at boot.

Given the above results I think it is safe to say that this patch will
dramatically improve boot times on large POWER systems.

drivers/base/memory.c | 36 +++++++++++++++++++++++-------------
1 file changed, 23 insertions(+), 13 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 799b43191dea..8902930d5ef2 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -19,6 +19,7 @@
#include <linux/memory.h>
#include <linux/memory_hotplug.h>
#include <linux/mm.h>
+#include <linux/radix-tree.h>
#include <linux/stat.h>
#include <linux/slab.h>

@@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
.offline = memory_subsys_offline,
};

+/*
+ * Memory blocks are cached in a local radix tree to avoid
+ * a costly linear search for the corresponding device on
+ * the subsystem bus.
+ */
+static RADIX_TREE(memory_blocks, GFP_KERNEL);
+
static BLOCKING_NOTIFIER_HEAD(memory_chain);

int register_memory_notifier(struct notifier_block *nb)
@@ -572,20 +580,14 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
/* A reference for the returned memory block device is acquired. */
static struct memory_block *find_memory_block_by_id(unsigned long block_id)
{
- struct device *dev;
+ struct memory_block *mem;

- dev = subsys_find_device_by_id(&memory_subsys, block_id, NULL);
- return dev ? to_memory_block(dev) : NULL;
+ mem = radix_tree_lookup(&memory_blocks, block_id);
+ if (mem)
+ get_device(&mem->dev);
+ return mem;
}

-/*
- * For now, we have a linear search to go find the appropriate
- * memory_block corresponding to a particular phys_index. If
- * this gets to be a real problem, we can always use a radix
- * tree or something here.
- *
- * This could be made generic for all device subsystems.
- */
struct memory_block *find_memory_block(struct mem_section *section)
{
unsigned long block_id = base_memory_block_id(__section_nr(section));
@@ -628,9 +630,15 @@ int register_memory(struct memory_block *memory)
memory->dev.offline = memory->state == MEM_OFFLINE;

ret = device_register(&memory->dev);
- if (ret)
+ if (ret) {
put_device(&memory->dev);
-
+ return ret;
+ }
+ ret = radix_tree_insert(&memory_blocks, memory->dev.id, memory);
+ if (ret) {
+ put_device(&memory->dev);
+ device_unregister(&memory->dev);
+ }
return ret;
}

@@ -688,6 +696,8 @@ static void unregister_memory(struct memory_block *memory)
if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
return;

+ WARN_ON(radix_tree_delete(&memory_blocks, memory->dev.id) == NULL);
+
/* drop the ref. we got via find_memory_block() */
put_device(&memory->dev);
device_unregister(&memory->dev);
--
2.24.1


2020-01-09 22:01:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

On Thu, 9 Jan 2020 15:25:16 -0600 Scott Cheloha <[email protected]> wrote:

> Searching for a particular memory block by id is an O(n) operation
> because each memory block's underlying device is kept in an unsorted
> linked list on the subsystem bus.
>
> We can cut the lookup cost to O(log n) if we cache the memory blocks in
> a radix tree. With a radix tree cache in place both memory subsystem
> initialization and memory hotplug run palpably faster on systems with a
> large number of memory blocks.
>
> ...
>
> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
> .offline = memory_subsys_offline,
> };
>
> +/*
> + * Memory blocks are cached in a local radix tree to avoid
> + * a costly linear search for the corresponding device on
> + * the subsystem bus.
> + */
> +static RADIX_TREE(memory_blocks, GFP_KERNEL);

What protects this tree from racy accesses?

2020-01-09 22:20:22

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup



> Am 09.01.2020 um 23:00 schrieb Andrew Morton <[email protected]>:
>
> On Thu, 9 Jan 2020 15:25:16 -0600 Scott Cheloha <[email protected]> wrote:
>
>> Searching for a particular memory block by id is an O(n) operation
>> because each memory block's underlying device is kept in an unsorted
>> linked list on the subsystem bus.
>>
>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
>> a radix tree. With a radix tree cache in place both memory subsystem
>> initialization and memory hotplug run palpably faster on systems with a
>> large number of memory blocks.
>>
>> ...
>>
>> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
>> .offline = memory_subsys_offline,
>> };
>>
>> +/*
>> + * Memory blocks are cached in a local radix tree to avoid
>> + * a costly linear search for the corresponding device on
>> + * the subsystem bus.
>> + */
>> +static RADIX_TREE(memory_blocks, GFP_KERNEL);
>
> What protects this tree from racy accesses?

I think the device hotplug lock currently (except during boot where no races can happen).

2020-01-09 22:29:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

On Thu, 9 Jan 2020 23:17:09 +0100 David Hildenbrand <[email protected]> wrote:

>
>
> > Am 09.01.2020 um 23:00 schrieb Andrew Morton <[email protected]>:
> >
> > On Thu, 9 Jan 2020 15:25:16 -0600 Scott Cheloha <[email protected]> wrote:
> >
> >> Searching for a particular memory block by id is an O(n) operation
> >> because each memory block's underlying device is kept in an unsorted
> >> linked list on the subsystem bus.
> >>
> >> We can cut the lookup cost to O(log n) if we cache the memory blocks in
> >> a radix tree. With a radix tree cache in place both memory subsystem
> >> initialization and memory hotplug run palpably faster on systems with a
> >> large number of memory blocks.
> >>
> >> ...
> >>
> >> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
> >> .offline = memory_subsys_offline,
> >> };
> >>
> >> +/*
> >> + * Memory blocks are cached in a local radix tree to avoid
> >> + * a costly linear search for the corresponding device on
> >> + * the subsystem bus.
> >> + */
> >> +static RADIX_TREE(memory_blocks, GFP_KERNEL);
> >
> > What protects this tree from racy accesses?
>
> I think the device hotplug lock currently (except during boot where no races can happen).
>

So this?

--- a/drivers/base/memory.c~drivers-base-memoryc-cache-blocks-in-radix-tree-to-accelerate-lookup-fix
+++ a/drivers/base/memory.c
@@ -61,6 +61,9 @@ static struct bus_type memory_subsys = {
* Memory blocks are cached in a local radix tree to avoid
* a costly linear search for the corresponding device on
* the subsystem bus.
+ *
+ * Protected by mem_hotplug_lock in mem_hotplug_begin(), and by the guaranteed
+ * single-threadness at boot time.
*/
static RADIX_TREE(memory_blocks, GFP_KERNEL);


But are we sure this is all true?

2020-01-09 22:36:56

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

> Am 09.01.2020 um 23:28 schrieb Andrew Morton <[email protected]>:
>
> On Thu, 9 Jan 2020 23:17:09 +0100 David Hildenbrand <[email protected]> wrote:
>
>>
>>
>>>> Am 09.01.2020 um 23:00 schrieb Andrew Morton <[email protected]>:
>>>
>>> On Thu, 9 Jan 2020 15:25:16 -0600 Scott Cheloha <[email protected]> wrote:
>>>
>>>> Searching for a particular memory block by id is an O(n) operation
>>>> because each memory block's underlying device is kept in an unsorted
>>>> linked list on the subsystem bus.
>>>>
>>>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
>>>> a radix tree. With a radix tree cache in place both memory subsystem
>>>> initialization and memory hotplug run palpably faster on systems with a
>>>> large number of memory blocks.
>>>>
>>>> ...
>>>>
>>>> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
>>>> .offline = memory_subsys_offline,
>>>> };
>>>>
>>>> +/*
>>>> + * Memory blocks are cached in a local radix tree to avoid
>>>> + * a costly linear search for the corresponding device on
>>>> + * the subsystem bus.
>>>> + */
>>>> +static RADIX_TREE(memory_blocks, GFP_KERNEL);
>>>
>>> What protects this tree from racy accesses?
>>
>> I think the device hotplug lock currently (except during boot where no races can happen).
>>
>
> So this?
>
> --- a/drivers/base/memory.c~drivers-base-memoryc-cache-blocks-in-radix-tree-to-accelerate-lookup-fix
> +++ a/drivers/base/memory.c
> @@ -61,6 +61,9 @@ static struct bus_type memory_subsys = {
> * Memory blocks are cached in a local radix tree to avoid
> * a costly linear search for the corresponding device on
> * the subsystem bus.
> + *
> + * Protected by mem_hotplug_lock in mem_hotplug_begin(), and by the guaranteed
> + * single-threadness at boot time.
> */
> static RADIX_TREE(memory_blocks, GFP_KERNEL);
>
>
> But are we sure this is all true?

I think the device hotplug lock, not the memory hotplug lock. Will double check later.
>

2020-01-10 09:33:37

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

On 09.01.20 23:35, David Hildenbrand wrote:
>> Am 09.01.2020 um 23:28 schrieb Andrew Morton <[email protected]>:
>>
>> On Thu, 9 Jan 2020 23:17:09 +0100 David Hildenbrand <[email protected]> wrote:
>>
>>>
>>>
>>>>> Am 09.01.2020 um 23:00 schrieb Andrew Morton <[email protected]>:
>>>>
>>>> On Thu, 9 Jan 2020 15:25:16 -0600 Scott Cheloha <[email protected]> wrote:
>>>>
>>>>> Searching for a particular memory block by id is an O(n) operation
>>>>> because each memory block's underlying device is kept in an unsorted
>>>>> linked list on the subsystem bus.
>>>>>
>>>>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
>>>>> a radix tree. With a radix tree cache in place both memory subsystem
>>>>> initialization and memory hotplug run palpably faster on systems with a
>>>>> large number of memory blocks.
>>>>>
>>>>> ...
>>>>>
>>>>> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
>>>>> .offline = memory_subsys_offline,
>>>>> };
>>>>>
>>>>> +/*
>>>>> + * Memory blocks are cached in a local radix tree to avoid
>>>>> + * a costly linear search for the corresponding device on
>>>>> + * the subsystem bus.
>>>>> + */
>>>>> +static RADIX_TREE(memory_blocks, GFP_KERNEL);
>>>>
>>>> What protects this tree from racy accesses?
>>>
>>> I think the device hotplug lock currently (except during boot where no races can happen).
>>>
>>
>> So this?
>>
>> --- a/drivers/base/memory.c~drivers-base-memoryc-cache-blocks-in-radix-tree-to-accelerate-lookup-fix
>> +++ a/drivers/base/memory.c
>> @@ -61,6 +61,9 @@ static struct bus_type memory_subsys = {
>> * Memory blocks are cached in a local radix tree to avoid
>> * a costly linear search for the corresponding device on
>> * the subsystem bus.
>> + *
>> + * Protected by mem_hotplug_lock in mem_hotplug_begin(), and by the guaranteed
>> + * single-threadness at boot time.
>> */
>> static RADIX_TREE(memory_blocks, GFP_KERNEL);
>>
>>
>> But are we sure this is all true?
>
> I think the device hotplug lock, not the memory hotplug lock. Will double check later.

So all writers either hold the device_hotplug_lock or run during boot.
Documented e.g., for memory_dev_init(), create_memory_block_devices(),
remove_memory_block_devices().

The readers are mainly
- find_memory_block()
-> called via online_pages()/offline_pages() where we hold the
device_hotplug_lock
-> called from arch/powerpc/platforms/pseries/hotplug-memory.c,
where we always hold the device_hotplug_lock
- walk_memory_blocks()
-> Callers in drivers/acpi/acpi_memhotplug.c either hold the
device_hotplug_lock or are called early during boot
-> Callers in mm/memory_hotplug.c either hold the
device_hotplug_lock or are called early during boot
-> link_mem_sections() is called either early during boot or via
add_memory_resource() (whereby all callers either hold the
device_hotplug_lock or are called early during boot)
-> Callers in arch/powerpc/platforms/powernv/memtrace.c hold the
device_hotplug_lock

So we are fine.

I suggest we document that expected behavior via

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 799b43191dea..8c8dc081597e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -585,6 +585,8 @@ static struct memory_block
*find_memory_block_by_id(unsigned long block_id)
* tree or something here.
*
* This could be made generic for all device subsystems.
+ *
+ * Called under device_hotplug_lock.
*/
struct memory_block *find_memory_block(struct mem_section *section)
{
@@ -837,6 +839,8 @@ void __init memory_dev_init(void)
*
* In case func() returns an error, walking is aborted and the error is
* returned.
+ *
+ * Called under device_hotplug_lock.
*/
int walk_memory_blocks(unsigned long start, unsigned long size,
void *arg, walk_memory_blocks_func_t func)


Please note that the memory hotplug lock is not safe on the reader side.
But also not on the writer side after
https://lkml.kernel.org/r/157863061737.2230556.3959730620803366776.stgit@dwillia2-desk3.amr.corp.intel.com


--
Thanks,

David / dhildenb

2020-01-10 11:32:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

On Fri 10-01-20 10:32:01, David Hildenbrand wrote:
> On 09.01.20 23:35, David Hildenbrand wrote:
> >> Am 09.01.2020 um 23:28 schrieb Andrew Morton <[email protected]>:
> >>
> >> On Thu, 9 Jan 2020 23:17:09 +0100 David Hildenbrand <[email protected]> wrote:
> >>
> >>>
> >>>
> >>>>> Am 09.01.2020 um 23:00 schrieb Andrew Morton <[email protected]>:
> >>>>
> >>>> On Thu, 9 Jan 2020 15:25:16 -0600 Scott Cheloha <[email protected]> wrote:
> >>>>
> >>>>> Searching for a particular memory block by id is an O(n) operation
> >>>>> because each memory block's underlying device is kept in an unsorted
> >>>>> linked list on the subsystem bus.
> >>>>>
> >>>>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
> >>>>> a radix tree. With a radix tree cache in place both memory subsystem
> >>>>> initialization and memory hotplug run palpably faster on systems with a
> >>>>> large number of memory blocks.
> >>>>>
> >>>>> ...
> >>>>>
> >>>>> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
> >>>>> .offline = memory_subsys_offline,
> >>>>> };
> >>>>>
> >>>>> +/*
> >>>>> + * Memory blocks are cached in a local radix tree to avoid
> >>>>> + * a costly linear search for the corresponding device on
> >>>>> + * the subsystem bus.
> >>>>> + */
> >>>>> +static RADIX_TREE(memory_blocks, GFP_KERNEL);
> >>>>
> >>>> What protects this tree from racy accesses?
> >>>
> >>> I think the device hotplug lock currently (except during boot where no races can happen).
> >>>
> >>
> >> So this?
> >>
> >> --- a/drivers/base/memory.c~drivers-base-memoryc-cache-blocks-in-radix-tree-to-accelerate-lookup-fix
> >> +++ a/drivers/base/memory.c
> >> @@ -61,6 +61,9 @@ static struct bus_type memory_subsys = {
> >> * Memory blocks are cached in a local radix tree to avoid
> >> * a costly linear search for the corresponding device on
> >> * the subsystem bus.
> >> + *
> >> + * Protected by mem_hotplug_lock in mem_hotplug_begin(), and by the guaranteed
> >> + * single-threadness at boot time.
> >> */
> >> static RADIX_TREE(memory_blocks, GFP_KERNEL);
> >>
> >>
> >> But are we sure this is all true?
> >
> > I think the device hotplug lock, not the memory hotplug lock. Will double check later.
>
> So all writers either hold the device_hotplug_lock or run during boot.
> Documented e.g., for memory_dev_init(), create_memory_block_devices(),
> remove_memory_block_devices().
>
> The readers are mainly
> - find_memory_block()
> -> called via online_pages()/offline_pages() where we hold the
> device_hotplug_lock
> -> called from arch/powerpc/platforms/pseries/hotplug-memory.c,
> where we always hold the device_hotplug_lock
> - walk_memory_blocks()
> -> Callers in drivers/acpi/acpi_memhotplug.c either hold the
> device_hotplug_lock or are called early during boot
> -> Callers in mm/memory_hotplug.c either hold the
> device_hotplug_lock or are called early during boot
> -> link_mem_sections() is called either early during boot or via
> add_memory_resource() (whereby all callers either hold the
> device_hotplug_lock or are called early during boot)
> -> Callers in arch/powerpc/platforms/powernv/memtrace.c hold the
> device_hotplug_lock
>
> So we are fine.
>
> I suggest we document that expected behavior via

Thanks for documenting this! Adding a comment as you suggest makes
sense. Overall the locking expectations would be similar to
subsys_find_device_by_id which doesn't use any internal locking so the
radix tree which follows the lifetime of those objects should be
compatible with the current implementation (so no new races at least).

> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 799b43191dea..8c8dc081597e 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -585,6 +585,8 @@ static struct memory_block
> *find_memory_block_by_id(unsigned long block_id)
> * tree or something here.
> *
> * This could be made generic for all device subsystems.
> + *
> + * Called under device_hotplug_lock.
> */
> struct memory_block *find_memory_block(struct mem_section *section)
> {
> @@ -837,6 +839,8 @@ void __init memory_dev_init(void)
> *
> * In case func() returns an error, walking is aborted and the error is
> * returned.
> + *
> + * Called under device_hotplug_lock.
> */
> int walk_memory_blocks(unsigned long start, unsigned long size,
> void *arg, walk_memory_blocks_func_t func)
>
>
> Please note that the memory hotplug lock is not safe on the reader side.
> But also not on the writer side after
> https://lkml.kernel.org/r/157863061737.2230556.3959730620803366776.stgit@dwillia2-desk3.amr.corp.intel.com
>
>
> --
> Thanks,
>
> David / dhildenb

--
Michal Hocko
SUSE Labs

2020-01-15 19:11:09

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

On 09.01.20 22:25, Scott Cheloha wrote:
> Searching for a particular memory block by id is an O(n) operation
> because each memory block's underlying device is kept in an unsorted
> linked list on the subsystem bus.
>
> We can cut the lookup cost to O(log n) if we cache the memory blocks in
> a radix tree. With a radix tree cache in place both memory subsystem
> initialization and memory hotplug run palpably faster on systems with a
> large number of memory blocks.
>
> Signed-off-by: Scott Cheloha <[email protected]>
> Acked-by: David Hildenbrand <[email protected]>
> Acked-by: Nathan Lynch <[email protected]>
> Acked-by: Michal Hocko <[email protected]>

Soooo,

I just learned that radix trees are nowadays only a wrapper for xarray
(for quite a while already!), and that the xarray interface shall be
used in new code.

include/linux/radix-tree.h:

/* Keep unconverted code working */
#define radix_tree_root xarray
[...]

Do we want to convert this code before sending it to Linus'? (resend
this patch, or a fixup on top)

--
Thanks,

David / dhildenb

2020-01-16 15:24:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

On Wed 15-01-20 20:09:48, David Hildenbrand wrote:
> On 09.01.20 22:25, Scott Cheloha wrote:
> > Searching for a particular memory block by id is an O(n) operation
> > because each memory block's underlying device is kept in an unsorted
> > linked list on the subsystem bus.
> >
> > We can cut the lookup cost to O(log n) if we cache the memory blocks in
> > a radix tree. With a radix tree cache in place both memory subsystem
> > initialization and memory hotplug run palpably faster on systems with a
> > large number of memory blocks.
> >
> > Signed-off-by: Scott Cheloha <[email protected]>
> > Acked-by: David Hildenbrand <[email protected]>
> > Acked-by: Nathan Lynch <[email protected]>
> > Acked-by: Michal Hocko <[email protected]>
>
> Soooo,
>
> I just learned that radix trees are nowadays only a wrapper for xarray
> (for quite a while already!), and that the xarray interface shall be
> used in new code.

Good point. I somehow didn't realize this would add more work for a
later code refactoring. The mapping should be pretty straightforward.

Thanks for noticing!

--
Michal Hocko
SUSE Labs

2020-01-16 15:29:34

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

On 16.01.20 16:22, Michal Hocko wrote:
> On Wed 15-01-20 20:09:48, David Hildenbrand wrote:
>> On 09.01.20 22:25, Scott Cheloha wrote:
>>> Searching for a particular memory block by id is an O(n) operation
>>> because each memory block's underlying device is kept in an unsorted
>>> linked list on the subsystem bus.
>>>
>>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
>>> a radix tree. With a radix tree cache in place both memory subsystem
>>> initialization and memory hotplug run palpably faster on systems with a
>>> large number of memory blocks.
>>>
>>> Signed-off-by: Scott Cheloha <[email protected]>
>>> Acked-by: David Hildenbrand <[email protected]>
>>> Acked-by: Nathan Lynch <[email protected]>
>>> Acked-by: Michal Hocko <[email protected]>
>>
>> Soooo,
>>
>> I just learned that radix trees are nowadays only a wrapper for xarray
>> (for quite a while already!), and that the xarray interface shall be
>> used in new code.
>
> Good point. I somehow didn't realize this would add more work for a
> later code refactoring. The mapping should be pretty straightforward.

Yes it is. @Scott, care to send a fixup that does the mapping?

>
> Thanks for noticing!

Don noticed it, so thanks to him :)


--
Thanks,

David / dhildenb

2020-01-16 16:19:36

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

On 16.01.20 16:28, David Hildenbrand wrote:
> On 16.01.20 16:22, Michal Hocko wrote:
>> On Wed 15-01-20 20:09:48, David Hildenbrand wrote:
>>> On 09.01.20 22:25, Scott Cheloha wrote:
>>>> Searching for a particular memory block by id is an O(n) operation
>>>> because each memory block's underlying device is kept in an unsorted
>>>> linked list on the subsystem bus.
>>>>
>>>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
>>>> a radix tree. With a radix tree cache in place both memory subsystem
>>>> initialization and memory hotplug run palpably faster on systems with a
>>>> large number of memory blocks.
>>>>
>>>> Signed-off-by: Scott Cheloha <[email protected]>
>>>> Acked-by: David Hildenbrand <[email protected]>
>>>> Acked-by: Nathan Lynch <[email protected]>
>>>> Acked-by: Michal Hocko <[email protected]>
>>>
>>> Soooo,
>>>
>>> I just learned that radix trees are nowadays only a wrapper for xarray
>>> (for quite a while already!), and that the xarray interface shall be
>>> used in new code.
>>
>> Good point. I somehow didn't realize this would add more work for a
>> later code refactoring. The mapping should be pretty straightforward.
>
> Yes it is. @Scott, care to send a fixup that does the mapping?

Never having used an xarray, I gave it a quick shot. The following
should do the trick:


diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index c6d288fad493..c75dec35de43 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -19,7 +19,7 @@
#include <linux/memory.h>
#include <linux/memory_hotplug.h>
#include <linux/mm.h>
-#include <linux/radix-tree.h>
+#include <linux/xarray.h>
#include <linux/stat.h>
#include <linux/slab.h>

@@ -58,11 +58,11 @@ static struct bus_type memory_subsys = {
};

/*
- * Memory blocks are cached in a local radix tree to avoid
+ * Memory blocks are cached in a local xarray to avoid
* a costly linear search for the corresponding device on
* the subsystem bus.
*/
-static RADIX_TREE(memory_blocks, GFP_KERNEL);
+static DEFINE_XARRAY(memory_blocks);

static BLOCKING_NOTIFIER_HEAD(memory_chain);

@@ -566,7 +566,7 @@ static struct memory_block *find_memory_block_by_id(unsigned long block_id)
{
struct memory_block *mem;

- mem = radix_tree_lookup(&memory_blocks, block_id);
+ mem = xa_load(&memory_blocks, block_id);
if (mem)
get_device(&mem->dev);
return mem;
@@ -621,7 +621,8 @@ int register_memory(struct memory_block *memory)
put_device(&memory->dev);
return ret;
}
- ret = radix_tree_insert(&memory_blocks, memory->dev.id, memory);
+ ret = xa_err(xa_store(&memory_blocks, memory->dev.id, memory,
+ GFP_KERNEL));
if (ret) {
put_device(&memory->dev);
device_unregister(&memory->dev);
@@ -683,7 +684,7 @@ static void unregister_memory(struct memory_block *memory)
if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
return;

- WARN_ON(radix_tree_delete(&memory_blocks, memory->dev.id) == NULL);
+ WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);

/* drop the ref. we got via find_memory_block() */
put_device(&memory->dev);


--
Thanks,

David / dhildenb

2020-01-16 20:05:13

by Donald Dutile

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

On 1/16/20 10:28 AM, David Hildenbrand wrote:
> On 16.01.20 16:22, Michal Hocko wrote:
>> On Wed 15-01-20 20:09:48, David Hildenbrand wrote:
>>> On 09.01.20 22:25, Scott Cheloha wrote:
>>>> Searching for a particular memory block by id is an O(n) operation
>>>> because each memory block's underlying device is kept in an unsorted
>>>> linked list on the subsystem bus.
>>>>
>>>> We can cut the lookup cost to O(log n) if we cache the memory blocks in
>>>> a radix tree. With a radix tree cache in place both memory subsystem
>>>> initialization and memory hotplug run palpably faster on systems with a
>>>> large number of memory blocks.
>>>>
>>>> Signed-off-by: Scott Cheloha <[email protected]>
>>>> Acked-by: David Hildenbrand <[email protected]>
>>>> Acked-by: Nathan Lynch <[email protected]>
>>>> Acked-by: Michal Hocko <[email protected]>
>>>
>>> Soooo,
>>>
>>> I just learned that radix trees are nowadays only a wrapper for xarray
>>> (for quite a while already!), and that the xarray interface shall be
>>> used in new code.
>>
>> Good point. I somehow didn't realize this would add more work for a
>> later code refactoring. The mapping should be pretty straightforward.
>
> Yes it is. @Scott, care to send a fixup that does the mapping?
>
>>
>> Thanks for noticing!
>
> Don noticed it, so thanks to him :)
>
Backporting XArray to RHEL-8, and continuing to support radix-tree api in RHEL-8 due to RHEL rulz, it wasn't stable/bug-free everyewhere, etc., was a challenge, thus my 'sensitivity' to seeing new radix-tree code upstream.

>

2020-01-17 09:37:32

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

On Thu 16-01-20 17:17:54, David Hildenbrand wrote:
[...]
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index c6d288fad493..c75dec35de43 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -19,7 +19,7 @@
> #include <linux/memory.h>
> #include <linux/memory_hotplug.h>
> #include <linux/mm.h>
> -#include <linux/radix-tree.h>
> +#include <linux/xarray.h>
> #include <linux/stat.h>
> #include <linux/slab.h>
>
> @@ -58,11 +58,11 @@ static struct bus_type memory_subsys = {
> };
>
> /*
> - * Memory blocks are cached in a local radix tree to avoid
> + * Memory blocks are cached in a local xarray to avoid
> * a costly linear search for the corresponding device on
> * the subsystem bus.
> */
> -static RADIX_TREE(memory_blocks, GFP_KERNEL);
> +static DEFINE_XARRAY(memory_blocks);
>
> static BLOCKING_NOTIFIER_HEAD(memory_chain);
>
> @@ -566,7 +566,7 @@ static struct memory_block *find_memory_block_by_id(unsigned long block_id)
> {
> struct memory_block *mem;
>
> - mem = radix_tree_lookup(&memory_blocks, block_id);
> + mem = xa_load(&memory_blocks, block_id);
> if (mem)
> get_device(&mem->dev);
> return mem;
> @@ -621,7 +621,8 @@ int register_memory(struct memory_block *memory)
> put_device(&memory->dev);
> return ret;
> }
> - ret = radix_tree_insert(&memory_blocks, memory->dev.id, memory);
> + ret = xa_err(xa_store(&memory_blocks, memory->dev.id, memory,
> + GFP_KERNEL));
> if (ret) {
> put_device(&memory->dev);
> device_unregister(&memory->dev);
> @@ -683,7 +684,7 @@ static void unregister_memory(struct memory_block *memory)
> if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
> return;
>
> - WARN_ON(radix_tree_delete(&memory_blocks, memory->dev.id) == NULL);
> + WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
>
> /* drop the ref. we got via find_memory_block() */
> put_device(&memory->dev);

OK, this looks sensible. xa_store shouldn't ever return an existing
device as we do the lookpup beforehand so good. We might need to
reorganize the code if we want to drop the loopup though.

Thanks!
--
Michal Hocko
SUSE Labs

2020-01-20 09:16:55

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

On 17.01.20 10:35, Michal Hocko wrote:
> On Thu 16-01-20 17:17:54, David Hildenbrand wrote:
> [...]
>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>> index c6d288fad493..c75dec35de43 100644
>> --- a/drivers/base/memory.c
>> +++ b/drivers/base/memory.c
>> @@ -19,7 +19,7 @@
>> #include <linux/memory.h>
>> #include <linux/memory_hotplug.h>
>> #include <linux/mm.h>
>> -#include <linux/radix-tree.h>
>> +#include <linux/xarray.h>
>> #include <linux/stat.h>
>> #include <linux/slab.h>
>>
>> @@ -58,11 +58,11 @@ static struct bus_type memory_subsys = {
>> };
>>
>> /*
>> - * Memory blocks are cached in a local radix tree to avoid
>> + * Memory blocks are cached in a local xarray to avoid
>> * a costly linear search for the corresponding device on
>> * the subsystem bus.
>> */
>> -static RADIX_TREE(memory_blocks, GFP_KERNEL);
>> +static DEFINE_XARRAY(memory_blocks);
>>
>> static BLOCKING_NOTIFIER_HEAD(memory_chain);
>>
>> @@ -566,7 +566,7 @@ static struct memory_block *find_memory_block_by_id(unsigned long block_id)
>> {
>> struct memory_block *mem;
>>
>> - mem = radix_tree_lookup(&memory_blocks, block_id);
>> + mem = xa_load(&memory_blocks, block_id);
>> if (mem)
>> get_device(&mem->dev);
>> return mem;
>> @@ -621,7 +621,8 @@ int register_memory(struct memory_block *memory)
>> put_device(&memory->dev);
>> return ret;
>> }
>> - ret = radix_tree_insert(&memory_blocks, memory->dev.id, memory);
>> + ret = xa_err(xa_store(&memory_blocks, memory->dev.id, memory,
>> + GFP_KERNEL));
>> if (ret) {
>> put_device(&memory->dev);
>> device_unregister(&memory->dev);
>> @@ -683,7 +684,7 @@ static void unregister_memory(struct memory_block *memory)
>> if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
>> return;
>>
>> - WARN_ON(radix_tree_delete(&memory_blocks, memory->dev.id) == NULL);
>> + WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
>>
>> /* drop the ref. we got via find_memory_block() */
>> put_device(&memory->dev);
>
> OK, this looks sensible. xa_store shouldn't ever return an existing
> device as we do the lookpup beforehand so good. We might need to
> reorganize the code if we want to drop the loopup though.

Ping Scott. Will you resend a cleanup like this as a proper patch or
shall I?

--
Thanks,

David / dhildenb

2020-01-21 12:31:28

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4] drivers/base/memory.c: cache blocks in radix tree to accelerate lookup

On Mon 20-01-20 10:15:32, David Hildenbrand wrote:
[...]
> Ping Scott. Will you resend a cleanup like this as a proper patch or
> shall I?

Maybe just fold it into the orignal patch which is already sitting in
mmotm?
--
Michal Hocko
SUSE Labs

2020-01-21 23:12:13

by Scott Cheloha

[permalink] [raw]
Subject: [PATCH v5] drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup

From: Scott Cheloha <[email protected]>

Searching for a particular memory block by id is an O(n) operation
because each memory block's underlying device is kept in an unsorted
linked list on the subsystem bus.

We can cut the lookup cost to O(log n) if we cache each memory block
in an xarray. This time complexity improvement is significant on
systems with many memory blocks. For example:

1. A 128GB POWER9 VM with 256MB memblocks has 512 blocks. With this
change memory_dev_init() completes ~12ms faster and walk_memory_blocks()
completes ~12ms faster.

Before:
[ 0.005042] memory_dev_init: adding memory blocks
[ 0.021591] memory_dev_init: added memory blocks
[ 0.022699] walk_memory_blocks: walking memory blocks
[ 0.038730] walk_memory_blocks: walked memory blocks 0-511

After:
[ 0.005057] memory_dev_init: adding memory blocks
[ 0.009415] memory_dev_init: added memory blocks
[ 0.010519] walk_memory_blocks: walking memory blocks
[ 0.014135] walk_memory_blocks: walked memory blocks 0-511

2. A 256GB POWER9 LPAR with 256MB memblocks has 1024 blocks. With
this change memory_dev_init() completes ~88ms faster and
walk_memory_blocks() completes ~87ms faster.

Before:
[ 0.252246] memory_dev_init: adding memory blocks
[ 0.395469] memory_dev_init: added memory blocks
[ 0.409413] walk_memory_blocks: walking memory blocks
[ 0.433028] walk_memory_blocks: walked memory blocks 0-511
[ 0.433094] walk_memory_blocks: walking memory blocks
[ 0.500244] walk_memory_blocks: walked memory blocks 131072-131583

After:
[ 0.245063] memory_dev_init: adding memory blocks
[ 0.299539] memory_dev_init: added memory blocks
[ 0.313609] walk_memory_blocks: walking memory blocks
[ 0.315287] walk_memory_blocks: walked memory blocks 0-511
[ 0.315349] walk_memory_blocks: walking memory blocks
[ 0.316988] walk_memory_blocks: walked memory blocks 131072-131583

3. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks. With
this change we complete memory_dev_init() ~37 minutes faster and
walk_memory_blocks() at least ~30 minutes faster. The exact timing
for walk_memory_blocks() is missing, though I observed that the
soft lockups in walk_memory_blocks() disappeared with the change,
suggesting that lower bound.

Before:
[ 13.703907] memory_dev_init: adding blocks
[ 2287.406099] memory_dev_init: added all blocks
[ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
[ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160

After:
[ 13.696898] memory_dev_init: adding blocks
[ 15.660035] memory_dev_init: added all blocks
(the walk_memory_blocks traces disappear)

There should be no significant negative impact for machines with few
memory blocks. A sparse xarray has a small footprint and an O(log n)
lookup is negligibly slower than an O(n) lookup for only the smallest
number of memory blocks.

1. A 16GB x86 machine with 128MB memblocks has 132 blocks. With this
change memory_dev_init() completes ~300us faster and walk_memory_blocks()
completes no faster or slower. The improvement is pretty close to noise.

Before:
[ 0.224752] memory_dev_init: adding memory blocks
[ 0.227116] memory_dev_init: added memory blocks
[ 0.227183] walk_memory_blocks: walking memory blocks
[ 0.227183] walk_memory_blocks: walked memory blocks 0-131

After:
[ 0.224911] memory_dev_init: adding memory blocks
[ 0.226935] memory_dev_init: added memory blocks
[ 0.227089] walk_memory_blocks: walking memory blocks
[ 0.227089] walk_memory_blocks: walked memory blocks 0-131

Signed-off-by: Scott Cheloha <[email protected]>
Acked-by: David Hildenbrand <[email protected]>
Acked-by: Nathan Lynch <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
v2 incorporates suggestions from David Hildenbrand.

v3 changes:
- Rebase atop "drivers/base/memory.c: drop the mem_sysfs_mutex"

- Be conservative: don't use radix_tree_for_each_slot() in
walk_memory_blocks() yet. It introduces RCU which could
change behavior. Walking the tree "by hand" with
find_memory_block_by_id() is slower but keeps the patch
simple.

v4 changes:
- Rewrite commit message to explicitly note the time
complexity improvements.

- Provide anecdotal accounts of time-savings in changelog

v5 changes:
- Switch from the radix_tree API to the xarray API to conform
to current kernel preferences.

- Move the time savings accounts into the commit message itself.
Remeasure performance changes on the machines I had available.

It should be noted that I was not able to get time on the 32TB
machine to remeasure the improvements for v5. The quoted traces
are from v4 of the patch. However, the xarray API is used to
implement the radix_tree API, so I expect the performance changes
will be identical.

I did test v5 of the patch on the other machines mentioned in the
commit message to ensure there were no regressions.

drivers/base/memory.c | 37 ++++++++++++++++++++++++-------------
1 file changed, 24 insertions(+), 13 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 799b43191dea..2178d3e6d063 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -21,6 +21,7 @@
#include <linux/mm.h>
#include <linux/stat.h>
#include <linux/slab.h>
+#include <linux/xarray.h>

#include <linux/atomic.h>
#include <linux/uaccess.h>
@@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
.offline = memory_subsys_offline,
};

+/*
+ * Memory blocks are cached in a local radix tree to avoid
+ * a costly linear search for the corresponding device on
+ * the subsystem bus.
+ */
+static DEFINE_XARRAY(memory_blocks);
+
static BLOCKING_NOTIFIER_HEAD(memory_chain);

int register_memory_notifier(struct notifier_block *nb)
@@ -572,20 +580,14 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
/* A reference for the returned memory block device is acquired. */
static struct memory_block *find_memory_block_by_id(unsigned long block_id)
{
- struct device *dev;
+ struct memory_block *mem;

- dev = subsys_find_device_by_id(&memory_subsys, block_id, NULL);
- return dev ? to_memory_block(dev) : NULL;
+ mem = xa_load(&memory_blocks, block_id);
+ if (mem)
+ get_device(&mem->dev);
+ return mem;
}

-/*
- * For now, we have a linear search to go find the appropriate
- * memory_block corresponding to a particular phys_index. If
- * this gets to be a real problem, we can always use a radix
- * tree or something here.
- *
- * This could be made generic for all device subsystems.
- */
struct memory_block *find_memory_block(struct mem_section *section)
{
unsigned long block_id = base_memory_block_id(__section_nr(section));
@@ -628,9 +630,16 @@ int register_memory(struct memory_block *memory)
memory->dev.offline = memory->state == MEM_OFFLINE;

ret = device_register(&memory->dev);
- if (ret)
+ if (ret) {
put_device(&memory->dev);
-
+ return ret;
+ }
+ ret = xa_err(xa_store(&memory_blocks, memory->dev.id, memory,
+ GFP_KERNEL));
+ if (ret) {
+ put_device(&memory->dev);
+ device_unregister(&memory->dev);
+ }
return ret;
}

@@ -688,6 +697,8 @@ static void unregister_memory(struct memory_block *memory)
if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
return;

+ WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
+
/* drop the ref. we got via find_memory_block() */
put_device(&memory->dev);
device_unregister(&memory->dev);
--
2.24.1

2020-01-22 10:44:50

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v5] drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup

On 22.01.20 00:10, Scott Cheloha wrote:
> From: Scott Cheloha <[email protected]>
>
> Searching for a particular memory block by id is an O(n) operation
> because each memory block's underlying device is kept in an unsorted
> linked list on the subsystem bus.
>
> We can cut the lookup cost to O(log n) if we cache each memory block
> in an xarray. This time complexity improvement is significant on
> systems with many memory blocks. For example:
>
> 1. A 128GB POWER9 VM with 256MB memblocks has 512 blocks. With this
> change memory_dev_init() completes ~12ms faster and walk_memory_blocks()
> completes ~12ms faster.
>
> Before:
> [ 0.005042] memory_dev_init: adding memory blocks
> [ 0.021591] memory_dev_init: added memory blocks
> [ 0.022699] walk_memory_blocks: walking memory blocks
> [ 0.038730] walk_memory_blocks: walked memory blocks 0-511
>
> After:
> [ 0.005057] memory_dev_init: adding memory blocks
> [ 0.009415] memory_dev_init: added memory blocks
> [ 0.010519] walk_memory_blocks: walking memory blocks
> [ 0.014135] walk_memory_blocks: walked memory blocks 0-511
>
> 2. A 256GB POWER9 LPAR with 256MB memblocks has 1024 blocks. With
> this change memory_dev_init() completes ~88ms faster and
> walk_memory_blocks() completes ~87ms faster.
>
> Before:
> [ 0.252246] memory_dev_init: adding memory blocks
> [ 0.395469] memory_dev_init: added memory blocks
> [ 0.409413] walk_memory_blocks: walking memory blocks
> [ 0.433028] walk_memory_blocks: walked memory blocks 0-511
> [ 0.433094] walk_memory_blocks: walking memory blocks
> [ 0.500244] walk_memory_blocks: walked memory blocks 131072-131583
>
> After:
> [ 0.245063] memory_dev_init: adding memory blocks
> [ 0.299539] memory_dev_init: added memory blocks
> [ 0.313609] walk_memory_blocks: walking memory blocks
> [ 0.315287] walk_memory_blocks: walked memory blocks 0-511
> [ 0.315349] walk_memory_blocks: walking memory blocks
> [ 0.316988] walk_memory_blocks: walked memory blocks 131072-131583
>
> 3. A 32TB POWER9 LPAR with 256MB memblocks has 131072 blocks. With
> this change we complete memory_dev_init() ~37 minutes faster and
> walk_memory_blocks() at least ~30 minutes faster. The exact timing
> for walk_memory_blocks() is missing, though I observed that the
> soft lockups in walk_memory_blocks() disappeared with the change,
> suggesting that lower bound.
>
> Before:
> [ 13.703907] memory_dev_init: adding blocks
> [ 2287.406099] memory_dev_init: added all blocks
> [ 2347.494986] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 2527.625378] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 2707.761977] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 2887.899975] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3068.028318] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3248.158764] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3428.287296] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3608.425357] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3788.554572] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 3968.695071] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
> [ 4148.823970] [c000000014c5bb60] [c000000000869af4] walk_memory_blocks+0x94/0x160
>
> After:
> [ 13.696898] memory_dev_init: adding blocks
> [ 15.660035] memory_dev_init: added all blocks
> (the walk_memory_blocks traces disappear)
>
> There should be no significant negative impact for machines with few
> memory blocks. A sparse xarray has a small footprint and an O(log n)
> lookup is negligibly slower than an O(n) lookup for only the smallest
> number of memory blocks.
>
> 1. A 16GB x86 machine with 128MB memblocks has 132 blocks. With this
> change memory_dev_init() completes ~300us faster and walk_memory_blocks()
> completes no faster or slower. The improvement is pretty close to noise.
>
> Before:
> [ 0.224752] memory_dev_init: adding memory blocks
> [ 0.227116] memory_dev_init: added memory blocks
> [ 0.227183] walk_memory_blocks: walking memory blocks
> [ 0.227183] walk_memory_blocks: walked memory blocks 0-131
>
> After:
> [ 0.224911] memory_dev_init: adding memory blocks
> [ 0.226935] memory_dev_init: added memory blocks
> [ 0.227089] walk_memory_blocks: walking memory blocks
> [ 0.227089] walk_memory_blocks: walked memory blocks 0-131
>
> Signed-off-by: Scott Cheloha <[email protected]>
> Acked-by: David Hildenbrand <[email protected]>
> Acked-by: Nathan Lynch <[email protected]>
> Acked-by: Michal Hocko <[email protected]>
> ---
> v2 incorporates suggestions from David Hildenbrand.
>
> v3 changes:
> - Rebase atop "drivers/base/memory.c: drop the mem_sysfs_mutex"
>
> - Be conservative: don't use radix_tree_for_each_slot() in
> walk_memory_blocks() yet. It introduces RCU which could
> change behavior. Walking the tree "by hand" with
> find_memory_block_by_id() is slower but keeps the patch
> simple.
>
> v4 changes:
> - Rewrite commit message to explicitly note the time
> complexity improvements.
>
> - Provide anecdotal accounts of time-savings in changelog
>
> v5 changes:
> - Switch from the radix_tree API to the xarray API to conform
> to current kernel preferences.
>
> - Move the time savings accounts into the commit message itself.
> Remeasure performance changes on the machines I had available.
>
> It should be noted that I was not able to get time on the 32TB
> machine to remeasure the improvements for v5. The quoted traces
> are from v4 of the patch. However, the xarray API is used to
> implement the radix_tree API, so I expect the performance changes
> will be identical.
>
> I did test v5 of the patch on the other machines mentioned in the
> commit message to ensure there were no regressions.
>
> drivers/base/memory.c | 37 ++++++++++++++++++++++++-------------
> 1 file changed, 24 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 799b43191dea..2178d3e6d063 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -21,6 +21,7 @@
> #include <linux/mm.h>
> #include <linux/stat.h>
> #include <linux/slab.h>
> +#include <linux/xarray.h>
>
> #include <linux/atomic.h>
> #include <linux/uaccess.h>
> @@ -56,6 +57,13 @@ static struct bus_type memory_subsys = {
> .offline = memory_subsys_offline,
> };
>
> +/*
> + * Memory blocks are cached in a local radix tree to avoid
> + * a costly linear search for the corresponding device on
> + * the subsystem bus.
> + */
> +static DEFINE_XARRAY(memory_blocks);
> +
> static BLOCKING_NOTIFIER_HEAD(memory_chain);
>
> int register_memory_notifier(struct notifier_block *nb)
> @@ -572,20 +580,14 @@ int __weak arch_get_memory_phys_device(unsigned long start_pfn)
> /* A reference for the returned memory block device is acquired. */
> static struct memory_block *find_memory_block_by_id(unsigned long block_id)
> {
> - struct device *dev;
> + struct memory_block *mem;
>
> - dev = subsys_find_device_by_id(&memory_subsys, block_id, NULL);
> - return dev ? to_memory_block(dev) : NULL;
> + mem = xa_load(&memory_blocks, block_id);
> + if (mem)
> + get_device(&mem->dev);
> + return mem;
> }
>
> -/*
> - * For now, we have a linear search to go find the appropriate
> - * memory_block corresponding to a particular phys_index. If
> - * this gets to be a real problem, we can always use a radix
> - * tree or something here.
> - *
> - * This could be made generic for all device subsystems.
> - */
> struct memory_block *find_memory_block(struct mem_section *section)
> {
> unsigned long block_id = base_memory_block_id(__section_nr(section));
> @@ -628,9 +630,16 @@ int register_memory(struct memory_block *memory)
> memory->dev.offline = memory->state == MEM_OFFLINE;
>
> ret = device_register(&memory->dev);
> - if (ret)
> + if (ret) {
> put_device(&memory->dev);
> -
> + return ret;
> + }
> + ret = xa_err(xa_store(&memory_blocks, memory->dev.id, memory,
> + GFP_KERNEL));
> + if (ret) {
> + put_device(&memory->dev);
> + device_unregister(&memory->dev);
> + }
> return ret;
> }
>
> @@ -688,6 +697,8 @@ static void unregister_memory(struct memory_block *memory)
> if (WARN_ON_ONCE(memory->dev.bus != &memory_subsys))
> return;
>
> + WARN_ON(xa_erase(&memory_blocks, memory->dev.id) == NULL);
> +
> /* drop the ref. we got via find_memory_block() */
> put_device(&memory->dev);
> device_unregister(&memory->dev);
>

I think only the device_hotplug_lock documentation from me as a fixup
are missing. So this replacing the original patch looks good to me!

Thanks Scott!

--
Thanks,

David / dhildenb