2019-06-13 16:48:14

by Nadav Amit

[permalink] [raw]
Subject: [PATCH 0/3] resource: find_next_iomem_res() improvements

Running some microbenchmarks on dax keeps showing find_next_iomem_res()
as a place in which significant amount of time is spent. It appears that
in order to determine the cacheability that is required for the PTE,
lookup_memtype() is called, and this one traverses the resources list in
an inefficient manner. This patch-set tries to improve this situation.

The first patch fixes what appears to be unsafe locking in
find_next_iomem_res().

The second patch improves performance by searching the top level first,
to find a matching range, before going down to the children. The third
patch improves the performance by caching the top level resource of the
last found resource in find_next_iomem_res().

Both of these optimizations are based on the ranges in the top level not
overlapping each other.

Running sysbench on dax (Haswell, pmem emulation, with write_cache
disabled):

sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
--file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run

Provides the following results:

events (avg/stddev)
-------------------
5.2-rc3: 1247669.0000/16075.39
+patches: 1293408.5000/7720.69 (+3.5%)

Cc: Borislav Petkov <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Ingo Molnar <[email protected]>

Nadav Amit (3):
resource: Fix locking in find_next_iomem_res()
resource: Avoid unnecessary lookups in find_next_iomem_res()
resource: Introduce resource cache

kernel/resource.c | 96 ++++++++++++++++++++++++++++++++++++++---------
1 file changed, 79 insertions(+), 17 deletions(-)

--
2.20.1


2019-06-13 16:48:30

by Nadav Amit

[permalink] [raw]
Subject: [PATCH 3/3] resource: Introduce resource cache

For efficient search of resources, as needed to determine the memory
type for dax page-faults, introduce a cache of the most recently used
top-level resource. Caching the top-level should be safe as ranges in
that level do not overlap (unlike those of lower levels).

Keep the cache per-cpu to avoid possible contention. Whenever a resource
is added, removed or changed, invalidate all the resources. The
invalidation takes place when the resource_lock is taken for write,
preventing possible races.

This patch provides relatively small performance improvements over the
previous patch (~0.5% on sysbench), but can benefit systems with many
resources.

Cc: Borislav Petkov <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
kernel/resource.c | 51 +++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index 51c3bf6d9b98..ab659d0eb52a 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -53,6 +53,12 @@ struct resource_constraint {

static DEFINE_RWLOCK(resource_lock);

+/*
+ * Cache of the top-level resource that was most recently use by
+ * find_next_iomem_res().
+ */
+static DEFINE_PER_CPU(struct resource *, resource_cache);
+
/*
* For memory hotplug, there is no way to free resource entries allocated
* by boot mem after the system is up. So for reusing the resource entry
@@ -262,9 +268,20 @@ static void __release_child_resources(struct resource *r)
}
}

+static void invalidate_resource_cache(void)
+{
+ int cpu;
+
+ lockdep_assert_held_exclusive(&resource_lock);
+
+ for_each_possible_cpu(cpu)
+ per_cpu(resource_cache, cpu) = NULL;
+}
+
void release_child_resources(struct resource *r)
{
write_lock(&resource_lock);
+ invalidate_resource_cache();
__release_child_resources(r);
write_unlock(&resource_lock);
}
@@ -281,6 +298,7 @@ struct resource *request_resource_conflict(struct resource *root, struct resourc
struct resource *conflict;

write_lock(&resource_lock);
+ invalidate_resource_cache();
conflict = __request_resource(root, new);
write_unlock(&resource_lock);
return conflict;
@@ -312,6 +330,7 @@ int release_resource(struct resource *old)
int retval;

write_lock(&resource_lock);
+ invalidate_resource_cache();
retval = __release_resource(old, true);
write_unlock(&resource_lock);
return retval;
@@ -343,7 +362,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
bool first_lvl, struct resource *res)
{
bool siblings_only = true;
- struct resource *p;
+ struct resource *p, *res_first_lvl;

if (!res)
return -EINVAL;
@@ -353,7 +372,15 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,

read_lock(&resource_lock);

- for (p = iomem_resource.child; p; p = next_resource(p, siblings_only)) {
+ /*
+ * If our cached entry preceded or matches the range that we are looking
+ * for, start with it. Otherwise, start from the beginning.
+ */
+ p = res_first_lvl = this_cpu_read(resource_cache);
+ if (!p || start < p->start)
+ p = iomem_resource.child;
+
+ for (; p; p = next_resource(p, siblings_only)) {
/* If we passed the resource we are looking for, stop */
if (p->start > end) {
p = NULL;
@@ -364,6 +391,10 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
if (p->end < start)
continue;

+ /* We only cache the first level for correctness */
+ if (p->parent == &iomem_resource)
+ res_first_lvl = p;
+
/*
* Now that we found a range that matches what we look for,
* check the flags and the descriptor. If we were not asked to
@@ -386,6 +417,9 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
res->end = min(end, p->end);
res->flags = p->flags;
res->desc = p->desc;
+
+ /* Update the cache */
+ this_cpu_write(resource_cache, res_first_lvl);
}

read_unlock(&resource_lock);
@@ -674,6 +708,7 @@ static int reallocate_resource(struct resource *root, struct resource *old,
struct resource *conflict;

write_lock(&resource_lock);
+ invalidate_resource_cache();

if ((err = __find_resource(root, old, &new, newsize, constraint)))
goto out;
@@ -744,6 +779,7 @@ int allocate_resource(struct resource *root, struct resource *new,
}

write_lock(&resource_lock);
+ invalidate_resource_cache();
err = find_resource(root, new, size, &constraint);
if (err >= 0 && __request_resource(root, new))
err = -EBUSY;
@@ -848,6 +884,7 @@ struct resource *insert_resource_conflict(struct resource *parent, struct resour
struct resource *conflict;

write_lock(&resource_lock);
+ invalidate_resource_cache();
conflict = __insert_resource(parent, new);
write_unlock(&resource_lock);
return conflict;
@@ -886,6 +923,7 @@ void insert_resource_expand_to_fit(struct resource *root, struct resource *new)
return;

write_lock(&resource_lock);
+ invalidate_resource_cache();
for (;;) {
struct resource *conflict;

@@ -926,6 +964,7 @@ int remove_resource(struct resource *old)
int retval;

write_lock(&resource_lock);
+ invalidate_resource_cache();
retval = __release_resource(old, false);
write_unlock(&resource_lock);
return retval;
@@ -985,6 +1024,7 @@ int adjust_resource(struct resource *res, resource_size_t start,
int result;

write_lock(&resource_lock);
+ invalidate_resource_cache();
result = __adjust_resource(res, start, size);
write_unlock(&resource_lock);
return result;
@@ -1059,6 +1099,8 @@ reserve_region_with_split(struct resource *root, resource_size_t start,
int abort = 0;

write_lock(&resource_lock);
+ invalidate_resource_cache();
+
if (root->start > start || root->end < end) {
pr_err("requested range [0x%llx-0x%llx] not in root %pr\n",
(unsigned long long)start, (unsigned long long)end,
@@ -1135,6 +1177,7 @@ struct resource * __request_region(struct resource *parent,
res->end = start + n - 1;

write_lock(&resource_lock);
+ invalidate_resource_cache();

for (;;) {
struct resource *conflict;
@@ -1168,6 +1211,7 @@ struct resource * __request_region(struct resource *parent,
schedule();
remove_wait_queue(&muxed_resource_wait, &wait);
write_lock(&resource_lock);
+ invalidate_resource_cache();
continue;
}
/* Uhhuh, that didn't work out.. */
@@ -1198,6 +1242,7 @@ void __release_region(struct resource *parent, resource_size_t start,
end = start + n - 1;

write_lock(&resource_lock);
+ invalidate_resource_cache();

for (;;) {
struct resource *res = *p;
@@ -1211,6 +1256,7 @@ void __release_region(struct resource *parent, resource_size_t start,
}
if (res->start != start || res->end != end)
break;
+
*p = res->sibling;
write_unlock(&resource_lock);
if (res->flags & IORESOURCE_MUXED)
@@ -1268,6 +1314,7 @@ int release_mem_region_adjustable(struct resource *parent,

p = &parent->child;
write_lock(&resource_lock);
+ invalidate_resource_cache();

while ((res = *p)) {
if (res->start >= end)
--
2.20.1

2019-06-13 16:49:16

by Nadav Amit

[permalink] [raw]
Subject: [PATCH 2/3] resource: Avoid unnecessary lookups in find_next_iomem_res()

find_next_iomem_res() shows up to be a source for overhead in dax
benchmarks.

Improve performance by not considering children of the tree if the top
level does not match. Since the range of the parents should include the
range of the children such check is redundant.

Running sysbench on dax (pmem emulation, with write_cache disabled):

sysbench fileio --file-total-size=3G --file-test-mode=rndwr \
--file-io-mode=mmap --threads=4 --file-fsync-mode=fdatasync run

Provides the following results:

events (avg/stddev)
-------------------
5.2-rc3: 1247669.0000/16075.39
w/patch: 1286320.5000/16402.72 (+3%)

Cc: Borislav Petkov <[email protected]>
Cc: Toshi Kani <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Dan Williams <[email protected]>
Cc: Bjorn Helgaas <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
kernel/resource.c | 29 ++++++++++++++++++++++-------
1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/kernel/resource.c b/kernel/resource.c
index c0f7ba0ece52..51c3bf6d9b98 100644
--- a/kernel/resource.c
+++ b/kernel/resource.c
@@ -342,6 +342,7 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,
unsigned long flags, unsigned long desc,
bool first_lvl, struct resource *res)
{
+ bool siblings_only = true;
struct resource *p;

if (!res)
@@ -352,17 +353,31 @@ static int find_next_iomem_res(resource_size_t start, resource_size_t end,

read_lock(&resource_lock);

- for (p = iomem_resource.child; p; p = next_resource(p, first_lvl)) {
- if ((p->flags & flags) != flags)
- continue;
- if ((desc != IORES_DESC_NONE) && (desc != p->desc))
- continue;
+ for (p = iomem_resource.child; p; p = next_resource(p, siblings_only)) {
+ /* If we passed the resource we are looking for, stop */
if (p->start > end) {
p = NULL;
break;
}
- if ((p->end >= start) && (p->start <= end))
- break;
+
+ /* Skip until we find a range that matches what we look for */
+ if (p->end < start)
+ continue;
+
+ /*
+ * Now that we found a range that matches what we look for,
+ * check the flags and the descriptor. If we were not asked to
+ * use only the first level, start looking at children as well.
+ */
+ siblings_only = first_lvl;
+
+ if ((p->flags & flags) != flags)
+ continue;
+ if ((desc != IORES_DESC_NONE) && (desc != p->desc))
+ continue;
+
+ /* Found a match, break */
+ break;
}

if (p) {
--
2.20.1

2019-06-17 17:29:12

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 3/3] resource: Introduce resource cache

> On Jun 15, 2019, at 3:16 PM, Sasha Levin <[email protected]> wrote:
>
> Hi,
>
> [This is an automated email]
>
> This commit has been processed because it contains a "Fixes:" tag,
> fixing commit: ff3cc952d3f0 resource: Add remove_resource interface.

This commit (Patch 3/3) does not have the “Fixes:” tag (and it is a
performance enhancement), so I don’t know why it was processed.

IOW: please do not backport it.

> The bot has tested the following trees: v5.1.9, v4.19.50, v4.14.125, v4.9.181.
>
> v5.1.9: Build OK!
> v4.19.50: Failed to apply! Possible dependencies:
> 010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
> 7a53bb309eb3 ("resource: Fix locking in find_next_iomem_res()")
> a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
>
> v4.14.125: Failed to apply! Possible dependencies:
> 010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
> 0e4c12b45aa8 ("x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages")
> 1d2e733b13b4 ("resource: Provide resource struct in resource walk callback")
> 4ac2aed837cb ("resource: Consolidate resource walking code")
> 7a53bb309eb3 ("resource: Fix locking in find_next_iomem_res()")
> a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
>
> v4.9.181: Failed to apply! Possible dependencies:
> 010a93bf97c7 ("resource: Fix find_next_iomem_res() iteration issue")
> 0e4c12b45aa8 ("x86/mm, resource: Use PAGE_KERNEL protection for ioremap of memory pages")
> 1d2e733b13b4 ("resource: Provide resource struct in resource walk callback")
> 4ac2aed837cb ("resource: Consolidate resource walking code")
> 60fe3910bb02 ("kexec_file: Allow arch-specific memory walking for kexec_add_buffer")
> 7a53bb309eb3 ("resource: Fix locking in find_next_iomem_res()")
> a0458284f062 ("powerpc: Add support code for kexec_file_load()")
> a98959fdbda1 ("resource: Include resource end in walk_*() interfaces")
> da6658859b9c ("powerpc: Change places using CONFIG_KEXEC to use CONFIG_KEXEC_CORE instead.")
> ec2b9bfaac44 ("kexec_file: Change kexec_add_buffer to take kexec_buf as argument.")
>
>
> How should we proceed with this patch?
>
> --
> Thanks,
> Sasha


2019-06-18 04:58:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 3/3] resource: Introduce resource cache

On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <[email protected]> wrote:

> For efficient search of resources, as needed to determine the memory
> type for dax page-faults, introduce a cache of the most recently used
> top-level resource. Caching the top-level should be safe as ranges in
> that level do not overlap (unlike those of lower levels).
>
> Keep the cache per-cpu to avoid possible contention. Whenever a resource
> is added, removed or changed, invalidate all the resources. The
> invalidation takes place when the resource_lock is taken for write,
> preventing possible races.
>
> This patch provides relatively small performance improvements over the
> previous patch (~0.5% on sysbench), but can benefit systems with many
> resources.

> --- a/kernel/resource.c
> +++ b/kernel/resource.c
> @@ -53,6 +53,12 @@ struct resource_constraint {
>
> static DEFINE_RWLOCK(resource_lock);
>
> +/*
> + * Cache of the top-level resource that was most recently use by
> + * find_next_iomem_res().
> + */
> +static DEFINE_PER_CPU(struct resource *, resource_cache);

A per-cpu cache which is accessed under a kernel-wide read_lock looks a
bit odd - the latency getting at that rwlock will swamp the benefit of
isolating the CPUs from each other when accessing resource_cache.

On the other hand, if we have multiple CPUs running
find_next_iomem_res() concurrently then yes, I see the benefit. Has
the benefit of using a per-cpu cache (rather than a kernel-wide one)
been quantified?


> @@ -262,9 +268,20 @@ static void __release_child_resources(struct resource *r)
> }
> }
>
> +static void invalidate_resource_cache(void)
> +{
> + int cpu;
> +
> + lockdep_assert_held_exclusive(&resource_lock);
> +
> + for_each_possible_cpu(cpu)
> + per_cpu(resource_cache, cpu) = NULL;
> +}

All the calls to invalidate_resource_cache() are rather a
maintainability issue - easy to miss one as the code evolves.

Can't we just make find_next_iomem_res() smarter? For example, start
the lookup from the cached point and if that failed, do a full sweep?

> + invalidate_resource_cache();
> + invalidate_resource_cache();
> + invalidate_resource_cache();
> + invalidate_resource_cache();
> + invalidate_resource_cache();
> + invalidate_resource_cache();
> + invalidate_resource_cache();
> + invalidate_resource_cache();
> + invalidate_resource_cache();
> + invalidate_resource_cache();
> + invalidate_resource_cache();
> + invalidate_resource_cache();
> + invalidate_resource_cache();
> + invalidate_resource_cache();

Ow. I guess the maintainability situation can be improved by renaming
resource_lock to something else (to avoid mishaps) then adding wrapper
functions. But still. I can't say this is a super-exciting patch :(

2019-06-18 06:46:10

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 3/3] resource: Introduce resource cache

> On Jun 17, 2019, at 10:33 PM, Nadav Amit <[email protected]> wrote:
>
>> On Jun 17, 2019, at 9:57 PM, Andrew Morton <[email protected]> wrote:
>>
>> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <[email protected]> wrote:
>>
>>> For efficient search of resources, as needed to determine the memory
>>> type for dax page-faults, introduce a cache of the most recently used
>>> top-level resource. Caching the top-level should be safe as ranges in
>>> that level do not overlap (unlike those of lower levels).
>>>
>>> Keep the cache per-cpu to avoid possible contention. Whenever a resource
>>> is added, removed or changed, invalidate all the resources. The
>>> invalidation takes place when the resource_lock is taken for write,
>>> preventing possible races.
>>>
>>> This patch provides relatively small performance improvements over the
>>> previous patch (~0.5% on sysbench), but can benefit systems with many
>>> resources.
>>
>>> --- a/kernel/resource.c
>>> +++ b/kernel/resource.c
>>> @@ -53,6 +53,12 @@ struct resource_constraint {
>>>
>>> static DEFINE_RWLOCK(resource_lock);
>>>
>>> +/*
>>> + * Cache of the top-level resource that was most recently use by
>>> + * find_next_iomem_res().
>>> + */
>>> +static DEFINE_PER_CPU(struct resource *, resource_cache);
>>
>> A per-cpu cache which is accessed under a kernel-wide read_lock looks a
>> bit odd - the latency getting at that rwlock will swamp the benefit of
>> isolating the CPUs from each other when accessing resource_cache.
>>
>> On the other hand, if we have multiple CPUs running
>> find_next_iomem_res() concurrently then yes, I see the benefit. Has
>> the benefit of using a per-cpu cache (rather than a kernel-wide one)
>> been quantified?
>
> No. I am not sure how easy it would be to measure it. On the other hander
> the lock is not supposed to be contended (at most cases). At the time I saw
> numbers that showed that stores to “exclusive" cache lines can be as
> expensive as atomic operations [1]. I am not sure how up to date these
> numbers are though. In the benchmark I ran, multiple CPUs ran
> find_next_iomem_res() concurrently.
>
> [1] http://sigops.org/s/conferences/sosp/2013/papers/p33-david.pdf

Just to clarify - the main motivation behind the per-cpu variable is not
about contention, but about the fact the different processes/threads that
run concurrently might use different resources.

2019-06-18 06:46:22

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

On Wed, Jun 12, 2019 at 9:59 PM Nadav Amit <[email protected]> wrote:
>
> Running some microbenchmarks on dax keeps showing find_next_iomem_res()
> as a place in which significant amount of time is spent. It appears that
> in order to determine the cacheability that is required for the PTE,
> lookup_memtype() is called, and this one traverses the resources list in
> an inefficient manner. This patch-set tries to improve this situation.

Let's just do this lookup once per device, cache that, and replay it
to modified vmf_insert_* routines that trust the caller to already
know the pgprot_values.

2019-06-18 06:48:08

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 3/3] resource: Introduce resource cache

> On Jun 17, 2019, at 9:57 PM, Andrew Morton <[email protected]> wrote:
>
> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <[email protected]> wrote:
>
>> For efficient search of resources, as needed to determine the memory
>> type for dax page-faults, introduce a cache of the most recently used
>> top-level resource. Caching the top-level should be safe as ranges in
>> that level do not overlap (unlike those of lower levels).
>>
>> Keep the cache per-cpu to avoid possible contention. Whenever a resource
>> is added, removed or changed, invalidate all the resources. The
>> invalidation takes place when the resource_lock is taken for write,
>> preventing possible races.
>>
>> This patch provides relatively small performance improvements over the
>> previous patch (~0.5% on sysbench), but can benefit systems with many
>> resources.
>
>> --- a/kernel/resource.c
>> +++ b/kernel/resource.c
>> @@ -53,6 +53,12 @@ struct resource_constraint {
>>
>> static DEFINE_RWLOCK(resource_lock);
>>
>> +/*
>> + * Cache of the top-level resource that was most recently use by
>> + * find_next_iomem_res().
>> + */
>> +static DEFINE_PER_CPU(struct resource *, resource_cache);
>
> A per-cpu cache which is accessed under a kernel-wide read_lock looks a
> bit odd - the latency getting at that rwlock will swamp the benefit of
> isolating the CPUs from each other when accessing resource_cache.
>
> On the other hand, if we have multiple CPUs running
> find_next_iomem_res() concurrently then yes, I see the benefit. Has
> the benefit of using a per-cpu cache (rather than a kernel-wide one)
> been quantified?

No. I am not sure how easy it would be to measure it. On the other hander
the lock is not supposed to be contended (at most cases). At the time I saw
numbers that showed that stores to “exclusive" cache lines can be as
expensive as atomic operations [1]. I am not sure how up to date these
numbers are though. In the benchmark I ran, multiple CPUs ran
find_next_iomem_res() concurrently.

[1] http://sigops.org/s/conferences/sosp/2013/papers/p33-david.pdf

>
>
>> @@ -262,9 +268,20 @@ static void __release_child_resources(struct resource *r)
>> }
>> }
>>
>> +static void invalidate_resource_cache(void)
>> +{
>> + int cpu;
>> +
>> + lockdep_assert_held_exclusive(&resource_lock);
>> +
>> + for_each_possible_cpu(cpu)
>> + per_cpu(resource_cache, cpu) = NULL;
>> +}
>
> All the calls to invalidate_resource_cache() are rather a
> maintainability issue - easy to miss one as the code evolves.
>
> Can't we just make find_next_iomem_res() smarter? For example, start
> the lookup from the cached point and if that failed, do a full sweep?

I may be able to do something like that to reduce the number of locations
that need to be updated, but you always need to invalidate if a resource is
removed. This might make the code more prone to bugs, since the logic of
when to invalidate becomes more complicated.

>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>> + invalidate_resource_cache();
>
> Ow. I guess the maintainability situation can be improved by renaming
> resource_lock to something else (to avoid mishaps) then adding wrapper
> functions. But still. I can't say this is a super-exciting patch :(

I considered doing so, but I was not sure it is better. If you want I’ll
implement it using wrapper functions (but both lock and unlock would need
to be wrapped for consistency).

The benefit of this patch over the previous one is not huge, and I don’t
know how to implement it any better (excluding wrapper function, if you
consider it better). If you want, you can just drop this patch.


2019-06-18 17:42:45

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

> On Jun 17, 2019, at 11:44 PM, Dan Williams <[email protected]> wrote:
>
> On Wed, Jun 12, 2019 at 9:59 PM Nadav Amit <[email protected]> wrote:
>> Running some microbenchmarks on dax keeps showing find_next_iomem_res()
>> as a place in which significant amount of time is spent. It appears that
>> in order to determine the cacheability that is required for the PTE,
>> lookup_memtype() is called, and this one traverses the resources list in
>> an inefficient manner. This patch-set tries to improve this situation.
>
> Let's just do this lookup once per device, cache that, and replay it
> to modified vmf_insert_* routines that trust the caller to already
> know the pgprot_values.

IIUC, one device can have multiple regions with different characteristics,
which require difference cachability. Apparently, that is the reason there
is a tree of resources. Please be more specific about where you want to
cache it, please.

Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
see being done in quite a few cases), but I don’t know the code well enough
to be certain that every vma should have a single protection and that it
should not change afterwards.

2019-06-18 18:32:31

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

On Tue, Jun 18, 2019 at 10:42 AM Nadav Amit <[email protected]> wrote:
>
> > On Jun 17, 2019, at 11:44 PM, Dan Williams <[email protected]> wrote:
> >
> > On Wed, Jun 12, 2019 at 9:59 PM Nadav Amit <[email protected]> wrote:
> >> Running some microbenchmarks on dax keeps showing find_next_iomem_res()
> >> as a place in which significant amount of time is spent. It appears that
> >> in order to determine the cacheability that is required for the PTE,
> >> lookup_memtype() is called, and this one traverses the resources list in
> >> an inefficient manner. This patch-set tries to improve this situation.
> >
> > Let's just do this lookup once per device, cache that, and replay it
> > to modified vmf_insert_* routines that trust the caller to already
> > know the pgprot_values.
>
> IIUC, one device can have multiple regions with different characteristics,
> which require difference cachability.

Not for pmem. It will always be one common cacheability setting for
the entirety of persistent memory.

> Apparently, that is the reason there
> is a tree of resources. Please be more specific about where you want to
> cache it, please.

The reason for lookup_memtype() was to try to prevent mixed
cacheability settings of pages across different processes . The
mapping type for pmem/dax is established by one of:

drivers/nvdimm/pmem.c:413: addr =
devm_memremap_pages(dev, &pmem->pgmap);
drivers/nvdimm/pmem.c:425: addr =
devm_memremap_pages(dev, &pmem->pgmap);
drivers/nvdimm/pmem.c:432: addr = devm_memremap(dev,
pmem->phys_addr,
drivers/nvdimm/pmem.c-433- pmem->size,
ARCH_MEMREMAP_PMEM);

...and is constant for the life of the device and all subsequent mappings.

> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
> see being done in quite a few cases), but I don’t know the code well enough
> to be certain that every vma should have a single protection and that it
> should not change afterwards.

No, I'm thinking this would naturally fit as a property hanging off a
'struct dax_device', and then create a version of vmf_insert_mixed()
and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
saved value.

2019-06-18 21:57:58

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

> On Jun 18, 2019, at 11:30 AM, Dan Williams <[email protected]> wrote:
>
> On Tue, Jun 18, 2019 at 10:42 AM Nadav Amit <[email protected]> wrote:
>>> On Jun 17, 2019, at 11:44 PM, Dan Williams <[email protected]> wrote:
>>>
>>> On Wed, Jun 12, 2019 at 9:59 PM Nadav Amit <[email protected]> wrote:
>>>> Running some microbenchmarks on dax keeps showing find_next_iomem_res()
>>>> as a place in which significant amount of time is spent. It appears that
>>>> in order to determine the cacheability that is required for the PTE,
>>>> lookup_memtype() is called, and this one traverses the resources list in
>>>> an inefficient manner. This patch-set tries to improve this situation.
>>>
>>> Let's just do this lookup once per device, cache that, and replay it
>>> to modified vmf_insert_* routines that trust the caller to already
>>> know the pgprot_values.
>>
>> IIUC, one device can have multiple regions with different characteristics,
>> which require difference cachability.
>
> Not for pmem. It will always be one common cacheability setting for
> the entirety of persistent memory.
>
>> Apparently, that is the reason there
>> is a tree of resources. Please be more specific about where you want to
>> cache it, please.
>
> The reason for lookup_memtype() was to try to prevent mixed
> cacheability settings of pages across different processes . The
> mapping type for pmem/dax is established by one of:
>
> drivers/nvdimm/pmem.c:413: addr =
> devm_memremap_pages(dev, &pmem->pgmap);
> drivers/nvdimm/pmem.c:425: addr =
> devm_memremap_pages(dev, &pmem->pgmap);
> drivers/nvdimm/pmem.c:432: addr = devm_memremap(dev,
> pmem->phys_addr,
> drivers/nvdimm/pmem.c-433- pmem->size,
> ARCH_MEMREMAP_PMEM);
>
> ...and is constant for the life of the device and all subsequent mappings.
>
>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
>> see being done in quite a few cases), but I don’t know the code well enough
>> to be certain that every vma should have a single protection and that it
>> should not change afterwards.
>
> No, I'm thinking this would naturally fit as a property hanging off a
> 'struct dax_device', and then create a version of vmf_insert_mixed()
> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
> saved value.

Thanks for the detailed explanation. I’ll give it a try (the moment I find
some free time). I still think that patch 2/3 is beneficial, but based on
your feedback, patch 3/3 should be dropped.

2019-06-19 13:01:52

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 3/3] resource: Introduce resource cache

On Tue, Jun 18, 2019 at 12:40 AM Nadav Amit <[email protected]> wrote:
>
> > On Jun 17, 2019, at 10:33 PM, Nadav Amit <[email protected]> wrote:
> >
> >> On Jun 17, 2019, at 9:57 PM, Andrew Morton <[email protected]> wrote:
> >>
> >> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <[email protected]> wrote:
> >>
> >>> For efficient search of resources, as needed to determine the memory
> >>> type for dax page-faults, introduce a cache of the most recently used
> >>> top-level resource. Caching the top-level should be safe as ranges in
> >>> that level do not overlap (unlike those of lower levels).
> >>>
> >>> Keep the cache per-cpu to avoid possible contention. Whenever a resource
> >>> is added, removed or changed, invalidate all the resources. The
> >>> invalidation takes place when the resource_lock is taken for write,
> >>> preventing possible races.
> >>>
> >>> This patch provides relatively small performance improvements over the
> >>> previous patch (~0.5% on sysbench), but can benefit systems with many
> >>> resources.
> >>
> >>> --- a/kernel/resource.c
> >>> +++ b/kernel/resource.c
> >>> @@ -53,6 +53,12 @@ struct resource_constraint {
> >>>
> >>> static DEFINE_RWLOCK(resource_lock);
> >>>
> >>> +/*
> >>> + * Cache of the top-level resource that was most recently use by
> >>> + * find_next_iomem_res().
> >>> + */
> >>> +static DEFINE_PER_CPU(struct resource *, resource_cache);
> >>
> >> A per-cpu cache which is accessed under a kernel-wide read_lock looks a
> >> bit odd - the latency getting at that rwlock will swamp the benefit of
> >> isolating the CPUs from each other when accessing resource_cache.
> >>
> >> On the other hand, if we have multiple CPUs running
> >> find_next_iomem_res() concurrently then yes, I see the benefit. Has
> >> the benefit of using a per-cpu cache (rather than a kernel-wide one)
> >> been quantified?
> >
> > No. I am not sure how easy it would be to measure it. On the other hander
> > the lock is not supposed to be contended (at most cases). At the time I saw
> > numbers that showed that stores to “exclusive" cache lines can be as
> > expensive as atomic operations [1]. I am not sure how up to date these
> > numbers are though. In the benchmark I ran, multiple CPUs ran
> > find_next_iomem_res() concurrently.
> >
> > [1] http://sigops.org/s/conferences/sosp/2013/papers/p33-david.pdf
>
> Just to clarify - the main motivation behind the per-cpu variable is not
> about contention, but about the fact the different processes/threads that
> run concurrently might use different resources.

IIUC, the underlying problem is that dax relies heavily on ioremap(),
and ioremap() on x86 takes too long because it relies on
find_next_iomem_res() via the __ioremap_caller() ->
__ioremap_check_mem() -> walk_mem_res() path.

The fact that x86 is the only arch that does this much work in
ioremap() makes me wonder. Is there something unique about x86
mapping attributes that requires this extra work, or is there some way
this could be reworked to avoid searching the resource map in the
first place?

Bjorn

2019-06-19 20:35:41

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 3/3] resource: Introduce resource cache

> On Jun 19, 2019, at 6:00 AM, Bjorn Helgaas <[email protected]> wrote:
>
> On Tue, Jun 18, 2019 at 12:40 AM Nadav Amit <[email protected]> wrote:
>>> On Jun 17, 2019, at 10:33 PM, Nadav Amit <[email protected]> wrote:
>>>
>>>> On Jun 17, 2019, at 9:57 PM, Andrew Morton <[email protected]> wrote:
>>>>
>>>> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <[email protected]> wrote:
>>>>
>>>>> For efficient search of resources, as needed to determine the memory
>>>>> type for dax page-faults, introduce a cache of the most recently used
>>>>> top-level resource. Caching the top-level should be safe as ranges in
>>>>> that level do not overlap (unlike those of lower levels).
>>>>>
>>>>> Keep the cache per-cpu to avoid possible contention. Whenever a resource
>>>>> is added, removed or changed, invalidate all the resources. The
>>>>> invalidation takes place when the resource_lock is taken for write,
>>>>> preventing possible races.
>>>>>
>>>>> This patch provides relatively small performance improvements over the
>>>>> previous patch (~0.5% on sysbench), but can benefit systems with many
>>>>> resources.
>>>>
>>>>> --- a/kernel/resource.c
>>>>> +++ b/kernel/resource.c
>>>>> @@ -53,6 +53,12 @@ struct resource_constraint {
>>>>>
>>>>> static DEFINE_RWLOCK(resource_lock);
>>>>>
>>>>> +/*
>>>>> + * Cache of the top-level resource that was most recently use by
>>>>> + * find_next_iomem_res().
>>>>> + */
>>>>> +static DEFINE_PER_CPU(struct resource *, resource_cache);
>>>>
>>>> A per-cpu cache which is accessed under a kernel-wide read_lock looks a
>>>> bit odd - the latency getting at that rwlock will swamp the benefit of
>>>> isolating the CPUs from each other when accessing resource_cache.
>>>>
>>>> On the other hand, if we have multiple CPUs running
>>>> find_next_iomem_res() concurrently then yes, I see the benefit. Has
>>>> the benefit of using a per-cpu cache (rather than a kernel-wide one)
>>>> been quantified?
>>>
>>> No. I am not sure how easy it would be to measure it. On the other hander
>>> the lock is not supposed to be contended (at most cases). At the time I saw
>>> numbers that showed that stores to “exclusive" cache lines can be as
>>> expensive as atomic operations [1]. I am not sure how up to date these
>>> numbers are though. In the benchmark I ran, multiple CPUs ran
>>> find_next_iomem_res() concurrently.
>>>
>>> [1] https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2Fsigops.org%2Fs%2Fconferences%2Fsosp%2F2013%2Fpapers%2Fp33-david.pdf&amp;data=02%7C01%7Cnamit%40vmware.com%7Ca2706c5ab2c544283f3b08d6f4b6152b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C636965460234022371&amp;sdata=cD7Nhs4jcJGMD7Lav6D%2BC6E5Sei0DiWhKXL7vz2cVHA%3D&amp;reserved=0
>>
>> Just to clarify - the main motivation behind the per-cpu variable is not
>> about contention, but about the fact the different processes/threads that
>> run concurrently might use different resources.
>
> IIUC, the underlying problem is that dax relies heavily on ioremap(),
> and ioremap() on x86 takes too long because it relies on
> find_next_iomem_res() via the __ioremap_caller() ->
> __ioremap_check_mem() -> walk_mem_res() path.

I don’t know much about this path and whether it is painful. The path I was
regarding is during page-fault handling:

- handle_mm_fault
- __handle_mm_fault
- do_wp_page
- ext4_dax_fault
- ext4_dax_huge_fault
- dax_iomap_fault
- dax_iomap_pte_fault
- vmf_insert_mixed_mkwrite
- __vm_insert_mixed
- track_pfn_insert
- lookup_memtype
- pat_pagerange_is_ram

But indeed track_pfn_insert() in x86 specific. I guess the differences are
due to the page-table controlling the cachability in x86 (PAT), but I don’t
know much about other architectures and whether they have similar
cachability controls in the page-tables.

2019-06-19 21:55:14

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 3/3] resource: Introduce resource cache

[ add Andi ]

On Wed, Jun 19, 2019 at 6:00 AM Bjorn Helgaas <[email protected]> wrote:
>
> On Tue, Jun 18, 2019 at 12:40 AM Nadav Amit <[email protected]> wrote:
> >
> > > On Jun 17, 2019, at 10:33 PM, Nadav Amit <[email protected]> wrote:
> > >
> > >> On Jun 17, 2019, at 9:57 PM, Andrew Morton <[email protected]> wrote:
> > >>
> > >> On Wed, 12 Jun 2019 21:59:03 -0700 Nadav Amit <[email protected]> wrote:
> > >>
> > >>> For efficient search of resources, as needed to determine the memory
> > >>> type for dax page-faults, introduce a cache of the most recently used
> > >>> top-level resource. Caching the top-level should be safe as ranges in
> > >>> that level do not overlap (unlike those of lower levels).
> > >>>
> > >>> Keep the cache per-cpu to avoid possible contention. Whenever a resource
> > >>> is added, removed or changed, invalidate all the resources. The
> > >>> invalidation takes place when the resource_lock is taken for write,
> > >>> preventing possible races.
> > >>>
> > >>> This patch provides relatively small performance improvements over the
> > >>> previous patch (~0.5% on sysbench), but can benefit systems with many
> > >>> resources.
> > >>
> > >>> --- a/kernel/resource.c
> > >>> +++ b/kernel/resource.c
> > >>> @@ -53,6 +53,12 @@ struct resource_constraint {
> > >>>
> > >>> static DEFINE_RWLOCK(resource_lock);
> > >>>
> > >>> +/*
> > >>> + * Cache of the top-level resource that was most recently use by
> > >>> + * find_next_iomem_res().
> > >>> + */
> > >>> +static DEFINE_PER_CPU(struct resource *, resource_cache);
> > >>
> > >> A per-cpu cache which is accessed under a kernel-wide read_lock looks a
> > >> bit odd - the latency getting at that rwlock will swamp the benefit of
> > >> isolating the CPUs from each other when accessing resource_cache.
> > >>
> > >> On the other hand, if we have multiple CPUs running
> > >> find_next_iomem_res() concurrently then yes, I see the benefit. Has
> > >> the benefit of using a per-cpu cache (rather than a kernel-wide one)
> > >> been quantified?
> > >
> > > No. I am not sure how easy it would be to measure it. On the other hander
> > > the lock is not supposed to be contended (at most cases). At the time I saw
> > > numbers that showed that stores to “exclusive" cache lines can be as
> > > expensive as atomic operations [1]. I am not sure how up to date these
> > > numbers are though. In the benchmark I ran, multiple CPUs ran
> > > find_next_iomem_res() concurrently.
> > >
> > > [1] http://sigops.org/s/conferences/sosp/2013/papers/p33-david.pdf
> >
> > Just to clarify - the main motivation behind the per-cpu variable is not
> > about contention, but about the fact the different processes/threads that
> > run concurrently might use different resources.
>
> IIUC, the underlying problem is that dax relies heavily on ioremap(),
> and ioremap() on x86 takes too long because it relies on
> find_next_iomem_res() via the __ioremap_caller() ->
> __ioremap_check_mem() -> walk_mem_res() path.
>
> The fact that x86 is the only arch that does this much work in
> ioremap() makes me wonder. Is there something unique about x86
> mapping attributes that requires this extra work, or is there some way
> this could be reworked to avoid searching the resource map in the
> first place?

The underlying issue is that the x86-PAT implementation wants to
ensure that conflicting mappings are not set up for the same physical
address. This is mentioned in the developer manuals as problematic on
some cpus. Andi, is lookup_memtype() and track_pfn_insert() still
relevant?

2019-06-20 21:32:39

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 3/3] resource: Introduce resource cache

Dan Williams <[email protected]> writes:
>
> The underlying issue is that the x86-PAT implementation wants to
> ensure that conflicting mappings are not set up for the same physical
> address. This is mentioned in the developer manuals as problematic on
> some cpus. Andi, is lookup_memtype() and track_pfn_insert() still
> relevant?

There have been discussions about it in the past, and the right answer
will likely differ for different CPUs: But so far the official answer
for Intel CPUs is that these caching conflicts should be avoided.

So I guess the cache in the original email makes sense for now.

-Andi

2019-06-20 23:13:37

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 3/3] resource: Introduce resource cache

On Thu, Jun 20, 2019 at 2:31 PM Andi Kleen <[email protected]> wrote:
>
> Dan Williams <[email protected]> writes:
> >
> > The underlying issue is that the x86-PAT implementation wants to
> > ensure that conflicting mappings are not set up for the same physical
> > address. This is mentioned in the developer manuals as problematic on
> > some cpus. Andi, is lookup_memtype() and track_pfn_insert() still
> > relevant?
>
> There have been discussions about it in the past, and the right answer
> will likely differ for different CPUs: But so far the official answer
> for Intel CPUs is that these caching conflicts should be avoided.
>

Ok.

> So I guess the cache in the original email makes sense for now.

I wouldn't go that far, but it does mean that if we go ahead with
caching the value as a dax_device property there should at least be a
debug option to assert that the device value conforms to all the other
mappings.

Another failing of the track_pfn_insert() and lookup_memtype()
implementation is that it makes it awkward to handle marking mappings
UC to prevent speculative consumption of poison. That is something
that is better handled, in my opinion, by asking the device for the
pgprot and coordinating shooting down any WB mappings of the same
physical page.

2019-07-16 22:04:10

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <[email protected]> wrote:

> > ...and is constant for the life of the device and all subsequent mappings.
> >
> >> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
> >> see being done in quite a few cases), but I don’t know the code well enough
> >> to be certain that every vma should have a single protection and that it
> >> should not change afterwards.
> >
> > No, I'm thinking this would naturally fit as a property hanging off a
> > 'struct dax_device', and then create a version of vmf_insert_mixed()
> > and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
> > saved value.
>
> Thanks for the detailed explanation. I’ll give it a try (the moment I find
> some free time). I still think that patch 2/3 is beneficial, but based on
> your feedback, patch 3/3 should be dropped.

It has been a while. What should we do with

resource-fix-locking-in-find_next_iomem_res.patch
resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch

?

2019-07-16 22:08:21

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

> On Jul 16, 2019, at 3:00 PM, Andrew Morton <[email protected]> wrote:
>
> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <[email protected]> wrote:
>
>>> ...and is constant for the life of the device and all subsequent mappings.
>>>
>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
>>>> see being done in quite a few cases), but I don’t know the code well enough
>>>> to be certain that every vma should have a single protection and that it
>>>> should not change afterwards.
>>>
>>> No, I'm thinking this would naturally fit as a property hanging off a
>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
>>> saved value.
>>
>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
>> some free time). I still think that patch 2/3 is beneficial, but based on
>> your feedback, patch 3/3 should be dropped.
>
> It has been a while. What should we do with
>
> resource-fix-locking-in-find_next_iomem_res.patch
> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
>
> ?

I didn’t get to follow Dan Williams advice. But, both of two patches are
fine on my opinion and should go upstream. The first one fixes a bug and the
second one improves performance considerably (and removes most of the
overhead). Future improvements can go on top of these patches and are not
expected to conflict.

So I think they should go upstream - the first one immediately, the second
one when possible.

2019-07-16 22:09:53

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton <[email protected]> wrote:
>
> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <[email protected]> wrote:
>
> > > ...and is constant for the life of the device and all subsequent mappings.
> > >
> > >> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
> > >> see being done in quite a few cases), but I don’t know the code well enough
> > >> to be certain that every vma should have a single protection and that it
> > >> should not change afterwards.
> > >
> > > No, I'm thinking this would naturally fit as a property hanging off a
> > > 'struct dax_device', and then create a version of vmf_insert_mixed()
> > > and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
> > > saved value.
> >
> > Thanks for the detailed explanation. I’ll give it a try (the moment I find
> > some free time). I still think that patch 2/3 is beneficial, but based on
> > your feedback, patch 3/3 should be dropped.
>
> It has been a while. What should we do with
>
> resource-fix-locking-in-find_next_iomem_res.patch

This one looks obviously correct to me, you can add:

Reviewed-by: Dan Williams <[email protected]>

> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch

This one is a good bug report that we need to go fix pgprot lookups
for dax, but I don't think we need to increase the trickiness of the
core resource lookup code in the meantime.

2019-07-16 22:14:31

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

> On Jul 16, 2019, at 3:07 PM, Dan Williams <[email protected]> wrote:
>
> On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton <[email protected]> wrote:
>> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <[email protected]> wrote:
>>
>>>> ...and is constant for the life of the device and all subsequent mappings.
>>>>
>>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
>>>>> see being done in quite a few cases), but I don’t know the code well enough
>>>>> to be certain that every vma should have a single protection and that it
>>>>> should not change afterwards.
>>>>
>>>> No, I'm thinking this would naturally fit as a property hanging off a
>>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
>>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
>>>> saved value.
>>>
>>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
>>> some free time). I still think that patch 2/3 is beneficial, but based on
>>> your feedback, patch 3/3 should be dropped.
>>
>> It has been a while. What should we do with
>>
>> resource-fix-locking-in-find_next_iomem_res.patch
>
> This one looks obviously correct to me, you can add:
>
> Reviewed-by: Dan Williams <[email protected]>
>
>> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
>
> This one is a good bug report that we need to go fix pgprot lookups
> for dax, but I don't think we need to increase the trickiness of the
> core resource lookup code in the meantime.

I think that traversing big parts of the tree that are known to be
irrelevant is wasteful no matter what, and this code is used in other cases.

I don’t think the new code is so tricky - can you point to the part of the
code that you find tricky?


2019-07-16 22:23:39

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

On Tue, Jul 16, 2019 at 3:13 PM Nadav Amit <[email protected]> wrote:
>
> > On Jul 16, 2019, at 3:07 PM, Dan Williams <[email protected]> wrote:
> >
> > On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton <[email protected]> wrote:
> >> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <[email protected]> wrote:
> >>
> >>>> ...and is constant for the life of the device and all subsequent mappings.
> >>>>
> >>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
> >>>>> see being done in quite a few cases), but I don’t know the code well enough
> >>>>> to be certain that every vma should have a single protection and that it
> >>>>> should not change afterwards.
> >>>>
> >>>> No, I'm thinking this would naturally fit as a property hanging off a
> >>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
> >>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
> >>>> saved value.
> >>>
> >>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
> >>> some free time). I still think that patch 2/3 is beneficial, but based on
> >>> your feedback, patch 3/3 should be dropped.
> >>
> >> It has been a while. What should we do with
> >>
> >> resource-fix-locking-in-find_next_iomem_res.patch
> >
> > This one looks obviously correct to me, you can add:
> >
> > Reviewed-by: Dan Williams <[email protected]>
> >
> >> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
> >
> > This one is a good bug report that we need to go fix pgprot lookups
> > for dax, but I don't think we need to increase the trickiness of the
> > core resource lookup code in the meantime.
>
> I think that traversing big parts of the tree that are known to be
> irrelevant is wasteful no matter what, and this code is used in other cases.
>
> I don’t think the new code is so tricky - can you point to the part of the
> code that you find tricky?

Given dax can be updated to avoid this abuse of find_next_iomem_res(),
it was a general observation that the patch adds more lines than it
removes and is not strictly necessary. I'm ambivalent as to whether it
is worth pushing upstream. If anything the changelog is going to be
invalidated by a change to dax to avoid find_next_iomem_res(). Can you
update the changelog to be relevant outside of the dax case?

2019-07-16 22:30:54

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

> On Jul 16, 2019, at 3:20 PM, Dan Williams <[email protected]> wrote:
>
> On Tue, Jul 16, 2019 at 3:13 PM Nadav Amit <[email protected]> wrote:
>>> On Jul 16, 2019, at 3:07 PM, Dan Williams <[email protected]> wrote:
>>>
>>> On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton <[email protected]> wrote:
>>>> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <[email protected]> wrote:
>>>>
>>>>>> ...and is constant for the life of the device and all subsequent mappings.
>>>>>>
>>>>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
>>>>>>> see being done in quite a few cases), but I don’t know the code well enough
>>>>>>> to be certain that every vma should have a single protection and that it
>>>>>>> should not change afterwards.
>>>>>>
>>>>>> No, I'm thinking this would naturally fit as a property hanging off a
>>>>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
>>>>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
>>>>>> saved value.
>>>>>
>>>>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
>>>>> some free time). I still think that patch 2/3 is beneficial, but based on
>>>>> your feedback, patch 3/3 should be dropped.
>>>>
>>>> It has been a while. What should we do with
>>>>
>>>> resource-fix-locking-in-find_next_iomem_res.patch
>>>
>>> This one looks obviously correct to me, you can add:
>>>
>>> Reviewed-by: Dan Williams <[email protected]>
>>>
>>>> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
>>>
>>> This one is a good bug report that we need to go fix pgprot lookups
>>> for dax, but I don't think we need to increase the trickiness of the
>>> core resource lookup code in the meantime.
>>
>> I think that traversing big parts of the tree that are known to be
>> irrelevant is wasteful no matter what, and this code is used in other cases.
>>
>> I don’t think the new code is so tricky - can you point to the part of the
>> code that you find tricky?
>
> Given dax can be updated to avoid this abuse of find_next_iomem_res(),
> it was a general observation that the patch adds more lines than it
> removes and is not strictly necessary. I'm ambivalent as to whether it
> is worth pushing upstream. If anything the changelog is going to be
> invalidated by a change to dax to avoid find_next_iomem_res(). Can you
> update the changelog to be relevant outside of the dax case?

Well, 8 lines are comments, 4 are empty lines, so it adds 3 lines of code
according to my calculations. :)

Having said that, if you think I might have made a mistake, or you are
concerned with some bug I might have caused, please let me know. I
understand that this logic might have been lying around for some time.

I can update the commit log, emphasizing the redundant search operations as
motivation and then mentioning dax as an instance that induces overheads due
to this overhead, and say it should be handled regardless to this patch-set.
Once I find time, I am going to deal with DAX, unless you beat me to it.

Thanks,
Nadav

2019-07-16 22:48:10

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/3] resource: find_next_iomem_res() improvements

On Tue, Jul 16, 2019 at 3:29 PM Nadav Amit <[email protected]> wrote:
>
> > On Jul 16, 2019, at 3:20 PM, Dan Williams <[email protected]> wrote:
> >
> > On Tue, Jul 16, 2019 at 3:13 PM Nadav Amit <[email protected]> wrote:
> >>> On Jul 16, 2019, at 3:07 PM, Dan Williams <[email protected]> wrote:
> >>>
> >>> On Tue, Jul 16, 2019 at 3:01 PM Andrew Morton <[email protected]> wrote:
> >>>> On Tue, 18 Jun 2019 21:56:43 +0000 Nadav Amit <[email protected]> wrote:
> >>>>
> >>>>>> ...and is constant for the life of the device and all subsequent mappings.
> >>>>>>
> >>>>>>> Perhaps you want to cache the cachability-mode in vma->vm_page_prot (which I
> >>>>>>> see being done in quite a few cases), but I don’t know the code well enough
> >>>>>>> to be certain that every vma should have a single protection and that it
> >>>>>>> should not change afterwards.
> >>>>>>
> >>>>>> No, I'm thinking this would naturally fit as a property hanging off a
> >>>>>> 'struct dax_device', and then create a version of vmf_insert_mixed()
> >>>>>> and vmf_insert_pfn_pmd() that bypass track_pfn_insert() to insert that
> >>>>>> saved value.
> >>>>>
> >>>>> Thanks for the detailed explanation. I’ll give it a try (the moment I find
> >>>>> some free time). I still think that patch 2/3 is beneficial, but based on
> >>>>> your feedback, patch 3/3 should be dropped.
> >>>>
> >>>> It has been a while. What should we do with
> >>>>
> >>>> resource-fix-locking-in-find_next_iomem_res.patch
> >>>
> >>> This one looks obviously correct to me, you can add:
> >>>
> >>> Reviewed-by: Dan Williams <[email protected]>
> >>>
> >>>> resource-avoid-unnecessary-lookups-in-find_next_iomem_res.patch
> >>>
> >>> This one is a good bug report that we need to go fix pgprot lookups
> >>> for dax, but I don't think we need to increase the trickiness of the
> >>> core resource lookup code in the meantime.
> >>
> >> I think that traversing big parts of the tree that are known to be
> >> irrelevant is wasteful no matter what, and this code is used in other cases.
> >>
> >> I don’t think the new code is so tricky - can you point to the part of the
> >> code that you find tricky?
> >
> > Given dax can be updated to avoid this abuse of find_next_iomem_res(),
> > it was a general observation that the patch adds more lines than it
> > removes and is not strictly necessary. I'm ambivalent as to whether it
> > is worth pushing upstream. If anything the changelog is going to be
> > invalidated by a change to dax to avoid find_next_iomem_res(). Can you
> > update the changelog to be relevant outside of the dax case?
>
> Well, 8 lines are comments, 4 are empty lines, so it adds 3 lines of code
> according to my calculations. :)
>
> Having said that, if you think I might have made a mistake, or you are
> concerned with some bug I might have caused, please let me know. I
> understand that this logic might have been lying around for some time.

Like I said, I'm ambivalent and not NAK'ing it. It looks ok, but at
the same time something is wrong if a hot path is constantly
re-looking up a resource. The fact that it shows up in profiles when
that happens could be considered useful.

> I can update the commit log, emphasizing the redundant search operations as
> motivation and then mentioning dax as an instance that induces overheads due
> to this overhead, and say it should be handled regardless to this patch-set.
> Once I find time, I am going to deal with DAX, unless you beat me to it.

It turns out that the ability to ask the driver for pgprot bits is
useful above and beyond this performance optimization. For example I'm
looking to use it to support disabling speculation into pages with
known media errors by letting the driver consult its 'badblock' list.
There are also usages for passing the key-id for persistent memory
encrypted by MKTME.