2024-06-05 00:36:48

by Andrii Nakryiko

[permalink] [raw]
Subject: [PATCH v3 0/9] ioctl()-based API to query VMAs from /proc/<pid>/maps

Implement binary ioctl()-based interface to /proc/<pid>/maps file to allow
applications to query VMA information more efficiently than reading *all* VMAs
nonselectively through text-based interface of /proc/<pid>/maps file.

Patch #3 goes into a lot of details and background on some common patterns of
using /proc/<pid>/maps in the area of performance profiling and subsequent
symbolization of captured stack traces. As mentioned in that patch, patterns
of VMA querying can differ depending on specific use case, but can generally
be grouped into two main categories: the need to query a small subset of VMAs
covering a given batch of addresses, or reading/storing/caching all
(typically, executable) VMAs upfront for later processing.

The new PROCMAP_QUERY ioctl() API added in this patch set was motivated by the
former pattern of usage. Patch #9 adds a tool that faithfully reproduces an
efficient VMA matching pass of a symbolizer, collecting a subset of covering
VMAs for a given set of addresses as efficiently as possible. This tool is
serving both as a testing ground, as well as a benchmarking tool.
It implements everything both for currently existing text-based
/proc/<pid>/maps interface, as well as for newly-added PROCMAP_QUERY ioctl().

But based on discussion on previous revision of this patch set, it turned out
that this ioctl() API is competitive with highly-optimized text-based
pre-processing pattern that perf tool is using. Based on perf discussion, this
revision adds more flexibility in specifying a subset of VMAs that are of
interest. Now it's possible to specify desired permissions of VMAs (e.g.,
request only executable ones) and/or restrict to only a subset of VMAs that
have file backing. This further improves the efficiency when using this new
API thanks to more selective (executable VMAs only) querying.

In addition to a custom benchmarking tool from patch #9, and experimental perf
integration (available at [0]), Daniel Mueller has since also implemented an
experimental integration into blazesym (see [1]), a library used for stack
trace symbolization by our server fleet-wide profiler and another on-device
profiler agent that runs on weaker ARM devices. The latter ARM-based device
profiler is especially sensitive to performance, and so we benchmarked and
compared text-based /proc/<pid>/maps solution to the equivalent one using
PROCMAP_QUERY ioctl().

Results are very encouraging, giving us 5x improvement for end-to-end
so-called "address normalization" pass, which is the part of the symbolization
process that happens locally on ARM device, before being sent out for further
heavier-weight processing on more powerful remote server. Note that this is
not an artificial microbenchmark. It's a full end-to-end API call being
measured with real-world data on real-world device.

TEXT-BASED
==========
Benchmarking main/normalize_process_no_build_ids_uncached_maps
main/normalize_process_no_build_ids_uncached_maps
time: [49.777 µs 49.982 µs 50.250 µs]

IOCTL-BASED
===========
Benchmarking main/normalize_process_no_build_ids_uncached_maps
main/normalize_process_no_build_ids_uncached_maps
time: [10.328 µs 10.391 µs 10.457 µs]
change: [−79.453% −79.304% −79.166%] (p = 0.00 < 0.02)
Performance has improved.

You can see above that we see the drop from 50µs down to 10µs for exactly
the same amount of work, with the same data and target process.

Results for more synthentic benchmarks that hammer /proc/<pid>/maps processing
specifically can be found in patch #9. In short, we see about ~40x improvement
with our custom benchmark tool (it varies depending on captured set of
addresses, previous revision used a different set of captured addresses,
giving about ~35x improvement). And even for perf-based benchmark it's on par
or slightly ahead when using permission-based filtering (fetching only
executable VMAs).

Another big change since v1 is the use of RCU-protected per-VMA lock during
querying, which is what has been requested by mm folks in favor of current
mmap_lock-based protection used by /proc/<pid>/maps text-based implementation.
For that, we added a new internal API that is equivalent to find_vma(), see
patch #1.

One thing that did not change was basing this new API as an ioctl() command
on /proc/<pid>/maps file. An ioctl-based API on top of pidfd was considered,
but has its own downsides. Implementing ioctl() directly on pidfd will cause
access permission checks on every single ioctl(), which leads to performance
concerns and potential spam of capable() audit messages. It also prevents
a nice pattern, possible with /proc/<pid>/maps, in which application opens
/proc/self/maps FD (requiring no additional capabilities) and passed this FD
to profiling agent for querying. To achieve similar pattern, a new file would
have to be created from pidf just for VMA querying, which is considered to be
inferior to just querying /proc/<pid>/maps FD as proposed in current approach.
These aspects were discussed in the hallway track at recent LSF/MM/BPF 2024
and sticking to procfs ioctl() was the final agreement we arrived at.

This patch set is based on top of next-20240604 tag in linux-next tree.

Note: currently there is a race when using RCU-protected per-VMA locking,
which Liam is fixing. RFC patches can be found at [2]. This patch set can
proceed in parallel with that solution.

[0] https://github.com/anakryiko/linux/commits/procfs-proc-maps-ioctl-v2/
[1] https://github.com/libbpf/blazesym/pull/675
[2] https://lore.kernel.org/linux-mm/[email protected]/

v2->v3:
- drop mmap_lock aggressively under CONFIG_PER_VMA_LOCK (Liam);
- code massaging to abstract per-VMA vs mmap_lock differences (Liam);
v1->v2:
- per-VMA lock is used, if possible (Liam, Suren);
- added file-backed VMA querying (perf folks);
- added permission-based VMA querying (perf folks);
- split out build ID into separate patch (Suren);
- better documented API, added mention of ioctl() into procfs docs (Greg).

Andrii Nakryiko (9):
mm: add find_vma()-like API but RCU protected and taking VMA lock
fs/procfs: extract logic for getting VMA name constituents
fs/procfs: implement efficient VMA querying API for /proc/<pid>/maps
fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API
fs/procfs: add build ID fetching to PROCMAP_QUERY API
docs/procfs: call out ioctl()-based PROCMAP_QUERY command existence
tools: sync uapi/linux/fs.h header into tools subdir
selftests/bpf: make use of PROCMAP_QUERY ioctl if available
selftests/bpf: add simple benchmark tool for /proc/<pid>/maps APIs

Documentation/filesystems/proc.rst | 8 +
fs/proc/task_mmu.c | 412 +++++++++++++--
include/linux/mm.h | 8 +
include/uapi/linux/fs.h | 156 +++++-
mm/memory.c | 62 +++
tools/include/uapi/linux/fs.h | 550 ++++++++++++++++++++
tools/testing/selftests/bpf/.gitignore | 1 +
tools/testing/selftests/bpf/Makefile | 2 +-
tools/testing/selftests/bpf/procfs_query.c | 386 ++++++++++++++
tools/testing/selftests/bpf/test_progs.c | 3 +
tools/testing/selftests/bpf/test_progs.h | 2 +
tools/testing/selftests/bpf/trace_helpers.c | 104 +++-
12 files changed, 1623 insertions(+), 71 deletions(-)
create mode 100644 tools/include/uapi/linux/fs.h
create mode 100644 tools/testing/selftests/bpf/procfs_query.c

--
2.43.0



2024-06-05 00:37:27

by Andrii Nakryiko

[permalink] [raw]
Subject: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock

Existing lock_vma_under_rcu() API assumes exact VMA match, so it's not
a 100% equivalent of find_vma(). There are use cases that do want
find_vma() semantics of finding an exact VMA or the next one.

Also, it's important for such an API to let user distinguish between not
being able to get per-VMA lock and not having any VMAs at or after
provided address.

As such, this patch adds a new find_vma()-like API,
find_and_lock_vma_rcu(), which finds exact or next VMA, attempts to take
per-VMA lock, and if that fails, returns ERR_PTR(-EBUSY). It still
returns NULL if there is no VMA at or after address. In successfuly case
it will return valid and non-isolated VMA with VMA lock taken.

This API will be used in subsequent patch in this patch set to implement
a new user-facing API for querying process VMAs.

Cc: Mike Rapoport <[email protected]>
Cc: Suren Baghdasaryan <[email protected]>
Cc: Liam Howlett <[email protected]>
Signed-off-by: Andrii Nakryiko <[email protected]>
---
include/linux/mm.h | 8 ++++++
mm/memory.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 70 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c41c82bcbec2..3ab52b7e124c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -776,6 +776,8 @@ static inline void assert_fault_locked(struct vm_fault *vmf)
mmap_assert_locked(vmf->vma->vm_mm);
}

+struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
+ unsigned long address);
struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
unsigned long address);

@@ -790,6 +792,12 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma)
static inline void vma_mark_detached(struct vm_area_struct *vma,
bool detached) {}

+struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
+ unsigned long address)
+{
+ return -EOPNOTSUPP;
+}
+
static inline struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
unsigned long address)
{
diff --git a/mm/memory.c b/mm/memory.c
index eef4e482c0c2..c9517742bd6d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5913,6 +5913,68 @@ struct vm_area_struct *lock_mm_and_find_vma(struct mm_struct *mm,
#endif

#ifdef CONFIG_PER_VMA_LOCK
+/*
+ * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
+ * next VMA. Search is done under RCU protection, without taking or assuming
+ * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
+
+ * @mm: The mm_struct to check
+ * @addr: The address
+ *
+ * Returns: The VMA associated with addr, or the next VMA.
+ * May return %NULL in the case of no VMA at addr or above.
+ * If the VMA is being modified and can't be locked, -EBUSY is returned.
+ */
+struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
+ unsigned long address)
+{
+ MA_STATE(mas, &mm->mm_mt, address, address);
+ struct vm_area_struct *vma;
+ int err;
+
+ rcu_read_lock();
+retry:
+ vma = mas_find(&mas, ULONG_MAX);
+ if (!vma) {
+ err = 0; /* no VMA, return NULL */
+ goto inval;
+ }
+
+ if (!vma_start_read(vma)) {
+ err = -EBUSY;
+ goto inval;
+ }
+
+ /*
+ * Check since vm_start/vm_end might change before we lock the VMA.
+ * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
+ * address or the next one, so we only make sure VMA wasn't updated to
+ * end before the address.
+ */
+ if (unlikely(vma->vm_end <= address)) {
+ err = -EBUSY;
+ goto inval_end_read;
+ }
+
+ /* Check if the VMA got isolated after we found it */
+ if (vma->detached) {
+ vma_end_read(vma);
+ count_vm_vma_lock_event(VMA_LOCK_MISS);
+ /* The area was replaced with another one */
+ goto retry;
+ }
+
+ rcu_read_unlock();
+ return vma;
+
+inval_end_read:
+ vma_end_read(vma);
+inval:
+ rcu_read_unlock();
+ count_vm_vma_lock_event(VMA_LOCK_ABORT);
+ return ERR_PTR(err);
+}
+
/*
* Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be
* stable and not isolated. If the VMA is not found or is being modified the
--
2.43.0


2024-06-05 00:39:05

by Andrii Nakryiko

[permalink] [raw]
Subject: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API

Attempt to use RCU-protected per-VMA lock when looking up requested VMA
as much as possible, only falling back to mmap_lock if per-VMA lock
failed. This is done so that querying of VMAs doesn't interfere with
other critical tasks, like page fault handling.

This has been suggested by mm folks, and we make use of a newly added
internal API that works like find_vma(), but tries to use per-VMA lock.

We have two sets of setup/query/teardown helper functions with different
implementations depending on availability of per-VMA lock (conditioned
on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.

When per-VMA lock is available, lookup is done under RCU, attempting to
take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
immediately. In this configuration mmap_lock is never helf for long,
minimizing disruptions while querying.

When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
using find_vma() API, and then unlock mmap_lock at the very end once as
well. In this setup we avoid locking/unlocking mmap_lock on every looked
up VMA (depending on query parameters we might need to iterate a few of
them).

Signed-off-by: Andrii Nakryiko <[email protected]>
---
fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 46 insertions(+)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 614fbe5d0667..140032ffc551 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
PROCMAP_QUERY_VMA_FLAGS \
)

+#ifdef CONFIG_PER_VMA_LOCK
+static int query_vma_setup(struct mm_struct *mm)
+{
+ /* in the presence of per-VMA lock we don't need any setup/teardown */
+ return 0;
+}
+
+static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
+{
+ /* in the presence of per-VMA lock we need to unlock vma, if present */
+ if (vma)
+ vma_end_read(vma);
+}
+
+static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
+{
+ struct vm_area_struct *vma;
+
+ /* try to use less disruptive per-VMA lock */
+ vma = find_and_lock_vma_rcu(mm, addr);
+ if (IS_ERR(vma)) {
+ /* failed to take per-VMA lock, fallback to mmap_lock */
+ if (mmap_read_lock_killable(mm))
+ return ERR_PTR(-EINTR);
+
+ vma = find_vma(mm, addr);
+ if (vma) {
+ /*
+ * We cannot use vma_start_read() as it may fail due to
+ * false locked (see comment in vma_start_read()). We
+ * can avoid that by directly locking vm_lock under
+ * mmap_lock, which guarantees that nobody can lock the
+ * vma for write (vma_start_write()) under us.
+ */
+ down_read(&vma->vm_lock->lock);
+ }
+
+ mmap_read_unlock(mm);
+ }
+
+ return vma;
+}
+#else
static int query_vma_setup(struct mm_struct *mm)
{
return mmap_read_lock_killable(mm);
@@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig
{
return find_vma(mm, addr);
}
+#endif

static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
unsigned long addr, u32 flags)
@@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
skip_vma:
/*
* If the user needs closest matching VMA, keep iterating.
+ * But before we proceed we might need to unlock current VMA.
*/
addr = vma->vm_end;
+ vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */
if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
goto next_vma;
no_vma:
--
2.43.0


2024-06-05 05:43:24

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock

On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> +/*
> + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> + * next VMA. Search is done under RCU protection, without taking or assuming
> + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.

You know this is supposed to be the _short_ description, right?
Three lines is way too long. The full description goes between the
arguments and the Return: line.

> + * @mm: The mm_struct to check
> + * @addr: The address
> + *
> + * Returns: The VMA associated with addr, or the next VMA.
> + * May return %NULL in the case of no VMA at addr or above.
> + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> + */
> +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> + unsigned long address)
> +{
> + MA_STATE(mas, &mm->mm_mt, address, address);
> + struct vm_area_struct *vma;
> + int err;
> +
> + rcu_read_lock();
> +retry:
> + vma = mas_find(&mas, ULONG_MAX);
> + if (!vma) {
> + err = 0; /* no VMA, return NULL */
> + goto inval;
> + }
> +
> + if (!vma_start_read(vma)) {
> + err = -EBUSY;
> + goto inval;
> + }
> +
> + /*
> + * Check since vm_start/vm_end might change before we lock the VMA.
> + * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> + * address or the next one, so we only make sure VMA wasn't updated to
> + * end before the address.
> + */
> + if (unlikely(vma->vm_end <= address)) {
> + err = -EBUSY;
> + goto inval_end_read;
> + }
> +
> + /* Check if the VMA got isolated after we found it */
> + if (vma->detached) {
> + vma_end_read(vma);
> + count_vm_vma_lock_event(VMA_LOCK_MISS);
> + /* The area was replaced with another one */

Surely you need to mas_reset() before you goto retry?

> + goto retry;
> + }

2024-06-05 13:44:43

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock

* Matthew Wilcox <[email protected]> [240604 20:57]:
> On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > +/*
> > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > + * next VMA. Search is done under RCU protection, without taking or assuming
> > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
>
> You know this is supposed to be the _short_ description, right?
> Three lines is way too long. The full description goes between the
> arguments and the Return: line.
>
> > + * @mm: The mm_struct to check
> > + * @addr: The address
> > + *
> > + * Returns: The VMA associated with addr, or the next VMA.
> > + * May return %NULL in the case of no VMA at addr or above.
> > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > + */
> > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > + unsigned long address)
> > +{
> > + MA_STATE(mas, &mm->mm_mt, address, address);
> > + struct vm_area_struct *vma;
> > + int err;
> > +
> > + rcu_read_lock();
> > +retry:
> > + vma = mas_find(&mas, ULONG_MAX);
> > + if (!vma) {
> > + err = 0; /* no VMA, return NULL */
> > + goto inval;
> > + }
> > +
> > + if (!vma_start_read(vma)) {
> > + err = -EBUSY;
> > + goto inval;
> > + }
> > +
> > + /*
> > + * Check since vm_start/vm_end might change before we lock the VMA.
> > + * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > + * address or the next one, so we only make sure VMA wasn't updated to
> > + * end before the address.
> > + */
> > + if (unlikely(vma->vm_end <= address)) {
> > + err = -EBUSY;
> > + goto inval_end_read;
> > + }
> > +
> > + /* Check if the VMA got isolated after we found it */
> > + if (vma->detached) {
> > + vma_end_read(vma);
> > + count_vm_vma_lock_event(VMA_LOCK_MISS);
> > + /* The area was replaced with another one */
>
> Surely you need to mas_reset() before you goto retry?

Probably more than that. We've found and may have adjusted the
index/last; we should reconfigure the maple state. You should probably
use mas_set(), which will reset the maple state and set the index and
long to address.


>
> > + goto retry;
> > + }
>

2024-06-05 16:14:25

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock

On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <[email protected]> wrote:
>
> * Matthew Wilcox <[email protected]> [240604 20:57]:
> > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > +/*
> > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> >
> > You know this is supposed to be the _short_ description, right?
> > Three lines is way too long. The full description goes between the
> > arguments and the Return: line.

Sure, I'll adjust.

> >
> > > + * @mm: The mm_struct to check
> > > + * @addr: The address
> > > + *
> > > + * Returns: The VMA associated with addr, or the next VMA.
> > > + * May return %NULL in the case of no VMA at addr or above.
> > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > + */
> > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > + unsigned long address)
> > > +{
> > > + MA_STATE(mas, &mm->mm_mt, address, address);
> > > + struct vm_area_struct *vma;
> > > + int err;
> > > +
> > > + rcu_read_lock();
> > > +retry:
> > > + vma = mas_find(&mas, ULONG_MAX);
> > > + if (!vma) {
> > > + err = 0; /* no VMA, return NULL */
> > > + goto inval;
> > > + }
> > > +
> > > + if (!vma_start_read(vma)) {
> > > + err = -EBUSY;
> > > + goto inval;
> > > + }
> > > +
> > > + /*
> > > + * Check since vm_start/vm_end might change before we lock the VMA.
> > > + * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > + * address or the next one, so we only make sure VMA wasn't updated to
> > > + * end before the address.
> > > + */
> > > + if (unlikely(vma->vm_end <= address)) {
> > > + err = -EBUSY;
> > > + goto inval_end_read;
> > > + }
> > > +
> > > + /* Check if the VMA got isolated after we found it */
> > > + if (vma->detached) {
> > > + vma_end_read(vma);
> > > + count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > + /* The area was replaced with another one */
> >
> > Surely you need to mas_reset() before you goto retry?
>
> Probably more than that. We've found and may have adjusted the
> index/last; we should reconfigure the maple state. You should probably
> use mas_set(), which will reset the maple state and set the index and
> long to address.

Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
address)` case, I presume we want to do the same, right? Basically, on
each retry start from the `address` unconditionally, no matter what's
the reason for retry.

>
>
> >
> > > + goto retry;
> > > + }
> >

2024-06-05 16:24:32

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock

On Wed, Jun 5, 2024 at 9:13 AM Andrii Nakryiko
<[email protected]> wrote:
>
> On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <[email protected]> wrote:
> >
> > * Matthew Wilcox <[email protected]> [240604 20:57]:
> > > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > > +/*
> > > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> > >
> > > You know this is supposed to be the _short_ description, right?
> > > Three lines is way too long. The full description goes between the
> > > arguments and the Return: line.
>
> Sure, I'll adjust.
>
> > >
> > > > + * @mm: The mm_struct to check
> > > > + * @addr: The address
> > > > + *
> > > > + * Returns: The VMA associated with addr, or the next VMA.
> > > > + * May return %NULL in the case of no VMA at addr or above.
> > > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > > + */
> > > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > > + unsigned long address)
> > > > +{
> > > > + MA_STATE(mas, &mm->mm_mt, address, address);
> > > > + struct vm_area_struct *vma;
> > > > + int err;
> > > > +
> > > > + rcu_read_lock();
> > > > +retry:
> > > > + vma = mas_find(&mas, ULONG_MAX);
> > > > + if (!vma) {
> > > > + err = 0; /* no VMA, return NULL */
> > > > + goto inval;
> > > > + }
> > > > +
> > > > + if (!vma_start_read(vma)) {
> > > > + err = -EBUSY;
> > > > + goto inval;
> > > > + }
> > > > +
> > > > + /*
> > > > + * Check since vm_start/vm_end might change before we lock the VMA.
> > > > + * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > > + * address or the next one, so we only make sure VMA wasn't updated to
> > > > + * end before the address.
> > > > + */
> > > > + if (unlikely(vma->vm_end <= address)) {
> > > > + err = -EBUSY;
> > > > + goto inval_end_read;
> > > > + }
> > > > +
> > > > + /* Check if the VMA got isolated after we found it */
> > > > + if (vma->detached) {
> > > > + vma_end_read(vma);
> > > > + count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > + /* The area was replaced with another one */
> > >
> > > Surely you need to mas_reset() before you goto retry?
> >
> > Probably more than that. We've found and may have adjusted the
> > index/last; we should reconfigure the maple state. You should probably
> > use mas_set(), which will reset the maple state and set the index and
> > long to address.
>
> Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
> address)` case, I presume we want to do the same, right? Basically, on
> each retry start from the `address` unconditionally, no matter what's
> the reason for retry.

ah, never mind, we don't retry in that situation, I'll just put
`mas_set(&mas, address);` right before `goto retry;`. Unless we should
actually retry in the case when VMA got moved before the requested
address, not sure, let me know what you think. Presumably retrying
will allow us to get the correct VMA without the need to fall back to
mmap_lock?

>
> >
> >
> > >
> > > > + goto retry;
> > > > + }
> > >

2024-06-05 16:27:43

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock

On Wed, Jun 5, 2024 at 9:24 AM Andrii Nakryiko
<[email protected]> wrote:
>
> On Wed, Jun 5, 2024 at 9:13 AM Andrii Nakryiko
> <[email protected]> wrote:
> >
> > On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <[email protected]> wrote:
> > >
> > > * Matthew Wilcox <[email protected]> [240604 20:57]:
> > > > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > > > +/*
> > > > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> > > >
> > > > You know this is supposed to be the _short_ description, right?
> > > > Three lines is way too long. The full description goes between the
> > > > arguments and the Return: line.
> >
> > Sure, I'll adjust.
> >
> > > >
> > > > > + * @mm: The mm_struct to check
> > > > > + * @addr: The address
> > > > > + *
> > > > > + * Returns: The VMA associated with addr, or the next VMA.
> > > > > + * May return %NULL in the case of no VMA at addr or above.
> > > > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > > > + */
> > > > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > > > + unsigned long address)
> > > > > +{
> > > > > + MA_STATE(mas, &mm->mm_mt, address, address);
> > > > > + struct vm_area_struct *vma;
> > > > > + int err;
> > > > > +
> > > > > + rcu_read_lock();
> > > > > +retry:
> > > > > + vma = mas_find(&mas, ULONG_MAX);
> > > > > + if (!vma) {
> > > > > + err = 0; /* no VMA, return NULL */
> > > > > + goto inval;
> > > > > + }
> > > > > +
> > > > > + if (!vma_start_read(vma)) {
> > > > > + err = -EBUSY;
> > > > > + goto inval;
> > > > > + }
> > > > > +
> > > > > + /*
> > > > > + * Check since vm_start/vm_end might change before we lock the VMA.
> > > > > + * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > > > + * address or the next one, so we only make sure VMA wasn't updated to
> > > > > + * end before the address.
> > > > > + */
> > > > > + if (unlikely(vma->vm_end <= address)) {
> > > > > + err = -EBUSY;
> > > > > + goto inval_end_read;
> > > > > + }
> > > > > +
> > > > > + /* Check if the VMA got isolated after we found it */
> > > > > + if (vma->detached) {
> > > > > + vma_end_read(vma);
> > > > > + count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > > + /* The area was replaced with another one */
> > > >
> > > > Surely you need to mas_reset() before you goto retry?
> > >
> > > Probably more than that. We've found and may have adjusted the
> > > index/last; we should reconfigure the maple state. You should probably
> > > use mas_set(), which will reset the maple state and set the index and
> > > long to address.
> >
> > Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
> > address)` case, I presume we want to do the same, right? Basically, on
> > each retry start from the `address` unconditionally, no matter what's
> > the reason for retry.
>
> ah, never mind, we don't retry in that situation, I'll just put
> `mas_set(&mas, address);` right before `goto retry;`. Unless we should
> actually retry in the case when VMA got moved before the requested
> address, not sure, let me know what you think. Presumably retrying
> will allow us to get the correct VMA without the need to fall back to
> mmap_lock?

sorry, one more question as I look some more around this (unfamiliar
to me) piece of code. I see that lock_vma_under_rcu counts
VMA_LOCK_MISS on retry, but I see that there is actually a
VMA_LOCK_RETRY stat as well. Any reason it's a MISS instead of RETRY?
Should I use MISS as well, or actually count a RETRY?

>
> >
> > >
> > >
> > > >
> > > > > + goto retry;
> > > > > + }
> > > >

2024-06-05 17:04:12

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock

* Andrii Nakryiko <[email protected]> [240605 12:27]:
> On Wed, Jun 5, 2024 at 9:24 AM Andrii Nakryiko
> <[email protected]> wrote:
> >
> > On Wed, Jun 5, 2024 at 9:13 AM Andrii Nakryiko
> > <[email protected]> wrote:
> > >
> > > On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <[email protected]> wrote:
> > > >
> > > > * Matthew Wilcox <[email protected]> [240604 20:57]:
> > > > > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > > > > +/*
> > > > > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > > > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > > > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> > > > >
> > > > > You know this is supposed to be the _short_ description, right?
> > > > > Three lines is way too long. The full description goes between the
> > > > > arguments and the Return: line.
> > >
> > > Sure, I'll adjust.
> > >
> > > > >
> > > > > > + * @mm: The mm_struct to check
> > > > > > + * @addr: The address
> > > > > > + *
> > > > > > + * Returns: The VMA associated with addr, or the next VMA.
> > > > > > + * May return %NULL in the case of no VMA at addr or above.
> > > > > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > > > > + */
> > > > > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > > > > + unsigned long address)
> > > > > > +{
> > > > > > + MA_STATE(mas, &mm->mm_mt, address, address);
> > > > > > + struct vm_area_struct *vma;
> > > > > > + int err;
> > > > > > +
> > > > > > + rcu_read_lock();
> > > > > > +retry:
> > > > > > + vma = mas_find(&mas, ULONG_MAX);
> > > > > > + if (!vma) {
> > > > > > + err = 0; /* no VMA, return NULL */
> > > > > > + goto inval;
> > > > > > + }
> > > > > > +
> > > > > > + if (!vma_start_read(vma)) {
> > > > > > + err = -EBUSY;
> > > > > > + goto inval;
> > > > > > + }
> > > > > > +
> > > > > > + /*
> > > > > > + * Check since vm_start/vm_end might change before we lock the VMA.
> > > > > > + * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > > > > + * address or the next one, so we only make sure VMA wasn't updated to
> > > > > > + * end before the address.
> > > > > > + */
> > > > > > + if (unlikely(vma->vm_end <= address)) {
> > > > > > + err = -EBUSY;
> > > > > > + goto inval_end_read;
> > > > > > + }
> > > > > > +
> > > > > > + /* Check if the VMA got isolated after we found it */
> > > > > > + if (vma->detached) {
> > > > > > + vma_end_read(vma);
> > > > > > + count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > > > + /* The area was replaced with another one */
> > > > >
> > > > > Surely you need to mas_reset() before you goto retry?
> > > >
> > > > Probably more than that. We've found and may have adjusted the
> > > > index/last; we should reconfigure the maple state. You should probably
> > > > use mas_set(), which will reset the maple state and set the index and
> > > > long to address.
> > >
> > > Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
> > > address)` case, I presume we want to do the same, right? Basically, on
> > > each retry start from the `address` unconditionally, no matter what's
> > > the reason for retry.
> >
> > ah, never mind, we don't retry in that situation, I'll just put
> > `mas_set(&mas, address);` right before `goto retry;`. Unless we should
> > actually retry in the case when VMA got moved before the requested
> > address, not sure, let me know what you think. Presumably retrying
> > will allow us to get the correct VMA without the need to fall back to
> > mmap_lock?
>
> sorry, one more question as I look some more around this (unfamiliar
> to me) piece of code. I see that lock_vma_under_rcu counts
> VMA_LOCK_MISS on retry, but I see that there is actually a
> VMA_LOCK_RETRY stat as well. Any reason it's a MISS instead of RETRY?
> Should I use MISS as well, or actually count a RETRY?
>

VMA_LOCK_MISS is used here because we missed the VMA due to a write
happening to move the vma (rather rare). The VMA_LOCK missed the vma.

VMA_LOCK_RETRY is used to indicate we need to retry under the mmap lock.
A retry is needed after the VMA_LOCK did not work under rcu locking.

Thanks,
Liam

2024-06-05 23:16:25

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API

On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <[email protected]> wrote:
>
> Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> as much as possible, only falling back to mmap_lock if per-VMA lock
> failed. This is done so that querying of VMAs doesn't interfere with
> other critical tasks, like page fault handling.
>
> This has been suggested by mm folks, and we make use of a newly added
> internal API that works like find_vma(), but tries to use per-VMA lock.
>
> We have two sets of setup/query/teardown helper functions with different
> implementations depending on availability of per-VMA lock (conditioned
> on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
>
> When per-VMA lock is available, lookup is done under RCU, attempting to
> take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> immediately. In this configuration mmap_lock is never helf for long,
> minimizing disruptions while querying.
>
> When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> using find_vma() API, and then unlock mmap_lock at the very end once as
> well. In this setup we avoid locking/unlocking mmap_lock on every looked
> up VMA (depending on query parameters we might need to iterate a few of
> them).
>
> Signed-off-by: Andrii Nakryiko <[email protected]>
> ---
> fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 46 insertions(+)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 614fbe5d0667..140032ffc551 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> PROCMAP_QUERY_VMA_FLAGS \
> )
>
> +#ifdef CONFIG_PER_VMA_LOCK
> +static int query_vma_setup(struct mm_struct *mm)
> +{
> + /* in the presence of per-VMA lock we don't need any setup/teardown */
> + return 0;
> +}
> +
> +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> +{
> + /* in the presence of per-VMA lock we need to unlock vma, if present */
> + if (vma)
> + vma_end_read(vma);
> +}
> +
> +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> +{
> + struct vm_area_struct *vma;
> +
> + /* try to use less disruptive per-VMA lock */
> + vma = find_and_lock_vma_rcu(mm, addr);
> + if (IS_ERR(vma)) {
> + /* failed to take per-VMA lock, fallback to mmap_lock */
> + if (mmap_read_lock_killable(mm))
> + return ERR_PTR(-EINTR);
> +
> + vma = find_vma(mm, addr);
> + if (vma) {
> + /*
> + * We cannot use vma_start_read() as it may fail due to
> + * false locked (see comment in vma_start_read()). We
> + * can avoid that by directly locking vm_lock under
> + * mmap_lock, which guarantees that nobody can lock the
> + * vma for write (vma_start_write()) under us.
> + */
> + down_read(&vma->vm_lock->lock);

Hi Andrii,
The above pattern of locking VMA under mmap_lock and then dropping
mmap_lock is becoming more common. Matthew had an RFC proposal for an
API to do this here:
https://lore.kernel.org/all/[email protected]/. It
might be worth reviving that discussion.

> + }
> +
> + mmap_read_unlock(mm);

Later on in your code you are calling get_vma_name() which might call
anon_vma_name() to retrieve user-defined VMA name. After this patch
this operation will be done without holding mmap_lock, however per
https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
this function has to be called with mmap_lock held for read. Indeed
with debug flags enabled you should hit this assertion:
https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.

> + }
> +
> + return vma;
> +}
> +#else
> static int query_vma_setup(struct mm_struct *mm)
> {
> return mmap_read_lock_killable(mm);
> @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig
> {
> return find_vma(mm, addr);
> }
> +#endif
>
> static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> unsigned long addr, u32 flags)
> @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> skip_vma:
> /*
> * If the user needs closest matching VMA, keep iterating.
> + * But before we proceed we might need to unlock current VMA.
> */
> addr = vma->vm_end;
> + vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */
> if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
> goto next_vma;
> no_vma:
> --
> 2.43.0
>

2024-06-05 23:24:40

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock

On Wed, Jun 5, 2024 at 10:03 AM Liam R. Howlett <[email protected]> wrote:
>
> * Andrii Nakryiko <[email protected]> [240605 12:27]:
> > On Wed, Jun 5, 2024 at 9:24 AM Andrii Nakryiko
> > <[email protected]> wrote:
> > >
> > > On Wed, Jun 5, 2024 at 9:13 AM Andrii Nakryiko
> > > <[email protected]> wrote:
> > > >
> > > > On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <[email protected]> wrote:
> > > > >
> > > > > * Matthew Wilcox <[email protected]> [240604 20:57]:
> > > > > > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > > > > > +/*
> > > > > > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > > > > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > > > > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> > > > > >
> > > > > > You know this is supposed to be the _short_ description, right?
> > > > > > Three lines is way too long. The full description goes between the
> > > > > > arguments and the Return: line.
> > > >
> > > > Sure, I'll adjust.
> > > >
> > > > > >
> > > > > > > + * @mm: The mm_struct to check
> > > > > > > + * @addr: The address
> > > > > > > + *
> > > > > > > + * Returns: The VMA associated with addr, or the next VMA.
> > > > > > > + * May return %NULL in the case of no VMA at addr or above.
> > > > > > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > > > > > + */
> > > > > > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > > > > > + unsigned long address)
> > > > > > > +{
> > > > > > > + MA_STATE(mas, &mm->mm_mt, address, address);
> > > > > > > + struct vm_area_struct *vma;
> > > > > > > + int err;
> > > > > > > +
> > > > > > > + rcu_read_lock();
> > > > > > > +retry:
> > > > > > > + vma = mas_find(&mas, ULONG_MAX);
> > > > > > > + if (!vma) {
> > > > > > > + err = 0; /* no VMA, return NULL */
> > > > > > > + goto inval;
> > > > > > > + }
> > > > > > > +
> > > > > > > + if (!vma_start_read(vma)) {
> > > > > > > + err = -EBUSY;
> > > > > > > + goto inval;
> > > > > > > + }
> > > > > > > +
> > > > > > > + /*
> > > > > > > + * Check since vm_start/vm_end might change before we lock the VMA.
> > > > > > > + * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > > > > > + * address or the next one, so we only make sure VMA wasn't updated to
> > > > > > > + * end before the address.
> > > > > > > + */
> > > > > > > + if (unlikely(vma->vm_end <= address)) {
> > > > > > > + err = -EBUSY;
> > > > > > > + goto inval_end_read;
> > > > > > > + }
> > > > > > > +
> > > > > > > + /* Check if the VMA got isolated after we found it */
> > > > > > > + if (vma->detached) {
> > > > > > > + vma_end_read(vma);
> > > > > > > + count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > > > > + /* The area was replaced with another one */
> > > > > >
> > > > > > Surely you need to mas_reset() before you goto retry?
> > > > >
> > > > > Probably more than that. We've found and may have adjusted the
> > > > > index/last; we should reconfigure the maple state. You should probably
> > > > > use mas_set(), which will reset the maple state and set the index and
> > > > > long to address.
> > > >
> > > > Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
> > > > address)` case, I presume we want to do the same, right? Basically, on
> > > > each retry start from the `address` unconditionally, no matter what's
> > > > the reason for retry.
> > >
> > > ah, never mind, we don't retry in that situation, I'll just put
> > > `mas_set(&mas, address);` right before `goto retry;`. Unless we should
> > > actually retry in the case when VMA got moved before the requested
> > > address, not sure, let me know what you think. Presumably retrying
> > > will allow us to get the correct VMA without the need to fall back to
> > > mmap_lock?
> >
> > sorry, one more question as I look some more around this (unfamiliar
> > to me) piece of code. I see that lock_vma_under_rcu counts
> > VMA_LOCK_MISS on retry, but I see that there is actually a
> > VMA_LOCK_RETRY stat as well. Any reason it's a MISS instead of RETRY?
> > Should I use MISS as well, or actually count a RETRY?
> >
>
> VMA_LOCK_MISS is used here because we missed the VMA due to a write
> happening to move the vma (rather rare). The VMA_LOCK missed the vma.
>
> VMA_LOCK_RETRY is used to indicate we need to retry under the mmap lock.
> A retry is needed after the VMA_LOCK did not work under rcu locking.

Originally lock_vma_under_rcu() was used only inside page fault path,
so these counters helped us quantify how effective VMA locking is when
handling page faults. With more users of that function these counters
will be affected by other paths as well. I'm not sure but I think it
makes sense to use them only inside page fault path, IOW we should
probably move count_vm_vma_lock_event() calls outside of
lock_vma_under_rcu() and add them only when handling page faults.

>
> Thanks,
> Liam

2024-06-06 16:52:40

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock

On Wed, Jun 5, 2024 at 4:22 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Wed, Jun 5, 2024 at 10:03 AM Liam R. Howlett <[email protected]> wrote:
> >
> > * Andrii Nakryiko <[email protected]> [240605 12:27]:
> > > On Wed, Jun 5, 2024 at 9:24 AM Andrii Nakryiko
> > > <[email protected]> wrote:
> > > >
> > > > On Wed, Jun 5, 2024 at 9:13 AM Andrii Nakryiko
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <[email protected]> wrote:
> > > > > >
> > > > > > * Matthew Wilcox <[email protected]> [240604 20:57]:
> > > > > > > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > > > > > > +/*
> > > > > > > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > > > > > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > > > > > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> > > > > > >
> > > > > > > You know this is supposed to be the _short_ description, right?
> > > > > > > Three lines is way too long. The full description goes between the
> > > > > > > arguments and the Return: line.
> > > > >
> > > > > Sure, I'll adjust.
> > > > >
> > > > > > >
> > > > > > > > + * @mm: The mm_struct to check
> > > > > > > > + * @addr: The address
> > > > > > > > + *
> > > > > > > > + * Returns: The VMA associated with addr, or the next VMA.
> > > > > > > > + * May return %NULL in the case of no VMA at addr or above.
> > > > > > > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > > > > > > + */
> > > > > > > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > > > > > > + unsigned long address)
> > > > > > > > +{
> > > > > > > > + MA_STATE(mas, &mm->mm_mt, address, address);
> > > > > > > > + struct vm_area_struct *vma;
> > > > > > > > + int err;
> > > > > > > > +
> > > > > > > > + rcu_read_lock();
> > > > > > > > +retry:
> > > > > > > > + vma = mas_find(&mas, ULONG_MAX);
> > > > > > > > + if (!vma) {
> > > > > > > > + err = 0; /* no VMA, return NULL */
> > > > > > > > + goto inval;
> > > > > > > > + }
> > > > > > > > +
> > > > > > > > + if (!vma_start_read(vma)) {
> > > > > > > > + err = -EBUSY;
> > > > > > > > + goto inval;
> > > > > > > > + }
> > > > > > > > +
> > > > > > > > + /*
> > > > > > > > + * Check since vm_start/vm_end might change before we lock the VMA.
> > > > > > > > + * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > > > > > > + * address or the next one, so we only make sure VMA wasn't updated to
> > > > > > > > + * end before the address.
> > > > > > > > + */
> > > > > > > > + if (unlikely(vma->vm_end <= address)) {
> > > > > > > > + err = -EBUSY;
> > > > > > > > + goto inval_end_read;
> > > > > > > > + }
> > > > > > > > +
> > > > > > > > + /* Check if the VMA got isolated after we found it */
> > > > > > > > + if (vma->detached) {
> > > > > > > > + vma_end_read(vma);
> > > > > > > > + count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > > > > > + /* The area was replaced with another one */
> > > > > > >
> > > > > > > Surely you need to mas_reset() before you goto retry?
> > > > > >
> > > > > > Probably more than that. We've found and may have adjusted the
> > > > > > index/last; we should reconfigure the maple state. You should probably
> > > > > > use mas_set(), which will reset the maple state and set the index and
> > > > > > long to address.
> > > > >
> > > > > Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
> > > > > address)` case, I presume we want to do the same, right? Basically, on
> > > > > each retry start from the `address` unconditionally, no matter what's
> > > > > the reason for retry.
> > > >
> > > > ah, never mind, we don't retry in that situation, I'll just put
> > > > `mas_set(&mas, address);` right before `goto retry;`. Unless we should
> > > > actually retry in the case when VMA got moved before the requested
> > > > address, not sure, let me know what you think. Presumably retrying
> > > > will allow us to get the correct VMA without the need to fall back to
> > > > mmap_lock?
> > >
> > > sorry, one more question as I look some more around this (unfamiliar
> > > to me) piece of code. I see that lock_vma_under_rcu counts
> > > VMA_LOCK_MISS on retry, but I see that there is actually a
> > > VMA_LOCK_RETRY stat as well. Any reason it's a MISS instead of RETRY?
> > > Should I use MISS as well, or actually count a RETRY?
> > >
> >
> > VMA_LOCK_MISS is used here because we missed the VMA due to a write
> > happening to move the vma (rather rare). The VMA_LOCK missed the vma.
> >
> > VMA_LOCK_RETRY is used to indicate we need to retry under the mmap lock.
> > A retry is needed after the VMA_LOCK did not work under rcu locking.
>
> Originally lock_vma_under_rcu() was used only inside page fault path,
> so these counters helped us quantify how effective VMA locking is when
> handling page faults. With more users of that function these counters
> will be affected by other paths as well. I'm not sure but I think it
> makes sense to use them only inside page fault path, IOW we should
> probably move count_vm_vma_lock_event() calls outside of
> lock_vma_under_rcu() and add them only when handling page faults.

Alright, seems like I should then just drop count_vm_vma_lock_event()
from the API I'm adding.

>
> >
> > Thanks,
> > Liam

2024-06-06 17:12:54

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API

On Thu, Jun 6, 2024 at 9:52 AM Andrii Nakryiko
<[email protected]> wrote:
>
> On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <[email protected]> wrote:
> > >
> > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > > as much as possible, only falling back to mmap_lock if per-VMA lock
> > > failed. This is done so that querying of VMAs doesn't interfere with
> > > other critical tasks, like page fault handling.
> > >
> > > This has been suggested by mm folks, and we make use of a newly added
> > > internal API that works like find_vma(), but tries to use per-VMA lock.
> > >
> > > We have two sets of setup/query/teardown helper functions with different
> > > implementations depending on availability of per-VMA lock (conditioned
> > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> > >
> > > When per-VMA lock is available, lookup is done under RCU, attempting to
> > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > > immediately. In this configuration mmap_lock is never helf for long,
> > > minimizing disruptions while querying.
> > >
> > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > > using find_vma() API, and then unlock mmap_lock at the very end once as
> > > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > > up VMA (depending on query parameters we might need to iterate a few of
> > > them).
> > >
> > > Signed-off-by: Andrii Nakryiko <[email protected]>
> > > ---
> > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 46 insertions(+)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 614fbe5d0667..140032ffc551 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > PROCMAP_QUERY_VMA_FLAGS \
> > > )
> > >
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +static int query_vma_setup(struct mm_struct *mm)
> > > +{
> > > + /* in the presence of per-VMA lock we don't need any setup/teardown */
> > > + return 0;
> > > +}
> > > +
> > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > > +{
> > > + /* in the presence of per-VMA lock we need to unlock vma, if present */
> > > + if (vma)
> > > + vma_end_read(vma);
> > > +}
> > > +
> > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > + struct vm_area_struct *vma;
> > > +
> > > + /* try to use less disruptive per-VMA lock */
> > > + vma = find_and_lock_vma_rcu(mm, addr);
> > > + if (IS_ERR(vma)) {
> > > + /* failed to take per-VMA lock, fallback to mmap_lock */
> > > + if (mmap_read_lock_killable(mm))
> > > + return ERR_PTR(-EINTR);
> > > +
> > > + vma = find_vma(mm, addr);
> > > + if (vma) {
> > > + /*
> > > + * We cannot use vma_start_read() as it may fail due to
> > > + * false locked (see comment in vma_start_read()). We
> > > + * can avoid that by directly locking vm_lock under
> > > + * mmap_lock, which guarantees that nobody can lock the
> > > + * vma for write (vma_start_write()) under us.
> > > + */
> > > + down_read(&vma->vm_lock->lock);
> >
> > Hi Andrii,
> > The above pattern of locking VMA under mmap_lock and then dropping
> > mmap_lock is becoming more common. Matthew had an RFC proposal for an
> > API to do this here:
> > https://lore.kernel.org/all/[email protected]/. It
> > might be worth reviving that discussion.
>
> Sure, it would be nice to have generic and blessed primitives to use
> here. But the good news is that once this is all figured out by you mm
> folks, it should be easy to make use of those primitives here, right?
>
> >
> > > + }
> > > +
> > > + mmap_read_unlock(mm);
> >
> > Later on in your code you are calling get_vma_name() which might call
> > anon_vma_name() to retrieve user-defined VMA name. After this patch
> > this operation will be done without holding mmap_lock, however per
> > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> > this function has to be called with mmap_lock held for read. Indeed
> > with debug flags enabled you should hit this assertion:
> > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.
>
> Sigh... Ok, what's the suggestion then? Should it be some variant of
> mmap_assert_locked() || vma_assert_locked() logic, or it's not so
> simple?
>
> Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until
> all these gotchas are figured out for /proc/<pid>/maps anyway, and
> then we can adapt both text-based and ioctl-based /proc/<pid>/maps
> APIs on top of whatever the final approach will end up being the right
> one?
>
> Liam, any objections to this? The whole point of this patch set is to
> add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
> implementation is structured in a way that should be easily amenable
> to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
> things that need to be figured for existing text-based
> /proc/<pid>/maps anyways, I think it would be best to use mmap_lock
> for now for this new API, and then adopt the same final
> CONFIG_PER_VMA_LOCK-aware solution.

I agree that you should start simple, using mmap_lock first and then
work on improvements. Would the proposed solution become useless with
coarse mmap_lock'ing?

>
> >
> > > + }
> > > +
> > > + return vma;
> > > +}
> > > +#else
> > > static int query_vma_setup(struct mm_struct *mm)
> > > {
> > > return mmap_read_lock_killable(mm);
> > > @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig
> > > {
> > > return find_vma(mm, addr);
> > > }
> > > +#endif
> > >
> > > static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> > > unsigned long addr, u32 flags)
> > > @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> > > skip_vma:
> > > /*
> > > * If the user needs closest matching VMA, keep iterating.
> > > + * But before we proceed we might need to unlock current VMA.
> > > */
> > > addr = vma->vm_end;
> > > + vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */
> > > if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
> > > goto next_vma;
> > > no_vma:
> > > --
> > > 2.43.0
> > >

2024-06-06 17:16:27

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API

On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <[email protected]> wrote:
> >
> > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > as much as possible, only falling back to mmap_lock if per-VMA lock
> > failed. This is done so that querying of VMAs doesn't interfere with
> > other critical tasks, like page fault handling.
> >
> > This has been suggested by mm folks, and we make use of a newly added
> > internal API that works like find_vma(), but tries to use per-VMA lock.
> >
> > We have two sets of setup/query/teardown helper functions with different
> > implementations depending on availability of per-VMA lock (conditioned
> > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> >
> > When per-VMA lock is available, lookup is done under RCU, attempting to
> > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > immediately. In this configuration mmap_lock is never helf for long,
> > minimizing disruptions while querying.
> >
> > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > using find_vma() API, and then unlock mmap_lock at the very end once as
> > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > up VMA (depending on query parameters we might need to iterate a few of
> > them).
> >
> > Signed-off-by: Andrii Nakryiko <[email protected]>
> > ---
> > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 46 insertions(+)
> >
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index 614fbe5d0667..140032ffc551 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > PROCMAP_QUERY_VMA_FLAGS \
> > )
> >
> > +#ifdef CONFIG_PER_VMA_LOCK
> > +static int query_vma_setup(struct mm_struct *mm)
> > +{
> > + /* in the presence of per-VMA lock we don't need any setup/teardown */
> > + return 0;
> > +}
> > +
> > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > +{
> > + /* in the presence of per-VMA lock we need to unlock vma, if present */
> > + if (vma)
> > + vma_end_read(vma);
> > +}
> > +
> > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > +{
> > + struct vm_area_struct *vma;
> > +
> > + /* try to use less disruptive per-VMA lock */
> > + vma = find_and_lock_vma_rcu(mm, addr);
> > + if (IS_ERR(vma)) {
> > + /* failed to take per-VMA lock, fallback to mmap_lock */
> > + if (mmap_read_lock_killable(mm))
> > + return ERR_PTR(-EINTR);
> > +
> > + vma = find_vma(mm, addr);
> > + if (vma) {
> > + /*
> > + * We cannot use vma_start_read() as it may fail due to
> > + * false locked (see comment in vma_start_read()). We
> > + * can avoid that by directly locking vm_lock under
> > + * mmap_lock, which guarantees that nobody can lock the
> > + * vma for write (vma_start_write()) under us.
> > + */
> > + down_read(&vma->vm_lock->lock);
>
> Hi Andrii,
> The above pattern of locking VMA under mmap_lock and then dropping
> mmap_lock is becoming more common. Matthew had an RFC proposal for an
> API to do this here:
> https://lore.kernel.org/all/[email protected]/. It
> might be worth reviving that discussion.

Sure, it would be nice to have generic and blessed primitives to use
here. But the good news is that once this is all figured out by you mm
folks, it should be easy to make use of those primitives here, right?

>
> > + }
> > +
> > + mmap_read_unlock(mm);
>
> Later on in your code you are calling get_vma_name() which might call
> anon_vma_name() to retrieve user-defined VMA name. After this patch
> this operation will be done without holding mmap_lock, however per
> https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> this function has to be called with mmap_lock held for read. Indeed
> with debug flags enabled you should hit this assertion:
> https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.

Sigh... Ok, what's the suggestion then? Should it be some variant of
mmap_assert_locked() || vma_assert_locked() logic, or it's not so
simple?

Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until
all these gotchas are figured out for /proc/<pid>/maps anyway, and
then we can adapt both text-based and ioctl-based /proc/<pid>/maps
APIs on top of whatever the final approach will end up being the right
one?

Liam, any objections to this? The whole point of this patch set is to
add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
implementation is structured in a way that should be easily amenable
to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
things that need to be figured for existing text-based
/proc/<pid>/maps anyways, I think it would be best to use mmap_lock
for now for this new API, and then adopt the same final
CONFIG_PER_VMA_LOCK-aware solution.

>
> > + }
> > +
> > + return vma;
> > +}
> > +#else
> > static int query_vma_setup(struct mm_struct *mm)
> > {
> > return mmap_read_lock_killable(mm);
> > @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig
> > {
> > return find_vma(mm, addr);
> > }
> > +#endif
> >
> > static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> > unsigned long addr, u32 flags)
> > @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> > skip_vma:
> > /*
> > * If the user needs closest matching VMA, keep iterating.
> > + * But before we proceed we might need to unlock current VMA.
> > */
> > addr = vma->vm_end;
> > + vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */
> > if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
> > goto next_vma;
> > no_vma:
> > --
> > 2.43.0
> >

2024-06-06 17:21:23

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 1/9] mm: add find_vma()-like API but RCU protected and taking VMA lock

On Thu, Jun 6, 2024 at 9:52 AM Andrii Nakryiko
<[email protected]> wrote:
>
> On Wed, Jun 5, 2024 at 4:22 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Wed, Jun 5, 2024 at 10:03 AM Liam R. Howlett <[email protected]> wrote:
> > >
> > > * Andrii Nakryiko <[email protected]> [240605 12:27]:
> > > > On Wed, Jun 5, 2024 at 9:24 AM Andrii Nakryiko
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Wed, Jun 5, 2024 at 9:13 AM Andrii Nakryiko
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Wed, Jun 5, 2024 at 6:33 AM Liam R. Howlett <[email protected]> wrote:
> > > > > > >
> > > > > > > * Matthew Wilcox <[email protected]> [240604 20:57]:
> > > > > > > > On Tue, Jun 04, 2024 at 05:24:46PM -0700, Andrii Nakryiko wrote:
> > > > > > > > > +/*
> > > > > > > > > + * find_and_lock_vma_rcu() - Find and lock the VMA for a given address, or the
> > > > > > > > > + * next VMA. Search is done under RCU protection, without taking or assuming
> > > > > > > > > + * mmap_lock. Returned VMA is guaranteed to be stable and not isolated.
> > > > > > > >
> > > > > > > > You know this is supposed to be the _short_ description, right?
> > > > > > > > Three lines is way too long. The full description goes between the
> > > > > > > > arguments and the Return: line.
> > > > > >
> > > > > > Sure, I'll adjust.
> > > > > >
> > > > > > > >
> > > > > > > > > + * @mm: The mm_struct to check
> > > > > > > > > + * @addr: The address
> > > > > > > > > + *
> > > > > > > > > + * Returns: The VMA associated with addr, or the next VMA.
> > > > > > > > > + * May return %NULL in the case of no VMA at addr or above.
> > > > > > > > > + * If the VMA is being modified and can't be locked, -EBUSY is returned.
> > > > > > > > > + */
> > > > > > > > > +struct vm_area_struct *find_and_lock_vma_rcu(struct mm_struct *mm,
> > > > > > > > > + unsigned long address)
> > > > > > > > > +{
> > > > > > > > > + MA_STATE(mas, &mm->mm_mt, address, address);
> > > > > > > > > + struct vm_area_struct *vma;
> > > > > > > > > + int err;
> > > > > > > > > +
> > > > > > > > > + rcu_read_lock();
> > > > > > > > > +retry:
> > > > > > > > > + vma = mas_find(&mas, ULONG_MAX);
> > > > > > > > > + if (!vma) {
> > > > > > > > > + err = 0; /* no VMA, return NULL */
> > > > > > > > > + goto inval;
> > > > > > > > > + }
> > > > > > > > > +
> > > > > > > > > + if (!vma_start_read(vma)) {
> > > > > > > > > + err = -EBUSY;
> > > > > > > > > + goto inval;
> > > > > > > > > + }
> > > > > > > > > +
> > > > > > > > > + /*
> > > > > > > > > + * Check since vm_start/vm_end might change before we lock the VMA.
> > > > > > > > > + * Note, unlike lock_vma_under_rcu() we are searching for VMA covering
> > > > > > > > > + * address or the next one, so we only make sure VMA wasn't updated to
> > > > > > > > > + * end before the address.
> > > > > > > > > + */
> > > > > > > > > + if (unlikely(vma->vm_end <= address)) {
> > > > > > > > > + err = -EBUSY;
> > > > > > > > > + goto inval_end_read;
> > > > > > > > > + }
> > > > > > > > > +
> > > > > > > > > + /* Check if the VMA got isolated after we found it */
> > > > > > > > > + if (vma->detached) {
> > > > > > > > > + vma_end_read(vma);
> > > > > > > > > + count_vm_vma_lock_event(VMA_LOCK_MISS);
> > > > > > > > > + /* The area was replaced with another one */
> > > > > > > >
> > > > > > > > Surely you need to mas_reset() before you goto retry?
> > > > > > >
> > > > > > > Probably more than that. We've found and may have adjusted the
> > > > > > > index/last; we should reconfigure the maple state. You should probably
> > > > > > > use mas_set(), which will reset the maple state and set the index and
> > > > > > > long to address.
> > > > > >
> > > > > > Yep, makes sense, thanks. As for the `unlikely(vma->vm_end <=
> > > > > > address)` case, I presume we want to do the same, right? Basically, on
> > > > > > each retry start from the `address` unconditionally, no matter what's
> > > > > > the reason for retry.
> > > > >
> > > > > ah, never mind, we don't retry in that situation, I'll just put
> > > > > `mas_set(&mas, address);` right before `goto retry;`. Unless we should
> > > > > actually retry in the case when VMA got moved before the requested
> > > > > address, not sure, let me know what you think. Presumably retrying
> > > > > will allow us to get the correct VMA without the need to fall back to
> > > > > mmap_lock?
> > > >
> > > > sorry, one more question as I look some more around this (unfamiliar
> > > > to me) piece of code. I see that lock_vma_under_rcu counts
> > > > VMA_LOCK_MISS on retry, but I see that there is actually a
> > > > VMA_LOCK_RETRY stat as well. Any reason it's a MISS instead of RETRY?
> > > > Should I use MISS as well, or actually count a RETRY?
> > > >
> > >
> > > VMA_LOCK_MISS is used here because we missed the VMA due to a write
> > > happening to move the vma (rather rare). The VMA_LOCK missed the vma.
> > >
> > > VMA_LOCK_RETRY is used to indicate we need to retry under the mmap lock.
> > > A retry is needed after the VMA_LOCK did not work under rcu locking.
> >
> > Originally lock_vma_under_rcu() was used only inside page fault path,
> > so these counters helped us quantify how effective VMA locking is when
> > handling page faults. With more users of that function these counters
> > will be affected by other paths as well. I'm not sure but I think it
> > makes sense to use them only inside page fault path, IOW we should
> > probably move count_vm_vma_lock_event() calls outside of
> > lock_vma_under_rcu() and add them only when handling page faults.
>
> Alright, seems like I should then just drop count_vm_vma_lock_event()
> from the API I'm adding.

That would be my preference but as I said, I'm not 100% sure about
this direction.

>
> >
> > >
> > > Thanks,
> > > Liam

2024-06-06 17:37:19

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API

On Thu, Jun 6, 2024 at 10:15 AM Liam R. Howlett <[email protected]> wrote:
>
> * Andrii Nakryiko <[email protected]> [240606 12:52]:
> > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <[email protected]> wrote:
> > > >
> > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > > > as much as possible, only falling back to mmap_lock if per-VMA lock
> > > > failed. This is done so that querying of VMAs doesn't interfere with
> > > > other critical tasks, like page fault handling.
> > > >
> > > > This has been suggested by mm folks, and we make use of a newly added
> > > > internal API that works like find_vma(), but tries to use per-VMA lock.
> > > >
> > > > We have two sets of setup/query/teardown helper functions with different
> > > > implementations depending on availability of per-VMA lock (conditioned
> > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> > > >
> > > > When per-VMA lock is available, lookup is done under RCU, attempting to
> > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > > > immediately. In this configuration mmap_lock is never helf for long,
> > > > minimizing disruptions while querying.
> > > >
> > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > > > using find_vma() API, and then unlock mmap_lock at the very end once as
> > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > > > up VMA (depending on query parameters we might need to iterate a few of
> > > > them).
> > > >
> > > > Signed-off-by: Andrii Nakryiko <[email protected]>
> > > > ---
> > > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 46 insertions(+)
> > > >
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index 614fbe5d0667..140032ffc551 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > > PROCMAP_QUERY_VMA_FLAGS \
> > > > )
> > > >
> > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > +static int query_vma_setup(struct mm_struct *mm)
> > > > +{
> > > > + /* in the presence of per-VMA lock we don't need any setup/teardown */
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > > > +{
> > > > + /* in the presence of per-VMA lock we need to unlock vma, if present */
> > > > + if (vma)
> > > > + vma_end_read(vma);
> > > > +}
> > > > +
> > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > > > +{
> > > > + struct vm_area_struct *vma;
> > > > +
> > > > + /* try to use less disruptive per-VMA lock */
> > > > + vma = find_and_lock_vma_rcu(mm, addr);
> > > > + if (IS_ERR(vma)) {
> > > > + /* failed to take per-VMA lock, fallback to mmap_lock */
> > > > + if (mmap_read_lock_killable(mm))
> > > > + return ERR_PTR(-EINTR);
> > > > +
> > > > + vma = find_vma(mm, addr);
> > > > + if (vma) {
> > > > + /*
> > > > + * We cannot use vma_start_read() as it may fail due to
> > > > + * false locked (see comment in vma_start_read()). We
> > > > + * can avoid that by directly locking vm_lock under
> > > > + * mmap_lock, which guarantees that nobody can lock the
> > > > + * vma for write (vma_start_write()) under us.
> > > > + */
> > > > + down_read(&vma->vm_lock->lock);
> > >
> > > Hi Andrii,
> > > The above pattern of locking VMA under mmap_lock and then dropping
> > > mmap_lock is becoming more common. Matthew had an RFC proposal for an
> > > API to do this here:
> > > https://lore.kernel.org/all/[email protected]/. It
> > > might be worth reviving that discussion.
> >
> > Sure, it would be nice to have generic and blessed primitives to use
> > here. But the good news is that once this is all figured out by you mm
> > folks, it should be easy to make use of those primitives here, right?
> >
> > >
> > > > + }
> > > > +
> > > > + mmap_read_unlock(mm);
> > >
> > > Later on in your code you are calling get_vma_name() which might call
> > > anon_vma_name() to retrieve user-defined VMA name. After this patch
> > > this operation will be done without holding mmap_lock, however per
> > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> > > this function has to be called with mmap_lock held for read. Indeed
> > > with debug flags enabled you should hit this assertion:
> > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.
>
> The documentation on the first link says to hold the lock or take a
> reference, but then we assert the lock. If you take a reference to the
> anon vma name, then we will trigger the assert. Either the
> documentation needs changing or the assert is incorrect - or I'm missing
> something?

I think the documentation is correct. It says that at the time of
calling anon_vma_name() the mmap_lock should be locked (hence the
assertion). Then the user can raise anon_vma_name refcount, drop
mmap_lock and safely continue using anon_vma_name object. IOW this is
fine:

mmap_read_lock(vma->mm);
anon_name = anon_vma_name(vma);
anon_vma_name_get(anon_name);
mmap_read_unlock(vma->mm);
// keep using anon_name
anon_vma_name_put(anon_name);


>
> >
> > Sigh... Ok, what's the suggestion then? Should it be some variant of
> > mmap_assert_locked() || vma_assert_locked() logic, or it's not so
> > simple?
> >
> > Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until
> > all these gotchas are figured out for /proc/<pid>/maps anyway, and
> > then we can adapt both text-based and ioctl-based /proc/<pid>/maps
> > APIs on top of whatever the final approach will end up being the right
> > one?
> >
> > Liam, any objections to this? The whole point of this patch set is to
> > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
> > implementation is structured in a way that should be easily amenable
> > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
> > things that need to be figured for existing text-based
> > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock
> > for now for this new API, and then adopt the same final
> > CONFIG_PER_VMA_LOCK-aware solution.
>
> The reason I was hoping to have the new interface use the per-vma
> locking from the start is to ensure the guarantees that we provide to
> the users would not change. We'd also avoid shifting to yet another
> mmap_lock users.
>
> I also didn't think it would complicate your series too much, so I
> understand why you want to revert to the old locking semantics. I'm
> fine with you continuing with the series on the old lock. Thanks for
> trying to make this work.
>
> Regards,
> Liam

2024-06-06 17:45:42

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API

* Andrii Nakryiko <[email protected]> [240606 12:52]:
> On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <[email protected]> wrote:
> > >
> > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > > as much as possible, only falling back to mmap_lock if per-VMA lock
> > > failed. This is done so that querying of VMAs doesn't interfere with
> > > other critical tasks, like page fault handling.
> > >
> > > This has been suggested by mm folks, and we make use of a newly added
> > > internal API that works like find_vma(), but tries to use per-VMA lock.
> > >
> > > We have two sets of setup/query/teardown helper functions with different
> > > implementations depending on availability of per-VMA lock (conditioned
> > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> > >
> > > When per-VMA lock is available, lookup is done under RCU, attempting to
> > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > > immediately. In this configuration mmap_lock is never helf for long,
> > > minimizing disruptions while querying.
> > >
> > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > > using find_vma() API, and then unlock mmap_lock at the very end once as
> > > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > > up VMA (depending on query parameters we might need to iterate a few of
> > > them).
> > >
> > > Signed-off-by: Andrii Nakryiko <[email protected]>
> > > ---
> > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > > 1 file changed, 46 insertions(+)
> > >
> > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > index 614fbe5d0667..140032ffc551 100644
> > > --- a/fs/proc/task_mmu.c
> > > +++ b/fs/proc/task_mmu.c
> > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > PROCMAP_QUERY_VMA_FLAGS \
> > > )
> > >
> > > +#ifdef CONFIG_PER_VMA_LOCK
> > > +static int query_vma_setup(struct mm_struct *mm)
> > > +{
> > > + /* in the presence of per-VMA lock we don't need any setup/teardown */
> > > + return 0;
> > > +}
> > > +
> > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > > +{
> > > + /* in the presence of per-VMA lock we need to unlock vma, if present */
> > > + if (vma)
> > > + vma_end_read(vma);
> > > +}
> > > +
> > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > + struct vm_area_struct *vma;
> > > +
> > > + /* try to use less disruptive per-VMA lock */
> > > + vma = find_and_lock_vma_rcu(mm, addr);
> > > + if (IS_ERR(vma)) {
> > > + /* failed to take per-VMA lock, fallback to mmap_lock */
> > > + if (mmap_read_lock_killable(mm))
> > > + return ERR_PTR(-EINTR);
> > > +
> > > + vma = find_vma(mm, addr);
> > > + if (vma) {
> > > + /*
> > > + * We cannot use vma_start_read() as it may fail due to
> > > + * false locked (see comment in vma_start_read()). We
> > > + * can avoid that by directly locking vm_lock under
> > > + * mmap_lock, which guarantees that nobody can lock the
> > > + * vma for write (vma_start_write()) under us.
> > > + */
> > > + down_read(&vma->vm_lock->lock);
> >
> > Hi Andrii,
> > The above pattern of locking VMA under mmap_lock and then dropping
> > mmap_lock is becoming more common. Matthew had an RFC proposal for an
> > API to do this here:
> > https://lore.kernel.org/all/[email protected]/. It
> > might be worth reviving that discussion.
>
> Sure, it would be nice to have generic and blessed primitives to use
> here. But the good news is that once this is all figured out by you mm
> folks, it should be easy to make use of those primitives here, right?
>
> >
> > > + }
> > > +
> > > + mmap_read_unlock(mm);
> >
> > Later on in your code you are calling get_vma_name() which might call
> > anon_vma_name() to retrieve user-defined VMA name. After this patch
> > this operation will be done without holding mmap_lock, however per
> > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> > this function has to be called with mmap_lock held for read. Indeed
> > with debug flags enabled you should hit this assertion:
> > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.

The documentation on the first link says to hold the lock or take a
reference, but then we assert the lock. If you take a reference to the
anon vma name, then we will trigger the assert. Either the
documentation needs changing or the assert is incorrect - or I'm missing
something?

>
> Sigh... Ok, what's the suggestion then? Should it be some variant of
> mmap_assert_locked() || vma_assert_locked() logic, or it's not so
> simple?
>
> Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until
> all these gotchas are figured out for /proc/<pid>/maps anyway, and
> then we can adapt both text-based and ioctl-based /proc/<pid>/maps
> APIs on top of whatever the final approach will end up being the right
> one?
>
> Liam, any objections to this? The whole point of this patch set is to
> add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
> implementation is structured in a way that should be easily amenable
> to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
> things that need to be figured for existing text-based
> /proc/<pid>/maps anyways, I think it would be best to use mmap_lock
> for now for this new API, and then adopt the same final
> CONFIG_PER_VMA_LOCK-aware solution.

The reason I was hoping to have the new interface use the per-vma
locking from the start is to ensure the guarantees that we provide to
the users would not change. We'd also avoid shifting to yet another
mmap_lock users.

I also didn't think it would complicate your series too much, so I
understand why you want to revert to the old locking semantics. I'm
fine with you continuing with the series on the old lock. Thanks for
trying to make this work.

Regards,
Liam

2024-06-06 18:03:52

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API

On Thu, Jun 6, 2024 at 10:12 AM Suren Baghdasaryan <[email protected]> wrote:
>
> On Thu, Jun 6, 2024 at 9:52 AM Andrii Nakryiko
> <[email protected]> wrote:
> >
> > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <[email protected]> wrote:
> > > >
> > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > > > as much as possible, only falling back to mmap_lock if per-VMA lock
> > > > failed. This is done so that querying of VMAs doesn't interfere with
> > > > other critical tasks, like page fault handling.
> > > >
> > > > This has been suggested by mm folks, and we make use of a newly added
> > > > internal API that works like find_vma(), but tries to use per-VMA lock.
> > > >
> > > > We have two sets of setup/query/teardown helper functions with different
> > > > implementations depending on availability of per-VMA lock (conditioned
> > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> > > >
> > > > When per-VMA lock is available, lookup is done under RCU, attempting to
> > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > > > immediately. In this configuration mmap_lock is never helf for long,
> > > > minimizing disruptions while querying.
> > > >
> > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > > > using find_vma() API, and then unlock mmap_lock at the very end once as
> > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > > > up VMA (depending on query parameters we might need to iterate a few of
> > > > them).
> > > >
> > > > Signed-off-by: Andrii Nakryiko <[email protected]>
> > > > ---
> > > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 46 insertions(+)
> > > >
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index 614fbe5d0667..140032ffc551 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > > PROCMAP_QUERY_VMA_FLAGS \
> > > > )
> > > >
> > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > +static int query_vma_setup(struct mm_struct *mm)
> > > > +{
> > > > + /* in the presence of per-VMA lock we don't need any setup/teardown */
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > > > +{
> > > > + /* in the presence of per-VMA lock we need to unlock vma, if present */
> > > > + if (vma)
> > > > + vma_end_read(vma);
> > > > +}
> > > > +
> > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > > > +{
> > > > + struct vm_area_struct *vma;
> > > > +
> > > > + /* try to use less disruptive per-VMA lock */
> > > > + vma = find_and_lock_vma_rcu(mm, addr);
> > > > + if (IS_ERR(vma)) {
> > > > + /* failed to take per-VMA lock, fallback to mmap_lock */
> > > > + if (mmap_read_lock_killable(mm))
> > > > + return ERR_PTR(-EINTR);
> > > > +
> > > > + vma = find_vma(mm, addr);
> > > > + if (vma) {
> > > > + /*
> > > > + * We cannot use vma_start_read() as it may fail due to
> > > > + * false locked (see comment in vma_start_read()). We
> > > > + * can avoid that by directly locking vm_lock under
> > > > + * mmap_lock, which guarantees that nobody can lock the
> > > > + * vma for write (vma_start_write()) under us.
> > > > + */
> > > > + down_read(&vma->vm_lock->lock);
> > >
> > > Hi Andrii,
> > > The above pattern of locking VMA under mmap_lock and then dropping
> > > mmap_lock is becoming more common. Matthew had an RFC proposal for an
> > > API to do this here:
> > > https://lore.kernel.org/all/[email protected]/. It
> > > might be worth reviving that discussion.
> >
> > Sure, it would be nice to have generic and blessed primitives to use
> > here. But the good news is that once this is all figured out by you mm
> > folks, it should be easy to make use of those primitives here, right?
> >
> > >
> > > > + }
> > > > +
> > > > + mmap_read_unlock(mm);
> > >
> > > Later on in your code you are calling get_vma_name() which might call
> > > anon_vma_name() to retrieve user-defined VMA name. After this patch
> > > this operation will be done without holding mmap_lock, however per
> > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> > > this function has to be called with mmap_lock held for read. Indeed
> > > with debug flags enabled you should hit this assertion:
> > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.
> >
> > Sigh... Ok, what's the suggestion then? Should it be some variant of
> > mmap_assert_locked() || vma_assert_locked() logic, or it's not so
> > simple?
> >
> > Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until
> > all these gotchas are figured out for /proc/<pid>/maps anyway, and
> > then we can adapt both text-based and ioctl-based /proc/<pid>/maps
> > APIs on top of whatever the final approach will end up being the right
> > one?
> >
> > Liam, any objections to this? The whole point of this patch set is to
> > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
> > implementation is structured in a way that should be easily amenable
> > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
> > things that need to be figured for existing text-based
> > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock
> > for now for this new API, and then adopt the same final
> > CONFIG_PER_VMA_LOCK-aware solution.
>
> I agree that you should start simple, using mmap_lock first and then
> work on improvements. Would the proposed solution become useless with
> coarse mmap_lock'ing?

Sorry, it's not clear what you mean by "proposed solution"? If you
mean this new ioctl-based API, no it's still very useful and fast even
if we take mmap_lock.

But if you mean vm_lock, then I'd say that due to anon_vma_name()
complication it makes vm_lock not attractive anymore, because vma_name
will be requested pretty much always. And if we need to take mmap_lock
anyways, then what's the point of per-VMA lock, right?

I'd like to be a good citizen here and help you guys not add new
mmap_lock users (and adopt per-VMA lock more widely), but I'm not sure
I can solve the anon_vma_name() conundrum, unfortunately.

Ultimately, I do care the most about having this new API available for
my customers to take advantage of, of course.

>
> >
> > >
> > > > + }
> > > > +
> > > > + return vma;
> > > > +}
> > > > +#else
> > > > static int query_vma_setup(struct mm_struct *mm)
> > > > {
> > > > return mmap_read_lock_killable(mm);
> > > > @@ -402,6 +445,7 @@ static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsig
> > > > {
> > > > return find_vma(mm, addr);
> > > > }
> > > > +#endif
> > > >
> > > > static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> > > > unsigned long addr, u32 flags)
> > > > @@ -441,8 +485,10 @@ static struct vm_area_struct *query_matching_vma(struct mm_struct *mm,
> > > > skip_vma:
> > > > /*
> > > > * If the user needs closest matching VMA, keep iterating.
> > > > + * But before we proceed we might need to unlock current VMA.
> > > > */
> > > > addr = vma->vm_end;
> > > > + vma_end_read(vma); /* no-op under !CONFIG_PER_VMA_LOCK */
> > > > if (flags & PROCMAP_QUERY_COVERING_OR_NEXT_VMA)
> > > > goto next_vma;
> > > > no_vma:
> > > > --
> > > > 2.43.0
> > > >

2024-06-06 18:10:00

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API

On Thu, Jun 6, 2024 at 10:15 AM Liam R. Howlett <[email protected]> wrote:
>
> * Andrii Nakryiko <[email protected]> [240606 12:52]:
> > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <[email protected]> wrote:
> > >
> > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <[email protected]> wrote:
> > > >
> > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > > > as much as possible, only falling back to mmap_lock if per-VMA lock
> > > > failed. This is done so that querying of VMAs doesn't interfere with
> > > > other critical tasks, like page fault handling.
> > > >
> > > > This has been suggested by mm folks, and we make use of a newly added
> > > > internal API that works like find_vma(), but tries to use per-VMA lock.
> > > >
> > > > We have two sets of setup/query/teardown helper functions with different
> > > > implementations depending on availability of per-VMA lock (conditioned
> > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> > > >
> > > > When per-VMA lock is available, lookup is done under RCU, attempting to
> > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > > > immediately. In this configuration mmap_lock is never helf for long,
> > > > minimizing disruptions while querying.
> > > >
> > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > > > using find_vma() API, and then unlock mmap_lock at the very end once as
> > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > > > up VMA (depending on query parameters we might need to iterate a few of
> > > > them).
> > > >
> > > > Signed-off-by: Andrii Nakryiko <[email protected]>
> > > > ---
> > > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 46 insertions(+)
> > > >
> > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > index 614fbe5d0667..140032ffc551 100644
> > > > --- a/fs/proc/task_mmu.c
> > > > +++ b/fs/proc/task_mmu.c
> > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > > PROCMAP_QUERY_VMA_FLAGS \
> > > > )
> > > >
> > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > +static int query_vma_setup(struct mm_struct *mm)
> > > > +{
> > > > + /* in the presence of per-VMA lock we don't need any setup/teardown */
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > > > +{
> > > > + /* in the presence of per-VMA lock we need to unlock vma, if present */
> > > > + if (vma)
> > > > + vma_end_read(vma);
> > > > +}
> > > > +
> > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > > > +{
> > > > + struct vm_area_struct *vma;
> > > > +
> > > > + /* try to use less disruptive per-VMA lock */
> > > > + vma = find_and_lock_vma_rcu(mm, addr);
> > > > + if (IS_ERR(vma)) {
> > > > + /* failed to take per-VMA lock, fallback to mmap_lock */
> > > > + if (mmap_read_lock_killable(mm))
> > > > + return ERR_PTR(-EINTR);
> > > > +
> > > > + vma = find_vma(mm, addr);
> > > > + if (vma) {
> > > > + /*
> > > > + * We cannot use vma_start_read() as it may fail due to
> > > > + * false locked (see comment in vma_start_read()). We
> > > > + * can avoid that by directly locking vm_lock under
> > > > + * mmap_lock, which guarantees that nobody can lock the
> > > > + * vma for write (vma_start_write()) under us.
> > > > + */
> > > > + down_read(&vma->vm_lock->lock);
> > >
> > > Hi Andrii,
> > > The above pattern of locking VMA under mmap_lock and then dropping
> > > mmap_lock is becoming more common. Matthew had an RFC proposal for an
> > > API to do this here:
> > > https://lore.kernel.org/all/[email protected]/. It
> > > might be worth reviving that discussion.
> >
> > Sure, it would be nice to have generic and blessed primitives to use
> > here. But the good news is that once this is all figured out by you mm
> > folks, it should be easy to make use of those primitives here, right?
> >
> > >
> > > > + }
> > > > +
> > > > + mmap_read_unlock(mm);
> > >
> > > Later on in your code you are calling get_vma_name() which might call
> > > anon_vma_name() to retrieve user-defined VMA name. After this patch
> > > this operation will be done without holding mmap_lock, however per
> > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> > > this function has to be called with mmap_lock held for read. Indeed
> > > with debug flags enabled you should hit this assertion:
> > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.
>
> The documentation on the first link says to hold the lock or take a
> reference, but then we assert the lock. If you take a reference to the
> anon vma name, then we will trigger the assert. Either the
> documentation needs changing or the assert is incorrect - or I'm missing
> something?
>
> >
> > Sigh... Ok, what's the suggestion then? Should it be some variant of
> > mmap_assert_locked() || vma_assert_locked() logic, or it's not so
> > simple?
> >
> > Maybe I should just drop the CONFIG_PER_VMA_LOCK changes for now until
> > all these gotchas are figured out for /proc/<pid>/maps anyway, and
> > then we can adapt both text-based and ioctl-based /proc/<pid>/maps
> > APIs on top of whatever the final approach will end up being the right
> > one?
> >
> > Liam, any objections to this? The whole point of this patch set is to
> > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
> > implementation is structured in a way that should be easily amenable
> > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
> > things that need to be figured for existing text-based
> > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock
> > for now for this new API, and then adopt the same final
> > CONFIG_PER_VMA_LOCK-aware solution.
>
> The reason I was hoping to have the new interface use the per-vma
> locking from the start is to ensure the guarantees that we provide to
> the users would not change. We'd also avoid shifting to yet another
> mmap_lock users.
>

Yep, it's completely understandable. And you see that I changed the
structure quite a lot to abstract away mmap_lock vs vm_lock details.
I'm afraid anon_vma_name() is quite an obstacle, unfortunately, and
seems like it should be addressed first, but I'm just not qualified
enough to do this.

> I also didn't think it would complicate your series too much, so I
> understand why you want to revert to the old locking semantics. I'm
> fine with you continuing with the series on the old lock. Thanks for
> trying to make this work.
>

I'm happy to keep the existing structure of the code, and
(intentionally) all the CONFIG_PER_VMA_LOCK logic is in separate
patches, so it's easy to do. I'd love to help adopt a per-VMA lock
once all the pieces are figured out. Hopefully anon_vma_name() is the
last one remaining :) So please keep me cc'ed on relevant patches.

As I mentioned, I just don't feel like I would be able to solve the
anon_vma_name() problem, but of course I wouldn't want to be
completely blocked by it as well.

> Regards,
> Liam

2024-06-06 18:10:04

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API

* Suren Baghdasaryan <[email protected]> [240606 13:33]:
> On Thu, Jun 6, 2024 at 10:15 AM Liam R. Howlett <[email protected]> wrote:
> >
> > * Andrii Nakryiko <[email protected]> [240606 12:52]:
> > > On Wed, Jun 5, 2024 at 4:16 PM Suren Baghdasaryan <[email protected]> wrote:
> > > >
> > > > On Tue, Jun 4, 2024 at 5:25 PM Andrii Nakryiko <[email protected]> wrote:
> > > > >
> > > > > Attempt to use RCU-protected per-VMA lock when looking up requested VMA
> > > > > as much as possible, only falling back to mmap_lock if per-VMA lock
> > > > > failed. This is done so that querying of VMAs doesn't interfere with
> > > > > other critical tasks, like page fault handling.
> > > > >
> > > > > This has been suggested by mm folks, and we make use of a newly added
> > > > > internal API that works like find_vma(), but tries to use per-VMA lock.
> > > > >
> > > > > We have two sets of setup/query/teardown helper functions with different
> > > > > implementations depending on availability of per-VMA lock (conditioned
> > > > > on CONFIG_PER_VMA_LOCK) to abstract per-VMA lock subtleties.
> > > > >
> > > > > When per-VMA lock is available, lookup is done under RCU, attempting to
> > > > > take a per-VMA lock. If that fails, we fallback to mmap_lock, but then
> > > > > proceed to unconditionally grab per-VMA lock again, dropping mmap_lock
> > > > > immediately. In this configuration mmap_lock is never helf for long,
> > > > > minimizing disruptions while querying.
> > > > >
> > > > > When per-VMA lock is compiled out, we take mmap_lock once, query VMAs
> > > > > using find_vma() API, and then unlock mmap_lock at the very end once as
> > > > > well. In this setup we avoid locking/unlocking mmap_lock on every looked
> > > > > up VMA (depending on query parameters we might need to iterate a few of
> > > > > them).
> > > > >
> > > > > Signed-off-by: Andrii Nakryiko <[email protected]>
> > > > > ---
> > > > > fs/proc/task_mmu.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > > 1 file changed, 46 insertions(+)
> > > > >
> > > > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > > > > index 614fbe5d0667..140032ffc551 100644
> > > > > --- a/fs/proc/task_mmu.c
> > > > > +++ b/fs/proc/task_mmu.c
> > > > > @@ -388,6 +388,49 @@ static int pid_maps_open(struct inode *inode, struct file *file)
> > > > > PROCMAP_QUERY_VMA_FLAGS \
> > > > > )
> > > > >
> > > > > +#ifdef CONFIG_PER_VMA_LOCK
> > > > > +static int query_vma_setup(struct mm_struct *mm)
> > > > > +{
> > > > > + /* in the presence of per-VMA lock we don't need any setup/teardown */
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +static void query_vma_teardown(struct mm_struct *mm, struct vm_area_struct *vma)
> > > > > +{
> > > > > + /* in the presence of per-VMA lock we need to unlock vma, if present */
> > > > > + if (vma)
> > > > > + vma_end_read(vma);
> > > > > +}
> > > > > +
> > > > > +static struct vm_area_struct *query_vma_find_by_addr(struct mm_struct *mm, unsigned long addr)
> > > > > +{
> > > > > + struct vm_area_struct *vma;
> > > > > +
> > > > > + /* try to use less disruptive per-VMA lock */
> > > > > + vma = find_and_lock_vma_rcu(mm, addr);
> > > > > + if (IS_ERR(vma)) {
> > > > > + /* failed to take per-VMA lock, fallback to mmap_lock */
> > > > > + if (mmap_read_lock_killable(mm))
> > > > > + return ERR_PTR(-EINTR);
> > > > > +
> > > > > + vma = find_vma(mm, addr);
> > > > > + if (vma) {
> > > > > + /*
> > > > > + * We cannot use vma_start_read() as it may fail due to
> > > > > + * false locked (see comment in vma_start_read()). We
> > > > > + * can avoid that by directly locking vm_lock under
> > > > > + * mmap_lock, which guarantees that nobody can lock the
> > > > > + * vma for write (vma_start_write()) under us.
> > > > > + */
> > > > > + down_read(&vma->vm_lock->lock);
> > > >
> > > > Hi Andrii,
> > > > The above pattern of locking VMA under mmap_lock and then dropping
> > > > mmap_lock is becoming more common. Matthew had an RFC proposal for an
> > > > API to do this here:
> > > > https://lore.kernel.org/all/[email protected]/. It
> > > > might be worth reviving that discussion.
> > >
> > > Sure, it would be nice to have generic and blessed primitives to use
> > > here. But the good news is that once this is all figured out by you mm
> > > folks, it should be easy to make use of those primitives here, right?
> > >
> > > >
> > > > > + }
> > > > > +
> > > > > + mmap_read_unlock(mm);
> > > >
> > > > Later on in your code you are calling get_vma_name() which might call
> > > > anon_vma_name() to retrieve user-defined VMA name. After this patch
> > > > this operation will be done without holding mmap_lock, however per
> > > > https://elixir.bootlin.com/linux/latest/source/include/linux/mm_types.h#L582
> > > > this function has to be called with mmap_lock held for read. Indeed
> > > > with debug flags enabled you should hit this assertion:
> > > > https://elixir.bootlin.com/linux/latest/source/mm/madvise.c#L96.
> >
> > The documentation on the first link says to hold the lock or take a
> > reference, but then we assert the lock. If you take a reference to the
> > anon vma name, then we will trigger the assert. Either the
> > documentation needs changing or the assert is incorrect - or I'm missing
> > something?
>
> I think the documentation is correct. It says that at the time of
> calling anon_vma_name() the mmap_lock should be locked (hence the
> assertion). Then the user can raise anon_vma_name refcount, drop
> mmap_lock and safely continue using anon_vma_name object. IOW this is
> fine:
>
> mmap_read_lock(vma->mm);
> anon_name = anon_vma_name(vma);
> anon_vma_name_get(anon_name);
> mmap_read_unlock(vma->mm);
> // keep using anon_name
> anon_vma_name_put(anon_name);
>

I consider that an optimistic view of what will happen, but okay.


2024-06-06 18:33:25

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v3 4/9] fs/procfs: use per-VMA RCU-protected locking in PROCMAP_QUERY API

* Andrii Nakryiko <[email protected]> [240606 14:09]:

...

> > > Liam, any objections to this? The whole point of this patch set is to
> > > add a new API, not all the CONFIG_PER_VMA_LOCK gotchas. My
> > > implementation is structured in a way that should be easily amenable
> > > to CONFIG_PER_VMA_LOCK changes, but if there are a few more subtle
> > > things that need to be figured for existing text-based
> > > /proc/<pid>/maps anyways, I think it would be best to use mmap_lock
> > > for now for this new API, and then adopt the same final
> > > CONFIG_PER_VMA_LOCK-aware solution.
> >
> > The reason I was hoping to have the new interface use the per-vma
> > locking from the start is to ensure the guarantees that we provide to
> > the users would not change. We'd also avoid shifting to yet another
> > mmap_lock users.
> >
>
> Yep, it's completely understandable. And you see that I changed the
> structure quite a lot to abstract away mmap_lock vs vm_lock details.
> I'm afraid anon_vma_name() is quite an obstacle, unfortunately, and
> seems like it should be addressed first, but I'm just not qualified
> enough to do this.
>
> > I also didn't think it would complicate your series too much, so I
> > understand why you want to revert to the old locking semantics. I'm
> > fine with you continuing with the series on the old lock. Thanks for
> > trying to make this work.
> >
>
> I'm happy to keep the existing structure of the code, and
> (intentionally) all the CONFIG_PER_VMA_LOCK logic is in separate
> patches, so it's easy to do. I'd love to help adopt a per-VMA lock
> once all the pieces are figured out. Hopefully anon_vma_name() is the
> last one remaining :) So please keep me cc'ed on relevant patches.
>
> As I mentioned, I just don't feel like I would be able to solve the
> anon_vma_name() problem, but of course I wouldn't want to be
> completely blocked by it as well.
>

Absolutely. Thanks for trying. To be clear, I'm fine with you dropping
the per-vma locking from this interface as well.

Thanks,
Liam