2024-03-21 14:57:53

by Jonathan Haslam

[permalink] [raw]
Subject: [PATCH] uprobes: reduce contention on uprobes_tree access

Active uprobes are stored in an RB tree and accesses to this tree are
dominated by read operations. Currently these accesses are serialized by
a spinlock but this leads to enormous contention when large numbers of
threads are executing active probes.

This patch converts the spinlock used to serialize access to the
uprobes_tree RB tree into a reader-writer spinlock. This lock type
aligns naturally with the overwhelmingly read-only nature of the tree
usage here. Although the addition of reader-writer spinlocks are
discouraged [0], this fix is proposed as an interim solution while an
RCU based approach is implemented (that work is in a nascent form). This
fix also has the benefit of being trivial, self contained and therefore
simple to backport.

This change has been tested against production workloads that exhibit
significant contention on the spinlock and an almost order of magnitude
reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).

[0] https://docs.kernel.org/locking/spinlocks.html

Signed-off-by: Jonathan Haslam <[email protected]>
---
kernel/events/uprobes.c | 22 +++++++++++-----------
1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 929e98c62965..42bf9b6e8bc0 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
*/
#define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)

-static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
+static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */

#define UPROBES_HASH_SZ 13
/* serialize uprobe->pending_list */
@@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
{
struct uprobe *uprobe;

- spin_lock(&uprobes_treelock);
+ read_lock(&uprobes_treelock);
uprobe = __find_uprobe(inode, offset);
- spin_unlock(&uprobes_treelock);
+ read_unlock(&uprobes_treelock);

return uprobe;
}
@@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
{
struct uprobe *u;

- spin_lock(&uprobes_treelock);
+ write_lock(&uprobes_treelock);
u = __insert_uprobe(uprobe);
- spin_unlock(&uprobes_treelock);
+ write_unlock(&uprobes_treelock);

return u;
}
@@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
if (WARN_ON(!uprobe_is_active(uprobe)))
return;

- spin_lock(&uprobes_treelock);
+ write_lock(&uprobes_treelock);
rb_erase(&uprobe->rb_node, &uprobes_tree);
- spin_unlock(&uprobes_treelock);
+ write_unlock(&uprobes_treelock);
RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
put_uprobe(uprobe);
}
@@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
min = vaddr_to_offset(vma, start);
max = min + (end - start) - 1;

- spin_lock(&uprobes_treelock);
+ read_lock(&uprobes_treelock);
n = find_node_in_range(inode, min, max);
if (n) {
for (t = n; t; t = rb_prev(t)) {
@@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
get_uprobe(u);
}
}
- spin_unlock(&uprobes_treelock);
+ read_unlock(&uprobes_treelock);
}

/* @vma contains reference counter, not the probed instruction. */
@@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
min = vaddr_to_offset(vma, start);
max = min + (end - start) - 1;

- spin_lock(&uprobes_treelock);
+ read_lock(&uprobes_treelock);
n = find_node_in_range(inode, min, max);
- spin_unlock(&uprobes_treelock);
+ read_unlock(&uprobes_treelock);

return !!n;
}
--
2.43.0



2024-03-21 16:18:45

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Thu, Mar 21, 2024 at 7:57 AM Jonathan Haslam
<[email protected]> wrote:
>
> Active uprobes are stored in an RB tree and accesses to this tree are
> dominated by read operations. Currently these accesses are serialized by
> a spinlock but this leads to enormous contention when large numbers of
> threads are executing active probes.
>
> This patch converts the spinlock used to serialize access to the
> uprobes_tree RB tree into a reader-writer spinlock. This lock type
> aligns naturally with the overwhelmingly read-only nature of the tree
> usage here. Although the addition of reader-writer spinlocks are
> discouraged [0], this fix is proposed as an interim solution while an
> RCU based approach is implemented (that work is in a nascent form). This
> fix also has the benefit of being trivial, self contained and therefore
> simple to backport.

Yep, makes sense, I think we'll want to backport this ASAP to some of
the old kernels we have. Thanks!

Acked-by: Andrii Nakryiko <[email protected]>

>
> This change has been tested against production workloads that exhibit
> significant contention on the spinlock and an almost order of magnitude
> reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
>
> [0] https://docs.kernel.org/locking/spinlocks.html
>
> Signed-off-by: Jonathan Haslam <[email protected]>
> ---
> kernel/events/uprobes.c | 22 +++++++++++-----------
> 1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 929e98c62965..42bf9b6e8bc0 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> */
> #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
>
> -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
>
> #define UPROBES_HASH_SZ 13
> /* serialize uprobe->pending_list */
> @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> {
> struct uprobe *uprobe;
>
> - spin_lock(&uprobes_treelock);
> + read_lock(&uprobes_treelock);
> uprobe = __find_uprobe(inode, offset);
> - spin_unlock(&uprobes_treelock);
> + read_unlock(&uprobes_treelock);
>
> return uprobe;
> }
> @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> {
> struct uprobe *u;
>
> - spin_lock(&uprobes_treelock);
> + write_lock(&uprobes_treelock);
> u = __insert_uprobe(uprobe);
> - spin_unlock(&uprobes_treelock);
> + write_unlock(&uprobes_treelock);
>
> return u;
> }
> @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> if (WARN_ON(!uprobe_is_active(uprobe)))
> return;
>
> - spin_lock(&uprobes_treelock);
> + write_lock(&uprobes_treelock);
> rb_erase(&uprobe->rb_node, &uprobes_tree);
> - spin_unlock(&uprobes_treelock);
> + write_unlock(&uprobes_treelock);
> RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> put_uprobe(uprobe);
> }
> @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> min = vaddr_to_offset(vma, start);
> max = min + (end - start) - 1;
>
> - spin_lock(&uprobes_treelock);
> + read_lock(&uprobes_treelock);
> n = find_node_in_range(inode, min, max);
> if (n) {
> for (t = n; t; t = rb_prev(t)) {
> @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> get_uprobe(u);
> }
> }
> - spin_unlock(&uprobes_treelock);
> + read_unlock(&uprobes_treelock);
> }
>
> /* @vma contains reference counter, not the probed instruction. */
> @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> min = vaddr_to_offset(vma, start);
> max = min + (end - start) - 1;
>
> - spin_lock(&uprobes_treelock);
> + read_lock(&uprobes_treelock);
> n = find_node_in_range(inode, min, max);
> - spin_unlock(&uprobes_treelock);
> + read_unlock(&uprobes_treelock);
>
> return !!n;
> }
> --
> 2.43.0
>

2024-03-24 03:29:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access


* Jonathan Haslam <[email protected]> wrote:

> Active uprobes are stored in an RB tree and accesses to this tree are
> dominated by read operations. Currently these accesses are serialized by
> a spinlock but this leads to enormous contention when large numbers of
> threads are executing active probes.
>
> This patch converts the spinlock used to serialize access to the
> uprobes_tree RB tree into a reader-writer spinlock. This lock type
> aligns naturally with the overwhelmingly read-only nature of the tree
> usage here. Although the addition of reader-writer spinlocks are
> discouraged [0], this fix is proposed as an interim solution while an
> RCU based approach is implemented (that work is in a nascent form). This
> fix also has the benefit of being trivial, self contained and therefore
> simple to backport.
>
> This change has been tested against production workloads that exhibit
> significant contention on the spinlock and an almost order of magnitude
> reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).

Have you considered/measured per-CPU RW semaphores?

Thanks,

Ingo

2024-03-25 11:35:54

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Thu, 21 Mar 2024 07:57:35 -0700
Jonathan Haslam <[email protected]> wrote:

> Active uprobes are stored in an RB tree and accesses to this tree are
> dominated by read operations. Currently these accesses are serialized by
> a spinlock but this leads to enormous contention when large numbers of
> threads are executing active probes.
>
> This patch converts the spinlock used to serialize access to the
> uprobes_tree RB tree into a reader-writer spinlock. This lock type
> aligns naturally with the overwhelmingly read-only nature of the tree
> usage here. Although the addition of reader-writer spinlocks are
> discouraged [0], this fix is proposed as an interim solution while an
> RCU based approach is implemented (that work is in a nascent form). This
> fix also has the benefit of being trivial, self contained and therefore
> simple to backport.
>
> This change has been tested against production workloads that exhibit
> significant contention on the spinlock and an almost order of magnitude
> reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).

Looks good to me.

Acked-by: Masami Hiramatsu (Google) <[email protected]>

BTW, how did you measure the overhead? I think spinlock overhead
will depend on how much lock contention happens.

Thank you,

>
> [0] https://docs.kernel.org/locking/spinlocks.html
>
> Signed-off-by: Jonathan Haslam <[email protected]>
> ---
> kernel/events/uprobes.c | 22 +++++++++++-----------
> 1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 929e98c62965..42bf9b6e8bc0 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> */
> #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
>
> -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
>
> #define UPROBES_HASH_SZ 13
> /* serialize uprobe->pending_list */
> @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> {
> struct uprobe *uprobe;
>
> - spin_lock(&uprobes_treelock);
> + read_lock(&uprobes_treelock);
> uprobe = __find_uprobe(inode, offset);
> - spin_unlock(&uprobes_treelock);
> + read_unlock(&uprobes_treelock);
>
> return uprobe;
> }
> @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> {
> struct uprobe *u;
>
> - spin_lock(&uprobes_treelock);
> + write_lock(&uprobes_treelock);
> u = __insert_uprobe(uprobe);
> - spin_unlock(&uprobes_treelock);
> + write_unlock(&uprobes_treelock);
>
> return u;
> }
> @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> if (WARN_ON(!uprobe_is_active(uprobe)))
> return;
>
> - spin_lock(&uprobes_treelock);
> + write_lock(&uprobes_treelock);
> rb_erase(&uprobe->rb_node, &uprobes_tree);
> - spin_unlock(&uprobes_treelock);
> + write_unlock(&uprobes_treelock);
> RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> put_uprobe(uprobe);
> }
> @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> min = vaddr_to_offset(vma, start);
> max = min + (end - start) - 1;
>
> - spin_lock(&uprobes_treelock);
> + read_lock(&uprobes_treelock);
> n = find_node_in_range(inode, min, max);
> if (n) {
> for (t = n; t; t = rb_prev(t)) {
> @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> get_uprobe(u);
> }
> }
> - spin_unlock(&uprobes_treelock);
> + read_unlock(&uprobes_treelock);
> }
>
> /* @vma contains reference counter, not the probed instruction. */
> @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> min = vaddr_to_offset(vma, start);
> max = min + (end - start) - 1;
>
> - spin_lock(&uprobes_treelock);
> + read_lock(&uprobes_treelock);
> n = find_node_in_range(inode, min, max);
> - spin_unlock(&uprobes_treelock);
> + read_unlock(&uprobes_treelock);
>
> return !!n;
> }
> --
> 2.43.0
>


--
Masami Hiramatsu (Google) <[email protected]>

2024-03-25 19:21:52

by Jonathan Haslam

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

Hi Ingo,

> > This change has been tested against production workloads that exhibit
> > significant contention on the spinlock and an almost order of magnitude
> > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
>
> Have you considered/measured per-CPU RW semaphores?

No I hadn't but thanks hugely for suggesting it! In initial measurements
it seems to be between 20-100% faster than the RW spinlocks! Apologies for
all the exclamation marks but I'm very excited. I'll do some more testing
tomorrow but so far it's looking very good.

Thanks again for the input.

Jon.

2024-03-25 23:14:51

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Mon, Mar 25, 2024 at 12:12 PM Jonthan Haslam
<[email protected]> wrote:
>
> Hi Ingo,
>
> > > This change has been tested against production workloads that exhibit
> > > significant contention on the spinlock and an almost order of magnitude
> > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
> >
> > Have you considered/measured per-CPU RW semaphores?
>
> No I hadn't but thanks hugely for suggesting it! In initial measurements
> it seems to be between 20-100% faster than the RW spinlocks! Apologies for
> all the exclamation marks but I'm very excited. I'll do some more testing
> tomorrow but so far it's looking very good.
>

Documentation ([0]) says that locking for writing calls
synchronize_rcu(), is that right? If that's true, attaching multiple
uprobes (including just attaching a single BPF multi-uprobe) will take
a really long time. We need to confirm we are not significantly
regressing this. And if we do, we need to take measures in the BPF
multi-uprobe attachment code path to make sure that a single
multi-uprobe attachment is still fast.

If my worries above turn out to be true, it still feels like a first
good step should be landing this patch as is (and get it backported to
older kernels), and then have percpu rw-semaphore as a final (and a
bit more invasive) solution (it's RCU-based, so feels like a good
primitive to settle on), making sure to not regress multi-uprobes
(we'll probably will need some batched API for multiple uprobes).

Thoughts?

[0] https://docs.kernel.org/locking/percpu-rw-semaphore.html

> Thanks again for the input.
>
> Jon.

2024-03-25 23:44:42

by Jonathan Haslam

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

Hi Masami,

> > This change has been tested against production workloads that exhibit
> > significant contention on the spinlock and an almost order of magnitude
> > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
>
> Looks good to me.
>
> Acked-by: Masami Hiramatsu (Google) <[email protected]>
>
> BTW, how did you measure the overhead? I think spinlock overhead
> will depend on how much lock contention happens.

Absolutely. I have the original production workload to test this with and
a derived one that mimics this test case. The production case has ~24
threads running on a 192 core system which access 14 USDTs around 1.5
million times per second in total (across all USDTs). My test case is
similar but can drive a higher rate of USDT access across more threads and
therefore generate higher contention.

All measurements are done using bpftrace scripts around relevant parts of
code in uprobes.c and application code.

Jon.

>
> Thank you,
>
> >
> > [0] https://docs.kernel.org/locking/spinlocks.html
> >
> > Signed-off-by: Jonathan Haslam <[email protected]>
> > ---
> > kernel/events/uprobes.c | 22 +++++++++++-----------
> > 1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 929e98c62965..42bf9b6e8bc0 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > */
> > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> >
> > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> >
> > #define UPROBES_HASH_SZ 13
> > /* serialize uprobe->pending_list */
> > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > {
> > struct uprobe *uprobe;
> >
> > - spin_lock(&uprobes_treelock);
> > + read_lock(&uprobes_treelock);
> > uprobe = __find_uprobe(inode, offset);
> > - spin_unlock(&uprobes_treelock);
> > + read_unlock(&uprobes_treelock);
> >
> > return uprobe;
> > }
> > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > {
> > struct uprobe *u;
> >
> > - spin_lock(&uprobes_treelock);
> > + write_lock(&uprobes_treelock);
> > u = __insert_uprobe(uprobe);
> > - spin_unlock(&uprobes_treelock);
> > + write_unlock(&uprobes_treelock);
> >
> > return u;
> > }
> > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > if (WARN_ON(!uprobe_is_active(uprobe)))
> > return;
> >
> > - spin_lock(&uprobes_treelock);
> > + write_lock(&uprobes_treelock);
> > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > - spin_unlock(&uprobes_treelock);
> > + write_unlock(&uprobes_treelock);
> > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > put_uprobe(uprobe);
> > }
> > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > min = vaddr_to_offset(vma, start);
> > max = min + (end - start) - 1;
> >
> > - spin_lock(&uprobes_treelock);
> > + read_lock(&uprobes_treelock);
> > n = find_node_in_range(inode, min, max);
> > if (n) {
> > for (t = n; t; t = rb_prev(t)) {
> > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > get_uprobe(u);
> > }
> > }
> > - spin_unlock(&uprobes_treelock);
> > + read_unlock(&uprobes_treelock);
> > }
> >
> > /* @vma contains reference counter, not the probed instruction. */
> > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > min = vaddr_to_offset(vma, start);
> > max = min + (end - start) - 1;
> >
> > - spin_lock(&uprobes_treelock);
> > + read_lock(&uprobes_treelock);
> > n = find_node_in_range(inode, min, max);
> > - spin_unlock(&uprobes_treelock);
> > + read_unlock(&uprobes_treelock);
> >
> > return !!n;
> > }
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) <[email protected]>

2024-03-26 11:55:29

by Jonathan Haslam

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

> > > Have you considered/measured per-CPU RW semaphores?
> >
> > No I hadn't but thanks hugely for suggesting it! In initial measurements
> > it seems to be between 20-100% faster than the RW spinlocks! Apologies for
> > all the exclamation marks but I'm very excited. I'll do some more testing
> > tomorrow but so far it's looking very good.
> >
>
> Documentation ([0]) says that locking for writing calls
> synchronize_rcu(), is that right? If that's true, attaching multiple
> uprobes (including just attaching a single BPF multi-uprobe) will take
> a really long time. We need to confirm we are not significantly
> regressing this. And if we do, we need to take measures in the BPF
> multi-uprobe attachment code path to make sure that a single
> multi-uprobe attachment is still fast.
>
> If my worries above turn out to be true, it still feels like a first
> good step should be landing this patch as is (and get it backported to
> older kernels), and then have percpu rw-semaphore as a final (and a
> bit more invasive) solution (it's RCU-based, so feels like a good
> primitive to settle on), making sure to not regress multi-uprobes
> (we'll probably will need some batched API for multiple uprobes).
>
> Thoughts?

Agreed. In the percpu_down_write() path we call rcu_sync_enter() which is
what calls into synchronize_rcu(). I haven't done the measurements yet but
I would imagine this has to regress probe attachment, at least in the
uncontended case. Of course, reads are by far the dominant mode here but
we probably shouldn't punish writes excessively. I will do some
measurements to quantify the write penalty here.

I agree that a batched interface for probe attachment is needed here. The
usual mode of operation for us is that we have a number of USDTs (uprobes)
in hand and we want to enable and disable them in one shot. Removing the
need to do multiple locking operations is definitely an efficiency
improvement that needs to be done. Tie that together with per-CPU RW
semaphores and this should scale extremely well in both a read and write
case.

Jon.

2024-03-26 16:02:15

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Sun, Mar 24, 2024 at 8:03 PM Masami Hiramatsu <[email protected]> wrote:
>
> On Thu, 21 Mar 2024 07:57:35 -0700
> Jonathan Haslam <[email protected]> wrote:
>
> > Active uprobes are stored in an RB tree and accesses to this tree are
> > dominated by read operations. Currently these accesses are serialized by
> > a spinlock but this leads to enormous contention when large numbers of
> > threads are executing active probes.
> >
> > This patch converts the spinlock used to serialize access to the
> > uprobes_tree RB tree into a reader-writer spinlock. This lock type
> > aligns naturally with the overwhelmingly read-only nature of the tree
> > usage here. Although the addition of reader-writer spinlocks are
> > discouraged [0], this fix is proposed as an interim solution while an
> > RCU based approach is implemented (that work is in a nascent form). This
> > fix also has the benefit of being trivial, self contained and therefore
> > simple to backport.
> >
> > This change has been tested against production workloads that exhibit
> > significant contention on the spinlock and an almost order of magnitude
> > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
>
> Looks good to me.
>
> Acked-by: Masami Hiramatsu (Google) <[email protected]>

Masami,

Given the discussion around per-cpu rw semaphore and need for
(internal) batched attachment API for uprobes, do you think you can
apply this patch as is for now? We can then gain initial improvements
in scalability that are also easy to backport, and Jonathan will work
on a more complete solution based on per-cpu RW semaphore, as
suggested by Ingo.

>
> BTW, how did you measure the overhead? I think spinlock overhead
> will depend on how much lock contention happens.
>
> Thank you,
>
> >
> > [0] https://docs.kernel.org/locking/spinlocks.html
> >
> > Signed-off-by: Jonathan Haslam <[email protected]>
> > ---
> > kernel/events/uprobes.c | 22 +++++++++++-----------
> > 1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 929e98c62965..42bf9b6e8bc0 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > */
> > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> >
> > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> >
> > #define UPROBES_HASH_SZ 13
> > /* serialize uprobe->pending_list */
> > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > {
> > struct uprobe *uprobe;
> >
> > - spin_lock(&uprobes_treelock);
> > + read_lock(&uprobes_treelock);
> > uprobe = __find_uprobe(inode, offset);
> > - spin_unlock(&uprobes_treelock);
> > + read_unlock(&uprobes_treelock);
> >
> > return uprobe;
> > }
> > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > {
> > struct uprobe *u;
> >
> > - spin_lock(&uprobes_treelock);
> > + write_lock(&uprobes_treelock);
> > u = __insert_uprobe(uprobe);
> > - spin_unlock(&uprobes_treelock);
> > + write_unlock(&uprobes_treelock);
> >
> > return u;
> > }
> > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > if (WARN_ON(!uprobe_is_active(uprobe)))
> > return;
> >
> > - spin_lock(&uprobes_treelock);
> > + write_lock(&uprobes_treelock);
> > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > - spin_unlock(&uprobes_treelock);
> > + write_unlock(&uprobes_treelock);
> > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > put_uprobe(uprobe);
> > }
> > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > min = vaddr_to_offset(vma, start);
> > max = min + (end - start) - 1;
> >
> > - spin_lock(&uprobes_treelock);
> > + read_lock(&uprobes_treelock);
> > n = find_node_in_range(inode, min, max);
> > if (n) {
> > for (t = n; t; t = rb_prev(t)) {
> > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > get_uprobe(u);
> > }
> > }
> > - spin_unlock(&uprobes_treelock);
> > + read_unlock(&uprobes_treelock);
> > }
> >
> > /* @vma contains reference counter, not the probed instruction. */
> > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > min = vaddr_to_offset(vma, start);
> > max = min + (end - start) - 1;
> >
> > - spin_lock(&uprobes_treelock);
> > + read_lock(&uprobes_treelock);
> > n = find_node_in_range(inode, min, max);
> > - spin_unlock(&uprobes_treelock);
> > + read_unlock(&uprobes_treelock);
> >
> > return !!n;
> > }
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) <[email protected]>

2024-03-26 23:43:02

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Tue, 26 Mar 2024 09:01:47 -0700
Andrii Nakryiko <[email protected]> wrote:

> On Sun, Mar 24, 2024 at 8:03 PM Masami Hiramatsu <[email protected]> wrote:
> >
> > On Thu, 21 Mar 2024 07:57:35 -0700
> > Jonathan Haslam <[email protected]> wrote:
> >
> > > Active uprobes are stored in an RB tree and accesses to this tree are
> > > dominated by read operations. Currently these accesses are serialized by
> > > a spinlock but this leads to enormous contention when large numbers of
> > > threads are executing active probes.
> > >
> > > This patch converts the spinlock used to serialize access to the
> > > uprobes_tree RB tree into a reader-writer spinlock. This lock type
> > > aligns naturally with the overwhelmingly read-only nature of the tree
> > > usage here. Although the addition of reader-writer spinlocks are
> > > discouraged [0], this fix is proposed as an interim solution while an
> > > RCU based approach is implemented (that work is in a nascent form). This
> > > fix also has the benefit of being trivial, self contained and therefore
> > > simple to backport.
> > >
> > > This change has been tested against production workloads that exhibit
> > > significant contention on the spinlock and an almost order of magnitude
> > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
> >
> > Looks good to me.
> >
> > Acked-by: Masami Hiramatsu (Google) <[email protected]>
>
> Masami,
>
> Given the discussion around per-cpu rw semaphore and need for
> (internal) batched attachment API for uprobes, do you think you can
> apply this patch as is for now? We can then gain initial improvements
> in scalability that are also easy to backport, and Jonathan will work
> on a more complete solution based on per-cpu RW semaphore, as
> suggested by Ingo.

Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
I would like to wait for the next version.

Thank you,

>
> >
> > BTW, how did you measure the overhead? I think spinlock overhead
> > will depend on how much lock contention happens.
> >
> > Thank you,
> >
> > >
> > > [0] https://docs.kernel.org/locking/spinlocks.html
> > >
> > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > ---
> > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 929e98c62965..42bf9b6e8bc0 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > */
> > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > >
> > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > >
> > > #define UPROBES_HASH_SZ 13
> > > /* serialize uprobe->pending_list */
> > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > {
> > > struct uprobe *uprobe;
> > >
> > > - spin_lock(&uprobes_treelock);
> > > + read_lock(&uprobes_treelock);
> > > uprobe = __find_uprobe(inode, offset);
> > > - spin_unlock(&uprobes_treelock);
> > > + read_unlock(&uprobes_treelock);
> > >
> > > return uprobe;
> > > }
> > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > {
> > > struct uprobe *u;
> > >
> > > - spin_lock(&uprobes_treelock);
> > > + write_lock(&uprobes_treelock);
> > > u = __insert_uprobe(uprobe);
> > > - spin_unlock(&uprobes_treelock);
> > > + write_unlock(&uprobes_treelock);
> > >
> > > return u;
> > > }
> > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > return;
> > >
> > > - spin_lock(&uprobes_treelock);
> > > + write_lock(&uprobes_treelock);
> > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > - spin_unlock(&uprobes_treelock);
> > > + write_unlock(&uprobes_treelock);
> > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > put_uprobe(uprobe);
> > > }
> > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > min = vaddr_to_offset(vma, start);
> > > max = min + (end - start) - 1;
> > >
> > > - spin_lock(&uprobes_treelock);
> > > + read_lock(&uprobes_treelock);
> > > n = find_node_in_range(inode, min, max);
> > > if (n) {
> > > for (t = n; t; t = rb_prev(t)) {
> > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > get_uprobe(u);
> > > }
> > > }
> > > - spin_unlock(&uprobes_treelock);
> > > + read_unlock(&uprobes_treelock);
> > > }
> > >
> > > /* @vma contains reference counter, not the probed instruction. */
> > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > min = vaddr_to_offset(vma, start);
> > > max = min + (end - start) - 1;
> > >
> > > - spin_lock(&uprobes_treelock);
> > > + read_lock(&uprobes_treelock);
> > > n = find_node_in_range(inode, min, max);
> > > - spin_unlock(&uprobes_treelock);
> > > + read_unlock(&uprobes_treelock);
> > >
> > > return !!n;
> > > }
> > > --
> > > 2.43.0
> > >
> >
> >
> > --
> > Masami Hiramatsu (Google) <[email protected]>


--
Masami Hiramatsu (Google) <[email protected]>

2024-03-26 23:43:18

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Mon, 25 Mar 2024 19:04:59 +0000
Jonthan Haslam <[email protected]> wrote:

> Hi Masami,
>
> > > This change has been tested against production workloads that exhibit
> > > significant contention on the spinlock and an almost order of magnitude
> > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
> >
> > Looks good to me.
> >
> > Acked-by: Masami Hiramatsu (Google) <[email protected]>
> >
> > BTW, how did you measure the overhead? I think spinlock overhead
> > will depend on how much lock contention happens.
>
> Absolutely. I have the original production workload to test this with and
> a derived one that mimics this test case. The production case has ~24
> threads running on a 192 core system which access 14 USDTs around 1.5
> million times per second in total (across all USDTs). My test case is
> similar but can drive a higher rate of USDT access across more threads and
> therefore generate higher contention.

Thanks for the info. So this result is measured in enough large machine
with high parallelism. So lock contention is matter.
Can you also include this information with the number in next version?

Thank you,

>
> All measurements are done using bpftrace scripts around relevant parts of
> code in uprobes.c and application code.
>
> Jon.
>
> >
> > Thank you,
> >
> > >
> > > [0] https://docs.kernel.org/locking/spinlocks.html
> > >
> > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > ---
> > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > index 929e98c62965..42bf9b6e8bc0 100644
> > > --- a/kernel/events/uprobes.c
> > > +++ b/kernel/events/uprobes.c
> > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > */
> > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > >
> > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > >
> > > #define UPROBES_HASH_SZ 13
> > > /* serialize uprobe->pending_list */
> > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > {
> > > struct uprobe *uprobe;
> > >
> > > - spin_lock(&uprobes_treelock);
> > > + read_lock(&uprobes_treelock);
> > > uprobe = __find_uprobe(inode, offset);
> > > - spin_unlock(&uprobes_treelock);
> > > + read_unlock(&uprobes_treelock);
> > >
> > > return uprobe;
> > > }
> > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > {
> > > struct uprobe *u;
> > >
> > > - spin_lock(&uprobes_treelock);
> > > + write_lock(&uprobes_treelock);
> > > u = __insert_uprobe(uprobe);
> > > - spin_unlock(&uprobes_treelock);
> > > + write_unlock(&uprobes_treelock);
> > >
> > > return u;
> > > }
> > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > return;
> > >
> > > - spin_lock(&uprobes_treelock);
> > > + write_lock(&uprobes_treelock);
> > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > - spin_unlock(&uprobes_treelock);
> > > + write_unlock(&uprobes_treelock);
> > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > put_uprobe(uprobe);
> > > }
> > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > min = vaddr_to_offset(vma, start);
> > > max = min + (end - start) - 1;
> > >
> > > - spin_lock(&uprobes_treelock);
> > > + read_lock(&uprobes_treelock);
> > > n = find_node_in_range(inode, min, max);
> > > if (n) {
> > > for (t = n; t; t = rb_prev(t)) {
> > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > get_uprobe(u);
> > > }
> > > }
> > > - spin_unlock(&uprobes_treelock);
> > > + read_unlock(&uprobes_treelock);
> > > }
> > >
> > > /* @vma contains reference counter, not the probed instruction. */
> > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > min = vaddr_to_offset(vma, start);
> > > max = min + (end - start) - 1;
> > >
> > > - spin_lock(&uprobes_treelock);
> > > + read_lock(&uprobes_treelock);
> > > n = find_node_in_range(inode, min, max);
> > > - spin_unlock(&uprobes_treelock);
> > > + read_unlock(&uprobes_treelock);
> > >
> > > return !!n;
> > > }
> > > --
> > > 2.43.0
> > >
> >
> >
> > --
> > Masami Hiramatsu (Google) <[email protected]>


--
Masami Hiramatsu (Google) <[email protected]>

2024-03-27 17:06:45

by Jonathan Haslam

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

> > Masami,
> >
> > Given the discussion around per-cpu rw semaphore and need for
> > (internal) batched attachment API for uprobes, do you think you can
> > apply this patch as is for now? We can then gain initial improvements
> > in scalability that are also easy to backport, and Jonathan will work
> > on a more complete solution based on per-cpu RW semaphore, as
> > suggested by Ingo.
>
> Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> I would like to wait for the next version.

My initial tests show a nice improvement on the over RW spinlocks but
significant regression in acquiring a write lock. I've got a few days
vacation over Easter but I'll aim to get some more formalised results out
to the thread toward the end of next week.

Jon.

>
> Thank you,
>
> >
> > >
> > > BTW, how did you measure the overhead? I think spinlock overhead
> > > will depend on how much lock contention happens.
> > >
> > > Thank you,
> > >
> > > >
> > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > >
> > > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > > ---
> > > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > > >
> > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > --- a/kernel/events/uprobes.c
> > > > +++ b/kernel/events/uprobes.c
> > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > */
> > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > > >
> > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > > >
> > > > #define UPROBES_HASH_SZ 13
> > > > /* serialize uprobe->pending_list */
> > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > > {
> > > > struct uprobe *uprobe;
> > > >
> > > > - spin_lock(&uprobes_treelock);
> > > > + read_lock(&uprobes_treelock);
> > > > uprobe = __find_uprobe(inode, offset);
> > > > - spin_unlock(&uprobes_treelock);
> > > > + read_unlock(&uprobes_treelock);
> > > >
> > > > return uprobe;
> > > > }
> > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > > {
> > > > struct uprobe *u;
> > > >
> > > > - spin_lock(&uprobes_treelock);
> > > > + write_lock(&uprobes_treelock);
> > > > u = __insert_uprobe(uprobe);
> > > > - spin_unlock(&uprobes_treelock);
> > > > + write_unlock(&uprobes_treelock);
> > > >
> > > > return u;
> > > > }
> > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > return;
> > > >
> > > > - spin_lock(&uprobes_treelock);
> > > > + write_lock(&uprobes_treelock);
> > > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > > - spin_unlock(&uprobes_treelock);
> > > > + write_unlock(&uprobes_treelock);
> > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > > put_uprobe(uprobe);
> > > > }
> > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > > min = vaddr_to_offset(vma, start);
> > > > max = min + (end - start) - 1;
> > > >
> > > > - spin_lock(&uprobes_treelock);
> > > > + read_lock(&uprobes_treelock);
> > > > n = find_node_in_range(inode, min, max);
> > > > if (n) {
> > > > for (t = n; t; t = rb_prev(t)) {
> > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > > get_uprobe(u);
> > > > }
> > > > }
> > > > - spin_unlock(&uprobes_treelock);
> > > > + read_unlock(&uprobes_treelock);
> > > > }
> > > >
> > > > /* @vma contains reference counter, not the probed instruction. */
> > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > > min = vaddr_to_offset(vma, start);
> > > > max = min + (end - start) - 1;
> > > >
> > > > - spin_lock(&uprobes_treelock);
> > > > + read_lock(&uprobes_treelock);
> > > > n = find_node_in_range(inode, min, max);
> > > > - spin_unlock(&uprobes_treelock);
> > > > + read_unlock(&uprobes_treelock);
> > > >
> > > > return !!n;
> > > > }
> > > > --
> > > > 2.43.0
> > > >
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) <[email protected]>
>
>
> --
> Masami Hiramatsu (Google) <[email protected]>

2024-03-28 00:18:59

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Wed, 27 Mar 2024 17:06:01 +0000
Jonthan Haslam <[email protected]> wrote:

> > > Masami,
> > >
> > > Given the discussion around per-cpu rw semaphore and need for
> > > (internal) batched attachment API for uprobes, do you think you can
> > > apply this patch as is for now? We can then gain initial improvements
> > > in scalability that are also easy to backport, and Jonathan will work
> > > on a more complete solution based on per-cpu RW semaphore, as
> > > suggested by Ingo.
> >
> > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > I would like to wait for the next version.
>
> My initial tests show a nice improvement on the over RW spinlocks but
> significant regression in acquiring a write lock. I've got a few days
> vacation over Easter but I'll aim to get some more formalised results out
> to the thread toward the end of next week.

As far as the write lock is only on the cold path, I think you can choose
per-cpu RW semaphore. Since it does not do busy wait, the total system
performance impact will be small.
I look forward to your formalized results :)

Thank you,

>
> Jon.
>
> >
> > Thank you,
> >
> > >
> > > >
> > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > will depend on how much lock contention happens.
> > > >
> > > > Thank you,
> > > >
> > > > >
> > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > >
> > > > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > > > ---
> > > > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > --- a/kernel/events/uprobes.c
> > > > > +++ b/kernel/events/uprobes.c
> > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > */
> > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > > > >
> > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > >
> > > > > #define UPROBES_HASH_SZ 13
> > > > > /* serialize uprobe->pending_list */
> > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > > > {
> > > > > struct uprobe *uprobe;
> > > > >
> > > > > - spin_lock(&uprobes_treelock);
> > > > > + read_lock(&uprobes_treelock);
> > > > > uprobe = __find_uprobe(inode, offset);
> > > > > - spin_unlock(&uprobes_treelock);
> > > > > + read_unlock(&uprobes_treelock);
> > > > >
> > > > > return uprobe;
> > > > > }
> > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > > > {
> > > > > struct uprobe *u;
> > > > >
> > > > > - spin_lock(&uprobes_treelock);
> > > > > + write_lock(&uprobes_treelock);
> > > > > u = __insert_uprobe(uprobe);
> > > > > - spin_unlock(&uprobes_treelock);
> > > > > + write_unlock(&uprobes_treelock);
> > > > >
> > > > > return u;
> > > > > }
> > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > return;
> > > > >
> > > > > - spin_lock(&uprobes_treelock);
> > > > > + write_lock(&uprobes_treelock);
> > > > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > > > - spin_unlock(&uprobes_treelock);
> > > > > + write_unlock(&uprobes_treelock);
> > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > > > put_uprobe(uprobe);
> > > > > }
> > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > > > min = vaddr_to_offset(vma, start);
> > > > > max = min + (end - start) - 1;
> > > > >
> > > > > - spin_lock(&uprobes_treelock);
> > > > > + read_lock(&uprobes_treelock);
> > > > > n = find_node_in_range(inode, min, max);
> > > > > if (n) {
> > > > > for (t = n; t; t = rb_prev(t)) {
> > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > > > get_uprobe(u);
> > > > > }
> > > > > }
> > > > > - spin_unlock(&uprobes_treelock);
> > > > > + read_unlock(&uprobes_treelock);
> > > > > }
> > > > >
> > > > > /* @vma contains reference counter, not the probed instruction. */
> > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > > > min = vaddr_to_offset(vma, start);
> > > > > max = min + (end - start) - 1;
> > > > >
> > > > > - spin_lock(&uprobes_treelock);
> > > > > + read_lock(&uprobes_treelock);
> > > > > n = find_node_in_range(inode, min, max);
> > > > > - spin_unlock(&uprobes_treelock);
> > > > > + read_unlock(&uprobes_treelock);
> > > > >
> > > > > return !!n;
> > > > > }
> > > > > --
> > > > > 2.43.0
> > > > >
> > > >
> > > >
> > > > --
> > > > Masami Hiramatsu (Google) <[email protected]>
> >
> >
> > --
> > Masami Hiramatsu (Google) <[email protected]>


--
Masami Hiramatsu (Google) <[email protected]>

2024-03-28 00:49:20

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu <[email protected]> wrote:
>
> On Wed, 27 Mar 2024 17:06:01 +0000
> Jonthan Haslam <[email protected]> wrote:
>
> > > > Masami,
> > > >
> > > > Given the discussion around per-cpu rw semaphore and need for
> > > > (internal) batched attachment API for uprobes, do you think you can
> > > > apply this patch as is for now? We can then gain initial improvements
> > > > in scalability that are also easy to backport, and Jonathan will work
> > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > suggested by Ingo.
> > >
> > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > I would like to wait for the next version.
> >
> > My initial tests show a nice improvement on the over RW spinlocks but
> > significant regression in acquiring a write lock. I've got a few days
> > vacation over Easter but I'll aim to get some more formalised results out
> > to the thread toward the end of next week.
>
> As far as the write lock is only on the cold path, I think you can choose
> per-cpu RW semaphore. Since it does not do busy wait, the total system
> performance impact will be small.

No, Masami, unfortunately it's not as simple. In BPF we have BPF
multi-uprobe, which can be used to attach to thousands of user
functions. It currently creates one uprobe at a time, as we don't
really have a batched API. If each such uprobe registration will now
take a (relatively) long time, when multiplied by number of attach-to
user functions, it will be a horrible regression in terms of
attachment/detachment performance.

So when we switch to per-CPU rw semaphore, we'll need to provide an
internal batch uprobe attach/detach API to make sure that attaching to
multiple uprobes is still fast.

Which is why I was asking to land this patch as is, as it relieves the
scalability pains in production and is easy to backport to old
kernels. And then we can work on batched APIs and switch to per-CPU rw
semaphore.

So I hope you can reconsider and accept improvements in this patch,
while Jonathan will keep working on even better final solution.
Thanks!

> I look forward to your formalized results :)
>
> Thank you,
>
> >
> > Jon.
> >
> > >
> > > Thank you,
> > >
> > > >
> > > > >
> > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > will depend on how much lock contention happens.
> > > > >
> > > > > Thank you,
> > > > >
> > > > > >
> > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > >
> > > > > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > > > > ---
> > > > > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > > > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > >
> > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > --- a/kernel/events/uprobes.c
> > > > > > +++ b/kernel/events/uprobes.c
> > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > > */
> > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > > > > >
> > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > >
> > > > > > #define UPROBES_HASH_SZ 13
> > > > > > /* serialize uprobe->pending_list */
> > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > {
> > > > > > struct uprobe *uprobe;
> > > > > >
> > > > > > - spin_lock(&uprobes_treelock);
> > > > > > + read_lock(&uprobes_treelock);
> > > > > > uprobe = __find_uprobe(inode, offset);
> > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > + read_unlock(&uprobes_treelock);
> > > > > >
> > > > > > return uprobe;
> > > > > > }
> > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > > > > {
> > > > > > struct uprobe *u;
> > > > > >
> > > > > > - spin_lock(&uprobes_treelock);
> > > > > > + write_lock(&uprobes_treelock);
> > > > > > u = __insert_uprobe(uprobe);
> > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > + write_unlock(&uprobes_treelock);
> > > > > >
> > > > > > return u;
> > > > > > }
> > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > > return;
> > > > > >
> > > > > > - spin_lock(&uprobes_treelock);
> > > > > > + write_lock(&uprobes_treelock);
> > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > + write_unlock(&uprobes_treelock);
> > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > > > > put_uprobe(uprobe);
> > > > > > }
> > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > min = vaddr_to_offset(vma, start);
> > > > > > max = min + (end - start) - 1;
> > > > > >
> > > > > > - spin_lock(&uprobes_treelock);
> > > > > > + read_lock(&uprobes_treelock);
> > > > > > n = find_node_in_range(inode, min, max);
> > > > > > if (n) {
> > > > > > for (t = n; t; t = rb_prev(t)) {
> > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > get_uprobe(u);
> > > > > > }
> > > > > > }
> > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > + read_unlock(&uprobes_treelock);
> > > > > > }
> > > > > >
> > > > > > /* @vma contains reference counter, not the probed instruction */
> > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > > > > min = vaddr_to_offset(vma, start);
> > > > > > max = min + (end - start) - 1;
> > > > > >
> > > > > > - spin_lock(&uprobes_treelock);
> > > > > > + read_lock(&uprobes_treelock);
> > > > > > n = find_node_in_range(inode, min, max);
> > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > + read_unlock(&uprobes_treelock);
> > > > > >
> > > > > > return !!n;
> > > > > > }
> > > > > > --
> > > > > > 2.43.0
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Masami Hiramatsu (Google) <[email protected]>
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) <[email protected]>
>
>
> --
> Masami Hiramatsu (Google) <[email protected]>

2024-03-29 17:34:26

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Wed, Mar 27, 2024 at 5:45 PM Andrii Nakryiko
<[email protected]> wrote:
>
> On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu <mhiramat@kernelorg> wrote:
> >
> > On Wed, 27 Mar 2024 17:06:01 +0000
> > Jonthan Haslam <[email protected]> wrote:
> >
> > > > > Masami,
> > > > >
> > > > > Given the discussion around per-cpu rw semaphore and need for
> > > > > (internal) batched attachment API for uprobes, do you think you can
> > > > > apply this patch as is for now? We can then gain initial improvements
> > > > > in scalability that are also easy to backport, and Jonathan will work
> > > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > > suggested by Ingo.
> > > >
> > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > > I would like to wait for the next version.
> > >
> > > My initial tests show a nice improvement on the over RW spinlocks but
> > > significant regression in acquiring a write lock. I've got a few days
> > > vacation over Easter but I'll aim to get some more formalised results out
> > > to the thread toward the end of next week.
> >
> > As far as the write lock is only on the cold path, I think you can choose
> > per-cpu RW semaphore. Since it does not do busy wait, the total system
> > performance impact will be small.
>
> No, Masami, unfortunately it's not as simple. In BPF we have BPF
> multi-uprobe, which can be used to attach to thousands of user
> functions. It currently creates one uprobe at a time, as we don't
> really have a batched API. If each such uprobe registration will now
> take a (relatively) long time, when multiplied by number of attach-to
> user functions, it will be a horrible regression in terms of
> attachment/detachment performance.
>
> So when we switch to per-CPU rw semaphore, we'll need to provide an
> internal batch uprobe attach/detach API to make sure that attaching to
> multiple uprobes is still fast.
>
> Which is why I was asking to land this patch as is, as it relieves the
> scalability pains in production and is easy to backport to old
> kernels. And then we can work on batched APIs and switch to per-CPU rw
> semaphore.
>
> So I hope you can reconsider and accept improvements in this patch,
> while Jonathan will keep working on even better final solution.
> Thanks!
>
> > I look forward to your formalized results :)
> >

BTW, as part of BPF selftests, we have a multi-attach test for uprobes
and USDTs, reporting attach/detach timings:
$ sudo ./test_progs -v -t uprobe_multi_test/bench
bpf_testmod.ko is already unloaded.
Loading bpf_testmod.ko...
Successfully loaded bpf_testmod.ko.
test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
test_bench_attach_uprobe: attached in 0.120s
test_bench_attach_uprobe: detached in 0.092s
#400/5 uprobe_multi_test/bench_uprobe:OK
test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
test_bench_attach_usdt:PASS:usdt_count 0 nsec
test_bench_attach_usdt: attached in 0.124s
test_bench_attach_usdt: detached in 0.064s
#400/6 uprobe_multi_test/bench_usdt:OK
#400 uprobe_multi_test:OK
Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
Successfully unloaded bpf_testmod.ko.

So it should be easy for Jonathan to validate his changes with this.

> > Thank you,
> >
> > >
> > > Jon.
> > >
> > > >
> > > > Thank you,
> > > >
> > > > >
> > > > > >
> > > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > > will depend on how much lock contention happens.
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > > >
> > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > >
> > > > > > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > > > > > ---
> > > > > > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > >
> > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > > > */
> > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > > > > > >
> > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > >
> > > > > > > #define UPROBES_HASH_SZ 13
> > > > > > > /* serialize uprobe->pending_list */
> > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > {
> > > > > > > struct uprobe *uprobe;
> > > > > > >
> > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > uprobe = __find_uprobe(inode, offset);
> > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > >
> > > > > > > return uprobe;
> > > > > > > }
> > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > > > > > {
> > > > > > > struct uprobe *u;
> > > > > > >
> > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > u = __insert_uprobe(uprobe);
> > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > >
> > > > > > > return u;
> > > > > > > }
> > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > > > return;
> > > > > > >
> > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > > > > > put_uprobe(uprobe);
> > > > > > > }
> > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > max = min + (end - start) - 1;
> > > > > > >
> > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > if (n) {
> > > > > > > for (t = n; t; t = rb_prev(t)) {
> > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > get_uprobe(u);
> > > > > > > }
> > > > > > > }
> > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > }
> > > > > > >
> > > > > > > /* @vma contains reference counter, not the probed instruction. */
> > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > max = min + (end - start) - 1;
> > > > > > >
> > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > >
> > > > > > > return !!n;
> > > > > > > }
> > > > > > > --
> > > > > > > 2.43.0
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > >
> > > >
> > > > --
> > > > Masami Hiramatsu (Google) <[email protected]>
> >
> >
> > --
> > Masami Hiramatsu (Google) <[email protected]>

2024-03-30 00:36:51

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Fri, 29 Mar 2024 10:33:57 -0700
Andrii Nakryiko <[email protected]> wrote:

> On Wed, Mar 27, 2024 at 5:45 PM Andrii Nakryiko
> <[email protected]> wrote:
> >
> > On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu <[email protected]> wrote:
> > >
> > > On Wed, 27 Mar 2024 17:06:01 +0000
> > > Jonthan Haslam <[email protected]> wrote:
> > >
> > > > > > Masami,
> > > > > >
> > > > > > Given the discussion around per-cpu rw semaphore and need for
> > > > > > (internal) batched attachment API for uprobes, do you think you can
> > > > > > apply this patch as is for now? We can then gain initial improvements
> > > > > > in scalability that are also easy to backport, and Jonathan will work
> > > > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > > > suggested by Ingo.
> > > > >
> > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > > > I would like to wait for the next version.
> > > >
> > > > My initial tests show a nice improvement on the over RW spinlocks but
> > > > significant regression in acquiring a write lock. I've got a few days
> > > > vacation over Easter but I'll aim to get some more formalised results out
> > > > to the thread toward the end of next week.
> > >
> > > As far as the write lock is only on the cold path, I think you can choose
> > > per-cpu RW semaphore. Since it does not do busy wait, the total system
> > > performance impact will be small.
> >
> > No, Masami, unfortunately it's not as simple. In BPF we have BPF
> > multi-uprobe, which can be used to attach to thousands of user
> > functions. It currently creates one uprobe at a time, as we don't
> > really have a batched API. If each such uprobe registration will now
> > take a (relatively) long time, when multiplied by number of attach-to
> > user functions, it will be a horrible regression in terms of
> > attachment/detachment performance.

Ah, got it. So attachment/detachment performance should be counted.

> >
> > So when we switch to per-CPU rw semaphore, we'll need to provide an
> > internal batch uprobe attach/detach API to make sure that attaching to
> > multiple uprobes is still fast.

Yeah, we need such interface like register_uprobes(...).

> >
> > Which is why I was asking to land this patch as is, as it relieves the
> > scalability pains in production and is easy to backport to old
> > kernels. And then we can work on batched APIs and switch to per-CPU rw
> > semaphore.

OK, then I'll push this to for-next at this moment.
Please share if you have a good idea for the batch interface which can be
backported. I guess it should involve updating userspace changes too.

Thank you!

> >
> > So I hope you can reconsider and accept improvements in this patch,
> > while Jonathan will keep working on even better final solution.
> > Thanks!
> >
> > > I look forward to your formalized results :)
> > >
>
> BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> and USDTs, reporting attach/detach timings:
> $ sudo ./test_progs -v -t uprobe_multi_test/bench
> bpf_testmod.ko is already unloaded.
> Loading bpf_testmod.ko...
> Successfully loaded bpf_testmod.ko.
> test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> test_bench_attach_uprobe: attached in 0.120s
> test_bench_attach_uprobe: detached in 0.092s
> #400/5 uprobe_multi_test/bench_uprobe:OK
> test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> test_bench_attach_usdt:PASS:usdt_count 0 nsec
> test_bench_attach_usdt: attached in 0.124s
> test_bench_attach_usdt: detached in 0.064s
> #400/6 uprobe_multi_test/bench_usdt:OK
> #400 uprobe_multi_test:OK
> Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> Successfully unloaded bpf_testmod.ko.
>
> So it should be easy for Jonathan to validate his changes with this.
>
> > > Thank you,
> > >
> > > >
> > > > Jon.
> > > >
> > > > >
> > > > > Thank you,
> > > > >
> > > > > >
> > > > > > >
> > > > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > > > will depend on how much lock contention happens.
> > > > > > >
> > > > > > > Thank you,
> > > > > > >
> > > > > > > >
> > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > > >
> > > > > > > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > > > > > > ---
> > > > > > > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > > > > */
> > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > > > > > > >
> > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > >
> > > > > > > > #define UPROBES_HASH_SZ 13
> > > > > > > > /* serialize uprobe->pending_list */
> > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > {
> > > > > > > > struct uprobe *uprobe;
> > > > > > > >
> > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > uprobe = __find_uprobe(inode, offset);
> > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > >
> > > > > > > > return uprobe;
> > > > > > > > }
> > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > > > > > > {
> > > > > > > > struct uprobe *u;
> > > > > > > >
> > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > u = __insert_uprobe(uprobe);
> > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > >
> > > > > > > > return u;
> > > > > > > > }
> > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > > > > return;
> > > > > > > >
> > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > > > > > > put_uprobe(uprobe);
> > > > > > > > }
> > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > max = min + (end - start) - 1;
> > > > > > > >
> > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > if (n) {
> > > > > > > > for (t = n; t; t = rb_prev(t)) {
> > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > get_uprobe(u);
> > > > > > > > }
> > > > > > > > }
> > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > }
> > > > > > > >
> > > > > > > > /* @vma contains reference counter, not the probed instruction. */
> > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > max = min + (end - start) - 1;
> > > > > > > >
> > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > >
> > > > > > > > return !!n;
> > > > > > > > }
> > > > > > > > --
> > > > > > > > 2.43.0
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > > >
> > > > >
> > > > > --
> > > > > Masami Hiramatsu (Google) <[email protected]>
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) <[email protected]>


--
Masami Hiramatsu (Google) <[email protected]>

2024-03-30 05:26:33

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Fri, Mar 29, 2024 at 5:36 PM Masami Hiramatsu <[email protected]> wrote:
>
> On Fri, 29 Mar 2024 10:33:57 -0700
> Andrii Nakryiko <[email protected]> wrote:
>
> > On Wed, Mar 27, 2024 at 5:45 PM Andrii Nakryiko
> > <[email protected]> wrote:
> > >
> > > On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu <[email protected]> wrote:
> > > >
> > > > On Wed, 27 Mar 2024 17:06:01 +0000
> > > > Jonthan Haslam <[email protected]> wrote:
> > > >
> > > > > > > Masami,
> > > > > > >
> > > > > > > Given the discussion around per-cpu rw semaphore and need for
> > > > > > > (internal) batched attachment API for uprobes, do you think you can
> > > > > > > apply this patch as is for now? We can then gain initial improvements
> > > > > > > in scalability that are also easy to backport, and Jonathan will work
> > > > > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > > > > suggested by Ingo.
> > > > > >
> > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > > > > I would like to wait for the next version.
> > > > >
> > > > > My initial tests show a nice improvement on the over RW spinlocks but
> > > > > significant regression in acquiring a write lock. I've got a few days
> > > > > vacation over Easter but I'll aim to get some more formalised results out
> > > > > to the thread toward the end of next week.
> > > >
> > > > As far as the write lock is only on the cold path, I think you can choose
> > > > per-cpu RW semaphore. Since it does not do busy wait, the total system
> > > > performance impact will be small.
> > >
> > > No, Masami, unfortunately it's not as simple. In BPF we have BPF
> > > multi-uprobe, which can be used to attach to thousands of user
> > > functions. It currently creates one uprobe at a time, as we don't
> > > really have a batched API. If each such uprobe registration will now
> > > take a (relatively) long time, when multiplied by number of attach-to
> > > user functions, it will be a horrible regression in terms of
> > > attachment/detachment performance.
>
> Ah, got it. So attachment/detachment performance should be counted.
>
> > >
> > > So when we switch to per-CPU rw semaphore, we'll need to provide an
> > > internal batch uprobe attach/detach API to make sure that attaching to
> > > multiple uprobes is still fast.
>
> Yeah, we need such interface like register_uprobes(...).
>
> > >
> > > Which is why I was asking to land this patch as is, as it relieves the
> > > scalability pains in production and is easy to backport to old
> > > kernels. And then we can work on batched APIs and switch to per-CPU rw
> > > semaphore.
>
> OK, then I'll push this to for-next at this moment.

Great, thanks a lot!

> Please share if you have a good idea for the batch interface which can be
> backported. I guess it should involve updating userspace changes too.
>

Yep, we'll investigate a best way to provide batch interface for
uprobes and will send patches.

> Thank you!
>
> > >
> > > So I hope you can reconsider and accept improvements in this patch,
> > > while Jonathan will keep working on even better final solution.
> > > Thanks!
> > >
> > > > I look forward to your formalized results :)
> > > >
> >
> > BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> > and USDTs, reporting attach/detach timings:
> > $ sudo ./test_progs -v -t uprobe_multi_test/bench
> > bpf_testmod.ko is already unloaded.
> > Loading bpf_testmod.ko...
> > Successfully loaded bpf_testmod.ko.
> > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> > test_bench_attach_uprobe: attached in 0.120s
> > test_bench_attach_uprobe: detached in 0.092s
> > #400/5 uprobe_multi_test/bench_uprobe:OK
> > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> > test_bench_attach_usdt:PASS:usdt_count 0 nsec
> > test_bench_attach_usdt: attached in 0.124s
> > test_bench_attach_usdt: detached in 0.064s
> > #400/6 uprobe_multi_test/bench_usdt:OK
> > #400 uprobe_multi_test:OK
> > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> > Successfully unloaded bpf_testmod.ko.
> >
> > So it should be easy for Jonathan to validate his changes with this.
> >
> > > > Thank you,
> > > >
> > > > >
> > > > > Jon.
> > > > >
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > > > > will depend on how much lock contention happens.
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > >
> > > > > > > > >
> > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > > > >
> > > > > > > > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > > > > > > > ---
> > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > > > > > */
> > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > > > > > > > >
> > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > >
> > > > > > > > > #define UPROBES_HASH_SZ 13
> > > > > > > > > /* serialize uprobe->pending_list */
> > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > > {
> > > > > > > > > struct uprobe *uprobe;
> > > > > > > > >
> > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > uprobe = __find_uprobe(inode, offset);
> > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > >
> > > > > > > > > return uprobe;
> > > > > > > > > }
> > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > > > > > > > {
> > > > > > > > > struct uprobe *u;
> > > > > > > > >
> > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > u = __insert_uprobe(uprobe);
> > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > >
> > > > > > > > > return u;
> > > > > > > > > }
> > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > > > > > return;
> > > > > > > > >
> > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > > > > > > > put_uprobe(uprobe);
> > > > > > > > > }
> > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > > max = min + (end - start) - 1;
> > > > > > > > >
> > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > > if (n) {
> > > > > > > > > for (t = n; t; t = rb_prev(t)) {
> > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > > get_uprobe(u);
> > > > > > > > > }
> > > > > > > > > }
> > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > /* @vma contains reference counter, not the probed instruction. */
> > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > > max = min + (end - start) - 1;
> > > > > > > > >
> > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > >
> > > > > > > > > return !!n;
> > > > > > > > > }
> > > > > > > > > --
> > > > > > > > > 2.43.0
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > >
> > > >
> > > > --
> > > > Masami Hiramatsu (Google) <[email protected]>
>
>
> --
> Masami Hiramatsu (Google) <[email protected]>

2024-04-03 11:05:31

by Jonathan Haslam

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

> > > > Given the discussion around per-cpu rw semaphore and need for
> > > > (internal) batched attachment API for uprobes, do you think you can
> > > > apply this patch as is for now? We can then gain initial improvements
> > > > in scalability that are also easy to backport, and Jonathan will work
> > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > suggested by Ingo.
> > >
> > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > I would like to wait for the next version.
> >
> > My initial tests show a nice improvement on the over RW spinlocks but
> > significant regression in acquiring a write lock. I've got a few days
> > vacation over Easter but I'll aim to get some more formalised results out
> > to the thread toward the end of next week.
>
> As far as the write lock is only on the cold path, I think you can choose
> per-cpu RW semaphore. Since it does not do busy wait, the total system
> performance impact will be small.
> I look forward to your formalized results :)

Sorry for the delay in getting back to you on this Masami.

I have used one of the bpf selftest benchmarks to provide some form of
comparison of the 3 different approaches (spinlock, RW spinlock and
per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop'
benchmark which just executes a single uprobe with a minimal bpf program
attached. The tests were done on a 32 core qemu/kvm instance.

Things to note about the results:

- The results are slightly variable so don't get too caught up on
individual thread count - it's the trend that is important.
- In terms of throughput with this specific benchmark a *very* macro view
is that the RW spinlock provides 40-60% more throughput than the
spinlock. The per-CPU RW semaphore provides in the order of 50-100%
more throughput then the spinlock.
- This doesn't fully reflect the large reduction in latency that we have
seen in application based measurements. However, it does demonstrate
that even the trivial change of going to a RW spinlock provides
significant benefits.

I haven't included the measurements on per-CPU RW semaphore write
performance as they are completely in line with those that Paul McKenney
posted on his journal [0]. On a 32 core system I see semaphore writes to
take in the order of 25-28 millisecs - the cost of the synchronize_rcu().

Each block of results below show 1 line per execution of the benchmark (the
"Summary" line) and each line is a run with one more thread added - a
thread is a "producer". The lines are edited to remove extraneous output
that adds no value here.

The tests were executed with this driver script:

for num_threads in {1..20}
do
sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary
done


spinlock

Summary: hits 1.453 ± 0.005M/s ( 1.453M/prod)
Summary: hits 2.087 ± 0.005M/s ( 1.043M/prod)
Summary: hits 2.701 ± 0.012M/s ( 0.900M/prod)
Summary: hits 1.917 ± 0.011M/s ( 0.479M/prod)
Summary: hits 2.105 ± 0.003M/s ( 0.421M/prod)
Summary: hits 1.615 ± 0.006M/s ( 0.269M/prod)
Summary: hits 2.046 ± 0.004M/s ( 0.292M/prod)
Summary: hits 1.818 ± 0.002M/s ( 0.227M/prod)
Summary: hits 1.867 ± 0.024M/s ( 0.207M/prod)
Summary: hits 1.692 ± 0.003M/s ( 0.169M/prod)
Summary: hits 1.778 ± 0.004M/s ( 0.162M/prod)
Summary: hits 1.710 ± 0.011M/s ( 0.142M/prod)
Summary: hits 1.687 ± 0.022M/s ( 0.130M/prod)
Summary: hits 1.697 ± 0.004M/s ( 0.121M/prod)
Summary: hits 1.645 ± 0.011M/s ( 0.110M/prod)
Summary: hits 1.671 ± 0.002M/s ( 0.104M/prod)
Summary: hits 1.692 ± 0.014M/s ( 0.100M/prod)
Summary: hits 1.700 ± 0.015M/s ( 0.094M/prod)
Summary: hits 1.668 ± 0.005M/s ( 0.088M/prod)
Summary: hits 1.644 ± 0.004M/s ( 0.082M/prod)


RW spinlock

Summary: hits 1.465 ± 0.008M/s ( 1.465M/prod)
Summary: hits 1.750 ± 0.035M/s ( 0.875M/prod)
Summary: hits 2.164 ± 0.008M/s ( 0.721M/prod)
Summary: hits 2.235 ± 0.014M/s ( 0.559M/prod)
Summary: hits 2.202 ± 0.005M/s ( 0.440M/prod)
Summary: hits 2.816 ± 0.003M/s ( 0.469M/prod)
Summary: hits 2.364 ± 0.002M/s ( 0.338M/prod)
Summary: hits 2.327 ± 0.008M/s ( 0.291M/prod)
Summary: hits 2.147 ± 0.005M/s ( 0.239M/prod)
Summary: hits 2.266 ± 0.011M/s ( 0.227M/prod)
Summary: hits 2.483 ± 0.003M/s ( 0.226M/prod)
Summary: hits 2.573 ± 0.008M/s ( 0.214M/prod)
Summary: hits 2.569 ± 0.004M/s ( 0.198M/prod)
Summary: hits 2.507 ± 0.013M/s ( 0.179M/prod)
Summary: hits 2.165 ± 0.008M/s ( 0.144M/prod)
Summary: hits 2.524 ± 0.004M/s ( 0.158M/prod)
Summary: hits 2.059 ± 0.020M/s ( 0.121M/prod)
Summary: hits 2.332 ± 0.007M/s ( 0.130M/prod)
Summary: hits 2.404 ± 0.017M/s ( 0.127M/prod)
Summary: hits 2.187 ± 0.002M/s ( 0.109M/prod)


per-CPU RW semaphore

Summary: hits 1.341 ± 0.017M/s ( 1.341M/prod)
Summary: hits 2.397 ± 0.011M/s ( 1.198M/prod)
Summary: hits 3.547 ± 0.041M/s ( 1.182M/prod)
Summary: hits 4.108 ± 0.016M/s ( 1.027M/prod)
Summary: hits 3.138 ± 0.055M/s ( 0.628M/prod)
Summary: hits 3.247 ± 0.017M/s ( 0.541M/prod)
Summary: hits 2.877 ± 0.005M/s ( 0.411M/prod)
Summary: hits 2.880 ± 0.002M/s ( 0.360M/prod)
Summary: hits 2.579 ± 0.001M/s ( 0.287M/prod)
Summary: hits 2.982 ± 0.011M/s ( 0.298M/prod)
Summary: hits 2.603 ± 0.002M/s ( 0.237M/prod)
Summary: hits 3.013 ± 0.004M/s ( 0.251M/prod)
Summary: hits 3.059 ± 0.001M/s ( 0.235M/prod)
Summary: hits 2.721 ± 0.014M/s ( 0.194M/prod)
Summary: hits 2.943 ± 0.005M/s ( 0.196M/prod)
Summary: hits 3.366 ± 0.011M/s ( 0.210M/prod)
Summary: hits 2.459 ± 0.001M/s ( 0.145M/prod)
Summary: hits 3.023 ± 0.024M/s ( 0.168M/prod)
Summary: hits 2.919 ± 0.002M/s ( 0.154M/prod)
Summary: hits 2.569 ± 0.002M/s ( 0.128M/prod)

[0] https://paulmck.livejournal.com/67547.html

Thanks.

Jon.

>
> Thank you,
>
> >
> > Jon.
> >
> > >
> > > Thank you,
> > >
> > > >
> > > > >
> > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > will depend on how much lock contention happens.
> > > > >
> > > > > Thank you,
> > > > >
> > > > > >
> > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > >
> > > > > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > > > > ---
> > > > > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > > > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > >
> > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > --- a/kernel/events/uprobes.c
> > > > > > +++ b/kernel/events/uprobes.c
> > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > > */
> > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > > > > >
> > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > >
> > > > > > #define UPROBES_HASH_SZ 13
> > > > > > /* serialize uprobe->pending_list */
> > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > {
> > > > > > struct uprobe *uprobe;
> > > > > >
> > > > > > - spin_lock(&uprobes_treelock);
> > > > > > + read_lock(&uprobes_treelock);
> > > > > > uprobe = __find_uprobe(inode, offset);
> > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > + read_unlock(&uprobes_treelock);
> > > > > >
> > > > > > return uprobe;
> > > > > > }
> > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > > > > {
> > > > > > struct uprobe *u;
> > > > > >
> > > > > > - spin_lock(&uprobes_treelock);
> > > > > > + write_lock(&uprobes_treelock);
> > > > > > u = __insert_uprobe(uprobe);
> > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > + write_unlock(&uprobes_treelock);
> > > > > >
> > > > > > return u;
> > > > > > }
> > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > > return;
> > > > > >
> > > > > > - spin_lock(&uprobes_treelock);
> > > > > > + write_lock(&uprobes_treelock);
> > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > + write_unlock(&uprobes_treelock);
> > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > > > > put_uprobe(uprobe);
> > > > > > }
> > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > min = vaddr_to_offset(vma, start);
> > > > > > max = min + (end - start) - 1;
> > > > > >
> > > > > > - spin_lock(&uprobes_treelock);
> > > > > > + read_lock(&uprobes_treelock);
> > > > > > n = find_node_in_range(inode, min, max);
> > > > > > if (n) {
> > > > > > for (t = n; t; t = rb_prev(t)) {
> > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > get_uprobe(u);
> > > > > > }
> > > > > > }
> > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > + read_unlock(&uprobes_treelock);
> > > > > > }
> > > > > >
> > > > > > /* @vma contains reference counter, not the probed instruction. */
> > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > > > > min = vaddr_to_offset(vma, start);
> > > > > > max = min + (end - start) - 1;
> > > > > >
> > > > > > - spin_lock(&uprobes_treelock);
> > > > > > + read_lock(&uprobes_treelock);
> > > > > > n = find_node_in_range(inode, min, max);
> > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > + read_unlock(&uprobes_treelock);
> > > > > >
> > > > > > return !!n;
> > > > > > }
> > > > > > --
> > > > > > 2.43.0
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Masami Hiramatsu (Google) <[email protected]>
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) <[email protected]>
>
>
> --
> Masami Hiramatsu (Google) <[email protected]>

2024-04-03 18:58:53

by Andrii Nakryiko

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Wed, Apr 3, 2024 at 4:05 AM Jonthan Haslam <[email protected]> wrote:
>
> > > > > Given the discussion around per-cpu rw semaphore and need for
> > > > > (internal) batched attachment API for uprobes, do you think you can
> > > > > apply this patch as is for now? We can then gain initial improvements
> > > > > in scalability that are also easy to backport, and Jonathan will work
> > > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > > suggested by Ingo.
> > > >
> > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > > I would like to wait for the next version.
> > >
> > > My initial tests show a nice improvement on the over RW spinlocks but
> > > significant regression in acquiring a write lock. I've got a few days
> > > vacation over Easter but I'll aim to get some more formalised results out
> > > to the thread toward the end of next week.
> >
> > As far as the write lock is only on the cold path, I think you can choose
> > per-cpu RW semaphore. Since it does not do busy wait, the total system
> > performance impact will be small.
> > I look forward to your formalized results :)
>
> Sorry for the delay in getting back to you on this Masami.
>
> I have used one of the bpf selftest benchmarks to provide some form of
> comparison of the 3 different approaches (spinlock, RW spinlock and
> per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop'
> benchmark which just executes a single uprobe with a minimal bpf program
> attached. The tests were done on a 32 core qemu/kvm instance.
>

Thanks a lot for running benchmarks and providing results!

> Things to note about the results:
>
> - The results are slightly variable so don't get too caught up on
> individual thread count - it's the trend that is important.
> - In terms of throughput with this specific benchmark a *very* macro view
> is that the RW spinlock provides 40-60% more throughput than the
> spinlock. The per-CPU RW semaphore provides in the order of 50-100%
> more throughput then the spinlock.
> - This doesn't fully reflect the large reduction in latency that we have
> seen in application based measurements. However, it does demonstrate
> that even the trivial change of going to a RW spinlock provides
> significant benefits.

This is probably because trig-uprobe-nop creates a single uprobe that
is triggered on many CPUs. While in production we have also *many*
uprobes running on many CPUs. In this benchmark, besides contention on
uprobes_treelock, we are also hammering on other per-uprobe locks
(register_rwsem, also if you don't have [0] patch locally, there will
be another filter lock taken each time, filter->rwlock). There is also
atomic refcounting going on, which when you have the same uprobe
across all CPUs at the same time will cause a bunch of cache line
bouncing.

So yes, it's understandable that in practice in production you see an
even larger effect of optimizing uprobe_treelock than in this
micro-benchmark.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next&id=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf

>
> I haven't included the measurements on per-CPU RW semaphore write
> performance as they are completely in line with those that Paul McKenney
> posted on his journal [0]. On a 32 core system I see semaphore writes to
> take in the order of 25-28 millisecs - the cost of the synchronize_rcu().
>
> Each block of results below show 1 line per execution of the benchmark (the
> "Summary" line) and each line is a run with one more thread added - a
> thread is a "producer". The lines are edited to remove extraneous output
> that adds no value here.
>
> The tests were executed with this driver script:
>
> for num_threads in {1..20}
> do
> sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary

just want to mention -a (affinity) option that you can pass a bench
tool, it will pin each thread on its own CPU. It generally makes tests
more uniform, eliminating CPU migrations variability.

> done
>
>
> spinlock
>
> Summary: hits 1.453 ± 0.005M/s ( 1.453M/prod)
> Summary: hits 2.087 ± 0.005M/s ( 1.043M/prod)
> Summary: hits 2.701 ± 0.012M/s ( 0.900M/prod)

I also wanted to point out that the first measurement (1.453M/s in
this row) is total throughput across all threads, while value in
parenthesis (0.900M/prod) is averaged throughput per each thread. So
this M/prod value is the most interesting in this benchmark where we
assess the effect of reducing contention.

> Summary: hits 1.917 ± 0.011M/s ( 0.479M/prod)
> Summary: hits 2.105 ± 0.003M/s ( 0.421M/prod)
> Summary: hits 1.615 ± 0.006M/s ( 0.269M/prod)

[...]

2024-04-04 10:46:13

by Jonathan Haslam

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

> > Things to note about the results:
> >
> > - The results are slightly variable so don't get too caught up on
> > individual thread count - it's the trend that is important.
> > - In terms of throughput with this specific benchmark a *very* macro view
> > is that the RW spinlock provides 40-60% more throughput than the
> > spinlock. The per-CPU RW semaphore provides in the order of 50-100%
> > more throughput then the spinlock.
> > - This doesn't fully reflect the large reduction in latency that we have
> > seen in application based measurements. However, it does demonstrate
> > that even the trivial change of going to a RW spinlock provides
> > significant benefits.
>
> This is probably because trig-uprobe-nop creates a single uprobe that
> is triggered on many CPUs. While in production we have also *many*
> uprobes running on many CPUs. In this benchmark, besides contention on
> uprobes_treelock, we are also hammering on other per-uprobe locks
> (register_rwsem, also if you don't have [0] patch locally, there will
> be another filter lock taken each time, filter->rwlock). There is also
> atomic refcounting going on, which when you have the same uprobe
> across all CPUs at the same time will cause a bunch of cache line
> bouncing.
>
> So yes, it's understandable that in practice in production you see an
> even larger effect of optimizing uprobe_treelock than in this
> micro-benchmark.
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next&id=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf

Thanks for the reply and the thoughts on this Andrii. Yes, I do have the
filter->rwlock fix applied but, as you say, there are no doubt other
effects at play here as to be expected in such a synthetic workload. I'm
pleased with the outcomes though as they show a good result even if they
are at the lower end of what I expect.

The results also show that pursuing an RCU solution is definitely worth it
but that write penalty is brutal in the case of a full synchronize_rcu()!
Should be fun.

> > for num_threads in {1..20}
> > do
> > sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary
>
> just want to mention -a (affinity) option that you can pass a bench
> tool, it will pin each thread on its own CPU. It generally makes tests
> more uniform, eliminating CPU migrations variability.

Thanks for pointing that flag out!

Jon.

>
> > done
> >
> >
> > spinlock
> >
> > Summary: hits 1.453 ± 0.005M/s ( 1.453M/prod)
> > Summary: hits 2.087 ± 0.005M/s ( 1.043M/prod)
> > Summary: hits 2.701 ± 0.012M/s ( 0.900M/prod)
>
> I also wanted to point out that the first measurement (1.453M/s in
> this row) is total throughput across all threads, while value in
> parenthesis (0.900M/prod) is averaged throughput per each thread. So
> this M/prod value is the most interesting in this benchmark where we
> assess the effect of reducing contention.
>
> > Summary: hits 1.917 ± 0.011M/s ( 0.479M/prod)
> > Summary: hits 2.105 ± 0.003M/s ( 0.421M/prod)
> > Summary: hits 1.615 ± 0.006M/s ( 0.269M/prod)
>
> [...]

2024-04-10 10:41:03

by Jonathan Haslam

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

Hi Masami,

> > > Which is why I was asking to land this patch as is, as it relieves the
> > > scalability pains in production and is easy to backport to old
> > > kernels. And then we can work on batched APIs and switch to per-CPU rw
> > > semaphore.
>
> OK, then I'll push this to for-next at this moment.
> Please share if you have a good idea for the batch interface which can be
> backported. I guess it should involve updating userspace changes too.

Did you (or anyone else) need anything more from me on this one so that it
can be pushed? I provided some benchmark numbers but happy to provide
anything else that may be required.

Thanks!

Jon.

>
> Thank you!
>
> > >
> > > So I hope you can reconsider and accept improvements in this patch,
> > > while Jonathan will keep working on even better final solution.
> > > Thanks!
> > >
> > > > I look forward to your formalized results :)
> > > >
> >
> > BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> > and USDTs, reporting attach/detach timings:
> > $ sudo ./test_progs -v -t uprobe_multi_test/bench
> > bpf_testmod.ko is already unloaded.
> > Loading bpf_testmod.ko...
> > Successfully loaded bpf_testmod.ko.
> > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> > test_bench_attach_uprobe: attached in 0.120s
> > test_bench_attach_uprobe: detached in 0.092s
> > #400/5 uprobe_multi_test/bench_uprobe:OK
> > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> > test_bench_attach_usdt:PASS:usdt_count 0 nsec
> > test_bench_attach_usdt: attached in 0.124s
> > test_bench_attach_usdt: detached in 0.064s
> > #400/6 uprobe_multi_test/bench_usdt:OK
> > #400 uprobe_multi_test:OK
> > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> > Successfully unloaded bpf_testmod.ko.
> >
> > So it should be easy for Jonathan to validate his changes with this.
> >
> > > > Thank you,
> > > >
> > > > >
> > > > > Jon.
> > > > >
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > > > > will depend on how much lock contention happens.
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > >
> > > > > > > > >
> > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > > > >
> > > > > > > > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > > > > > > > ---
> > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > > > > > */
> > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > > > > > > > >
> > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > >
> > > > > > > > > #define UPROBES_HASH_SZ 13
> > > > > > > > > /* serialize uprobe->pending_list */
> > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > > {
> > > > > > > > > struct uprobe *uprobe;
> > > > > > > > >
> > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > uprobe = __find_uprobe(inode, offset);
> > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > >
> > > > > > > > > return uprobe;
> > > > > > > > > }
> > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > > > > > > > {
> > > > > > > > > struct uprobe *u;
> > > > > > > > >
> > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > u = __insert_uprobe(uprobe);
> > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > >
> > > > > > > > > return u;
> > > > > > > > > }
> > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > > > > > return;
> > > > > > > > >
> > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > > > > > > > put_uprobe(uprobe);
> > > > > > > > > }
> > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > > max = min + (end - start) - 1;
> > > > > > > > >
> > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > > if (n) {
> > > > > > > > > for (t = n; t; t = rb_prev(t)) {
> > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > > get_uprobe(u);
> > > > > > > > > }
> > > > > > > > > }
> > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > /* @vma contains reference counter, not the probed instruction. */
> > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > > max = min + (end - start) - 1;
> > > > > > > > >
> > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > >
> > > > > > > > > return !!n;
> > > > > > > > > }
> > > > > > > > > --
> > > > > > > > > 2.43.0
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > >
> > > >
> > > > --
> > > > Masami Hiramatsu (Google) <[email protected]>
>
>
> --
> Masami Hiramatsu (Google) <[email protected]>

2024-04-10 23:29:15

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Wed, 10 Apr 2024 11:38:11 +0100
Jonthan Haslam <[email protected]> wrote:

> Hi Masami,
>
> > > > Which is why I was asking to land this patch as is, as it relieves the
> > > > scalability pains in production and is easy to backport to old
> > > > kernels. And then we can work on batched APIs and switch to per-CPU rw
> > > > semaphore.
> >
> > OK, then I'll push this to for-next at this moment.
> > Please share if you have a good idea for the batch interface which can be
> > backported. I guess it should involve updating userspace changes too.
>
> Did you (or anyone else) need anything more from me on this one so that it
> can be pushed? I provided some benchmark numbers but happy to provide
> anything else that may be required.

Yeah, if you can update with the result, it looks better to me.
Or, can I update the description?

Thank you,

>
> Thanks!
>
> Jon.
>
> >
> > Thank you!
> >
> > > >
> > > > So I hope you can reconsider and accept improvements in this patch,
> > > > while Jonathan will keep working on even better final solution.
> > > > Thanks!
> > > >
> > > > > I look forward to your formalized results :)
> > > > >
> > >
> > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> > > and USDTs, reporting attach/detach timings:
> > > $ sudo ./test_progs -v -t uprobe_multi_test/bench
> > > bpf_testmod.ko is already unloaded.
> > > Loading bpf_testmod.ko...
> > > Successfully loaded bpf_testmod.ko.
> > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> > > test_bench_attach_uprobe: attached in 0.120s
> > > test_bench_attach_uprobe: detached in 0.092s
> > > #400/5 uprobe_multi_test/bench_uprobe:OK
> > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> > > test_bench_attach_usdt:PASS:usdt_count 0 nsec
> > > test_bench_attach_usdt: attached in 0.124s
> > > test_bench_attach_usdt: detached in 0.064s
> > > #400/6 uprobe_multi_test/bench_usdt:OK
> > > #400 uprobe_multi_test:OK
> > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> > > Successfully unloaded bpf_testmod.ko.
> > >
> > > So it should be easy for Jonathan to validate his changes with this.
> > >
> > > > > Thank you,
> > > > >
> > > > > >
> > > > > > Jon.
> > > > > >
> > > > > > >
> > > > > > > Thank you,
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > > > > > will depend on how much lock contention happens.
> > > > > > > > >
> > > > > > > > > Thank you,
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > > > > > > > > ---
> > > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > > > > > > */
> > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > > > > > > > > >
> > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > > >
> > > > > > > > > > #define UPROBES_HASH_SZ 13
> > > > > > > > > > /* serialize uprobe->pending_list */
> > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > > > {
> > > > > > > > > > struct uprobe *uprobe;
> > > > > > > > > >
> > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > uprobe = __find_uprobe(inode, offset);
> > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > >
> > > > > > > > > > return uprobe;
> > > > > > > > > > }
> > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > > > > > > > > {
> > > > > > > > > > struct uprobe *u;
> > > > > > > > > >
> > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > > u = __insert_uprobe(uprobe);
> > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > > >
> > > > > > > > > > return u;
> > > > > > > > > > }
> > > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > > > > > > return;
> > > > > > > > > >
> > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > > > > > > > > put_uprobe(uprobe);
> > > > > > > > > > }
> > > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > > > max = min + (end - start) - 1;
> > > > > > > > > >
> > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > > > if (n) {
> > > > > > > > > > for (t = n; t; t = rb_prev(t)) {
> > > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > > > get_uprobe(u);
> > > > > > > > > > }
> > > > > > > > > > }
> > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */
> > > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > > > max = min + (end - start) - 1;
> > > > > > > > > >
> > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > >
> > > > > > > > > > return !!n;
> > > > > > > > > > }
> > > > > > > > > > --
> > > > > > > > > > 2.43.0
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > > >
> > > > >
> > > > > --
> > > > > Masami Hiramatsu (Google) <[email protected]>
> >
> >
> > --
> > Masami Hiramatsu (Google) <[email protected]>


--
Masami Hiramatsu (Google) <[email protected]>

2024-04-11 08:42:27

by Jonathan Haslam

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

> > > OK, then I'll push this to for-next at this moment.
> > > Please share if you have a good idea for the batch interface which can be
> > > backported. I guess it should involve updating userspace changes too.
> >
> > Did you (or anyone else) need anything more from me on this one so that it
> > can be pushed? I provided some benchmark numbers but happy to provide
> > anything else that may be required.
>
> Yeah, if you can update with the result, it looks better to me.
> Or, can I update the description?

Sure, please feel free to update the description yourself.

Jon.

>
> Thank you,
>
> >
> > Thanks!
> >
> > Jon.
> >
> > >
> > > Thank you!
> > >
> > > > >
> > > > > So I hope you can reconsider and accept improvements in this patch,
> > > > > while Jonathan will keep working on even better final solution.
> > > > > Thanks!
> > > > >
> > > > > > I look forward to your formalized results :)
> > > > > >
> > > >
> > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> > > > and USDTs, reporting attach/detach timings:
> > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench
> > > > bpf_testmod.ko is already unloaded.
> > > > Loading bpf_testmod.ko...
> > > > Successfully loaded bpf_testmod.ko.
> > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> > > > test_bench_attach_uprobe: attached in 0.120s
> > > > test_bench_attach_uprobe: detached in 0.092s
> > > > #400/5 uprobe_multi_test/bench_uprobe:OK
> > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec
> > > > test_bench_attach_usdt: attached in 0.124s
> > > > test_bench_attach_usdt: detached in 0.064s
> > > > #400/6 uprobe_multi_test/bench_usdt:OK
> > > > #400 uprobe_multi_test:OK
> > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> > > > Successfully unloaded bpf_testmod.ko.
> > > >
> > > > So it should be easy for Jonathan to validate his changes with this.
> > > >
> > > > > > Thank you,
> > > > > >
> > > > > > >
> > > > > > > Jon.
> > > > > > >
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > > > > > > will depend on how much lock contention happens.
> > > > > > > > > >
> > > > > > > > > > Thank you,
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > > > > > > > > > ---
> > > > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > > > > > > > */
> > > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > > > > > > > > > >
> > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > > > >
> > > > > > > > > > > #define UPROBES_HASH_SZ 13
> > > > > > > > > > > /* serialize uprobe->pending_list */
> > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > > > > {
> > > > > > > > > > > struct uprobe *uprobe;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > > uprobe = __find_uprobe(inode, offset);
> > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > > >
> > > > > > > > > > > return uprobe;
> > > > > > > > > > > }
> > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > > > > > > > > > {
> > > > > > > > > > > struct uprobe *u;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > > > u = __insert_uprobe(uprobe);
> > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > > > >
> > > > > > > > > > > return u;
> > > > > > > > > > > }
> > > > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > > > > > > > return;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > > > > > > > > > put_uprobe(uprobe);
> > > > > > > > > > > }
> > > > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > > > > max = min + (end - start) - 1;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > > > > if (n) {
> > > > > > > > > > > for (t = n; t; t = rb_prev(t)) {
> > > > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > > > > get_uprobe(u);
> > > > > > > > > > > }
> > > > > > > > > > > }
> > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */
> > > > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > > > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > > > > max = min + (end - start) - 1;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > > >
> > > > > > > > > > > return !!n;
> > > > > > > > > > > }
> > > > > > > > > > > --
> > > > > > > > > > > 2.43.0
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Masami Hiramatsu (Google) <[email protected]>
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) <[email protected]>
>
>
> --
> Masami Hiramatsu (Google) <[email protected]>

2024-04-18 11:11:37

by Jonathan Haslam

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

Hi Masami,

> > > OK, then I'll push this to for-next at this moment.
> > > Please share if you have a good idea for the batch interface which can be
> > > backported. I guess it should involve updating userspace changes too.
> >
> > Did you (or anyone else) need anything more from me on this one so that it
> > can be pushed? I provided some benchmark numbers but happy to provide
> > anything else that may be required.
>
> Yeah, if you can update with the result, it looks better to me.
> Or, can I update the description?

Just checking if you need me to do anything on this so that it can be
pushed to for-next? Would be really great if we can get this in. Thanks
for all your help.

Jon.

>
> Thank you,
>
> >
> > Thanks!
> >
> > Jon.
> >
> > >
> > > Thank you!
> > >
> > > > >
> > > > > So I hope you can reconsider and accept improvements in this patch,
> > > > > while Jonathan will keep working on even better final solution.
> > > > > Thanks!
> > > > >
> > > > > > I look forward to your formalized results :)
> > > > > >
> > > >
> > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> > > > and USDTs, reporting attach/detach timings:
> > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench
> > > > bpf_testmod.ko is already unloaded.
> > > > Loading bpf_testmod.ko...
> > > > Successfully loaded bpf_testmod.ko.
> > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> > > > test_bench_attach_uprobe: attached in 0.120s
> > > > test_bench_attach_uprobe: detached in 0.092s
> > > > #400/5 uprobe_multi_test/bench_uprobe:OK
> > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec
> > > > test_bench_attach_usdt: attached in 0.124s
> > > > test_bench_attach_usdt: detached in 0.064s
> > > > #400/6 uprobe_multi_test/bench_usdt:OK
> > > > #400 uprobe_multi_test:OK
> > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> > > > Successfully unloaded bpf_testmod.ko.
> > > >
> > > > So it should be easy for Jonathan to validate his changes with this.
> > > >
> > > > > > Thank you,
> > > > > >
> > > > > > >
> > > > > > > Jon.
> > > > > > >
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > > > > > > will depend on how much lock contention happens.
> > > > > > > > > >
> > > > > > > > > > Thank you,
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > > > > > > > > > ---
> > > > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > > > > > > > */
> > > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > > > > > > > > > >
> > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > > > >
> > > > > > > > > > > #define UPROBES_HASH_SZ 13
> > > > > > > > > > > /* serialize uprobe->pending_list */
> > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > > > > {
> > > > > > > > > > > struct uprobe *uprobe;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > > uprobe = __find_uprobe(inode, offset);
> > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > > >
> > > > > > > > > > > return uprobe;
> > > > > > > > > > > }
> > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > > > > > > > > > {
> > > > > > > > > > > struct uprobe *u;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > > > u = __insert_uprobe(uprobe);
> > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > > > >
> > > > > > > > > > > return u;
> > > > > > > > > > > }
> > > > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > > > > > > > return;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > > > > > > > > > put_uprobe(uprobe);
> > > > > > > > > > > }
> > > > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > > > > max = min + (end - start) - 1;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > > > > if (n) {
> > > > > > > > > > > for (t = n; t; t = rb_prev(t)) {
> > > > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > > > > get_uprobe(u);
> > > > > > > > > > > }
> > > > > > > > > > > }
> > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > > > }
> > > > > > > > > > >
> > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */
> > > > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > > > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > > > > max = min + (end - start) - 1;
> > > > > > > > > > >
> > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > > >
> > > > > > > > > > > return !!n;
> > > > > > > > > > > }
> > > > > > > > > > > --
> > > > > > > > > > > 2.43.0
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Masami Hiramatsu (Google) <[email protected]>
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) <[email protected]>
>
>
> --
> Masami Hiramatsu (Google) <[email protected]>

2024-04-19 00:43:56

by Masami Hiramatsu

[permalink] [raw]
Subject: Re: [PATCH] uprobes: reduce contention on uprobes_tree access

On Thu, 18 Apr 2024 12:10:59 +0100
Jonthan Haslam <[email protected]> wrote:

> Hi Masami,
>
> > > > OK, then I'll push this to for-next at this moment.
> > > > Please share if you have a good idea for the batch interface which can be
> > > > backported. I guess it should involve updating userspace changes too.
> > >
> > > Did you (or anyone else) need anything more from me on this one so that it
> > > can be pushed? I provided some benchmark numbers but happy to provide
> > > anything else that may be required.
> >
> > Yeah, if you can update with the result, it looks better to me.
> > Or, can I update the description?
>
> Just checking if you need me to do anything on this so that it can be
> pushed to for-next? Would be really great if we can get this in. Thanks
> for all your help.

Yes, other uprobe performance improvements[1][2] have the benchmark results,
but this patch doesn't. If you update this with the benchmark results, it is
helpful to me.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/

Thank you,

>
> Jon.
>
> >
> > Thank you,
> >
> > >
> > > Thanks!
> > >
> > > Jon.
> > >
> > > >
> > > > Thank you!
> > > >
> > > > > >
> > > > > > So I hope you can reconsider and accept improvements in this patch,
> > > > > > while Jonathan will keep working on even better final solution.
> > > > > > Thanks!
> > > > > >
> > > > > > > I look forward to your formalized results :)
> > > > > > >
> > > > >
> > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> > > > > and USDTs, reporting attach/detach timings:
> > > > > $ sudo ./test_progs -v -t uprobe_multi_test/bench
> > > > > bpf_testmod.ko is already unloaded.
> > > > > Loading bpf_testmod.ko...
> > > > > Successfully loaded bpf_testmod.ko.
> > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> > > > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> > > > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> > > > > test_bench_attach_uprobe: attached in 0.120s
> > > > > test_bench_attach_uprobe: detached in 0.092s
> > > > > #400/5 uprobe_multi_test/bench_uprobe:OK
> > > > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> > > > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> > > > > test_bench_attach_usdt:PASS:usdt_count 0 nsec
> > > > > test_bench_attach_usdt: attached in 0.124s
> > > > > test_bench_attach_usdt: detached in 0.064s
> > > > > #400/6 uprobe_multi_test/bench_usdt:OK
> > > > > #400 uprobe_multi_test:OK
> > > > > Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
> > > > > Successfully unloaded bpf_testmod.ko.
> > > > >
> > > > > So it should be easy for Jonathan to validate his changes with this.
> > > > >
> > > > > > > Thank you,
> > > > > > >
> > > > > > > >
> > > > > > > > Jon.
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Thank you,
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > > > > > > > will depend on how much lock contention happens.
> > > > > > > > > > >
> > > > > > > > > > > Thank you,
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Jonathan Haslam <[email protected]>
> > > > > > > > > > > > ---
> > > > > > > > > > > > kernel/events/uprobes.c | 22 +++++++++++-----------
> > > > > > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > > > > > > > > */
> > > > > > > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(&uprobes_tree)
> > > > > > > > > > > >
> > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */
> > > > > > > > > > > >
> > > > > > > > > > > > #define UPROBES_HASH_SZ 13
> > > > > > > > > > > > /* serialize uprobe->pending_list */
> > > > > > > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, loff_t offset)
> > > > > > > > > > > > {
> > > > > > > > > > > > struct uprobe *uprobe;
> > > > > > > > > > > >
> > > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > > > uprobe = __find_uprobe(inode, offset);
> > > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > > > >
> > > > > > > > > > > > return uprobe;
> > > > > > > > > > > > }
> > > > > > > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
> > > > > > > > > > > > {
> > > > > > > > > > > > struct uprobe *u;
> > > > > > > > > > > >
> > > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > > > > u = __insert_uprobe(uprobe);
> > > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > > > > >
> > > > > > > > > > > > return u;
> > > > > > > > > > > > }
> > > > > > > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > > > > > > > > if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > > > > > > > > return;
> > > > > > > > > > > >
> > > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > > + write_lock(&uprobes_treelock);
> > > > > > > > > > > > rb_erase(&uprobe->rb_node, &uprobes_tree);
> > > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > > + write_unlock(&uprobes_treelock);
> > > > > > > > > > > > RB_CLEAR_NODE(&uprobe->rb_node); /* for uprobe_is_active() */
> > > > > > > > > > > > put_uprobe(uprobe);
> > > > > > > > > > > > }
> > > > > > > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > > > > > max = min + (end - start) - 1;
> > > > > > > > > > > >
> > > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > > > > > if (n) {
> > > > > > > > > > > > for (t = n; t; t = rb_prev(t)) {
> > > > > > > > > > > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> > > > > > > > > > > > get_uprobe(u);
> > > > > > > > > > > > }
> > > > > > > > > > > > }
> > > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > > > > }
> > > > > > > > > > > >
> > > > > > > > > > > > /* @vma contains reference counter, not the probed instruction. */
> > > > > > > > > > > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned long start, unsigned long e
> > > > > > > > > > > > min = vaddr_to_offset(vma, start);
> > > > > > > > > > > > max = min + (end - start) - 1;
> > > > > > > > > > > >
> > > > > > > > > > > > - spin_lock(&uprobes_treelock);
> > > > > > > > > > > > + read_lock(&uprobes_treelock);
> > > > > > > > > > > > n = find_node_in_range(inode, min, max);
> > > > > > > > > > > > - spin_unlock(&uprobes_treelock);
> > > > > > > > > > > > + read_unlock(&uprobes_treelock);
> > > > > > > > > > > >
> > > > > > > > > > > > return !!n;
> > > > > > > > > > > > }
> > > > > > > > > > > > --
> > > > > > > > > > > > 2.43.0
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Masami Hiramatsu (Google) <[email protected]>
> > > >
> > > >
> > > > --
> > > > Masami Hiramatsu (Google) <[email protected]>
> >
> >
> > --
> > Masami Hiramatsu (Google) <[email protected]>
>


--
Masami Hiramatsu (Google) <[email protected]>