2020-10-21 09:15:48

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v4 0/1] Add tracepoints around mmap_lock acquisition

This patchset adds tracepoints around mmap_lock acquisition. This is useful so
we can measure the latency of lock acquisition, in order to detect contention.

This version is based upon linux-next (since it depends on some recently-merged
patches [1] [2]).

Changes since v3:

- Switched EXPORT_SYMBOL to EXPORT_TRACEPOINT_SYMBOL, removed comment.

- Removed redundant trace_..._enabled() check.

- Defined the three TRACE_EVENTs separately, instead of sharing an event class.
The tradeoff is 524 more bytes in .text, but the start_locking and released
events no longer have a vestigial "success" field, so they're simpler +
faster.

Changes since v2:

- Refactored tracing helper functions so the helpers are simper, but the locking
functinos are slightly more verbose. Overall, this decreased the delta to
mmap_lock.h slightly.

- Fixed a typo in a comment. :)

Changes since v1:

- Functions renamed to reserve the "trace_" prefix for actual tracepoints.

- We no longer measure the duration directly. Instead, users are expected to
construct a synthetic event which computes the interval between "start
locking" and "acquire returned".

- The new helper for checking if tracepoints are enabled in a header is used to
avoid un-inlining any of the lock wrappers. This yields ~zero overhead if the
tracepoints aren't enabled, and therefore obviates the need for a Kconfig for
this change.

[1] https://lore.kernel.org/patchwork/patch/1316922/
[2] https://lore.kernel.org/patchwork/patch/1311996/

Axel Rasmussen (1):
mmap_lock: add tracepoints around lock acquisition

include/linux/mmap_lock.h | 93 ++++++++++++++++++++++++++++--
include/trace/events/mmap_lock.h | 98 ++++++++++++++++++++++++++++++++
mm/Makefile | 2 +-
mm/mmap_lock.c | 80 ++++++++++++++++++++++++++
4 files changed, 267 insertions(+), 6 deletions(-)
create mode 100644 include/trace/events/mmap_lock.h
create mode 100644 mm/mmap_lock.c

--
2.29.0.rc1.297.gfa9743e501-goog


2020-10-21 09:15:52

by Axel Rasmussen

[permalink] [raw]
Subject: [PATCH v4 1/1] mmap_lock: add tracepoints around lock acquisition

The goal of these tracepoints is to be able to debug lock contention
issues. This lock is acquired on most (all?) mmap / munmap / page fault
operations, so a multi-threaded process which does a lot of these can
experience significant contention.

We trace just before we start acquisition, when the acquisition returns
(whether it succeeded or not), and when the lock is released (or
downgraded). The events are broken out by lock type (read / write).

The events are also broken out by memcg path. For container-based
workloads, users often think of several processes in a memcg as a single
logical "task", so collecting statistics at this level is useful.

The end goal is to get latency information. This isn't directly included
in the trace events. Instead, users are expected to compute the time
between "start locking" and "acquire returned", using e.g. synthetic
events or BPF. The benefit we get from this is simpler code.

Because we use tracepoint_enabled() to decide whether or not to trace,
this patch has effectively no overhead unless tracepoints are enabled at
runtime. If tracepoints are enabled, there is a performance impact, but
how much depends on exactly what e.g. the BPF program does.

Reviewed-by: Michel Lespinasse <[email protected]>
Acked-by: Yafang Shao <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Axel Rasmussen <[email protected]>
---
include/linux/mmap_lock.h | 93 ++++++++++++++++++++++++++++--
include/trace/events/mmap_lock.h | 98 ++++++++++++++++++++++++++++++++
mm/Makefile | 2 +-
mm/mmap_lock.c | 80 ++++++++++++++++++++++++++
4 files changed, 267 insertions(+), 6 deletions(-)
create mode 100644 include/trace/events/mmap_lock.h
create mode 100644 mm/mmap_lock.c

diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h
index e1afb420fc94..453d82879266 100644
--- a/include/linux/mmap_lock.h
+++ b/include/linux/mmap_lock.h
@@ -1,11 +1,63 @@
#ifndef _LINUX_MMAP_LOCK_H
#define _LINUX_MMAP_LOCK_H

+#include <linux/lockdep.h>
+#include <linux/mm_types.h>
#include <linux/mmdebug.h>
+#include <linux/rwsem.h>
+#include <linux/tracepoint-defs.h>
+#include <linux/types.h>

#define MMAP_LOCK_INITIALIZER(name) \
.mmap_lock = __RWSEM_INITIALIZER((name).mmap_lock),

+DECLARE_TRACEPOINT(mmap_lock_start_locking);
+DECLARE_TRACEPOINT(mmap_lock_acquire_returned);
+DECLARE_TRACEPOINT(mmap_lock_released);
+
+#ifdef CONFIG_TRACING
+
+void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write);
+void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
+ bool success);
+void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write);
+
+static inline void __mmap_lock_trace_start_locking(struct mm_struct *mm,
+ bool write)
+{
+ if (tracepoint_enabled(mmap_lock_start_locking))
+ __mmap_lock_do_trace_start_locking(mm, write);
+}
+
+static inline void __mmap_lock_trace_acquire_returned(struct mm_struct *mm,
+ bool write, bool success)
+{
+ if (tracepoint_enabled(mmap_lock_acquire_returned))
+ __mmap_lock_do_trace_acquire_returned(mm, write, success);
+}
+
+static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
+{
+ if (tracepoint_enabled(mmap_lock_released))
+ __mmap_lock_do_trace_released(mm, write);
+}
+
+#else /* !CONFIG_TRACING */
+
+static inline void __mmap_lock_trace_start_locking(struct mm_struct *mm,
+ bool write)
+{
+}
+static inline void __mmap_lock_trace_acquire_returned(struct mm_struct *mm,
+ bool write, bool success)
+{
+}
+static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write)
+{
+}
+
+#endif /* CONFIG_TRACING */
+
static inline void mmap_init_lock(struct mm_struct *mm)
{
init_rwsem(&mm->mmap_lock);
@@ -13,58 +65,88 @@ static inline void mmap_init_lock(struct mm_struct *mm)

static inline void mmap_write_lock(struct mm_struct *mm)
{
+ __mmap_lock_trace_start_locking(mm, true);
down_write(&mm->mmap_lock);
+ __mmap_lock_trace_acquire_returned(mm, true, true);
}

static inline void mmap_write_lock_nested(struct mm_struct *mm, int subclass)
{
+ __mmap_lock_trace_start_locking(mm, true);
down_write_nested(&mm->mmap_lock, subclass);
+ __mmap_lock_trace_acquire_returned(mm, true, true);
}

static inline int mmap_write_lock_killable(struct mm_struct *mm)
{
- return down_write_killable(&mm->mmap_lock);
+ int ret;
+
+ __mmap_lock_trace_start_locking(mm, true);
+ ret = down_write_killable(&mm->mmap_lock);
+ __mmap_lock_trace_acquire_returned(mm, true, ret == 0);
+ return ret;
}

static inline bool mmap_write_trylock(struct mm_struct *mm)
{
- return down_write_trylock(&mm->mmap_lock) != 0;
+ bool ret;
+
+ __mmap_lock_trace_start_locking(mm, true);
+ ret = down_write_trylock(&mm->mmap_lock) != 0;
+ __mmap_lock_trace_acquire_returned(mm, true, ret);
+ return ret;
}

static inline void mmap_write_unlock(struct mm_struct *mm)
{
up_write(&mm->mmap_lock);
+ __mmap_lock_trace_released(mm, true);
}

static inline void mmap_write_downgrade(struct mm_struct *mm)
{
downgrade_write(&mm->mmap_lock);
+ __mmap_lock_trace_acquire_returned(mm, false, true);
}

static inline void mmap_read_lock(struct mm_struct *mm)
{
+ __mmap_lock_trace_start_locking(mm, false);
down_read(&mm->mmap_lock);
+ __mmap_lock_trace_acquire_returned(mm, false, true);
}

static inline int mmap_read_lock_killable(struct mm_struct *mm)
{
- return down_read_killable(&mm->mmap_lock);
+ int ret;
+
+ __mmap_lock_trace_start_locking(mm, false);
+ ret = down_read_killable(&mm->mmap_lock);
+ __mmap_lock_trace_acquire_returned(mm, false, ret == 0);
+ return ret;
}

static inline bool mmap_read_trylock(struct mm_struct *mm)
{
- return down_read_trylock(&mm->mmap_lock) != 0;
+ bool ret;
+
+ __mmap_lock_trace_start_locking(mm, false);
+ ret = down_read_trylock(&mm->mmap_lock) != 0;
+ __mmap_lock_trace_acquire_returned(mm, false, ret);
+ return ret;
}

static inline void mmap_read_unlock(struct mm_struct *mm)
{
up_read(&mm->mmap_lock);
+ __mmap_lock_trace_released(mm, false);
}

static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm)
{
- if (down_read_trylock(&mm->mmap_lock)) {
+ if (mmap_read_trylock(mm)) {
rwsem_release(&mm->mmap_lock.dep_map, _RET_IP_);
+ __mmap_lock_trace_released(mm, false);
return true;
}
return false;
@@ -73,6 +155,7 @@ static inline bool mmap_read_trylock_non_owner(struct mm_struct *mm)
static inline void mmap_read_unlock_non_owner(struct mm_struct *mm)
{
up_read_non_owner(&mm->mmap_lock);
+ __mmap_lock_trace_released(mm, false);
}

static inline void mmap_assert_locked(struct mm_struct *mm)
diff --git a/include/trace/events/mmap_lock.h b/include/trace/events/mmap_lock.h
new file mode 100644
index 000000000000..2268ff967072
--- /dev/null
+++ b/include/trace/events/mmap_lock.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM mmap_lock
+
+#if !defined(_TRACE_MMAP_LOCK_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MMAP_LOCK_H
+
+#include <linux/tracepoint.h>
+#include <linux/types.h>
+
+struct mm_struct;
+
+TRACE_EVENT(mmap_lock_start_locking,
+
+ TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write),
+
+ TP_ARGS(mm, memcg_path, write),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __string(memcg_path, memcg_path)
+ __field(bool, write)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __assign_str(memcg_path, memcg_path);
+ __entry->write = write;
+ ),
+
+ TP_printk(
+ "mm=%p memcg_path=%s write=%s\n",
+ __entry->mm,
+ __get_str(memcg_path),
+ __entry->write ? "true" : "false"
+ )
+);
+
+TRACE_EVENT(mmap_lock_acquire_returned,
+
+ TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write,
+ bool success),
+
+ TP_ARGS(mm, memcg_path, write, success),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __string(memcg_path, memcg_path)
+ __field(bool, write)
+ __field(bool, success)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __assign_str(memcg_path, memcg_path);
+ __entry->write = write;
+ __entry->success = success;
+ ),
+
+ TP_printk(
+ "mm=%p memcg_path=%s write=%s success=%s\n",
+ __entry->mm,
+ __get_str(memcg_path),
+ __entry->write ? "true" : "false",
+ __entry->success ? "true" : "false"
+ )
+);
+
+TRACE_EVENT(mmap_lock_released,
+
+ TP_PROTO(struct mm_struct *mm, const char *memcg_path, bool write),
+
+ TP_ARGS(mm, memcg_path, write),
+
+ TP_STRUCT__entry(
+ __field(struct mm_struct *, mm)
+ __string(memcg_path, memcg_path)
+ __field(bool, write)
+ ),
+
+ TP_fast_assign(
+ __entry->mm = mm;
+ __assign_str(memcg_path, memcg_path);
+ __entry->write = write;
+ ),
+
+ TP_printk(
+ "mm=%p memcg_path=%s write=%s\n",
+ __entry->mm,
+ __get_str(memcg_path),
+ __entry->write ? "true" : "false"
+ )
+);
+
+#endif /* _TRACE_MMAP_LOCK_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/Makefile b/mm/Makefile
index 069f216e109e..b6cd2fffa492 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -52,7 +52,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \
mm_init.o percpu.o slab_common.o \
compaction.o vmacache.o \
interval_tree.o list_lru.o workingset.o \
- debug.o gup.o $(mmu-y)
+ debug.o gup.o mmap_lock.o $(mmu-y)

# Give 'page_alloc' its own module-parameter namespace
page-alloc-y := page_alloc.o
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
new file mode 100644
index 000000000000..52741eff7bea
--- /dev/null
+++ b/mm/mmap_lock.c
@@ -0,0 +1,80 @@
+// SPDX-License-Identifier: GPL-2.0
+#define CREATE_TRACE_POINTS
+#include <trace/events/mmap_lock.h>
+
+#include <linux/mm.h>
+#include <linux/cgroup.h>
+#include <linux/memcontrol.h>
+#include <linux/mmap_lock.h>
+#include <linux/percpu.h>
+#include <linux/smp.h>
+#include <linux/trace_events.h>
+
+EXPORT_TRACEPOINT_SYMBOL(mmap_lock_start_locking);
+EXPORT_TRACEPOINT_SYMBOL(mmap_lock_acquire_returned);
+EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released);
+
+#ifdef CONFIG_MEMCG
+
+DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
+
+/*
+ * Write the given mm_struct's memcg path to a percpu buffer, and return a
+ * pointer to it. If the path cannot be determined, the buffer will contain the
+ * empty string.
+ *
+ * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
+ * disabled by the caller before calling us, and re-enabled only after the
+ * caller is done with the pointer.
+ */
+static const char *get_mm_memcg_path(struct mm_struct *mm)
+{
+ struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
+
+ if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
+ char *buf = this_cpu_ptr(trace_memcg_path);
+
+ cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL);
+ return buf;
+ }
+ return "";
+}
+
+#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
+ do { \
+ get_cpu(); \
+ trace_mmap_lock_##type(mm, get_mm_memcg_path(mm), \
+ ##__VA_ARGS__); \
+ put_cpu(); \
+ } while (0)
+
+#else /* !CONFIG_MEMCG */
+
+#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
+ trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
+
+#endif /* CONFIG_MEMCG */
+
+/*
+ * Trace calls must be in a separate file, as otherwise there's a circular
+ * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h.
+ */
+
+void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
+{
+ TRACE_MMAP_LOCK_EVENT(start_locking, mm, write);
+}
+EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
+
+void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
+ bool success)
+{
+ TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
+}
+EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
+
+void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
+{
+ TRACE_MMAP_LOCK_EVENT(released, mm, write);
+}
+EXPORT_SYMBOL(__mmap_lock_do_trace_released);
--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-23 16:17:10

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 1/1] mmap_lock: add tracepoints around lock acquisition

On 10/20/20 8:47 PM, Axel Rasmussen wrote:
> The goal of these tracepoints is to be able to debug lock contention
> issues. This lock is acquired on most (all?) mmap / munmap / page fault
> operations, so a multi-threaded process which does a lot of these can
> experience significant contention.
>
> We trace just before we start acquisition, when the acquisition returns
> (whether it succeeded or not), and when the lock is released (or
> downgraded). The events are broken out by lock type (read / write).
>
> The events are also broken out by memcg path. For container-based
> workloads, users often think of several processes in a memcg as a single
> logical "task", so collecting statistics at this level is useful.
>
> The end goal is to get latency information. This isn't directly included
> in the trace events. Instead, users are expected to compute the time
> between "start locking" and "acquire returned", using e.g. synthetic
> events or BPF. The benefit we get from this is simpler code.
>
> Because we use tracepoint_enabled() to decide whether or not to trace,
> this patch has effectively no overhead unless tracepoints are enabled at
> runtime. If tracepoints are enabled, there is a performance impact, but
> how much depends on exactly what e.g. the BPF program does.
>
> Reviewed-by: Michel Lespinasse <[email protected]>
> Acked-by: Yafang Shao <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Signed-off-by: Axel Rasmussen <[email protected]>

All seem fine to me, except I started to wonder..

> +
> +#ifdef CONFIG_MEMCG
> +
> +DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
> +
> +/*
> + * Write the given mm_struct's memcg path to a percpu buffer, and return a
> + * pointer to it. If the path cannot be determined, the buffer will contain the
> + * empty string.
> + *
> + * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
> + * disabled by the caller before calling us, and re-enabled only after the
> + * caller is done with the pointer.

Is this enough? What if we fill the buffer and then an interrupt comes and the
handler calls here again? We overwrite the buffer and potentially report a wrong
cgroup after the execution resumes?
If nothing worse can happen (are interrupts disabled while the ftrace code is
copying from the buffer?), then it's probably ok?

> + */
> +static const char *get_mm_memcg_path(struct mm_struct *mm)
> +{
> + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
> +
> + if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
> + char *buf = this_cpu_ptr(trace_memcg_path);
> +
> + cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL);
> + return buf;
> + }
> + return "";
> +}
> +
> +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
> + do { \
> + get_cpu(); \
> + trace_mmap_lock_##type(mm, get_mm_memcg_path(mm), \
> + ##__VA_ARGS__); \
> + put_cpu(); \
> + } while (0)
> +
> +#else /* !CONFIG_MEMCG */
> +
> +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
> + trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
> +
> +#endif /* CONFIG_MEMCG */
> +
> +/*
> + * Trace calls must be in a separate file, as otherwise there's a circular
> + * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h.
> + */
> +
> +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
> +{
> + TRACE_MMAP_LOCK_EVENT(start_locking, mm, write);
> +}
> +EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
> +
> +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
> + bool success)
> +{
> + TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
> +}
> +EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
> +
> +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
> +{
> + TRACE_MMAP_LOCK_EVENT(released, mm, write);
> +}
> +EXPORT_SYMBOL(__mmap_lock_do_trace_released);
>

2020-10-23 19:04:20

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 1/1] mmap_lock: add tracepoints around lock acquisition

On 10/23/20 7:38 PM, Axel Rasmussen wrote:
> On Fri, Oct 23, 2020 at 7:00 AM Vlastimil Babka <[email protected]> wrote:
>>
>> On 10/20/20 8:47 PM, Axel Rasmussen wrote:
>> > The goal of these tracepoints is to be able to debug lock contention
>> > issues. This lock is acquired on most (all?) mmap / munmap / page fault
>> > operations, so a multi-threaded process which does a lot of these can
>> > experience significant contention.
>> >
>> > We trace just before we start acquisition, when the acquisition returns
>> > (whether it succeeded or not), and when the lock is released (or
>> > downgraded). The events are broken out by lock type (read / write).
>> >
>> > The events are also broken out by memcg path. For container-based
>> > workloads, users often think of several processes in a memcg as a single
>> > logical "task", so collecting statistics at this level is useful.
>> >
>> > The end goal is to get latency information. This isn't directly included
>> > in the trace events. Instead, users are expected to compute the time
>> > between "start locking" and "acquire returned", using e.g. synthetic
>> > events or BPF. The benefit we get from this is simpler code.
>> >
>> > Because we use tracepoint_enabled() to decide whether or not to trace,
>> > this patch has effectively no overhead unless tracepoints are enabled at
>> > runtime. If tracepoints are enabled, there is a performance impact, but
>> > how much depends on exactly what e.g. the BPF program does.
>> >
>> > Reviewed-by: Michel Lespinasse <[email protected]>
>> > Acked-by: Yafang Shao <[email protected]>
>> > Acked-by: David Rientjes <[email protected]>
>> > Signed-off-by: Axel Rasmussen <[email protected]>
>>
>> All seem fine to me, except I started to wonder..
>>
>> > +
>> > +#ifdef CONFIG_MEMCG
>> > +
>> > +DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
>> > +
>> > +/*
>> > + * Write the given mm_struct's memcg path to a percpu buffer, and return a
>> > + * pointer to it. If the path cannot be determined, the buffer will contain the
>> > + * empty string.
>> > + *
>> > + * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
>> > + * disabled by the caller before calling us, and re-enabled only after the
>> > + * caller is done with the pointer.
>>
>> Is this enough? What if we fill the buffer and then an interrupt comes and the
>> handler calls here again? We overwrite the buffer and potentially report a wrong
>> cgroup after the execution resumes?
>> If nothing worse can happen (are interrupts disabled while the ftrace code is
>> copying from the buffer?), then it's probably ok?
>
> I think you're right, get_cpu()/put_cpu() only deals with preemption,
> not interrupts.
>
> I'm somewhat sure this code can be called in interrupt context, so I
> don't think we can use locks to prevent this situation. I think it
> works like this: say we acquire the lock, an interrupt happens, and
> then we try to acquire again on the same CPU; we can't sleep, so we're
> stuck.

Yes, we could perhaps trylock() and if it fails, give up on the memcg path.

> I think we can't kmalloc here (instead of a percpu buffer) either,
> since I would guess that kmalloc may also acquire mmap_lock itself?

the overhead is not worth it anyway, for a tracepoint

> Is adding local_irq_save()/local_irq_restore() in addition to
> get_cpu()/put_cpu() sufficient?

If you do that, then I guess you don't need get_cpu()/put_cpu() anymore. But
it's more costly.

But sounds like we are solving something that the tracing subystem has to solve
as well to store the trace event data, so maybe Steven has some better idea?

>>
>> > + */
>> > +static const char *get_mm_memcg_path(struct mm_struct *mm)
>> > +{
>> > + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
>> > +
>> > + if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
>> > + char *buf = this_cpu_ptr(trace_memcg_path);
>> > +
>> > + cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL);
>> > + return buf;
>> > + }
>> > + return "";
>> > +}
>> > +
>> > +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
>> > + do { \
>> > + get_cpu(); \
>> > + trace_mmap_lock_##type(mm, get_mm_memcg_path(mm), \
>> > + ##__VA_ARGS__); \
>> > + put_cpu(); \
>> > + } while (0)
>> > +
>> > +#else /* !CONFIG_MEMCG */
>> > +
>> > +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
>> > + trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
>> > +
>> > +#endif /* CONFIG_MEMCG */
>> > +
>> > +/*
>> > + * Trace calls must be in a separate file, as otherwise there's a circular
>> > + * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h.
>> > + */
>> > +
>> > +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
>> > +{
>> > + TRACE_MMAP_LOCK_EVENT(start_locking, mm, write);
>> > +}
>> > +EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
>> > +
>> > +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
>> > + bool success)
>> > +{
>> > + TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
>> > +}
>> > +EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
>> > +
>> > +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
>> > +{
>> > + TRACE_MMAP_LOCK_EVENT(released, mm, write);
>> > +}
>> > +EXPORT_SYMBOL(__mmap_lock_do_trace_released);
>> >
>>
>

2020-10-23 22:55:25

by Axel Rasmussen

[permalink] [raw]
Subject: Re: [PATCH v4 1/1] mmap_lock: add tracepoints around lock acquisition

On Fri, Oct 23, 2020 at 7:00 AM Vlastimil Babka <[email protected]> wrote:
>
> On 10/20/20 8:47 PM, Axel Rasmussen wrote:
> > The goal of these tracepoints is to be able to debug lock contention
> > issues. This lock is acquired on most (all?) mmap / munmap / page fault
> > operations, so a multi-threaded process which does a lot of these can
> > experience significant contention.
> >
> > We trace just before we start acquisition, when the acquisition returns
> > (whether it succeeded or not), and when the lock is released (or
> > downgraded). The events are broken out by lock type (read / write).
> >
> > The events are also broken out by memcg path. For container-based
> > workloads, users often think of several processes in a memcg as a single
> > logical "task", so collecting statistics at this level is useful.
> >
> > The end goal is to get latency information. This isn't directly included
> > in the trace events. Instead, users are expected to compute the time
> > between "start locking" and "acquire returned", using e.g. synthetic
> > events or BPF. The benefit we get from this is simpler code.
> >
> > Because we use tracepoint_enabled() to decide whether or not to trace,
> > this patch has effectively no overhead unless tracepoints are enabled at
> > runtime. If tracepoints are enabled, there is a performance impact, but
> > how much depends on exactly what e.g. the BPF program does.
> >
> > Reviewed-by: Michel Lespinasse <[email protected]>
> > Acked-by: Yafang Shao <[email protected]>
> > Acked-by: David Rientjes <[email protected]>
> > Signed-off-by: Axel Rasmussen <[email protected]>
>
> All seem fine to me, except I started to wonder..
>
> > +
> > +#ifdef CONFIG_MEMCG
> > +
> > +DEFINE_PER_CPU(char[MAX_FILTER_STR_VAL], trace_memcg_path);
> > +
> > +/*
> > + * Write the given mm_struct's memcg path to a percpu buffer, and return a
> > + * pointer to it. If the path cannot be determined, the buffer will contain the
> > + * empty string.
> > + *
> > + * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
> > + * disabled by the caller before calling us, and re-enabled only after the
> > + * caller is done with the pointer.
>
> Is this enough? What if we fill the buffer and then an interrupt comes and the
> handler calls here again? We overwrite the buffer and potentially report a wrong
> cgroup after the execution resumes?
> If nothing worse can happen (are interrupts disabled while the ftrace code is
> copying from the buffer?), then it's probably ok?

I think you're right, get_cpu()/put_cpu() only deals with preemption,
not interrupts.

I'm somewhat sure this code can be called in interrupt context, so I
don't think we can use locks to prevent this situation. I think it
works like this: say we acquire the lock, an interrupt happens, and
then we try to acquire again on the same CPU; we can't sleep, so we're
stuck.

I think we can't kmalloc here (instead of a percpu buffer) either,
since I would guess that kmalloc may also acquire mmap_lock itself?

Is adding local_irq_save()/local_irq_restore() in addition to
get_cpu()/put_cpu() sufficient?

>
> > + */
> > +static const char *get_mm_memcg_path(struct mm_struct *mm)
> > +{
> > + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
> > +
> > + if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
> > + char *buf = this_cpu_ptr(trace_memcg_path);
> > +
> > + cgroup_path(memcg->css.cgroup, buf, MAX_FILTER_STR_VAL);
> > + return buf;
> > + }
> > + return "";
> > +}
> > +
> > +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
> > + do { \
> > + get_cpu(); \
> > + trace_mmap_lock_##type(mm, get_mm_memcg_path(mm), \
> > + ##__VA_ARGS__); \
> > + put_cpu(); \
> > + } while (0)
> > +
> > +#else /* !CONFIG_MEMCG */
> > +
> > +#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
> > + trace_mmap_lock_##type(mm, "", ##__VA_ARGS__)
> > +
> > +#endif /* CONFIG_MEMCG */
> > +
> > +/*
> > + * Trace calls must be in a separate file, as otherwise there's a circular
> > + * dependency between linux/mmap_lock.h and trace/events/mmap_lock.h.
> > + */
> > +
> > +void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
> > +{
> > + TRACE_MMAP_LOCK_EVENT(start_locking, mm, write);
> > +}
> > +EXPORT_SYMBOL(__mmap_lock_do_trace_start_locking);
> > +
> > +void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
> > + bool success)
> > +{
> > + TRACE_MMAP_LOCK_EVENT(acquire_returned, mm, write, success);
> > +}
> > +EXPORT_SYMBOL(__mmap_lock_do_trace_acquire_returned);
> > +
> > +void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)
> > +{
> > + TRACE_MMAP_LOCK_EVENT(released, mm, write);
> > +}
> > +EXPORT_SYMBOL(__mmap_lock_do_trace_released);
> >
>

2020-10-26 14:56:35

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH v4 1/1] mmap_lock: add tracepoints around lock acquisition

On Fri, 23 Oct 2020 19:56:49 +0200
Vlastimil Babka <[email protected]> wrote:

> > I'm somewhat sure this code can be called in interrupt context, so I
> > don't think we can use locks to prevent this situation. I think it
> > works like this: say we acquire the lock, an interrupt happens, and
> > then we try to acquire again on the same CPU; we can't sleep, so we're
> > stuck.
>
> Yes, we could perhaps trylock() and if it fails, give up on the memcg path.
>
> > I think we can't kmalloc here (instead of a percpu buffer) either,
> > since I would guess that kmalloc may also acquire mmap_lock itself?
>
> the overhead is not worth it anyway, for a tracepoint
>
> > Is adding local_irq_save()/local_irq_restore() in addition to
> > get_cpu()/put_cpu() sufficient?
>
> If you do that, then I guess you don't need get_cpu()/put_cpu() anymore. But
> it's more costly.
>
> But sounds like we are solving something that the tracing subystem has to solve
> as well to store the trace event data, so maybe Steven has some better idea?

How big of a buffer to you need? The way that ftrace handles reserving
buffer for events (which coincidentally, I just talked about today at Open
Source Summit Europe!), is that I simply use local_add_return() and have a
buffer index. And the stack trace code does this as well.

For using a temporary buffer, you would allocate (at enabling of the
tracepoint, which is why we have a "reg" and "unreg" form of the
TRACE_EVENT macro called TRACE_EVENT_FN (for "function)). Have this
temporary buffer per cpu and handle all the contexts that it would be
called in. For ftrace, we usually make it 4 (normal, softirq, irq and NM
context). Ftrace will use local_add_return, but as I'm guessing, the
interrupt context will be done with its buffer after writing the event, you
don't need to worry about the counter being atomic.

You simply need to do:

DEFINE_PER_CPU(char *, my_buffer);
static int my_buf_idx;

At initialization:

for_each_possible_cpu(cpu) {
per_cpu(my_buffer, cpu) = kmalloc(MY_BUFF_SIZE *context_needed, GFP_KERNEL);
if (!per_cpu(my_buffer, cpu))
goto out_fail;
per_cpu(my_buf_idx, cpu) = 0;
}


Then for the event:

preempt_disable();
idx = this_cpu_add(my_buf_idx, MY_BUFF_SIZE);
current_buffer = this_cpu_ptr(my_buffer);
buf = current_buffer[idx - MY_BUFF-SIZE];

copy_my_data_to_buffer(buf);
trace_my_trace_point(buf);

this_cpu_sub(my_buf_idx, MY_BUFF_SIZE);
preempt_enable();



Now if an interrupt were to come in, it would do the same thing, but will
use the buffer after the MY_BUFF_SIZE, and you don't need to worry about
one corrupting another. Once the index has been incremented, a interrupt
will use the portion of the buffer after the "allocate" part. And even if
it happened right at the this_cpu_add(), the interrupt would put it back
before returning back to the context that it interrupted.

Is this what you need to deal with?

-- Steve