syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
is that an ongoing trace event might race with the tracepoint being
disabled (and therefore the _unreg() callback being called). Consider
this ordering:
T1: trace event fires, get_mm_memcg_path() is called
T1: get_memcg_path_buf() returns a buffer pointer
T2: trace_mmap_lock_unreg() is called, buffers are freed
T1: cgroup_path() is called with the now-freed buffer
The solution in this commit is to modify trace_mmap_lock_unreg() to
first stop new buffers from being handed out, and then to wait (spin)
until any existing buffer references are dropped (i.e., those trace
events complete).
I have a simple reproducer program which spins up two pools of threads,
doing the following in a tight loop:
Pool 1:
mmap(NULL, 4096, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
munmap()
Pool 2:
echo 1 > /sys/kernel/debug/tracing/events/mmap_lock/enable
echo 0 > /sys/kernel/debug/tracing/events/mmap_lock/enable
This triggers the use-after-free very quickly. With this patch, I let it
run for an hour without any BUGs.
While fixing this, I also noticed and fixed a css ref leak. Previously
we called get_mem_cgroup_from_mm(), but we never called css_put() to
release that reference. get_mm_memcg_path() now does this properly.
[1]: https://syzkaller.appspot.com/bug?extid=19e6dd9943972fa1c58a
Fixes: 0f818c4bc1f3 ("mm: mmap_lock: add tracepoints around lock acquisition")
Signed-off-by: Axel Rasmussen <[email protected]>
---
mm/mmap_lock.c | 100 +++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 85 insertions(+), 15 deletions(-)
diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
index 12af8f1b8a14..be38dc58278b 100644
--- a/mm/mmap_lock.c
+++ b/mm/mmap_lock.c
@@ -3,6 +3,7 @@
#include <trace/events/mmap_lock.h>
#include <linux/mm.h>
+#include <linux/atomic.h>
#include <linux/cgroup.h>
#include <linux/memcontrol.h>
#include <linux/mmap_lock.h>
@@ -18,13 +19,28 @@ EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released);
#ifdef CONFIG_MEMCG
/*
- * Our various events all share the same buffer (because we don't want or need
- * to allocate a set of buffers *per event type*), so we need to protect against
- * concurrent _reg() and _unreg() calls, and count how many _reg() calls have
- * been made.
+ * This is unfortunately complicated... _reg() and _unreg() may be called
+ * in parallel, separately for each of our three event types. To save memory,
+ * all of the event types share the same buffers. Furthermore, trace events
+ * might happen in parallel with _unreg(); we need to ensure we don't free the
+ * buffers before all inflights have finished. Because these events happen
+ * "frequently", we also want to prevent new inflights from starting once the
+ * _unreg() process begins. And, for performance reasons, we want to avoid any
+ * locking in the trace event path.
+ *
+ * So:
+ *
+ * - Use a spinlock to serialize _reg() and _unreg() calls.
+ * - Keep track of nested _reg() calls with a lock-protected counter.
+ * - Define a flag indicating whether or not unregistration has begun (and
+ * therefore that there should be no new buffer uses going forward).
+ * - Keep track of inflight buffer users with a reference count.
*/
static DEFINE_SPINLOCK(reg_lock);
-static int reg_refcount;
+static int reg_types_rc; /* Protected by reg_lock. */
+static bool unreg_started; /* Doesn't need synchronization. */
+/* atomic_t instead of refcount_t, as we want ordered inc without locks. */
+static atomic_t inflight_rc = ATOMIC_INIT(0);
/*
* Size of the buffer for memcg path names. Ignoring stack trace support,
@@ -46,9 +62,14 @@ int trace_mmap_lock_reg(void)
unsigned long flags;
int cpu;
+ /*
+ * Serialize _reg() and _unreg(). Without this, e.g. _unreg() might
+ * start cleaning up while _reg() is only partially completed.
+ */
spin_lock_irqsave(®_lock, flags);
- if (reg_refcount++)
+ /* If the refcount is going 0->1, proceed with allocating buffers. */
+ if (reg_types_rc++)
goto out;
for_each_possible_cpu(cpu) {
@@ -62,6 +83,11 @@ int trace_mmap_lock_reg(void)
per_cpu(memcg_path_buf_idx, cpu) = 0;
}
+ /* Reset unreg_started flag, allowing new trace events. */
+ WRITE_ONCE(unreg_started, false);
+ /* Add the registration +1 to the inflight refcount. */
+ atomic_inc(&inflight_rc);
+
out:
spin_unlock_irqrestore(®_lock, flags);
return 0;
@@ -74,7 +100,8 @@ int trace_mmap_lock_reg(void)
break;
}
- --reg_refcount;
+ /* Since we failed, undo the earlier increment. */
+ --reg_types_rc;
spin_unlock_irqrestore(®_lock, flags);
return -ENOMEM;
@@ -87,9 +114,23 @@ void trace_mmap_lock_unreg(void)
spin_lock_irqsave(®_lock, flags);
- if (--reg_refcount)
+ /* If the refcount is going 1->0, proceed with freeing buffers. */
+ if (--reg_types_rc)
goto out;
+ /* This was the last registration; start preventing new events... */
+ WRITE_ONCE(unreg_started, true);
+ /* Remove the registration +1 from the inflight refcount. */
+ atomic_dec(&inflight_rc);
+ /*
+ * Wait for inflight refcount to be zero (all inflights stopped). Since
+ * we have a spinlock we can't sleep, so just spin. Because trace events
+ * are "fast", and because we stop new inflights from starting at this
+ * point with unreg_started, this should be a short spin.
+ */
+ while (atomic_read(&inflight_rc))
+ barrier();
+
for_each_possible_cpu(cpu) {
kfree(per_cpu(memcg_path_buf, cpu));
}
@@ -102,6 +143,20 @@ static inline char *get_memcg_path_buf(void)
{
int idx;
+ /*
+ * If unregistration is happening, stop. Yes, this check is racy;
+ * that's fine. It just means _unreg() might spin waiting for an extra
+ * event or two. Use-after-free is actually prevented by the refcount.
+ */
+ if (READ_ONCE(unreg_started))
+ return NULL;
+ /*
+ * Take a reference, unless the registration +1 has been released
+ * and there aren't already existing inflights (refcount is zero).
+ */
+ if (!atomic_inc_not_zero(&inflight_rc))
+ return NULL;
+
idx = this_cpu_add_return(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE) -
MEMCG_PATH_BUF_SIZE;
return &this_cpu_read(memcg_path_buf)[idx];
@@ -110,27 +165,42 @@ static inline char *get_memcg_path_buf(void)
static inline void put_memcg_path_buf(void)
{
this_cpu_sub(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE);
+ /* We're done with this buffer; drop the reference. */
+ atomic_dec(&inflight_rc);
}
/*
* Write the given mm_struct's memcg path to a percpu buffer, and return a
- * pointer to it. If the path cannot be determined, NULL is returned.
+ * pointer to it. If the path cannot be determined, or no buffer was available
+ * (because the trace event is being unregistered), NULL is returned.
*
* Note: buffers are allocated per-cpu to avoid locking, so preemption must be
* disabled by the caller before calling us, and re-enabled only after the
* caller is done with the pointer.
+ *
+ * The caller must call put_memcg_path_buf() once the buffer is no longer
+ * needed. This must be done while preemption is still disabled.
*/
static const char *get_mm_memcg_path(struct mm_struct *mm)
{
+ char *buf = NULL;
struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
- if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
- char *buf = get_memcg_path_buf();
+ if (memcg == NULL)
+ goto out;
+ if (unlikely(memcg->css.cgroup == NULL))
+ goto out_put;
- cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE);
- return buf;
- }
- return NULL;
+ buf = get_memcg_path_buf();
+ if (buf == NULL)
+ goto out_put;
+
+ cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE);
+
+out_put:
+ css_put(&memcg->css);
+out:
+ return buf;
}
#define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
--
2.29.2.454.gaff20da3a2-goog
On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen <[email protected]> wrote:
>
> syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
> is that an ongoing trace event might race with the tracepoint being
> disabled (and therefore the _unreg() callback being called). Consider
> this ordering:
>
> T1: trace event fires, get_mm_memcg_path() is called
> T1: get_memcg_path_buf() returns a buffer pointer
> T2: trace_mmap_lock_unreg() is called, buffers are freed
> T1: cgroup_path() is called with the now-freed buffer
Any reason to use the cgroup_path instead of the cgroup_ino? There are
other examples of trace points using cgroup_ino and no need to
allocate buffers. Also cgroup namespace might complicate the path
usage.
>
> The solution in this commit is to modify trace_mmap_lock_unreg() to
> first stop new buffers from being handed out, and then to wait (spin)
> until any existing buffer references are dropped (i.e., those trace
> events complete).
>
> I have a simple reproducer program which spins up two pools of threads,
> doing the following in a tight loop:
>
> Pool 1:
> mmap(NULL, 4096, PROT_READ | PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
> munmap()
>
> Pool 2:
> echo 1 > /sys/kernel/debug/tracing/events/mmap_lock/enable
> echo 0 > /sys/kernel/debug/tracing/events/mmap_lock/enable
>
> This triggers the use-after-free very quickly. With this patch, I let it
> run for an hour without any BUGs.
>
> While fixing this, I also noticed and fixed a css ref leak. Previously
> we called get_mem_cgroup_from_mm(), but we never called css_put() to
> release that reference. get_mm_memcg_path() now does this properly.
>
> [1]: https://syzkaller.appspot.com/bug?extid=19e6dd9943972fa1c58a
>
> Fixes: 0f818c4bc1f3 ("mm: mmap_lock: add tracepoints around lock acquisition")
The original patch is in mm tree, so the SHA1 is not stabilized.
Usually Andrew squash the fixes into the original patches.
> Signed-off-by: Axel Rasmussen <[email protected]>
> ---
> mm/mmap_lock.c | 100 +++++++++++++++++++++++++++++++++++++++++--------
> 1 file changed, 85 insertions(+), 15 deletions(-)
>
> diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> index 12af8f1b8a14..be38dc58278b 100644
> --- a/mm/mmap_lock.c
> +++ b/mm/mmap_lock.c
> @@ -3,6 +3,7 @@
> #include <trace/events/mmap_lock.h>
>
> #include <linux/mm.h>
> +#include <linux/atomic.h>
> #include <linux/cgroup.h>
> #include <linux/memcontrol.h>
> #include <linux/mmap_lock.h>
> @@ -18,13 +19,28 @@ EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released);
> #ifdef CONFIG_MEMCG
>
> /*
> - * Our various events all share the same buffer (because we don't want or need
> - * to allocate a set of buffers *per event type*), so we need to protect against
> - * concurrent _reg() and _unreg() calls, and count how many _reg() calls have
> - * been made.
> + * This is unfortunately complicated... _reg() and _unreg() may be called
> + * in parallel, separately for each of our three event types. To save memory,
> + * all of the event types share the same buffers. Furthermore, trace events
> + * might happen in parallel with _unreg(); we need to ensure we don't free the
> + * buffers before all inflights have finished. Because these events happen
> + * "frequently", we also want to prevent new inflights from starting once the
> + * _unreg() process begins. And, for performance reasons, we want to avoid any
> + * locking in the trace event path.
> + *
> + * So:
> + *
> + * - Use a spinlock to serialize _reg() and _unreg() calls.
> + * - Keep track of nested _reg() calls with a lock-protected counter.
> + * - Define a flag indicating whether or not unregistration has begun (and
> + * therefore that there should be no new buffer uses going forward).
> + * - Keep track of inflight buffer users with a reference count.
> */
> static DEFINE_SPINLOCK(reg_lock);
> -static int reg_refcount;
> +static int reg_types_rc; /* Protected by reg_lock. */
> +static bool unreg_started; /* Doesn't need synchronization. */
> +/* atomic_t instead of refcount_t, as we want ordered inc without locks. */
> +static atomic_t inflight_rc = ATOMIC_INIT(0);
>
> /*
> * Size of the buffer for memcg path names. Ignoring stack trace support,
> @@ -46,9 +62,14 @@ int trace_mmap_lock_reg(void)
> unsigned long flags;
> int cpu;
>
> + /*
> + * Serialize _reg() and _unreg(). Without this, e.g. _unreg() might
> + * start cleaning up while _reg() is only partially completed.
> + */
> spin_lock_irqsave(®_lock, flags);
>
> - if (reg_refcount++)
> + /* If the refcount is going 0->1, proceed with allocating buffers. */
> + if (reg_types_rc++)
> goto out;
>
> for_each_possible_cpu(cpu) {
> @@ -62,6 +83,11 @@ int trace_mmap_lock_reg(void)
> per_cpu(memcg_path_buf_idx, cpu) = 0;
> }
>
> + /* Reset unreg_started flag, allowing new trace events. */
> + WRITE_ONCE(unreg_started, false);
> + /* Add the registration +1 to the inflight refcount. */
> + atomic_inc(&inflight_rc);
> +
> out:
> spin_unlock_irqrestore(®_lock, flags);
> return 0;
> @@ -74,7 +100,8 @@ int trace_mmap_lock_reg(void)
> break;
> }
>
> - --reg_refcount;
> + /* Since we failed, undo the earlier increment. */
> + --reg_types_rc;
>
> spin_unlock_irqrestore(®_lock, flags);
> return -ENOMEM;
> @@ -87,9 +114,23 @@ void trace_mmap_lock_unreg(void)
>
> spin_lock_irqsave(®_lock, flags);
>
> - if (--reg_refcount)
> + /* If the refcount is going 1->0, proceed with freeing buffers. */
> + if (--reg_types_rc)
> goto out;
>
> + /* This was the last registration; start preventing new events... */
> + WRITE_ONCE(unreg_started, true);
> + /* Remove the registration +1 from the inflight refcount. */
> + atomic_dec(&inflight_rc);
> + /*
> + * Wait for inflight refcount to be zero (all inflights stopped). Since
> + * we have a spinlock we can't sleep, so just spin. Because trace events
> + * are "fast", and because we stop new inflights from starting at this
> + * point with unreg_started, this should be a short spin.
> + */
> + while (atomic_read(&inflight_rc))
> + barrier();
> +
> for_each_possible_cpu(cpu) {
> kfree(per_cpu(memcg_path_buf, cpu));
> }
> @@ -102,6 +143,20 @@ static inline char *get_memcg_path_buf(void)
> {
> int idx;
>
> + /*
> + * If unregistration is happening, stop. Yes, this check is racy;
> + * that's fine. It just means _unreg() might spin waiting for an extra
> + * event or two. Use-after-free is actually prevented by the refcount.
> + */
> + if (READ_ONCE(unreg_started))
> + return NULL;
> + /*
> + * Take a reference, unless the registration +1 has been released
> + * and there aren't already existing inflights (refcount is zero).
> + */
> + if (!atomic_inc_not_zero(&inflight_rc))
> + return NULL;
> +
> idx = this_cpu_add_return(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE) -
> MEMCG_PATH_BUF_SIZE;
> return &this_cpu_read(memcg_path_buf)[idx];
> @@ -110,27 +165,42 @@ static inline char *get_memcg_path_buf(void)
> static inline void put_memcg_path_buf(void)
> {
> this_cpu_sub(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE);
> + /* We're done with this buffer; drop the reference. */
> + atomic_dec(&inflight_rc);
> }
>
> /*
> * Write the given mm_struct's memcg path to a percpu buffer, and return a
> - * pointer to it. If the path cannot be determined, NULL is returned.
> + * pointer to it. If the path cannot be determined, or no buffer was available
> + * (because the trace event is being unregistered), NULL is returned.
> *
> * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
> * disabled by the caller before calling us, and re-enabled only after the
> * caller is done with the pointer.
> + *
> + * The caller must call put_memcg_path_buf() once the buffer is no longer
> + * needed. This must be done while preemption is still disabled.
> */
> static const char *get_mm_memcg_path(struct mm_struct *mm)
> {
> + char *buf = NULL;
> struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
>
> - if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
> - char *buf = get_memcg_path_buf();
> + if (memcg == NULL)
> + goto out;
> + if (unlikely(memcg->css.cgroup == NULL))
> + goto out_put;
>
> - cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE);
> - return buf;
> - }
> - return NULL;
> + buf = get_memcg_path_buf();
> + if (buf == NULL)
> + goto out_put;
> +
> + cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE);
> +
> +out_put:
> + css_put(&memcg->css);
> +out:
> + return buf;
> }
>
> #define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
> --
> 2.29.2.454.gaff20da3a2-goog
>
On Mon, 30 Nov 2020 15:35:04 -0800
Axel Rasmussen <[email protected]> wrote:
> syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
> is that an ongoing trace event might race with the tracepoint being
> disabled (and therefore the _unreg() callback being called). Consider
> this ordering:
>
> T1: trace event fires, get_mm_memcg_path() is called
> T1: get_memcg_path_buf() returns a buffer pointer
> T2: trace_mmap_lock_unreg() is called, buffers are freed
> T1: cgroup_path() is called with the now-freed buffer
>
> The solution in this commit is to modify trace_mmap_lock_unreg() to
> first stop new buffers from being handed out, and then to wait (spin)
> until any existing buffer references are dropped (i.e., those trace
> events complete).
>
> I have a simple reproducer program which spins up two pools of threads,
> doing the following in a tight loop:
>
> Pool 1:
> mmap(NULL, 4096, PROT_READ | PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
> munmap()
>
> Pool 2:
> echo 1 > /sys/kernel/debug/tracing/events/mmap_lock/enable
> echo 0 > /sys/kernel/debug/tracing/events/mmap_lock/enable
>
> This triggers the use-after-free very quickly. With this patch, I let it
> run for an hour without any BUGs.
>
> While fixing this, I also noticed and fixed a css ref leak. Previously
> we called get_mem_cgroup_from_mm(), but we never called css_put() to
> release that reference. get_mm_memcg_path() now does this properly.
>
> [1]: https://syzkaller.appspot.com/bug?extid=19e6dd9943972fa1c58a
>
> Fixes: 0f818c4bc1f3 ("mm: mmap_lock: add tracepoints around lock acquisition")
> Signed-off-by: Axel Rasmussen <[email protected]>
>
Looking at the original patch that this fixes, I'm thinking, why use a
spinlock in the reg/unreg callers? Registering and unregistering a
tracepoint can sleep (it calls mutex locks), so the reg/unreg can sleep
too. As the use of the get_mm_memcg_path() is done under
preempt_disable, which is a rcu grace period, you could simply change
the unregister to:
void trace_mmap_lock_unreg(void)
{
int cpu;
mutex_lock(®_lock);
if (--reg_refcount)
goto out;
/* Make sure all users of memcg_path_buf are done */
synchronize_rcu();
for_each_possible_cpu(cpu) {
kfree(per_cpu(memcg_path_buf, cpu));
}
out:
mutex_unlock(®_lock);
}
Obviously, you would need to change reg_lock to mutex in the _reg()
function.
-- Steve
On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt <[email protected]> wrote:
>
> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen <[email protected]> wrote:
> >
> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
> > is that an ongoing trace event might race with the tracepoint being
> > disabled (and therefore the _unreg() callback being called). Consider
> > this ordering:
> >
> > T1: trace event fires, get_mm_memcg_path() is called
> > T1: get_memcg_path_buf() returns a buffer pointer
> > T2: trace_mmap_lock_unreg() is called, buffers are freed
> > T1: cgroup_path() is called with the now-freed buffer
>
> Any reason to use the cgroup_path instead of the cgroup_ino? There are
> other examples of trace points using cgroup_ino and no need to
> allocate buffers. Also cgroup namespace might complicate the path
> usage.
Hmm, so in general I would love to use a numeric identifier instead of a string.
I did some reading, and it looks like the cgroup_ino() mainly has to
do with writeback, instead of being just a general identifier?
https://www.kernel.org/doc/Documentation/cgroup-v2.txt
There is cgroup_id() which I think is almost what I'd want, but there
are a couple problems with it:
- I don't know of a way for userspace to translate IDs -> paths, to
make them human readable?
- Also I think the ID implementation we use for this is "dense",
meaning if a cgroup is removed, its ID is likely to be quickly reused.
>
> >
> > The solution in this commit is to modify trace_mmap_lock_unreg() to
> > first stop new buffers from being handed out, and then to wait (spin)
> > until any existing buffer references are dropped (i.e., those trace
> > events complete).
> >
> > I have a simple reproducer program which spins up two pools of threads,
> > doing the following in a tight loop:
> >
> > Pool 1:
> > mmap(NULL, 4096, PROT_READ | PROT_WRITE,
> > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
> > munmap()
> >
> > Pool 2:
> > echo 1 > /sys/kernel/debug/tracing/events/mmap_lock/enable
> > echo 0 > /sys/kernel/debug/tracing/events/mmap_lock/enable
> >
> > This triggers the use-after-free very quickly. With this patch, I let it
> > run for an hour without any BUGs.
> >
> > While fixing this, I also noticed and fixed a css ref leak. Previously
> > we called get_mem_cgroup_from_mm(), but we never called css_put() to
> > release that reference. get_mm_memcg_path() now does this properly.
> >
> > [1]: https://syzkaller.appspot.com/bug?extid=19e6dd9943972fa1c58a
> >
> > Fixes: 0f818c4bc1f3 ("mm: mmap_lock: add tracepoints around lock acquisition")
>
> The original patch is in mm tree, so the SHA1 is not stabilized.
> Usually Andrew squash the fixes into the original patches.
Ah, I added this because it also shows up in linux-next, under the
next-20201130 tag. I'll remove it in v2, squashing is fine. :)
>
> > Signed-off-by: Axel Rasmussen <[email protected]>
> > ---
> > mm/mmap_lock.c | 100 +++++++++++++++++++++++++++++++++++++++++--------
> > 1 file changed, 85 insertions(+), 15 deletions(-)
> >
> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
> > index 12af8f1b8a14..be38dc58278b 100644
> > --- a/mm/mmap_lock.c
> > +++ b/mm/mmap_lock.c
> > @@ -3,6 +3,7 @@
> > #include <trace/events/mmap_lock.h>
> >
> > #include <linux/mm.h>
> > +#include <linux/atomic.h>
> > #include <linux/cgroup.h>
> > #include <linux/memcontrol.h>
> > #include <linux/mmap_lock.h>
> > @@ -18,13 +19,28 @@ EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released);
> > #ifdef CONFIG_MEMCG
> >
> > /*
> > - * Our various events all share the same buffer (because we don't want or need
> > - * to allocate a set of buffers *per event type*), so we need to protect against
> > - * concurrent _reg() and _unreg() calls, and count how many _reg() calls have
> > - * been made.
> > + * This is unfortunately complicated... _reg() and _unreg() may be called
> > + * in parallel, separately for each of our three event types. To save memory,
> > + * all of the event types share the same buffers. Furthermore, trace events
> > + * might happen in parallel with _unreg(); we need to ensure we don't free the
> > + * buffers before all inflights have finished. Because these events happen
> > + * "frequently", we also want to prevent new inflights from starting once the
> > + * _unreg() process begins. And, for performance reasons, we want to avoid any
> > + * locking in the trace event path.
> > + *
> > + * So:
> > + *
> > + * - Use a spinlock to serialize _reg() and _unreg() calls.
> > + * - Keep track of nested _reg() calls with a lock-protected counter.
> > + * - Define a flag indicating whether or not unregistration has begun (and
> > + * therefore that there should be no new buffer uses going forward).
> > + * - Keep track of inflight buffer users with a reference count.
> > */
> > static DEFINE_SPINLOCK(reg_lock);
> > -static int reg_refcount;
> > +static int reg_types_rc; /* Protected by reg_lock. */
> > +static bool unreg_started; /* Doesn't need synchronization. */
> > +/* atomic_t instead of refcount_t, as we want ordered inc without locks. */
> > +static atomic_t inflight_rc = ATOMIC_INIT(0);
> >
> > /*
> > * Size of the buffer for memcg path names. Ignoring stack trace support,
> > @@ -46,9 +62,14 @@ int trace_mmap_lock_reg(void)
> > unsigned long flags;
> > int cpu;
> >
> > + /*
> > + * Serialize _reg() and _unreg(). Without this, e.g. _unreg() might
> > + * start cleaning up while _reg() is only partially completed.
> > + */
> > spin_lock_irqsave(®_lock, flags);
> >
> > - if (reg_refcount++)
> > + /* If the refcount is going 0->1, proceed with allocating buffers. */
> > + if (reg_types_rc++)
> > goto out;
> >
> > for_each_possible_cpu(cpu) {
> > @@ -62,6 +83,11 @@ int trace_mmap_lock_reg(void)
> > per_cpu(memcg_path_buf_idx, cpu) = 0;
> > }
> >
> > + /* Reset unreg_started flag, allowing new trace events. */
> > + WRITE_ONCE(unreg_started, false);
> > + /* Add the registration +1 to the inflight refcount. */
> > + atomic_inc(&inflight_rc);
> > +
> > out:
> > spin_unlock_irqrestore(®_lock, flags);
> > return 0;
> > @@ -74,7 +100,8 @@ int trace_mmap_lock_reg(void)
> > break;
> > }
> >
> > - --reg_refcount;
> > + /* Since we failed, undo the earlier increment. */
> > + --reg_types_rc;
> >
> > spin_unlock_irqrestore(®_lock, flags);
> > return -ENOMEM;
> > @@ -87,9 +114,23 @@ void trace_mmap_lock_unreg(void)
> >
> > spin_lock_irqsave(®_lock, flags);
> >
> > - if (--reg_refcount)
> > + /* If the refcount is going 1->0, proceed with freeing buffers. */
> > + if (--reg_types_rc)
> > goto out;
> >
> > + /* This was the last registration; start preventing new events... */
> > + WRITE_ONCE(unreg_started, true);
> > + /* Remove the registration +1 from the inflight refcount. */
> > + atomic_dec(&inflight_rc);
> > + /*
> > + * Wait for inflight refcount to be zero (all inflights stopped). Since
> > + * we have a spinlock we can't sleep, so just spin. Because trace events
> > + * are "fast", and because we stop new inflights from starting at this
> > + * point with unreg_started, this should be a short spin.
> > + */
> > + while (atomic_read(&inflight_rc))
> > + barrier();
> > +
> > for_each_possible_cpu(cpu) {
> > kfree(per_cpu(memcg_path_buf, cpu));
> > }
> > @@ -102,6 +143,20 @@ static inline char *get_memcg_path_buf(void)
> > {
> > int idx;
> >
> > + /*
> > + * If unregistration is happening, stop. Yes, this check is racy;
> > + * that's fine. It just means _unreg() might spin waiting for an extra
> > + * event or two. Use-after-free is actually prevented by the refcount.
> > + */
> > + if (READ_ONCE(unreg_started))
> > + return NULL;
> > + /*
> > + * Take a reference, unless the registration +1 has been released
> > + * and there aren't already existing inflights (refcount is zero).
> > + */
> > + if (!atomic_inc_not_zero(&inflight_rc))
> > + return NULL;
> > +
> > idx = this_cpu_add_return(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE) -
> > MEMCG_PATH_BUF_SIZE;
> > return &this_cpu_read(memcg_path_buf)[idx];
> > @@ -110,27 +165,42 @@ static inline char *get_memcg_path_buf(void)
> > static inline void put_memcg_path_buf(void)
> > {
> > this_cpu_sub(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE);
> > + /* We're done with this buffer; drop the reference. */
> > + atomic_dec(&inflight_rc);
> > }
> >
> > /*
> > * Write the given mm_struct's memcg path to a percpu buffer, and return a
> > - * pointer to it. If the path cannot be determined, NULL is returned.
> > + * pointer to it. If the path cannot be determined, or no buffer was available
> > + * (because the trace event is being unregistered), NULL is returned.
> > *
> > * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
> > * disabled by the caller before calling us, and re-enabled only after the
> > * caller is done with the pointer.
> > + *
> > + * The caller must call put_memcg_path_buf() once the buffer is no longer
> > + * needed. This must be done while preemption is still disabled.
> > */
> > static const char *get_mm_memcg_path(struct mm_struct *mm)
> > {
> > + char *buf = NULL;
> > struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
> >
> > - if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
> > - char *buf = get_memcg_path_buf();
> > + if (memcg == NULL)
> > + goto out;
> > + if (unlikely(memcg->css.cgroup == NULL))
> > + goto out_put;
> >
> > - cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE);
> > - return buf;
> > - }
> > - return NULL;
> > + buf = get_memcg_path_buf();
> > + if (buf == NULL)
> > + goto out_put;
> > +
> > + cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE);
> > +
> > +out_put:
> > + css_put(&memcg->css);
> > +out:
> > + return buf;
> > }
> >
> > #define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
> > --
> > 2.29.2.454.gaff20da3a2-goog
> >
Axel Rasmussen <[email protected]> wrote:
> On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt <[email protected]> wrote:
>>
>> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen <[email protected]> wrote:
>> >
>> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
>> > is that an ongoing trace event might race with the tracepoint being
>> > disabled (and therefore the _unreg() callback being called). Consider
>> > this ordering:
>> >
>> > T1: trace event fires, get_mm_memcg_path() is called
>> > T1: get_memcg_path_buf() returns a buffer pointer
>> > T2: trace_mmap_lock_unreg() is called, buffers are freed
>> > T1: cgroup_path() is called with the now-freed buffer
>>
>> Any reason to use the cgroup_path instead of the cgroup_ino? There are
>> other examples of trace points using cgroup_ino and no need to
>> allocate buffers. Also cgroup namespace might complicate the path
>> usage.
>
> Hmm, so in general I would love to use a numeric identifier instead of a string.
>
> I did some reading, and it looks like the cgroup_ino() mainly has to
> do with writeback, instead of being just a general identifier?
> https://www.kernel.org/doc/Documentation/cgroup-v2.txt
>
> There is cgroup_id() which I think is almost what I'd want, but there
> are a couple problems with it:
>
> - I don't know of a way for userspace to translate IDs -> paths, to
> make them human readable?
The id => name map can be built from user space with a tree walk.
Example:
$ find /sys/fs/cgroup/memory -type d -printf '%i %P\n' # ~ [main]
20387 init.scope
31 system.slice
> - Also I think the ID implementation we use for this is "dense",
> meaning if a cgroup is removed, its ID is likely to be quickly reused.
>
>>
>> >
>> > The solution in this commit is to modify trace_mmap_lock_unreg() to
>> > first stop new buffers from being handed out, and then to wait (spin)
>> > until any existing buffer references are dropped (i.e., those trace
>> > events complete).
>> >
>> > I have a simple reproducer program which spins up two pools of threads,
>> > doing the following in a tight loop:
>> >
>> > Pool 1:
>> > mmap(NULL, 4096, PROT_READ | PROT_WRITE,
>> > MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)
>> > munmap()
>> >
>> > Pool 2:
>> > echo 1 > /sys/kernel/debug/tracing/events/mmap_lock/enable
>> > echo 0 > /sys/kernel/debug/tracing/events/mmap_lock/enable
>> >
>> > This triggers the use-after-free very quickly. With this patch, I let it
>> > run for an hour without any BUGs.
>> >
>> > While fixing this, I also noticed and fixed a css ref leak. Previously
>> > we called get_mem_cgroup_from_mm(), but we never called css_put() to
>> > release that reference. get_mm_memcg_path() now does this properly.
>> >
>> > [1]: https://syzkaller.appspot.com/bug?extid=19e6dd9943972fa1c58a
>> >
>> > Fixes: 0f818c4bc1f3 ("mm: mmap_lock: add tracepoints around lock acquisition")
>>
>> The original patch is in mm tree, so the SHA1 is not stabilized.
>> Usually Andrew squash the fixes into the original patches.
>
> Ah, I added this because it also shows up in linux-next, under the
> next-20201130 tag. I'll remove it in v2, squashing is fine. :)
>
>>
>> > Signed-off-by: Axel Rasmussen <[email protected]>
>> > ---
>> > mm/mmap_lock.c | 100 +++++++++++++++++++++++++++++++++++++++++--------
>> > 1 file changed, 85 insertions(+), 15 deletions(-)
>> >
>> > diff --git a/mm/mmap_lock.c b/mm/mmap_lock.c
>> > index 12af8f1b8a14..be38dc58278b 100644
>> > --- a/mm/mmap_lock.c
>> > +++ b/mm/mmap_lock.c
>> > @@ -3,6 +3,7 @@
>> > #include <trace/events/mmap_lock.h>
>> >
>> > #include <linux/mm.h>
>> > +#include <linux/atomic.h>
>> > #include <linux/cgroup.h>
>> > #include <linux/memcontrol.h>
>> > #include <linux/mmap_lock.h>
>> > @@ -18,13 +19,28 @@ EXPORT_TRACEPOINT_SYMBOL(mmap_lock_released);
>> > #ifdef CONFIG_MEMCG
>> >
>> > /*
>> > - * Our various events all share the same buffer (because we don't want or need
>> > - * to allocate a set of buffers *per event type*), so we need to protect against
>> > - * concurrent _reg() and _unreg() calls, and count how many _reg() calls have
>> > - * been made.
>> > + * This is unfortunately complicated... _reg() and _unreg() may be called
>> > + * in parallel, separately for each of our three event types. To save memory,
>> > + * all of the event types share the same buffers. Furthermore, trace events
>> > + * might happen in parallel with _unreg(); we need to ensure we don't free the
>> > + * buffers before all inflights have finished. Because these events happen
>> > + * "frequently", we also want to prevent new inflights from starting once the
>> > + * _unreg() process begins. And, for performance reasons, we want to avoid any
>> > + * locking in the trace event path.
>> > + *
>> > + * So:
>> > + *
>> > + * - Use a spinlock to serialize _reg() and _unreg() calls.
>> > + * - Keep track of nested _reg() calls with a lock-protected counter.
>> > + * - Define a flag indicating whether or not unregistration has begun (and
>> > + * therefore that there should be no new buffer uses going forward).
>> > + * - Keep track of inflight buffer users with a reference count.
>> > */
>> > static DEFINE_SPINLOCK(reg_lock);
>> > -static int reg_refcount;
>> > +static int reg_types_rc; /* Protected by reg_lock. */
>> > +static bool unreg_started; /* Doesn't need synchronization. */
>> > +/* atomic_t instead of refcount_t, as we want ordered inc without locks. */
>> > +static atomic_t inflight_rc = ATOMIC_INIT(0);
>> >
>> > /*
>> > * Size of the buffer for memcg path names. Ignoring stack trace support,
>> > @@ -46,9 +62,14 @@ int trace_mmap_lock_reg(void)
>> > unsigned long flags;
>> > int cpu;
>> >
>> > + /*
>> > + * Serialize _reg() and _unreg(). Without this, e.g. _unreg() might
>> > + * start cleaning up while _reg() is only partially completed.
>> > + */
>> > spin_lock_irqsave(®_lock, flags);
>> >
>> > - if (reg_refcount++)
>> > + /* If the refcount is going 0->1, proceed with allocating buffers. */
>> > + if (reg_types_rc++)
>> > goto out;
>> >
>> > for_each_possible_cpu(cpu) {
>> > @@ -62,6 +83,11 @@ int trace_mmap_lock_reg(void)
>> > per_cpu(memcg_path_buf_idx, cpu) = 0;
>> > }
>> >
>> > + /* Reset unreg_started flag, allowing new trace events. */
>> > + WRITE_ONCE(unreg_started, false);
>> > + /* Add the registration +1 to the inflight refcount. */
>> > + atomic_inc(&inflight_rc);
>> > +
>> > out:
>> > spin_unlock_irqrestore(®_lock, flags);
>> > return 0;
>> > @@ -74,7 +100,8 @@ int trace_mmap_lock_reg(void)
>> > break;
>> > }
>> >
>> > - --reg_refcount;
>> > + /* Since we failed, undo the earlier increment. */
>> > + --reg_types_rc;
>> >
>> > spin_unlock_irqrestore(®_lock, flags);
>> > return -ENOMEM;
>> > @@ -87,9 +114,23 @@ void trace_mmap_lock_unreg(void)
>> >
>> > spin_lock_irqsave(®_lock, flags);
>> >
>> > - if (--reg_refcount)
>> > + /* If the refcount is going 1->0, proceed with freeing buffers. */
>> > + if (--reg_types_rc)
>> > goto out;
>> >
>> > + /* This was the last registration; start preventing new events... */
>> > + WRITE_ONCE(unreg_started, true);
>> > + /* Remove the registration +1 from the inflight refcount. */
>> > + atomic_dec(&inflight_rc);
>> > + /*
>> > + * Wait for inflight refcount to be zero (all inflights stopped). Since
>> > + * we have a spinlock we can't sleep, so just spin. Because trace events
>> > + * are "fast", and because we stop new inflights from starting at this
>> > + * point with unreg_started, this should be a short spin.
>> > + */
>> > + while (atomic_read(&inflight_rc))
>> > + barrier();
>> > +
>> > for_each_possible_cpu(cpu) {
>> > kfree(per_cpu(memcg_path_buf, cpu));
>> > }
>> > @@ -102,6 +143,20 @@ static inline char *get_memcg_path_buf(void)
>> > {
>> > int idx;
>> >
>> > + /*
>> > + * If unregistration is happening, stop. Yes, this check is racy;
>> > + * that's fine. It just means _unreg() might spin waiting for an extra
>> > + * event or two. Use-after-free is actually prevented by the refcount.
>> > + */
>> > + if (READ_ONCE(unreg_started))
>> > + return NULL;
>> > + /*
>> > + * Take a reference, unless the registration +1 has been released
>> > + * and there aren't already existing inflights (refcount is zero).
>> > + */
>> > + if (!atomic_inc_not_zero(&inflight_rc))
>> > + return NULL;
>> > +
>> > idx = this_cpu_add_return(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE) -
>> > MEMCG_PATH_BUF_SIZE;
>> > return &this_cpu_read(memcg_path_buf)[idx];
>> > @@ -110,27 +165,42 @@ static inline char *get_memcg_path_buf(void)
>> > static inline void put_memcg_path_buf(void)
>> > {
>> > this_cpu_sub(memcg_path_buf_idx, MEMCG_PATH_BUF_SIZE);
>> > + /* We're done with this buffer; drop the reference. */
>> > + atomic_dec(&inflight_rc);
>> > }
>> >
>> > /*
>> > * Write the given mm_struct's memcg path to a percpu buffer, and return a
>> > - * pointer to it. If the path cannot be determined, NULL is returned.
>> > + * pointer to it. If the path cannot be determined, or no buffer was available
>> > + * (because the trace event is being unregistered), NULL is returned.
>> > *
>> > * Note: buffers are allocated per-cpu to avoid locking, so preemption must be
>> > * disabled by the caller before calling us, and re-enabled only after the
>> > * caller is done with the pointer.
>> > + *
>> > + * The caller must call put_memcg_path_buf() once the buffer is no longer
>> > + * needed. This must be done while preemption is still disabled.
>> > */
>> > static const char *get_mm_memcg_path(struct mm_struct *mm)
>> > {
>> > + char *buf = NULL;
>> > struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
>> >
>> > - if (memcg != NULL && likely(memcg->css.cgroup != NULL)) {
>> > - char *buf = get_memcg_path_buf();
>> > + if (memcg == NULL)
>> > + goto out;
>> > + if (unlikely(memcg->css.cgroup == NULL))
>> > + goto out_put;
>> >
>> > - cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE);
>> > - return buf;
>> > - }
>> > - return NULL;
>> > + buf = get_memcg_path_buf();
>> > + if (buf == NULL)
>> > + goto out_put;
>> > +
>> > + cgroup_path(memcg->css.cgroup, buf, MEMCG_PATH_BUF_SIZE);
>> > +
>> > +out_put:
>> > + css_put(&memcg->css);
>> > +out:
>> > + return buf;
>> > }
>> >
>> > #define TRACE_MMAP_LOCK_EVENT(type, mm, ...) \
>> > --
>> > 2.29.2.454.gaff20da3a2-goog
>> >
On Tue, Dec 1, 2020 at 9:56 AM Greg Thelen <[email protected]> wrote:
>
> Axel Rasmussen <[email protected]> wrote:
>
> > On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt <[email protected]> wrote:
> >>
> >> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen <[email protected]> wrote:
> >> >
> >> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
> >> > is that an ongoing trace event might race with the tracepoint being
> >> > disabled (and therefore the _unreg() callback being called). Consider
> >> > this ordering:
> >> >
> >> > T1: trace event fires, get_mm_memcg_path() is called
> >> > T1: get_memcg_path_buf() returns a buffer pointer
> >> > T2: trace_mmap_lock_unreg() is called, buffers are freed
> >> > T1: cgroup_path() is called with the now-freed buffer
> >>
> >> Any reason to use the cgroup_path instead of the cgroup_ino? There are
> >> other examples of trace points using cgroup_ino and no need to
> >> allocate buffers. Also cgroup namespace might complicate the path
> >> usage.
> >
> > Hmm, so in general I would love to use a numeric identifier instead of a string.
> >
> > I did some reading, and it looks like the cgroup_ino() mainly has to
> > do with writeback, instead of being just a general identifier?
> > https://www.kernel.org/doc/Documentation/cgroup-v2.txt
I think you are confusing cgroup inodes with real filesystem inodes in that doc.
> >
> > There is cgroup_id() which I think is almost what I'd want, but there
> > are a couple problems with it:
> >
> > - I don't know of a way for userspace to translate IDs -> paths, to
> > make them human readable?
>
> The id => name map can be built from user space with a tree walk.
> Example:
>
> $ find /sys/fs/cgroup/memory -type d -printf '%i %P\n' # ~ [main]
> 20387 init.scope
> 31 system.slice
>
> > - Also I think the ID implementation we use for this is "dense",
> > meaning if a cgroup is removed, its ID is likely to be quickly reused.
> >
The ID for cgroup nodes (underlying it is kernfs) are allocated from
idr_alloc_cyclic() which gives new ID after the last allocated ID and
wrap after around INT_MAX IDs. So, likeliness of repetition is very
low. Also the file_handle returned by name_to_handle_at() for cgroupfs
returns the inode ID which gives confidence to the claim of low chance
of ID reusing.
On Tue, Dec 1, 2020 at 10:42 AM Shakeel Butt <[email protected]> wrote:
>
> On Tue, Dec 1, 2020 at 9:56 AM Greg Thelen <[email protected]> wrote:
> >
> > Axel Rasmussen <[email protected]> wrote:
> >
> > > On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt <[email protected]> wrote:
> > >>
> > >> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen <[email protected]> wrote:
> > >> >
> > >> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
> > >> > is that an ongoing trace event might race with the tracepoint being
> > >> > disabled (and therefore the _unreg() callback being called). Consider
> > >> > this ordering:
> > >> >
> > >> > T1: trace event fires, get_mm_memcg_path() is called
> > >> > T1: get_memcg_path_buf() returns a buffer pointer
> > >> > T2: trace_mmap_lock_unreg() is called, buffers are freed
> > >> > T1: cgroup_path() is called with the now-freed buffer
> > >>
> > >> Any reason to use the cgroup_path instead of the cgroup_ino? There are
> > >> other examples of trace points using cgroup_ino and no need to
> > >> allocate buffers. Also cgroup namespace might complicate the path
> > >> usage.
> > >
> > > Hmm, so in general I would love to use a numeric identifier instead of a string.
> > >
> > > I did some reading, and it looks like the cgroup_ino() mainly has to
> > > do with writeback, instead of being just a general identifier?
> > > https://www.kernel.org/doc/Documentation/cgroup-v2.txt
>
> I think you are confusing cgroup inodes with real filesystem inodes in that doc.
>
> > >
> > > There is cgroup_id() which I think is almost what I'd want, but there
> > > are a couple problems with it:
> > >
> > > - I don't know of a way for userspace to translate IDs -> paths, to
> > > make them human readable?
> >
> > The id => name map can be built from user space with a tree walk.
> > Example:
> >
> > $ find /sys/fs/cgroup/memory -type d -printf '%i %P\n' # ~ [main]
> > 20387 init.scope
> > 31 system.slice
> >
> > > - Also I think the ID implementation we use for this is "dense",
> > > meaning if a cgroup is removed, its ID is likely to be quickly reused.
> > >
>
> The ID for cgroup nodes (underlying it is kernfs) are allocated from
> idr_alloc_cyclic() which gives new ID after the last allocated ID and
> wrap after around INT_MAX IDs. So, likeliness of repetition is very
> low. Also the file_handle returned by name_to_handle_at() for cgroupfs
> returns the inode ID which gives confidence to the claim of low chance
> of ID reusing.
Ah, for some reason I remembered it using idr_alloc(), but you're
right, it does use cyclical IDs. Even so, tracepoints which expose
these IDs would still be difficult to use I think. Say we're trying to
collect a histogram of lock latencies over the course of some test
we're running. At the end, we want to produce some kind of
human-readable report.
cgroups may come and go throughout the test. Even if we never re-use
IDs, in order to be able to map all of them to human-readable paths,
it seems like we'd need some background process to poll the
/sys/fs/cgroup/memory directory tree as Greg described, keeping track
of the ID<->path mapping. This seems expensive, and even if we poll
relatively frequently we might still miss short-lived cgroups.
Trying to aggregate such statistics across physical machines, or
reboots of the same machine, is further complicated. The machine(s)
may be running the same application, which runs in a container with
the same path, but it'll end up with different IDs. So we'd have to
collect the ID<->path mapping from each, and then try to match up the
names for aggregation.
+Tejun Heo
On Tue, Dec 1, 2020 at 11:14 AM Axel Rasmussen <[email protected]> wrote:
>
> On Tue, Dec 1, 2020 at 10:42 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Tue, Dec 1, 2020 at 9:56 AM Greg Thelen <[email protected]> wrote:
> > >
> > > Axel Rasmussen <[email protected]> wrote:
> > >
> > > > On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt <[email protected]> wrote:
> > > >>
> > > >> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen <[email protected]> wrote:
> > > >> >
> > > >> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
> > > >> > is that an ongoing trace event might race with the tracepoint being
> > > >> > disabled (and therefore the _unreg() callback being called). Consider
> > > >> > this ordering:
> > > >> >
> > > >> > T1: trace event fires, get_mm_memcg_path() is called
> > > >> > T1: get_memcg_path_buf() returns a buffer pointer
> > > >> > T2: trace_mmap_lock_unreg() is called, buffers are freed
> > > >> > T1: cgroup_path() is called with the now-freed buffer
> > > >>
> > > >> Any reason to use the cgroup_path instead of the cgroup_ino? There are
> > > >> other examples of trace points using cgroup_ino and no need to
> > > >> allocate buffers. Also cgroup namespace might complicate the path
> > > >> usage.
> > > >
> > > > Hmm, so in general I would love to use a numeric identifier instead of a string.
> > > >
> > > > I did some reading, and it looks like the cgroup_ino() mainly has to
> > > > do with writeback, instead of being just a general identifier?
> > > > https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> >
> > I think you are confusing cgroup inodes with real filesystem inodes in that doc.
> >
> > > >
> > > > There is cgroup_id() which I think is almost what I'd want, but there
> > > > are a couple problems with it:
> > > >
> > > > - I don't know of a way for userspace to translate IDs -> paths, to
> > > > make them human readable?
> > >
> > > The id => name map can be built from user space with a tree walk.
> > > Example:
> > >
> > > $ find /sys/fs/cgroup/memory -type d -printf '%i %P\n' # ~ [main]
> > > 20387 init.scope
> > > 31 system.slice
> > >
> > > > - Also I think the ID implementation we use for this is "dense",
> > > > meaning if a cgroup is removed, its ID is likely to be quickly reused.
> > > >
> >
> > The ID for cgroup nodes (underlying it is kernfs) are allocated from
> > idr_alloc_cyclic() which gives new ID after the last allocated ID and
> > wrap after around INT_MAX IDs. So, likeliness of repetition is very
> > low. Also the file_handle returned by name_to_handle_at() for cgroupfs
> > returns the inode ID which gives confidence to the claim of low chance
> > of ID reusing.
>
> Ah, for some reason I remembered it using idr_alloc(), but you're
> right, it does use cyclical IDs. Even so, tracepoints which expose
> these IDs would still be difficult to use I think.
The writeback tracepoint in include/trace/events/writeback.h is
already using the cgroup IDs. Actually it used to use cgroup_path but
converted to cgroup_ino.
Tejun, how do you use these tracepoints?
> Say we're trying to
> collect a histogram of lock latencies over the course of some test
> we're running. At the end, we want to produce some kind of
> human-readable report.
>
I am assuming the test infra and the tracing infra are decoupled
entities and test infra is orchestrating the cgroups as well.
> cgroups may come and go throughout the test. Even if we never re-use
> IDs, in order to be able to map all of them to human-readable paths,
> it seems like we'd need some background process to poll the
> /sys/fs/cgroup/memory directory tree as Greg described, keeping track
> of the ID<->path mapping. This seems expensive, and even if we poll
> relatively frequently we might still miss short-lived cgroups.
>
> Trying to aggregate such statistics across physical machines, or
> reboots of the same machine, is further complicated. The machine(s)
> may be running the same application, which runs in a container with
> the same path, but it'll end up with different IDs. So we'd have to
> collect the ID<->path mapping from each, and then try to match up the
> names for aggregation.
How about adding another tracepoint in cgroup_create which will output
the ID along with the name or path? With a little post processing you
can get the same information. Also note that if the test is
deleting/creating the cgroup with the same name, you will miss that
information if filtering with just path.
IMHO cgroup IDs will make the kernel code much simpler with the
tradeoff of a bit more work in user space.
On Tue, Dec 1, 2020 at 12:53 PM Shakeel Butt <[email protected]> wrote:
>
> +Tejun Heo
>
> On Tue, Dec 1, 2020 at 11:14 AM Axel Rasmussen <[email protected]> wrote:
> >
> > On Tue, Dec 1, 2020 at 10:42 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Tue, Dec 1, 2020 at 9:56 AM Greg Thelen <[email protected]> wrote:
> > > >
> > > > Axel Rasmussen <[email protected]> wrote:
> > > >
> > > > > On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt <[email protected]> wrote:
> > > > >>
> > > > >> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen <[email protected]> wrote:
> > > > >> >
> > > > >> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
> > > > >> > is that an ongoing trace event might race with the tracepoint being
> > > > >> > disabled (and therefore the _unreg() callback being called). Consider
> > > > >> > this ordering:
> > > > >> >
> > > > >> > T1: trace event fires, get_mm_memcg_path() is called
> > > > >> > T1: get_memcg_path_buf() returns a buffer pointer
> > > > >> > T2: trace_mmap_lock_unreg() is called, buffers are freed
> > > > >> > T1: cgroup_path() is called with the now-freed buffer
> > > > >>
> > > > >> Any reason to use the cgroup_path instead of the cgroup_ino? There are
> > > > >> other examples of trace points using cgroup_ino and no need to
> > > > >> allocate buffers. Also cgroup namespace might complicate the path
> > > > >> usage.
> > > > >
> > > > > Hmm, so in general I would love to use a numeric identifier instead of a string.
> > > > >
> > > > > I did some reading, and it looks like the cgroup_ino() mainly has to
> > > > > do with writeback, instead of being just a general identifier?
> > > > > https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > >
> > > I think you are confusing cgroup inodes with real filesystem inodes in that doc.
> > >
> > > > >
> > > > > There is cgroup_id() which I think is almost what I'd want, but there
> > > > > are a couple problems with it:
> > > > >
> > > > > - I don't know of a way for userspace to translate IDs -> paths, to
> > > > > make them human readable?
> > > >
> > > > The id => name map can be built from user space with a tree walk.
> > > > Example:
> > > >
> > > > $ find /sys/fs/cgroup/memory -type d -printf '%i %P\n' # ~ [main]
> > > > 20387 init.scope
> > > > 31 system.slice
> > > >
> > > > > - Also I think the ID implementation we use for this is "dense",
> > > > > meaning if a cgroup is removed, its ID is likely to be quickly reused.
> > > > >
> > >
> > > The ID for cgroup nodes (underlying it is kernfs) are allocated from
> > > idr_alloc_cyclic() which gives new ID after the last allocated ID and
> > > wrap after around INT_MAX IDs. So, likeliness of repetition is very
> > > low. Also the file_handle returned by name_to_handle_at() for cgroupfs
> > > returns the inode ID which gives confidence to the claim of low chance
> > > of ID reusing.
> >
> > Ah, for some reason I remembered it using idr_alloc(), but you're
> > right, it does use cyclical IDs. Even so, tracepoints which expose
> > these IDs would still be difficult to use I think.
>
> The writeback tracepoint in include/trace/events/writeback.h is
> already using the cgroup IDs. Actually it used to use cgroup_path but
> converted to cgroup_ino.
>
> Tejun, how do you use these tracepoints?
>
> > Say we're trying to
> > collect a histogram of lock latencies over the course of some test
> > we're running. At the end, we want to produce some kind of
> > human-readable report.
> >
>
> I am assuming the test infra and the tracing infra are decoupled
> entities and test infra is orchestrating the cgroups as well.
>
> > cgroups may come and go throughout the test. Even if we never re-use
> > IDs, in order to be able to map all of them to human-readable paths,
> > it seems like we'd need some background process to poll the
> > /sys/fs/cgroup/memory directory tree as Greg described, keeping track
> > of the ID<->path mapping. This seems expensive, and even if we poll
> > relatively frequently we might still miss short-lived cgroups.
> >
> > Trying to aggregate such statistics across physical machines, or
> > reboots of the same machine, is further complicated. The machine(s)
> > may be running the same application, which runs in a container with
> > the same path, but it'll end up with different IDs. So we'd have to
> > collect the ID<->path mapping from each, and then try to match up the
> > names for aggregation.
>
> How about adding another tracepoint in cgroup_create which will output
> the ID along with the name or path? With a little post processing you
> can get the same information. Also note that if the test is
> deleting/creating the cgroup with the same name, you will miss that
> information if filtering with just path.
>
> IMHO cgroup IDs will make the kernel code much simpler with the
> tradeoff of a bit more work in user space.
I like this idea! I think userspace can use the synthetic trace event
API to construct an event which includes the strings, like the one
I've added, if we had this separate ID<->path mapping tracepoint. If
so, it would be just as easy for userspace to use, but it would let us
deal with integer IDs everywhere else in the kernel, and keep the
complexity related to dealing with buffers limited to just one place.
That said, I'd prefer to pursue this as a follow-up thing, rather than
as part of fixing this bug. Seem reasonable?
On Tue, Dec 1, 2020 at 4:16 PM Axel Rasmussen <[email protected]> wrote:
>
> On Tue, Dec 1, 2020 at 12:53 PM Shakeel Butt <[email protected]> wrote:
> >
> > +Tejun Heo
> >
> > On Tue, Dec 1, 2020 at 11:14 AM Axel Rasmussen <[email protected]> wrote:
> > >
> > > On Tue, Dec 1, 2020 at 10:42 AM Shakeel Butt <[email protected]> wrote:
> > > >
> > > > On Tue, Dec 1, 2020 at 9:56 AM Greg Thelen <[email protected]> wrote:
> > > > >
> > > > > Axel Rasmussen <[email protected]> wrote:
> > > > >
> > > > > > On Mon, Nov 30, 2020 at 5:34 PM Shakeel Butt <[email protected]> wrote:
> > > > > >>
> > > > > >> On Mon, Nov 30, 2020 at 3:43 PM Axel Rasmussen <[email protected]> wrote:
> > > > > >> >
> > > > > >> > syzbot reported[1] a use-after-free introduced in 0f818c4bc1f3. The bug
> > > > > >> > is that an ongoing trace event might race with the tracepoint being
> > > > > >> > disabled (and therefore the _unreg() callback being called). Consider
> > > > > >> > this ordering:
> > > > > >> >
> > > > > >> > T1: trace event fires, get_mm_memcg_path() is called
> > > > > >> > T1: get_memcg_path_buf() returns a buffer pointer
> > > > > >> > T2: trace_mmap_lock_unreg() is called, buffers are freed
> > > > > >> > T1: cgroup_path() is called with the now-freed buffer
> > > > > >>
> > > > > >> Any reason to use the cgroup_path instead of the cgroup_ino? There are
> > > > > >> other examples of trace points using cgroup_ino and no need to
> > > > > >> allocate buffers. Also cgroup namespace might complicate the path
> > > > > >> usage.
> > > > > >
> > > > > > Hmm, so in general I would love to use a numeric identifier instead of a string.
> > > > > >
> > > > > > I did some reading, and it looks like the cgroup_ino() mainly has to
> > > > > > do with writeback, instead of being just a general identifier?
> > > > > > https://www.kernel.org/doc/Documentation/cgroup-v2.txt
> > > >
> > > > I think you are confusing cgroup inodes with real filesystem inodes in that doc.
> > > >
> > > > > >
> > > > > > There is cgroup_id() which I think is almost what I'd want, but there
> > > > > > are a couple problems with it:
> > > > > >
> > > > > > - I don't know of a way for userspace to translate IDs -> paths, to
> > > > > > make them human readable?
> > > > >
> > > > > The id => name map can be built from user space with a tree walk.
> > > > > Example:
> > > > >
> > > > > $ find /sys/fs/cgroup/memory -type d -printf '%i %P\n' # ~ [main]
> > > > > 20387 init.scope
> > > > > 31 system.slice
> > > > >
> > > > > > - Also I think the ID implementation we use for this is "dense",
> > > > > > meaning if a cgroup is removed, its ID is likely to be quickly reused.
> > > > > >
> > > >
> > > > The ID for cgroup nodes (underlying it is kernfs) are allocated from
> > > > idr_alloc_cyclic() which gives new ID after the last allocated ID and
> > > > wrap after around INT_MAX IDs. So, likeliness of repetition is very
> > > > low. Also the file_handle returned by name_to_handle_at() for cgroupfs
> > > > returns the inode ID which gives confidence to the claim of low chance
> > > > of ID reusing.
> > >
> > > Ah, for some reason I remembered it using idr_alloc(), but you're
> > > right, it does use cyclical IDs. Even so, tracepoints which expose
> > > these IDs would still be difficult to use I think.
> >
> > The writeback tracepoint in include/trace/events/writeback.h is
> > already using the cgroup IDs. Actually it used to use cgroup_path but
> > converted to cgroup_ino.
> >
> > Tejun, how do you use these tracepoints?
> >
> > > Say we're trying to
> > > collect a histogram of lock latencies over the course of some test
> > > we're running. At the end, we want to produce some kind of
> > > human-readable report.
> > >
> >
> > I am assuming the test infra and the tracing infra are decoupled
> > entities and test infra is orchestrating the cgroups as well.
> >
> > > cgroups may come and go throughout the test. Even if we never re-use
> > > IDs, in order to be able to map all of them to human-readable paths,
> > > it seems like we'd need some background process to poll the
> > > /sys/fs/cgroup/memory directory tree as Greg described, keeping track
> > > of the ID<->path mapping. This seems expensive, and even if we poll
> > > relatively frequently we might still miss short-lived cgroups.
> > >
> > > Trying to aggregate such statistics across physical machines, or
> > > reboots of the same machine, is further complicated. The machine(s)
> > > may be running the same application, which runs in a container with
> > > the same path, but it'll end up with different IDs. So we'd have to
> > > collect the ID<->path mapping from each, and then try to match up the
> > > names for aggregation.
> >
> > How about adding another tracepoint in cgroup_create which will output
> > the ID along with the name or path? With a little post processing you
> > can get the same information. Also note that if the test is
> > deleting/creating the cgroup with the same name, you will miss that
> > information if filtering with just path.
> >
> > IMHO cgroup IDs will make the kernel code much simpler with the
> > tradeoff of a bit more work in user space.
>
> I like this idea! I think userspace can use the synthetic trace event
> API to construct an event which includes the strings, like the one
> I've added, if we had this separate ID<->path mapping tracepoint. If
> so, it would be just as easy for userspace to use, but it would let us
> deal with integer IDs everywhere else in the kernel, and keep the
> complexity related to dealing with buffers limited to just one place.
>
> That said, I'd prefer to pursue this as a follow-up thing, rather than
> as part of fixing this bug. Seem reasonable?
SGTM but note that usually Andrew squash all the patches into one
before sending to Linus. If you plan to replace the path buffer with
integer IDs then no need to spend time fixing buffer related bug.
On Tue, 1 Dec 2020 16:36:32 -0800
Shakeel Butt <[email protected]> wrote:
> SGTM but note that usually Andrew squash all the patches into one
> before sending to Linus. If you plan to replace the path buffer with
> integer IDs then no need to spend time fixing buffer related bug.
I don't think Andrew squashes all the patches. I believe he sends Linus
a patch series.
Andrew?
-- Steve
On Tue, Dec 1, 2020 at 5:07 PM Steven Rostedt <[email protected]> wrote:
>
> On Tue, 1 Dec 2020 16:36:32 -0800
> Shakeel Butt <[email protected]> wrote:
>
> > SGTM but note that usually Andrew squash all the patches into one
> > before sending to Linus. If you plan to replace the path buffer with
> > integer IDs then no need to spend time fixing buffer related bug.
>
> I don't think Andrew squashes all the patches. I believe he sends Linus
> a patch series.
I am talking about the patch and the following fixes to that patch.
Those are usually squashed into one patch.
Hello,
On Tue, Dec 01, 2020 at 12:53:46PM -0800, Shakeel Butt wrote:
> The writeback tracepoint in include/trace/events/writeback.h is
> already using the cgroup IDs. Actually it used to use cgroup_path but
> converted to cgroup_ino.
>
> Tejun, how do you use these tracepoints?
There've been some changes to cgroup ids recently and now cgroup id, ino and
its file_handle are all compatible. On 64bit ino machines, they're all the
same and won't be reused. On 32bit ino machines, the lower 32bit of full id
is used as ino. ino may be reused but not the full 64bit id.
You can map back cgroup id to path from userspace using open_by_handle_at().
The following is an example program which does path -> cgrp id -> path
mappings.
#define _GNU_SOURCE
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdint.h>
#ifndef FILEID_KERNFS
#define FILEID_KERNFS 0xfe
#endif
struct fh_store {
struct file_handle fh;
char stor[MAX_HANDLE_SZ];
};
uint64_t path_to_cgrp_id(const char *path)
{
struct fh_store fh_store;
struct file_handle *fh = &fh_store.fh;
int mnt_id;
fh->handle_bytes = MAX_HANDLE_SZ;
if (name_to_handle_at(AT_FDCWD, path, fh, &mnt_id, 0)) {
perror("name_to_handle_at");
abort();
}
if (fh->handle_type != FILEID_KERNFS) {
fprintf(stderr, "invalid handle_type 0x%x\n", fh->handle_type);
abort();
}
return *(uint64_t *)fh->f_handle;
}
void cgrp_id_to_path(uint64_t cgrp_id, char *path_buf)
{
struct fh_store fh_store;
struct file_handle *fh = &fh_store.fh;
char proc_path[PATH_MAX];
int mnt_fd, fd;
fh->handle_type = FILEID_KERNFS;
fh->handle_bytes = sizeof(uint64_t);
*(uint64_t *)fh->f_handle = cgrp_id;
mnt_fd = open("/sys/fs/cgroup", O_RDONLY);
if (mnt_fd < 0) {
perror("open(\"/sys/fs/cgroup\")");
abort();
}
fd = open_by_handle_at(mnt_fd, fh, O_RDONLY);
if (fd < 0) {
perror("open_by_handle_at");
abort();
}
snprintf(proc_path, PATH_MAX, "/proc/self/fd/%d", fd);
printf("proc_path=%s\n", proc_path);
if (readlink(proc_path, path_buf, PATH_MAX) < 0) {
perror("readlink");
abort();
}
}
int main(int argc, char **argv)
{
char path_buf[PATH_MAX + 1] = "";
uint64_t cgrp_id;
if (argc != 2) {
fprintf(stderr, "Usage: test-cgrp-id CGROUP_PATH\n");
return 1;
}
cgrp_id = path_to_cgrp_id(argv[1]);
printf("cgrp_id=%llu\n", (unsigned long long)cgrp_id);
cgrp_id_to_path(cgrp_id, path_buf);
printf("cgrp_path=%s\n", path_buf);
return 0;
}
On Wed, Dec 2, 2020 at 11:01 AM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Tue, Dec 01, 2020 at 12:53:46PM -0800, Shakeel Butt wrote:
> > The writeback tracepoint in include/trace/events/writeback.h is
> > already using the cgroup IDs. Actually it used to use cgroup_path but
> > converted to cgroup_ino.
> >
> > Tejun, how do you use these tracepoints?
>
Thanks Tejun, I really appreciate the example you provided. I do have one query.
> There've been some changes to cgroup ids recently and now cgroup id, ino and
> its file_handle are all compatible. On 64bit ino machines, they're all the
> same and won't be reused. On 32bit ino machines, the lower 32bit of full id
> is used as ino. ino may be reused but not the full 64bit id.
__kernfs_new_node() is using idr_alloc_cyclic() which will return
32bit ID. If I am understanding this correctly the full ID is
generated similarly for 32bit and 64bit machines but for 64bit
machines the whole ID is inode number while on 32bit machines the
lower 32bits contains the inode number. Is that correct?
Hello,
On Wed, Dec 02, 2020 at 03:23:57PM -0800, Shakeel Butt wrote:
> > There've been some changes to cgroup ids recently and now cgroup id, ino and
> > its file_handle are all compatible. On 64bit ino machines, they're all the
> > same and won't be reused. On 32bit ino machines, the lower 32bit of full id
> > is used as ino. ino may be reused but not the full 64bit id.
>
> __kernfs_new_node() is using idr_alloc_cyclic() which will return
> 32bit ID. If I am understanding this correctly the full ID is
> generated similarly for 32bit and 64bit machines but for 64bit
> machines the whole ID is inode number while on 32bit machines the
> lower 32bits contains the inode number. Is that correct?
Yes, that's correct.
Thanks.
--
tejun
On 12/2/20 2:11 AM, Shakeel Butt wrote:
> On Tue, Dec 1, 2020 at 5:07 PM Steven Rostedt <[email protected]> wrote:
>>
>> On Tue, 1 Dec 2020 16:36:32 -0800
>> Shakeel Butt <[email protected]> wrote:
>>
>> > SGTM but note that usually Andrew squash all the patches into one
>> > before sending to Linus. If you plan to replace the path buffer with
>> > integer IDs then no need to spend time fixing buffer related bug.
>>
>> I don't think Andrew squashes all the patches. I believe he sends Linus
>> a patch series.
>
> I am talking about the patch and the following fixes to that patch.
> Those are usually squashed into one patch.
Yeah, if there's a way forward that doesn't need to construct full path on each
event and the associated complexity and just use an ID, let's convert to the ID
and squash it, for less churn. Especially if there are other existing
tracepoints that use the ID.
If there's further (somewhat orthogonal) work to make the IDs easier for
userspace, it can be added on top later, but really shouldn't need to add the
current complex solution only to remove it later?
Thanks,
Vlastimil
On Fri, Dec 4, 2020 at 8:36 AM Vlastimil Babka <[email protected]> wrote:
>
> On 12/2/20 2:11 AM, Shakeel Butt wrote:
> > On Tue, Dec 1, 2020 at 5:07 PM Steven Rostedt <[email protected]> wrote:
> >>
> >> On Tue, 1 Dec 2020 16:36:32 -0800
> >> Shakeel Butt <[email protected]> wrote:
> >>
> >> > SGTM but note that usually Andrew squash all the patches into one
> >> > before sending to Linus. If you plan to replace the path buffer with
> >> > integer IDs then no need to spend time fixing buffer related bug.
> >>
> >> I don't think Andrew squashes all the patches. I believe he sends Linus
> >> a patch series.
> >
> > I am talking about the patch and the following fixes to that patch.
> > Those are usually squashed into one patch.
>
> Yeah, if there's a way forward that doesn't need to construct full path on each
> event and the associated complexity and just use an ID, let's convert to the ID
> and squash it, for less churn. Especially if there are other existing
> tracepoints that use the ID.
The thing I worry about is that it being inconvenient will prevent
folks from using it to send us data on how mmap_lock behaves under
their workloads. Anecdotally, I talked to a few teams within Google,
and it seems those using containers use paths instead of IDs to refer
to them, and they don't have existing infrastructure to keep track of
the mapping. So to collect data from those workloads, we'd have to
wait for them to build such a thing, vs. just asking them to run a
bash one-liner. I haven't done a wider survey, but I suspect the same
is true for users of e.g. Docker or Kubernetes.
>
> If there's further (somewhat orthogonal) work to make the IDs easier for
> userspace, it can be added on top later, but really shouldn't need to add the
> current complex solution only to remove it later?
I'm on board with Shakeel's idea to export this on cgroup
creation/deletion instead, since it lets us keep the complexity in
exactly one place. I think such a thing would use the same code I'm
proposing now, though, so we wouldn't just be deleting it if we merge
it. Also, before doing that I think it's worth identifying a second
user within the kernel (maybe the writeback tracepoint mentioned
earlier in the thread?), so doing it incrementally seems reasonable to
me.
This plan is also in-line with existing stuff we provide. Histogram
triggers provide utilities like ".execname" (PID -> execname) or
".sym" (address -> symbol) for convenience. Sure, userspace could
figure these things out for itself, but it's convenient to provide
them directly, and it's not so bad since we have exactly one place in
the tracing infrastructure that knows how to do these translations.
Actually, maybe a ".cgpath" variant would be better than a separate
tracepoint. I haven't looked at what either approach would require in
detail; maybe another reason to do this iteratively. :)
>
> Thanks,
> Vlastimil