2020-05-06 16:25:24

by Qian Cai

[permalink] [raw]
Subject: Kmemleak infrastructure improvement for task_struct leaks and call_rcu()

== task struck leaks ==
There are leaks from task struct from time to time where someone forgot to call put_task_struct() somewhere leading to leaks. For example,

https://lore.kernel.org/lkml/[email protected]/

It was such a pain to debug this kind of leaks at the moment, as all we could do was to audit the code by checking all new put_task_struct() and get_task_struct() call sites which is error-prone because there could be other new call sites like get_pid_task() which would also need to be balanced with put_task_struct() as well.

What do you think about adding some aux call traces for kmemleak in general? For example, if the tracking object is a task struct, it would save call traces for the first and last call of both get_task_struct() and put_task_struct(). Then, it could be expanded to track other refcount-based leaks in the future.

== call_rcu() leaks ==
Another issue that might be relevant is that it seems sometimes, kmemleak will give a lot of false positives (hundreds) because the memory was supposed to be freed by call_rcu() (for example, in dst_release()) but for some reasons, it takes a long time probably waiting for grace periods or some kind of RCU self-stall, but the memory had already became an orphan. I am not sure how we are going to resolve this properly until we have to figure out why call_rcu() is taking so long to finish?

Another solution is to add aux call traces for both skb_dst_drop() and skb_dst_set() for this case, but that there are many places to free memory via call_rcu() like inode free etc.


2020-05-06 17:42:36

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Kmemleak infrastructure improvement for task_struct leaks and call_rcu()

On Wed, May 06, 2020 at 12:22:37PM -0400, Qian Cai wrote:
> == task struck leaks ==
> There are leaks from task struct from time to time where someone forgot to call put_task_struct() somewhere leading to leaks. For example,
>
> https://lore.kernel.org/lkml/[email protected]/
>
> It was such a pain to debug this kind of leaks at the moment, as all we could do was to audit the code by checking all new put_task_struct() and get_task_struct() call sites which is error-prone because there could be other new call sites like get_pid_task() which would also need to be balanced with put_task_struct() as well.
>
> What do you think about adding some aux call traces for kmemleak in general? For example, if the tracking object is a task struct, it would save call traces for the first and last call of both get_task_struct() and put_task_struct(). Then, it could be expanded to track other refcount-based leaks in the future.
>
> == call_rcu() leaks ==
> Another issue that might be relevant is that it seems sometimes, kmemleak will give a lot of false positives (hundreds) because the memory was supposed to be freed by call_rcu() (for example, in dst_release()) but for some reasons, it takes a long time probably waiting for grace periods or some kind of RCU self-stall, but the memory had already became an orphan. I am not sure how we are going to resolve this properly until we have to figure out why call_rcu() is taking so long to finish?

I know nothing about kmemleak, but I won't let that stop me from making
random suggestions...

One approach is to do an rcu_barrier() inside kmemleak just before
printing leaked blocks, and check to see if any are still leaked after
the rcu_barrier().

If kmemleak works on crash dumps, another approach is to scan RCU's
callback lists. This will miss those callbacks that rcu_do_batch()
was in the middle of invoking, though. It also misses cases where
someone passes a linked structure to call_rcu(), and then frees the
structure piece by piece within the callback function.

> Another solution is to add aux call traces for both skb_dst_drop() and skb_dst_set() for this case, but that there are many places to free memory via call_rcu() like inode free etc.

And call_rcu() has no idea where the memory starts. And again, sometimes
there is memory linked from that passed to call_rcu() that will be freed
by the callback function.

In theory, these linked-structure cases could be handled by checking
the callback function and then traversing the links. I wouldn't be
that ambitious, but don't let me discourage you. ;-)

Thanx, Paul

2020-05-07 17:16:23

by Catalin Marinas

[permalink] [raw]
Subject: Re: Kmemleak infrastructure improvement for task_struct leaks and call_rcu()

On Wed, May 06, 2020 at 10:40:19AM -0700, Paul E. McKenney wrote:
> On Wed, May 06, 2020 at 12:22:37PM -0400, Qian Cai wrote:
> > == call_rcu() leaks ==
> > Another issue that might be relevant is that it seems sometimes,
> > kmemleak will give a lot of false positives (hundreds) because the
> > memory was supposed to be freed by call_rcu() (for example, in
> > dst_release()) but for some reasons, it takes a long time probably
> > waiting for grace periods or some kind of RCU self-stall, but the
> > memory had already became an orphan. I am not sure how we are going
> > to resolve this properly until we have to figure out why call_rcu()
> > is taking so long to finish?
>
> I know nothing about kmemleak, but I won't let that stop me from making
> random suggestions...
>
> One approach is to do an rcu_barrier() inside kmemleak just before
> printing leaked blocks, and check to see if any are still leaked after
> the rcu_barrier().

The main issue is that kmemleak doesn't stop the world when scanning
(which can take over a minute, depending on your hardware), so we get
lots of transient pointer misses. There are some heuristics but
obviously they don't always work.

With RCU, objects are queued for RCU freeing later and chained via
rcu_head.next (IIUC). Under load, this list can be pretty volatile and
if this happen during kmemleak scanning, it's sufficient to lose track
of a next pointer and the rest of the list would be reported as a leak.

I think rcu_barrier() just before the starting the kmemleak scanning may
help if it reduces the number of objects queued.

Now, I wonder whether kmemleak itself can break this RCU chain. The
kmemleak metadata is allocated on a slab alloc callback. The freeing,
however, is done using call_rcu() because originally calling back into
the slab freeing from kmemleak_free() didn't go well. Since the
kmemleak_object structure is not tracked by kmemleak, I wonder whether
its rcu_head would break this directed pointer reference graph.

Let's try the rcu_barrier() first and I'll think about the metadata case
over the weekend.

Thanks.

--
Catalin

2020-05-07 17:18:22

by Catalin Marinas

[permalink] [raw]
Subject: Re: Kmemleak infrastructure improvement for task_struct leaks and call_rcu()

On Wed, May 06, 2020 at 12:22:37PM -0400, Qian Cai wrote:
> What do you think about adding some aux call traces for kmemleak in
> general? For example, if the tracking object is a task struct, it
> would save call traces for the first and last call of both
> get_task_struct() and put_task_struct(). Then, it could be expanded to
> track other refcount-based leaks in the future.

I don't mind adding additional tracking info if it helps with debugging.
But if it's for improving false positives, I'd prefer to look deeper
into figure out why the pointer reference graph tracking failed.

Thanks.

--
Catalin

2020-05-07 17:30:52

by Qian Cai

[permalink] [raw]
Subject: Re: Kmemleak infrastructure improvement for task_struct leaks and call_rcu()



> On May 7, 2020, at 1:16 PM, Catalin Marinas <[email protected]> wrote:
>
> I don't mind adding additional tracking info if it helps with debugging.
> But if it's for improving false positives, I'd prefer to look deeper
> into figure out why the pointer reference graph tracking failed.

No, the task struct leaks are real leaks. It is just painful to figure out the missing or misplaced put_task_struct() from the kmemleak reports at the moment.

2020-05-07 17:56:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: Kmemleak infrastructure improvement for task_struct leaks and call_rcu()

On Thu, May 07, 2020 at 06:14:19PM +0100, Catalin Marinas wrote:
> On Wed, May 06, 2020 at 10:40:19AM -0700, Paul E. McKenney wrote:
> > On Wed, May 06, 2020 at 12:22:37PM -0400, Qian Cai wrote:
> > > == call_rcu() leaks ==
> > > Another issue that might be relevant is that it seems sometimes,
> > > kmemleak will give a lot of false positives (hundreds) because the
> > > memory was supposed to be freed by call_rcu() (for example, in
> > > dst_release()) but for some reasons, it takes a long time probably
> > > waiting for grace periods or some kind of RCU self-stall, but the
> > > memory had already became an orphan. I am not sure how we are going
> > > to resolve this properly until we have to figure out why call_rcu()
> > > is taking so long to finish?
> >
> > I know nothing about kmemleak, but I won't let that stop me from making
> > random suggestions...
> >
> > One approach is to do an rcu_barrier() inside kmemleak just before
> > printing leaked blocks, and check to see if any are still leaked after
> > the rcu_barrier().
>
> The main issue is that kmemleak doesn't stop the world when scanning
> (which can take over a minute, depending on your hardware), so we get
> lots of transient pointer misses. There are some heuristics but
> obviously they don't always work.
>
> With RCU, objects are queued for RCU freeing later and chained via
> rcu_head.next (IIUC). Under load, this list can be pretty volatile and
> if this happen during kmemleak scanning, it's sufficient to lose track
> of a next pointer and the rest of the list would be reported as a leak.
>
> I think rcu_barrier() just before the starting the kmemleak scanning may
> help if it reduces the number of objects queued.

It might, especially if the call_rcu() rate is lower after the
rcu_barrier() than it was beforehand. Which might well be the case when
a large cleanup activity ended just before rcu_barrier() was invoked.

> Now, I wonder whether kmemleak itself can break this RCU chain. The
> kmemleak metadata is allocated on a slab alloc callback. The freeing,
> however, is done using call_rcu() because originally calling back into
> the slab freeing from kmemleak_free() didn't go well. Since the
> kmemleak_object structure is not tracked by kmemleak, I wonder whether
> its rcu_head would break this directed pointer reference graph.

It is true that kmemleak could decide that being passed to call_rcu()
as being freed. However, it would need to know the rcu_head offset.
And there are (or were) a few places that pass linked structures to
call_rcu(), and kmemleak would presumably need to mark them all free
at that point. Or maybe accept the much lower false-positive rate from
not marking them.

> Let's try the rcu_barrier() first and I'll think about the metadata case
> over the weekend.

Looking forward to hearing how it goes!

Thanx, Paul

2020-05-09 10:12:46

by Catalin Marinas

[permalink] [raw]
Subject: Re: Kmemleak infrastructure improvement for task_struct leaks and call_rcu()

On Thu, May 07, 2020 at 01:29:04PM -0400, Qian Cai wrote:
> On May 7, 2020, at 1:16 PM, Catalin Marinas <[email protected]> wrote:
> > I don't mind adding additional tracking info if it helps with debugging.
> > But if it's for improving false positives, I'd prefer to look deeper
> > into figure out why the pointer reference graph tracking failed.
>
> No, the task struct leaks are real leaks. It is just painful to figure
> out the missing or misplaced put_task_struct() from the kmemleak
> reports at the moment.

We could log the callers to get_task_struct() and put_task_struct(),
something like __builtin_return_address(0) (how does this work if the
function is inlined?). If it's not the full backtrace, it shouldn't slow
down kmemleak considerably. I don't think it's worth logging only the
first/last calls to get/put. You'd hope that put is called in reverse
order to get.

I think it may be better if this is added as a new allocation pointed to
from kmemleak_object rather than increasing this structure since it will
be added on a case by case basis. When dumping the leak information, it
would also dump the get/put calls, in the order they were called. We
could add some simple refcount tracking (++ for get, -- for put) to
easily notice any imbalance.

I'm pretty busy next week but happy to review if you have a patch ;).

--
Catalin

2020-05-10 21:31:20

by Qian Cai

[permalink] [raw]
Subject: Re: Kmemleak infrastructure improvement for task_struct leaks and call_rcu()



> On May 9, 2020, at 5:44 AM, Catalin Marinas <[email protected]> wrote:
>
> On Thu, May 07, 2020 at 01:29:04PM -0400, Qian Cai wrote:
>> On May 7, 2020, at 1:16 PM, Catalin Marinas <[email protected]> wrote:
>>> I don't mind adding additional tracking info if it helps with debugging.
>>> But if it's for improving false positives, I'd prefer to look deeper
>>> into figure out why the pointer reference graph tracking failed.
>>
>> No, the task struct leaks are real leaks. It is just painful to figure
>> out the missing or misplaced put_task_struct() from the kmemleak
>> reports at the moment.
>
> We could log the callers to get_task_struct() and put_task_struct(),
> something like __builtin_return_address(0) (how does this work if the
> function is inlined?). If it's not the full backtrace, it shouldn't slow
> down kmemleak considerably. I don't think it's worth logging only the
> first/last calls to get/put. You'd hope that put is called in reverse
> order to get.
>
> I think it may be better if this is added as a new allocation pointed to
> from kmemleak_object rather than increasing this structure since it will
> be added on a case by case basis. When dumping the leak information, it
> would also dump the get/put calls, in the order they were called. We
> could add some simple refcount tracking (++ for get, -- for put) to
> easily notice any imbalance.
>
> I'm pretty busy next week but happy to review if you have a patch ;).

I am still thinking about a more generic way for all those refcount-based leaks without needing of manual annotation of all those places. Today, I had another one,

unreferenced object 0xe6ff008924f28500 (size 128):
comm "qemu-kvm", pid 4835, jiffies 4295141828 (age 6944.120s)
hex dump (first 32 bytes):
01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de ....kkkk.....N..
ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........
backtrace:
[<000000005ed1a868>] slab_post_alloc_hook+0x74/0x9c
[<00000000c65ee7dc>] kmem_cache_alloc_trace+0x2b4/0x3d4
[<000000009efa9e6e>] do_eventfd+0x54/0x1ac
[<000000001146e724>] __arm64_sys_eventfd2+0x34/0x44
[<0000000096fc3a61>] do_el0_svc+0x128/0x1dc
[<000000005ae8f980>] el0_sync_handler+0xd0/0x268
[<0000000043f2c790>] el0_sync+0x164/0x180

That is eventfd_ctx_fileget() / eventfd_ctx_put() pairs.

2020-05-12 14:18:16

by Catalin Marinas

[permalink] [raw]
Subject: Re: Kmemleak infrastructure improvement for task_struct leaks and call_rcu()

On Sun, May 10, 2020 at 05:27:41PM -0400, Qian Cai wrote:
> On May 9, 2020, at 5:44 AM, Catalin Marinas <[email protected]> wrote:
> > On Thu, May 07, 2020 at 01:29:04PM -0400, Qian Cai wrote:
> >> On May 7, 2020, at 1:16 PM, Catalin Marinas <[email protected]> wrote:
> >>> I don't mind adding additional tracking info if it helps with debugging.
> >>> But if it's for improving false positives, I'd prefer to look deeper
> >>> into figure out why the pointer reference graph tracking failed.
> >>
> >> No, the task struct leaks are real leaks. It is just painful to figure
> >> out the missing or misplaced put_task_struct() from the kmemleak
> >> reports at the moment.
> >
> > We could log the callers to get_task_struct() and put_task_struct(),
> > something like __builtin_return_address(0) (how does this work if the
> > function is inlined?). If it's not the full backtrace, it shouldn't slow
> > down kmemleak considerably. I don't think it's worth logging only the
> > first/last calls to get/put. You'd hope that put is called in reverse
> > order to get.
> >
> > I think it may be better if this is added as a new allocation pointed to
> > from kmemleak_object rather than increasing this structure since it will
> > be added on a case by case basis. When dumping the leak information, it
> > would also dump the get/put calls, in the order they were called. We
> > could add some simple refcount tracking (++ for get, -- for put) to
> > easily notice any imbalance.
> >
> > I'm pretty busy next week but happy to review if you have a patch ;).
>
> I am still thinking about a more generic way for all those
> refcount-based leaks without needing of manual annotation of all those
> places. Today, I had another one,
>
> unreferenced object 0xe6ff008924f28500 (size 128):
> comm "qemu-kvm", pid 4835, jiffies 4295141828 (age 6944.120s)
> hex dump (first 32 bytes):
> 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de ....kkkk.....N..
> ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........
> backtrace:
> [<000000005ed1a868>] slab_post_alloc_hook+0x74/0x9c
> [<00000000c65ee7dc>] kmem_cache_alloc_trace+0x2b4/0x3d4
> [<000000009efa9e6e>] do_eventfd+0x54/0x1ac
> [<000000001146e724>] __arm64_sys_eventfd2+0x34/0x44
> [<0000000096fc3a61>] do_el0_svc+0x128/0x1dc
> [<000000005ae8f980>] el0_sync_handler+0xd0/0x268
> [<0000000043f2c790>] el0_sync+0x164/0x180
>
> That is eventfd_ctx_fileget() / eventfd_ctx_put() pairs.

In this case it uses kref_get() to increment the refcount. We could add
a kmemleak_add_trace() which allocates a new array and stores the stack
trace, linked to the original object. Similarly for kref_put().

If we do this for each inc/dec call, I'd leave it off as default and
only enable it explicitly by cmdline argument or
/sys/kerne/debug/kmemleak when needed. In most cases you'd hope there is
no leak, so no point in tracking additional metadata. But if you do hit
a problem, just enable the additional tracking to help with the
debugging.

--
Catalin

2020-05-12 18:11:39

by Qian Cai

[permalink] [raw]
Subject: Re: Kmemleak infrastructure improvement for task_struct leaks and call_rcu()



> On May 12, 2020, at 10:15 AM, Catalin Marinas <[email protected]> wrote:
>
> In this case it uses kref_get() to increment the refcount. We could add
> a kmemleak_add_trace() which allocates a new array and stores the stack
> trace, linked to the original object. Similarly for kref_put().
>
> If we do this for each inc/dec call, I'd leave it off as default and
> only enable it explicitly by cmdline argument or
> /sys/kerne/debug/kmemleak when needed. In most cases you'd hope there is
> no leak, so no point in tracking additional metadata. But if you do hit
> a problem, just enable the additional tracking to help with the
> debugging.

Well, we would like those testing bots to report kmemleak (I knew there would be many false positives) with those additional information of refcount leaks in case they found ones, albeit never saw one from those bots at all yet.

Since some of those bots will run fuzzers, so it would be difficult to reproduce. Thus, the option has to be enabled by default somehow. Otherwise, they could easily miss it in the first place. I’ll look into the see if we could make it fairly low overhead.

2020-05-13 11:19:17

by Catalin Marinas

[permalink] [raw]
Subject: Re: Kmemleak infrastructure improvement for task_struct leaks and call_rcu()

On Tue, May 12, 2020 at 02:09:30PM -0400, Qian Cai wrote:
>
>
> > On May 12, 2020, at 10:15 AM, Catalin Marinas <[email protected]> wrote:
> >
> > In this case it uses kref_get() to increment the refcount. We could add
> > a kmemleak_add_trace() which allocates a new array and stores the stack
> > trace, linked to the original object. Similarly for kref_put().
> >
> > If we do this for each inc/dec call, I'd leave it off as default and
> > only enable it explicitly by cmdline argument or
> > /sys/kerne/debug/kmemleak when needed. In most cases you'd hope there is
> > no leak, so no point in tracking additional metadata. But if you do hit
> > a problem, just enable the additional tracking to help with the
> > debugging.
>
> Well, we would like those testing bots to report kmemleak (I knew
> there would be many false positives) with those additional information
> of refcount leaks in case they found ones, albeit never saw one from
> those bots at all yet.

I know the syzkaller guys tried to run the fuzzer with kmemleak enabled
and there were false positives that required human intervention. IIRC
they disabled it eventually. The proposal was for a new feature to
kmemleak to run the scanning under stop_machine() so that no other CPU
messes with linked lists etc. That would make kmemleak more reliable
under heavy load. Another option was to let the system cool down before
running the scanning.

> Since some of those bots will run fuzzers, so it would be difficult to
> reproduce. Thus, the option has to be enabled by default somehow.
> Otherwise, they could easily miss it in the first place. I’ll look
> into the see if we could make it fairly low overhead.

I guess we don't need the full stack trace. About 4 function calls to
the refcount modification should be sufficient to get an idea.

--
Catalin

2020-05-17 22:35:36

by Qian Cai

[permalink] [raw]
Subject: Re: Kmemleak infrastructure improvement for task_struct leaks and call_rcu()



> On May 13, 2020, at 5:59 AM, Catalin Marinas <[email protected]> wrote:
>
> I guess we don't need the full stack trace. About 4 function calls to
> the refcount modification should be sufficient to get an idea.

That should work. I will probably only have cycles to code this after v5.7 is released (have to fight with regression bugs in linux-next first), so feel free to beat me to it.