2020-09-08 16:10:02

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] KFENCE: A low-overhead sampling-based memory safety error detector

On Tue, Sep 08, 2020 at 05:36PM +0200, Vlastimil Babka wrote:
> On 9/8/20 5:31 PM, Marco Elver wrote:
> >>
> >> How much memory overhead does this end up having? I know it depends on
> >> the object size and so forth. But, could you give some real-world
> >> examples of memory consumption? Also, what's the worst case? Say I
> >> have a ton of worst-case-sized (32b) slab objects. Will I notice?
> >
> > KFENCE objects are limited (default 255). If we exhaust KFENCE's memory
> > pool, no more KFENCE allocations will occur.
> > Documentation/dev-tools/kfence.rst gives a formula to calculate the
> > KFENCE pool size:
> >
> > The total memory dedicated to the KFENCE memory pool can be computed as::
> >
> > ( #objects + 1 ) * 2 * PAGE_SIZE
> >
> > Using the default config, and assuming a page size of 4 KiB, results in
> > dedicating 2 MiB to the KFENCE memory pool.
> >
> > Does that clarify this point? Or anything else that could help clarify
> > this?
>
> Hmm did you observe that with this limit, a long-running system would eventually
> converge to KFENCE memory pool being filled with long-aged objects, so there
> would be no space to sample new ones?

Sure, that's a possibility. But remember that we're not trying to
deterministically detect bugs on 1 system (if you wanted that, you
should use KASAN), but a fleet of machines! The non-determinism of which
allocations will end up in KFENCE, will ensure we won't end up with a
fleet of machines of identical allocations. That's exactly what we're
after. Even if we eventually exhaust the pool, you'll still detect bugs
if there are any.

If you are overly worried, either the sample interval or number of
available objects needs to be tweaked to be larger. The default of 255
is quite conservative, and even using something larger on a modern
system is hardly noticeable. Choosing a sample interval & number of
objects should also factor in how many machines you plan to deploy this
on. Monitoring /sys/kernel/debug/kfence/stats can help you here.

Thanks,
-- Marco


2020-09-11 07:37:34

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] KFENCE: A low-overhead sampling-based memory safety error detector

On Tue, Sep 8, 2020 at 5:56 PM Marco Elver <[email protected]> wrote:
>
> On Tue, Sep 08, 2020 at 05:36PM +0200, Vlastimil Babka wrote:
> > On 9/8/20 5:31 PM, Marco Elver wrote:
> > >>
> > >> How much memory overhead does this end up having? I know it depends on
> > >> the object size and so forth. But, could you give some real-world
> > >> examples of memory consumption? Also, what's the worst case? Say I
> > >> have a ton of worst-case-sized (32b) slab objects. Will I notice?
> > >
> > > KFENCE objects are limited (default 255). If we exhaust KFENCE's memory
> > > pool, no more KFENCE allocations will occur.
> > > Documentation/dev-tools/kfence.rst gives a formula to calculate the
> > > KFENCE pool size:
> > >
> > > The total memory dedicated to the KFENCE memory pool can be computed as::
> > >
> > > ( #objects + 1 ) * 2 * PAGE_SIZE
> > >
> > > Using the default config, and assuming a page size of 4 KiB, results in
> > > dedicating 2 MiB to the KFENCE memory pool.
> > >
> > > Does that clarify this point? Or anything else that could help clarify
> > > this?
> >
> > Hmm did you observe that with this limit, a long-running system would eventually
> > converge to KFENCE memory pool being filled with long-aged objects, so there
> > would be no space to sample new ones?
>
> Sure, that's a possibility. But remember that we're not trying to
> deterministically detect bugs on 1 system (if you wanted that, you
> should use KASAN), but a fleet of machines! The non-determinism of which
> allocations will end up in KFENCE, will ensure we won't end up with a
> fleet of machines of identical allocations. That's exactly what we're
> after. Even if we eventually exhaust the pool, you'll still detect bugs
> if there are any.
>
> If you are overly worried, either the sample interval or number of
> available objects needs to be tweaked to be larger. The default of 255
> is quite conservative, and even using something larger on a modern
> system is hardly noticeable. Choosing a sample interval & number of
> objects should also factor in how many machines you plan to deploy this
> on. Monitoring /sys/kernel/debug/kfence/stats can help you here.

Hi Marco,

I reviewed patches and they look good to me (minus some local comments
that I've left).

The main question/concern I have is what Vlastimil mentioned re
long-aged objects.
Is the default sample interval values reasonable for typical
workloads? Do we have any guidelines on choosing the sample interval?
Should it depend on workload/use pattern?
By "reasonable" I mean if the pool will last long enough to still
sample something after hours/days? Have you tried any experiments with
some workload (both short-lived processes and long-lived
processes/namespaces) capturing state of the pool? It can make sense
to do to better understand dynamics. I suspect that the rate may need
to be orders of magnitude lower.

Also I am wondering about the boot process (both kernel and init).
It's both inherently almost the same for the whole population of
machines and inherently produces persistent objects. Should we lower
the rate for the first minute of uptime? Or maybe make it proportional
to uptime?

I feel it's quite an important aspect. We can have this awesome idea
and implementation, but radically lower its utility by using bad
sampling value (which will have silent "failure mode" -- no bugs
detected).

But to make it clear: all of this does not conflict with the merge of
the first version. Just having tunable sampling interval is good
enough. We will get the ultimate understanding only when we start
using it widely anyway.

2020-09-11 12:05:51

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] KFENCE: A low-overhead sampling-based memory safety error detector

On Fri, 11 Sep 2020 at 09:36, Dmitry Vyukov <[email protected]> wrote:
> On Tue, Sep 8, 2020 at 5:56 PM Marco Elver <[email protected]> wrote:
> > On Tue, Sep 08, 2020 at 05:36PM +0200, Vlastimil Babka wrote:
[...]
> > > Hmm did you observe that with this limit, a long-running system would eventually
> > > converge to KFENCE memory pool being filled with long-aged objects, so there
> > > would be no space to sample new ones?
> >
> > Sure, that's a possibility. But remember that we're not trying to
> > deterministically detect bugs on 1 system (if you wanted that, you
> > should use KASAN), but a fleet of machines! The non-determinism of which
> > allocations will end up in KFENCE, will ensure we won't end up with a
> > fleet of machines of identical allocations. That's exactly what we're
> > after. Even if we eventually exhaust the pool, you'll still detect bugs
> > if there are any.
> >
> > If you are overly worried, either the sample interval or number of
> > available objects needs to be tweaked to be larger. The default of 255
> > is quite conservative, and even using something larger on a modern
> > system is hardly noticeable. Choosing a sample interval & number of
> > objects should also factor in how many machines you plan to deploy this
> > on. Monitoring /sys/kernel/debug/kfence/stats can help you here.
>
> Hi Marco,
>
> I reviewed patches and they look good to me (minus some local comments
> that I've left).

Thank you.

> The main question/concern I have is what Vlastimil mentioned re
> long-aged objects.
> Is the default sample interval values reasonable for typical
> workloads? Do we have any guidelines on choosing the sample interval?
> Should it depend on workload/use pattern?

As I hinted at before, the sample interval & number of objects needs
to depend on:
- number of machines,
- workload,
- acceptable overhead (performance, memory).

However, workload can vary greatly, and something more dynamic may be
needed. We do have the option to monitor
/sys/kernel/debug/kfence/stats and even change the sample interval at
runtime, e.g. from a user space tool that checks the currently used
objects, and as the pool is closer to exhausted, starts increasing
/sys/module/kfence/parameters/sample_interval.

Of course, if we figure out the best dynamic policy, we can add this
policy into the kernel. But I don't think it makes sense to hard-code
such a policy right now.

> By "reasonable" I mean if the pool will last long enough to still
> sample something after hours/days? Have you tried any experiments with
> some workload (both short-lived processes and long-lived
> processes/namespaces) capturing state of the pool? It can make sense
> to do to better understand dynamics. I suspect that the rate may need
> to be orders of magnitude lower.

Yes, the current default sample interval is a lower bound, and is also
a reasonable default for testing. I expect real deployments to use
much higher sample intervals (lower rate).

So here's some data (with CONFIG_KFENCE_NUM_OBJECTS=1000, so that
allocated KFENCE objects isn't artificially capped):

-- With a mostly vanilla config + KFENCE (sample interval 100 ms),
after ~40 min uptime (only boot, then idle) I see ~60 KFENCE objects
(total allocations >600). Those aren't always the same objects, with
roughly ~2 allocations/frees per second.

-- Then running sysbench I/O benchmark, KFENCE objects allocated peak
at 82. During the benchmark, allocations/frees per second are closer
to 10-15. After the benchmark, the KFENCE objects allocated remain at
82, and allocations/frees per second fall back to ~2.

-- For the same system, changing the sample interval to 1 ms (echo 1 >
/sys/module/kfence/parameters/sample_interval), and re-running the
benchmark gives me: KFENCE objects allocated peak at exactly 500, with
~500 allocations/frees per second. After that, allocated KFENCE
objects dropped a little to 496, and allocations/frees per second fell
back to ~2.

-- The long-lived objects are due to caches, and just running 'echo 1
> /proc/sys/vm/drop_caches' reduced allocated KFENCE objects back to
45.

> Also I am wondering about the boot process (both kernel and init).
> It's both inherently almost the same for the whole population of
> machines and inherently produces persistent objects. Should we lower
> the rate for the first minute of uptime? Or maybe make it proportional
> to uptime?

It should depend on current usage, which is dependent on the workload.
I don't think uptime helps much, as seen above. If we imagine a user
space tool that tweaks this for us, we can initialize KFENCE with a
very large sample interval, and once booted, this user space
tool/script adjusts /sys/module/kfence/parameters/sample_interval.

At the very least, I think I'll just make
/sys/module/kfence/parameters/sample_interval root-writable
unconditionally, so that we can experiment with such a tool.

Lowering the rate for the first minute of uptime might also be an
option, although if we do that, we can also just move kfence_init() to
the end of start_kernel(). IMHO, I think it still makes sense to
sample normally during boot, because who knows how those allocations
are used with different workloads once the kernel is live. With a
sample interval of 1000 ms (which is closer to what we probably want
in production), I see no more than 20 KFENCE objects allocated after
boot. I think we can live with that.

> I feel it's quite an important aspect. We can have this awesome idea
> and implementation, but radically lower its utility by using bad
> sampling value (which will have silent "failure mode" -- no bugs
> detected).

As a first step, I think monitoring the entire fleet here is key here
(collect /sys/kernel/debug/kfence/stats). Essentially, as long as
allocations/frees per second remains >0, we're probably fine, even if
we always run at max. KFENCE objects allocated.

An improvement over allocations/frees per second >0 would be
dynamically tweaking sample_interval based on how close we get to max
KFENCE objects allocated.

Yet another option is to skip KFENCE allocations based on the memcache
name, e.g. for those caches dedicated to long-lived allocations.

> But to make it clear: all of this does not conflict with the merge of
> the first version. Just having tunable sampling interval is good
> enough. We will get the ultimate understanding only when we start
> using it widely anyway.

Thanks,
-- Marco

2020-09-11 13:49:26

by Dmitry Vyukov

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] KFENCE: A low-overhead sampling-based memory safety error detector

On Fri, Sep 11, 2020 at 2:03 PM Marco Elver <[email protected]> wrote:
>
> On Fri, 11 Sep 2020 at 09:36, Dmitry Vyukov <[email protected]> wrote:
> > On Tue, Sep 8, 2020 at 5:56 PM Marco Elver <[email protected]> wrote:
> > > On Tue, Sep 08, 2020 at 05:36PM +0200, Vlastimil Babka wrote:
> [...]
> > > > Hmm did you observe that with this limit, a long-running system would eventually
> > > > converge to KFENCE memory pool being filled with long-aged objects, so there
> > > > would be no space to sample new ones?
> > >
> > > Sure, that's a possibility. But remember that we're not trying to
> > > deterministically detect bugs on 1 system (if you wanted that, you
> > > should use KASAN), but a fleet of machines! The non-determinism of which
> > > allocations will end up in KFENCE, will ensure we won't end up with a
> > > fleet of machines of identical allocations. That's exactly what we're
> > > after. Even if we eventually exhaust the pool, you'll still detect bugs
> > > if there are any.
> > >
> > > If you are overly worried, either the sample interval or number of
> > > available objects needs to be tweaked to be larger. The default of 255
> > > is quite conservative, and even using something larger on a modern
> > > system is hardly noticeable. Choosing a sample interval & number of
> > > objects should also factor in how many machines you plan to deploy this
> > > on. Monitoring /sys/kernel/debug/kfence/stats can help you here.
> >
> > Hi Marco,
> >
> > I reviewed patches and they look good to me (minus some local comments
> > that I've left).
>
> Thank you.
>
> > The main question/concern I have is what Vlastimil mentioned re
> > long-aged objects.
> > Is the default sample interval values reasonable for typical
> > workloads? Do we have any guidelines on choosing the sample interval?
> > Should it depend on workload/use pattern?
>
> As I hinted at before, the sample interval & number of objects needs
> to depend on:
> - number of machines,
> - workload,
> - acceptable overhead (performance, memory).
>
> However, workload can vary greatly, and something more dynamic may be
> needed. We do have the option to monitor
> /sys/kernel/debug/kfence/stats and even change the sample interval at
> runtime, e.g. from a user space tool that checks the currently used
> objects, and as the pool is closer to exhausted, starts increasing
> /sys/module/kfence/parameters/sample_interval.
>
> Of course, if we figure out the best dynamic policy, we can add this
> policy into the kernel. But I don't think it makes sense to hard-code
> such a policy right now.
>
> > By "reasonable" I mean if the pool will last long enough to still
> > sample something after hours/days? Have you tried any experiments with
> > some workload (both short-lived processes and long-lived
> > processes/namespaces) capturing state of the pool? It can make sense
> > to do to better understand dynamics. I suspect that the rate may need
> > to be orders of magnitude lower.
>
> Yes, the current default sample interval is a lower bound, and is also
> a reasonable default for testing. I expect real deployments to use
> much higher sample intervals (lower rate).
>
> So here's some data (with CONFIG_KFENCE_NUM_OBJECTS=1000, so that
> allocated KFENCE objects isn't artificially capped):
>
> -- With a mostly vanilla config + KFENCE (sample interval 100 ms),
> after ~40 min uptime (only boot, then idle) I see ~60 KFENCE objects
> (total allocations >600). Those aren't always the same objects, with
> roughly ~2 allocations/frees per second.
>
> -- Then running sysbench I/O benchmark, KFENCE objects allocated peak
> at 82. During the benchmark, allocations/frees per second are closer
> to 10-15. After the benchmark, the KFENCE objects allocated remain at
> 82, and allocations/frees per second fall back to ~2.
>
> -- For the same system, changing the sample interval to 1 ms (echo 1 >
> /sys/module/kfence/parameters/sample_interval), and re-running the
> benchmark gives me: KFENCE objects allocated peak at exactly 500, with
> ~500 allocations/frees per second. After that, allocated KFENCE
> objects dropped a little to 496, and allocations/frees per second fell
> back to ~2.
>
> -- The long-lived objects are due to caches, and just running 'echo 1
> > /proc/sys/vm/drop_caches' reduced allocated KFENCE objects back to
> 45.

Interesting. What type of caches is this? If there is some type of
cache that caches particularly lots of sampled objects, we could
potentially change the cache to release sampled objects eagerly.

> > Also I am wondering about the boot process (both kernel and init).
> > It's both inherently almost the same for the whole population of
> > machines and inherently produces persistent objects. Should we lower
> > the rate for the first minute of uptime? Or maybe make it proportional
> > to uptime?
>
> It should depend on current usage, which is dependent on the workload.
> I don't think uptime helps much, as seen above. If we imagine a user
> space tool that tweaks this for us, we can initialize KFENCE with a
> very large sample interval, and once booted, this user space
> tool/script adjusts /sys/module/kfence/parameters/sample_interval.
>
> At the very least, I think I'll just make
> /sys/module/kfence/parameters/sample_interval root-writable
> unconditionally, so that we can experiment with such a tool.
>
> Lowering the rate for the first minute of uptime might also be an
> option, although if we do that, we can also just move kfence_init() to
> the end of start_kernel(). IMHO, I think it still makes sense to
> sample normally during boot, because who knows how those allocations
> are used with different workloads once the kernel is live. With a
> sample interval of 1000 ms (which is closer to what we probably want
> in production), I see no more than 20 KFENCE objects allocated after
> boot. I think we can live with that.
>
> > I feel it's quite an important aspect. We can have this awesome idea
> > and implementation, but radically lower its utility by using bad
> > sampling value (which will have silent "failure mode" -- no bugs
> > detected).
>
> As a first step, I think monitoring the entire fleet here is key here
> (collect /sys/kernel/debug/kfence/stats). Essentially, as long as
> allocations/frees per second remains >0, we're probably fine, even if
> we always run at max. KFENCE objects allocated.
>
> An improvement over allocations/frees per second >0 would be
> dynamically tweaking sample_interval based on how close we get to max
> KFENCE objects allocated.
>
> Yet another option is to skip KFENCE allocations based on the memcache
> name, e.g. for those caches dedicated to long-lived allocations.
>
> > But to make it clear: all of this does not conflict with the merge of
> > the first version. Just having tunable sampling interval is good
> > enough. We will get the ultimate understanding only when we start
> > using it widely anyway.
>
> Thanks,
> -- Marco

2020-09-11 16:27:54

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] KFENCE: A low-overhead sampling-based memory safety error detector

On Fri, 11 Sep 2020 at 15:10, Dmitry Vyukov <[email protected]> wrote:
> On Fri, Sep 11, 2020 at 2:03 PM Marco Elver <[email protected]> wrote:
> > On Fri, 11 Sep 2020 at 09:36, Dmitry Vyukov <[email protected]> wrote:
[...]
> > > By "reasonable" I mean if the pool will last long enough to still
> > > sample something after hours/days? Have you tried any experiments with
> > > some workload (both short-lived processes and long-lived
> > > processes/namespaces) capturing state of the pool? It can make sense
> > > to do to better understand dynamics. I suspect that the rate may need
> > > to be orders of magnitude lower.
> >
> > Yes, the current default sample interval is a lower bound, and is also
> > a reasonable default for testing. I expect real deployments to use
> > much higher sample intervals (lower rate).
> >
> > So here's some data (with CONFIG_KFENCE_NUM_OBJECTS=1000, so that
> > allocated KFENCE objects isn't artificially capped):
> >
> > -- With a mostly vanilla config + KFENCE (sample interval 100 ms),
> > after ~40 min uptime (only boot, then idle) I see ~60 KFENCE objects
> > (total allocations >600). Those aren't always the same objects, with
> > roughly ~2 allocations/frees per second.
> >
> > -- Then running sysbench I/O benchmark, KFENCE objects allocated peak
> > at 82. During the benchmark, allocations/frees per second are closer
> > to 10-15. After the benchmark, the KFENCE objects allocated remain at
> > 82, and allocations/frees per second fall back to ~2.
> >
> > -- For the same system, changing the sample interval to 1 ms (echo 1 >
> > /sys/module/kfence/parameters/sample_interval), and re-running the
> > benchmark gives me: KFENCE objects allocated peak at exactly 500, with
> > ~500 allocations/frees per second. After that, allocated KFENCE
> > objects dropped a little to 496, and allocations/frees per second fell
> > back to ~2.
> >
> > -- The long-lived objects are due to caches, and just running 'echo 1
> > > /proc/sys/vm/drop_caches' reduced allocated KFENCE objects back to
> > 45.
>
> Interesting. What type of caches is this? If there is some type of
> cache that caches particularly lots of sampled objects, we could
> potentially change the cache to release sampled objects eagerly.

The 2 major users of KFENCE objects for that workload are
'buffer_head' and 'bio-0'.

If we want to deal with those, I guess there are 2 options:

1. More complex, but more precise: make the users of them check
is_kfence_address() and release their buffers earlier.

2. Simpler, generic solution: make KFENCE stop return allocations for
non-kmalloc_caches memcaches after more than ~90% of the pool is
exhausted. This assumes that creators of long-lived objects usually
set up their own memcaches.

I'm currently inclined to go for (2).

Thanks,
-- Marco

2020-09-11 16:37:14

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH RFC 00/10] KFENCE: A low-overhead sampling-based memory safety error detector

On Fri, 11 Sep 2020 at 15:33, Marco Elver <[email protected]> wrote:
> On Fri, 11 Sep 2020 at 15:10, Dmitry Vyukov <[email protected]> wrote:
> > On Fri, Sep 11, 2020 at 2:03 PM Marco Elver <[email protected]> wrote:
> > > On Fri, 11 Sep 2020 at 09:36, Dmitry Vyukov <[email protected]> wrote:
> [...]
> > > > By "reasonable" I mean if the pool will last long enough to still
> > > > sample something after hours/days? Have you tried any experiments with
> > > > some workload (both short-lived processes and long-lived
> > > > processes/namespaces) capturing state of the pool? It can make sense
> > > > to do to better understand dynamics. I suspect that the rate may need
> > > > to be orders of magnitude lower.
> > >
> > > Yes, the current default sample interval is a lower bound, and is also
> > > a reasonable default for testing. I expect real deployments to use
> > > much higher sample intervals (lower rate).
> > >
> > > So here's some data (with CONFIG_KFENCE_NUM_OBJECTS=1000, so that
> > > allocated KFENCE objects isn't artificially capped):
> > >
> > > -- With a mostly vanilla config + KFENCE (sample interval 100 ms),
> > > after ~40 min uptime (only boot, then idle) I see ~60 KFENCE objects
> > > (total allocations >600). Those aren't always the same objects, with
> > > roughly ~2 allocations/frees per second.
> > >
> > > -- Then running sysbench I/O benchmark, KFENCE objects allocated peak
> > > at 82. During the benchmark, allocations/frees per second are closer
> > > to 10-15. After the benchmark, the KFENCE objects allocated remain at
> > > 82, and allocations/frees per second fall back to ~2.
> > >
> > > -- For the same system, changing the sample interval to 1 ms (echo 1 >
> > > /sys/module/kfence/parameters/sample_interval), and re-running the
> > > benchmark gives me: KFENCE objects allocated peak at exactly 500, with
> > > ~500 allocations/frees per second. After that, allocated KFENCE
> > > objects dropped a little to 496, and allocations/frees per second fell
> > > back to ~2.
> > >
> > > -- The long-lived objects are due to caches, and just running 'echo 1
> > > > /proc/sys/vm/drop_caches' reduced allocated KFENCE objects back to
> > > 45.
> >
> > Interesting. What type of caches is this? If there is some type of
> > cache that caches particularly lots of sampled objects, we could
> > potentially change the cache to release sampled objects eagerly.
>
> The 2 major users of KFENCE objects for that workload are
> 'buffer_head' and 'bio-0'.
>
> If we want to deal with those, I guess there are 2 options:
>
> 1. More complex, but more precise: make the users of them check
> is_kfence_address() and release their buffers earlier.
>
> 2. Simpler, generic solution: make KFENCE stop return allocations for
> non-kmalloc_caches memcaches after more than ~90% of the pool is
> exhausted. This assumes that creators of long-lived objects usually
> set up their own memcaches.
>
> I'm currently inclined to go for (2).

Ok, after some offline chat, we determined that (2) would be premature
and we can't really say if kmalloc should have precedence if we reach
some usage threshold. So for now, let's just leave as-is and start
with the recommendation to monitor and adjust based on usage, fleet
size, etc.

Thanks,
-- Marco