Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18;
MIME-Version: 1.0
References: <20200907134055.2878499-1-elver@google.com> <e399d8d5-03c2-3c13-2a43-3bb8e842c55a@intel.com>
 <20200908153102.GB61807@elver.google.com> <feb73053-17a6-8b43-5b2b-51a813e81622@suse.cz>
 <20200908155631.GC61807@elver.google.com> <CACT4Y+YZqj0CJTumpHr-g9HcRgs+JHwWP5eg1nYHP0E-Zw25DQ@mail.gmail.com>
 <CANpmjNO7XwhefA+NKszVkNqj8a60QY45n-=EUtGns+ysNYV9mQ@mail.gmail.com>
 <CACT4Y+YSQDjEh6+XMXiHvMaKAT8bA=JkC8xY3AXfcSk+f9yR+g@mail.gmail.com> <CANpmjNNWOVi317bF_E=QdcSSd5x6Dfk=+nECA9VnZSLGMKigYQ@mail.gmail.com>
In-Reply-To: <CANpmjNNWOVi317bF_E=QdcSSd5x6Dfk=+nECA9VnZSLGMKigYQ@mail.gmail.com>
From:   Marco Elver <elver@google.com>
Date:   Fri, 11 Sep 2020 18:33:52 +0200
Message-ID: <CANpmjNN_OPCvWPnb62nu+B94t7P54utAH6BGaRYuYuuCfygzig@mail.gmail.com>
Subject: Re: [PATCH RFC 00/10] KFENCE: A low-overhead sampling-based memory
 safety error detector
To:     Dmitry Vyukov <dvyukov@google.com>
Cc:     Vlastimil Babka <vbabka@suse.cz>,
        Dave Hansen <dave.hansen@intel.com>,
        Alexander Potapenko <glider@google.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        Catalin Marinas <catalin.marinas@arm.com>,
        Christoph Lameter <cl@linux.com>,
        David Rientjes <rientjes@google.com>,
        Joonsoo Kim <iamjoonsoo.kim@lge.com>,
        Mark Rutland <mark.rutland@arm.com>,
        Pekka Enberg <penberg@kernel.org>,
        "H. Peter Anvin" <hpa@zytor.com>,
        "Paul E. McKenney" <paulmck@kernel.org>,
        Andrey Konovalov <andreyknvl@google.com>,
        Andrey Ryabinin <aryabinin@virtuozzo.com>,
        Andy Lutomirski <luto@kernel.org>,
        Borislav Petkov <bp@alien8.de>,
        Dave Hansen <dave.hansen@linux.intel.com>,
        Eric Dumazet <edumazet@google.com>,
        Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
        Ingo Molnar <mingo@redhat.com>, Jann Horn <jannh@google.com>,
        Jonathan Corbet <corbet@lwn.net>,
        Kees Cook <keescook@chromium.org>,
        Peter Zijlstra <peterz@infradead.org>, Qian Cai <cai@lca.pw>,
        Thomas Gleixner <tglx@linutronix.de>,
        Will Deacon <will@kernel.org>,
        "the arch/x86 maintainers" <x86@kernel.org>,
        "open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
        LKML <linux-kernel@vger.kernel.org>,
        kasan-dev <kasan-dev@googlegroups.com>,
        Linux ARM <linux-arm-kernel@lists.infradead.org>,
        Linux-MM <linux-mm@kvack.org>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org
Precedence: bulk

On Fri, 11 Sep 2020 at 15:33, Marco Elver <elver@google.com> wrote:
> On Fri, 11 Sep 2020 at 15:10, Dmitry Vyukov <dvyukov@google.com> wrote:
> > On Fri, Sep 11, 2020 at 2:03 PM Marco Elver <elver@google.com> wrote:
> > > On Fri, 11 Sep 2020 at 09:36, Dmitry Vyukov <dvyukov@google.com> wrote:
> [...]
> > > > By "reasonable" I mean if the pool will last long enough to still
> > > > sample something after hours/days? Have you tried any experiments with
> > > > some workload (both short-lived processes and long-lived
> > > > processes/namespaces) capturing state of the pool? It can make sense
> > > > to do to better understand dynamics. I suspect that the rate may need
> > > > to be orders of magnitude lower.
> > >
> > > Yes, the current default sample interval is a lower bound, and is also
> > > a reasonable default for testing. I expect real deployments to use
> > > much higher sample intervals (lower rate).
> > >
> > > So here's some data (with CONFIG_KFENCE_NUM_OBJECTS=1000, so that
> > > allocated KFENCE objects isn't artificially capped):
> > >
> > > -- With a mostly vanilla config + KFENCE (sample interval 100 ms),
> > > after ~40 min uptime (only boot, then idle) I see ~60 KFENCE objects
> > > (total allocations >600). Those aren't always the same objects, with
> > > roughly ~2 allocations/frees per second.
> > >
> > > -- Then running sysbench I/O benchmark, KFENCE objects allocated peak
> > > at 82. During the benchmark, allocations/frees per second are closer
> > > to 10-15. After the benchmark, the KFENCE objects allocated remain at
> > > 82, and allocations/frees per second fall back to ~2.
> > >
> > > -- For the same system, changing the sample interval to 1 ms (echo 1 >
> > > /sys/module/kfence/parameters/sample_interval), and re-running the
> > > benchmark gives me: KFENCE objects allocated peak at exactly 500, with
> > > ~500 allocations/frees per second. After that, allocated KFENCE
> > > objects dropped a little to 496, and allocations/frees per second fell
> > > back to ~2.
> > >
> > > -- The long-lived objects are due to caches, and just running 'echo 1
> > > > /proc/sys/vm/drop_caches' reduced allocated KFENCE objects back to
> > > 45.
> >
> > Interesting. What type of caches is this? If there is some type of
> > cache that caches particularly lots of sampled objects, we could
> > potentially change the cache to release sampled objects eagerly.
>
> The 2 major users of KFENCE objects for that workload are
> 'buffer_head' and 'bio-0'.
>
> If we want to deal with those, I guess there are 2 options:
>
> 1. More complex, but more precise: make the users of them check
> is_kfence_address() and release their buffers earlier.
>
> 2. Simpler, generic solution: make KFENCE stop return allocations for
> non-kmalloc_caches memcaches after more than ~90% of the pool is
> exhausted. This assumes that creators of long-lived objects usually
> set up their own memcaches.
>
> I'm currently inclined to go for (2).

Ok, after some offline chat, we determined that (2) would be premature
and we can't really say if kmalloc should have precedence if we reach
some usage threshold. So for now, let's just leave as-is and start
with the recommendation to monitor and adjust based on usage, fleet
size, etc.

Thanks,
-- Marco