As found in PaX, this adds a cheap check on heap consistency, just to
notice if things have gotten corrupted in the page lookup.
Signed-off-by: Kees Cook <[email protected]>
---
mm/slab.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/slab.h b/mm/slab.h
index 65e7c3fcac72..64447640b70c 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -384,6 +384,7 @@ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
return s;
page = virt_to_head_page(x);
+ BUG_ON(!PageSlab(page));
cachep = page->slab_cache;
if (slab_equal_or_root(cachep, s))
return cachep;
--
2.7.4
--
Kees Cook
Pixel Security
On Fri, 31 Mar 2017 09:40:28 -0700 Kees Cook <[email protected]> wrote:
> As found in PaX, this adds a cheap check on heap consistency, just to
> notice if things have gotten corrupted in the page lookup.
"As found in PaX" isn't a very illuminating justification for such a
change. Was there a real kernel bug which this would have exposed, or
what?
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -384,6 +384,7 @@ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
> return s;
>
> page = virt_to_head_page(x);
> + BUG_ON(!PageSlab(page));
> cachep = page->slab_cache;
> if (slab_equal_or_root(cachep, s))
> return cachep;
BUG_ON might be too severe. I expect the kindest VM_WARN_ON_ONCE()
would suffice here, but without more details it is hard to say.
On Fri, Mar 31, 2017 at 2:33 PM, Andrew Morton
<[email protected]> wrote:
> On Fri, 31 Mar 2017 09:40:28 -0700 Kees Cook <[email protected]> wrote:
>
>> As found in PaX, this adds a cheap check on heap consistency, just to
>> notice if things have gotten corrupted in the page lookup.
>
> "As found in PaX" isn't a very illuminating justification for such a
> change. Was there a real kernel bug which this would have exposed, or
> what?
I don't know off the top of my head, but given the kinds of heap
attacks I've been seeing, I think this added consistency check is
worth it given how inexpensive it is. When heap metadata gets
corrupted, we can get into nasty side-effects that can be
attacker-controlled, so better to catch obviously bad states as early
as possible.
>> --- a/mm/slab.h
>> +++ b/mm/slab.h
>> @@ -384,6 +384,7 @@ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
>> return s;
>>
>> page = virt_to_head_page(x);
>> + BUG_ON(!PageSlab(page));
>> cachep = page->slab_cache;
>> if (slab_equal_or_root(cachep, s))
>> return cachep;
>
> BUG_ON might be too severe. I expect the kindest VM_WARN_ON_ONCE()
> would suffice here, but without more details it is hard to say.
So, WARN isn't enough to protect the kernel (execution continues and
the memory is still dereferenced for malicious purposes, etc). Perhaps
use CHECK_DATA_CORRUPTION() here, which can either WARN and take a
"safe" path, or BUG (depending on config paranoia of the builder).
I've got a series adding it in a number of other places, so I could
add this patch to that series?
-Kees
--
Kees Cook
Pixel Security
Kees Cook <[email protected]> writes:
> On Fri, Mar 31, 2017 at 2:33 PM, Andrew Morton
> <[email protected]> wrote:
>> On Fri, 31 Mar 2017 09:40:28 -0700 Kees Cook <[email protected]> wrote:
>>
>>> As found in PaX, this adds a cheap check on heap consistency, just to
>>> notice if things have gotten corrupted in the page lookup.
>>
>> "As found in PaX" isn't a very illuminating justification for such a
>> change. Was there a real kernel bug which this would have exposed, or
>> what?
>
> I don't know off the top of my head, but given the kinds of heap
> attacks I've been seeing, I think this added consistency check is
> worth it given how inexpensive it is. When heap metadata gets
> corrupted, we can get into nasty side-effects that can be
> attacker-controlled, so better to catch obviously bad states as early
> as possible.
There's your changelog :)
>>> --- a/mm/slab.h
>>> +++ b/mm/slab.h
>>> @@ -384,6 +384,7 @@ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
>>> return s;
>>>
>>> page = virt_to_head_page(x);
>>> + BUG_ON(!PageSlab(page));
>>> cachep = page->slab_cache;
>>> if (slab_equal_or_root(cachep, s))
>>> return cachep;
>>
>> BUG_ON might be too severe. I expect the kindest VM_WARN_ON_ONCE()
>> would suffice here, but without more details it is hard to say.
>
> So, WARN isn't enough to protect the kernel (execution continues and
> the memory is still dereferenced for malicious purposes, etc).
You could do:
if (WARN_ON(!PageSlab(page)))
return NULL.
Though I see at least two callers that don't check for a NULL return.
Looking at the context, the tail of the function already contains:
pr_err("%s: Wrong slab cache. %s but object is from %s\n",
__func__, s->name, cachep->name);
WARN_ON_ONCE(1);
return s;
}
At least in slab.c it seems that would allow you to "free" an object
from one kmem_cache onto the array_cache of another kmem_cache, which
seems fishy. But maybe there's a check somewhere I'm missing?
cheers
On Mon, 3 Apr 2017, Michael Ellerman wrote:
> At least in slab.c it seems that would allow you to "free" an object
> from one kmem_cache onto the array_cache of another kmem_cache, which
> seems fishy. But maybe there's a check somewhere I'm missing?
kfree can be used to free any object from any slab cache.
kmem_cache_free() checks if the object belongs to the cache given.
On Mon, Apr 03, 2017 at 09:03:50AM -0500, Christoph Lameter wrote:
> On Mon, 3 Apr 2017, Michael Ellerman wrote:
>
> > At least in slab.c it seems that would allow you to "free" an object
> > from one kmem_cache onto the array_cache of another kmem_cache, which
> > seems fishy. But maybe there's a check somewhere I'm missing?
>
> kfree can be used to free any object from any slab cache.
Is that a guarantee? There's some wording in the RCU free code that
seems to indicate we can't rely on that being true.
On Fri 31-03-17 09:40:28, Kees Cook wrote:
> As found in PaX, this adds a cheap check on heap consistency, just to
> notice if things have gotten corrupted in the page lookup.
>
> Signed-off-by: Kees Cook <[email protected]>
NAK without a proper changelog. Seriously, we do not blindly apply
changes from other projects without a deep understanding of all
consequences.
> ---
> mm/slab.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/mm/slab.h b/mm/slab.h
> index 65e7c3fcac72..64447640b70c 100644
> --- a/mm/slab.h
> +++ b/mm/slab.h
> @@ -384,6 +384,7 @@ static inline struct kmem_cache *cache_from_obj(struct kmem_cache *s, void *x)
> return s;
>
> page = virt_to_head_page(x);
> + BUG_ON(!PageSlab(page));
> cachep = page->slab_cache;
> if (slab_equal_or_root(cachep, s))
> return cachep;
> --
> 2.7.4
>
>
> --
> Kees Cook
> Pixel Security
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
Michal Hocko
SUSE Labs
On Tue, 4 Apr 2017, Michal Hocko wrote:
> NAK without a proper changelog. Seriously, we do not blindly apply
> changes from other projects without a deep understanding of all
> consequences.
Functionalitywise this is trivial. A page must be a slab page in order to
be able to determine the slab cache of an object. Its definitely not ok if
the page is not a slab page.
The main issue that may exist here is the adding of overhead to a critical
code path like kfree().
On Tue 04-04-17 10:07:23, Cristopher Lameter wrote:
> On Tue, 4 Apr 2017, Michal Hocko wrote:
>
> > NAK without a proper changelog. Seriously, we do not blindly apply
> > changes from other projects without a deep understanding of all
> > consequences.
>
> Functionalitywise this is trivial. A page must be a slab page in order to
> be able to determine the slab cache of an object. Its definitely not ok if
> the page is not a slab page.
Yes, but we do not have to blow the kernel, right? Why cannot we simply
leak that memory?
> The main issue that may exist here is the adding of overhead to a critical
> code path like kfree().
Yes, nothing is for free. But if the attack space is real then we
probably want to sacrifice few cycles (to simply return ASAP without
further further processing). This all should be in the changelog ideally
with some numbers. I suspect this would be hard to measure in most
workloads.
--
Michal Hocko
SUSE Labs
On Tue, Apr 4, 2017 at 8:16 AM, Michal Hocko <[email protected]> wrote:
> On Tue 04-04-17 10:07:23, Cristopher Lameter wrote:
>> On Tue, 4 Apr 2017, Michal Hocko wrote:
>>
>> > NAK without a proper changelog. Seriously, we do not blindly apply
>> > changes from other projects without a deep understanding of all
>> > consequences.
>>
>> Functionalitywise this is trivial. A page must be a slab page in order to
>> be able to determine the slab cache of an object. Its definitely not ok if
>> the page is not a slab page.
>
> Yes, but we do not have to blow the kernel, right? Why cannot we simply
> leak that memory?
I can put this behind CHECK_DATA_CORRUPTION() instead of BUG(), which
allows the system builder to choose between WARN and BUG. Some people
absolutely want the kernel to BUG on data corruption as it could be an
attack.
>> The main issue that may exist here is the adding of overhead to a critical
>> code path like kfree().
>
> Yes, nothing is for free. But if the attack space is real then we
> probably want to sacrifice few cycles (to simply return ASAP without
> further further processing). This all should be in the changelog ideally
> with some numbers. I suspect this would be hard to measure in most
> workloads.
Given the trivial nature of the check, yeah, it seemed impossible to
actually show performance changes.
-Kees
--
Kees Cook
Pixel Security
On Tue 04-04-17 08:46:02, Kees Cook wrote:
> On Tue, Apr 4, 2017 at 8:16 AM, Michal Hocko <[email protected]> wrote:
> > On Tue 04-04-17 10:07:23, Cristopher Lameter wrote:
> >> On Tue, 4 Apr 2017, Michal Hocko wrote:
> >>
> >> > NAK without a proper changelog. Seriously, we do not blindly apply
> >> > changes from other projects without a deep understanding of all
> >> > consequences.
> >>
> >> Functionalitywise this is trivial. A page must be a slab page in order to
> >> be able to determine the slab cache of an object. Its definitely not ok if
> >> the page is not a slab page.
> >
> > Yes, but we do not have to blow the kernel, right? Why cannot we simply
> > leak that memory?
>
> I can put this behind CHECK_DATA_CORRUPTION() instead of BUG(), which
> allows the system builder to choose between WARN and BUG. Some people
> absolutely want the kernel to BUG on data corruption as it could be an
> attack.
CHECK_DATA_CORRUPTION sounds as better fit to me. This would, however
require to handle the potenial corruption by returning and leaking the
memory.
--
Michal Hocko
SUSE Labs
On Tue, Apr 4, 2017 at 8:58 AM, Michal Hocko <[email protected]> wrote:
> On Tue 04-04-17 08:46:02, Kees Cook wrote:
>> On Tue, Apr 4, 2017 at 8:16 AM, Michal Hocko <[email protected]> wrote:
>> > On Tue 04-04-17 10:07:23, Cristopher Lameter wrote:
>> >> On Tue, 4 Apr 2017, Michal Hocko wrote:
>> >>
>> >> > NAK without a proper changelog. Seriously, we do not blindly apply
>> >> > changes from other projects without a deep understanding of all
>> >> > consequences.
>> >>
>> >> Functionalitywise this is trivial. A page must be a slab page in order to
>> >> be able to determine the slab cache of an object. Its definitely not ok if
>> >> the page is not a slab page.
>> >
>> > Yes, but we do not have to blow the kernel, right? Why cannot we simply
>> > leak that memory?
>>
>> I can put this behind CHECK_DATA_CORRUPTION() instead of BUG(), which
>> allows the system builder to choose between WARN and BUG. Some people
>> absolutely want the kernel to BUG on data corruption as it could be an
>> attack.
>
> CHECK_DATA_CORRUPTION sounds as better fit to me. This would, however
> require to handle the potenial corruption by returning and leaking the
> memory.
IIUC, that would be the "return s" path? I should likely change the
WARN_ON_ONCE there to be CHECK_DATA_CORRUPTION too. I'll add this to
my series.
-Kees
--
Kees Cook
Pixel Security
On Tue, 4 Apr 2017, Michal Hocko wrote:
> Yes, but we do not have to blow the kernel, right? Why cannot we simply
> leak that memory?
Because it is a serious bug to attempt to free a non slab object using
slab operations. This is often the result of memory corruption, coding
errs etc. The system needs to stop right there.
On Tue 04-04-17 14:13:06, Cristopher Lameter wrote:
> On Tue, 4 Apr 2017, Michal Hocko wrote:
>
> > Yes, but we do not have to blow the kernel, right? Why cannot we simply
> > leak that memory?
>
> Because it is a serious bug to attempt to free a non slab object using
> slab operations. This is often the result of memory corruption, coding
> errs etc. The system needs to stop right there.
Why when an alternative is a memory leak?
--
Michal Hocko
SUSE Labs
On Tue, 4 Apr 2017, Michal Hocko wrote:
> On Tue 04-04-17 14:13:06, Cristopher Lameter wrote:
> > On Tue, 4 Apr 2017, Michal Hocko wrote:
> >
> > > Yes, but we do not have to blow the kernel, right? Why cannot we simply
> > > leak that memory?
> >
> > Because it is a serious bug to attempt to free a non slab object using
> > slab operations. This is often the result of memory corruption, coding
> > errs etc. The system needs to stop right there.
>
> Why when an alternative is a memory leak?
Because the slab allocators fail also in case you free an object multiple
times etc etc. Continuation is supported by enabling a special resiliency
feature via the kernel command line. The alternative is selectable but not
the default.
On Tue 04-04-17 14:58:06, Cristopher Lameter wrote:
> On Tue, 4 Apr 2017, Michal Hocko wrote:
>
> > On Tue 04-04-17 14:13:06, Cristopher Lameter wrote:
> > > On Tue, 4 Apr 2017, Michal Hocko wrote:
> > >
> > > > Yes, but we do not have to blow the kernel, right? Why cannot we simply
> > > > leak that memory?
> > >
> > > Because it is a serious bug to attempt to free a non slab object using
> > > slab operations. This is often the result of memory corruption, coding
> > > errs etc. The system needs to stop right there.
> >
> > Why when an alternative is a memory leak?
>
> Because the slab allocators fail also in case you free an object multiple
> times etc etc. Continuation is supported by enabling a special resiliency
> feature via the kernel command line. The alternative is selectable but not
> the default.
I disagree! We should try to continue as long as we _know_ that the
internal state of the allocator is still consistent and a further
operation will not spread the corruption even more. This is clearly not
the case for an invalid pointer to kfree.
I can see why checking for an early allocator corruption is not always
feasible and you can only detect after-the-fact but this is not the case
here and putting your system down just because some buggy code is trying
to free something it hasn't allocated is not really useful. I completely
agree with Linus that we overuse BUG way too much and this is just
another example of it.
--
Michal Hocko
SUSE Labs
On Tue, Apr 4, 2017 at 1:13 PM, Michal Hocko <[email protected]> wrote:
> On Tue 04-04-17 14:58:06, Cristopher Lameter wrote:
>> On Tue, 4 Apr 2017, Michal Hocko wrote:
>>
>> > On Tue 04-04-17 14:13:06, Cristopher Lameter wrote:
>> > > On Tue, 4 Apr 2017, Michal Hocko wrote:
>> > >
>> > > > Yes, but we do not have to blow the kernel, right? Why cannot we simply
>> > > > leak that memory?
>> > >
>> > > Because it is a serious bug to attempt to free a non slab object using
>> > > slab operations. This is often the result of memory corruption, coding
>> > > errs etc. The system needs to stop right there.
>> >
>> > Why when an alternative is a memory leak?
>>
>> Because the slab allocators fail also in case you free an object multiple
>> times etc etc. Continuation is supported by enabling a special resiliency
>> feature via the kernel command line. The alternative is selectable but not
>> the default.
>
> I disagree! We should try to continue as long as we _know_ that the
> internal state of the allocator is still consistent and a further
> operation will not spread the corruption even more. This is clearly not
> the case for an invalid pointer to kfree.
>
> I can see why checking for an early allocator corruption is not always
> feasible and you can only detect after-the-fact but this is not the case
> here and putting your system down just because some buggy code is trying
> to free something it hasn't allocated is not really useful. I completely
> agree with Linus that we overuse BUG way too much and this is just
> another example of it.
Instead of the proposed BUG here, what's the correct "safe" return value?
-Kees
--
Kees Cook
Pixel Security
On Mon 10-04-17 21:58:22, Kees Cook wrote:
> On Tue, Apr 4, 2017 at 1:13 PM, Michal Hocko <[email protected]> wrote:
> > On Tue 04-04-17 14:58:06, Cristopher Lameter wrote:
> >> On Tue, 4 Apr 2017, Michal Hocko wrote:
> >>
> >> > On Tue 04-04-17 14:13:06, Cristopher Lameter wrote:
> >> > > On Tue, 4 Apr 2017, Michal Hocko wrote:
> >> > >
> >> > > > Yes, but we do not have to blow the kernel, right? Why cannot we simply
> >> > > > leak that memory?
> >> > >
> >> > > Because it is a serious bug to attempt to free a non slab object using
> >> > > slab operations. This is often the result of memory corruption, coding
> >> > > errs etc. The system needs to stop right there.
> >> >
> >> > Why when an alternative is a memory leak?
> >>
> >> Because the slab allocators fail also in case you free an object multiple
> >> times etc etc. Continuation is supported by enabling a special resiliency
> >> feature via the kernel command line. The alternative is selectable but not
> >> the default.
> >
> > I disagree! We should try to continue as long as we _know_ that the
> > internal state of the allocator is still consistent and a further
> > operation will not spread the corruption even more. This is clearly not
> > the case for an invalid pointer to kfree.
> >
> > I can see why checking for an early allocator corruption is not always
> > feasible and you can only detect after-the-fact but this is not the case
> > here and putting your system down just because some buggy code is trying
> > to free something it hasn't allocated is not really useful. I completely
> > agree with Linus that we overuse BUG way too much and this is just
> > another example of it.
>
> Instead of the proposed BUG here, what's the correct "safe" return value?
I would assume that _you_ as the one who proposes the change would take
some time to read and understand the code and know this answer. This is
how we do changes to the kernel: have an objective, understand the code
and generate the patch.
I am really sad that this particular patch has shown that you didn't
bother to consider the later part and blindly applied something that you
haven't thought through properly. Please try harder next time.
--
Michal Hocko
SUSE Labs
On Tue, Apr 11, 2017 at 6:46 AM, Michal Hocko <[email protected]> wrote:
> On Mon 10-04-17 21:58:22, Kees Cook wrote:
>> On Tue, Apr 4, 2017 at 1:13 PM, Michal Hocko <[email protected]> wrote:
>> > On Tue 04-04-17 14:58:06, Cristopher Lameter wrote:
>> >> On Tue, 4 Apr 2017, Michal Hocko wrote:
>> >>
>> >> > On Tue 04-04-17 14:13:06, Cristopher Lameter wrote:
>> >> > > On Tue, 4 Apr 2017, Michal Hocko wrote:
>> >> > >
>> >> > > > Yes, but we do not have to blow the kernel, right? Why cannot we simply
>> >> > > > leak that memory?
>> >> > >
>> >> > > Because it is a serious bug to attempt to free a non slab object using
>> >> > > slab operations. This is often the result of memory corruption, coding
>> >> > > errs etc. The system needs to stop right there.
>> >> >
>> >> > Why when an alternative is a memory leak?
>> >>
>> >> Because the slab allocators fail also in case you free an object multiple
>> >> times etc etc. Continuation is supported by enabling a special resiliency
>> >> feature via the kernel command line. The alternative is selectable but not
>> >> the default.
>> >
>> > I disagree! We should try to continue as long as we _know_ that the
>> > internal state of the allocator is still consistent and a further
>> > operation will not spread the corruption even more. This is clearly not
>> > the case for an invalid pointer to kfree.
>> >
>> > I can see why checking for an early allocator corruption is not always
>> > feasible and you can only detect after-the-fact but this is not the case
>> > here and putting your system down just because some buggy code is trying
>> > to free something it hasn't allocated is not really useful. I completely
>> > agree with Linus that we overuse BUG way too much and this is just
>> > another example of it.
>>
>> Instead of the proposed BUG here, what's the correct "safe" return value?
>
> I would assume that _you_ as the one who proposes the change would take
> some time to read and understand the code and know this answer. This is
> how we do changes to the kernel: have an objective, understand the code
> and generate the patch.
>
> I am really sad that this particular patch has shown that you didn't
> bother to consider the later part and blindly applied something that you
> haven't thought through properly. Please try harder next time.
Our objectives are different: I want the kernel to immediately stop
when corruption is detected. Since others are interested in making it
survivable, I was hoping to get a hint about what such an improvement
would look like. Instead this condescending attitude, can you instead
provide constructive help that will get our users closer to the safe
kernel operation we're all interested in?
-Kees
--
Kees Cook
Pixel Security
On Tue 11-04-17 07:14:01, Kees Cook wrote:
> On Tue, Apr 11, 2017 at 6:46 AM, Michal Hocko <[email protected]> wrote:
> > On Mon 10-04-17 21:58:22, Kees Cook wrote:
> >> On Tue, Apr 4, 2017 at 1:13 PM, Michal Hocko <[email protected]> wrote:
> >> > On Tue 04-04-17 14:58:06, Cristopher Lameter wrote:
> >> >> On Tue, 4 Apr 2017, Michal Hocko wrote:
> >> >>
> >> >> > On Tue 04-04-17 14:13:06, Cristopher Lameter wrote:
> >> >> > > On Tue, 4 Apr 2017, Michal Hocko wrote:
> >> >> > >
> >> >> > > > Yes, but we do not have to blow the kernel, right? Why cannot we simply
> >> >> > > > leak that memory?
> >> >> > >
> >> >> > > Because it is a serious bug to attempt to free a non slab object using
> >> >> > > slab operations. This is often the result of memory corruption, coding
> >> >> > > errs etc. The system needs to stop right there.
> >> >> >
> >> >> > Why when an alternative is a memory leak?
> >> >>
> >> >> Because the slab allocators fail also in case you free an object multiple
> >> >> times etc etc. Continuation is supported by enabling a special resiliency
> >> >> feature via the kernel command line. The alternative is selectable but not
> >> >> the default.
> >> >
> >> > I disagree! We should try to continue as long as we _know_ that the
> >> > internal state of the allocator is still consistent and a further
> >> > operation will not spread the corruption even more. This is clearly not
> >> > the case for an invalid pointer to kfree.
> >> >
> >> > I can see why checking for an early allocator corruption is not always
> >> > feasible and you can only detect after-the-fact but this is not the case
> >> > here and putting your system down just because some buggy code is trying
> >> > to free something it hasn't allocated is not really useful. I completely
> >> > agree with Linus that we overuse BUG way too much and this is just
> >> > another example of it.
> >>
> >> Instead of the proposed BUG here, what's the correct "safe" return value?
> >
> > I would assume that _you_ as the one who proposes the change would take
> > some time to read and understand the code and know this answer. This is
> > how we do changes to the kernel: have an objective, understand the code
> > and generate the patch.
> >
> > I am really sad that this particular patch has shown that you didn't
> > bother to consider the later part and blindly applied something that you
> > haven't thought through properly. Please try harder next time.
>
> Our objectives are different: I want the kernel to immediately stop
> when corruption is detected. Since others are interested in making it
> survivable, I was hoping to get a hint about what such an improvement
> would look like.
I do not think sprinkling BUG_ONs will help that objective. And BUG_ON
under IRQ disable is likely not helping an error survivable...
> Instead this condescending attitude, can you instead
> provide constructive help that will get our users closer to the safe
> kernel operation we're all interested in?
I would do something like...
---
diff --git a/mm/slab.c b/mm/slab.c
index bd63450a9b16..87c99a5e9e18 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -393,10 +393,15 @@ static inline void set_store_user_dirty(struct kmem_cache *cachep) {}
static int slab_max_order = SLAB_MAX_ORDER_LO;
static bool slab_max_order_set __initdata;
+static inline struct kmem_cache *page_to_cache(struct page *page)
+{
+ return page->slab_cache;
+}
+
static inline struct kmem_cache *virt_to_cache(const void *obj)
{
struct page *page = virt_to_head_page(obj);
- return page->slab_cache;
+ return page_to_cache(page);
}
static inline void *index_to_obj(struct kmem_cache *cache, struct page *page,
@@ -3813,14 +3818,18 @@ void kfree(const void *objp)
{
struct kmem_cache *c;
unsigned long flags;
+ struct page *page;
trace_kfree(_RET_IP_, objp);
if (unlikely(ZERO_OR_NULL_PTR(objp)))
return;
+ page = virt_to_head_page(obj);
+ if (CHECK_DATA_CORRUPTION(!PageSlab(page)))
+ return;
local_irq_save(flags);
kfree_debugcheck(objp);
- c = virt_to_cache(objp);
+ c = page_to_cache(page);
debug_check_no_locks_freed(objp, c->object_size);
debug_check_no_obj_freed(objp, c->object_size);
--
Michal Hocko
SUSE Labs
On Tue, Apr 11, 2017 at 7:19 AM, Michal Hocko <[email protected]> wrote:
> On Tue 11-04-17 07:14:01, Kees Cook wrote:
>> On Tue, Apr 11, 2017 at 6:46 AM, Michal Hocko <[email protected]> wrote:
>> > On Mon 10-04-17 21:58:22, Kees Cook wrote:
>> >> On Tue, Apr 4, 2017 at 1:13 PM, Michal Hocko <[email protected]> wrote:
>> >> > On Tue 04-04-17 14:58:06, Cristopher Lameter wrote:
>> >> >> On Tue, 4 Apr 2017, Michal Hocko wrote:
>> >> >>
>> >> >> > On Tue 04-04-17 14:13:06, Cristopher Lameter wrote:
>> >> >> > > On Tue, 4 Apr 2017, Michal Hocko wrote:
>> >> >> > >
>> >> >> > > > Yes, but we do not have to blow the kernel, right? Why cannot we simply
>> >> >> > > > leak that memory?
>> >> >> > >
>> >> >> > > Because it is a serious bug to attempt to free a non slab object using
>> >> >> > > slab operations. This is often the result of memory corruption, coding
>> >> >> > > errs etc. The system needs to stop right there.
>> >> >> >
>> >> >> > Why when an alternative is a memory leak?
>> >> >>
>> >> >> Because the slab allocators fail also in case you free an object multiple
>> >> >> times etc etc. Continuation is supported by enabling a special resiliency
>> >> >> feature via the kernel command line. The alternative is selectable but not
>> >> >> the default.
>> >> >
>> >> > I disagree! We should try to continue as long as we _know_ that the
>> >> > internal state of the allocator is still consistent and a further
>> >> > operation will not spread the corruption even more. This is clearly not
>> >> > the case for an invalid pointer to kfree.
>> >> >
>> >> > I can see why checking for an early allocator corruption is not always
>> >> > feasible and you can only detect after-the-fact but this is not the case
>> >> > here and putting your system down just because some buggy code is trying
>> >> > to free something it hasn't allocated is not really useful. I completely
>> >> > agree with Linus that we overuse BUG way too much and this is just
>> >> > another example of it.
>> >>
>> >> Instead of the proposed BUG here, what's the correct "safe" return value?
>> >
>> > I would assume that _you_ as the one who proposes the change would take
>> > some time to read and understand the code and know this answer. This is
>> > how we do changes to the kernel: have an objective, understand the code
>> > and generate the patch.
>> >
>> > I am really sad that this particular patch has shown that you didn't
>> > bother to consider the later part and blindly applied something that you
>> > haven't thought through properly. Please try harder next time.
>>
>> Our objectives are different: I want the kernel to immediately stop
>> when corruption is detected. Since others are interested in making it
>> survivable, I was hoping to get a hint about what such an improvement
>> would look like.
>
> I do not think sprinkling BUG_ONs will help that objective. And BUG_ON
> under IRQ disable is likely not helping an error survivable...
Yes, agreed. Handling it cleanly is always better.
>> Instead this condescending attitude, can you instead
>> provide constructive help that will get our users closer to the safe
>> kernel operation we're all interested in?
>
> I would do something like...
> ---
> diff --git a/mm/slab.c b/mm/slab.c
> index bd63450a9b16..87c99a5e9e18 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -393,10 +393,15 @@ static inline void set_store_user_dirty(struct kmem_cache *cachep) {}
> static int slab_max_order = SLAB_MAX_ORDER_LO;
> static bool slab_max_order_set __initdata;
>
> +static inline struct kmem_cache *page_to_cache(struct page *page)
> +{
> + return page->slab_cache;
> +}
> +
> static inline struct kmem_cache *virt_to_cache(const void *obj)
> {
> struct page *page = virt_to_head_page(obj);
> - return page->slab_cache;
> + return page_to_cache(page);
> }
>
> static inline void *index_to_obj(struct kmem_cache *cache, struct page *page,
> @@ -3813,14 +3818,18 @@ void kfree(const void *objp)
> {
> struct kmem_cache *c;
> unsigned long flags;
> + struct page *page;
>
> trace_kfree(_RET_IP_, objp);
>
> if (unlikely(ZERO_OR_NULL_PTR(objp)))
> return;
> + page = virt_to_head_page(obj);
> + if (CHECK_DATA_CORRUPTION(!PageSlab(page)))
> + return;
> local_irq_save(flags);
> kfree_debugcheck(objp);
> - c = virt_to_cache(objp);
> + c = page_to_cache(page);
> debug_check_no_locks_freed(objp, c->object_size);
>
> debug_check_no_obj_freed(objp, c->object_size);
Awesome! Thank you very much! I'll play with this.
-Kees
--
Kees Cook
Pixel Security
On Tue, 11 Apr 2017, Michal Hocko wrote:
> static inline void *index_to_obj(struct kmem_cache *cache, struct page *page,
> @@ -3813,14 +3818,18 @@ void kfree(const void *objp)
> {
> struct kmem_cache *c;
> unsigned long flags;
> + struct page *page;
>
> trace_kfree(_RET_IP_, objp);
>
> if (unlikely(ZERO_OR_NULL_PTR(objp)))
> return;
> + page = virt_to_head_page(obj);
> + if (CHECK_DATA_CORRUPTION(!PageSlab(page)))
There is a flag SLAB_DEBUG_OBJECTS that is available for this check.
Consistency checks are configuraable in the slab allocator.
Mentioned that before and got this lecture about data consistency checks.
On Tue, Apr 11, 2017 at 9:16 AM, Christoph Lameter <[email protected]> wrote:
> On Tue, 11 Apr 2017, Michal Hocko wrote:
>
>> static inline void *index_to_obj(struct kmem_cache *cache, struct page *page,
>> @@ -3813,14 +3818,18 @@ void kfree(const void *objp)
>> {
>> struct kmem_cache *c;
>> unsigned long flags;
>> + struct page *page;
>>
>> trace_kfree(_RET_IP_, objp);
>>
>> if (unlikely(ZERO_OR_NULL_PTR(objp)))
>> return;
>> + page = virt_to_head_page(obj);
>> + if (CHECK_DATA_CORRUPTION(!PageSlab(page)))
>
> There is a flag SLAB_DEBUG_OBJECTS that is available for this check.
> Consistency checks are configuraable in the slab allocator.
>
> Mentioned that before and got this lecture about data consistency checks.
It seems that enabling the debug checks comes with a non-trivial
performance impact. I'd like to see consistency checks by default so
we can handle intentional heap corruption attacks better. This check
isn't expensive...
-Kees
--
Kees Cook
Pixel Security
On Tue, 11 Apr 2017, Kees Cook wrote:
> It seems that enabling the debug checks comes with a non-trivial
> performance impact. I'd like to see consistency checks by default so
> we can handle intentional heap corruption attacks better. This check
> isn't expensive...
Its in a very hot code and frequently used code path.
On Tue, 11 Apr 2017, Kees Cook wrote:
> It seems that enabling the debug checks comes with a non-trivial
> performance impact. I'd like to see consistency checks by default so
> we can handle intentional heap corruption attacks better. This check
> isn't expensive...
Note also that these checks can be enabled and disabled at runtime for
each slab cache.
On Tue, Apr 11, 2017 at 9:23 AM, Christoph Lameter <[email protected]> wrote:
> On Tue, 11 Apr 2017, Kees Cook wrote:
>
>> It seems that enabling the debug checks comes with a non-trivial
>> performance impact. I'd like to see consistency checks by default so
>> we can handle intentional heap corruption attacks better. This check
>> isn't expensive...
>
> Its in a very hot code and frequently used code path.
Yeah, absolutely. All the more reason to make sure the kernel can't be
attacked through it. :) As with the automotive industry analogy[1]
from Konstantin, we need to make sure Linux not only run fast and
efficiently, but also fails gracefully by default.
> Note also that these checks can be enabled and disabled at runtime for
> each slab cache.
Correct, but my understanding is that enabling them through the debug
system ends up being much more expensive than this smaller check. The
debug code is fairly comprehensive, but it's not been designed for
efficient attack detection, etc.
-Kees
[1] http://kernsec.org/files/lss2015/giant-bags-of-mostly-water.pdf
--
Kees Cook
Pixel Security
On Tue 11-04-17 11:16:42, Cristopher Lameter wrote:
> On Tue, 11 Apr 2017, Michal Hocko wrote:
>
> > static inline void *index_to_obj(struct kmem_cache *cache, struct page *page,
> > @@ -3813,14 +3818,18 @@ void kfree(const void *objp)
> > {
> > struct kmem_cache *c;
> > unsigned long flags;
> > + struct page *page;
> >
> > trace_kfree(_RET_IP_, objp);
> >
> > if (unlikely(ZERO_OR_NULL_PTR(objp)))
> > return;
> > + page = virt_to_head_page(obj);
> > + if (CHECK_DATA_CORRUPTION(!PageSlab(page)))
>
> There is a flag SLAB_DEBUG_OBJECTS that is available for this check.
Which is way too late, at least for the kfree path. page->slab_cache
on anything else than PageSlab is just a garbage. And my understanding
of the patch objective is to stop those from happening.
> Consistency checks are configuraable in the slab allocator.
and they have to be compiled in (at least for SLAB) AFAIR.
--
Michal Hocko
SUSE Labs
On Tue, 11 Apr 2017, Michal Hocko wrote:
> >
> > There is a flag SLAB_DEBUG_OBJECTS that is available for this check.
>
> Which is way too late, at least for the kfree path. page->slab_cache
> on anything else than PageSlab is just a garbage. And my understanding
> of the patch objective is to stop those from happening.
We are looking here at SLAB. SLUB code can legitimately have a compound
page there because large allocations fallback to the page allocator.
Garbage would be attempting to free a page that has !PageSLAB set but also
is no compound page. That condition is already checked in kfree() with a
BUG_ON() and that BUG_ON has been there for a long time. Certainly we can
make SLAB consistent if there is no check there already. Slab just
attempts a free on that object which will fail too.
So we are already handling that condition. Why change things? Add a BUG_ON
if you want to make SLAB consistent.
On Fri, 31 Mar 2017, Kees Cook wrote:
> As found in PaX, this adds a cheap check on heap consistency, just to
> notice if things have gotten corrupted in the page lookup.
Ok this only affects kmem_cache_free() and not kfree(). For
kmem_cache_free() we already have a lot of stuff in the hotpath due to
cgruops. If you want this also for kfree() then we need a separate patch.
Also for kmem_cache_free(): Here we always have a slab cache and thus we
could check the flags that could modify what behavior we want.
Acked-by: Christoph Lameter <[email protected]>
On Tue 11-04-17 13:03:01, Cristopher Lameter wrote:
> On Tue, 11 Apr 2017, Michal Hocko wrote:
>
> > >
> > > There is a flag SLAB_DEBUG_OBJECTS that is available for this check.
> >
> > Which is way too late, at least for the kfree path. page->slab_cache
> > on anything else than PageSlab is just a garbage. And my understanding
> > of the patch objective is to stop those from happening.
>
> We are looking here at SLAB. SLUB code can legitimately have a compound
> page there because large allocations fallback to the page allocator.
>
> Garbage would be attempting to free a page that has !PageSLAB set but also
> is no compound page. That condition is already checked in kfree() with a
> BUG_ON() and that BUG_ON has been there for a long time.
Are you talking about SLAB or SLUB here? The only
BUG_ON(PageSlab(page)) in SLAB I can see is in kmem_freepages and that
is way too late because we already rely on cachep which is not
trustworthy. Or am I missing some other place you have in mind?
> Certainly we can
> make SLAB consistent if there is no check there already. Slab just
> attempts a free on that object which will fail too.
>
> So we are already handling that condition. Why change things? Add a BUG_ON
> if you want to make SLAB consistent.
I hate to repeat myself but let me do it for the last time in this
thread. BUG_ON for something that is recoverable is completely
inappropriate. And I consider kfree with a bogus pointer something that
we can easily recover from. There are other cases where the internal
state of the allocator is compromised to the point where continuing is
not possible and BUGing there is acceptable but kfree(garbage) is not
that case.
--
Michal Hocko
SUSE Labs
On Tue, 11 Apr 2017, Michal Hocko wrote:
> > So we are already handling that condition. Why change things? Add a BUG_ON
> > if you want to make SLAB consistent.
>
> I hate to repeat myself but let me do it for the last time in this
> thread. BUG_ON for something that is recoverable is completely
> inappropriate. And I consider kfree with a bogus pointer something that
> we can easily recover from. There are other cases where the internal
> state of the allocator is compromised to the point where continuing is
> not possible and BUGing there is acceptable but kfree(garbage) is not
> that case.
kfree(garbage) by the core kernel has so far been taken as a sign of
severe memory corruption and the kernels have been oopsing when this
occurred. This has been that way for a decade or so. kfree() is used by
the allocators and various other core kernel components. If the metadata
of the core kernel is compromised then it is safest to stop right there.
If you want to change things then someone has to do some work. What you
are saying is not the way things are implemented. Sorry.
Making both allocators consistent is ok with me and is a improvement of
the code.
On Tue 11-04-17 13:44:02, Cristopher Lameter wrote:
> On Tue, 11 Apr 2017, Michal Hocko wrote:
>
> > > So we are already handling that condition. Why change things? Add a BUG_ON
> > > if you want to make SLAB consistent.
> >
> > I hate to repeat myself but let me do it for the last time in this
> > thread. BUG_ON for something that is recoverable is completely
> > inappropriate. And I consider kfree with a bogus pointer something that
> > we can easily recover from. There are other cases where the internal
> > state of the allocator is compromised to the point where continuing is
> > not possible and BUGing there is acceptable but kfree(garbage) is not
> > that case.
>
> kfree(garbage) by the core kernel has so far been taken as a sign of
> severe memory corruption and the kernels have been oopsing when this
> occurred. This has been that way for a decade or so.
which doesn't make it a valid decision. We just overuse BUG*
> kfree() is used by
> the allocators and various other core kernel components. If the metadata
> of the core kernel is compromised then it is safest to stop right there.
>
> If you want to change things then someone has to do some work. What you
> are saying is not the way things are implemented. Sorry.
I didn't say anything like that. Hence the proposed patch which still
needs some more thinking and evaluation.
--
Michal Hocko
SUSE Labs
On Tue, 11 Apr 2017, Michal Hocko wrote:
> I didn't say anything like that. Hence the proposed patch which still
> needs some more thinking and evaluation.
This patch does not even affect kfree(). Could you start another
discussion thread where you discuss your suggestions for the changes in
the allocators and how we could go about this?
On Tue 11-04-17 13:59:44, Cristopher Lameter wrote:
> On Tue, 11 Apr 2017, Michal Hocko wrote:
>
> > I didn't say anything like that. Hence the proposed patch which still
> > needs some more thinking and evaluation.
>
> This patch does not even affect kfree().
Ehm? Are we even talking about the same thing? The whole discussion was
to catch invalid pointers to _kfree_ and why BUG* is not the best way to
handle that.
> Could you start another
> discussion thread where you discuss your suggestions for the changes in
> the allocators and how we could go about this?
I presume Kees will pursue
http://lkml.kernel.org/r/[email protected] or
something along those lines.
--
Michal Hocko
SUSE Labs
On Tue, 11 Apr 2017, Michal Hocko wrote:
> On Tue 11-04-17 13:59:44, Cristopher Lameter wrote:
> > On Tue, 11 Apr 2017, Michal Hocko wrote:
> >
> > > I didn't say anything like that. Hence the proposed patch which still
> > > needs some more thinking and evaluation.
> >
> > This patch does not even affect kfree().
>
> Ehm? Are we even talking about the same thing? The whole discussion was
> to catch invalid pointers to _kfree_ and why BUG* is not the best way to
> handle that.
The patch does not do that. See my review. Invalid points to kfree are
already caught with a bug on. See kfree in mm/slub.c
On Mon 17-04-17 10:22:29, Cristopher Lameter wrote:
> On Tue, 11 Apr 2017, Michal Hocko wrote:
>
> > On Tue 11-04-17 13:59:44, Cristopher Lameter wrote:
> > > On Tue, 11 Apr 2017, Michal Hocko wrote:
> > >
> > > > I didn't say anything like that. Hence the proposed patch which still
> > > > needs some more thinking and evaluation.
> > >
> > > This patch does not even affect kfree().
> >
> > Ehm? Are we even talking about the same thing? The whole discussion was
> > to catch invalid pointers to _kfree_ and why BUG* is not the best way to
> > handle that.
>
> The patch does not do that. See my review. Invalid points to kfree are
> already caught with a bug on. See kfree in mm/slub.c
Are you even reading those emails? First of all we are talking about
slab here. Secondly I've already pointed out that the BUG_ON(!PageSlab)
in kmem_freepages is already too late because we do operate on a
potential garbage from invalid page...
--
Michal Hocko
SUSE Labs
On Tue, 18 Apr 2017, Michal Hocko wrote:
> > The patch does not do that. See my review. Invalid points to kfree are
> > already caught with a bug on. See kfree in mm/slub.c
>
> Are you even reading those emails? First of all we are talking about
> slab here. Secondly I've already pointed out that the BUG_ON(!PageSlab)
> in kmem_freepages is already too late because we do operate on a
> potential garbage from invalid page...
Again this is confusing because you are discussing something that was not
covered by the patch submitted. Please start another discussion thread
on kfree() separately from the discussion of the patch here.
On Tue, 18 Apr 2017, Michal Hocko wrote:
> Are you even reading those emails? First of all we are talking about
> slab here. Secondly I've already pointed out that the BUG_ON(!PageSlab)
> in kmem_freepages is already too late because we do operate on a
> potential garbage from invalid page...
Before I forget:
1. The patch affects both slab and slub since it patches mm/slab.h and is
called by both allocators.
2. The check in the patch we are discussing here when calling
kmem_cache_free() will be executing before kmem_freepages() is called
in slab.
On Tue 04-04-17 13:30:22, Michal Hocko wrote:
> On Fri 31-03-17 09:40:28, Kees Cook wrote:
> > As found in PaX, this adds a cheap check on heap consistency, just to
> > notice if things have gotten corrupted in the page lookup.
> >
> > Signed-off-by: Kees Cook <[email protected]>
>
> NAK without a proper changelog. Seriously, we do not blindly apply
> changes from other projects without a deep understanding of all
> consequences.
This still seems to be in the mmotm tree
http://www.ozlabs.org/~akpm/mmotm/broken-out/mm-add-additional-consistency-check.patch
I hope we have agreed that this is not the right approach to handle
invalid pointers in kfree and rather go soemthing like
http://lkml.kernel.org/r/[email protected]
instead.
--
Michal Hocko
SUSE Labs
On Tue, Apr 11, 2017 at 7:19 AM, Michal Hocko <[email protected]> wrote:
> I would do something like...
> ---
> diff --git a/mm/slab.c b/mm/slab.c
> index bd63450a9b16..87c99a5e9e18 100644
> --- a/mm/slab.c
> +++ b/mm/slab.c
> @@ -393,10 +393,15 @@ static inline void set_store_user_dirty(struct kmem_cache *cachep) {}
> static int slab_max_order = SLAB_MAX_ORDER_LO;
> static bool slab_max_order_set __initdata;
>
> +static inline struct kmem_cache *page_to_cache(struct page *page)
> +{
> + return page->slab_cache;
> +}
> +
> static inline struct kmem_cache *virt_to_cache(const void *obj)
> {
> struct page *page = virt_to_head_page(obj);
> - return page->slab_cache;
> + return page_to_cache(page);
> }
>
> static inline void *index_to_obj(struct kmem_cache *cache, struct page *page,
> @@ -3813,14 +3818,18 @@ void kfree(const void *objp)
> {
> struct kmem_cache *c;
> unsigned long flags;
> + struct page *page;
>
> trace_kfree(_RET_IP_, objp);
>
> if (unlikely(ZERO_OR_NULL_PTR(objp)))
> return;
> + page = virt_to_head_page(obj);
> + if (CHECK_DATA_CORRUPTION(!PageSlab(page)))
> + return;
> local_irq_save(flags);
> kfree_debugcheck(objp);
> - c = virt_to_cache(objp);
> + c = page_to_cache(page);
> debug_check_no_locks_freed(objp, c->object_size);
>
> debug_check_no_obj_freed(objp, c->object_size);
Sorry for the delay, I've finally had time to look at this again.
So, this only handles the kfree() case, not the kmem_cache_free() nor
kmem_cache_free_bulk() cases, so it misses all the non-kmalloc
allocations (and kfree() ultimately calls down to kmem_cache_free()).
Similarly, my proposed patch missed the kfree() path. :P
As I work on a replacement, is the goal to avoid the checks while
under local_irq_save()? (i.e. I can't just put the check in
virt_to_cache(), etc.)
-Kees
--
Kees Cook
Pixel Security
On Thu 27-04-17 18:11:28, Kees Cook wrote:
> On Tue, Apr 11, 2017 at 7:19 AM, Michal Hocko <[email protected]> wrote:
> > I would do something like...
> > ---
> > diff --git a/mm/slab.c b/mm/slab.c
> > index bd63450a9b16..87c99a5e9e18 100644
> > --- a/mm/slab.c
> > +++ b/mm/slab.c
> > @@ -393,10 +393,15 @@ static inline void set_store_user_dirty(struct kmem_cache *cachep) {}
> > static int slab_max_order = SLAB_MAX_ORDER_LO;
> > static bool slab_max_order_set __initdata;
> >
> > +static inline struct kmem_cache *page_to_cache(struct page *page)
> > +{
> > + return page->slab_cache;
> > +}
> > +
> > static inline struct kmem_cache *virt_to_cache(const void *obj)
> > {
> > struct page *page = virt_to_head_page(obj);
> > - return page->slab_cache;
> > + return page_to_cache(page);
> > }
> >
> > static inline void *index_to_obj(struct kmem_cache *cache, struct page *page,
> > @@ -3813,14 +3818,18 @@ void kfree(const void *objp)
> > {
> > struct kmem_cache *c;
> > unsigned long flags;
> > + struct page *page;
> >
> > trace_kfree(_RET_IP_, objp);
> >
> > if (unlikely(ZERO_OR_NULL_PTR(objp)))
> > return;
> > + page = virt_to_head_page(obj);
> > + if (CHECK_DATA_CORRUPTION(!PageSlab(page)))
> > + return;
> > local_irq_save(flags);
> > kfree_debugcheck(objp);
> > - c = virt_to_cache(objp);
> > + c = page_to_cache(page);
> > debug_check_no_locks_freed(objp, c->object_size);
> >
> > debug_check_no_obj_freed(objp, c->object_size);
>
> Sorry for the delay, I've finally had time to look at this again.
>
> So, this only handles the kfree() case, not the kmem_cache_free() nor
> kmem_cache_free_bulk() cases, so it misses all the non-kmalloc
> allocations (and kfree() ultimately calls down to kmem_cache_free()).
> Similarly, my proposed patch missed the kfree() path. :P
yes
> As I work on a replacement, is the goal to avoid the checks while
> under local_irq_save()? (i.e. I can't just put the check in
> virt_to_cache(), etc.)
You would have to check all callers of virt_to_cache. I would simply
replace BUG_ON(!PageSlab()) in cache_from_obj. kmem_cache_free already
handles NULL cache. kmem_cache_free_bulk and build_detached_freelist can
be made to do so.
--
Michal Hocko
SUSE Labs