2014-10-01 21:07:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: poison critical mm/ structs

On Mon, 29 Sep 2014 21:47:14 -0400 Sasha Levin <[email protected]> wrote:

> Currently we're seeing a few issues which are unexplainable by looking at the
> data we see and are most likely caused by a memory corruption caused
> elsewhere.
>
> This is wasting time for folks who are trying to figure out an issue provided
> a stack trace that can't really point out the real issue.
>
> This patch introduces poisoning on struct page, vm_area_struct, and mm_struct,
> and places checks in busy paths to catch corruption early.
>
> This series was tested, and it detects corruption in vm_area_struct. Right now
> I'm working on figuring out the source of the corruption, (which is a long
> standing bug) using KASan, but the current code is useful as it is.

Is this still useful if/when kasan is in place?

It looks fairly cheap - I wonder if it should simply fall under
CONFIG_DEBUG_VM rather than the new CONFIG_DEBUG_VM_POISON.


2014-10-01 21:39:50

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: poison critical mm/ structs

On 10/01/2014 05:07 PM, Andrew Morton wrote:
> On Mon, 29 Sep 2014 21:47:14 -0400 Sasha Levin <[email protected]> wrote:
>
>> Currently we're seeing a few issues which are unexplainable by looking at the
>> data we see and are most likely caused by a memory corruption caused
>> elsewhere.
>>
>> This is wasting time for folks who are trying to figure out an issue provided
>> a stack trace that can't really point out the real issue.
>>
>> This patch introduces poisoning on struct page, vm_area_struct, and mm_struct,
>> and places checks in busy paths to catch corruption early.
>>
>> This series was tested, and it detects corruption in vm_area_struct. Right now
>> I'm working on figuring out the source of the corruption, (which is a long
>> standing bug) using KASan, but the current code is useful as it is.
>
> Is this still useful if/when kasan is in place?

Yes, the corruption we're seeing happens inside the struct rather than around it.
kasan doesn't look there.

When kasan is merged, we could complement this patchset by making kasan trap on
when the poison is getting written, rather than triggering a BUG in some place
else after we saw the corruption.

> It looks fairly cheap - I wonder if it should simply fall under
> CONFIG_DEBUG_VM rather than the new CONFIG_DEBUG_VM_POISON.

Config options are cheap as well :)

I'd rather expand it further and add poison/kasan trapping into other places such
as the vma interval tree rather than having to keep it "cheap".


Thanks,
Sasha

2014-10-01 21:48:36

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: poison critical mm/ structs

On Wed, 01 Oct 2014 17:39:39 -0400 Sasha Levin <[email protected]> wrote:

> > It looks fairly cheap - I wonder if it should simply fall under
> > CONFIG_DEBUG_VM rather than the new CONFIG_DEBUG_VM_POISON.
>
> Config options are cheap as well :)

Thing is, lots of people are enabling CONFIG_DEBUG_VM, but a smaller
number of people will enable CONFIG_DEBUG_VM_POISON. Less coverage.

Defaulting to y if CONFIG_DEBUG_VM might help, but if people do `make
oldconfig' when CONFIG_DEBUG_VM=n, their CONFIG_DEBUG_VM_POISON will
get set to `n' and will remain that way when they set CONFIG_DEBUG_VM
again.

2014-10-02 03:51:54

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: poison critical mm/ structs

On 10/01/2014 05:48 PM, Andrew Morton wrote:
> On Wed, 01 Oct 2014 17:39:39 -0400 Sasha Levin <[email protected]> wrote:
>
>>> It looks fairly cheap - I wonder if it should simply fall under
>>> CONFIG_DEBUG_VM rather than the new CONFIG_DEBUG_VM_POISON.
>>
>> Config options are cheap as well :)
>
> Thing is, lots of people are enabling CONFIG_DEBUG_VM, but a smaller
> number of people will enable CONFIG_DEBUG_VM_POISON. Less coverage.
>
> Defaulting to y if CONFIG_DEBUG_VM might help, but if people do `make
> oldconfig' when CONFIG_DEBUG_VM=n, their CONFIG_DEBUG_VM_POISON will
> get set to `n' and will remain that way when they set CONFIG_DEBUG_VM
> again.

In that case, what about:

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index db41b15..b2c7038 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -546,6 +546,7 @@ config DEBUG_VM_RB
config DEBUG_VM_POISON
bool "Poison VM structures"
depends on DEBUG_VM
+ def_bool y
help
Add poison to the beggining and end of various VM structure to
detect memory corruption in VM management code.

We'll default to "Y" in 'make oldconfig' and it'll automatically be switched
on when the user selects CONFIG_DEBUG_VM=y, but we still keep the advantages
of having it in a different config option.


Thanks,
Sasha

2014-10-02 09:24:57

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: poison critical mm/ structs

On Wed, 1 Oct 2014, Sasha Levin wrote:
> On 10/01/2014 05:07 PM, Andrew Morton wrote:
> > On Mon, 29 Sep 2014 21:47:14 -0400 Sasha Levin <[email protected]> wrote:
> >
> >> Currently we're seeing a few issues which are unexplainable by looking at the
> >> data we see and are most likely caused by a memory corruption caused
> >> elsewhere.
> >>
> >> This is wasting time for folks who are trying to figure out an issue provided
> >> a stack trace that can't really point out the real issue.
> >>
> >> This patch introduces poisoning on struct page, vm_area_struct, and mm_struct,
> >> and places checks in busy paths to catch corruption early.
> >>
> >> This series was tested, and it detects corruption in vm_area_struct. Right now
> >> I'm working on figuring out the source of the corruption, (which is a long
> >> standing bug) using KASan, but the current code is useful as it is.
> >
> > Is this still useful if/when kasan is in place?
>
> Yes, the corruption we're seeing happens inside the struct rather than around it.
> kasan doesn't look there.
>
> When kasan is merged, we could complement this patchset by making kasan trap on
> when the poison is getting written, rather than triggering a BUG in some place
> else after we saw the corruption.
>
> > It looks fairly cheap - I wonder if it should simply fall under
> > CONFIG_DEBUG_VM rather than the new CONFIG_DEBUG_VM_POISON.
>
> Config options are cheap as well :)
>
> I'd rather expand it further and add poison/kasan trapping into other places such
> as the vma interval tree rather than having to keep it "cheap".

I like to run with CONFIG_DEBUG_VM, and would not want this stuff
turned on in my builds (especially not the struct page enlargement);
so I'm certainly with you in preferring a separate option.

But it all seems very ad hoc to me. Are people going to be adding
more and more mm structures into it, ad infinitum? And adding
CONFIG_DEBUG_SCHED_POISON one day when someone notices corruption
of a scheduler structure? etc etc.

What does this add on top of slab poisoning? Some checks in some
mm places while the object is active, I guess: why not base those
on slab poisoning? And add them in as appropriate to the problem
at hand, when a problem is seen.

I think these patches are fine for investigating whatever is the
problem currently afflicting you and mm under trinity; but we all
have our temporary debugging patches, I don't think all deserve
preservation in everyone else's kernel, that amounts to far more
clutter than any are worth.

I'm glad to hear they've confirmed some vm_area_struct corruption:
any ideas on where that's coming from?

Hugh

2014-10-02 14:58:34

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: poison critical mm/ structs

On 10/02/2014 05:23 AM, Hugh Dickins wrote:
> On Wed, 1 Oct 2014, Sasha Levin wrote:
>> On 10/01/2014 05:07 PM, Andrew Morton wrote:
>>> On Mon, 29 Sep 2014 21:47:14 -0400 Sasha Levin <[email protected]> wrote:
>>>
>>>> Currently we're seeing a few issues which are unexplainable by looking at the
>>>> data we see and are most likely caused by a memory corruption caused
>>>> elsewhere.
>>>>
>>>> This is wasting time for folks who are trying to figure out an issue provided
>>>> a stack trace that can't really point out the real issue.
>>>>
>>>> This patch introduces poisoning on struct page, vm_area_struct, and mm_struct,
>>>> and places checks in busy paths to catch corruption early.
>>>>
>>>> This series was tested, and it detects corruption in vm_area_struct. Right now
>>>> I'm working on figuring out the source of the corruption, (which is a long
>>>> standing bug) using KASan, but the current code is useful as it is.
>>>
>>> Is this still useful if/when kasan is in place?
>>
>> Yes, the corruption we're seeing happens inside the struct rather than around it.
>> kasan doesn't look there.
>>
>> When kasan is merged, we could complement this patchset by making kasan trap on
>> when the poison is getting written, rather than triggering a BUG in some place
>> else after we saw the corruption.
>>
>>> It looks fairly cheap - I wonder if it should simply fall under
>>> CONFIG_DEBUG_VM rather than the new CONFIG_DEBUG_VM_POISON.
>>
>> Config options are cheap as well :)
>>
>> I'd rather expand it further and add poison/kasan trapping into other places such
>> as the vma interval tree rather than having to keep it "cheap".
>
> I like to run with CONFIG_DEBUG_VM, and would not want this stuff
> turned on in my builds (especially not the struct page enlargement);
> so I'm certainly with you in preferring a separate option.
>
> But it all seems very ad hoc to me. Are people going to be adding
> more and more mm structures into it, ad infinitum? And adding
> CONFIG_DEBUG_SCHED_POISON one day when someone notices corruption
> of a scheduler structure? etc etc.

That was my plan, yes.

> What does this add on top of slab poisoning? Some checks in some
> mm places while the object is active, I guess: why not base those
> on slab poisoning? And add them in as appropriate to the problem
> at hand, when a problem is seen.

The extra you're getting is detecting corruption that happened
inside the object rather than around it. In the case of poisoning
working along with kasan you don't have to limit it to slab either,
so you can detect issues in static objects as well.

fwiw, there's currently a long standing issue with corruption inside
spinlocks in sched code. This sort of issues always exist, so (at least)
my kernel would always have poisoning in some form from now on.

> I think these patches are fine for investigating whatever is the
> problem currently afflicting you and mm under trinity; but we all
> have our temporary debugging patches, I don't think all deserve
> preservation in everyone else's kernel, that amounts to far more
> clutter than any are worth.

If the issue is lines of code we can look into making it cleaner.

> I'm glad to hear they've confirmed some vm_area_struct corruption:
> any ideas on where that's coming from?

Nope, I've added kasan poisoning to vm_area_struct but it has not
reproduced since then, I've just hit bunch of different issues.


Thanks,
Sasha

2014-10-02 15:13:25

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: poison critical mm/ structs

On Thu, Oct 02, 2014 at 02:23:08AM -0700, Hugh Dickins wrote:

> I think these patches are fine for investigating whatever is the
> problem currently afflicting you and mm under trinity; but we all
> have our temporary debugging patches, I don't think all deserve
> preservation in everyone else's kernel, that amounts to far more
> clutter than any are worth.

One problem with keeping things like this in -mm (or other non-Linus tree)
is that they bit-rot quickly, and become a pain to apply, especially if
they are perpetually on top of other changes in -mm.

I looked at trying these patches on Linus' tree when Sasha posted them,
but lost motivation when I realized they needed other bits of -mm too.

It may be that after Andrews 3.18+ mega-merge things would be simpler,
but I have a feeling it wouldn't be long before the situation would
arise again.

Dave

2014-10-07 22:16:18

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: poison critical mm/ structs

On 10/02/2014 07:58 AM, Sasha Levin wrote:
>> > What does this add on top of slab poisoning? Some checks in some
>> > mm places while the object is active, I guess: why not base those
>> > on slab poisoning? And add them in as appropriate to the problem
>> > at hand, when a problem is seen.
> The extra you're getting is detecting corruption that happened
> inside the object rather than around it.

Isn't this more akin to redzoning that poisoning?

I'm not sure I follow the logic here. The poison is inside the object,
but it's now at the edges. With slub at least, you get a redzone right
after the object(s):

{ OBJ } | REDZONE | { OBJ } | REDZONE | ...

With this patch, you'd get something along these lines:

{ POISON | OBJ | POISON } { POISON | OBJ | POISON } ...

So if somebody overflows OBJ, they'll hit the redzone/poison in either
case. If they're randomly scribbling on memory, their likelihood of
hitting the redzone/poison is proportional to the size of the
redzone/poison.

The only place this really helps is if someone overflows from a
non-redzoned page or structure in to the beginning of a slub redzoned
one. The fact that the redzone is at the end means we'll miss it.

But, all that means is that we should probably add redzones to the
beginning of slub objects, not just the end. That doesn't help us with
'struct page' of course, but it does for the mm_struct and vma.

2014-10-08 16:44:12

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: poison critical mm/ structs

On 10/07/2014 06:16 PM, Dave Hansen wrote:
> On 10/02/2014 07:58 AM, Sasha Levin wrote:
>>>> What does this add on top of slab poisoning? Some checks in some
>>>> mm places while the object is active, I guess: why not base those
>>>> on slab poisoning? And add them in as appropriate to the problem
>>>> at hand, when a problem is seen.
>> The extra you're getting is detecting corruption that happened
>> inside the object rather than around it.
>
> Isn't this more akin to redzoning that poisoning?
>
> I'm not sure I follow the logic here. The poison is inside the object,
> but it's now at the edges. With slub at least, you get a redzone right
> after the object(s):
>
> { OBJ } | REDZONE | { OBJ } | REDZONE | ...
>
> With this patch, you'd get something along these lines:
>
> { POISON | OBJ | POISON } { POISON | OBJ | POISON } ...
>
> So if somebody overflows OBJ, they'll hit the redzone/poison in either
> case. If they're randomly scribbling on memory, their likelihood of
> hitting the redzone/poison is proportional to the size of the
> redzone/poison.
>
> The only place this really helps is if someone overflows from a
> non-redzoned page or structure in to the beginning of a slub redzoned
> one. The fact that the redzone is at the end means we'll miss it.
>
> But, all that means is that we should probably add redzones to the
> beginning of slub objects, not just the end. That doesn't help us with
> 'struct page' of course, but it does for the mm_struct and vma.

This patchset is based on an actual issue we're seeing where the vma
gets corrupted without triggering any of the slub redzones.

Testing this patchset locally confirmed that while slub redzones stay
intact, the poison fields get overwritten - so now we're able to catch
the corruption after it happened.

I'm not sure what's the scenario that causes that, but once we figure
that out I could have a better response to your question...


Thanks,
Sasha

2014-10-09 19:11:54

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 0/5] mm: poison critical mm/ structs

On 10/02/2014 05:23 AM, Hugh Dickins wrote:
> I'm glad to hear they've confirmed some vm_area_struct corruption:
> any ideas on where that's coming from?

Hugh,

I think that what we're seeing isn't a corruption of vm_area_struct
per-se, but something weirder.

I've poisoned every spot where vm_area_struct is allocated, and yet
there seems to be nothing that's hitting that field before we end
up using a "zeroed out" vm_area_struct.

The results are the same both with and without kasan, there seems
to be no corruption happening anywhere, but we somehow end up with
an empty vm_area_struct.

It also somewhat makes sense considering that we're seeing no slub
corruption either. Either something is zeroing out *exactly*
vm_area_struct, or it's not really corruption...


Thanks,
Sasha