by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On Wed, Nov 11, 2015 at 8:32 PM, Minchan Kim <[email protected]> wrote:
>
> Linux doesn't have an ability to free pages lazy while other OS already
> have been supported that named by madvise(MADV_FREE).
>
> The gain is clear that kernel can discard freed pages rather than swapping
> out or OOM if memory pressure happens.

>
> When madvise syscall is called, VM clears dirty bit of ptes of the range.
> If memory pressure happens, VM checks dirty bit of page table and if it
> found still "clean", it means it's a "lazyfree pages" so VM could discard
> the page instead of swapping out. Once there was store operation for the
> page before VM peek a page to reclaim, dirty bit is set so VM can swap out
> the page instead of discarding.
>

I realize that this lends itself to an efficient implementation, but
it's certainly the case that the kernel *could* use the accessed bit
instead of the dirty bit to give more sensible user semantics, and the
semantics that rely on the dirty bit make me uncomfortable from an ABI
perspective.

I also think that the kernel should commit to either zeroing the page
or leaving it unchanged in response to MADV_FREE (even if the decision
of which to do is made later on). I think that your patch series does
this, but only after a few of the patches are applied (the swap entry
freeing), and I think that it should be a real guaranteed part of the
semantics and maybe have a test case.

--Andy

2015-11-12 05:21:38

On 13/11/15 02:03 AM, Minchan Kim wrote:
> On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote:
>>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
>>> easily when we need it. Maybe, that's what you want. Right?
>>
>> Yes, but why the access bit instead of the dirty bit for that? It could
>> always be made more strict (i.e. access bit) in the future, while going
>> the other way won't be possible. So I think the dirty bit is really the
>> more conservative choice since if it turns out to be a mistake it can be
>> fixed without a backwards incompatible change.
>
> Absolutely true. That's why I insist on dirty bit until now although
> I didn't tell the reason. But I thought you wanted to change for using
> access bit for the future, too. It seems MADV_FREE start to bloat
> over and over again before knowing real problems and usecases.
> It's almost same situation with volatile ranges so I really want to
> stop at proper point which maintainer should decide, I hope.
> Without it, we will make the feature a lot heavy by just brain storming
> and then causes lots of churn in MM code without real bebenfit
> It would be very painful for us.

Well, I don't think you need more than a good API and an implementation
with no known bugs, kernel security concerns or backwards compatibility
issues. Configuration and API extensions are something for later (i.e.
land a baseline, then submit stuff like sysctl tunables). Just my take
on it though...

Attachments:

signature.asc (819.00 B)
OpenPGP digital signature

2015-11-13 19:46:30

by Andy Lutomirski

[permalink] [raw]

Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On Fri, Nov 13, 2015 at 12:13 AM, Daniel Micay <[email protected]> wrote:
> On 13/11/15 02:03 AM, Minchan Kim wrote:
>> On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote:
>>>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
>>>> easily when we need it. Maybe, that's what you want. Right?
>>>
>>> Yes, but why the access bit instead of the dirty bit for that? It could
>>> always be made more strict (i.e. access bit) in the future, while going
>>> the other way won't be possible. So I think the dirty bit is really the
>>> more conservative choice since if it turns out to be a mistake it can be
>>> fixed without a backwards incompatible change.
>>
>> Absolutely true. That's why I insist on dirty bit until now although
>> I didn't tell the reason. But I thought you wanted to change for using
>> access bit for the future, too. It seems MADV_FREE start to bloat
>> over and over again before knowing real problems and usecases.
>> It's almost same situation with volatile ranges so I really want to
>> stop at proper point which maintainer should decide, I hope.
>> Without it, we will make the feature a lot heavy by just brain storming
>> and then causes lots of churn in MM code without real bebenfit
>> It would be very painful for us.
>
> Well, I don't think you need more than a good API and an implementation
> with no known bugs, kernel security concerns or backwards compatibility
> issues. Configuration and API extensions are something for later (i.e.
> land a baseline, then submit stuff like sysctl tunables). Just my take
> on it though...
>

As long as it's anonymous MAP_PRIVATE only, then the security aspects
should be okay. MADV_DONTNEED seems to work on pretty much any VMA,
and there's been long history of interesting bugs there.

As for dirty vs accessed, an argument in favor of going straight to
accessed is that it means that users can write code like this without
worrying about whether they have a kernel that uses the dirty bit:

x = mmap(...);
*x = 1; /* mark it present */

/* i'm done with it */
*x = 1;
madvise(MADV_FREE, x, ...);

wait a while;

/* is it still there? */
if (*x == 1) {
/* use whatever was cached there */
} else {
/* reinitialize it */
*x = 1;
}

With the dirty bit, this will look like it works, but on occasion
users will lose the race where they probe *x to see if the data was
lost and then the data gets lost before the next write comes in.

Sure, that load from *x could be changed to RMW or users could do a
dummy write (e.g. x[1] = 1; if (*x == 1) ...), but people might forget
to do that, and the caching implications are a little bit worse.

Note that switching to RMW is really really dangerous. Doing:

*x &= 1;
if (*x == 1) ...;

is safe on x86 if the compiler generates:

andl $1, (%[x]);
cmpl $1, (%[x]);

but is unsafe if the compiler generates:

movl (%[x]), %eax;
andl $1, %eax;
movl %eax, (%[x]);
cmpl $1, %eax;

and even worse if the write is omitted when "provably" unnecessary.

OTOH, if switching to the accessed bit is too much of a mess, then
using the dirty bit at first isn't so bad.

--Andy

--
Andy Lutomirski
AMA Capital Management, LLC

2015-11-16 02:12:58

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

On Fri, Nov 13, 2015 at 11:46:07AM -0800, Andy Lutomirski wrote:
> On Fri, Nov 13, 2015 at 12:13 AM, Daniel Micay <[email protected]> wrote:
> > On 13/11/15 02:03 AM, Minchan Kim wrote:
> >> On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote:
> >>>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
> >>>> easily when we need it. Maybe, that's what you want. Right?
> >>>
> >>> Yes, but why the access bit instead of the dirty bit for that? It could
> >>> always be made more strict (i.e. access bit) in the future, while going
> >>> the other way won't be possible. So I think the dirty bit is really the
> >>> more conservative choice since if it turns out to be a mistake it can be
> >>> fixed without a backwards incompatible change.
> >>
> >> Absolutely true. That's why I insist on dirty bit until now although
> >> I didn't tell the reason. But I thought you wanted to change for using
> >> access bit for the future, too. It seems MADV_FREE start to bloat
> >> over and over again before knowing real problems and usecases.
> >> It's almost same situation with volatile ranges so I really want to
> >> stop at proper point which maintainer should decide, I hope.
> >> Without it, we will make the feature a lot heavy by just brain storming
> >> and then causes lots of churn in MM code without real bebenfit
> >> It would be very painful for us.
> >
> > Well, I don't think you need more than a good API and an implementation
> > with no known bugs, kernel security concerns or backwards compatibility
> > issues. Configuration and API extensions are something for later (i.e.
> > land a baseline, then submit stuff like sysctl tunables). Just my take
> > on it though...
> >
>
> As long as it's anonymous MAP_PRIVATE only, then the security aspects
> should be okay. MADV_DONTNEED seems to work on pretty much any VMA,
> and there's been long history of interesting bugs there.
>
> As for dirty vs accessed, an argument in favor of going straight to
> accessed is that it means that users can write code like this without
> worrying about whether they have a kernel that uses the dirty bit:
>
> x = mmap(...);
> *x = 1; /* mark it present */
>
> /* i'm done with it */
> *x = 1;
> madvise(MADV_FREE, x, ...);
>
> wait a while;
>
> /* is it still there? */
> if (*x == 1) {
> /* use whatever was cached there */
> } else {
> /* reinitialize it */
> *x = 1;
> }
>
> With the dirty bit, this will look like it works, but on occasion
> users will lose the race where they probe *x to see if the data was
> lost and then the data gets lost before the next write comes in.
>
> Sure, that load from *x could be changed to RMW or users could do a
> dummy write (e.g. x[1] = 1; if (*x == 1) ...), but people might forget
> to do that, and the caching implications are a little bit worse.

I think your example is the case what people abuse MADV_FREE.
What happens if the object(ie, x) spans multiple pages?
User should know object's memory align and investigate all of pages
which span the object. Hmm, I don't think it's good for API.

>
> Note that switching to RMW is really really dangerous. Doing:
>
> *x &= 1;
> if (*x == 1) ...;
>
> is safe on x86 if the compiler generates:
>
> andl $1, (%[x]);
> cmpl $1, (%[x]);
>
> but is unsafe if the compiler generates:
>
> movl (%[x]), %eax;
> andl $1, %eax;
> movl %eax, (%[x]);
> cmpl $1, %eax;
>
> and even worse if the write is omitted when "provably" unnecessary.
>
> OTOH, if switching to the accessed bit is too much of a mess, then
> using the dirty bit at first isn't so bad.

Thanks! I want to use dirty bit first.

About access bit, I don't want to say it to mess but I guess it would
change a lot subtle thing for all architectures. Because we have used
access bit as just *hint* for aging while dirty bit is really
*critical marker* for system integrity. A example in x86, we don't
keep accuracy of access bit for reducing TLB flush IPI. I don't know
what technique other arches have used but they might have.

Thanks.

>
> --Andy
>
> --
> Andy Lutomirski
> AMA Capital Management, LLC

2015-11-16 03:15:07

by yalin wang

[permalink] [raw]

Subject: Re: [PATCH v3 01/17] mm: support madvise(MADV_FREE)

> On Nov 16, 2015, at 10:13, Minchan Kim <[email protected]> wrote:
>
> On Fri, Nov 13, 2015 at 11:46:07AM -0800, Andy Lutomirski wrote:
>> On Fri, Nov 13, 2015 at 12:13 AM, Daniel Micay <[email protected]> wrote:
>>> On 13/11/15 02:03 AM, Minchan Kim wrote:
>>>> On Fri, Nov 13, 2015 at 01:45:52AM -0500, Daniel Micay wrote:
>>>>>> And now I am thinking if we use access bit, we could implment MADV_FREE_UNDO
>>>>>> easily when we need it. Maybe, that's what you want. Right?
>>>>>
>>>>> Yes, but why the access bit instead of the dirty bit for that? It could
>>>>> always be made more strict (i.e. access bit) in the future, while going
>>>>> the other way won't be possible. So I think the dirty bit is really the
>>>>> more conservative choice since if it turns out to be a mistake it can be
>>>>> fixed without a backwards incompatible change.
>>>>
>>>> Absolutely true. That's why I insist on dirty bit until now although
>>>> I didn't tell the reason. But I thought you wanted to change for using
>>>> access bit for the future, too. It seems MADV_FREE start to bloat
>>>> over and over again before knowing real problems and usecases.
>>>> It's almost same situation with volatile ranges so I really want to
>>>> stop at proper point which maintainer should decide, I hope.
>>>> Without it, we will make the feature a lot heavy by just brain storming
>>>> and then causes lots of churn in MM code without real bebenfit
>>>> It would be very painful for us.
>>>
>>> Well, I don't think you need more than a good API and an implementation
>>> with no known bugs, kernel security concerns or backwards compatibility
>>> issues. Configuration and API extensions are something for later (i.e.
>>> land a baseline, then submit stuff like sysctl tunables). Just my take
>>> on it though...
>>>
>>
>> As long as it's anonymous MAP_PRIVATE only, then the security aspects
>> should be okay. MADV_DONTNEED seems to work on pretty much any VMA,
>> and there's been long history of interesting bugs there.
>>
>> As for dirty vs accessed, an argument in favor of going straight to
>> accessed is that it means that users can write code like this without
>> worrying about whether they have a kernel that uses the dirty bit:
>>
>> x = mmap(...);
>> *x = 1; /* mark it present */
>>
>> /* i'm done with it */
>> *x = 1;
>> madvise(MADV_FREE, x, ...);
>>
>> wait a while;
>>
>> /* is it still there? */
>> if (*x == 1) {
>> /* use whatever was cached there */
>> } else {
>> /* reinitialize it */
>> *x = 1;
>> }
>>
>> With the dirty bit, this will look like it works, but on occasion
>> users will lose the race where they probe *x to see if the data was
>> lost and then the data gets lost before the next write comes in.
>>
>> Sure, that load from *x could be changed to RMW or users could do a
>> dummy write (e.g. x[1] = 1; if (*x == 1) ...), but people might forget
>> to do that, and the caching implications are a little bit worse.
>
> I think your example is the case what people abuse MADV_FREE.
> What happens if the object(ie, x) spans multiple pages?
> User should know object's memory align and investigate all of pages
> which span the object. Hmm, I don't think it's good for API.
>
>>
>> Note that switching to RMW is really really dangerous. Doing:
>>
>> *x &= 1;
>> if (*x == 1) ...;
>>
>> is safe on x86 if the compiler generates:
>>
>> andl $1, (%[x]);
>> cmpl $1, (%[x]);
>>
>> but is unsafe if the compiler generates:
>>
>> movl (%[x]), %eax;
>> andl $1, %eax;
>> movl %eax, (%[x]);
>> cmpl $1, %eax;
>>
>> and even worse if the write is omitted when "provably" unnecessary.
>>
>> OTOH, if switching to the accessed bit is too much of a mess, then
>> using the dirty bit at first isn't so bad.
>
> Thanks! I want to use dirty bit first.
>
> About access bit, I don't want to say it to mess but I guess it would
> change a lot subtle thing for all architectures. Because we have used
> access bit as just *hint* for aging while dirty bit is really
> *critical marker* for system integrity. A example in x86, we don't
> keep accuracy of access bit for reducing TLB flush IPI. I don't know
> what technique other arches have used but they might have.
>
> Thanks.
>
i think use access bit is not easy to implement for ANON page in kernel.
we are sure the Anon page is always PageDirty() if it is !PageSwapCache() ,
unless it is MADV_FREE page ,
but use access bit , how to distinguish Normal ANON page and MADV_FREE page?
it can be implemented by Access bit , but not easy, need more code change .

Thanks