LinuxLists.cc - Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

2014-04-01 21:37:32

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>
> Either way, optimistic volatile pointers are nowhere near as
> transparent to the application as the above description suggests,
> which makes this usecase not very interesting, IMO.
>

... however, I think you're still derating the value way too much. The
case of user space doing elastic memory management is more and more
common, and for a lot of those applications it is perfectly reasonable
to either not do system calls or to have to devolatilize first.

-hpa

2014-04-01 23:01:45

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>> Either way, optimistic volatile pointers are nowhere near as
>> transparent to the application as the above description suggests,
>> which makes this usecase not very interesting, IMO.
>
> ... however, I think you're still derating the value way too much. The
> case of user space doing elastic memory management is more and more
> common, and for a lot of those applications it is perfectly reasonable
> to either not do system calls or to have to devolatilize first.

The SIGBUS is only in cases where the memory is set as volatile and
_then_ accessed, right?

John, this was something that the Mozilla guys asked for, right? Any
idea why this isn't ever a problem for them?

2014-04-02 04:04:04

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> [ I tried to bring this up during LSFMM but it got drowned out.
> Trying again :) ]
>
> On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
>> Optimistic method:
>> 1) Userland marks a large range of data as volatile
>> 2) Userland continues to access the data as it needs.
>> 3) If userland accesses a page that has been purged, the kernel will
>> send a SIGBUS
>> 4) Userspace can trap the SIGBUS, mark the affected pages as
>> non-volatile, and refill the data as needed before continuing on
> As far as I understand, if a pointer to volatile memory makes it into
> a syscall and the fault is trapped in kernel space, there won't be a
> SIGBUS, the syscall will just return -EFAULT.
>
> Handling this would mean annotating every syscall invocation to check
> for -EFAULT, refill the data, and then restart the syscall. This is
> complicated even before taking external libraries into account, which
> may not propagate syscall returns properly or may not be reentrant at
> the necessary granularity.
>
> Another option is to never pass volatile memory pointers into the
> kernel, but that too means that knowledge of volatility has to travel
> alongside the pointers, which will either result in more complexity
> throughout the application or severely limited scope of volatile
> memory usage.
>
> Either way, optimistic volatile pointers are nowhere near as
> transparent to the application as the above description suggests,
> which makes this usecase not very interesting, IMO. If we can support
> it at little cost, why not, but I don't think we should complicate the
> common usecases to support this one.

So yea, thanks again for all the feedback at LSF-MM! I'm trying to get
things integrated for a v13 here shortly (although with visitors in town
this week it may not happen until next week).

So, maybe its best to ignore the fact that folks want to do semi-crazy
user-space faulting via SIGBUS. At least to start with. Lets look at the
semantic for the "normal" mark volatile, never touch the pages until you
mark non-volatile - basically where accessing volatile pages is similar
to a use-after-free bug.

So, for the most part, I'd say the proposed SIGBUS semantics don't
complicate things for this basic use-case, at least when compared with
things like zero-fill. If an applications accidentally accessed a
purged volatile page, I think SIGBUS is the right thing to do. They most
likely immediately crash, but its better then them moving along with
silent corruption because they're mucking with zero-filled pages.

So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
you have a third option you're thinking of, I'd of course be interested
in hearing it.

Now... once you've chosen SIGBUS semantics, there will be folks who will
try to exploit the fact that we get SIGBUS on purged page access (at
least on the user-space side) and will try to access pages that are
volatile until they are purged and try to then handle the SIGBUS to fix
things up. Those folks exploiting that will have to be particularly
careful not to pass volatile data to the kernel, and if they do they'll
have to be smart enough to handle the EFAULT, etc. That's really all
their problem, because they're being clever. :)

I've maybe made a mistake in talking at length about those use cases,
because I wanted to make sure folks didn't have suggestions on how to
better address those cases (so far I've not heard any), and it sort of
helps wrap folks heads around at least some of the potential variations
on the desired purging semantics (lru based cold page purging, or entire
object based purging).

Now, one other potential variant, which Keith brought up at LSF-MM, and
others have mentioned before, is to have *any* volatile page access
(purged or not) return a SIGBUS. This seems "safe" in that it protects
developers from themselves, and makes application behavior more
deterministic (rather then depending on memory pressure). However it
also has the overhead of setting up the pte swp entries for each page in
order to trip the SIGBUS. Since folks have explicitly asked for it,
allowing non-purged volatile page access seems more flexible. And its
cheaper. So that's what I've been leaning towards.

thanks again!
-john

2014-04-02 04:08:27

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On 04/01/2014 09:03 PM, John Stultz wrote:
>
> So, maybe its best to ignore the fact that folks want to do semi-crazy
> user-space faulting via SIGBUS. At least to start with. Lets look at the
> semantic for the "normal" mark volatile, never touch the pages until you
> mark non-volatile - basically where accessing volatile pages is similar
> to a use-after-free bug.
>
> So, for the most part, I'd say the proposed SIGBUS semantics don't
> complicate things for this basic use-case, at least when compared with
> things like zero-fill. If an applications accidentally accessed a
> purged volatile page, I think SIGBUS is the right thing to do. They most
> likely immediately crash, but its better then them moving along with
> silent corruption because they're mucking with zero-filled pages.
>
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.
>

People already do SIGBUS for mmap, so there is nothing new here.

> Now... once you've chosen SIGBUS semantics, there will be folks who will
> try to exploit the fact that we get SIGBUS on purged page access (at
> least on the user-space side) and will try to access pages that are
> volatile until they are purged and try to then handle the SIGBUS to fix
> things up. Those folks exploiting that will have to be particularly
> careful not to pass volatile data to the kernel, and if they do they'll
> have to be smart enough to handle the EFAULT, etc. That's really all
> their problem, because they're being clever. :)

Yep.

-hpa

2014-04-02 04:12:50

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On 04/01/2014 04:01 PM, Dave Hansen wrote:
> On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
>> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>>> Either way, optimistic volatile pointers are nowhere near as
>>> transparent to the application as the above description suggests,
>>> which makes this usecase not very interesting, IMO.
>> ... however, I think you're still derating the value way too much. The
>> case of user space doing elastic memory management is more and more
>> common, and for a lot of those applications it is perfectly reasonable
>> to either not do system calls or to have to devolatilize first.
> The SIGBUS is only in cases where the memory is set as volatile and
> _then_ accessed, right?
Not just set volatile and then accessed, but when a volatile page has
been purged and then accessed without being made non-volatile.

> John, this was something that the Mozilla guys asked for, right? Any
> idea why this isn't ever a problem for them?
So one of their use cases for it is for library text. Basically they
want to decompress a compressed library file into memory. Then they plan
to mark the uncompressed pages volatile, and then be able to call into
it. Ideally for them, the kernel would only purge cold pages, leaving
the hot pages in memory. When they traverse a purged page, they handle
the SIGBUS and patch the page up.

Now.. this is not what I'd consider a normal use case, but was hoping to
illustrate some of the more interesting uses and demonstrate the
interfaces flexibility.

Also it provided a clear example of benefits to doing LRU based
cold-page purging rather then full object purging. Though I think the
same could be demonstrated in a simpler case of a large cache of objects
that the applications wants to mark volatile in one pass, unmarking
sub-objects as it needs.

thanks
-john

2014-04-02 16:30:47

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> > [ I tried to bring this up during LSFMM but it got drowned out.
> > Trying again :) ]
> >
> > On Fri, Mar 21, 2014 at 02:17:30PM -0700, John Stultz wrote:
> >> Optimistic method:
> >> 1) Userland marks a large range of data as volatile
> >> 2) Userland continues to access the data as it needs.
> >> 3) If userland accesses a page that has been purged, the kernel will
> >> send a SIGBUS
> >> 4) Userspace can trap the SIGBUS, mark the affected pages as
> >> non-volatile, and refill the data as needed before continuing on
> > As far as I understand, if a pointer to volatile memory makes it into
> > a syscall and the fault is trapped in kernel space, there won't be a
> > SIGBUS, the syscall will just return -EFAULT.
> >
> > Handling this would mean annotating every syscall invocation to check
> > for -EFAULT, refill the data, and then restart the syscall. This is
> > complicated even before taking external libraries into account, which
> > may not propagate syscall returns properly or may not be reentrant at
> > the necessary granularity.
> >
> > Another option is to never pass volatile memory pointers into the
> > kernel, but that too means that knowledge of volatility has to travel
> > alongside the pointers, which will either result in more complexity
> > throughout the application or severely limited scope of volatile
> > memory usage.
> >
> > Either way, optimistic volatile pointers are nowhere near as
> > transparent to the application as the above description suggests,
> > which makes this usecase not very interesting, IMO. If we can support
> > it at little cost, why not, but I don't think we should complicate the
> > common usecases to support this one.
>
> So yea, thanks again for all the feedback at LSF-MM! I'm trying to get
> things integrated for a v13 here shortly (although with visitors in town
> this week it may not happen until next week).
>
>
> So, maybe its best to ignore the fact that folks want to do semi-crazy
> user-space faulting via SIGBUS. At least to start with. Lets look at the
> semantic for the "normal" mark volatile, never touch the pages until you
> mark non-volatile - basically where accessing volatile pages is similar
> to a use-after-free bug.
>
> So, for the most part, I'd say the proposed SIGBUS semantics don't
> complicate things for this basic use-case, at least when compared with
> things like zero-fill. If an applications accidentally accessed a
> purged volatile page, I think SIGBUS is the right thing to do. They most
> likely immediately crash, but its better then them moving along with
> silent corruption because they're mucking with zero-filled pages.
>
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.

The reason I'm bringing this up again is because I see very little
solid usecases for a separate vrange() syscall once we have something
like MADV_FREE and MADV_REVIVE, which respectively clear the dirty
bits of a range of anon/tmpfs pages, and set them again and report if
any pages in the given range were purged on revival.

So between zero-fill and SIGBUS, I'd prefer the one which results in
the simpler user interface / fewer system calls.

2014-04-02 16:34:32

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On 04/02/2014 09:30 AM, Johannes Weiner wrote:
>
> So between zero-fill and SIGBUS, I'd prefer the one which results in
> the simpler user interface / fewer system calls.
>

The use cases are different; I believe this should be a user space option.

-hpa

2014-04-02 16:36:46

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> On 04/01/2014 04:01 PM, Dave Hansen wrote:
> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> >>> Either way, optimistic volatile pointers are nowhere near as
> >>> transparent to the application as the above description suggests,
> >>> which makes this usecase not very interesting, IMO.
> >> ... however, I think you're still derating the value way too much. The
> >> case of user space doing elastic memory management is more and more
> >> common, and for a lot of those applications it is perfectly reasonable
> >> to either not do system calls or to have to devolatilize first.
> > The SIGBUS is only in cases where the memory is set as volatile and
> > _then_ accessed, right?
> Not just set volatile and then accessed, but when a volatile page has
> been purged and then accessed without being made non-volatile.
>
>
> > John, this was something that the Mozilla guys asked for, right? Any
> > idea why this isn't ever a problem for them?
> So one of their use cases for it is for library text. Basically they
> want to decompress a compressed library file into memory. Then they plan
> to mark the uncompressed pages volatile, and then be able to call into
> it. Ideally for them, the kernel would only purge cold pages, leaving
> the hot pages in memory. When they traverse a purged page, they handle
> the SIGBUS and patch the page up.

How big are these libraries compared to overall system size?

> Now.. this is not what I'd consider a normal use case, but was hoping to
> illustrate some of the more interesting uses and demonstrate the
> interfaces flexibility.

I'm just dying to hear a "normal" use case then. :)

> Also it provided a clear example of benefits to doing LRU based
> cold-page purging rather then full object purging. Though I think the
> same could be demonstrated in a simpler case of a large cache of objects
> that the applications wants to mark volatile in one pass, unmarking
> sub-objects as it needs.

Agreed.

2014-04-02 16:39:10

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On 04/02/2014 09:32 AM, H. Peter Anvin wrote:
> On 04/02/2014 09:30 AM, Johannes Weiner wrote:
>>
>> So between zero-fill and SIGBUS, I'd prefer the one which results in
>> the simpler user interface / fewer system calls.
>>
>
> The use cases are different; I believe this should be a user space option.
>

Case in point, for example: imagine a JIT. You *really* don't want to
zero-fill memory behind the back of your JIT, as all zero memory may not
be a trapping instruction (it isn't on x86, for example, and if you are
unlucky you may be modifying *part* of an instruction.)

Thus, SIGBUS is the only safe option.

-hpa

2014-04-02 17:18:33

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 02, 2014 at 09:37:49AM -0700, H. Peter Anvin wrote:
> On 04/02/2014 09:32 AM, H. Peter Anvin wrote:
> > On 04/02/2014 09:30 AM, Johannes Weiner wrote:
> >>
> >> So between zero-fill and SIGBUS, I'd prefer the one which results in
> >> the simpler user interface / fewer system calls.
> >>
> >
> > The use cases are different; I believe this should be a user space option.
> >
>
> Case in point, for example: imagine a JIT. You *really* don't want to
> zero-fill memory behind the back of your JIT, as all zero memory may not
> be a trapping instruction (it isn't on x86, for example, and if you are
> unlucky you may be modifying *part* of an instruction.)

Yes, and I think this would be comparable to the compressed-library
usecase that John mentioned. What's special about these cases is that
the accesses are no longer under control of the application because
it's literally code that the CPU jumps into. It is obvious to me that
such a usecase would require SIGBUS handling. However, it seems that
in any usecase *besides* executable code caches, userspace would have
the ability to mark the pages non-volatile ahead of time, and thus not
require SIGBUS delivery.

Hence my follow-up question in the other mail about how large we
expect such code caches to become in practice in relationship to
overall system memory. Are code caches interesting reclaim candidates
to begin with? Are they big enough to make the machine thrash/swap
otherwise?

2014-04-02 17:40:30

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <[email protected]> wrote:
> On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
>> On 04/01/2014 04:01 PM, Dave Hansen wrote:
>> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
>> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
>> > John, this was something that the Mozilla guys asked for, right? Any
>> > idea why this isn't ever a problem for them?
>> So one of their use cases for it is for library text. Basically they
>> want to decompress a compressed library file into memory. Then they plan
>> to mark the uncompressed pages volatile, and then be able to call into
>> it. Ideally for them, the kernel would only purge cold pages, leaving
>> the hot pages in memory. When they traverse a purged page, they handle
>> the SIGBUS and patch the page up.
>
> How big are these libraries compared to overall system size?

Mike or Taras would have to refresh my memory on this detail. My
recollection is it mostly has to do with keeping the on-disk size of
the library small, so it can load off of slow media very quickly.

>> Now.. this is not what I'd consider a normal use case, but was hoping to
>> illustrate some of the more interesting uses and demonstrate the
>> interfaces flexibility.
>
> I'm just dying to hear a "normal" use case then. :)

So the more "normal" use cause would be marking objects volatile and
then non-volatile w/o accessing them in-between. In this case the
zero-fill vs SIGBUS semantics don't really matter, its really just a
trade off in how we handle applications deviating (intentionally or
not) from this use case.

So to maybe flesh out the context here for folks who are following
along (but weren't in the hallway at LSF :), Johannes made a fairly
interesting proposal (Johannes: Please correct me here where I'm maybe
slightly off here) to use only the dirty bits of the ptes to mark a
page as volatile. Then the kernel could reclaim these clean pages as
it needed, and when we marked the range as non-volatile, the pages
would be re-dirtied and if any of the pages were missing, we could
return a flag with the purged state. This had some different
semantics then what I've been working with for awhile (for example,
any writes to pages would implicitly clear volatility), so I wasn't
completely comfortable with it, but figured I'd think about it to see
if it could be done. Particularly since it would in some ways simplify
tmpfs/shm shared volatility that I'd eventually like to do.

After thinking it over in the hallway, I talked some of the details w/
Johnnes and there was one issue that while w/ anonymous memory, we can
still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
semantics, but since on shared volatile ranges, we don't have anything
to hang a volatile flag on w/o adding some new vma like structure to
the address_space structure (much as we did in the past w/ earlier
volatile range implementations). This would negate much of the point
of using the dirty bits to simplify the shared volatility
implementation.

Thus Johannes is reasonably questioning the need for SIGBUS semantics,
since if it wasn't needed, the simpler page-cleaning based volatility
could potentially be used.

Now, while for the case I'm personally most interested in (ashmem),
zero-fill would technically be ok, since that's what Android does.
Even so, I don't think its the best approach for the interface, since
applications may end up quite surprised by the results when they
accidentally don't follow the "don't touch volatile pages" rule.

That point beside, I think the other problem with the page-cleaning
volatility approach is that there are other awkward side effects. For
example: Say an application marks a range as volatile. One page in the
range is then purged. The application, due to a bug or otherwise,
reads the volatile range. This causes the page to be zero-filled in,
and the application silently uses the corrupted data (which isn't
great). More problematic though, is that by faulting the page in,
they've in effect lost the purge state for that page. When the
application then goes to mark the range as non-volatile, all pages are
present, so we'd return that no pages were purged. From an
application perspective this is pretty ugly.

Johannes: Any thoughts on this potential issue with your proposal? Am
I missing something else?

thanks
-john

2014-04-02 17:40:28

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On 04/02/2014 10:18 AM, Johannes Weiner wrote:
> Hence my follow-up question in the other mail about how large we
> expect such code caches to become in practice in relationship to
> overall system memory. Are code caches interesting reclaim candidates
> to begin with? Are they big enough to make the machine thrash/swap
> otherwise?

A big chunk of the use cases here are for swapless systems anyway, so
this is the *only* way for them to reclaim anonymous memory. Their
choices are either to be constantly throwing away and rebuilding these
objects, or to leave them in memory effectively pinned.

In practice I did see ashmem (the Android thing that we're trying to
replace) get used a lot by the Android web browser when I was playing
with it. John said that it got used for storing decompressed copies of
images.

2014-04-02 17:48:07

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen <[email protected]> wrote:
> On 04/02/2014 10:18 AM, Johannes Weiner wrote:
>> Hence my follow-up question in the other mail about how large we
>> expect such code caches to become in practice in relationship to
>> overall system memory. Are code caches interesting reclaim candidates
>> to begin with? Are they big enough to make the machine thrash/swap
>> otherwise?
>
> A big chunk of the use cases here are for swapless systems anyway, so
> this is the *only* way for them to reclaim anonymous memory. Their
> choices are either to be constantly throwing away and rebuilding these
> objects, or to leave them in memory effectively pinned.
>
> In practice I did see ashmem (the Android thing that we're trying to
> replace) get used a lot by the Android web browser when I was playing
> with it. John said that it got used for storing decompressed copies of
> images.

Although images are a simpler case where its easier to not touch
volatile pages. I think Johannes is mostly concerned about cases where
volatile pages are being accessed while they are volatile, which the
Mozilla folks are so far the only viable case (in my mind... folks may
have others) where they intentionally want to access pages while
they're volatile and thus require SIGBUS semantics.

I suspect handling the SIGBUS and patching up the purged page you
trapped on is likely much to complicated for most use cases. But I do
think SIGBUS is preferable to zero-fill on purged page access, just
because its likely to be easier to debug applications.

thanks
-john

2014-04-02 17:59:09

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <[email protected]> wrote:
> > On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> >> On 04/01/2014 04:01 PM, Dave Hansen wrote:
> >> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> >> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> >> > John, this was something that the Mozilla guys asked for, right? Any
> >> > idea why this isn't ever a problem for them?
> >> So one of their use cases for it is for library text. Basically they
> >> want to decompress a compressed library file into memory. Then they plan
> >> to mark the uncompressed pages volatile, and then be able to call into
> >> it. Ideally for them, the kernel would only purge cold pages, leaving
> >> the hot pages in memory. When they traverse a purged page, they handle
> >> the SIGBUS and patch the page up.
> >
> > How big are these libraries compared to overall system size?
>
> Mike or Taras would have to refresh my memory on this detail. My
> recollection is it mostly has to do with keeping the on-disk size of
> the library small, so it can load off of slow media very quickly.
>
> >> Now.. this is not what I'd consider a normal use case, but was hoping to
> >> illustrate some of the more interesting uses and demonstrate the
> >> interfaces flexibility.
> >
> > I'm just dying to hear a "normal" use case then. :)
>
> So the more "normal" use cause would be marking objects volatile and
> then non-volatile w/o accessing them in-between. In this case the
> zero-fill vs SIGBUS semantics don't really matter, its really just a
> trade off in how we handle applications deviating (intentionally or
> not) from this use case.
>
> So to maybe flesh out the context here for folks who are following
> along (but weren't in the hallway at LSF :), Johannes made a fairly
> interesting proposal (Johannes: Please correct me here where I'm maybe
> slightly off here) to use only the dirty bits of the ptes to mark a
> page as volatile. Then the kernel could reclaim these clean pages as
> it needed, and when we marked the range as non-volatile, the pages
> would be re-dirtied and if any of the pages were missing, we could
> return a flag with the purged state. This had some different
> semantics then what I've been working with for awhile (for example,
> any writes to pages would implicitly clear volatility), so I wasn't
> completely comfortable with it, but figured I'd think about it to see
> if it could be done. Particularly since it would in some ways simplify
> tmpfs/shm shared volatility that I'd eventually like to do.
>
> After thinking it over in the hallway, I talked some of the details w/
> Johnnes and there was one issue that while w/ anonymous memory, we can
> still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
> semantics, but since on shared volatile ranges, we don't have anything
> to hang a volatile flag on w/o adding some new vma like structure to
> the address_space structure (much as we did in the past w/ earlier
> volatile range implementations). This would negate much of the point
> of using the dirty bits to simplify the shared volatility
> implementation.
>
> Thus Johannes is reasonably questioning the need for SIGBUS semantics,
> since if it wasn't needed, the simpler page-cleaning based volatility
> could potentially be used.

Thanks for summarizing this again!

> Now, while for the case I'm personally most interested in (ashmem),
> zero-fill would technically be ok, since that's what Android does.
> Even so, I don't think its the best approach for the interface, since
> applications may end up quite surprised by the results when they
> accidentally don't follow the "don't touch volatile pages" rule.
>
> That point beside, I think the other problem with the page-cleaning
> volatility approach is that there are other awkward side effects. For
> example: Say an application marks a range as volatile. One page in the
> range is then purged. The application, due to a bug or otherwise,
> reads the volatile range. This causes the page to be zero-filled in,
> and the application silently uses the corrupted data (which isn't
> great). More problematic though, is that by faulting the page in,
> they've in effect lost the purge state for that page. When the
> application then goes to mark the range as non-volatile, all pages are
> present, so we'd return that no pages were purged. From an
> application perspective this is pretty ugly.
>
> Johannes: Any thoughts on this potential issue with your proposal? Am
> I missing something else?

No, this is accurate. However, I don't really see how this is
different than any other use-after-free bug. If you access malloc
memory after free(), you might receive a SIGSEGV, you might see random
data, you might corrupt somebody else's data. This certainly isn't
nice, but it's not exactly new behavior, is it?

2014-04-02 18:07:21

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 10:40 AM, Dave Hansen <[email protected]> wrote:
> > On 04/02/2014 10:18 AM, Johannes Weiner wrote:
> >> Hence my follow-up question in the other mail about how large we
> >> expect such code caches to become in practice in relationship to
> >> overall system memory. Are code caches interesting reclaim candidates
> >> to begin with? Are they big enough to make the machine thrash/swap
> >> otherwise?
> >
> > A big chunk of the use cases here are for swapless systems anyway, so
> > this is the *only* way for them to reclaim anonymous memory. Their
> > choices are either to be constantly throwing away and rebuilding these
> > objects, or to leave them in memory effectively pinned.
> >
> > In practice I did see ashmem (the Android thing that we're trying to
> > replace) get used a lot by the Android web browser when I was playing
> > with it. John said that it got used for storing decompressed copies of
> > images.
>
> Although images are a simpler case where its easier to not touch
> volatile pages. I think Johannes is mostly concerned about cases where
> volatile pages are being accessed while they are volatile, which the
> Mozilla folks are so far the only viable case (in my mind... folks may
> have others) where they intentionally want to access pages while
> they're volatile and thus require SIGBUS semantics.

Yes, absolutely, that is my only concern. Compressed images as in
Android can easily be marked non-volatile before they are accessed
again.

Code caches are harder because control is handed off to the CPU, but
I'm not entirely sure yet whether these are in fact interesting
reclaim candidates.

> I suspect handling the SIGBUS and patching up the purged page you
> trapped on is likely much to complicated for most use cases. But I do
> think SIGBUS is preferable to zero-fill on purged page access, just
> because its likely to be easier to debug applications.

Fully agreed, but it seems a bit overkill to add a separate syscall, a
range-tree on top of shmem address_spaces, and an essentially new
programming model based on SIGBUS userspace fault handling (incl. all
the complexities and confusion this inevitably will bring when people
DO end up passing these pointers into kernel space) just to be a bit
nicer about use-after-free bugs in applications.

2014-04-02 18:32:41

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

Hi everyone,

On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> you have a third option you're thinking of, I'd of course be interested
> in hearing it.

I actually thought the way of being notified with a page fault (sigbus
or whatever) was the most efficient way of using volatile ranges.

Why having to call a syscall to know if you can still access the
volatile range, if there was no VM pressure before the access?
syscalls are expensive, accessing the memory direct is not. Only if it
page was actually missing and a page fault would fire, you'd take the
slowpath.

The usages I see for this are plenty, like for maintaining caches in
memory that may be big and would be nice to discard if there's VM
pressure, jpeg uncompressed images sounds like a candidate too. So the
browser size would shrink if there's VM pressure, instead of ending up
swapping out uncompressed image data that can be regenerated more
quickly with the CPU than with swapins.

> Now... once you've chosen SIGBUS semantics, there will be folks who will
> try to exploit the fact that we get SIGBUS on purged page access (at
> least on the user-space side) and will try to access pages that are
> volatile until they are purged and try to then handle the SIGBUS to fix
> things up. Those folks exploiting that will have to be particularly
> careful not to pass volatile data to the kernel, and if they do they'll
> have to be smart enough to handle the EFAULT, etc. That's really all
> their problem, because they're being clever. :)

I'm actually working on feature that would solve the problem for the
syscalls accessing missing volatile pages. So you'd never see a
-EFAULT because all syscalls won't return even if they encounters a
missing page in the volatile range dropped by the VM pressure.

It's called userfaultfd. You call sys_userfaultfd(flags) and it
connects the current mm to a pseudo filedescriptor. The filedescriptor
works similarly to eventfd but with a different protocol.

You need a thread that will never access the userfault area with the
CPU, that is responsible to poll on the userfaultfd and talk the
userfaultfd protocol to fill-in missing pages. The userfault thread
after a POLLIN event reads the virtual addresses of the fault that
must have happened on some other thread of the same mm, and then
writes back an "handled" virtual range into the fd, after the page (or
pages if multiple) have been regenerated and mapped in with
sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
swapping. Then depending on the "solved" range written back into the
fd, the kernel will wakeup the thread or threads that were waiting in
kernel mode on the "handled" virtual range, and retry the fault
without ever exiting kernel mode.

We need this in KVM for running the guest on memory that is on other
nodes or other processes (postcopy live migration is the most common
use case but there are others like memory externalization and
cross-node KSM in the cloud, to keep a single copy of memory across
multiple nodes and externalized to the VM and to the host node).

This thread made me wonder if we could mix the two features and you
would then depend on MADV_USERFAULT and userfaultfd to deliver to
userland the "faults" happening on the volatile pages that have been
purged as result of VM pressure.

I'm just saying this after Johannes mentioned the issue with syscalls
returning -EFAULT. Because that is the very issue that the userfaultfd
is going to solve for the KVM migration thread.

What I'm thinking now would be to mark the volatile range also
MADV_USERFAULT and then calling userfaultfd and instead of having the
cache regeneration "slow path" inside the SIGBUS handler, to run it in
the userfault thread that polls the userfaultfd. Then you could write
the volatile ranges to disk with a write() syscall (or use any other
syscall on the volatile ranges), without having to worry about -EFAULT
being returned because one page was discarded. And if MADV_USERFAULT
is not called in combination with vrange syscalls, then it'd still
work without the userfault, but with the vrange syscalls only.

In short the idea would be to let the userfault code solve the fault
delivery to userland for you, and make the vrange syscalls only focus
on the page purging problem, without having to worry about what
happens when something access a missing page.

But if you don't intend to solve the syscall -EFAULT problem, well
then probably the overlap is still as thin as I thought it was before
(like also mentioned in the below link).

Thanks,
Andrea

PS. my last email about this from a more KVM centric point of view:

http://www.spinics.net/lists/kvm/msg101449.html

2014-04-02 19:01:03

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner <[email protected]> wrote:
> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
>> That point beside, I think the other problem with the page-cleaning
>> volatility approach is that there are other awkward side effects. For
>> example: Say an application marks a range as volatile. One page in the
>> range is then purged. The application, due to a bug or otherwise,
>> reads the volatile range. This causes the page to be zero-filled in,
>> and the application silently uses the corrupted data (which isn't
>> great). More problematic though, is that by faulting the page in,
>> they've in effect lost the purge state for that page. When the
>> application then goes to mark the range as non-volatile, all pages are
>> present, so we'd return that no pages were purged. From an
>> application perspective this is pretty ugly.
>>
>> Johannes: Any thoughts on this potential issue with your proposal? Am
>> I missing something else?
>
> No, this is accurate. However, I don't really see how this is
> different than any other use-after-free bug. If you access malloc
> memory after free(), you might receive a SIGSEGV, you might see random
> data, you might corrupt somebody else's data. This certainly isn't
> nice, but it's not exactly new behavior, is it?

The part that troubles me is that I see the purged state as kernel
data being corrupted by userland in this case. The kernel will tell
userspace that no pages were purged, even though they were. Only
because userspace made an errant read of a page, and got garbage data
back.

thanks
-john

2014-04-02 19:28:03

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> Hi everyone,
>
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > you have a third option you're thinking of, I'd of course be interested
> > in hearing it.
>
> I actually thought the way of being notified with a page fault (sigbus
> or whatever) was the most efficient way of using volatile ranges.
>
> Why having to call a syscall to know if you can still access the
> volatile range, if there was no VM pressure before the access?
> syscalls are expensive, accessing the memory direct is not. Only if it
> page was actually missing and a page fault would fire, you'd take the
> slowpath.

Not everybody wants to actually come back for the data in the range,
allocators and message passing applications just want to be able to
reuse the memory mapping.

By tying the volatility to the dirty bit in the page tables, an
allocator could simply clear those bits once on free(). When malloc()
hands out this region again, the user is expected to write, which will
either overwrite the old page, or, if it was purged, fault in a fresh
zero page. But there is no second syscall needed to clear volatility.

> > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > try to exploit the fact that we get SIGBUS on purged page access (at
> > least on the user-space side) and will try to access pages that are
> > volatile until they are purged and try to then handle the SIGBUS to fix
> > things up. Those folks exploiting that will have to be particularly
> > careful not to pass volatile data to the kernel, and if they do they'll
> > have to be smart enough to handle the EFAULT, etc. That's really all
> > their problem, because they're being clever. :)
>
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
>
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
>
> You need a thread that will never access the userfault area with the
> CPU, that is responsible to poll on the userfaultfd and talk the
> userfaultfd protocol to fill-in missing pages. The userfault thread
> after a POLLIN event reads the virtual addresses of the fault that
> must have happened on some other thread of the same mm, and then
> writes back an "handled" virtual range into the fd, after the page (or
> pages if multiple) have been regenerated and mapped in with
> sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> swapping. Then depending on the "solved" range written back into the
> fd, the kernel will wakeup the thread or threads that were waiting in
> kernel mode on the "handled" virtual range, and retry the fault
> without ever exiting kernel mode.
>
> We need this in KVM for running the guest on memory that is on other
> nodes or other processes (postcopy live migration is the most common
> use case but there are others like memory externalization and
> cross-node KSM in the cloud, to keep a single copy of memory across
> multiple nodes and externalized to the VM and to the host node).
>
> This thread made me wonder if we could mix the two features and you
> would then depend on MADV_USERFAULT and userfaultfd to deliver to
> userland the "faults" happening on the volatile pages that have been
> purged as result of VM pressure.
>
> I'm just saying this after Johannes mentioned the issue with syscalls
> returning -EFAULT. Because that is the very issue that the userfaultfd
> is going to solve for the KVM migration thread.
>
> What I'm thinking now would be to mark the volatile range also
> MADV_USERFAULT and then calling userfaultfd and instead of having the
> cache regeneration "slow path" inside the SIGBUS handler, to run it in
> the userfault thread that polls the userfaultfd. Then you could write
> the volatile ranges to disk with a write() syscall (or use any other
> syscall on the volatile ranges), without having to worry about -EFAULT
> being returned because one page was discarded. And if MADV_USERFAULT
> is not called in combination with vrange syscalls, then it'd still
> work without the userfault, but with the vrange syscalls only.
>
> In short the idea would be to let the userfault code solve the fault
> delivery to userland for you, and make the vrange syscalls only focus
> on the page purging problem, without having to worry about what
> happens when something access a missing page.

Yes, the two seem certainly combinable to me.

madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace
fault handling. In the fault slowpath, you can then regenerate any
missing data and do MADV_FREE again if it should remain volatile. And
again, any actual writes to the region would clear volatility because
now the cache copy changed and discarding it would mean losing state.

2014-04-02 19:37:16

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 2, 2014 at 11:07 AM, Johannes Weiner <[email protected]> wrote:
> On Wed, Apr 02, 2014 at 10:48:03AM -0700, John Stultz wrote:
>> I suspect handling the SIGBUS and patching up the purged page you
>> trapped on is likely much to complicated for most use cases. But I do
>> think SIGBUS is preferable to zero-fill on purged page access, just
>> because its likely to be easier to debug applications.
>
> Fully agreed, but it seems a bit overkill to add a separate syscall, a
> range-tree on top of shmem address_spaces, and an essentially new
> programming model based on SIGBUS userspace fault handling (incl. all
> the complexities and confusion this inevitably will bring when people
> DO end up passing these pointers into kernel space) just to be a bit
> nicer about use-after-free bugs in applications.

Its more about making an interface that has graspable semantics to
userspace, instead of having the semantics being a side-effect of the
implementation.

Tying volatility to the page-clean state and page-was-purged to
page-present seems problematic to me, because there are too many ways
to change the page-clean or page-present outside of the interface
being proposed.

I feel this causes a cascade of corner cases that have to be explained
to users of the interface.

Also I disagree we're adding a new programming model, as SIGBUSes can
already be caught, just that there's not usually much one can do,
where with volatile pages its more likely something could be done. And
again, its really just a side-effect of having semantics (SIGBUS on
purged page access) that are more helpful from a applications
perspective.

As for the separate syscall: Again, this is mainly needed to handle
allocation failures that happen mid-way through modifying the range.
There may still be a way to do the allocation first and only after it
succeeds do the modification. The vma merge/splitting logic doesn't
make this easy but if we can be sure that on a failed split of 1 vma
-> 3 vmas (which may fail half way) we can re-merge w/o allocation and
error out (without having to do any other allocations), this might be
avoidable. I'm still wanting to look at this. If so, it would be
easier to re-add this support under madvise, if folks really really
don't like the new syscall. For the most part, having the separate
syscall allows us to discuss other details of the semantics, which to
me are more important then the syscall naming.

thanks
-john

2014-04-02 19:47:19

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner <[email protected]> wrote:
> > On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> >> That point beside, I think the other problem with the page-cleaning
> >> volatility approach is that there are other awkward side effects. For
> >> example: Say an application marks a range as volatile. One page in the
> >> range is then purged. The application, due to a bug or otherwise,
> >> reads the volatile range. This causes the page to be zero-filled in,
> >> and the application silently uses the corrupted data (which isn't
> >> great). More problematic though, is that by faulting the page in,
> >> they've in effect lost the purge state for that page. When the
> >> application then goes to mark the range as non-volatile, all pages are
> >> present, so we'd return that no pages were purged. From an
> >> application perspective this is pretty ugly.
> >>
> >> Johannes: Any thoughts on this potential issue with your proposal? Am
> >> I missing something else?
> >
> > No, this is accurate. However, I don't really see how this is
> > different than any other use-after-free bug. If you access malloc
> > memory after free(), you might receive a SIGSEGV, you might see random
> > data, you might corrupt somebody else's data. This certainly isn't
> > nice, but it's not exactly new behavior, is it?
>
> The part that troubles me is that I see the purged state as kernel
> data being corrupted by userland in this case. The kernel will tell
> userspace that no pages were purged, even though they were. Only
> because userspace made an errant read of a page, and got garbage data
> back.

That sounds overly dramatic to me. First of all, this data still
reflects accurately the actions of userspace in this situation. And
secondly, the kernel does not rely on this data to be meaningful from
a userspace perspective to function correctly.

It's really nothing but a use-after-free bug that has consequences for
no-one but the faulty application. The thing that IS new is that even
a read is enough to corrupt your data in this case.

MADV_REVIVE could return 0 if all pages in the specified range were
present, -Esomething if otherwise. That would be semantically sound
even if userspace messes up.

2014-04-02 19:51:53

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On 04/02/2014 11:31 AM, Andrea Arcangeli wrote:
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
>> Now... once you've chosen SIGBUS semantics, there will be folks who will
>> try to exploit the fact that we get SIGBUS on purged page access (at
>> least on the user-space side) and will try to access pages that are
>> volatile until they are purged and try to then handle the SIGBUS to fix
>> things up. Those folks exploiting that will have to be particularly
>> careful not to pass volatile data to the kernel, and if they do they'll
>> have to be smart enough to handle the EFAULT, etc. That's really all
>> their problem, because they're being clever. :)
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
>
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
So yea! I actually think (its been awhile now) I mentioned your work to
Taras (or maybe he mentioned it to me?), but it did seem like the
userfaltfd would be a better solution for the style of fault handling
they were thinking about. (Especially as actually handling SIGBUS and
doing something sane in a large threaded application seems very difficult).

That said, explaining volatile ranges as a concept has been difficult
enough without mixing in other new concepts :), so I'm hesitant to tie
the functionality together in until its clear the userfaultfd approach
is likely to land. But maybe I need to take a closer look at it.

thanks
-john

2014-04-02 20:13:44

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On 04/02/2014 12:47 PM, Johannes Weiner wrote:
> On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
>> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner <[email protected]> wrote:
>>> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
>>>> That point beside, I think the other problem with the page-cleaning
>>>> volatility approach is that there are other awkward side effects. For
>>>> example: Say an application marks a range as volatile. One page in the
>>>> range is then purged. The application, due to a bug or otherwise,
>>>> reads the volatile range. This causes the page to be zero-filled in,
>>>> and the application silently uses the corrupted data (which isn't
>>>> great). More problematic though, is that by faulting the page in,
>>>> they've in effect lost the purge state for that page. When the
>>>> application then goes to mark the range as non-volatile, all pages are
>>>> present, so we'd return that no pages were purged. From an
>>>> application perspective this is pretty ugly.
>>>>
>>>> Johannes: Any thoughts on this potential issue with your proposal? Am
>>>> I missing something else?
>>> No, this is accurate. However, I don't really see how this is
>>> different than any other use-after-free bug. If you access malloc
>>> memory after free(), you might receive a SIGSEGV, you might see random
>>> data, you might corrupt somebody else's data. This certainly isn't
>>> nice, but it's not exactly new behavior, is it?
>> The part that troubles me is that I see the purged state as kernel
>> data being corrupted by userland in this case. The kernel will tell
>> userspace that no pages were purged, even though they were. Only
>> because userspace made an errant read of a page, and got garbage data
>> back.
> That sounds overly dramatic to me. First of all, this data still
> reflects accurately the actions of userspace in this situation. And
> secondly, the kernel does not rely on this data to be meaningful from
> a userspace perspective to function correctly.
<insert dramatic-chipmunk video w/ text overlay "errant read corrupted
volatile page purge state!!!!1">

Maybe you're right, but I feel this is the sort of thing application
developers would be surprised and annoyed by.

> It's really nothing but a use-after-free bug that has consequences for
> no-one but the faulty application. The thing that IS new is that even
> a read is enough to corrupt your data in this case.
>
> MADV_REVIVE could return 0 if all pages in the specified range were
> present, -Esomething if otherwise. That would be semantically sound
> even if userspace messes up.

So its semantically more of just a combined mincore+dirty operation..
and nothing more?

What are other folks thinking about this? Although I don't particularly
like it, I probably could go along with Johannes' approach, forgoing
SIGBUS for zero-fill and adapting the semantics that are in my mind a
bit stranger. This would allow for ashmem-like style behavior w/ the
additional write-clears-volatile-state and read-clears-purged-state
constraints (which I don't think would be problematic for Android, but
am not totally sure).

But I do worry that these semantics are easier for kernel-mm-developers
to grasp, but are much much harder for application developers to
understand.

Additionally unless we could really leave access-after-volatile as a
total undefined behavior, this would lock us into O(page) behavior and
would remove the possibility of O(log(ranges)) behavior Minchan and I
were able to get (admittedly with more complicated code - but something
I was hoping we'd be able to get back to after the base semantics and
interface behavior was understood and merged). I since applications will
have bugs and will access after volatile, we won't be able to get away
with that sort of behavioral flexibility.

thanks
-john

2014-04-02 22:44:49

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed 02-04-14 13:13:34, John Stultz wrote:
> On 04/02/2014 12:47 PM, Johannes Weiner wrote:
> > On Wed, Apr 02, 2014 at 12:01:00PM -0700, John Stultz wrote:
> >> On Wed, Apr 2, 2014 at 10:58 AM, Johannes Weiner <[email protected]> wrote:
> >>> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> >>>> That point beside, I think the other problem with the page-cleaning
> >>>> volatility approach is that there are other awkward side effects. For
> >>>> example: Say an application marks a range as volatile. One page in the
> >>>> range is then purged. The application, due to a bug or otherwise,
> >>>> reads the volatile range. This causes the page to be zero-filled in,
> >>>> and the application silently uses the corrupted data (which isn't
> >>>> great). More problematic though, is that by faulting the page in,
> >>>> they've in effect lost the purge state for that page. When the
> >>>> application then goes to mark the range as non-volatile, all pages are
> >>>> present, so we'd return that no pages were purged. From an
> >>>> application perspective this is pretty ugly.
> >>>>
> >>>> Johannes: Any thoughts on this potential issue with your proposal? Am
> >>>> I missing something else?
> >>> No, this is accurate. However, I don't really see how this is
> >>> different than any other use-after-free bug. If you access malloc
> >>> memory after free(), you might receive a SIGSEGV, you might see random
> >>> data, you might corrupt somebody else's data. This certainly isn't
> >>> nice, but it's not exactly new behavior, is it?
> >> The part that troubles me is that I see the purged state as kernel
> >> data being corrupted by userland in this case. The kernel will tell
> >> userspace that no pages were purged, even though they were. Only
> >> because userspace made an errant read of a page, and got garbage data
> >> back.
> > That sounds overly dramatic to me. First of all, this data still
> > reflects accurately the actions of userspace in this situation. And
> > secondly, the kernel does not rely on this data to be meaningful from
> > a userspace perspective to function correctly.
> <insert dramatic-chipmunk video w/ text overlay "errant read corrupted
> volatile page purge state!!!!1">
>
> Maybe you're right, but I feel this is the sort of thing application
> developers would be surprised and annoyed by.
>
>
> > It's really nothing but a use-after-free bug that has consequences for
> > no-one but the faulty application. The thing that IS new is that even
> > a read is enough to corrupt your data in this case.
> >
> > MADV_REVIVE could return 0 if all pages in the specified range were
> > present, -Esomething if otherwise. That would be semantically sound
> > even if userspace messes up.
>
> So its semantically more of just a combined mincore+dirty operation..
> and nothing more?
>
> What are other folks thinking about this? Although I don't particularly
> like it, I probably could go along with Johannes' approach, forgoing
> SIGBUS for zero-fill and adapting the semantics that are in my mind a
> bit stranger. This would allow for ashmem-like style behavior w/ the
> additional write-clears-volatile-state and read-clears-purged-state
> constraints (which I don't think would be problematic for Android, but
> am not totally sure).
>
> But I do worry that these semantics are easier for kernel-mm-developers
> to grasp, but are much much harder for application developers to
> understand.
Yeah, I have to admit that although the simplicity of the implementation
looks compelling, the interface from a userspace POV looks weird.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2014-04-07 05:24:24

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 02, 2014 at 12:36:38PM -0400, Johannes Weiner wrote:
> On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> > On 04/01/2014 04:01 PM, Dave Hansen wrote:
> > > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> > >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> > >>> Either way, optimistic volatile pointers are nowhere near as
> > >>> transparent to the application as the above description suggests,
> > >>> which makes this usecase not very interesting, IMO.
> > >> ... however, I think you're still derating the value way too much. The
> > >> case of user space doing elastic memory management is more and more
> > >> common, and for a lot of those applications it is perfectly reasonable
> > >> to either not do system calls or to have to devolatilize first.
> > > The SIGBUS is only in cases where the memory is set as volatile and
> > > _then_ accessed, right?
> > Not just set volatile and then accessed, but when a volatile page has
> > been purged and then accessed without being made non-volatile.
> >
> >
> > > John, this was something that the Mozilla guys asked for, right? Any
> > > idea why this isn't ever a problem for them?
> > So one of their use cases for it is for library text. Basically they
> > want to decompress a compressed library file into memory. Then they plan
> > to mark the uncompressed pages volatile, and then be able to call into
> > it. Ideally for them, the kernel would only purge cold pages, leaving
> > the hot pages in memory. When they traverse a purged page, they handle
> > the SIGBUS and patch the page up.
>
> How big are these libraries compared to overall system size?

One of the example about jit I had is 5M bytes for just simple node.js
service. Acutally I'm not sure it was JIT or something. Just what I saw
was it was rwxp vmas so I guess they are JIT.
Anyway, it's really simple script but consumed 5M bytes. It's really
big for Embedded WebOS because other more complicated service could be
executed in parallel on the system.

>
> > Now.. this is not what I'd consider a normal use case, but was hoping to
> > illustrate some of the more interesting uses and demonstrate the
> > interfaces flexibility.
>
> I'm just dying to hear a "normal" use case then. :)
>
> > Also it provided a clear example of benefits to doing LRU based
> > cold-page purging rather then full object purging. Though I think the
> > same could be demonstrated in a simpler case of a large cache of objects
> > that the applications wants to mark volatile in one pass, unmarking
> > sub-objects as it needs.
>
> Agreed.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2014-04-07 05:48:35

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <[email protected]> wrote:
> > On Tue, Apr 01, 2014 at 09:12:44PM -0700, John Stultz wrote:
> >> On 04/01/2014 04:01 PM, Dave Hansen wrote:
> >> > On 04/01/2014 02:35 PM, H. Peter Anvin wrote:
> >> >> On 04/01/2014 02:21 PM, Johannes Weiner wrote:
> >> > John, this was something that the Mozilla guys asked for, right? Any
> >> > idea why this isn't ever a problem for them?
> >> So one of their use cases for it is for library text. Basically they
> >> want to decompress a compressed library file into memory. Then they plan
> >> to mark the uncompressed pages volatile, and then be able to call into
> >> it. Ideally for them, the kernel would only purge cold pages, leaving
> >> the hot pages in memory. When they traverse a purged page, they handle
> >> the SIGBUS and patch the page up.
> >
> > How big are these libraries compared to overall system size?
>
> Mike or Taras would have to refresh my memory on this detail. My
> recollection is it mostly has to do with keeping the on-disk size of
> the library small, so it can load off of slow media very quickly.
>
> >> Now.. this is not what I'd consider a normal use case, but was hoping to
> >> illustrate some of the more interesting uses and demonstrate the
> >> interfaces flexibility.
> >
> > I'm just dying to hear a "normal" use case then. :)
>
> So the more "normal" use cause would be marking objects volatile and
> then non-volatile w/o accessing them in-between. In this case the
> zero-fill vs SIGBUS semantics don't really matter, its really just a
> trade off in how we handle applications deviating (intentionally or
> not) from this use case.
>
> So to maybe flesh out the context here for folks who are following
> along (but weren't in the hallway at LSF :), Johannes made a fairly
> interesting proposal (Johannes: Please correct me here where I'm maybe
> slightly off here) to use only the dirty bits of the ptes to mark a
> page as volatile. Then the kernel could reclaim these clean pages as
> it needed, and when we marked the range as non-volatile, the pages
> would be re-dirtied and if any of the pages were missing, we could

I'd like to know more clearly as Hannes and you are thinking.
You mean that when we unmark the range, we should redirty of all of
pages's pte? or SetPageDirty?
If we redirty pte, maybe softdirty people(ie, CRIU) might be angry
because it could make lots of diff.
If we just do SetPageDirty, it would invalidate writeout-avoid logic
of swapped page which were already on the swap. Yeb, but it could
be minor and SetPageDirty model would be proper for shared-vrange
implmenetation. But how could we know any pages were missing
when unmarking time? Where do we keep the information?
It's no problem for vrange-anon because we can keep the information
on pte but how about vrange-file(ie, vrange-shared)? Using a shadow
entry of radix tree? What are you thinking about?

Another major concern is still syscall's overhead.
Such page-based scheme has a trouble with syscall's speed so I'm
afraid users might not use the syscall any more. :(
Frankly speaking, we don't have concrete user so not sure how
the overhead is severe but we could imagine easily that in future
someuser might want to makr volatile huge GB memory.

But I couldn't insist on range-based option because it has downside, too.
If we don't work page-based model, reclaim path cleary have a big
overhead to scan virtual memory to find a victim pages. As worst case,
just a page in Huge GB vma. Even, a page might be other zone. :(
If we could optimize that path to prevent CPU buring in future,
it could make very complicated and not sure woking well.
We already have similar issue with compaction. ;-)

So, it's really dilemma.

> return a flag with the purged state. This had some different
> semantics then what I've been working with for awhile (for example,
> any writes to pages would implicitly clear volatility), so I wasn't
> completely comfortable with it, but figured I'd think about it to see
> if it could be done. Particularly since it would in some ways simplify
> tmpfs/shm shared volatility that I'd eventually like to do.
>
> After thinking it over in the hallway, I talked some of the details w/
> Johnnes and there was one issue that while w/ anonymous memory, we can
> still add a VM_VOLATILE flag on the vma, so we can get SIGBUS
> semantics, but since on shared volatile ranges, we don't have anything
> to hang a volatile flag on w/o adding some new vma like structure to
> the address_space structure (much as we did in the past w/ earlier
> volatile range implementations). This would negate much of the point
> of using the dirty bits to simplify the shared volatility
> implementation.
>
> Thus Johannes is reasonably questioning the need for SIGBUS semantics,
> since if it wasn't needed, the simpler page-cleaning based volatility
> could potentially be used.

I think SIGBUS scenario isn't common but in case of JIT, it is necessary
and the amount of ram consumed would be never small for embedded world.

>
>
> Now, while for the case I'm personally most interested in (ashmem),
> zero-fill would technically be ok, since that's what Android does.
> Even so, I don't think its the best approach for the interface, since
> applications may end up quite surprised by the results when they
> accidentally don't follow the "don't touch volatile pages" rule.
>
> That point beside, I think the other problem with the page-cleaning
> volatility approach is that there are other awkward side effects. For
> example: Say an application marks a range as volatile. One page in the
> range is then purged. The application, due to a bug or otherwise,
> reads the volatile range. This causes the page to be zero-filled in,
> and the application silently uses the corrupted data (which isn't
> great). More problematic though, is that by faulting the page in,
> they've in effect lost the purge state for that page. When the
> application then goes to mark the range as non-volatile, all pages are
> present, so we'd return that no pages were purged. From an
> application perspective this is pretty ugly.
>
> Johannes: Any thoughts on this potential issue with your proposal? Am
> I missing something else?
>
> thanks
> -john
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2014-04-07 06:10:55

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

Hello Andrea,

On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> Hi everyone,
>
> On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > you have a third option you're thinking of, I'd of course be interested
> > in hearing it.
>
> I actually thought the way of being notified with a page fault (sigbus
> or whatever) was the most efficient way of using volatile ranges.
>
> Why having to call a syscall to know if you can still access the
> volatile range, if there was no VM pressure before the access?
> syscalls are expensive, accessing the memory direct is not. Only if it
> page was actually missing and a page fault would fire, you'd take the
> slowpath.

True.

>
> The usages I see for this are plenty, like for maintaining caches in
> memory that may be big and would be nice to discard if there's VM
> pressure, jpeg uncompressed images sounds like a candidate too. So the
> browser size would shrink if there's VM pressure, instead of ending up
> swapping out uncompressed image data that can be regenerated more
> quickly with the CPU than with swapins.

That's really typical case vrange is targetting.

>
> > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > try to exploit the fact that we get SIGBUS on purged page access (at
> > least on the user-space side) and will try to access pages that are
> > volatile until they are purged and try to then handle the SIGBUS to fix
> > things up. Those folks exploiting that will have to be particularly
> > careful not to pass volatile data to the kernel, and if they do they'll
> > have to be smart enough to handle the EFAULT, etc. That's really all
> > their problem, because they're being clever. :)
>
> I'm actually working on feature that would solve the problem for the
> syscalls accessing missing volatile pages. So you'd never see a
> -EFAULT because all syscalls won't return even if they encounters a
> missing page in the volatile range dropped by the VM pressure.
>
> It's called userfaultfd. You call sys_userfaultfd(flags) and it
> connects the current mm to a pseudo filedescriptor. The filedescriptor
> works similarly to eventfd but with a different protocol.
>
> You need a thread that will never access the userfault area with the
> CPU, that is responsible to poll on the userfaultfd and talk the
> userfaultfd protocol to fill-in missing pages. The userfault thread
> after a POLLIN event reads the virtual addresses of the fault that
> must have happened on some other thread of the same mm, and then
> writes back an "handled" virtual range into the fd, after the page (or
> pages if multiple) have been regenerated and mapped in with
> sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> swapping. Then depending on the "solved" range written back into the
> fd, the kernel will wakeup the thread or threads that were waiting in
> kernel mode on the "handled" virtual range, and retry the fault
> without ever exiting kernel mode.

Sounds flexible.

>
> We need this in KVM for running the guest on memory that is on other
> nodes or other processes (postcopy live migration is the most common
> use case but there are others like memory externalization and
> cross-node KSM in the cloud, to keep a single copy of memory across
> multiple nodes and externalized to the VM and to the host node).
>
> This thread made me wonder if we could mix the two features and you
> would then depend on MADV_USERFAULT and userfaultfd to deliver to
> userland the "faults" happening on the volatile pages that have been
> purged as result of VM pressure.
>
> I'm just saying this after Johannes mentioned the issue with syscalls
> returning -EFAULT. Because that is the very issue that the userfaultfd
> is going to solve for the KVM migration thread.
>
> What I'm thinking now would be to mark the volatile range also
> MADV_USERFAULT and then calling userfaultfd and instead of having the
> cache regeneration "slow path" inside the SIGBUS handler, to run it in
> the userfault thread that polls the userfaultfd. Then you could write
> the volatile ranges to disk with a write() syscall (or use any other
> syscall on the volatile ranges), without having to worry about -EFAULT
> being returned because one page was discarded. And if MADV_USERFAULT
> is not called in combination with vrange syscalls, then it'd still
> work without the userfault, but with the vrange syscalls only.
>
> In short the idea would be to let the userfault code solve the fault
> delivery to userland for you, and make the vrange syscalls only focus
> on the page purging problem, without having to worry about what
> happens when something access a missing page.
>
> But if you don't intend to solve the syscall -EFAULT problem, well
> then probably the overlap is still as thin as I thought it was before
> (like also mentioned in the below link).

Sounds doable. I will look into your patch.
Thanks for reminding!

>
> Thanks,
> Andrea
>
> PS. my last email about this from a more KVM centric point of view:
>
> http://www.spinics.net/lists/kvm/msg101449.html
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2014-04-07 06:19:27

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 02, 2014 at 03:27:44PM -0400, Johannes Weiner wrote:
> On Wed, Apr 02, 2014 at 08:31:13PM +0200, Andrea Arcangeli wrote:
> > Hi everyone,
> >
> > On Tue, Apr 01, 2014 at 09:03:57PM -0700, John Stultz wrote:
> > > So between zero-fill and SIGBUS, I think SIGBUS makes the most sense. If
> > > you have a third option you're thinking of, I'd of course be interested
> > > in hearing it.
> >
> > I actually thought the way of being notified with a page fault (sigbus
> > or whatever) was the most efficient way of using volatile ranges.
> >
> > Why having to call a syscall to know if you can still access the
> > volatile range, if there was no VM pressure before the access?
> > syscalls are expensive, accessing the memory direct is not. Only if it
> > page was actually missing and a page fault would fire, you'd take the
> > slowpath.
>
> Not everybody wants to actually come back for the data in the range,
> allocators and message passing applications just want to be able to
> reuse the memory mapping.
>
> By tying the volatility to the dirty bit in the page tables, an
> allocator could simply clear those bits once on free(). When malloc()
> hands out this region again, the user is expected to write, which will
> either overwrite the old page, or, if it was purged, fault in a fresh
> zero page. But there is no second syscall needed to clear volatility.
>
> > > Now... once you've chosen SIGBUS semantics, there will be folks who will
> > > try to exploit the fact that we get SIGBUS on purged page access (at
> > > least on the user-space side) and will try to access pages that are
> > > volatile until they are purged and try to then handle the SIGBUS to fix
> > > things up. Those folks exploiting that will have to be particularly
> > > careful not to pass volatile data to the kernel, and if they do they'll
> > > have to be smart enough to handle the EFAULT, etc. That's really all
> > > their problem, because they're being clever. :)
> >
> > I'm actually working on feature that would solve the problem for the
> > syscalls accessing missing volatile pages. So you'd never see a
> > -EFAULT because all syscalls won't return even if they encounters a
> > missing page in the volatile range dropped by the VM pressure.
> >
> > It's called userfaultfd. You call sys_userfaultfd(flags) and it
> > connects the current mm to a pseudo filedescriptor. The filedescriptor
> > works similarly to eventfd but with a different protocol.
> >
> > You need a thread that will never access the userfault area with the
> > CPU, that is responsible to poll on the userfaultfd and talk the
> > userfaultfd protocol to fill-in missing pages. The userfault thread
> > after a POLLIN event reads the virtual addresses of the fault that
> > must have happened on some other thread of the same mm, and then
> > writes back an "handled" virtual range into the fd, after the page (or
> > pages if multiple) have been regenerated and mapped in with
> > sys_remap_anon_pages(), mremap or equivalent atomic pagetable page
> > swapping. Then depending on the "solved" range written back into the
> > fd, the kernel will wakeup the thread or threads that were waiting in
> > kernel mode on the "handled" virtual range, and retry the fault
> > without ever exiting kernel mode.
> >
> > We need this in KVM for running the guest on memory that is on other
> > nodes or other processes (postcopy live migration is the most common
> > use case but there are others like memory externalization and
> > cross-node KSM in the cloud, to keep a single copy of memory across
> > multiple nodes and externalized to the VM and to the host node).
> >
> > This thread made me wonder if we could mix the two features and you
> > would then depend on MADV_USERFAULT and userfaultfd to deliver to
> > userland the "faults" happening on the volatile pages that have been
> > purged as result of VM pressure.
> >
> > I'm just saying this after Johannes mentioned the issue with syscalls
> > returning -EFAULT. Because that is the very issue that the userfaultfd
> > is going to solve for the KVM migration thread.
> >
> > What I'm thinking now would be to mark the volatile range also
> > MADV_USERFAULT and then calling userfaultfd and instead of having the
> > cache regeneration "slow path" inside the SIGBUS handler, to run it in
> > the userfault thread that polls the userfaultfd. Then you could write
> > the volatile ranges to disk with a write() syscall (or use any other
> > syscall on the volatile ranges), without having to worry about -EFAULT
> > being returned because one page was discarded. And if MADV_USERFAULT
> > is not called in combination with vrange syscalls, then it'd still
> > work without the userfault, but with the vrange syscalls only.
> >
> > In short the idea would be to let the userfault code solve the fault
> > delivery to userland for you, and make the vrange syscalls only focus
> > on the page purging problem, without having to worry about what
> > happens when something access a missing page.
>
> Yes, the two seem certainly combinable to me.
>
> madvise(MADV_FREE | MADV_USERFAULT) to allow purging and userspace
> fault handling. In the fault slowpath, you can then regenerate any
> missing data and do MADV_FREE again if it should remain volatile. And
> again, any actual writes to the region would clear volatility because
> now the cache copy changed and discarding it would mean losing state.

Another scenario that above can't cover.
Someone might put volatility permanently until unmarking so they can
generate cache pages on that range freely without further syscall.

I mean above sugguestion can cover those pages were already mapped
when syscall was called but couldn't cover upcoming fault-in pages
so I think vrange syscall is still needed.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2014-04-08 03:32:37

by Kevin Easton

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <[email protected]> wrote:
> > I'm just dying to hear a "normal" use case then. :)
>
> So the more "normal" use cause would be marking objects volatile and
> then non-volatile w/o accessing them in-between. In this case the
> zero-fill vs SIGBUS semantics don't really matter, its really just a
> trade off in how we handle applications deviating (intentionally or
> not) from this use case.
>
> So to maybe flesh out the context here for folks who are following
> along (but weren't in the hallway at LSF :), Johannes made a fairly
> interesting proposal (Johannes: Please correct me here where I'm maybe
> slightly off here) to use only the dirty bits of the ptes to mark a
> page as volatile. Then the kernel could reclaim these clean pages as
> it needed, and when we marked the range as non-volatile, the pages
> would be re-dirtied and if any of the pages were missing, we could
> return a flag with the purged state. This had some different
> semantics then what I've been working with for awhile (for example,
> any writes to pages would implicitly clear volatility), so I wasn't
> completely comfortable with it, but figured I'd think about it to see
> if it could be done. Particularly since it would in some ways simplify
> tmpfs/shm shared volatility that I'd eventually like to do.
...
> Now, while for the case I'm personally most interested in (ashmem),
> zero-fill would technically be ok, since that's what Android does.
> Even so, I don't think its the best approach for the interface, since
> applications may end up quite surprised by the results when they
> accidentally don't follow the "don't touch volatile pages" rule.
>
> That point beside, I think the other problem with the page-cleaning
> volatility approach is that there are other awkward side effects. For
> example: Say an application marks a range as volatile. One page in the
> range is then purged. The application, due to a bug or otherwise,
> reads the volatile range. This causes the page to be zero-filled in,
> and the application silently uses the corrupted data (which isn't
> great). More problematic though, is that by faulting the page in,
> they've in effect lost the purge state for that page. When the
> application then goes to mark the range as non-volatile, all pages are
> present, so we'd return that no pages were purged. From an
> application perspective this is pretty ugly.

The write-implicitly-clears-volatile semantics would actually be
an advantage for some use cases. If you have a volatile cache of
many sub-page-size objects, the application can just include at
the start of each page "int present, in_use;". "present" is set
to non-zero before marking volatile, and when the application wants
unmark as volatile it writes to "in_use" and tests the value of
"present". No need for a syscall at all, although it does take a
minor fault.

The syscall would be better for the case of large objects, though.

Or is that fatally flawed?

- Kevin

2014-04-08 03:38:14

[permalink] [raw]

Subject: Re: [PATCH 0/5] Volatile Ranges (v12) & LSF-MM discussion fodder

On 04/07/2014 09:32 PM, Kevin Easton wrote:
> On Wed, Apr 02, 2014 at 10:40:16AM -0700, John Stultz wrote:
>> On Wed, Apr 2, 2014 at 9:36 AM, Johannes Weiner <[email protected]> wrote:
>>> I'm just dying to hear a "normal" use case then. :)
>> So the more "normal" use cause would be marking objects volatile and
>> then non-volatile w/o accessing them in-between. In this case the
>> zero-fill vs SIGBUS semantics don't really matter, its really just a
>> trade off in how we handle applications deviating (intentionally or
>> not) from this use case.
>>
>> So to maybe flesh out the context here for folks who are following
>> along (but weren't in the hallway at LSF :), Johannes made a fairly
>> interesting proposal (Johannes: Please correct me here where I'm maybe
>> slightly off here) to use only the dirty bits of the ptes to mark a
>> page as volatile. Then the kernel could reclaim these clean pages as
>> it needed, and when we marked the range as non-volatile, the pages
>> would be re-dirtied and if any of the pages were missing, we could
>> return a flag with the purged state. This had some different
>> semantics then what I've been working with for awhile (for example,
>> any writes to pages would implicitly clear volatility), so I wasn't
>> completely comfortable with it, but figured I'd think about it to see
>> if it could be done. Particularly since it would in some ways simplify
>> tmpfs/shm shared volatility that I'd eventually like to do.
> ...
>> Now, while for the case I'm personally most interested in (ashmem),
>> zero-fill would technically be ok, since that's what Android does.
>> Even so, I don't think its the best approach for the interface, since
>> applications may end up quite surprised by the results when they
>> accidentally don't follow the "don't touch volatile pages" rule.
>>
>> That point beside, I think the other problem with the page-cleaning
>> volatility approach is that there are other awkward side effects. For
>> example: Say an application marks a range as volatile. One page in the
>> range is then purged. The application, due to a bug or otherwise,
>> reads the volatile range. This causes the page to be zero-filled in,
>> and the application silently uses the corrupted data (which isn't
>> great). More problematic though, is that by faulting the page in,
>> they've in effect lost the purge state for that page. When the
>> application then goes to mark the range as non-volatile, all pages are
>> present, so we'd return that no pages were purged. From an
>> application perspective this is pretty ugly.
> The write-implicitly-clears-volatile semantics would actually be
> an advantage for some use cases. If you have a volatile cache of
> many sub-page-size objects, the application can just include at
> the start of each page "int present, in_use;". "present" is set
> to non-zero before marking volatile, and when the application wants
> unmark as volatile it writes to "in_use" and tests the value of
> "present". No need for a syscall at all, although it does take a
> minor fault.
>
> The syscall would be better for the case of large objects, though.
>
> Or is that fatally flawed?

Well, as you note, each object would then have to be page size or
smaller, which limits some of the potential use cases.

However, these semantics would match better to the MADV_FREE proposal
Minchan is pushing. So this method would work fine there.

thanks
-john

2014-04-11 19:32:31