Hi all,
<resending without unwanted HTML-ifying - apologies for the noise if
this appears twice for you!>
Recent changes have restricted a userspace interface used by our
product; specifically, a security patch to require CAP_SYS_ADMIN when
opening /proc/PID/pagemap
(https://github.com/torvalds/linux/commit/ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce,
original LKML discussion here: https://lkml.org/lkml/2015/3/9/864).
Although I've marked this as a "Regression", we do realise there are
legitimate security concerns over the original implementation of this
interface. Still, given the kernel's strong stance on preserving
userspace interfaces, we thought we ought to flag this quickly as
something that has changed application-relevant behaviour.
We believe this change came into released kernels with Linux 4.0. We
first observed problems when testing on Ubuntu 15.04 this week; I see
the patch is now backported to the various -stable kernel lines, so
I'd expect it to show up in other distros in due course. The obvious
solution (to simply run with CAP_SYS_ADMIN) is quite undesirable for
our product, which is a debugger; we're expecting our users to run
without special privileges.
In our use of /proc/PID/pagemap, we currently make use of the physical
pageframe addresses. We should be able to work with a scrambled
representation of these (Andy Lutomirski suggested this in the
original discussion - https://lkml.org/lkml/2015/3/16/1273) so long as
the scrambling remained consistent during the lifetime of the open
pagemap file. Alternatively, if physical addresses were simply zeroed
(also suggested by Pavel Emelyanov -
https://lkml.org/lkml/2015/3/9/871) we would be able to change our
code to rely only on the soft-dirty flag and thus still work
correctly.
I propose to follow up with a patch that provides unprivileged access
to /proc/PID/pagemap with the physical pageframe addresses zeroed.
Would this be an acceptable approach?
Thank you,
Mark Williamson
---
Undo Software - http://undo-software.com/
On 24 April 2015 at 08:01, Mark Williamson
<[email protected]> wrote:
> In our use of /proc/PID/pagemap, we currently make use of the physical
> pageframe addresses. We should be able to work with a scrambled
> representation of these (Andy Lutomirski suggested this in the
> original discussion - https://lkml.org/lkml/2015/3/16/1273) so long as
> the scrambling remained consistent during the lifetime of the open
> pagemap file. Alternatively, if physical addresses were simply zeroed
> (also suggested by Pavel Emelyanov -
> https://lkml.org/lkml/2015/3/9/871) we would be able to change our
> code to rely only on the soft-dirty flag and thus still work
> correctly.
I'm curious, what do you use the physical page addresses for?
Since you pointed to http://undo-software.com, which talks about
reversible debugging tools, I can guess you would use the soft-dirty
flag to implement copy-on-write snapshotting. I'm guessing you might
use physical page addresses for determining when the same page is
mapped twice (in the same process or different processes)?
Cheers,
Mark
On Fri, Apr 24, 2015 at 7:55 AM, Mark Williamson
<[email protected]> wrote:
>
> Although I've marked this as a "Regression", we do realise there are
> legitimate security concerns over the original implementation of this
> interface. Still, given the kernel's strong stance on preserving userspace
> interfaces, we thought we ought to flag this quickly as something that has
> changed application-relevant behaviour.
So the one exception to the regression rule is "security fixes", but
even for security fixes we do try to be as reasonable as humanly
possible to make them not break things.
Now, as you mentioned, one option is to not outright disallow accesses
to the /proc/PID/pagemap, but to at least hide the page frame numbers.
However, I don't believe that we have a good enough scrambling model
to make that reasonable. Remember: any attacker will be able to see
our scrambling code, so it would need to be both cryptographically
secure *and* use a truly random per-VM secret key. Quite frankly,
that's a _lot_ of effort for dubious gain...
So the "just show physical addresses as zero for non-root users"
(instead of the outright ban on opening the file) is likely the only
really viable alternative.
It sounds like that could work for you. So if you can modify the app
to do that, and send me a tested kernel patch that moves the
permission check into the read phase (remember to use the open-time
credentials in "file->f_cred" rather than the read-time credentials in
"current" - otherwise you can trick some suid program to read the fily
that an unauthorized user opened), then we can have this fixed. Does
that sound reasonable?
Linus
On Fri, Apr 24, 2015 at 9:08 AM, Linus Torvalds
<[email protected]> wrote:
> On Fri, Apr 24, 2015 at 7:55 AM, Mark Williamson
> <[email protected]> wrote:
>>
>> Although I've marked this as a "Regression", we do realise there are
>> legitimate security concerns over the original implementation of this
>> interface. Still, given the kernel's strong stance on preserving userspace
>> interfaces, we thought we ought to flag this quickly as something that has
>> changed application-relevant behaviour.
>
> So the one exception to the regression rule is "security fixes", but
> even for security fixes we do try to be as reasonable as humanly
> possible to make them not break things.
>
> Now, as you mentioned, one option is to not outright disallow accesses
> to the /proc/PID/pagemap, but to at least hide the page frame numbers.
> However, I don't believe that we have a good enough scrambling model
> to make that reasonable. Remember: any attacker will be able to see
> our scrambling code, so it would need to be both cryptographically
> secure *and* use a truly random per-VM secret key. Quite frankly,
> that's a _lot_ of effort for dubious gain...
Even though I've been accused (correctly?) of suggesting that, I'm not
sure I like it anymore. Suppose I map some anonymous memory, learn
its (scrambled) pfn, then unmap it and remap a setuid file. Now I can
tell whether I've mapped the setuid file at the same pfn that was
mapped as my anonymous memory. IIRC that's sufficient for one of the
variants of Mark's attack.
--Andy
On Fri, Apr 24, 2015 at 9:10 AM, Andy Lutomirski <[email protected]> wrote:
>
> Even though I've been accused (correctly?) of suggesting that, I'm not
> sure I like it anymore. Suppose I map some anonymous memory, learn
> its (scrambled) pfn, then unmap it and remap a setuid file. Now I can
> tell whether I've mapped the setuid file at the same pfn that was
> mapped as my anonymous memory. IIRC that's sufficient for one of the
> variants of Mark's attack.
Ack. So we really do have to zero out the pfn entirely for security
reasons, and not just because it's less effort.
Linus
Hi Mark,
On Fri, Apr 24, 2015 at 4:26 PM, Mark Seaborn <[email protected]> wrote:
> I'm curious, what do you use the physical page addresses for?
>
> Since you pointed to http://undo-software.com, which talks about
> reversible debugging tools, I can guess you would use the soft-dirty
> flag to implement copy-on-write snapshotting. I'm guessing you might
> use physical page addresses for determining when the same page is
> mapped twice (in the same process or different processes)?
That's pretty much it. Actually, we're effectively using the physical
addresses to emulate soft-dirty. For certain operations (e.g. some
system calls) we need to track what memory has changed since we last
looked at the process state. We have a mechanism that forks a child
process, runs the system call, then refers to pagemap to figure out
what's been modified.
Currently, our mechanism compares the physical addresses of pages
before and after the syscall so that we can see which pages got CoWed.
This is perhaps a slightly "unconventional" use of the interface but
we support kernels that predate the soft-dirty mechanism and (as far
as we know) this is probably the best way we can answer "What got
changed?" on those releases.
Using the soft-dirty mechanism where available should make our code
both cleaner and faster, so if we can fix the pagemap file to allow
that then we'll be quite happy!
Cheers,
Mark
Hi Linus,
Thanks for responding so quickly!
On Fri, Apr 24, 2015 at 5:08 PM, Linus Torvalds
<[email protected]> wrote:
> So the one exception to the regression rule is "security fixes", but
> even for security fixes we do try to be as reasonable as humanly
> possible to make them not break things.
Understood - there are clear reasons something had to be done here.
> Now, as you mentioned, one option is to not outright disallow accesses
> to the /proc/PID/pagemap, but to at least hide the page frame numbers.
> However, I don't believe that we have a good enough scrambling model
> to make that reasonable. Remember: any attacker will be able to see
> our scrambling code, so it would need to be both cryptographically
> secure *and* use a truly random per-VM secret key. Quite frankly,
> that's a _lot_ of effort for dubious gain...
*nod*
> So the "just show physical addresses as zero for non-root users"
> (instead of the outright ban on opening the file) is likely the only
> really viable alternative.
>
> It sounds like that could work for you. So if you can modify the app
> to do that, and send me a tested kernel patch that moves the
> permission check into the read phase (remember to use the open-time
> credentials in "file->f_cred" rather than the read-time credentials in
> "current" - otherwise you can trick some suid program to read the fily
> that an unauthorized user opened), then we can have this fixed. Does
> that sound reasonable?
That sounds very reasonable, thank you! We'll cook up a patch and get
back to you.
Thanks,
Mark
Hi Andy,
On Fri, Apr 24, 2015 at 5:10 PM, Andy Lutomirski <[email protected]> wrote:
> Even though I've been accused (correctly?) of suggesting that, I'm not
> sure I like it anymore. Suppose I map some anonymous memory, learn
> its (scrambled) pfn, then unmap it and remap a setuid file. Now I can
> tell whether I've mapped the setuid file at the same pfn that was
> mapped as my anonymous memory. IIRC that's sufficient for one of the
> variants of Mark's attack.
In fairness, you may have mentioned it but it's entirely possible you
didn't originate the suggestion and I quoted out of context. Sorry
for implicating you ;-)
That's an attack that I hadn't considered when thinking about this
stuff. Zeroing the page frame numbers is an easier patch, so
arguments in favour of that are a happy answer as far as I'm
concerned!
Thanks,
Mark
Hi all,
We've been investigating further and found a snag with the PFN-hiding
approach discussed last week - looks like it won't be enough on all
the architectures we support. Our product runs on x86_32, x86_64 and
ARM. For now, it looks like soft-dirty is only available on x86_64.
A patch that simply zeros out the physical addresses in
/proc/PID/pagemap will therefore help us on x86_64 but we'll still
have problems on other platforms[1].
For context, we were previously using pagemap as a cross-platform way
to get soft-dirty-like functionality. Specifically, to ask "did a
process write to any pages since fork()" by comparing addresses and
deducing where CoW must have occurred. In the absence of soft-dirty
and the physical addresses, it looks like we can't figure that out
with the remaining information in pagemap.
If the pagemap file included the "writeable" bit from the PTE, we
think we'd have all the information required to deduce what we need
(although I realise that's a bit of a nasty workaround). If I
proposed including the PTE protection bits in pagemap, would that be
controversial? I'm guessing yes but thought it was worth a shot ;-)
Would anybody be able to suggest a more tasteful approach?
Thanks,
Mark
[1] I'd note that using soft-dirty is clearly the right approach for
us on x64, where available and that ideally we'd use it on other
architectures - cross-arch support for soft-dirty is a slightly
different discussion, which I hope to post another thread for.
On Fri, Apr 24, 2015 at 5:43 PM, Mark Williamson
<[email protected]> wrote:
> Hi Mark,
>
> On Fri, Apr 24, 2015 at 4:26 PM, Mark Seaborn <[email protected]> wrote:
>> I'm curious, what do you use the physical page addresses for?
>>
>> Since you pointed to http://undo-software.com, which talks about
>> reversible debugging tools, I can guess you would use the soft-dirty
>> flag to implement copy-on-write snapshotting. I'm guessing you might
>> use physical page addresses for determining when the same page is
>> mapped twice (in the same process or different processes)?
>
> That's pretty much it. Actually, we're effectively using the physical
> addresses to emulate soft-dirty. For certain operations (e.g. some
> system calls) we need to track what memory has changed since we last
> looked at the process state. We have a mechanism that forks a child
> process, runs the system call, then refers to pagemap to figure out
> what's been modified.
>
> Currently, our mechanism compares the physical addresses of pages
> before and after the syscall so that we can see which pages got CoWed.
> This is perhaps a slightly "unconventional" use of the interface but
> we support kernels that predate the soft-dirty mechanism and (as far
> as we know) this is probably the best way we can answer "What got
> changed?" on those releases.
>
> Using the soft-dirty mechanism where available should make our code
> both cleaner and faster, so if we can fix the pagemap file to allow
> that then we'll be quite happy!
>
> Cheers,
> Mark
Hi again,
On Wed, Apr 29, 2015 at 7:44 PM, Mark Williamson
<[email protected]> wrote:
> We've been investigating further and found a snag with the PFN-hiding
> approach discussed last week - looks like it won't be enough on all
> the architectures we support. Our product runs on x86_32, x86_64 and
> ARM. For now, it looks like soft-dirty is only available on x86_64.
> A patch that simply zeros out the physical addresses in
> /proc/PID/pagemap will therefore help us on x86_64 but we'll still
> have problems on other platforms[1].
Another thought occurs - although we *strictly* want to know "what got
written to", we might be able to get by with a superset of that, such
as "what got accessed, read or write"...
Thus, we could investigate clearing the Referenced bit (which I
understand we can do through /proc/PID/clear_refs) and then just treat
any subsequently-referenced pages as being potentially modified. It's
not ideal but it might be enough to get by...
I still feel a little nervous with this, since we support distros
(e.g. RHEL5) that are too old to have clear_refs. Still, it would
result in less disruption to the format of pagemap.
Thanks,
Mark
> For context, we were previously using pagemap as a cross-platform way
> to get soft-dirty-like functionality. Specifically, to ask "did a
> process write to any pages since fork()" by comparing addresses and
> deducing where CoW must have occurred. In the absence of soft-dirty
> and the physical addresses, it looks like we can't figure that out
> with the remaining information in pagemap.
>
> If the pagemap file included the "writeable" bit from the PTE, we
> think we'd have all the information required to deduce what we need
> (although I realise that's a bit of a nasty workaround). If I
> proposed including the PTE protection bits in pagemap, would that be
> controversial? I'm guessing yes but thought it was worth a shot ;-)
> Would anybody be able to suggest a more tasteful approach?
>
> Thanks,
> Mark
>
> [1] I'd note that using soft-dirty is clearly the right approach for
> us on x64, where available and that ideally we'd use it on other
> architectures - cross-arch support for soft-dirty is a slightly
> different discussion, which I hope to post another thread for.
>
> On Fri, Apr 24, 2015 at 5:43 PM, Mark Williamson
> <[email protected]> wrote:
>> Hi Mark,
>>
>> On Fri, Apr 24, 2015 at 4:26 PM, Mark Seaborn <[email protected]> wrote:
>>> I'm curious, what do you use the physical page addresses for?
>>>
>>> Since you pointed to http://undo-software.com, which talks about
>>> reversible debugging tools, I can guess you would use the soft-dirty
>>> flag to implement copy-on-write snapshotting. I'm guessing you might
>>> use physical page addresses for determining when the same page is
>>> mapped twice (in the same process or different processes)?
>>
>> That's pretty much it. Actually, we're effectively using the physical
>> addresses to emulate soft-dirty. For certain operations (e.g. some
>> system calls) we need to track what memory has changed since we last
>> looked at the process state. We have a mechanism that forks a child
>> process, runs the system call, then refers to pagemap to figure out
>> what's been modified.
>>
>> Currently, our mechanism compares the physical addresses of pages
>> before and after the syscall so that we can see which pages got CoWed.
>> This is perhaps a slightly "unconventional" use of the interface but
>> we support kernels that predate the soft-dirty mechanism and (as far
>> as we know) this is probably the best way we can answer "What got
>> changed?" on those releases.
>>
>> Using the soft-dirty mechanism where available should make our code
>> both cleaner and faster, so if we can fix the pagemap file to allow
>> that then we'll be quite happy!
>>
>> Cheers,
>> Mark
On Wed, Apr 29, 2015 at 07:44:57PM +0100, Mark Williamson wrote:
> Hi all,
>
> We've been investigating further and found a snag with the PFN-hiding
> approach discussed last week - looks like it won't be enough on all
> the architectures we support. Our product runs on x86_32, x86_64 and
> ARM. For now, it looks like soft-dirty is only available on x86_64.
> A patch that simply zeros out the physical addresses in
> /proc/PID/pagemap will therefore help us on x86_64 but we'll still
> have problems on other platforms[1].
>
> For context, we were previously using pagemap as a cross-platform way
> to get soft-dirty-like functionality. Specifically, to ask "did a
> process write to any pages since fork()" by comparing addresses and
> deducing where CoW must have occurred. In the absence of soft-dirty
> and the physical addresses, it looks like we can't figure that out
> with the remaining information in pagemap.
>
> If the pagemap file included the "writeable" bit from the PTE, we
> think we'd have all the information required to deduce what we need
> (although I realise that's a bit of a nasty workaround). If I
> proposed including the PTE protection bits in pagemap, would that be
> controversial? I'm guessing yes but thought it was worth a shot ;-)
> Would anybody be able to suggest a more tasteful approach?
Emm.. I have hard time to understand how writable bit is enough to get
soft-dirty-alike functionality.
Let's say we have anon-mapping with COW setup after the fork(). It's not
writable PTEs to trigger COW on wp faults. But you can easily get to the
same non-writable PTE after breaking COW: fork() again or
mprotect(PROT_READ) and mprotect(PROT_READ|PROT_WRITE) back.
?
>
> Thanks,
> Mark
>
> [1] I'd note that using soft-dirty is clearly the right approach for
> us on x64, where available and that ideally we'd use it on other
> architectures - cross-arch support for soft-dirty is a slightly
> different discussion, which I hope to post another thread for.
--
Kirill A. Shutemov
Hi,
On Wed, Apr 29, 2015 at 8:36 PM, Kirill A. Shutemov
<[email protected]> wrote:
> On Wed, Apr 29, 2015 at 07:44:57PM +0100, Mark Williamson wrote:
>> Hi all,
... snip ...
>> For context, we were previously using pagemap as a cross-platform way
>> to get soft-dirty-like functionality. Specifically, to ask "did a
>> process write to any pages since fork()" by comparing addresses and
>> deducing where CoW must have occurred. In the absence of soft-dirty
>> and the physical addresses, it looks like we can't figure that out
>> with the remaining information in pagemap.
>>
>> If the pagemap file included the "writeable" bit from the PTE, we
>> think we'd have all the information required to deduce what we need
>> (although I realise that's a bit of a nasty workaround). If I
>> proposed including the PTE protection bits in pagemap, would that be
>> controversial? I'm guessing yes but thought it was worth a shot ;-)
>> Would anybody be able to suggest a more tasteful approach?
>
> Emm.. I have hard time to understand how writable bit is enough to get
> soft-dirty-alike functionality.
In the general case, you are of course correct - in our specific case
I *think* we'd be able to manage OK ... (see below).
> Let's say we have anon-mapping with COW setup after the fork(). It's not
> writable PTEs to trigger COW on wp faults. But you can easily get to the
> same non-writable PTE after breaking COW: fork() again or
> mprotect(PROT_READ) and mprotect(PROT_READ|PROT_WRITE) back.
I believe we'll be able to get away with this in our particular
usecase. The process is running in our debugger at the time and so we
can interpose on the system calls that are happening. That should
give us the opportunity to check for CoW-breaking before the debuggee
is allowed to alter page protections itself.
It ends up not being full soft-dirty behaviour but it's similar enough
to tell us what we need to know.
Cheers,
Mark
> ?
>
>>
>> Thanks,
>> Mark
>>
>> [1] I'd note that using soft-dirty is clearly the right approach for
>> us on x64, where available and that ideally we'd use it on other
>> architectures - cross-arch support for soft-dirty is a slightly
>> different discussion, which I hope to post another thread for.
>
> --
> Kirill A. Shutemov
On Wed, Apr 29, 2015 at 12:36 PM, Kirill A. Shutemov
<[email protected]> wrote:
>
> Emm.. I have hard time to understand how writable bit is enough to get
> soft-dirty-alike functionality.
I don't think it is.
For anonymous pages, maybe you can play tricks with comparing the page
'anon_vma' with the vma->anon_vma.
I haven't really thought that through, but does something like
static inline bool page_is_dirty_in_vma(struct page *page, struct
vm_area_struct *vma)
{
struct anon_vma *anon_vma = vma->anon_vma;
return page->mapping == (void *)anon_vma + PAGE_MAPPING_ANON;
}
end up working as a "page has been dirtied in this mapping"?
If the page came from another process and hasn't been written to, it
will have the anon_vma pointing to the originalting vma.
I may be high on some bad drugs, though. As mentioned, I didn't really
think this through.
Linus
On Wed, Apr 29, 2015 at 11:33 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Apr 29, 2015 at 12:36 PM, Kirill A. Shutemov
> <[email protected]> wrote:
>>
>> Emm.. I have hard time to understand how writable bit is enough to get
>> soft-dirty-alike functionality.
>
> I don't think it is.
>
> For anonymous pages, maybe you can play tricks with comparing the page
> 'anon_vma' with the vma->anon_vma.
>
> I haven't really thought that through, but does something like
>
> static inline bool page_is_dirty_in_vma(struct page *page, struct
> vm_area_struct *vma)
> {
> struct anon_vma *anon_vma = vma->anon_vma;
>
> return page->mapping == (void *)anon_vma + PAGE_MAPPING_ANON;
> }
>
> end up working as a "page has been dirtied in this mapping"?
This's no longer true. After recent fixes for "anon_vma endless growing" new vma
might reuse old anon_vma from grandparent vma.
>
> If the page came from another process and hasn't been written to, it
> will have the anon_vma pointing to the originalting vma.
>
> I may be high on some bad drugs, though. As mentioned, I didn't really
> think this through.
>
> Linus
On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <[email protected]> wrote:
>
> This's no longer true. After recent fixes for "anon_vma endless growing" new vma
> might reuse old anon_vma from grandparent vma.
Oh well. I guess that was too simple.
If Mark is ok with the rule that "it's not reliably if you have two
nested forks" (ie it only works if you exec for every fork you do), it
should still work, right? It sounds like Mark doesn't necessarily need
to handle the *generic* case.
Linus
On Wed, Apr 29, 2015 at 02:02:01PM -0700, Linus Torvalds wrote:
> On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <[email protected]> wrote:
> >
> > This's no longer true. After recent fixes for "anon_vma endless growing" new vma
> > might reuse old anon_vma from grandparent vma.
>
> Oh well. I guess that was too simple.
>
> If Mark is ok with the rule that "it's not reliably if you have two
> nested forks" (ie it only works if you exec for every fork you do), it
> should still work, right? It sounds like Mark doesn't necessarily need
> to handle the *generic* case.
This sounds too ugly to be exposed it as ABI.
--
Kirill A. Shutemov
On Wed, Apr 29, 2015 at 2:05 PM, Kirill A. Shutemov
<[email protected]> wrote:
>
> This sounds too ugly to be exposed it as ABI.
Oh, pretty it ain't. However, regressions in many ways are worse. If
it makes it possible to not regress...
Linus
On Wed, Apr 29, 2015 at 02:18:49PM -0700, Linus Torvalds wrote:
> On Wed, Apr 29, 2015 at 2:05 PM, Kirill A. Shutemov
> <[email protected]> wrote:
> >
> > This sounds too ugly to be exposed it as ABI.
>
> Oh, pretty it ain't. However, regressions in many ways are worse. If
> it makes it possible to not regress...
One idea is to extend kcmp(2) with KCMP_PAGE. idx1 and idx2 are virtual
addresses in two processes. It returns 0 if addresses points to the same
page and 3 otherwise.
Would it be enough for the use case?
I guess it could be too slow to check one page a time...
Invent new kcmpv(2)? ;)
--
Kirill A. Shutemov
On Thu, Apr 30, 2015 at 12:02 AM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <[email protected]> wrote:
>>
>> This's no longer true. After recent fixes for "anon_vma endless growing" new vma
>> might reuse old anon_vma from grandparent vma.
>
> Oh well. I guess that was too simple.
>
> If Mark is ok with the rule that "it's not reliably if you have two
> nested forks" (ie it only works if you exec for every fork you do), it
> should still work, right? It sounds like Mark doesn't necessarily need
> to handle the *generic* case.
What about exposing shared/exclusive bit in pagemap == 1 if
page_mapcount() > 1, otherwise 0 (or vise versa).
Seems like this should work for detecting CoWed pages in child mm.
On Wed, Apr 29, 2015 at 10:02 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <[email protected]> wrote:
>>
>> This's no longer true. After recent fixes for "anon_vma endless growing" new vma
>> might reuse old anon_vma from grandparent vma.
>
> Oh well. I guess that was too simple.
>
> If Mark is ok with the rule that "it's not reliably if you have two
> nested forks" (ie it only works if you exec for every fork you do), it
> should still work, right? It sounds like Mark doesn't necessarily need
> to handle the *generic* case.
Yes, it sounds like that should be OK for us. Our usecase is pretty
restricted, so we're a long way off requiring a generic solution.
Our code will always fork() a fresh child in which to monitor memory
changes. We run the operations we're interested in, use pagemap to
figure out "what changed" (by comparing whether the pagemap_entry_t
values are different from their parent) and then throw away the child
process.
Currently our code does an entry-by-entry compare of pagemap, so
anything that exposes writes as a change to values in there would
allow us to run unmodified. That would be really nice. That said, I
think we'd still be OK to modify our own code too if we can find a
solution that would continue to function on older kernel releases,
-stable trees, etc.
Thanks,
Mark
On Thu, Apr 30, 2015 at 2:43 PM, Konstantin Khlebnikov <[email protected]> wrote:
> On Thu, Apr 30, 2015 at 12:02 AM, Linus Torvalds
> <[email protected]> wrote:
>> On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <[email protected]> wrote:
>>>
>>> This's no longer true. After recent fixes for "anon_vma endless growing" new vma
>>> might reuse old anon_vma from grandparent vma.
>>
>> Oh well. I guess that was too simple.
>>
>> If Mark is ok with the rule that "it's not reliably if you have two
>> nested forks" (ie it only works if you exec for every fork you do), it
>> should still work, right? It sounds like Mark doesn't necessarily need
>> to handle the *generic* case.
>
> What about exposing shared/exclusive bit in pagemap == 1 if
> page_mapcount() > 1, otherwise 0 (or vise versa).
>
> Seems like this should work for detecting CoWed pages in child mm.
Something like this (see patch in attachment)
On Thu, Apr 30, 2015 at 04:11:30PM +0300, Konstantin Khlebnikov wrote:
> On Thu, Apr 30, 2015 at 2:43 PM, Konstantin Khlebnikov <[email protected]> wrote:
> > On Thu, Apr 30, 2015 at 12:02 AM, Linus Torvalds
> > <[email protected]> wrote:
> >> On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <[email protected]> wrote:
> >>>
> >>> This's no longer true. After recent fixes for "anon_vma endless growing" new vma
> >>> might reuse old anon_vma from grandparent vma.
> >>
> >> Oh well. I guess that was too simple.
> >>
> >> If Mark is ok with the rule that "it's not reliably if you have two
> >> nested forks" (ie it only works if you exec for every fork you do), it
> >> should still work, right? It sounds like Mark doesn't necessarily need
> >> to handle the *generic* case.
> >
> > What about exposing shared/exclusive bit in pagemap == 1 if
> > page_mapcount() > 1, otherwise 0 (or vise versa).
> >
> > Seems like this should work for detecting CoWed pages in child mm.
>
> Something like this (see patch in attachment)
THP is not covered.
Any comments on kcmp() idea?
--
Kirill A. Shutemov
On Thu, Apr 30, 2015 at 4:22 PM, Kirill A. Shutemov
<[email protected]> wrote:
> On Thu, Apr 30, 2015 at 04:11:30PM +0300, Konstantin Khlebnikov wrote:
>> On Thu, Apr 30, 2015 at 2:43 PM, Konstantin Khlebnikov <[email protected]> wrote:
>> > On Thu, Apr 30, 2015 at 12:02 AM, Linus Torvalds
>> > <[email protected]> wrote:
>> >> On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <[email protected]> wrote:
>> >>>
>> >>> This's no longer true. After recent fixes for "anon_vma endless growing" new vma
>> >>> might reuse old anon_vma from grandparent vma.
>> >>
>> >> Oh well. I guess that was too simple.
>> >>
>> >> If Mark is ok with the rule that "it's not reliably if you have two
>> >> nested forks" (ie it only works if you exec for every fork you do), it
>> >> should still work, right? It sounds like Mark doesn't necessarily need
>> >> to handle the *generic* case.
>> >
>> > What about exposing shared/exclusive bit in pagemap == 1 if
>> > page_mapcount() > 1, otherwise 0 (or vise versa).
>> >
>> > Seems like this should work for detecting CoWed pages in child mm.
>>
>> Something like this (see patch in attachment)
>
> THP is not covered.
Ok. Thanks.
>
> Any comments on kcmp() idea?
Should work too. Supporing full equal-less-greater semantics seems
safe -- it's obfuscation is strong enough for that.
>
> --
> Kirill A. Shutemov
Hi all,
On Thu, Apr 30, 2015 at 2:11 PM, Konstantin Khlebnikov <[email protected]> wrote:
> On Thu, Apr 30, 2015 at 2:43 PM, Konstantin Khlebnikov <[email protected]> wrote:
>> What about exposing shared/exclusive bit in pagemap == 1 if
>> page_mapcount() > 1, otherwise 0 (or vise versa).
>>
>> Seems like this should work for detecting CoWed pages in child mm.
>
> Something like this (see patch in attachment)
Either something like this patch (updated to cover THPs), or Linus's
suggestion seems worth a try. Could I perhaps get a steer on which
would be more likely to be accepted / preferred?
Either way, we'd want to expose the resulting flag somewhere within
pagemap. We could do this either within the normal flags region, or
potentially even repurpose one of the (now censored) physical bits.
If there's a general feeling then I'll update my work-in-progress and
post it here.
Thanks,
Mark
>> Something like this (see patch in attachment)
>
> THP is not covered.
>
> Any comments on kcmp() idea?
It seems like a modified kcmp() would also be a valid approach but, as
you noted, probably speed-limited for our purposes. As you say, there
is the option of a vector-oriented equivalent. It seems like a
generally nice facility to have available in the kernel but my
suspicion is that it wouldn't be optimal for us...
My thinking is that using soft-dirty might give us the best
performance on platforms where it's available. We're only using
fork() as a cunning/hacky way of tracking what pages change;
soft-dirty would allow us to avoid that too, whereas using kcmp()
would still require the forking overhead.
Does that make sense, or have I missed something?
Thanks,
Mark