Hi,
I was alerted by people trying to use the PERF_RECORD_MMAP2
record to disambiguate virtual address mappings that there is a case
where the record does not contain enough information.
As you know, the MMAP2 record adds the major, minor, ino number,
inode generation numbers to a mapping. But it does that only for
file or pseudo -file backed mappings. That covers file mmaps and also
SYSV shared memory segments.
However there is a another kind of situation that arises in some
multi-process benchmarks where a region of memory is cloned
using VM_CLONE. As such, the virtual addresses match between
the processes but the major, minor, inode, inode generation fields
are all zeroes because there is no inode associated with the mapping.
Yet, it is important for the tool to know the mappings between the
processes are pointing to the same physical data.
We need to cover this case and I am seeking for advice on how to
best address this need given that we discarded using the plain physical
address for disambiguation.
Thanks.
On Mon, Sep 30, 2013 at 05:44:41PM +0200, Stephane Eranian wrote:
> Hi,
>
> I was alerted by people trying to use the PERF_RECORD_MMAP2
> record to disambiguate virtual address mappings that there is a case
> where the record does not contain enough information.
>
> As you know, the MMAP2 record adds the major, minor, ino number,
> inode generation numbers to a mapping. But it does that only for
> file or pseudo -file backed mappings. That covers file mmaps and also
> SYSV shared memory segments.
>
> However there is a another kind of situation that arises in some
> multi-process benchmarks where a region of memory is cloned
> using VM_CLONE. As such, the virtual addresses match between
> the processes but the major, minor, inode, inode generation fields
> are all zeroes because there is no inode associated with the mapping.
> Yet, it is important for the tool to know the mappings between the
> processes are pointing to the same physical data.
>
> We need to cover this case and I am seeking for advice on how to
> best address this need given that we discarded using the plain physical
> address for disambiguation.
Urgh.. who in his bloody mind is playing VM_CLNOE games that is not
pthread_creatE() ?
On Mon, Sep 30, 2013 at 6:15 PM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Sep 30, 2013 at 05:44:41PM +0200, Stephane Eranian wrote:
>> Hi,
>>
>> I was alerted by people trying to use the PERF_RECORD_MMAP2
>> record to disambiguate virtual address mappings that there is a case
>> where the record does not contain enough information.
>>
>> As you know, the MMAP2 record adds the major, minor, ino number,
>> inode generation numbers to a mapping. But it does that only for
>> file or pseudo -file backed mappings. That covers file mmaps and also
>> SYSV shared memory segments.
>>
>> However there is a another kind of situation that arises in some
>> multi-process benchmarks where a region of memory is cloned
>> using VM_CLONE. As such, the virtual addresses match between
>> the processes but the major, minor, inode, inode generation fields
>> are all zeroes because there is no inode associated with the mapping.
>> Yet, it is important for the tool to know the mappings between the
>> processes are pointing to the same physical data.
>>
>> We need to cover this case and I am seeking for advice on how to
>> best address this need given that we discarded using the plain physical
>> address for disambiguation.
>
> Urgh.. who in his bloody mind is playing VM_CLNOE games that is not
> pthread_creatE() ?
Some matrix multiply benchmark, I guess.
On Mon, Sep 30, 2013 at 06:48:55PM +0200, Stephane Eranian wrote:
> On Mon, Sep 30, 2013 at 6:15 PM, Peter Zijlstra <[email protected]> wrote:
> > On Mon, Sep 30, 2013 at 05:44:41PM +0200, Stephane Eranian wrote:
> >> Hi,
> >>
> >> I was alerted by people trying to use the PERF_RECORD_MMAP2
> >> record to disambiguate virtual address mappings that there is a case
> >> where the record does not contain enough information.
> >>
> >> As you know, the MMAP2 record adds the major, minor, ino number,
> >> inode generation numbers to a mapping. But it does that only for
> >> file or pseudo -file backed mappings. That covers file mmaps and also
> >> SYSV shared memory segments.
> >>
> >> However there is a another kind of situation that arises in some
> >> multi-process benchmarks where a region of memory is cloned
> >> using VM_CLONE. As such, the virtual addresses match between
> >> the processes but the major, minor, inode, inode generation fields
> >> are all zeroes because there is no inode associated with the mapping.
> >> Yet, it is important for the tool to know the mappings between the
> >> processes are pointing to the same physical data.
> >>
> >> We need to cover this case and I am seeking for advice on how to
> >> best address this need given that we discarded using the plain physical
> >> address for disambiguation.
> >
> > Urgh.. who in his bloody mind is playing VM_CLNOE games that is not
> > pthread_creatE() ?
>
> Some matrix multiply benchmark, I guess.
So the problem is that we don't have a user visible address space
identifier; with CLONE_THREAD we have the thread group id that acts
like this. But for bare CLONE_VM usage there's nothing afaik.
On Mon, Sep 30, 2013 at 6:54 PM, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Sep 30, 2013 at 06:48:55PM +0200, Stephane Eranian wrote:
> > On Mon, Sep 30, 2013 at 6:15 PM, Peter Zijlstra <[email protected]> wrote:
> > > On Mon, Sep 30, 2013 at 05:44:41PM +0200, Stephane Eranian wrote:
> > >> Hi,
> > >>
> > >> I was alerted by people trying to use the PERF_RECORD_MMAP2
> > >> record to disambiguate virtual address mappings that there is a case
> > >> where the record does not contain enough information.
> > >>
> > >> As you know, the MMAP2 record adds the major, minor, ino number,
> > >> inode generation numbers to a mapping. But it does that only for
> > >> file or pseudo -file backed mappings. That covers file mmaps and also
> > >> SYSV shared memory segments.
> > >>
> > >> However there is a another kind of situation that arises in some
> > >> multi-process benchmarks where a region of memory is cloned
> > >> using VM_CLONE. As such, the virtual addresses match between
> > >> the processes but the major, minor, inode, inode generation fields
> > >> are all zeroes because there is no inode associated with the mapping.
> > >> Yet, it is important for the tool to know the mappings between the
> > >> processes are pointing to the same physical data.
> > >>
> > >> We need to cover this case and I am seeking for advice on how to
> > >> best address this need given that we discarded using the plain physical
> > >> address for disambiguation.
> > >
> > > Urgh.. who in his bloody mind is playing VM_CLNOE games that is not
> > > pthread_creatE() ?
> >
> > Some matrix multiply benchmark, I guess.
>
> So the problem is that we don't have a user visible address space
> identifier; with CLONE_THREAD we have the thread group id that acts
> like this. But for bare CLONE_VM usage there's nothing afaik.
>From the tool's perspective, the MMAP2 record must contain enough information
to identify that the mapping points to the same physical pages in that
particular
case (multi-process + VM_CLONE). As we have it now all inode-related fields
are zero which is useless (indicates: no info). In other words, we need to make
up some unique number and stash it in the maj.min,ino triplet somehow.
On Tue, Oct 01, 2013 at 01:22:58PM +0200, Stephane Eranian wrote:
> > So the problem is that we don't have a user visible address space
> > identifier; with CLONE_THREAD we have the thread group id that acts
> > like this. But for bare CLONE_VM usage there's nothing afaik.
>
>
> From the tool's perspective, the MMAP2 record must contain enough information
> to identify that the mapping points to the same physical pages in that
> particular
> case (multi-process + VM_CLONE). As we have it now all inode-related fields
> are zero which is useless (indicates: no info). In other words, we need to make
> up some unique number and stash it in the maj.min,ino triplet somehow.
So the only thing I can come up with is something like the below;
supposedly the sha hash mixing a boot time random seed and the mm
pointer is enough to avoid it being a data leak.
And of course there's the possibility of a collision.
---
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -39,11 +39,15 @@
#include <linux/hw_breakpoint.h>
#include <linux/mm_types.h>
#include <linux/cgroup.h>
+#include <linux/random.h>
+#include <linux/cryptohash.h>
#include "internal.h"
#include <asm/irq_regs.h>
+static u64 __perf_rand_seed;
+
struct remote_function_call {
struct task_struct *p;
int (*func)(void *info);
@@ -5136,6 +5140,33 @@ static void perf_event_mmap_event(struct
min = MINOR(dev);
} else {
+ union {
+ struct {
+ u64 ino, gen;
+ u32 min;
+ };
+ u32 digest[SHA_DIGEST_WORDS];
+ } hash;
+ union {
+ struct {
+ u64 seed;
+ u64 ptr;
+ };
+ u8 message[SHA_MESSAGE_BYTES];
+ } data;
+ u32 workspace[SHA_WORKSPACE_WORDS];
+
+ memset(&data, 0, sizeof(data));
+ data.seed = __perf_rand_seed;
+ data.ptr = (u64)vma->vm_mm;
+
+ sha_init(hash.digest);
+ sha_transform(hash.digest, data.message, workspace);
+
+ gen = hash.gen;
+ ino = hash.ino;
+ min = hash.min;
+
if (arch_vma_name(mmap_event->vma)) {
name = strncpy(tmp, arch_vma_name(mmap_event->vma),
sizeof(tmp) - 1);
@@ -7895,6 +7926,8 @@ void __init perf_event_init(void)
/* do not patch jump label more than once per second */
jump_label_rate_limit(&perf_sched_events, HZ);
+ get_random_bytes(&__perf_rand_seed, sizeof(__perf_rand_seed));
+
/*
* Build time assertion that we keep the data_head at the intended
* location. IOW, validation we got the __reserved[] size right.
On Wed, Oct 02, 2013 at 01:23:16PM +0200, Peter Zijlstra wrote:
> So the only thing I can come up with is something like the below;
> supposedly the sha hash mixing a boot time random seed and the mm
> pointer is enough to avoid it being a data leak.
That is, right until it becomes feasible to run 2^64 SHA1 computations.
I've actually no idea how hard that is given todays GPU assisted
efforts.
* Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 02, 2013 at 01:23:16PM +0200, Peter Zijlstra wrote:
>
> > So the only thing I can come up with is something like the below;
> > supposedly the sha hash mixing a boot time random seed and the mm
> > pointer is enough to avoid it being a data leak.
>
> That is, right until it becomes feasible to run 2^64 SHA1 computations.
> I've actually no idea how hard that is given todays GPU assisted
> efforts.
Well, here are the possible cryptanalytic attacks I can think of:
- differential, because here you don't just have access to the hash
value but you can essentially feed highly correlated plaintext to the
hash at will, by starting/stopping threads, knowing their typical mm
pointer differences, etc.
I.e. less than 2^64, potentially a lot less.
- then there are timing attacks, and someone having access to a PMU
context and who can trigger this SHA1 computation arbitrarily in task
local context can run very accurate and low noise timing attacks...
I don't think the kernel's sha_transform() is hardened against timing
attacks, it's performance optimized so it has variable execution time
highly dependent on plaintext input - which leaks information about the
plaintext.
But then again, how realistic is an attack? All that effort just to
recover the raw kernel data pointer value of a struct mm? Dunno whether we
should worry about it.
Thanks,
Ingo
On Wed, Oct 02, 2013 at 02:39:53PM +0200, Ingo Molnar wrote:
> - then there are timing attacks, and someone having access to a PMU
> context and who can trigger this SHA1 computation arbitrarily in task
> local context can run very accurate and low noise timing attacks...
>
> I don't think the kernel's sha_transform() is hardened against timing
> attacks, it's performance optimized so it has variable execution time
> highly dependent on plaintext input - which leaks information about the
> plaintext.
Typical user doesn't have enough priv to profile kernel space; once you
do you also have enough priv to see kernel addresses outright (ie.
kallsyms etc..).
On Wed, Oct 2, 2013 at 2:46 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 02, 2013 at 02:39:53PM +0200, Ingo Molnar wrote:
>> - then there are timing attacks, and someone having access to a PMU
>> context and who can trigger this SHA1 computation arbitrarily in task
>> local context can run very accurate and low noise timing attacks...
>>
>> I don't think the kernel's sha_transform() is hardened against timing
>> attacks, it's performance optimized so it has variable execution time
>> highly dependent on plaintext input - which leaks information about the
>> plaintext.
>
> Typical user doesn't have enough priv to profile kernel space; once you
> do you also have enough priv to see kernel addresses outright (ie.
> kallsyms etc..).
>
I was going to say just that. But that's not the default, paranoid level
is at 1 by default and not 2. So I supposedly can still do:
$ perf record -e cycles ......
In per-thread mode and collect kernel level addresses.
On Wed, Oct 02, 2013 at 02:59:32PM +0200, Stephane Eranian wrote:
> On Wed, Oct 2, 2013 at 2:46 PM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Oct 02, 2013 at 02:39:53PM +0200, Ingo Molnar wrote:
> >> - then there are timing attacks, and someone having access to a PMU
> >> context and who can trigger this SHA1 computation arbitrarily in task
> >> local context can run very accurate and low noise timing attacks...
> >>
> >> I don't think the kernel's sha_transform() is hardened against timing
> >> attacks, it's performance optimized so it has variable execution time
> >> highly dependent on plaintext input - which leaks information about the
> >> plaintext.
> >
> > Typical user doesn't have enough priv to profile kernel space; once you
> > do you also have enough priv to see kernel addresses outright (ie.
> > kallsyms etc..).
> >
> I was going to say just that. But that's not the default, paranoid level
> is at 1 by default and not 2. So I supposedly can still do:
Oh right you are.. so yes that's a very viable avenue.
On Wed, Oct 2, 2013 at 3:01 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 02, 2013 at 02:59:32PM +0200, Stephane Eranian wrote:
>> On Wed, Oct 2, 2013 at 2:46 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Wed, Oct 02, 2013 at 02:39:53PM +0200, Ingo Molnar wrote:
>> >> - then there are timing attacks, and someone having access to a PMU
>> >> context and who can trigger this SHA1 computation arbitrarily in task
>> >> local context can run very accurate and low noise timing attacks...
>> >>
>> >> I don't think the kernel's sha_transform() is hardened against timing
>> >> attacks, it's performance optimized so it has variable execution time
>> >> highly dependent on plaintext input - which leaks information about the
>> >> plaintext.
>> >
>> > Typical user doesn't have enough priv to profile kernel space; once you
>> > do you also have enough priv to see kernel addresses outright (ie.
>> > kallsyms etc..).
>> >
>> I was going to say just that. But that's not the default, paranoid level
>> is at 1 by default and not 2. So I supposedly can still do:
>
> Oh right you are.. so yes that's a very viable avenue.
You mean simply encoding the vma->vm_mm as the ino number, for instance.
On Wed, Oct 02, 2013 at 03:13:16PM +0200, Stephane Eranian wrote:
> On Wed, Oct 2, 2013 at 3:01 PM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Oct 02, 2013 at 02:59:32PM +0200, Stephane Eranian wrote:
> >> On Wed, Oct 2, 2013 at 2:46 PM, Peter Zijlstra <[email protected]> wrote:
> >> > On Wed, Oct 02, 2013 at 02:39:53PM +0200, Ingo Molnar wrote:
> >> >> - then there are timing attacks, and someone having access to a PMU
> >> >> context and who can trigger this SHA1 computation arbitrarily in task
> >> >> local context can run very accurate and low noise timing attacks...
> >> >>
> >> >> I don't think the kernel's sha_transform() is hardened against timing
> >> >> attacks, it's performance optimized so it has variable execution time
> >> >> highly dependent on plaintext input - which leaks information about the
> >> >> plaintext.
> >> >
> >> > Typical user doesn't have enough priv to profile kernel space; once you
> >> > do you also have enough priv to see kernel addresses outright (ie.
> >> > kallsyms etc..).
> >> >
> >> I was going to say just that. But that's not the default, paranoid level
> >> is at 1 by default and not 2. So I supposedly can still do:
> >
> > Oh right you are.. so yes that's a very viable avenue.
>
> You mean simply encoding the vma->vm_mm as the ino number, for instance.
Nah.. I think Kees would very much shoot us on the spot for doing that.
But with the paranoid level defaulting to 1 the PMU attack on the kernel
SHA implenentation is feasible.
* Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 02, 2013 at 02:39:53PM +0200, Ingo Molnar wrote:
> > - then there are timing attacks, and someone having access to a PMU
> > context and who can trigger this SHA1 computation arbitrarily in task
> > local context can run very accurate and low noise timing attacks...
> >
> > I don't think the kernel's sha_transform() is hardened against timing
> > attacks, it's performance optimized so it has variable execution time
> > highly dependent on plaintext input - which leaks information about the
> > plaintext.
>
> Typical user doesn't have enough priv to profile kernel space; once you
> do you also have enough priv to see kernel addresses outright (ie.
> kallsyms etc..).
I didn't mean profiling - that's not a 'timing attack'.
A simple RDTSC done around repeated calls to sha_transform() using kernel
functionality is.
Thanks,
Ingo
On Wed, Oct 2, 2013 at 6:37 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 02, 2013 at 03:13:16PM +0200, Stephane Eranian wrote:
>> On Wed, Oct 2, 2013 at 3:01 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Wed, Oct 02, 2013 at 02:59:32PM +0200, Stephane Eranian wrote:
>> >> On Wed, Oct 2, 2013 at 2:46 PM, Peter Zijlstra <[email protected]> wrote:
>> >> > On Wed, Oct 02, 2013 at 02:39:53PM +0200, Ingo Molnar wrote:
>> >> >> - then there are timing attacks, and someone having access to a PMU
>> >> >> context and who can trigger this SHA1 computation arbitrarily in task
>> >> >> local context can run very accurate and low noise timing attacks...
>> >> >>
>> >> >> I don't think the kernel's sha_transform() is hardened against timing
>> >> >> attacks, it's performance optimized so it has variable execution time
>> >> >> highly dependent on plaintext input - which leaks information about the
>> >> >> plaintext.
>> >> >
>> >> > Typical user doesn't have enough priv to profile kernel space; once you
>> >> > do you also have enough priv to see kernel addresses outright (ie.
>> >> > kallsyms etc..).
>> >> >
>> >> I was going to say just that. But that's not the default, paranoid level
>> >> is at 1 by default and not 2. So I supposedly can still do:
>> >>
>> >> $ perf record -e cycles ......
>> >>
>> >> In per-thread mode and collect kernel level addresses.
>> >
>> > Oh right you are.. so yes that's a very viable avenue.
>>
>> You mean simply encoding the vma->vm_mm as the ino number, for instance.
>
> Nah.. I think Kees would very much shoot us on the spot for doing that.
> But with the paranoid level defaulting to 1 the PMU attack on the kernel
> SHA implenentation is feasible.
We already have other kernel address leaks (e.g. heap addresses via
INET_DIAG), and I'd like to avoid adding more. It'd be nice if there
was a common way to uniquely mask these values that are really just
"handles". We could use it both here and in the network code.
Would it be possible to just have a "regular" incrementing handle,
like fd, or are we talking about doing that map for all VMAs, which
would make that mapping unfeasible due to storage needs?
On the other hand, trying to hide kernel addresses from root is no
easy task, especially given that while kptr_restrict has a "2" level,
dmesg_restrict does not. Are these kernel-address-leaking perf modes
already limited to a specific POSIX capability?
-Kees
--
Kees Cook
Chrome OS Security
On Wed, Oct 2, 2013 at 7:14 PM, Kees Cook <[email protected]> wrote:
> On Wed, Oct 2, 2013 at 6:37 AM, Peter Zijlstra <[email protected]> wrote:
>> On Wed, Oct 02, 2013 at 03:13:16PM +0200, Stephane Eranian wrote:
>>> On Wed, Oct 2, 2013 at 3:01 PM, Peter Zijlstra <[email protected]> wrote:
>>> > On Wed, Oct 02, 2013 at 02:59:32PM +0200, Stephane Eranian wrote:
>>> >> On Wed, Oct 2, 2013 at 2:46 PM, Peter Zijlstra <[email protected]> wrote:
>>> >> > On Wed, Oct 02, 2013 at 02:39:53PM +0200, Ingo Molnar wrote:
>>> >> >> - then there are timing attacks, and someone having access to a PMU
>>> >> >> context and who can trigger this SHA1 computation arbitrarily in task
>>> >> >> local context can run very accurate and low noise timing attacks...
>>> >> >>
>>> >> >> I don't think the kernel's sha_transform() is hardened against timing
>>> >> >> attacks, it's performance optimized so it has variable execution time
>>> >> >> highly dependent on plaintext input - which leaks information about the
>>> >> >> plaintext.
>>> >> >
>>> >> > Typical user doesn't have enough priv to profile kernel space; once you
>>> >> > do you also have enough priv to see kernel addresses outright (ie.
>>> >> > kallsyms etc..).
>>> >> >
>>> >> I was going to say just that. But that's not the default, paranoid level
>>> >> is at 1 by default and not 2. So I supposedly can still do:
>>> >>
>>> >> $ perf record -e cycles ......
>>> >>
>>> >> In per-thread mode and collect kernel level addresses.
>>> >
>>> > Oh right you are.. so yes that's a very viable avenue.
>>>
>>> You mean simply encoding the vma->vm_mm as the ino number, for instance.
>>
>> Nah.. I think Kees would very much shoot us on the spot for doing that.
>> But with the paranoid level defaulting to 1 the PMU attack on the kernel
>> SHA implenentation is feasible.
>
> We already have other kernel address leaks (e.g. heap addresses via
> INET_DIAG), and I'd like to avoid adding more. It'd be nice if there
> was a common way to uniquely mask these values that are really just
> "handles". We could use it both here and in the network code.
>
> Would it be possible to just have a "regular" incrementing handle,
> like fd, or are we talking about doing that map for all VMAs, which
> would make that mapping unfeasible due to storage needs?
>
All we need is a way to report that two vmas point to the same
vma->vm_mm, i..e, same physical data. If I understand what
you are suggesting, you'd add some sort of generation number
to the vm_mm. Each new vm_mm gets a new number. That
would work, I think. No kernel addresses reported directly nor
hashed.
> On the other hand, trying to hide kernel addresses from root is no
> easy task, especially given that while kptr_restrict has a "2" level,
> dmesg_restrict does not. Are these kernel-address-leaking perf modes
> already limited to a specific POSIX capability?
>
> -Kees
>
> --
> Kees Cook
> Chrome OS Security
On Wed, Oct 2, 2013 at 10:20 AM, Stephane Eranian <[email protected]> wrote:
> On Wed, Oct 2, 2013 at 7:14 PM, Kees Cook <[email protected]> wrote:
>> On Wed, Oct 2, 2013 at 6:37 AM, Peter Zijlstra <[email protected]> wrote:
>>> On Wed, Oct 02, 2013 at 03:13:16PM +0200, Stephane Eranian wrote:
>>>> On Wed, Oct 2, 2013 at 3:01 PM, Peter Zijlstra <[email protected]> wrote:
>>>> > On Wed, Oct 02, 2013 at 02:59:32PM +0200, Stephane Eranian wrote:
>>>> >> On Wed, Oct 2, 2013 at 2:46 PM, Peter Zijlstra <[email protected]> wrote:
>>>> >> > On Wed, Oct 02, 2013 at 02:39:53PM +0200, Ingo Molnar wrote:
>>>> >> >> - then there are timing attacks, and someone having access to a PMU
>>>> >> >> context and who can trigger this SHA1 computation arbitrarily in task
>>>> >> >> local context can run very accurate and low noise timing attacks...
>>>> >> >>
>>>> >> >> I don't think the kernel's sha_transform() is hardened against timing
>>>> >> >> attacks, it's performance optimized so it has variable execution time
>>>> >> >> highly dependent on plaintext input - which leaks information about the
>>>> >> >> plaintext.
>>>> >> >
>>>> >> > Typical user doesn't have enough priv to profile kernel space; once you
>>>> >> > do you also have enough priv to see kernel addresses outright (ie.
>>>> >> > kallsyms etc..).
>>>> >> >
>>>> >> I was going to say just that. But that's not the default, paranoid level
>>>> >> is at 1 by default and not 2. So I supposedly can still do:
>>>> >>
>>>> >> $ perf record -e cycles ......
>>>> >>
>>>> >> In per-thread mode and collect kernel level addresses.
>>>> >
>>>> > Oh right you are.. so yes that's a very viable avenue.
>>>>
>>>> You mean simply encoding the vma->vm_mm as the ino number, for instance.
>>>
>>> Nah.. I think Kees would very much shoot us on the spot for doing that.
>>> But with the paranoid level defaulting to 1 the PMU attack on the kernel
>>> SHA implenentation is feasible.
>>
>> We already have other kernel address leaks (e.g. heap addresses via
>> INET_DIAG), and I'd like to avoid adding more. It'd be nice if there
>> was a common way to uniquely mask these values that are really just
>> "handles". We could use it both here and in the network code.
>>
>> Would it be possible to just have a "regular" incrementing handle,
>> like fd, or are we talking about doing that map for all VMAs, which
>> would make that mapping unfeasible due to storage needs?
>>
> All we need is a way to report that two vmas point to the same
> vma->vm_mm, i..e, same physical data. If I understand what
> you are suggesting, you'd add some sort of generation number
> to the vm_mm. Each new vm_mm gets a new number. That
> would work, I think. No kernel addresses reported directly nor
> hashed.
Right. Is that workable? It sounds like this handle is only needed at
inspection time, though. Is this uniqueness test limited to a single
process, or is this uniqueness test across processes?
-Kees
--
Kees Cook
Chrome OS Security
On Wed, Oct 2, 2013 at 7:29 PM, Kees Cook <[email protected]> wrote:
> On Wed, Oct 2, 2013 at 10:20 AM, Stephane Eranian <[email protected]> wrote:
>> On Wed, Oct 2, 2013 at 7:14 PM, Kees Cook <[email protected]> wrote:
>>> On Wed, Oct 2, 2013 at 6:37 AM, Peter Zijlstra <[email protected]> wrote:
>>>> On Wed, Oct 02, 2013 at 03:13:16PM +0200, Stephane Eranian wrote:
>>>>> On Wed, Oct 2, 2013 at 3:01 PM, Peter Zijlstra <[email protected]> wrote:
>>>>> > On Wed, Oct 02, 2013 at 02:59:32PM +0200, Stephane Eranian wrote:
>>>>> >> On Wed, Oct 2, 2013 at 2:46 PM, Peter Zijlstra <[email protected]> wrote:
>>>>> >> > On Wed, Oct 02, 2013 at 02:39:53PM +0200, Ingo Molnar wrote:
>>>>> >> >> - then there are timing attacks, and someone having access to a PMU
>>>>> >> >> context and who can trigger this SHA1 computation arbitrarily in task
>>>>> >> >> local context can run very accurate and low noise timing attacks...
>>>>> >> >>
>>>>> >> >> I don't think the kernel's sha_transform() is hardened against timing
>>>>> >> >> attacks, it's performance optimized so it has variable execution time
>>>>> >> >> highly dependent on plaintext input - which leaks information about the
>>>>> >> >> plaintext.
>>>>> >> >
>>>>> >> > Typical user doesn't have enough priv to profile kernel space; once you
>>>>> >> > do you also have enough priv to see kernel addresses outright (ie.
>>>>> >> > kallsyms etc..).
>>>>> >> >
>>>>> >> I was going to say just that. But that's not the default, paranoid level
>>>>> >> is at 1 by default and not 2. So I supposedly can still do:
>>>>> >>
>>>>> >> $ perf record -e cycles ......
>>>>> >>
>>>>> >> In per-thread mode and collect kernel level addresses.
>>>>> >
>>>>> > Oh right you are.. so yes that's a very viable avenue.
>>>>>
>>>>> You mean simply encoding the vma->vm_mm as the ino number, for instance.
>>>>
>>>> Nah.. I think Kees would very much shoot us on the spot for doing that.
>>>> But with the paranoid level defaulting to 1 the PMU attack on the kernel
>>>> SHA implenentation is feasible.
>>>
>>> We already have other kernel address leaks (e.g. heap addresses via
>>> INET_DIAG), and I'd like to avoid adding more. It'd be nice if there
>>> was a common way to uniquely mask these values that are really just
>>> "handles". We could use it both here and in the network code.
>>>
>>> Would it be possible to just have a "regular" incrementing handle,
>>> like fd, or are we talking about doing that map for all VMAs, which
>>> would make that mapping unfeasible due to storage needs?
>>>
>> All we need is a way to report that two vmas point to the same
>> vma->vm_mm, i..e, same physical data. If I understand what
>> you are suggesting, you'd add some sort of generation number
>> to the vm_mm. Each new vm_mm gets a new number. That
>> would work, I think. No kernel addresses reported directly nor
>> hashed.
>
> Right. Is that workable? It sounds like this handle is only needed at
> inspection time, though. Is this uniqueness test limited to a single
> process, or is this uniqueness test across processes?
>
Each time that vm_mm is allocated we would allocate a new generation
number.
Uniqueness is across processes. But that's by construction of the
address space.
On Wed, Oct 2, 2013 at 10:49 AM, Stephane Eranian <[email protected]> wrote:
> On Wed, Oct 2, 2013 at 7:29 PM, Kees Cook <[email protected]> wrote:
>> On Wed, Oct 2, 2013 at 10:20 AM, Stephane Eranian <[email protected]> wrote:
>>> On Wed, Oct 2, 2013 at 7:14 PM, Kees Cook <[email protected]> wrote:
>>>> On Wed, Oct 2, 2013 at 6:37 AM, Peter Zijlstra <[email protected]> wrote:
>>>>> On Wed, Oct 02, 2013 at 03:13:16PM +0200, Stephane Eranian wrote:
>>>>>> On Wed, Oct 2, 2013 at 3:01 PM, Peter Zijlstra <[email protected]> wrote:
>>>>>> > On Wed, Oct 02, 2013 at 02:59:32PM +0200, Stephane Eranian wrote:
>>>>>> >> On Wed, Oct 2, 2013 at 2:46 PM, Peter Zijlstra <[email protected]> wrote:
>>>>>> >> > On Wed, Oct 02, 2013 at 02:39:53PM +0200, Ingo Molnar wrote:
>>>>>> >> >> - then there are timing attacks, and someone having access to a PMU
>>>>>> >> >> context and who can trigger this SHA1 computation arbitrarily in task
>>>>>> >> >> local context can run very accurate and low noise timing attacks...
>>>>>> >> >>
>>>>>> >> >> I don't think the kernel's sha_transform() is hardened against timing
>>>>>> >> >> attacks, it's performance optimized so it has variable execution time
>>>>>> >> >> highly dependent on plaintext input - which leaks information about the
>>>>>> >> >> plaintext.
>>>>>> >> >
>>>>>> >> > Typical user doesn't have enough priv to profile kernel space; once you
>>>>>> >> > do you also have enough priv to see kernel addresses outright (ie.
>>>>>> >> > kallsyms etc..).
>>>>>> >> >
>>>>>> >> I was going to say just that. But that's not the default, paranoid level
>>>>>> >> is at 1 by default and not 2. So I supposedly can still do:
>>>>>> >>
>>>>>> >> $ perf record -e cycles ......
>>>>>> >>
>>>>>> >> In per-thread mode and collect kernel level addresses.
>>>>>> >
>>>>>> > Oh right you are.. so yes that's a very viable avenue.
>>>>>>
>>>>>> You mean simply encoding the vma->vm_mm as the ino number, for instance.
>>>>>
>>>>> Nah.. I think Kees would very much shoot us on the spot for doing that.
>>>>> But with the paranoid level defaulting to 1 the PMU attack on the kernel
>>>>> SHA implenentation is feasible.
>>>>
>>>> We already have other kernel address leaks (e.g. heap addresses via
>>>> INET_DIAG), and I'd like to avoid adding more. It'd be nice if there
>>>> was a common way to uniquely mask these values that are really just
>>>> "handles". We could use it both here and in the network code.
>>>>
>>>> Would it be possible to just have a "regular" incrementing handle,
>>>> like fd, or are we talking about doing that map for all VMAs, which
>>>> would make that mapping unfeasible due to storage needs?
>>>>
>>> All we need is a way to report that two vmas point to the same
>>> vma->vm_mm, i..e, same physical data. If I understand what
>>> you are suggesting, you'd add some sort of generation number
>>> to the vm_mm. Each new vm_mm gets a new number. That
>>> would work, I think. No kernel addresses reported directly nor
>>> hashed.
>>
>> Right. Is that workable? It sounds like this handle is only needed at
>> inspection time, though. Is this uniqueness test limited to a single
>> process, or is this uniqueness test across processes?
>>
> Each time that vm_mm is allocated we would allocate a new generation
> number.
Seems like a simple enough solution. Surely there must be a catch. :)
>
> Uniqueness is across processes. But that's by construction of the
> address space.
Okay, that's what I figured.
-Kees
--
Kees Cook
Chrome OS Security
On Wed, Oct 02, 2013 at 11:10:15AM -0700, Kees Cook wrote:
> Seems like a simple enough solution. Surely there must be a catch. :)
I didn't want to add this to the core mm just for perf..
On Wed, Oct 2, 2013 at 12:00 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 02, 2013 at 11:10:15AM -0700, Kees Cook wrote:
>> Seems like a simple enough solution. Surely there must be a catch. :)
>
> I didn't want to add this to the core mm just for perf..
It seems like it would be pretty inexpensive. It might also be
valuable in other situations. Not that I can think of any at the
moment. Additionally, it could likely be hidden by a CONFIG, so that
if perf isn't built in, there's no change?
-Kees
--
Kees Cook
Chrome OS Security
On Wed, Oct 02, 2013 at 12:38:54PM -0700, Kees Cook wrote:
> On Wed, Oct 2, 2013 at 12:00 PM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Oct 02, 2013 at 11:10:15AM -0700, Kees Cook wrote:
> >> Seems like a simple enough solution. Surely there must be a catch. :)
> >
> > I didn't want to add this to the core mm just for perf..
>
> It seems like it would be pretty inexpensive. It might also be
> valuable in other situations. Not that I can think of any at the
> moment. Additionally, it could likely be hidden by a CONFIG, so that
> if perf isn't built in, there's no change?
You optimist, you think you can build a kernel without perf? ;-)
Its just that I would hate to add more completely global state to the
fork() path. The tasklist_lock might be hard to crack, but at least the
pid-hash could use per bucket locks (it doesn't apparently).
I suppose people don't really care that much about fork() performance;
which is sad. KSM and THP also add their own global locks :-(
On Wed, Oct 2, 2013 at 10:31 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 02, 2013 at 12:38:54PM -0700, Kees Cook wrote:
>> On Wed, Oct 2, 2013 at 12:00 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Wed, Oct 02, 2013 at 11:10:15AM -0700, Kees Cook wrote:
>> >> Seems like a simple enough solution. Surely there must be a catch. :)
>> >
>> > I didn't want to add this to the core mm just for perf..
>>
>> It seems like it would be pretty inexpensive. It might also be
>> valuable in other situations. Not that I can think of any at the
>> moment. Additionally, it could likely be hidden by a CONFIG, so that
>> if perf isn't built in, there's no change?
>
> You optimist, you think you can build a kernel without perf? ;-)
>
> Its just that I would hate to add more completely global state to the
> fork() path. The tasklist_lock might be hard to crack, but at least the
> pid-hash could use per bucket locks (it doesn't apparently).
>
> I suppose people don't really care that much about fork() performance;
> which is sad. KSM and THP also add their own global locks :-(
I don't know the MM code but I assume that that vm_mm struct is
allocated dynamically
and maybe you already grabbing a lock while doing this. Could we
leverage that lock
to increment a global generation number?
On Thu, Oct 03, 2013 at 10:55:28AM +0200, Stephane Eranian wrote:
> I don't know the MM code but I assume that that vm_mm struct is
> allocated dynamically
> and maybe you already grabbing a lock while doing this. Could we
> leverage that lock
> to increment a global generation number?
Sure; something like so.. I just don't like global state nor adding to
mm_struct for just this.
---
include/linux/mm_types.h | 1 +
kernel/fork.c | 6 ++++++
2 files changed, 7 insertions(+)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d9851eeb6e1d..3877b1e72a5b 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -436,6 +436,7 @@ struct mm_struct {
int first_nid;
#endif
struct uprobes_state uprobes_state;
+ u64 mm_id;
};
/* first nid will either be a valid NID or one of these values */
diff --git a/kernel/fork.c b/kernel/fork.c
index 086fe73ad6bd..b315f6227629 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -523,6 +523,8 @@ static void mm_init_aio(struct mm_struct *mm)
#endif
}
+static u64 global_mm_id;
+
static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
{
atomic_set(&mm->mm_users, 1);
@@ -537,6 +539,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
spin_lock_init(&mm->page_table_lock);
mm_init_aio(mm);
mm_init_owner(mm, p);
+ mm->mm_id = 0;
if (likely(!mm_alloc_pgd(mm))) {
mm->def_flags = 0;
@@ -1422,6 +1425,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
*/
write_lock_irq(&tasklist_lock);
+ if (p->mm && !p->mm->mm_id)
+ p->mm->mm_id = ++global_mm_id;
+
/* CLONE_PARENT re-uses the old parent */
if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
p->real_parent = current->real_parent;
On Thu, Oct 3, 2013 at 11:03 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Oct 03, 2013 at 10:55:28AM +0200, Stephane Eranian wrote:
>> I don't know the MM code but I assume that that vm_mm struct is
>> allocated dynamically
>> and maybe you already grabbing a lock while doing this. Could we
>> leverage that lock
>> to increment a global generation number?
>
> Sure; something like so.. I just don't like global state nor adding to
> mm_struct for just this.
>
I understand, I don't like global state either. I see you already
have uprobes state in there. Thus to me, the perf situation is
not much worse in terms of usefulness (frequency of use).
Thanks.
> ---
> include/linux/mm_types.h | 1 +
> kernel/fork.c | 6 ++++++
> 2 files changed, 7 insertions(+)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d9851eeb6e1d..3877b1e72a5b 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -436,6 +436,7 @@ struct mm_struct {
> int first_nid;
> #endif
> struct uprobes_state uprobes_state;
> + u64 mm_id;
> };
>
> /* first nid will either be a valid NID or one of these values */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 086fe73ad6bd..b315f6227629 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -523,6 +523,8 @@ static void mm_init_aio(struct mm_struct *mm)
> #endif
> }
>
> +static u64 global_mm_id;
> +
> static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> {
> atomic_set(&mm->mm_users, 1);
> @@ -537,6 +539,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> spin_lock_init(&mm->page_table_lock);
> mm_init_aio(mm);
> mm_init_owner(mm, p);
> + mm->mm_id = 0;
>
> if (likely(!mm_alloc_pgd(mm))) {
> mm->def_flags = 0;
> @@ -1422,6 +1425,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> */
> write_lock_irq(&tasklist_lock);
>
> + if (p->mm && !p->mm->mm_id)
> + p->mm->mm_id = ++global_mm_id;
> +
> /* CLONE_PARENT re-uses the old parent */
> if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
> p->real_parent = current->real_parent;
On Thu, Oct 3, 2013 at 2:13 AM, Stephane Eranian <[email protected]> wrote:
> On Thu, Oct 3, 2013 at 11:03 AM, Peter Zijlstra <[email protected]> wrote:
>> On Thu, Oct 03, 2013 at 10:55:28AM +0200, Stephane Eranian wrote:
>>> I don't know the MM code but I assume that that vm_mm struct is
>>> allocated dynamically
>>> and maybe you already grabbing a lock while doing this. Could we
>>> leverage that lock
>>> to increment a global generation number?
>>
>> Sure; something like so.. I just don't like global state nor adding to
>> mm_struct for just this.
>>
> I understand, I don't like global state either. I see you already
> have uprobes state in there. Thus to me, the perf situation is
> not much worse in terms of usefulness (frequency of use).
>
> Thanks.
>
>> ---
>> include/linux/mm_types.h | 1 +
>> kernel/fork.c | 6 ++++++
>> 2 files changed, 7 insertions(+)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index d9851eeb6e1d..3877b1e72a5b 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -436,6 +436,7 @@ struct mm_struct {
>> int first_nid;
>> #endif
>> struct uprobes_state uprobes_state;
>> + u64 mm_id;
>> };
>>
>> /* first nid will either be a valid NID or one of these values */
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 086fe73ad6bd..b315f6227629 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -523,6 +523,8 @@ static void mm_init_aio(struct mm_struct *mm)
>> #endif
>> }
>>
>> +static u64 global_mm_id;
>> +
>> static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
>> {
>> atomic_set(&mm->mm_users, 1);
>> @@ -537,6 +539,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
>> spin_lock_init(&mm->page_table_lock);
>> mm_init_aio(mm);
>> mm_init_owner(mm, p);
>> + mm->mm_id = 0;
>>
>> if (likely(!mm_alloc_pgd(mm))) {
>> mm->def_flags = 0;
>> @@ -1422,6 +1425,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
>> */
>> write_lock_irq(&tasklist_lock);
>>
>> + if (p->mm && !p->mm->mm_id)
>> + p->mm->mm_id = ++global_mm_id;
>> +
>> /* CLONE_PARENT re-uses the old parent */
>> if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
>> p->real_parent = current->real_parent;
FWIW, I like this approach.
-Kees
--
Kees Cook
Chrome OS Security
On Thu, Oct 03, 2013 at 10:55:28AM +0200, Stephane Eranian wrote:
> On Wed, Oct 2, 2013 at 10:31 PM, Peter Zijlstra <[email protected]> wrote:
> > On Wed, Oct 02, 2013 at 12:38:54PM -0700, Kees Cook wrote:
> >> On Wed, Oct 2, 2013 at 12:00 PM, Peter Zijlstra <[email protected]> wrote:
> >> > On Wed, Oct 02, 2013 at 11:10:15AM -0700, Kees Cook wrote:
> >> >> Seems like a simple enough solution. Surely there must be a catch. :)
> >> >
> >> > I didn't want to add this to the core mm just for perf..
> >>
> >> It seems like it would be pretty inexpensive. It might also be
> >> valuable in other situations. Not that I can think of any at the
> >> moment. Additionally, it could likely be hidden by a CONFIG, so that
> >> if perf isn't built in, there's no change?
> >
> > You optimist, you think you can build a kernel without perf? ;-)
> >
> > Its just that I would hate to add more completely global state to the
> > fork() path. The tasklist_lock might be hard to crack, but at least the
> > pid-hash could use per bucket locks (it doesn't apparently).
> >
> > I suppose people don't really care that much about fork() performance;
> > which is sad. KSM and THP also add their own global locks :-(
>
> I don't know the MM code but I assume that that vm_mm struct is
> allocated dynamically
> and maybe you already grabbing a lock while doing this. Could we
> leverage that lock
> to increment a global generation number?
I thought Pavel added a mm identifier for criu, but can't find it
right now.
-Andi
--
[email protected] -- Speaking for myself only
On Wed, Oct 2, 2013 at 3:01 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Oct 02, 2013 at 02:59:32PM +0200, Stephane Eranian wrote:
>> On Wed, Oct 2, 2013 at 2:46 PM, Peter Zijlstra <[email protected]> wrote:
>> > On Wed, Oct 02, 2013 at 02:39:53PM +0200, Ingo Molnar wrote:
>> >> - then there are timing attacks, and someone having access to a PMU
>> >> context and who can trigger this SHA1 computation arbitrarily in task
>> >> local context can run very accurate and low noise timing attacks...
>> >>
>> >> I don't think the kernel's sha_transform() is hardened against timing
>> >> attacks, it's performance optimized so it has variable execution time
>> >> highly dependent on plaintext input - which leaks information about the
>> >> plaintext.
>> >
>> > Typical user doesn't have enough priv to profile kernel space; once you
>> > do you also have enough priv to see kernel addresses outright (ie.
>> > kallsyms etc..).
>> >
>> I was going to say just that. But that's not the default, paranoid level
>> is at 1 by default and not 2. So I supposedly can still do:
>
> Oh right you are.. so yes that's a very viable avenue.
I am going to try this out today. I think if it works well, we could
also simplify the
MMAP2 record and just pass this unique id for all cases.MMAP2 is only in rcX
release so far. Is that still possible?
On Mon, Oct 07, 2013 at 01:16:35PM +0200, Stephane Eranian wrote:
> I am going to try this out today. I think if it works well, we could
> also simplify the
> MMAP2 record and just pass this unique id for all cases.MMAP2 is only in rcX
> release so far. Is that still possible?
Yeah I think so.
Peter,
On Thu, Oct 3, 2013 at 11:03 AM, Peter Zijlstra <[email protected]> wrote:
> On Thu, Oct 03, 2013 at 10:55:28AM +0200, Stephane Eranian wrote:
>> I don't know the MM code but I assume that that vm_mm struct is
>> allocated dynamically
>> and maybe you already grabbing a lock while doing this. Could we
>> leverage that lock
>> to increment a global generation number?
>
> Sure; something like so.. I just don't like global state nor adding to
> mm_struct for just this.
>
> ---
> include/linux/mm_types.h | 1 +
> kernel/fork.c | 6 ++++++
> 2 files changed, 7 insertions(+)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index d9851eeb6e1d..3877b1e72a5b 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -436,6 +436,7 @@ struct mm_struct {
> int first_nid;
> #endif
> struct uprobes_state uprobes_state;
> + u64 mm_id;
> };
>
> /* first nid will either be a valid NID or one of these values */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 086fe73ad6bd..b315f6227629 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -523,6 +523,8 @@ static void mm_init_aio(struct mm_struct *mm)
> #endif
> }
>
> +static u64 global_mm_id;
> +
> static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> {
> atomic_set(&mm->mm_users, 1);
> @@ -537,6 +539,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p)
> spin_lock_init(&mm->page_table_lock);
> mm_init_aio(mm);
> mm_init_owner(mm, p);
> + mm->mm_id = 0;
>
> if (likely(!mm_alloc_pgd(mm))) {
> mm->def_flags = 0;
> @@ -1422,6 +1425,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
> */
> write_lock_irq(&tasklist_lock);
>
> + if (p->mm && !p->mm->mm_id)
> + p->mm->mm_id = ++global_mm_id;
> +
> /* CLONE_PARENT re-uses the old parent */
> if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
> p->real_parent = current->real_parent;
Ok, so I tried this. It does not cover shmat() cases unfortunately.
We need that same ++global_mm_id somewhere on that code path.
Haven't looked at it in details just yet.
On Mon, Oct 07, 2013 at 11:04:38PM +0200, Stephane Eranian wrote:
> Peter,
>
> Ok, so I tried this. It does not cover shmat() cases unfortunately.
> We need that same ++global_mm_id somewhere on that code path.
> Haven't looked at it in details just yet.
shmat() is a regular shared memory thingy right? So that should be
covered by the regular inode,dev bits, right?
On Tue, Oct 8, 2013 at 8:54 AM, Peter Zijlstra <[email protected]> wrote:
> On Mon, Oct 07, 2013 at 11:04:38PM +0200, Stephane Eranian wrote:
>> Peter,
>>
>> Ok, so I tried this. It does not cover shmat() cases unfortunately.
>> We need that same ++global_mm_id somewhere on that code path.
>> Haven't looked at it in details just yet.
>
> shmat() is a regular shared memory thingy right? So that should be
> covered by the regular inode,dev bits, right?
Yes, it is but I am trying to see whether or not we could unify that and
use a single u64 number to uniquely identify each mapping.
On Tue, Oct 08, 2013 at 09:15:30AM +0200, Stephane Eranian wrote:
> Yes, it is but I am trying to see whether or not we could unify that and
> use a single u64 number to uniquely identify each mapping.
No you cannot; two unrelated executables which have distinct mm_ids can
easily mmap() the same shared file.
On Tue, Oct 8, 2013 at 11:36 AM, Peter Zijlstra <[email protected]> wrote:
> On Tue, Oct 08, 2013 at 09:15:30AM +0200, Stephane Eranian wrote:
>> Yes, it is but I am trying to see whether or not we could unify that and
>> use a single u64 number to uniquely identify each mapping.
>
> No you cannot; two unrelated executables which have distinct mm_ids can
> easily mmap() the same shared file.
That seems to indicate the mm_ids is not attached to the right level of VM
data structure. But I am okay with keeping it that way and stashing the mm_id
as a pseudo inode number for the case of non file-backed mappings. If we
say maj=min=ino=gen=0 means no "info", then any other combinations can
be used to identify identical mappings. We use actual min,maj, ino, gen
for file backed, and maj=min=gen=0 + ino = mm_ids for the other cases.
That should work, shouldn't it?
On Tue, Oct 08, 2013 at 11:42:06AM +0200, Stephane Eranian wrote:
> On Tue, Oct 8, 2013 at 11:36 AM, Peter Zijlstra <[email protected]> wrote:
> > On Tue, Oct 08, 2013 at 09:15:30AM +0200, Stephane Eranian wrote:
> >> Yes, it is but I am trying to see whether or not we could unify that and
> >> use a single u64 number to uniquely identify each mapping.
> >
> > No you cannot; two unrelated executables which have distinct mm_ids can
> > easily mmap() the same shared file.
>
> That seems to indicate the mm_ids is not attached to the right level of VM
> data structure.
No that's not it.. shared mappings can create arbitrary couplings
between mm's, there's really nothing you can do about that.
This is really only about anonymous memory shared with CLONE_VM.
> But I am okay with keeping it that way and stashing the mm_id
> as a pseudo inode number for the case of non file-backed mappings. If we
> say maj=min=ino=gen=0 means no "info", then any other combinations can
> be used to identify identical mappings. We use actual min,maj, ino, gen
> for file backed, and maj=min=gen=0 + ino = mm_ids for the other cases.
> That should work, shouldn't it?
I'm not sure if min=maj=0 is a valid device, nor do I know if ino=0 is a
valid ino. But I suppose the name "//anon" should be a big enough clue
if both of those fail.
Stephane:
On 9/30/13 9:44 AM, Stephane Eranian wrote:
> I was alerted by people trying to use the PERF_RECORD_MMAP2
> record to disambiguate virtual address mappings that there is a case
> where the record does not contain enough information.
>
> As you know, the MMAP2 record adds the major, minor, ino number,
> inode generation numbers to a mapping. But it does that only for
> file or pseudo -file backed mappings. That covers file mmaps and also
> SYSV shared memory segments.
>
> However there is a another kind of situation that arises in some
> multi-process benchmarks where a region of memory is cloned
> using VM_CLONE. As such, the virtual addresses match between
> the processes but the major, minor, inode, inode generation fields
> are all zeroes because there is no inode associated with the mapping.
> Yet, it is important for the tool to know the mappings between the
> processes are pointing to the same physical data.
>
> We need to cover this case and I am seeking for advice on how to
> best address this need given that we discarded using the plain physical
> address for disambiguation.
If the current MMAP2 is not a complete solution for what you (Google)
need, should support be reverted before 3.12 is released? No sense in
making this part of the forever API if more work is needed on it.
David
* David Ahern <[email protected]> wrote:
> Stephane:
>
> On 9/30/13 9:44 AM, Stephane Eranian wrote:
> >I was alerted by people trying to use the PERF_RECORD_MMAP2
> >record to disambiguate virtual address mappings that there is a case
> >where the record does not contain enough information.
> >
> >As you know, the MMAP2 record adds the major, minor, ino number,
> >inode generation numbers to a mapping. But it does that only for
> >file or pseudo -file backed mappings. That covers file mmaps and also
> >SYSV shared memory segments.
> >
> >However there is a another kind of situation that arises in some
> >multi-process benchmarks where a region of memory is cloned
> >using VM_CLONE. As such, the virtual addresses match between
> >the processes but the major, minor, inode, inode generation fields
> >are all zeroes because there is no inode associated with the mapping.
> >Yet, it is important for the tool to know the mappings between the
> >processes are pointing to the same physical data.
> >
> >We need to cover this case and I am seeking for advice on how to
> >best address this need given that we discarded using the plain physical
> >address for disambiguation.
>
>
> If the current MMAP2 is not a complete solution for what you (Google)
> need, should support be reverted before 3.12 is released? No sense in
> making this part of the forever API if more work is needed on it.
Instead of a full revert we could just turn off the ABI portion minimally
and not recognize it for now. Assuming a more complete solution is in the
works for v3.13.
Thanks,
Ingo
On Tue, Oct 8, 2013 at 9:41 PM, Ingo Molnar <[email protected]> wrote:
>
> * David Ahern <[email protected]> wrote:
>
>> Stephane:
>>
>> On 9/30/13 9:44 AM, Stephane Eranian wrote:
>> >I was alerted by people trying to use the PERF_RECORD_MMAP2
>> >record to disambiguate virtual address mappings that there is a case
>> >where the record does not contain enough information.
>> >
>> >As you know, the MMAP2 record adds the major, minor, ino number,
>> >inode generation numbers to a mapping. But it does that only for
>> >file or pseudo -file backed mappings. That covers file mmaps and also
>> >SYSV shared memory segments.
>> >
>> >However there is a another kind of situation that arises in some
>> >multi-process benchmarks where a region of memory is cloned
>> >using VM_CLONE. As such, the virtual addresses match between
>> >the processes but the major, minor, inode, inode generation fields
>> >are all zeroes because there is no inode associated with the mapping.
>> >Yet, it is important for the tool to know the mappings between the
>> >processes are pointing to the same physical data.
>> >
>> >We need to cover this case and I am seeking for advice on how to
>> >best address this need given that we discarded using the plain physical
>> >address for disambiguation.
>>
>>
>> If the current MMAP2 is not a complete solution for what you (Google)
>> need, should support be reverted before 3.12 is released? No sense in
>> making this part of the forever API if more work is needed on it.
>
> Instead of a full revert we could just turn off the ABI portion minimally
> and not recognize it for now. Assuming a more complete solution is in the
> works for v3.13.
>
That's a possibility. They are also pieces in the perf tool itself.
We could certainly make the attr->mmap2 bit disappear.
I think it boils down to how can we uniquely identify virtual
mapping to the same physical data either via shmat(), files, VM_CLONE.
We had all covered but the last case with the ino approach. We don't have
a solution for VM_CLONE yet.
On 10/8/13 1:54 PM, Stephane Eranian wrote:
>>> If the current MMAP2 is not a complete solution for what you (Google)
>>> need, should support be reverted before 3.12 is released? No sense in
>>> making this part of the forever API if more work is needed on it.
>>
>> Instead of a full revert we could just turn off the ABI portion minimally
>> and not recognize it for now. Assuming a more complete solution is in the
>> works for v3.13.
>>
> That's a possibility. They are also pieces in the perf tool itself.
> We could certainly make the attr->mmap2 bit disappear.
>
> I think it boils down to how can we uniquely identify virtual
> mapping to the same physical data either via shmat(), files, VM_CLONE.
> We had all covered but the last case with the ino approach. We don't have
> a solution for VM_CLONE yet.
>
I was mainly thinking the 2 parts that generate MMAP2 events: kernel
side it's the mmap2 attribute and perf_event_mmap_output; userspace side
it's perf_event__synthesize_mmap_event. Certainly the rest of the
plumbing can be left in place until the solution is refined as needed.
David
* David Ahern <[email protected]> wrote:
> On 10/8/13 1:54 PM, Stephane Eranian wrote:
> >>>If the current MMAP2 is not a complete solution for what you (Google)
> >>>need, should support be reverted before 3.12 is released? No sense in
> >>>making this part of the forever API if more work is needed on it.
> >>
> >>Instead of a full revert we could just turn off the ABI portion minimally
> >>and not recognize it for now. Assuming a more complete solution is in the
> >>works for v3.13.
> >>
> >That's a possibility. They are also pieces in the perf tool itself.
> >We could certainly make the attr->mmap2 bit disappear.
> >
> >I think it boils down to how can we uniquely identify virtual
> >mapping to the same physical data either via shmat(), files, VM_CLONE.
> >We had all covered but the last case with the ino approach. We don't have
> >a solution for VM_CLONE yet.
> >
>
> I was mainly thinking the 2 parts that generate MMAP2 events: kernel
> side it's the mmap2 attribute and perf_event_mmap_output; userspace side
> it's perf_event__synthesize_mmap_event. Certainly the rest of the
> plumbing can be left in place until the solution is refined as needed.
Could some of you please send a patch ASAP?
Thanks,
Ingo
* Stephane Eranian <[email protected]> wrote:
> > Instead of a full revert we could just turn off the ABI portion
> > minimally and not recognize it for now. Assuming a more complete
> > solution is in the works for v3.13.
>
> That's a possibility. They are also pieces in the perf tool itself. We
> could certainly make the attr->mmap2 bit disappear.
>
> I think it boils down to how can we uniquely identify virtual mapping to
> the same physical data either via shmat(), files, VM_CLONE. We had all
> covered but the last case with the ino approach. We don't have a
> solution for VM_CLONE yet.
PeterZ didn't like exposing the physical RAM address, right?
Peter: could we expose the page frame number instead? (Maybe mix it with a
random seed through the hash-mixer and expose that.) User-space will thus
be able to discover whether two pages are shared or not, but not extract
other information.
The global MM ID thing looked rather ugly as well.
That way we could drop the inode info as well? That feels a bit hacky too.
Thanks,
Ingo
On Wed, Oct 09, 2013 at 11:59:06AM +0200, Ingo Molnar wrote:
> PeterZ didn't like exposing the physical RAM address, right?
More than not like; its useless and broken, see below.
> Peter: could we expose the page frame number instead? (Maybe mix it with a
> random seed through the hash-mixer and expose that.) User-space will thus
> be able to discover whether two pages are shared or not, but not extract
> other information.
No, all physical stuff is useless, a page can get moved about through
swap/ksm/thp/compaction/etc.
> The global MM ID thing looked rather ugly as well.
I'm still having the argument with Stephane -- although it fell off-list
-- on why this wouldn't work.
AFAICT the mm_id should work for CLONE_VM, as CLONE_VM inherits the
entire address space so all should be identical when mm_id match.
> That way we could drop the inode info as well? That feels a bit hacky too.
No, the inode stuff is required for shared pages, its the only way to
track them; and barring a filename (which we've so far used for this)
the dev:ino:gen thing is the only way to uniquely identify files -- it
even works where filenames don't, eg. filesystem namespaces and weird
things like shm that don't have stable filenames.