2020-02-14 20:34:25

by Qian Cai

[permalink] [raw]
Subject: [PATCH -next] fork: annotate a data race in vm_area_dup()

struct vm_area_struct could be accessed concurrently as noticed by
KCSAN,

write to 0xffff9cf8bba08ad8 of 8 bytes by task 14263 on cpu 35:
vma_interval_tree_insert+0x101/0x150:
rb_insert_augmented_cached at include/linux/rbtree_augmented.h:58
(inlined by) vma_interval_tree_insert at mm/interval_tree.c:23
__vma_link_file+0x6e/0xe0
__vma_link_file at mm/mmap.c:629
vma_link+0xa2/0x120
mmap_region+0x753/0xb90
do_mmap+0x45c/0x710
vm_mmap_pgoff+0xc0/0x130
ksys_mmap_pgoff+0x1d1/0x300
__x64_sys_mmap+0x33/0x40
do_syscall_64+0x91/0xc44
entry_SYSCALL_64_after_hwframe+0x49/0xbe

read to 0xffff9cf8bba08a80 of 200 bytes by task 14262 on cpu 122:
vm_area_dup+0x6a/0xe0
vm_area_dup at kernel/fork.c:362
__split_vma+0x72/0x2a0
__split_vma at mm/mmap.c:2661
split_vma+0x5a/0x80
mprotect_fixup+0x368/0x3f0
do_mprotect_pkey+0x263/0x420
__x64_sys_mprotect+0x51/0x70
do_syscall_64+0x91/0xc44
entry_SYSCALL_64_after_hwframe+0x49/0xbe

The write is holding mmap_sem while changing vm_area_struct.shared.rb.
Even though the read is lockless while making a copy, the clone will
have its own shared.rb reinitialized. Thus, mark it as an intentional
data race using the data_race() macro.

Signed-off-by: Qian Cai <[email protected]>
---
kernel/fork.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 41f784b6203a..81bdc6e8a6cf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -359,7 +359,11 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig)
struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL);

if (new) {
- *new = *orig;
+ /*
+ * @orig may be modified concurrently, but the clone will be
+ * reinitialized.
+ */
+ *new = data_race(*orig);
INIT_LIST_HEAD(&new->anon_vma_chain);
}
return new;
--
1.8.3.1


2020-02-17 22:31:38

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH -next] fork: annotate a data race in vm_area_dup()

On Fri, Feb 14, 2020 at 03:33:23PM -0500, Qian Cai wrote:
> struct vm_area_struct could be accessed concurrently as noticed by
> KCSAN,
>
> write to 0xffff9cf8bba08ad8 of 8 bytes by task 14263 on cpu 35:
> vma_interval_tree_insert+0x101/0x150:
> rb_insert_augmented_cached at include/linux/rbtree_augmented.h:58
> (inlined by) vma_interval_tree_insert at mm/interval_tree.c:23
> __vma_link_file+0x6e/0xe0
> __vma_link_file at mm/mmap.c:629
> vma_link+0xa2/0x120
> mmap_region+0x753/0xb90
> do_mmap+0x45c/0x710
> vm_mmap_pgoff+0xc0/0x130
> ksys_mmap_pgoff+0x1d1/0x300
> __x64_sys_mmap+0x33/0x40
> do_syscall_64+0x91/0xc44
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> read to 0xffff9cf8bba08a80 of 200 bytes by task 14262 on cpu 122:
> vm_area_dup+0x6a/0xe0
> vm_area_dup at kernel/fork.c:362
> __split_vma+0x72/0x2a0
> __split_vma at mm/mmap.c:2661
> split_vma+0x5a/0x80
> mprotect_fixup+0x368/0x3f0
> do_mprotect_pkey+0x263/0x420
> __x64_sys_mprotect+0x51/0x70
> do_syscall_64+0x91/0xc44
> entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> The write is holding mmap_sem while changing vm_area_struct.shared.rb.
> Even though the read is lockless while making a copy, the clone will
> have its own shared.rb reinitialized. Thus, mark it as an intentional
> data race using the data_race() macro.

I'm confused. AFAICS both sides hold mmap_sem on write:

- vm_mmap_pgoff() takes mmap_sem for the write on the write side

- do_mprotect_pkey() takes mmap_sem for the write on the read side


What do I miss?

--
Kirill A. Shutemov

2020-02-18 04:01:51

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH -next] fork: annotate a data race in vm_area_dup()



> On Feb 17, 2020, at 5:31 PM, Kirill A. Shutemov <[email protected]> wrote:
>
> I'm confused. AFAICS both sides hold mmap_sem on write:
>
> - vm_mmap_pgoff() takes mmap_sem for the write on the write side
>
> - do_mprotect_pkey() takes mmap_sem for the write on the read side
>
>
> What do I miss?

Ah, good catch. I missed the locking for the read there. This is interesting because Marco
did confirmed that the concurrency could happen,

https://lore.kernel.org/lkml/[email protected]/

If that means KCSAN is not at fault, then I could think of two things,

1) someone downgrades the lock.

I don’t think that a case here. Only __do_munmap() will do that but I did not see how
it will affect us here.

2) the reader and writer are two different processes.

So, they held a different mmap_sem, but I can’t see how could two processes shared
the same vm_area_struct. Also, file->f_mapping->i_mmap was also stored in the
writer, but I can’t see how it was also loaded in the reader.

Any ideas?

2020-02-18 10:30:28

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH -next] fork: annotate a data race in vm_area_dup()

On Mon, Feb 17, 2020 at 10:59:47PM -0500, Qian Cai wrote:
>
>
> > On Feb 17, 2020, at 5:31 PM, Kirill A. Shutemov <[email protected]> wrote:
> >
> > I'm confused. AFAICS both sides hold mmap_sem on write:
> >
> > - vm_mmap_pgoff() takes mmap_sem for the write on the write side
> >
> > - do_mprotect_pkey() takes mmap_sem for the write on the read side
> >
> >
> > What do I miss?
>
> Ah, good catch. I missed the locking for the read there. This is interesting because Marco
> did confirmed that the concurrency could happen,
>
> https://lore.kernel.org/lkml/[email protected]/
>
> If that means KCSAN is not at fault, then I could think of two things,
>
> 1) someone downgrades the lock.
>
> I don’t think that a case here. Only __do_munmap() will do that but I did not see how
> it will affect us here.
>
> 2) the reader and writer are two different processes.
>
> So, they held a different mmap_sem, but I can’t see how could two processes shared
> the same vm_area_struct. Also, file->f_mapping->i_mmap was also stored in the
> writer, but I can’t see how it was also loaded in the reader.
>
> Any ideas?

I think I've got this:

vm_area_dup() blindly copies all fields of orignal VMA to the new one.
This includes coping vm_area_struct::shared.rb which is normally protected
by i_mmap_lock. But this is fine because the read value will be
overwritten on the following __vma_link_file() under proper protectection.

So the fix is correct, but justificaiton is lacking.

Also, I would like to more fine-grained annotation: marking with
data_race() 200 bytes copy may hide other issues.

--
Kirill A. Shutemov

2020-02-18 12:41:58

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH -next] fork: annotate a data race in vm_area_dup()



> On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <[email protected]> wrote:
>
> I think I've got this:
>
> vm_area_dup() blindly copies all fields of orignal VMA to the new one.
> This includes coping vm_area_struct::shared.rb which is normally protected
> by i_mmap_lock. But this is fine because the read value will be
> overwritten on the following __vma_link_file() under proper protectection.

Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes.

>
> So the fix is correct, but justificaiton is lacking.
>
> Also, I would like to more fine-grained annotation: marking with
> data_race() 200 bytes copy may hide other issues.

That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()?

2020-02-18 14:11:18

by Marco Elver

[permalink] [raw]
Subject: Re: [PATCH -next] fork: annotate a data race in vm_area_dup()

On Tue, 18 Feb 2020 at 13:40, Qian Cai <[email protected]> wrote:
>
>
>
> > On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <[email protected]> wrote:
> >
> > I think I've got this:
> >
> > vm_area_dup() blindly copies all fields of orignal VMA to the new one.
> > This includes coping vm_area_struct::shared.rb which is normally protected
> > by i_mmap_lock. But this is fine because the read value will be
> > overwritten on the following __vma_link_file() under proper protectection.
>
> Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes.
>
> >
> > So the fix is correct, but justificaiton is lacking.
> >
> > Also, I would like to more fine-grained annotation: marking with
> > data_race() 200 bytes copy may hide other issues.
>
> That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()?

There is no nice interface I can think of. All options will just cause
more problems, inconsistencies, or annoyances.

Ideally, to not introduce more types of macros and keep it consistent,
ASSERT_EXCLUSIVE_FIELDS_EXCEPT(var, ...) maybe what you're after:
"Check no concurrent writers to struct, except ignore the provided
fields".

This option doesn't quite work, unless you just restrict it to 1 field
(we can't use ranges, because e.g. vm_area_struct has
__randomize_layout). The next time around, you'll want 2 fields, and
it won't work. Also, do we know that 'shared.rb' is the only thing we
want to ignore?

If you wanted something similar to ASSERT_EXCLUSIVE_BITS, it'd have to
be ASSERT_EXCLUSIVE_FIELDS(var, ...), however, this is quite annoying
for structs with many fields as you'd have to list all of them. It's
similar to what you can already do currently (but I also don't
recommend because it's unmaintainable):

ASSERT_EXCLUSIVE_WRITER(orig->vm_start);
ASSERT_EXCLUSIVE_WRITER(orig->vm_end);
ASSERT_EXCLUSIVE_WRITER(orig->vm_next);
ASSERT_EXCLUSIVE_WRITER(orig->vm_prev);
... and so on ...
*new = data_race(*orig);

Also note that vm_area_struct has __randomize_layout, which makes
using ranges impossible. All in all, I don't see a terribly nice
option.

If, however, you knew that there are 1 or 2 fields only that you want
to make sure are not modified concurrently, ASSERT_EXCLUSIVE_WRITER +
data_race() would probably work well (or even ASSERT_EXCLUSIVE_ACCESS
if you want to make sure there are no writers nor _readers_).

Thanks,
-- Marco

2020-02-18 15:01:11

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH -next] fork: annotate a data race in vm_area_dup()

On Tue, 2020-02-18 at 15:09 +0100, Marco Elver wrote:
> On Tue, 18 Feb 2020 at 13:40, Qian Cai <[email protected]> wrote:
> >
> >
> >
> > > On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <[email protected]> wrote:
> > >
> > > I think I've got this:
> > >
> > > vm_area_dup() blindly copies all fields of orignal VMA to the new one.
> > > This includes coping vm_area_struct::shared.rb which is normally protected
> > > by i_mmap_lock. But this is fine because the read value will be
> > > overwritten on the following __vma_link_file() under proper protectection.
> >
> > Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes.
> >
> > >
> > > So the fix is correct, but justificaiton is lacking.
> > >
> > > Also, I would like to more fine-grained annotation: marking with
> > > data_race() 200 bytes copy may hide other issues.
> >
> > That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()?
>
> There is no nice interface I can think of. All options will just cause
> more problems, inconsistencies, or annoyances.
>
> Ideally, to not introduce more types of macros and keep it consistent,
> ASSERT_EXCLUSIVE_FIELDS_EXCEPT(var, ...) maybe what you're after:
> "Check no concurrent writers to struct, except ignore the provided
> fields".
>
> This option doesn't quite work, unless you just restrict it to 1 field
> (we can't use ranges, because e.g. vm_area_struct has
> __randomize_layout). The next time around, you'll want 2 fields, and
> it won't work. Also, do we know that 'shared.rb' is the only thing we
> want to ignore?
>
> If you wanted something similar to ASSERT_EXCLUSIVE_BITS, it'd have to
> be ASSERT_EXCLUSIVE_FIELDS(var, ...), however, this is quite annoying
> for structs with many fields as you'd have to list all of them. It's
> similar to what you can already do currently (but I also don't
> recommend because it's unmaintainable):
>
> ASSERT_EXCLUSIVE_WRITER(orig->vm_start);
> ASSERT_EXCLUSIVE_WRITER(orig->vm_end);
> ASSERT_EXCLUSIVE_WRITER(orig->vm_next);
> ASSERT_EXCLUSIVE_WRITER(orig->vm_prev);
> ... and so on ...
> *new = data_race(*orig);
>
> Also note that vm_area_struct has __randomize_layout, which makes
> using ranges impossible. All in all, I don't see a terribly nice
> option.
>
> If, however, you knew that there are 1 or 2 fields only that you want
> to make sure are not modified concurrently, ASSERT_EXCLUSIVE_WRITER +
> data_race() would probably work well (or even ASSERT_EXCLUSIVE_ACCESS
> if you want to make sure there are no writers nor _readers_).

I am testing an idea that just do,

lockdep_assert_held_write(&orig->vm_mm->mmap_sem);
*new = data_race(*orig);

The idea is that as long as we have the exclusive mmap_sem held in all paths
(auditing indicated so), no writer should be able to mess up our vm_area_struct
except the "shared.rb" field which has no harm.

2020-02-18 15:18:35

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH -next] fork: annotate a data race in vm_area_dup()

On Tue, Feb 18, 2020 at 10:00:35AM -0500, Qian Cai wrote:
> On Tue, 2020-02-18 at 15:09 +0100, Marco Elver wrote:
> > On Tue, 18 Feb 2020 at 13:40, Qian Cai <[email protected]> wrote:
> > >
> > >
> > >
> > > > On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <[email protected]> wrote:
> > > >
> > > > I think I've got this:
> > > >
> > > > vm_area_dup() blindly copies all fields of orignal VMA to the new one.
> > > > This includes coping vm_area_struct::shared.rb which is normally protected
> > > > by i_mmap_lock. But this is fine because the read value will be
> > > > overwritten on the following __vma_link_file() under proper protectection.
> > >
> > > Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes.
> > >
> > > >
> > > > So the fix is correct, but justificaiton is lacking.
> > > >
> > > > Also, I would like to more fine-grained annotation: marking with
> > > > data_race() 200 bytes copy may hide other issues.
> > >
> > > That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()?
> >
> > There is no nice interface I can think of. All options will just cause
> > more problems, inconsistencies, or annoyances.
> >
> > Ideally, to not introduce more types of macros and keep it consistent,
> > ASSERT_EXCLUSIVE_FIELDS_EXCEPT(var, ...) maybe what you're after:
> > "Check no concurrent writers to struct, except ignore the provided
> > fields".
> >
> > This option doesn't quite work, unless you just restrict it to 1 field
> > (we can't use ranges, because e.g. vm_area_struct has
> > __randomize_layout). The next time around, you'll want 2 fields, and
> > it won't work. Also, do we know that 'shared.rb' is the only thing we
> > want to ignore?
> >
> > If you wanted something similar to ASSERT_EXCLUSIVE_BITS, it'd have to
> > be ASSERT_EXCLUSIVE_FIELDS(var, ...), however, this is quite annoying
> > for structs with many fields as you'd have to list all of them. It's
> > similar to what you can already do currently (but I also don't
> > recommend because it's unmaintainable):
> >
> > ASSERT_EXCLUSIVE_WRITER(orig->vm_start);
> > ASSERT_EXCLUSIVE_WRITER(orig->vm_end);
> > ASSERT_EXCLUSIVE_WRITER(orig->vm_next);
> > ASSERT_EXCLUSIVE_WRITER(orig->vm_prev);
> > ... and so on ...
> > *new = data_race(*orig);
> >
> > Also note that vm_area_struct has __randomize_layout, which makes
> > using ranges impossible. All in all, I don't see a terribly nice
> > option.
> >
> > If, however, you knew that there are 1 or 2 fields only that you want
> > to make sure are not modified concurrently, ASSERT_EXCLUSIVE_WRITER +
> > data_race() would probably work well (or even ASSERT_EXCLUSIVE_ACCESS
> > if you want to make sure there are no writers nor _readers_).
>
> I am testing an idea that just do,
>
> lockdep_assert_held_write(&orig->vm_mm->mmap_sem);
> *new = data_race(*orig);
>
> The idea is that as long as we have the exclusive mmap_sem held in all paths
> (auditing indicated so), no writer should be able to mess up our vm_area_struct
> except the "shared.rb" field which has no harm.

Well, some fields protected by page_table_lock and can be written to
without exclusive mmap_sem. Probably even without any mmap_sem: pin
mm_struct + page_table_lock should be enough.

--
Kirill A. Shutemov

2020-02-18 16:47:01

by Qian Cai

[permalink] [raw]
Subject: Re: [PATCH -next] fork: annotate a data race in vm_area_dup()

On Tue, 2020-02-18 at 18:18 +0300, Kirill A. Shutemov wrote:
> On Tue, Feb 18, 2020 at 10:00:35AM -0500, Qian Cai wrote:
> > On Tue, 2020-02-18 at 15:09 +0100, Marco Elver wrote:
> > > On Tue, 18 Feb 2020 at 13:40, Qian Cai <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > > On Feb 18, 2020, at 5:29 AM, Kirill A. Shutemov <[email protected]> wrote:
> > > > >
> > > > > I think I've got this:
> > > > >
> > > > > vm_area_dup() blindly copies all fields of orignal VMA to the new one.
> > > > > This includes coping vm_area_struct::shared.rb which is normally protected
> > > > > by i_mmap_lock. But this is fine because the read value will be
> > > > > overwritten on the following __vma_link_file() under proper protectection.
> > > >
> > > > Right, multiple processes could share the same file-based address space where those vma have been linked into address_space::i_mmap via vm_area_struct::shared.rb. Thus, the reader could see its shared.rb linkage pointers got updated by other processes.
> > > >
> > > > >
> > > > > So the fix is correct, but justificaiton is lacking.
> > > > >
> > > > > Also, I would like to more fine-grained annotation: marking with
> > > > > data_race() 200 bytes copy may hide other issues.
> > > >
> > > > That is the harder part where I don’t think we have anything for that today. Macro, any suggestions? ASSERT_IGNORE_FIELD()?
> > >
> > > There is no nice interface I can think of. All options will just cause
> > > more problems, inconsistencies, or annoyances.
> > >
> > > Ideally, to not introduce more types of macros and keep it consistent,
> > > ASSERT_EXCLUSIVE_FIELDS_EXCEPT(var, ...) maybe what you're after:
> > > "Check no concurrent writers to struct, except ignore the provided
> > > fields".
> > >
> > > This option doesn't quite work, unless you just restrict it to 1 field
> > > (we can't use ranges, because e.g. vm_area_struct has
> > > __randomize_layout). The next time around, you'll want 2 fields, and
> > > it won't work. Also, do we know that 'shared.rb' is the only thing we
> > > want to ignore?
> > >
> > > If you wanted something similar to ASSERT_EXCLUSIVE_BITS, it'd have to
> > > be ASSERT_EXCLUSIVE_FIELDS(var, ...), however, this is quite annoying
> > > for structs with many fields as you'd have to list all of them. It's
> > > similar to what you can already do currently (but I also don't
> > > recommend because it's unmaintainable):
> > >
> > > ASSERT_EXCLUSIVE_WRITER(orig->vm_start);
> > > ASSERT_EXCLUSIVE_WRITER(orig->vm_end);
> > > ASSERT_EXCLUSIVE_WRITER(orig->vm_next);
> > > ASSERT_EXCLUSIVE_WRITER(orig->vm_prev);
> > > ... and so on ...
> > > *new = data_race(*orig);
> > >
> > > Also note that vm_area_struct has __randomize_layout, which makes
> > > using ranges impossible. All in all, I don't see a terribly nice
> > > option.
> > >
> > > If, however, you knew that there are 1 or 2 fields only that you want
> > > to make sure are not modified concurrently, ASSERT_EXCLUSIVE_WRITER +
> > > data_race() would probably work well (or even ASSERT_EXCLUSIVE_ACCESS
> > > if you want to make sure there are no writers nor _readers_).
> >
> > I am testing an idea that just do,
> >
> > lockdep_assert_held_write(&orig->vm_mm->mmap_sem);
> > *new = data_race(*orig);
> >
> > The idea is that as long as we have the exclusive mmap_sem held in all paths
> > (auditing indicated so), no writer should be able to mess up our vm_area_struct
> > except the "shared.rb" field which has no harm.
>
> Well, some fields protected by page_table_lock and can be written to
> without exclusive mmap_sem. Probably even without any mmap_sem: pin
> mm_struct + page_table_lock should be enough.
>

How about this?

diff --git a/kernel/fork.c b/kernel/fork.c
index cb2ae49e497e..68f273e0ebf4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -359,7 +359,10 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct
*orig)
        struct vm_area_struct *new = kmem_cache_alloc(vm_area_cachep,
GFP_KERNEL);
 
        if (new) {
-               *new = *orig;
+               ASSERT_EXCLUSIVE_WRITER(orig->vm_flags);
+               ASSERT_EXCLUSIVE_WRITER(orig->vm_file);
+
+               *new = data_race(*orig);
                INIT_LIST_HEAD(&new->anon_vma_chain);
                new->vm_next = new->vm_prev = NULL;
        }