2013-06-04 07:22:17

by Frank Mehnert

[permalink] [raw]
Subject: Handling NUMA page migration

Hi,

our memory management on Linux hosts conflicts with NUMA page migration.
I assume this problem existed for a longer time but Linux 3.8 introduced
automatic NUMA page balancing which makes the problem visible on
multi-node hosts leading to kernel oopses.

NUMA page migration means that the physical address of a page changes.
This is fatal if the application assumes that this never happens for
that page as it was supposed to be pinned.

We have two kind of pinned memory:

A) 1. allocate memory in userland with mmap()
2. madvise(MADV_DONTFORK)
3. pin with get_user_pages().
4. flush dcache_page()
5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
(resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
VM_DONTCOPY | VM_LOCKED | 0xff)

B) 1. allocate memory with alloc_pages()
2. SetPageReserved()
3. vm_mmap() to allocate a userspace mapping
4. vm_insert_page()
5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
(resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND | 0xff)

At least the memory allocated like B) is affected by automatic NUMA page
migration. I'm not sure about A).

1. How can I prevent automatic NUMA page migration on this memory?
2. Can NUMA page migration also be handled on such kind of memory without
preventing migration?

Thanks,

Frank
--
Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany

Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
Registergericht: Amtsgericht M?nchen, HRA 95603
Gesch?ftsf?hrer: J?rgen Kunz

Komplement?rin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher


2013-06-04 11:58:11

by Robin Holt

[permalink] [raw]
Subject: Re: Handling NUMA page migration

This is probably more appropriate to be directed at the linux-mm
mailing list.

On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> Hi,
>
> our memory management on Linux hosts conflicts with NUMA page migration.
> I assume this problem existed for a longer time but Linux 3.8 introduced
> automatic NUMA page balancing which makes the problem visible on
> multi-node hosts leading to kernel oopses.
>
> NUMA page migration means that the physical address of a page changes.
> This is fatal if the application assumes that this never happens for
> that page as it was supposed to be pinned.
>
> We have two kind of pinned memory:
>
> A) 1. allocate memory in userland with mmap()
> 2. madvise(MADV_DONTFORK)
> 3. pin with get_user_pages().
> 4. flush dcache_page()
> 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> VM_DONTCOPY | VM_LOCKED | 0xff)

I don't think this type of allocation should be affected. The
get_user_pages() call should elevate the pages reference count which
should prevent migration from completing. I would, however, wait for
a more definitive answer.

> B) 1. allocate memory with alloc_pages()
> 2. SetPageReserved()
> 3. vm_mmap() to allocate a userspace mapping
> 4. vm_insert_page()
> 5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
> (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND | 0xff)
>
> At least the memory allocated like B) is affected by automatic NUMA page
> migration. I'm not sure about A).
>
> 1. How can I prevent automatic NUMA page migration on this memory?
> 2. Can NUMA page migration also be handled on such kind of memory without
> preventing migration?
>
> Thanks,
>
> Frank
> --
> Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
> ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany
>
> Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
> Registergericht: Amtsgericht M?nchen, HRA 95603
> Gesch?ftsf?hrer: J?rgen Kunz
>
> Komplement?rin: ORACLE Deutschland Verwaltung B.V.
> Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
> Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
> Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-06-04 12:15:08

by Frank Mehnert

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> This is probably more appropriate to be directed at the linux-mm
> mailing list.
>
> On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > Hi,
> >
> > our memory management on Linux hosts conflicts with NUMA page migration.
> > I assume this problem existed for a longer time but Linux 3.8 introduced
> > automatic NUMA page balancing which makes the problem visible on
> > multi-node hosts leading to kernel oopses.
> >
> > NUMA page migration means that the physical address of a page changes.
> > This is fatal if the application assumes that this never happens for
> > that page as it was supposed to be pinned.
> >
> > We have two kind of pinned memory:
> >
> > A) 1. allocate memory in userland with mmap()
> >
> > 2. madvise(MADV_DONTFORK)
> > 3. pin with get_user_pages().
> > 4. flush dcache_page()
> > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> >
> > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> >
> > VM_DONTCOPY | VM_LOCKED | 0xff)
>
> I don't think this type of allocation should be affected. The
> get_user_pages() call should elevate the pages reference count which
> should prevent migration from completing. I would, however, wait for
> a more definitive answer.

Thanks Robin! Actually case B) is more important for us so I'm waiting
for more feedback :)

Frank

> > B) 1. allocate memory with alloc_pages()
> >
> > 2. SetPageReserved()
> > 3. vm_mmap() to allocate a userspace mapping
> > 4. vm_insert_page()
> > 5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
> >
> > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > 0xff)
> >
> > At least the memory allocated like B) is affected by automatic NUMA page
> > migration. I'm not sure about A).
> >
> > 1. How can I prevent automatic NUMA page migration on this memory?
> > 2. Can NUMA page migration also be handled on such kind of memory without
> >
> > preventing migration?
> >
> > Thanks,
> >
> > Frank
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany

Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
Registergericht: Amtsgericht M?nchen, HRA 95603
Gesch?ftsf?hrer: J?rgen Kunz

Komplement?rin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.

2013-06-04 13:34:49

by Robin Holt

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Tue, Jun 04, 2013 at 02:14:45PM +0200, Frank Mehnert wrote:
> On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> > This is probably more appropriate to be directed at the linux-mm
> > mailing list.
> >
> > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > > Hi,
> > >
> > > our memory management on Linux hosts conflicts with NUMA page migration.
> > > I assume this problem existed for a longer time but Linux 3.8 introduced
> > > automatic NUMA page balancing which makes the problem visible on
> > > multi-node hosts leading to kernel oopses.
> > >
> > > NUMA page migration means that the physical address of a page changes.
> > > This is fatal if the application assumes that this never happens for
> > > that page as it was supposed to be pinned.
> > >
> > > We have two kind of pinned memory:
> > >
> > > A) 1. allocate memory in userland with mmap()
> > >
> > > 2. madvise(MADV_DONTFORK)
> > > 3. pin with get_user_pages().
> > > 4. flush dcache_page()
> > > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> > >
> > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > >
> > > VM_DONTCOPY | VM_LOCKED | 0xff)
> >
> > I don't think this type of allocation should be affected. The
> > get_user_pages() call should elevate the pages reference count which
> > should prevent migration from completing. I would, however, wait for
> > a more definitive answer.
>
> Thanks Robin! Actually case B) is more important for us so I'm waiting
> for more feedback :)

If you have a good test case, you might want to try adding a get_page()
in there to see if that mitigates the problem. It would at least be
interesting to know if it has an effect.

Robin

>
> Frank
>
> > > B) 1. allocate memory with alloc_pages()
> > >
> > > 2. SetPageReserved()
> > > 3. vm_mmap() to allocate a userspace mapping
> > > 4. vm_insert_page()
> > > 5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
> > >
> > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > > 0xff)
> > >
> > > At least the memory allocated like B) is affected by automatic NUMA page
> > > migration. I'm not sure about A).
> > >
> > > 1. How can I prevent automatic NUMA page migration on this memory?
> > > 2. Can NUMA page migration also be handled on such kind of memory without
> > >
> > > preventing migration?
> > >
> > > Thanks,
> > >
> > > Frank
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
>
> --
> Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
> ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany
>
> Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
> Registergericht: Amtsgericht M?nchen, HRA 95603
> Gesch?ftsf?hrer: J?rgen Kunz
>
> Komplement?rin: ORACLE Deutschland Verwaltung B.V.
> Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
> Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
> Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher



Attachments:
(No filename) (3.17 kB)
signature.asc (836.00 B)
Digital signature
Download all attachments

2013-06-04 14:02:35

by Michal Hocko

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Tue 04-06-13 14:14:45, Frank Mehnert wrote:
> On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> > This is probably more appropriate to be directed at the linux-mm
> > mailing list.
> >
> > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > > Hi,
> > >
> > > our memory management on Linux hosts conflicts with NUMA page migration.
> > > I assume this problem existed for a longer time but Linux 3.8 introduced
> > > automatic NUMA page balancing which makes the problem visible on
> > > multi-node hosts leading to kernel oopses.
> > >
> > > NUMA page migration means that the physical address of a page changes.
> > > This is fatal if the application assumes that this never happens for
> > > that page as it was supposed to be pinned.
> > >
> > > We have two kind of pinned memory:
> > >
> > > A) 1. allocate memory in userland with mmap()
> > >
> > > 2. madvise(MADV_DONTFORK)
> > > 3. pin with get_user_pages().
> > > 4. flush dcache_page()
> > > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> > >
> > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > >
> > > VM_DONTCOPY | VM_LOCKED | 0xff)
> >
> > I don't think this type of allocation should be affected. The
> > get_user_pages() call should elevate the pages reference count which
> > should prevent migration from completing. I would, however, wait for
> > a more definitive answer.
>
> Thanks Robin! Actually case B) is more important for us so I'm waiting
> for more feedback :)

The manual node migration code seems to be OK in case B as well because
Reserved are skipped (check check_pte_range called from down the
do_migrate_pages path).

Maybe auto-numa code is missing this check assuming that it cannot
encounter reserved pages.

migrate_misplaced_page relies on numamigrate_isolate_page which relies
on isolate_lru_page and that one expects a LRU page. Is your Reserved
page on the LRU list? That would be a bit unexpected.
--
Michal Hocko
SUSE Labs

2013-06-04 15:48:35

by Jerome Glisse

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Tue, Jun 04, 2013 at 02:14:45PM +0200, Frank Mehnert wrote:
> On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> > This is probably more appropriate to be directed at the linux-mm
> > mailing list.
> >
> > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > > Hi,
> > >
> > > our memory management on Linux hosts conflicts with NUMA page migration.
> > > I assume this problem existed for a longer time but Linux 3.8 introduced
> > > automatic NUMA page balancing which makes the problem visible on
> > > multi-node hosts leading to kernel oopses.
> > >
> > > NUMA page migration means that the physical address of a page changes.
> > > This is fatal if the application assumes that this never happens for
> > > that page as it was supposed to be pinned.
> > >
> > > We have two kind of pinned memory:
> > >
> > > A) 1. allocate memory in userland with mmap()
> > >
> > > 2. madvise(MADV_DONTFORK)
> > > 3. pin with get_user_pages().
> > > 4. flush dcache_page()
> > > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> > >
> > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > >
> > > VM_DONTCOPY | VM_LOCKED | 0xff)
> >
> > I don't think this type of allocation should be affected. The
> > get_user_pages() call should elevate the pages reference count which
> > should prevent migration from completing. I would, however, wait for
> > a more definitive answer.
>
> Thanks Robin! Actually case B) is more important for us so I'm waiting
> for more feedback :)
>
> Frank
>
> > > B) 1. allocate memory with alloc_pages()
> > >
> > > 2. SetPageReserved()
> > > 3. vm_mmap() to allocate a userspace mapping
> > > 4. vm_insert_page()
> > > 5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
> > >
> > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > > 0xff)
> > >
> > > At least the memory allocated like B) is affected by automatic NUMA page
> > > migration. I'm not sure about A).
> > >
> > > 1. How can I prevent automatic NUMA page migration on this memory?
> > > 2. Can NUMA page migration also be handled on such kind of memory without
> > >
> > > preventing migration?
> > >
> > > Thanks,
> > >
> > > Frank

I was looking at migration code lately, and while i am not an expert at all
in this area. I think there is a bug in the way handle_mm_fault deals, or
rather not deals, with migration entry.

When huge page is migrated its pmd is replace with a special swp entry pmd,
which is a non zero pmd but that does not have any of the huge pmd flag set
so none of the handle_mm_fault path detect it as swap entry. Then believe
its a valid pmd and try to allocate pte under it which should oops.

Attached patch is what i believe should be done (not even compile tested).

Again i might be missing a subtelty somewhere else and just missed where
huge migration entry are dealt with.

Cheers,
Jerome


Attachments:
(No filename) (2.87 kB)
0001-mm-properly-handle-fault-on-huge-page-migration.patch (1.35 kB)
Download all attachments

2013-06-04 17:49:45

by Jerome Glisse

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Tue, Jun 4, 2013 at 11:45 AM, Jerome Glisse <[email protected]> wrote:
> On Tue, Jun 04, 2013 at 02:14:45PM +0200, Frank Mehnert wrote:
>> On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
>> > This is probably more appropriate to be directed at the linux-mm
>> > mailing list.
>> >
>> > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
>> > > Hi,
>> > >
>> > > our memory management on Linux hosts conflicts with NUMA page migration.
>> > > I assume this problem existed for a longer time but Linux 3.8 introduced
>> > > automatic NUMA page balancing which makes the problem visible on
>> > > multi-node hosts leading to kernel oopses.
>> > >
>> > > NUMA page migration means that the physical address of a page changes.
>> > > This is fatal if the application assumes that this never happens for
>> > > that page as it was supposed to be pinned.
>> > >
>> > > We have two kind of pinned memory:
>> > >
>> > > A) 1. allocate memory in userland with mmap()
>> > >
>> > > 2. madvise(MADV_DONTFORK)
>> > > 3. pin with get_user_pages().
>> > > 4. flush dcache_page()
>> > > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
>> > >
>> > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
>> > >
>> > > VM_DONTCOPY | VM_LOCKED | 0xff)
>> >
>> > I don't think this type of allocation should be affected. The
>> > get_user_pages() call should elevate the pages reference count which
>> > should prevent migration from completing. I would, however, wait for
>> > a more definitive answer.
>>
>> Thanks Robin! Actually case B) is more important for us so I'm waiting
>> for more feedback :)
>>
>> Frank
>>
>> > > B) 1. allocate memory with alloc_pages()
>> > >
>> > > 2. SetPageReserved()
>> > > 3. vm_mmap() to allocate a userspace mapping
>> > > 4. vm_insert_page()
>> > > 5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
>> > >
>> > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
>> > > 0xff)
>> > >
>> > > At least the memory allocated like B) is affected by automatic NUMA page
>> > > migration. I'm not sure about A).
>> > >
>> > > 1. How can I prevent automatic NUMA page migration on this memory?
>> > > 2. Can NUMA page migration also be handled on such kind of memory without
>> > >
>> > > preventing migration?
>> > >
>> > > Thanks,
>> > >
>> > > Frank
>
> I was looking at migration code lately, and while i am not an expert at all
> in this area. I think there is a bug in the way handle_mm_fault deals, or
> rather not deals, with migration entry.
>
> When huge page is migrated its pmd is replace with a special swp entry pmd,
> which is a non zero pmd but that does not have any of the huge pmd flag set
> so none of the handle_mm_fault path detect it as swap entry. Then believe
> its a valid pmd and try to allocate pte under it which should oops.
>
> Attached patch is what i believe should be done (not even compile tested).
>
> Again i might be missing a subtelty somewhere else and just missed where
> huge migration entry are dealt with.
>
> Cheers,
> Jerome

Never mind i was missing something hugetlb_fault will handle it.

Cheers,
Jerome

2013-06-04 18:17:22

by Frank Mehnert

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Tuesday 04 June 2013 16:02:30 Michal Hocko wrote:
> On Tue 04-06-13 14:14:45, Frank Mehnert wrote:
> > On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> > > This is probably more appropriate to be directed at the linux-mm
> > > mailing list.
> > >
> > > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > > > Hi,
> > > >
> > > > our memory management on Linux hosts conflicts with NUMA page
> > > > migration. I assume this problem existed for a longer time but Linux
> > > > 3.8 introduced automatic NUMA page balancing which makes the problem
> > > > visible on multi-node hosts leading to kernel oopses.
> > > >
> > > > NUMA page migration means that the physical address of a page
> > > > changes. This is fatal if the application assumes that this never
> > > > happens for that page as it was supposed to be pinned.
> > > >
> > > > We have two kind of pinned memory:
> > > >
> > > > A) 1. allocate memory in userland with mmap()
> > > >
> > > > 2. madvise(MADV_DONTFORK)
> > > > 3. pin with get_user_pages().
> > > > 4. flush dcache_page()
> > > > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> > > >
> > > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND
> > > > |
> > > >
> > > > VM_DONTCOPY | VM_LOCKED | 0xff)
> > >
> > > I don't think this type of allocation should be affected. The
> > > get_user_pages() call should elevate the pages reference count which
> > > should prevent migration from completing. I would, however, wait for
> > > a more definitive answer.
> >
> > Thanks Robin! Actually case B) is more important for us so I'm waiting
> > for more feedback :)
>
> The manual node migration code seems to be OK in case B as well because
> Reserved are skipped (check check_pte_range called from down the
> do_migrate_pages path).
>
> Maybe auto-numa code is missing this check assuming that it cannot
> encounter reserved pages.
>
> migrate_misplaced_page relies on numamigrate_isolate_page which relies
> on isolate_lru_page and that one expects a LRU page. Is your Reserved
> page on the LRU list? That would be a bit unexpected.

I will check this.

In the meantime I verified that my testcase does not fail if I pass
'numa_balancing=false' to the kernel, so it's definitely a NUMA balancing
problem.

I also did 'get_page()' on all pages of method B but the testcase so this
didn't help.

Frank
--
Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany

Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
Registergericht: Amtsgericht M?nchen, HRA 95603
Gesch?ftsf?hrer: J?rgen Kunz

Komplement?rin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.

2013-06-04 21:55:02

by Frank Mehnert

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Tuesday 04 June 2013 20:17:02 Frank Mehnert wrote:
> On Tuesday 04 June 2013 16:02:30 Michal Hocko wrote:
> > On Tue 04-06-13 14:14:45, Frank Mehnert wrote:
> > > On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> > > > This is probably more appropriate to be directed at the linux-mm
> > > > mailing list.
> > > >
> > > > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > > > > Hi,
> > > > >
> > > > > our memory management on Linux hosts conflicts with NUMA page
> > > > > migration. I assume this problem existed for a longer time but
> > > > > Linux 3.8 introduced automatic NUMA page balancing which makes the
> > > > > problem visible on multi-node hosts leading to kernel oopses.
> > > > >
> > > > > NUMA page migration means that the physical address of a page
> > > > > changes. This is fatal if the application assumes that this never
> > > > > happens for that page as it was supposed to be pinned.
> > > > >
> > > > > We have two kind of pinned memory:
> > > > >
> > > > > A) 1. allocate memory in userland with mmap()
> > > > >
> > > > > 2. madvise(MADV_DONTFORK)
> > > > > 3. pin with get_user_pages().
> > > > > 4. flush dcache_page()
> > > > > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> > > > >
> > > > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP |
> > > > > VM_DONTEXPAND
> > > > >
> > > > > VM_DONTCOPY | VM_LOCKED | 0xff)
> > > >
> > > > I don't think this type of allocation should be affected. The
> > > > get_user_pages() call should elevate the pages reference count which
> > > > should prevent migration from completing. I would, however, wait for
> > > > a more definitive answer.
> > >
> > > Thanks Robin! Actually case B) is more important for us so I'm waiting
> > > for more feedback :)
> >
> > The manual node migration code seems to be OK in case B as well because
> > Reserved are skipped (check check_pte_range called from down the
> > do_migrate_pages path).
> >
> > Maybe auto-numa code is missing this check assuming that it cannot
> > encounter reserved pages.
> >
> > migrate_misplaced_page relies on numamigrate_isolate_page which relies
> > on isolate_lru_page and that one expects a LRU page. Is your Reserved
> > page on the LRU list? That would be a bit unexpected.
>
> I will check this.

I tested this now. When the Oops happens, PageLRU() of the corresponding
page struct is NOT set! I've patched the kernel to find that out. This is
case B from my original mail (alloc_pages(), SetPageReserved(), vm_mmap(),
vm_insert_page(), vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)) and PageLRU()
was clear after vm_insert_page().

Example of such an oops (the present bits of PMD and PTE are clear):

BUG: unable to handle kernel paging request at 00007ff493c7eff8
IP: [<ffffffffa039e17f>] 0xffffffffa039e17e
PGD 201b068067 PUD 381c082067 PMD 20063d2166 PTE 8000002005da9166
Oops: 0000 [#1] SMP
Modules linked in: pci_stub vboxpci(OF) vboxnetadp(OF) vboxnetflt(OF)
vboxdrv(OF) md4 nls_utf8 cifs fscache vesafb kvm_amd kvm psmouse serio_raw
microcode ib_mthca ib_mad ib_core amd64_edac_mod edac_core k10temp
edac_mce_amd joydev shpchp mac_hid lp parport i2c_nforce2 hid_generic usbhid
hid mptsas mptscsih mptbase scsi_transport_sas e1000 pata_acpi pata_amd
CPU 24
Pid: 2058, comm: EMT Tainted: GF O 3.8.0-23-generic #34 Sun
Microsystems Sun Fire X4600 M2/Sun Fire X4600 M2
RIP: 0010:[<ffffffffa039e17f>] [<ffffffffa039e17f>] 0xffffffffa039e17e
RSP: 0018:ffff88381bac1968 EFLAGS: 00010202
RAX: 00007ff493c7eff8 RBX: ffff88381bac1998 RCX: 0000000000000000
RDX: 0000000000000ff8 RSI: 0000000000000000 RDI: ffff88381bac1a18
RBP: ffff88381bac1988 R08: ffffc90029981000 R09: ffffc9002999c000
R10: ffff88381bac1998 R11: ffffffffa037aee0 R12: ffffc9002999c000
R13: ffffffffa002f98d R14: ffffffffa002f98d R15: ffffc9002999c000
FS: 00007ff4f59b7700(0000) GS:ffff883827c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007ff493c7eff8 CR3: 000000201b06f000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process EMT (pid: 2058, threadinfo ffff88381bac0000, task ffff88381b840000)
Stack:
0000000000000000 ffff88381bac1a60 ffff88381bac1ab8 ffffffffa002f98d
ffff88381bac1a28 ffffffffa039e5bd ffffffffa002f98d 0000000000000000
0000000000000000 0000000000000000 00007ff493c7e000 00007ff493c7eff8

Any more ideas? I'm happy to perform more tests.

Thanks,

Frank

> In the meantime I verified that my testcase does not fail if I pass
> 'numa_balancing=false' to the kernel, so it's definitely a NUMA balancing
> problem.
>
> I also did 'get_page()' on all pages of method B but the testcase so this
> didn't help.
>
> Frank

--
Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany

Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
Registergericht: Amtsgericht M?nchen, HRA 95603
Gesch?ftsf?hrer: J?rgen Kunz

Komplement?rin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.

2013-06-05 07:54:58

by Michal Hocko

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Tue 04-06-13 23:54:45, Frank Mehnert wrote:
> On Tuesday 04 June 2013 20:17:02 Frank Mehnert wrote:
> > On Tuesday 04 June 2013 16:02:30 Michal Hocko wrote:
> > > On Tue 04-06-13 14:14:45, Frank Mehnert wrote:
> > > > On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> > > > > This is probably more appropriate to be directed at the linux-mm
> > > > > mailing list.
> > > > >
> > > > > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > > > > > Hi,
> > > > > >
> > > > > > our memory management on Linux hosts conflicts with NUMA page
> > > > > > migration. I assume this problem existed for a longer time but
> > > > > > Linux 3.8 introduced automatic NUMA page balancing which makes the
> > > > > > problem visible on multi-node hosts leading to kernel oopses.
> > > > > >
> > > > > > NUMA page migration means that the physical address of a page
> > > > > > changes. This is fatal if the application assumes that this never
> > > > > > happens for that page as it was supposed to be pinned.
> > > > > >
> > > > > > We have two kind of pinned memory:
> > > > > >
> > > > > > A) 1. allocate memory in userland with mmap()
> > > > > >
> > > > > > 2. madvise(MADV_DONTFORK)
> > > > > > 3. pin with get_user_pages().
> > > > > > 4. flush dcache_page()
> > > > > > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> > > > > >
> > > > > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP |
> > > > > > VM_DONTEXPAND
> > > > > >
> > > > > > VM_DONTCOPY | VM_LOCKED | 0xff)
> > > > >
> > > > > I don't think this type of allocation should be affected. The
> > > > > get_user_pages() call should elevate the pages reference count which
> > > > > should prevent migration from completing. I would, however, wait for
> > > > > a more definitive answer.
> > > >
> > > > Thanks Robin! Actually case B) is more important for us so I'm waiting
> > > > for more feedback :)
> > >
> > > The manual node migration code seems to be OK in case B as well because
> > > Reserved are skipped (check check_pte_range called from down the
> > > do_migrate_pages path).
> > >
> > > Maybe auto-numa code is missing this check assuming that it cannot
> > > encounter reserved pages.
> > >
> > > migrate_misplaced_page relies on numamigrate_isolate_page which relies
> > > on isolate_lru_page and that one expects a LRU page. Is your Reserved
> > > page on the LRU list? That would be a bit unexpected.
> >
> > I will check this.
>
> I tested this now. When the Oops happens,

You didn't mention Oops before. Are you sure you are just not missing
any follow up fix?

> PageLRU() of the corresponding page struct is NOT set! I've patched
> the kernel to find that out.

At which state? When you setup your page or when the Oops happens?
Are you sure that your out-of-tree code plays well with the migration
code?

> This is case B from my original mail (alloc_pages(),
> SetPageReserved(), vm_mmap(), vm_insert_page(), vm_flags |=
> (VM_DONTEXPAND | VM_DONTDUMP)) and PageLRU() was clear after
> vm_insert_page().
>
> Example of such an oops (the present bits of PMD and PTE are clear):
>
> BUG: unable to handle kernel paging request at 00007ff493c7eff8

This is of no use. a) the strack trace is missing and b) even if there
was one you seem to have symbol names disabled so you need to enable
CONFIG_KALLSYMS.

> IP: [<ffffffffa039e17f>] 0xffffffffa039e17e
> PGD 201b068067 PUD 381c082067 PMD 20063d2166 PTE 8000002005da9166
> Oops: 0000 [#1] SMP
> Modules linked in: pci_stub vboxpci(OF) vboxnetadp(OF) vboxnetflt(OF)
> vboxdrv(OF) md4 nls_utf8 cifs fscache vesafb kvm_amd kvm psmouse serio_raw
> microcode ib_mthca ib_mad ib_core amd64_edac_mod edac_core k10temp
> edac_mce_amd joydev shpchp mac_hid lp parport i2c_nforce2 hid_generic usbhid
> hid mptsas mptscsih mptbase scsi_transport_sas e1000 pata_acpi pata_amd
> CPU 24
> Pid: 2058, comm: EMT Tainted: GF O 3.8.0-23-generic #34 Sun
> Microsystems Sun Fire X4600 M2/Sun Fire X4600 M2
> RIP: 0010:[<ffffffffa039e17f>] [<ffffffffa039e17f>] 0xffffffffa039e17e
> RSP: 0018:ffff88381bac1968 EFLAGS: 00010202
> RAX: 00007ff493c7eff8 RBX: ffff88381bac1998 RCX: 0000000000000000
> RDX: 0000000000000ff8 RSI: 0000000000000000 RDI: ffff88381bac1a18
> RBP: ffff88381bac1988 R08: ffffc90029981000 R09: ffffc9002999c000
> R10: ffff88381bac1998 R11: ffffffffa037aee0 R12: ffffc9002999c000
> R13: ffffffffa002f98d R14: ffffffffa002f98d R15: ffffc9002999c000
> FS: 00007ff4f59b7700(0000) GS:ffff883827c00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> CR2: 00007ff493c7eff8 CR3: 000000201b06f000 CR4: 00000000000007e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Process EMT (pid: 2058, threadinfo ffff88381bac0000, task ffff88381b840000)
> Stack:
> 0000000000000000 ffff88381bac1a60 ffff88381bac1ab8 ffffffffa002f98d
> ffff88381bac1a28 ffffffffa039e5bd ffffffffa002f98d 0000000000000000
> 0000000000000000 0000000000000000 00007ff493c7e000 00007ff493c7eff8
>
> Any more ideas? I'm happy to perform more tests.
>
> Thanks,
>
> Frank
>
> > In the meantime I verified that my testcase does not fail if I pass
> > 'numa_balancing=false' to the kernel, so it's definitely a NUMA balancing
> > problem.
> >
> > I also did 'get_page()' on all pages of method B but the testcase so this
> > didn't help.
> >
> > Frank
>
> --
> Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
> ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany
>
> Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
> Registergericht: Amtsgericht M?nchen, HRA 95603
> Gesch?ftsf?hrer: J?rgen Kunz
>
> Komplement?rin: ORACLE Deutschland Verwaltung B.V.
> Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
> Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
> Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher



--
Michal Hocko
SUSE Labs

2013-06-05 08:34:36

by Frank Mehnert

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Wednesday 05 June 2013 09:54:54 Michal Hocko wrote:
> On Tue 04-06-13 23:54:45, Frank Mehnert wrote:
> > On Tuesday 04 June 2013 20:17:02 Frank Mehnert wrote:
> > > On Tuesday 04 June 2013 16:02:30 Michal Hocko wrote:
> > > > On Tue 04-06-13 14:14:45, Frank Mehnert wrote:
> > > > > On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> > > > > > This is probably more appropriate to be directed at the linux-mm
> > > > > > mailing list.
> > > > > >
> > > > > > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > our memory management on Linux hosts conflicts with NUMA page
> > > > > > > migration. I assume this problem existed for a longer time but
> > > > > > > Linux 3.8 introduced automatic NUMA page balancing which makes
> > > > > > > the problem visible on multi-node hosts leading to kernel
> > > > > > > oopses.
> > > > > > >
> > > > > > > NUMA page migration means that the physical address of a page
> > > > > > > changes. This is fatal if the application assumes that this
> > > > > > > never happens for that page as it was supposed to be pinned.
> > > > > > >
> > > > > > > We have two kind of pinned memory:
> > > > > > >
> > > > > > > A) 1. allocate memory in userland with mmap()
> > > > > > >
> > > > > > > 2. madvise(MADV_DONTFORK)
> > > > > > > 3. pin with get_user_pages().
> > > > > > > 4. flush dcache_page()
> > > > > > > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> > > > > > >
> > > > > > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP |
> > > > > > > VM_DONTEXPAND
> > > > > > >
> > > > > > > VM_DONTCOPY | VM_LOCKED | 0xff)
> > > > > >
> > > > > > I don't think this type of allocation should be affected. The
> > > > > > get_user_pages() call should elevate the pages reference count
> > > > > > which should prevent migration from completing. I would,
> > > > > > however, wait for a more definitive answer.
> > > > >
> > > > > Thanks Robin! Actually case B) is more important for us so I'm
> > > > > waiting for more feedback :)
> > > >
> > > > The manual node migration code seems to be OK in case B as well
> > > > because Reserved are skipped (check check_pte_range called from down
> > > > the do_migrate_pages path).
> > > >
> > > > Maybe auto-numa code is missing this check assuming that it cannot
> > > > encounter reserved pages.
> > > >
> > > > migrate_misplaced_page relies on numamigrate_isolate_page which
> > > > relies on isolate_lru_page and that one expects a LRU page. Is your
> > > > Reserved page on the LRU list? That would be a bit unexpected.
> > >
> > > I will check this.
> >
> > I tested this now. When the Oops happens,
>
> You didn't mention Oops before. Are you sure you are just not missing
> any follow up fix?

Sorry, but remember, this is on a host running VirtualBox which is
executing code in ring 0.

> > PageLRU() of the corresponding page struct is NOT set! I've patched
> > the kernel to find that out.
>
> At which state? When you setup your page or when the Oops happens?
> Are you sure that your out-of-tree code plays well with the migration
> code?

I've added code to show_fault_oops(). This code determines the page struct
for the address where the ring 0 page fault happened. It then prints
the value of PageLRU(page) from that page struct as part of the Oops.
This was to check if the page is part of the LRU list or not. I hope
I did this right.

> > This is case B from my original mail (alloc_pages(),
> > SetPageReserved(), vm_mmap(), vm_insert_page(), vm_flags |=
> > (VM_DONTEXPAND | VM_DONTDUMP)) and PageLRU() was clear after
> > vm_insert_page().
> >
> > Example of such an oops (the present bits of PMD and PTE are clear):
> >
> > BUG: unable to handle kernel paging request at 00007ff493c7eff8
>
> This is of no use. a) the strack trace is missing and b) even if there
> was one you seem to have symbol names disabled so you need to enable
> CONFIG_KALLSYMS.

There is no need to debug the kernel page fault, I already know it's
inside the VirtualBox kernel code.

All what I'm asking for is how to debug this problem and how our code
for allocating memory may conflict with the automatic NUMA page balancing.
These oopses are only triggered with automatic NUMA balancing.

I'm currently doing more tests but suggestions are welcome.

Thanks,

Frank
--
Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany

Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
Registergericht: Amtsgericht M?nchen, HRA 95603
Gesch?ftsf?hrer: J?rgen Kunz

Komplement?rin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.

2013-06-05 08:56:42

by Frank Mehnert

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Wednesday 05 June 2013 10:34:13 Frank Mehnert wrote:
> On Wednesday 05 June 2013 09:54:54 Michal Hocko wrote:
> > On Tue 04-06-13 23:54:45, Frank Mehnert wrote:
> > > On Tuesday 04 June 2013 20:17:02 Frank Mehnert wrote:
> > > > On Tuesday 04 June 2013 16:02:30 Michal Hocko wrote:
> > > > > On Tue 04-06-13 14:14:45, Frank Mehnert wrote:
> > > > > > On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> > > > > > > This is probably more appropriate to be directed at the
> > > > > > > linux-mm mailing list.
> > > > > > >
> > > > > > > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > our memory management on Linux hosts conflicts with NUMA page
> > > > > > > > migration. I assume this problem existed for a longer time
> > > > > > > > but Linux 3.8 introduced automatic NUMA page balancing which
> > > > > > > > makes the problem visible on multi-node hosts leading to
> > > > > > > > kernel oopses.
> > > > > > > >
> > > > > > > > NUMA page migration means that the physical address of a page
> > > > > > > > changes. This is fatal if the application assumes that this
> > > > > > > > never happens for that page as it was supposed to be pinned.
> > > > > > > >
> > > > > > > > We have two kind of pinned memory:

Just to repeat it for reference:

A) 1. allocate memory in userland with mmap()
2. madvise(MADV_DONTFORK)
3. pin with get_user_pages().
4. flush dcache_page()
5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
(resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
VM_DONTCOPY | VM_LOCKED | 0xff)

B) 1. allocate memory with alloc_pages()
2. SetPageReserved()
3. vm_mmap() to allocate a userspace mapping
4. vm_insert_page()
5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
(resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND | 0xff)

The frequent case is B.

I've just disabled CONFIG_TRANSPARENT_HUGEPAGE for testing purposes and
the Oops is still triggered when running my testcase.

Thanks,

Frank
--
Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany

Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
Registergericht: Amtsgericht M?nchen, HRA 95603
Gesch?ftsf?hrer: J?rgen Kunz

Komplement?rin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.

2013-06-05 09:10:52

by Michal Hocko

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Wed 05-06-13 10:34:13, Frank Mehnert wrote:
> On Wednesday 05 June 2013 09:54:54 Michal Hocko wrote:
> > On Tue 04-06-13 23:54:45, Frank Mehnert wrote:
> > > On Tuesday 04 June 2013 20:17:02 Frank Mehnert wrote:
> > > > On Tuesday 04 June 2013 16:02:30 Michal Hocko wrote:
> > > > > On Tue 04-06-13 14:14:45, Frank Mehnert wrote:
> > > > > > On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> > > > > > > This is probably more appropriate to be directed at the linux-mm
> > > > > > > mailing list.
> > > > > > >
> > > > > > > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > our memory management on Linux hosts conflicts with NUMA page
> > > > > > > > migration. I assume this problem existed for a longer time but
> > > > > > > > Linux 3.8 introduced automatic NUMA page balancing which makes
> > > > > > > > the problem visible on multi-node hosts leading to kernel
> > > > > > > > oopses.
> > > > > > > >
> > > > > > > > NUMA page migration means that the physical address of a page
> > > > > > > > changes. This is fatal if the application assumes that this
> > > > > > > > never happens for that page as it was supposed to be pinned.
> > > > > > > >
> > > > > > > > We have two kind of pinned memory:
> > > > > > > >
> > > > > > > > A) 1. allocate memory in userland with mmap()
> > > > > > > >
> > > > > > > > 2. madvise(MADV_DONTFORK)
> > > > > > > > 3. pin with get_user_pages().
> > > > > > > > 4. flush dcache_page()
> > > > > > > > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> > > > > > > >
> > > > > > > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP |
> > > > > > > > VM_DONTEXPAND
> > > > > > > >
> > > > > > > > VM_DONTCOPY | VM_LOCKED | 0xff)
> > > > > > >
> > > > > > > I don't think this type of allocation should be affected. The
> > > > > > > get_user_pages() call should elevate the pages reference count
> > > > > > > which should prevent migration from completing. I would,
> > > > > > > however, wait for a more definitive answer.
> > > > > >
> > > > > > Thanks Robin! Actually case B) is more important for us so I'm
> > > > > > waiting for more feedback :)
> > > > >
> > > > > The manual node migration code seems to be OK in case B as well
> > > > > because Reserved are skipped (check check_pte_range called from down
> > > > > the do_migrate_pages path).
> > > > >
> > > > > Maybe auto-numa code is missing this check assuming that it cannot
> > > > > encounter reserved pages.
> > > > >
> > > > > migrate_misplaced_page relies on numamigrate_isolate_page which
> > > > > relies on isolate_lru_page and that one expects a LRU page. Is your
> > > > > Reserved page on the LRU list? That would be a bit unexpected.
> > > >
> > > > I will check this.
> > >
> > > I tested this now. When the Oops happens,
> >
> > You didn't mention Oops before. Are you sure you are just not missing
> > any follow up fix?
>
> Sorry, but remember, this is on a host running VirtualBox which is
> executing code in ring 0.

Then the problem might be almost anywhere... I am afraid I cannot help
you much with that. Good luck.

> > > PageLRU() of the corresponding page struct is NOT set! I've patched
> > > the kernel to find that out.
> >
> > At which state? When you setup your page or when the Oops happens?
> > Are you sure that your out-of-tree code plays well with the migration
> > code?
>
> I've added code to show_fault_oops(). This code determines the page struct
> for the address where the ring 0 page fault happened. It then prints
> the value of PageLRU(page) from that page struct as part of the Oops.
> This was to check if the page is part of the LRU list or not. I hope
> I did this right.

I am not sure this will tell you much. Your code would have to trip over
a page affected by the migration. And nothing indicates this so far.
--
Michal Hocko
SUSE Labs

2013-06-05 09:32:31

by Frank Mehnert

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Wednesday 05 June 2013 11:10:48 Michal Hocko wrote:
> On Wed 05-06-13 10:34:13, Frank Mehnert wrote:
> > On Wednesday 05 June 2013 09:54:54 Michal Hocko wrote:
> > > On Tue 04-06-13 23:54:45, Frank Mehnert wrote:
> > > > On Tuesday 04 June 2013 20:17:02 Frank Mehnert wrote:
> > > > > On Tuesday 04 June 2013 16:02:30 Michal Hocko wrote:
> > > > > > On Tue 04-06-13 14:14:45, Frank Mehnert wrote:
> > > > > > > On Tuesday 04 June 2013 13:58:07 Robin Holt wrote:
> > > > > > > > This is probably more appropriate to be directed at the
> > > > > > > > linux-mm mailing list.
> > > > > > > >
> > > > > > > > On Tue, Jun 04, 2013 at 09:22:10AM +0200, Frank Mehnert wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > > our memory management on Linux hosts conflicts with NUMA
> > > > > > > > > page migration. I assume this problem existed for a longer
> > > > > > > > > time but Linux 3.8 introduced automatic NUMA page
> > > > > > > > > balancing which makes the problem visible on multi-node
> > > > > > > > > hosts leading to kernel oopses.
> > > > > > > > >
> > > > > > > > > NUMA page migration means that the physical address of a
> > > > > > > > > page changes. This is fatal if the application assumes
> > > > > > > > > that this never happens for that page as it was supposed
> > > > > > > > > to be pinned.
> > > > > > > > >
> > > > > > > > > We have two kind of pinned memory:
> > > > > > > > >
> > > > > > > > > A) 1. allocate memory in userland with mmap()
> > > > > > > > >
> > > > > > > > > 2. madvise(MADV_DONTFORK)
> > > > > > > > > 3. pin with get_user_pages().
> > > > > > > > > 4. flush dcache_page()
> > > > > > > > > 5. vm_flags |= (VM_DONTCOPY | VM_LOCKED)
> > > > > > > > >
> > > > > > > > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP |
> > > > > > > > > VM_DONTEXPAND
> > > > > > > > >
> > > > > > > > > VM_DONTCOPY | VM_LOCKED | 0xff)
> > > > > > > >
> > > > > > > > I don't think this type of allocation should be affected.
> > > > > > > > The get_user_pages() call should elevate the pages reference
> > > > > > > > count which should prevent migration from completing. I
> > > > > > > > would, however, wait for a more definitive answer.
> > > > > > >
> > > > > > > Thanks Robin! Actually case B) is more important for us so I'm
> > > > > > > waiting for more feedback :)
> > > > > >
> > > > > > The manual node migration code seems to be OK in case B as well
> > > > > > because Reserved are skipped (check check_pte_range called from
> > > > > > down the do_migrate_pages path).
> > > > > >
> > > > > > Maybe auto-numa code is missing this check assuming that it
> > > > > > cannot encounter reserved pages.
> > > > > >
> > > > > > migrate_misplaced_page relies on numamigrate_isolate_page which
> > > > > > relies on isolate_lru_page and that one expects a LRU page. Is
> > > > > > your Reserved page on the LRU list? That would be a bit
> > > > > > unexpected.
> > > > >
> > > > > I will check this.
> > > >
> > > > I tested this now. When the Oops happens,
> > >
> > > You didn't mention Oops before. Are you sure you are just not missing
> > > any follow up fix?
> >
> > Sorry, but remember, this is on a host running VirtualBox which is
> > executing code in ring 0.
>
> Then the problem might be almost anywhere... I am afraid I cannot help
> you much with that. Good luck.

Thank you very much for your help. As I said, this problem happens _only_
with NUMA_BALANCING enabled. I understand that you treat the VirtualBox
code as untrusted but the reason for the problem is that some assumption
is obviously not met: The VirtualBox code assumes that the memory it
allocates using case A and case B is

1. always present and
2. will always be backed by the same phyiscal memory

over the entire life time. Enabling NUMA_BALANCING seems to make this
assumption false. I only want to know why.

I posted the snipped of the Oops above to show that some present bits are
not set (in that case in PMD ant PTE), the question is why.

> > > > PageLRU() of the corresponding page struct is NOT set! I've patched
> > > > the kernel to find that out.
> > >
> > > At which state? When you setup your page or when the Oops happens?
> > > Are you sure that your out-of-tree code plays well with the migration
> > > code?
> >
> > I've added code to show_fault_oops(). This code determines the page
> > struct for the address where the ring 0 page fault happened. It then
> > prints the value of PageLRU(page) from that page struct as part of the
> > Oops. This was to check if the page is part of the LRU list or not. I
> > hope I did this right.
>
> I am not sure this will tell you much. Your code would have to trip over
> a page affected by the migration. And nothing indicates this so far.

I see, you don't believe me. I will add more code to the kernel logging
which pages were migrated.

Thanks,

Frank
--
Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany

Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
Registergericht: Amtsgericht M?nchen, HRA 95603
Gesch?ftsf?hrer: J?rgen Kunz

Komplement?rin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.

2013-06-05 09:56:33

by Michal Hocko

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Wed 05-06-13 11:32:15, Frank Mehnert wrote:
[...]
> Thank you very much for your help. As I said, this problem happens _only_
> with NUMA_BALANCING enabled. I understand that you treat the VirtualBox
> code as untrusted but the reason for the problem is that some assumption
> is obviously not met: The VirtualBox code assumes that the memory it
> allocates using case A and case B is
>
> 1. always present and
> 2. will always be backed by the same phyiscal memory
>
> over the entire life time. Enabling NUMA_BALANCING seems to make this
> assumption false. I only want to know why.

As I said earlier. Both the manual node migration and numa_fault handler
do not migrate pages with elevated ref count (your A case) and pages
that are not on the LRU. So if your Referenced pages might be on the LRU
then you probably have to look into numamigrate_isolate_page and do an
exception for PageReserved pages. But I am a bit suspicious this is the
cause because the reclaim doesn't consider PageReserved pages either so
they could get reclaimed. Or maybe you have handled that path in your
kernel.

Or the other option is that you depend on a timing or something like
that which doesn't hold anymore. That would be hard to debug though.

> I see, you don't believe me. I will add more code to the kernel logging
> which pages were migrated.

Simple test for PageReserved flag in numamigrate_isolate_page should
tell you more.

This would cover the migration part. Another potential problem could be
that the page might get unmapped and marked for the numa fault (see
do_numa_page). So maybe your code just assumes that the page even
doesn't get unmapped?
--
Michal Hocko
SUSE Labs

2013-06-05 10:10:24

by Mel Gorman

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Tue, Jun 04, 2013 at 06:58:07AM -0500, Robin Holt wrote:
> > B) 1. allocate memory with alloc_pages()
> > 2. SetPageReserved()
> > 3. vm_mmap() to allocate a userspace mapping
> > 4. vm_insert_page()
> > 5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
> > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND | 0xff)
> >
> > At least the memory allocated like B) is affected by automatic NUMA page
> > migration. I'm not sure about A).
> >
> > 1. How can I prevent automatic NUMA page migration on this memory?
> > 2. Can NUMA page migration also be handled on such kind of memory without
> > preventing migration?
> >

Page migration does not expect a PageReserved && PageLRU page. The only
reserved check that is made by migration is for the zero page and that
happens in the syscall path for move_pages() which is not used by either
compaction or automatic balancing.

At some point you must have a driver that is setting PageReserved on
anonymous pages that is later encountered by automatic numa balancing
during a NUMA hinting fault. I expect this is an out-of-tree driver or
a custom kernel of some sort. Memory should be pinned by elevating the
reference count of the page, not setting PageReserved.

It's not particularly clear how you avoid hitting the same bug due to THP and
memory compaction to be honest but maybe your setup hits a steady state that
simply never hit the problem or it happens rarely and it was not identified.

--
Mel Gorman
SUSE Labs

2013-06-05 10:23:08

by Frank Mehnert

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Wednesday 05 June 2013 11:56:30 Michal Hocko wrote:
> On Wed 05-06-13 11:32:15, Frank Mehnert wrote:
> [...]
>
> > Thank you very much for your help. As I said, this problem happens _only_
> > with NUMA_BALANCING enabled. I understand that you treat the VirtualBox
> > code as untrusted but the reason for the problem is that some assumption
> > is obviously not met: The VirtualBox code assumes that the memory it
> > allocates using case A and case B is
> >
> > 1. always present and
> > 2. will always be backed by the same phyiscal memory
> >
> > over the entire life time. Enabling NUMA_BALANCING seems to make this
> > assumption false. I only want to know why.
>
> As I said earlier. Both the manual node migration and numa_fault handler
> do not migrate pages with elevated ref count (your A case) and pages
> that are not on the LRU. So if your Referenced pages might be on the LRU
> then you probably have to look into numamigrate_isolate_page and do an
> exception for PageReserved pages. But I am a bit suspicious this is the
> cause because the reclaim doesn't consider PageReserved pages either so
> they could get reclaimed. Or maybe you have handled that path in your
> kernel.

Thanks, I will also investigate into this direction.

> Or the other option is that you depend on a timing or something like
> that which doesn't hold anymore. That would be hard to debug though.
>
> > I see, you don't believe me. I will add more code to the kernel logging
> > which pages were migrated.
>
> Simple test for PageReserved flag in numamigrate_isolate_page should
> tell you more.
>
> This would cover the migration part. Another potential problem could be
> that the page might get unmapped and marked for the numa fault (see
> do_numa_page). So maybe your code just assumes that the page even
> doesn't get unmapped?

Exactly, that's the assumption -- therefore all these vm_flags tricks.
If this assumption is wrong or not always true, can this requirement
(page is _never_ unmapped) be met at all?

Thanks,

Frank
--
Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany

Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
Registergericht: Amtsgericht M?nchen, HRA 95603
Gesch?ftsf?hrer: J?rgen Kunz

Komplement?rin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.

2013-06-05 10:35:51

by Frank Mehnert

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Wednesday 05 June 2013 12:10:19 Mel Gorman wrote:
> On Tue, Jun 04, 2013 at 06:58:07AM -0500, Robin Holt wrote:
> > > B) 1. allocate memory with alloc_pages()
> > >
> > > 2. SetPageReserved()
> > > 3. vm_mmap() to allocate a userspace mapping
> > > 4. vm_insert_page()
> > > 5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
> > >
> > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > > 0xff)
> > >
> > > At least the memory allocated like B) is affected by automatic NUMA
> > > page migration. I'm not sure about A).
> > >
> > > 1. How can I prevent automatic NUMA page migration on this memory?
> > > 2. Can NUMA page migration also be handled on such kind of memory
> > > without
> > >
> > > preventing migration?
>
> Page migration does not expect a PageReserved && PageLRU page. The only
> reserved check that is made by migration is for the zero page and that
> happens in the syscall path for move_pages() which is not used by either
> compaction or automatic balancing.
>
> At some point you must have a driver that is setting PageReserved on
> anonymous pages that is later encountered by automatic numa balancing
> during a NUMA hinting fault. I expect this is an out-of-tree driver or
> a custom kernel of some sort. Memory should be pinned by elevating the
> reference count of the page, not setting PageReserved.

Yes, this is ring 0 code from VirtualBox. The VBox ring 0 driver does the
steps which are shown above. Setting PageReserved is not only for pinning
but also for fork() protection. I've tried to do get_page() as well but
it did not help preventing the migration during NUMA balancing.

As I wrote, the code for allocating + mapping the memory assumes that
the memory is finally pinned and will be never unmapped. That assumption
might be wrong or wrong under certain/rare conditions. I would like to
know these conditions and how we can prevent them from happening or how
we can handle them correctly.

> It's not particularly clear how you avoid hitting the same bug due to THP
> and memory compaction to be honest but maybe your setup hits a steady
> state that simply never hit the problem or it happens rarely and it was
> not identified.

I'm currently using the stock Ubuntu 13.04 generic kernel (3.8.0-23),
patched with some additional logging code. It is true that this problem
could also be triggered by other kernel mechanisms as you described.

Thanks,

Frank
--
Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany

Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
Registergericht: Amtsgericht M?nchen, HRA 95603
Gesch?ftsf?hrer: J?rgen Kunz

Komplement?rin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.

2013-06-05 11:41:17

by Michal Hocko

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Wed 05-06-13 03:22:32, Frank Mehnert wrote:
> On Wednesday 05 June 2013 11:56:30 Michal Hocko wrote:
> > On Wed 05-06-13 11:32:15, Frank Mehnert wrote:
> > [...]
> >
> > > Thank you very much for your help. As I said, this problem happens _only_
> > > with NUMA_BALANCING enabled. I understand that you treat the VirtualBox
> > > code as untrusted but the reason for the problem is that some assumption
> > > is obviously not met: The VirtualBox code assumes that the memory it
> > > allocates using case A and case B is
> > >
> > > 1. always present and
> > > 2. will always be backed by the same phyiscal memory
> > >
> > > over the entire life time. Enabling NUMA_BALANCING seems to make this
> > > assumption false. I only want to know why.
> >
> > As I said earlier. Both the manual node migration and numa_fault handler
> > do not migrate pages with elevated ref count (your A case) and pages
> > that are not on the LRU. So if your Referenced pages might be on the LRU
> > then you probably have to look into numamigrate_isolate_page and do an
> > exception for PageReserved pages. But I am a bit suspicious this is the
> > cause because the reclaim doesn't consider PageReserved pages either so
> > they could get reclaimed. Or maybe you have handled that path in your
> > kernel.
>
> Thanks, I will also investigate into this direction.
>
> > Or the other option is that you depend on a timing or something like
> > that which doesn't hold anymore. That would be hard to debug though.
> >
> > > I see, you don't believe me. I will add more code to the kernel logging
> > > which pages were migrated.
> >
> > Simple test for PageReserved flag in numamigrate_isolate_page should
> > tell you more.
> >
> > This would cover the migration part. Another potential problem could be
> > that the page might get unmapped and marked for the numa fault (see
> > do_numa_page). So maybe your code just assumes that the page even
> > doesn't get unmapped?
>
> Exactly, that's the assumption -- therefore all these vm_flags tricks.
> If this assumption is wrong or not always true, can this requirement
> (page is _never_ unmapped) be met at all?

yes, just pin the page by get_page(). Reserved pages are usually not
touched because they are not sitting in the LRU (that just doesn't make
any sense) - why we would age such pages in the first place.
--
Michal Hocko
SUSE Labs

2013-06-05 12:34:07

by Mel Gorman

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Wed, Jun 05, 2013 at 12:35:35PM +0200, Frank Mehnert wrote:
> On Wednesday 05 June 2013 12:10:19 Mel Gorman wrote:
> > On Tue, Jun 04, 2013 at 06:58:07AM -0500, Robin Holt wrote:
> > > > B) 1. allocate memory with alloc_pages()
> > > >
> > > > 2. SetPageReserved()
> > > > 3. vm_mmap() to allocate a userspace mapping
> > > > 4. vm_insert_page()
> > > > 5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
> > > >
> > > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP | VM_DONTEXPAND |
> > > > 0xff)
> > > >
> > > > At least the memory allocated like B) is affected by automatic NUMA
> > > > page migration. I'm not sure about A).
> > > >
> > > > 1. How can I prevent automatic NUMA page migration on this memory?
> > > > 2. Can NUMA page migration also be handled on such kind of memory
> > > > without
> > > >
> > > > preventing migration?
> >
> > Page migration does not expect a PageReserved && PageLRU page. The only
> > reserved check that is made by migration is for the zero page and that
> > happens in the syscall path for move_pages() which is not used by either
> > compaction or automatic balancing.
> >
> > At some point you must have a driver that is setting PageReserved on
> > anonymous pages that is later encountered by automatic numa balancing
> > during a NUMA hinting fault. I expect this is an out-of-tree driver or
> > a custom kernel of some sort. Memory should be pinned by elevating the
> > reference count of the page, not setting PageReserved.
>
> Yes, this is ring 0 code from VirtualBox. The VBox ring 0 driver does the
> steps which are shown above. Setting PageReserved is not only for pinning
> but also for fork() protection.

Offhand I don't see what setting PageReserved on an LRU page has to do
with fork() protection. If the VMA should not be copied by fork then use
MADV_DONTFORK.

> I've tried to do get_page() as well but
> it did not help preventing the migration during NUMA balancing.
>

I think you mean elevating the page count did not prevent the unmapping. The
elevated count should have prevented the actual migration but would not
prevent the unmapping.

> As I wrote, the code for allocating + mapping the memory assumes that
> the memory is finally pinned and will be never unmapped. That assumption
> might be wrong or wrong under certain/rare conditions. I would like to
> know these conditions and how we can prevent them from happening or how
> we can handle them correctly.

Memory compaction for THP allocations will break that assumption as
compaction ignores VM_LOCKED. I strongly suspect that if you did something
like move a process into a cpuset bound to another node that it would
also break. If a process like numad is running then it would probably
break virtualbox as well as it triggers migration from userspace. It is
a fragile assumption to make.

> > It's not particularly clear how you avoid hitting the same bug due to THP
> > and memory compaction to be honest but maybe your setup hits a steady
> > state that simply never hit the problem or it happens rarely and it was
> > not identified.
>
> I'm currently using the stock Ubuntu 13.04 generic kernel (3.8.0-23),

and an out-of-tree driver which is what is hitting the problem.

A few of your options in order of estimated time to completion are;

1. Disable numa balancing within your driver or fail to start if it's
running
2. Create a patch that adds a new NUMA_PTE_SCAN_IGNORE value for
mm->first_nid (see includ/linux.mm_types.h). In sched/core/fair.c,
add a check that first_nid == NUMA_PTE_SCAN_IGNORE should be ignored.
Document that only virtualbox needs this and set it within your
driver. This will not fix the compaction cases or numad using cpusets
to migrate your processes though
3. When the driver affects a region, set mm->numa_next_reset and
mm->numa_next_scan to large values to prevent the pages being unmapped.
This would be very fragile, could break again in the future and is ugly
4. Add a check in change_pte_range() for the !prot_numa case to check
PageReserved. This will prevent automatic numa balancing unmapping the
page. Document that only virtualbox requires this.
5. Add a check in change_pte_range() for an elevated page count.
Document that there is no point unmapping a page for a NUMA hinting
fault that will only fail migration later anyway which is true albeit of
marginal benefit. Then, in the vbox driver, elevate the page count, do
away with the PageReserved trick, use MADV_DONTFORK to prevent copying
at fork time.

--
Mel Gorman
SUSE Labs

2013-06-06 10:09:35

by Frank Mehnert

[permalink] [raw]
Subject: Re: Handling NUMA page migration

On Wednesday 05 June 2013 14:34:00 Mel Gorman wrote:
> On Wed, Jun 05, 2013 at 12:35:35PM +0200, Frank Mehnert wrote:
> > On Wednesday 05 June 2013 12:10:19 Mel Gorman wrote:
> > > On Tue, Jun 04, 2013 at 06:58:07AM -0500, Robin Holt wrote:
> > > > > B) 1. allocate memory with alloc_pages()
> > > > >
> > > > > 2. SetPageReserved()
> > > > > 3. vm_mmap() to allocate a userspace mapping
> > > > > 4. vm_insert_page()
> > > > > 5. vm_flags |= (VM_DONTEXPAND | VM_DONTDUMP)
> > > > >
> > > > > (resulting flags are VM_MIXEDMAP | VM_DONTDUMP |
> > > > > VM_DONTEXPAND | 0xff)
> > > > >
> > > > > At least the memory allocated like B) is affected by automatic NUMA
> > > > > page migration. I'm not sure about A).
> > > > >
> > > > > 1. How can I prevent automatic NUMA page migration on this memory?
> > > > > 2. Can NUMA page migration also be handled on such kind of memory
> > > > > without
> > > > >
> > > > > preventing migration?
> > >
> > > Page migration does not expect a PageReserved && PageLRU page. The only
> > > reserved check that is made by migration is for the zero page and that
> > > happens in the syscall path for move_pages() which is not used by
> > > either compaction or automatic balancing.
> > >
> > > At some point you must have a driver that is setting PageReserved on
> > > anonymous pages that is later encountered by automatic numa balancing
> > > during a NUMA hinting fault. I expect this is an out-of-tree driver or
> > > a custom kernel of some sort. Memory should be pinned by elevating the
> > > reference count of the page, not setting PageReserved.
> >
> > Yes, this is ring 0 code from VirtualBox. The VBox ring 0 driver does the
> > steps which are shown above. Setting PageReserved is not only for pinning
> > but also for fork() protection.
>
> Offhand I don't see what setting PageReserved on an LRU page has to do
> with fork() protection. If the VMA should not be copied by fork then use
> MADV_DONTFORK.

I'm not sure either. That code has grown over years and was even working
on Linux 2.4.

> > I've tried to do get_page() as well but
> > it did not help preventing the migration during NUMA balancing.
>
> I think you mean elevating the page count did not prevent the unmapping.
> The elevated count should have prevented the actual migration but would
> not prevent the unmapping.

Right, that's what I meant and your explanations make sense to me.

> > As I wrote, the code for allocating + mapping the memory assumes that
> > the memory is finally pinned and will be never unmapped. That assumption
> > might be wrong or wrong under certain/rare conditions. I would like to
> > know these conditions and how we can prevent them from happening or how
> > we can handle them correctly.
>
> Memory compaction for THP allocations will break that assumption as
> compaction ignores VM_LOCKED. I strongly suspect that if you did something
> like move a process into a cpuset bound to another node that it would
> also break. If a process like numad is running then it would probably
> break virtualbox as well as it triggers migration from userspace. It is
> a fragile assumption to make.
>
> > > It's not particularly clear how you avoid hitting the same bug due to
> > > THP and memory compaction to be honest but maybe your setup hits a
> > > steady state that simply never hit the problem or it happens rarely
> > > and it was not identified.
> >
> > I'm currently using the stock Ubuntu 13.04 generic kernel (3.8.0-23),
>
> and an out-of-tree driver which is what is hitting the problem.

Right.

> A few of your options in order of estimated time to completion are;
>
> 1. Disable numa balancing within your driver or fail to start if it's
> running
> 2. Create a patch that adds a new NUMA_PTE_SCAN_IGNORE value for
> mm->first_nid (see includ/linux.mm_types.h). In sched/core/fair.c,
> add a check that first_nid == NUMA_PTE_SCAN_IGNORE should be ignored.
> Document that only virtualbox needs this and set it within your
> driver. This will not fix the compaction cases or numad using cpusets
> to migrate your processes though
> 3. When the driver affects a region, set mm->numa_next_reset and
> mm->numa_next_scan to large values to prevent the pages being unmapped.
> This would be very fragile, could break again in the future and is ugly
> 4. Add a check in change_pte_range() for the !prot_numa case to check
> PageReserved. This will prevent automatic numa balancing unmapping the
> page. Document that only virtualbox requires this.
> 5. Add a check in change_pte_range() for an elevated page count.
> Document that there is no point unmapping a page for a NUMA hinting
> fault that will only fail migration later anyway which is true albeit of
> marginal benefit. Then, in the vbox driver, elevate the page count, do
> away with the PageReserved trick, use MADV_DONTFORK to prevent copying
> at fork time.

Thank you for these suggestions! For now I tried your suggestion 4) although
I think you meant the prot_numa case, not the !prot_numa case, correct?

It also turned out that we even must not do ptep_modify_prot_start() for such
ranges, therefore I added the PageReserved() check like this:

--- mm/mprotect.c 2013-06-05 18:24:41.564777871 +0200
+++ mm/mprotect.c 2013-06-05 17:16:47.689923398 +0200
@@ -54,14 +54,22 @@
pte_t ptent;
bool updated = false;

+ struct page *page;
+
+ page = vm_normal_page(vma, addr, oldpte);
+ if (page && PageReserved(page))
+ continue;
+
ptent = ptep_modify_prot_start(mm, addr, pte);
if (!prot_numa) {
ptent = pte_modify(ptent, newprot);
updated = true;
} else {
+#if 0
struct page *page;
page = vm_normal_page(vma, addr, oldpte);
+#endif
if (page) {
int this_nid = page_to_nid(page);
if (last_nid == -1)

With this change I cannot reproduce any problems anymore.

Adding such a change to the kernel would help us a lot. OTOH I wonder why it
is not possible to prevent these unmaps with other means, for instance for
VM arease with VM_IO set. Wouldn't that make sense?

What I didn't mention explicitely in my previous postings: I assume that all
these problems come also from using R3 addresses from R0 code. That might be
evil but VirtualBox does currently map the complete guest address space into
the address space of the corresponding host process for simplicity reasons.
Mapping into R0 isn't possible, at least not on 32-bit hosts. But I would
like to know if R0 mappings (vmap()) would be affected by any kind of page
migration.

Thanks,

Frank
--
Dr.-Ing. Frank Mehnert | Software Development Director, VirtualBox
ORACLE Deutschland B.V. & Co. KG | Werkstr. 24 | 71384 Weinstadt, Germany

Hauptverwaltung: Riesstr. 25, D-80992 M?nchen
Registergericht: Amtsgericht M?nchen, HRA 95603
Gesch?ftsf?hrer: J?rgen Kunz

Komplement?rin: ORACLE Deutschland Verwaltung B.V.
Hertogswetering 163/167, 3543 AS Utrecht, Niederlande
Handelsregister der Handelskammer Midden-Niederlande, Nr. 30143697
Gesch?ftsf?hrer: Alexander van der Ven, Astrid Kepper, Val Maher


Attachments:
signature.asc (198.00 B)
This is a digitally signed message part.